├── .gitattributes ├── .gitignore ├── GEMINI_INSIGHTS.md ├── INSTALLATION.md ├── LICENSE ├── README.md ├── app.py ├── install.bat ├── install.py ├── install.sh ├── requirements.txt └── utils ├── audio_processing.py ├── cache.py ├── diarization.py ├── export.py ├── gpu_utils.py ├── keyword_extraction.py ├── ollama_integration.py ├── summarization.py ├── transcription.py ├── translation.py └── validation.py /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | # Python virtual environment 2 | venv/ 3 | __pycache__/ 4 | *.pyc 5 | 6 | # IDE files 7 | .vscode/ 8 | .idea/ 9 | 10 | # OS files 11 | .env 12 | .DS_Store 13 | Thumbs.db 14 | -------------------------------------------------------------------------------- /GEMINI_INSIGHTS.md: -------------------------------------------------------------------------------- 1 | # Gemini Insights: OBS Recording Transcriber 2 | 3 | ## Project Overview 4 | The OBS Recording Transcriber is a Python application built with Streamlit that processes video recordings (particularly from OBS Studio) to generate transcripts and summaries using AI models. The application uses Whisper for transcription and Hugging Face Transformers for summarization. 5 | 6 | ## Key Improvement Areas 7 | 8 | ### 1. UI Enhancements 9 | - **Implemented:** 10 | - Responsive layout with columns for better organization 11 | - Expanded sidebar with categorized settings 12 | - Custom CSS for improved button styling 13 | - Spinner for long-running operations 14 | - Expanded transcript view by default 15 | 16 | - **Additional Recommendations:** 17 | - Add a dark mode toggle 18 | - Implement progress bars for each processing step 19 | - Add tooltips for complex options 20 | - Create a dashboard view for batch processing results 21 | - Add visualization of transcript segments with timestamps 22 | 23 | ### 2. Ollama Local API Integration 24 | - **Implemented:** 25 | - Local API integration for offline summarization 26 | - Model selection from available Ollama models 27 | - Chunking for long texts 28 | - Fallback to online models when Ollama fails 29 | 30 | - **Additional Recommendations:** 31 | - Add temperature and other generation parameters as advanced options 32 | - Implement streaming responses for real-time feedback 33 | - Cache results to avoid reprocessing 34 | - Add support for custom Ollama model creation with specific instructions 35 | - Implement parallel processing for multiple chunks 36 | 37 | ### 3. Subtitle Export Formats 38 | - **Implemented:** 39 | - SRT export with proper formatting 40 | - ASS export with basic styling 41 | - Multi-format export options 42 | - Automatic segment creation from plain text 43 | 44 | - **Additional Recommendations:** 45 | - Add customizable styling options for ASS subtitles 46 | - Implement subtitle editing before export 47 | - Add support for VTT format for web videos 48 | - Implement subtitle timing adjustment 49 | - Add batch export for multiple files 50 | 51 | ### 4. Architecture and Code Quality 52 | - **Recommendations:** 53 | - Implement proper error handling and logging throughout 54 | - Add unit tests for critical components 55 | - Create a configuration file for default settings 56 | - Implement caching for processed files 57 | - Add type hints throughout the codebase 58 | - Document API endpoints for potential future web service 59 | 60 | ### 5. Performance Optimizations 61 | - **Recommendations:** 62 | - Implement parallel processing for batch operations 63 | - Add GPU acceleration configuration options 64 | - Optimize memory usage for large files 65 | - Implement incremental processing for very long recordings 66 | - Add compression options for exported files 67 | 68 | ### 6. Additional Features 69 | - **Recommendations:** 70 | - Speaker diarization (identifying different speakers) 71 | - Language detection and translation 72 | - Keyword extraction and timestamp linking 73 | - Integration with video editing software 74 | - Batch processing queue with email notifications 75 | - Custom vocabulary for domain-specific terminology 76 | 77 | ## Implementation Roadmap 78 | 1. **Phase 1 (Completed):** Basic UI improvements, Ollama integration, and subtitle export 79 | 2. **Phase 2 (Completed):** Performance optimizations and additional export formats 80 | - Added WebVTT export format for web videos 81 | - Implemented GPU acceleration with automatic device selection 82 | - Added caching system for faster processing of previously transcribed files 83 | - Optimized memory usage with configurable memory limits 84 | - Added compression options for exported files 85 | - Enhanced ASS subtitle styling options 86 | - Added progress indicators for better user feedback 87 | 3. **Phase 3 (Completed):** Advanced features like speaker diarization and translation 88 | - Implemented speaker diarization to identify different speakers in recordings 89 | - Added language detection and translation capabilities 90 | - Integrated keyword extraction with timestamp linking 91 | - Created interactive transcript with keyword highlighting 92 | - Added named entity recognition for better content analysis 93 | - Generated keyword index with timestamp references 94 | - Provided speaker statistics and word count analysis 95 | 4. **Phase 4:** Integration with other tools and services 96 | 97 | ## Technical Considerations 98 | - Ensure compatibility with different Whisper model sizes 99 | - Handle large files efficiently to prevent memory issues 100 | - Provide graceful degradation when optional dependencies are missing 101 | - Maintain backward compatibility with existing workflows 102 | - Consider containerization for easier deployment 103 | 104 | ## Conclusion 105 | The OBS Recording Transcriber has a solid foundation but can be significantly enhanced with the suggested improvements. The focus should be on improving user experience, adding offline processing capabilities, and expanding export options to make the tool more versatile for different use cases. -------------------------------------------------------------------------------- /INSTALLATION.md: -------------------------------------------------------------------------------- 1 | # Installation Guide for OBS Recording Transcriber 2 | 3 | This guide will help you install all the necessary dependencies for the OBS Recording Transcriber application, including the advanced features from Phase 3. 4 | 5 | ## Prerequisites 6 | 7 | Before installing the Python packages, you need to set up some prerequisites: 8 | 9 | ### 1. Python 3.8 or higher 10 | 11 | Make sure you have Python 3.8 or higher installed. You can download it from [python.org](https://www.python.org/downloads/). 12 | 13 | ### 2. FFmpeg 14 | 15 | FFmpeg is required for audio processing: 16 | 17 | - **Windows**: 18 | - Download from [gyan.dev/ffmpeg/builds](https://www.gyan.dev/ffmpeg/builds/) 19 | - Extract the ZIP file 20 | - Add the `bin` folder to your system PATH 21 | 22 | - **macOS**: 23 | ```bash 24 | brew install ffmpeg 25 | ``` 26 | 27 | - **Linux**: 28 | ```bash 29 | sudo apt update 30 | sudo apt install ffmpeg 31 | ``` 32 | 33 | ### 3. Visual C++ Build Tools (Windows only) 34 | 35 | Some packages like `tokenizers` require C++ build tools: 36 | 37 | 1. Download and install [Visual C++ Build Tools](https://visualstudio.microsoft.com/visual-cpp-build-tools/) 38 | 2. During installation, select "Desktop development with C++" 39 | 40 | ## Installation Steps 41 | 42 | ### 1. Create a Virtual Environment (Recommended) 43 | 44 | ```bash 45 | # Create a virtual environment 46 | python -m venv venv 47 | 48 | # Activate the virtual environment 49 | # Windows 50 | venv\Scripts\activate 51 | # macOS/Linux 52 | source venv/bin/activate 53 | ``` 54 | 55 | ### 2. Install PyTorch 56 | 57 | For better performance, install PyTorch with CUDA support if you have an NVIDIA GPU: 58 | 59 | ```bash 60 | # Windows/Linux with CUDA 61 | pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 62 | 63 | # macOS or CPU-only 64 | pip install torch torchvision torchaudio 65 | ``` 66 | 67 | ### 3. Install Dependencies 68 | 69 | ```bash 70 | # Install all dependencies from requirements.txt 71 | pip install -r requirements.txt 72 | ``` 73 | 74 | ### 4. Troubleshooting Common Issues 75 | 76 | #### Tokenizers Installation Issues 77 | 78 | If you encounter issues with `tokenizers` installation: 79 | 80 | 1. Make sure you have Visual C++ Build Tools installed (Windows) 81 | 2. Try installing Rust: [rustup.rs](https://rustup.rs/) 82 | 3. Install tokenizers separately: 83 | ```bash 84 | pip install tokenizers --no-binary tokenizers 85 | ``` 86 | 87 | #### PyAnnote.Audio Access 88 | 89 | To use speaker diarization, you need a HuggingFace token with access to the pyannote models: 90 | 91 | 1. Create an account on [HuggingFace](https://huggingface.co/) 92 | 2. Generate an access token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) 93 | 3. Request access to [pyannote/speaker-diarization-3.0](https://huggingface.co/pyannote/speaker-diarization-3.0) 94 | 4. Set the token in the application when prompted or as an environment variable: 95 | ```bash 96 | # Windows 97 | set HF_TOKEN=your_token_here 98 | # macOS/Linux 99 | export HF_TOKEN=your_token_here 100 | ``` 101 | 102 | #### Memory Issues with Large Files 103 | 104 | If you encounter memory issues with large files: 105 | 106 | 1. Use a smaller Whisper model (e.g., "base" instead of "large") 107 | 2. Reduce the GPU memory fraction in the application settings 108 | 3. Increase your system's swap space/virtual memory 109 | 110 | ## Running the Application 111 | 112 | After installation, run the application with: 113 | 114 | ```bash 115 | streamlit run app.py 116 | ``` 117 | 118 | ## Optional: Ollama Setup for Local Summarization 119 | 120 | To use Ollama for local summarization: 121 | 122 | 1. Install Ollama from [ollama.ai](https://ollama.ai/) 123 | 2. Pull a model: 124 | ```bash 125 | ollama pull llama3 126 | ``` 127 | 3. Uncomment the Ollama line in requirements.txt and install: 128 | ```bash 129 | pip install ollama 130 | ``` 131 | 132 | ## Verifying Installation 133 | 134 | To verify that all components are working correctly: 135 | 136 | 1. Run the application 137 | 2. Check that GPU acceleration is available (if applicable) 138 | 3. Test a small video file with basic transcription 139 | 4. Gradually enable advanced features like diarization and translation 140 | 141 | If you encounter any issues, check the application logs for specific error messages. -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 DataAnts-AI 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Video Transcriber 2 | 3 | ## Project Overview 4 | The video Recording Transcriber is a Python application built with Streamlit that processes video recordings (particularly from OBS Studio) to generate transcripts and summaries using AI models. The application uses Whisper for transcription and Hugging Face Transformers for summarization. 5 | 6 | 7 |  8 | 9 | Demo here 10 | 11 | https://github.com/user-attachments/assets/990e63fc-232e-46a0-afdf-ca8836d46a13 12 | 13 | 14 | ## Installation 15 | 16 | ### Easy Installation (Recommended) 17 | 18 | #### Windows 19 | 1. Download or clone the repository 20 | 2. Run `install.bat` by double-clicking it 21 | 3. Follow the on-screen instructions 22 | 23 | #### Linux/macOS 24 | 1. Download or clone the repository 25 | 2. Open a terminal in the project directory 26 | 3. Make the install script executable: `chmod +x install.sh` 27 | 4. Run the script: `./install.sh` 28 | 5. Follow the on-screen instructions 29 | 30 | ### Manual Installation 31 | 1. Clone the repo. 32 | ``` 33 | git clone https://github.com/DataAnts-AI/VideoTranscriber.git 34 | cd VideoTranscriber 35 | ``` 36 | 37 | 2. Install dependencies: 38 | ``` 39 | pip install -r requirements.txt 40 | ``` 41 | 42 | Notes: 43 | - Ensure that the versions align with the features you use and your system compatibility. 44 | - torch version should match the capabilities of your hardware (e.g., CUDA support for GPUs). 45 | - For advanced features like speaker diarization, you'll need a HuggingFace token. 46 | - See `INSTALLATION.md` for detailed instructions and troubleshooting. 47 | 48 | 3. Run the application: 49 | ``` 50 | streamlit run app.py 51 | ``` 52 | 53 | ## Usage 54 | 1. Set your base folder where OBS recordings are stored 55 | 2. Select a recording from the dropdown 56 | 3. Choose transcription and summarization models 57 | 4. Configure performance settings (GPU acceleration, caching) 58 | 5. Select export formats and compression options 59 | 6. Click "Process Recording" to start 60 | 61 | ## Advanced Features 62 | - **Speaker Diarization**: Identify and label different speakers in your recordings 63 | - **Translation**: Automatically detect language and translate to multiple languages 64 | - **Keyword Extraction**: Extract important keywords with timestamp links 65 | - **Interactive Transcript**: Navigate through the transcript with keyword highlighting 66 | - **GPU Acceleration**: Utilize your GPU for faster processing 67 | - **Caching**: Save processing time by caching results 68 | 69 | 70 | 71 | ## Key Improvement Areas 72 | 73 | ### 1. UI Enhancements 74 | - **Implemented:** 75 | - Responsive layout with columns for better organization 76 | - Expanded sidebar with categorized settings 77 | - Custom CSS for improved button styling 78 | - Spinner for long-running operations 79 | - Expanded transcript view by default 80 | 81 | - **Additional Recommendations:** 82 | - Add a dark mode toggle 83 | - Implement progress bars for each processing step 84 | - Add tooltips for complex options 85 | - Create a dashboard view for batch processing results 86 | - Add visualization of transcript segments with timestamps 87 | 88 | ### 2. Ollama Local API Integration 89 | - **Implemented:** 90 | - Local API integration for offline summarization 91 | - Model selection from available Ollama models 92 | - Chunking for long texts 93 | - Fallback to online models when Ollama fails 94 | 95 | - **Additional Recommendations:** 96 | - Add temperature and other generation parameters as advanced options 97 | - Implement streaming responses for real-time feedback 98 | - Cache results to avoid reprocessing 99 | - Add support for custom Ollama model creation with specific instructions 100 | - Implement parallel processing for multiple chunks 101 | 102 | ### 3. Subtitle Export Formats 103 | - **Implemented:** 104 | - SRT export with proper formatting 105 | - ASS export with basic styling 106 | - Multi-format export options 107 | - Automatic segment creation from plain text 108 | 109 | - **Additional Recommendations:** 110 | - Add customizable styling options for ASS subtitles 111 | - Implement subtitle editing before export 112 | - Add support for VTT format for web videos 113 | - Implement subtitle timing adjustment 114 | - Add batch export for multiple files 115 | 116 | ### 4. Architecture and Code Quality 117 | - **Recommendations:** 118 | - Implement proper error handling and logging throughout 119 | - Add unit tests for critical components 120 | - Create a configuration file for default settings 121 | - Implement caching for processed files 122 | - Add type hints throughout the codebase 123 | - Document API endpoints for potential future web service 124 | 125 | ### 5. Performance Optimizations 126 | - **Recommendations:** 127 | - Implement parallel processing for batch operations 128 | - Add GPU acceleration configuration options 129 | - Optimize memory usage for large files 130 | - Implement incremental processing for very long recordings 131 | - Add compression options for exported files 132 | 133 | ### 6. Additional Features 134 | - **Recommendations:** 135 | - Speaker diarization (identifying different speakers) 136 | - Language detection and translation 137 | - Keyword extraction and timestamp linking 138 | - Integration with video editing software 139 | - Batch processing queue with email notifications 140 | - Custom vocabulary for domain-specific terminology 141 | 142 | ## Implementation Roadmap 143 | 1. **Phase 1 (Completed):** Basic UI improvements, Ollama integration, and subtitle export 144 | 2. **Phase 2 (Completed):** Performance optimizations and additional export formats 145 | - Added WebVTT export format for web videos 146 | - Implemented GPU acceleration with automatic device selection 147 | - Added caching system for faster processing of previously transcribed files 148 | - Optimized memory usage with configurable memory limits 149 | - Added compression options for exported files 150 | - Enhanced ASS subtitle styling options 151 | - Added progress indicators for better user feedback 152 | 3. **Phase 3 (Completed):** Advanced features like speaker diarization and translation 153 | - Implemented speaker diarization to identify different speakers in recordings 154 | - Added language detection and translation capabilities 155 | - Integrated keyword extraction with timestamp linking 156 | - Created interactive transcript with keyword highlighting 157 | - Added named entity recognition for better content analysis 158 | - Generated keyword index with timestamp references 159 | - Provided speaker statistics and word count analysis 160 | 4. **Phase 4:** Integration with other tools and services (In progess) 161 | 162 | 163 | Reach out to support@dataants.org if you need assistance with any AI solutions - we offer support for n8n workflows, local RAG chatbots, and ERP and Financial reporting. 164 | -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | import streamlit as st 2 | from utils.audio_processing import extract_audio 3 | from utils.transcription import transcribe_audio 4 | from utils.summarization import summarize_text 5 | from utils.validation import validate_environment 6 | from utils.export import export_transcript 7 | from pathlib import Path 8 | import os 9 | import logging 10 | import humanize 11 | from datetime import timedelta 12 | 13 | # Configure logging 14 | logging.basicConfig(level=logging.INFO) 15 | logger = logging.getLogger(__name__) 16 | 17 | # Try to import Ollama integration, but don't fail if it's not available 18 | try: 19 | from utils.ollama_integration import check_ollama_available, list_available_models, chunk_and_summarize 20 | OLLAMA_AVAILABLE = check_ollama_available() 21 | except ImportError: 22 | OLLAMA_AVAILABLE = False 23 | 24 | # Try to import GPU utilities, but don't fail if not available 25 | try: 26 | from utils.gpu_utils import get_gpu_info, configure_gpu 27 | GPU_UTILS_AVAILABLE = True 28 | except ImportError: 29 | GPU_UTILS_AVAILABLE = False 30 | 31 | # Try to import caching utilities, but don't fail if not available 32 | try: 33 | from utils.cache import get_cache_size, clear_cache 34 | CACHE_AVAILABLE = True 35 | except ImportError: 36 | CACHE_AVAILABLE = False 37 | 38 | # Try to import diarization utilities, but don't fail if not available 39 | try: 40 | from utils.diarization import transcribe_with_diarization 41 | DIARIZATION_AVAILABLE = True 42 | except ImportError: 43 | DIARIZATION_AVAILABLE = False 44 | 45 | # Try to import translation utilities, but don't fail if not available 46 | try: 47 | from utils.translation import transcribe_and_translate, get_language_name 48 | TRANSLATION_AVAILABLE = True 49 | except ImportError: 50 | TRANSLATION_AVAILABLE = False 51 | 52 | # Try to import keyword extraction utilities, but don't fail if not available 53 | try: 54 | from utils.keyword_extraction import extract_keywords_from_transcript, generate_keyword_index, generate_interactive_transcript 55 | KEYWORD_EXTRACTION_AVAILABLE = True 56 | except ImportError: 57 | KEYWORD_EXTRACTION_AVAILABLE = False 58 | 59 | def main(): 60 | # Set page configuration 61 | st.set_page_config( 62 | page_title="OBS Recording Transcriber", 63 | page_icon="🎥", 64 | layout="wide", 65 | initial_sidebar_state="expanded" 66 | ) 67 | 68 | # Custom CSS for better UI 69 | st.markdown(""" 70 | 102 | """, unsafe_allow_html=True) 103 | 104 | st.title("🎥 OBS Recording Transcriber") 105 | st.caption("Process your OBS recordings with AI transcription and summarization") 106 | 107 | # Sidebar configuration 108 | st.sidebar.header("Settings") 109 | 110 | # Allow the user to select a base folder 111 | base_folder = st.sidebar.text_input( 112 | "Enter the base folder path:", 113 | value=str(Path.home()) 114 | ) 115 | 116 | base_path = Path(base_folder) 117 | 118 | # Model selection 119 | st.sidebar.subheader("Model Settings") 120 | 121 | # Transcription model selection 122 | transcription_model = st.sidebar.selectbox( 123 | "Transcription Model", 124 | ["tiny", "base", "small", "medium", "large"], 125 | index=1, 126 | help="Select the Whisper model size. Larger models are more accurate but slower." 127 | ) 128 | 129 | # Summarization model selection 130 | summarization_options = ["Hugging Face (Online)", "Ollama (Local)"] if OLLAMA_AVAILABLE else ["Hugging Face (Online)"] 131 | summarization_method = st.sidebar.selectbox( 132 | "Summarization Method", 133 | summarization_options, 134 | index=0, 135 | help="Select the summarization method. Ollama runs locally but requires installation." 136 | ) 137 | 138 | # If Ollama is selected, show model selection 139 | ollama_model = None 140 | if OLLAMA_AVAILABLE and summarization_method == "Ollama (Local)": 141 | available_models = list_available_models() 142 | if available_models: 143 | ollama_model = st.sidebar.selectbox( 144 | "Ollama Model", 145 | available_models, 146 | index=0 if "llama3" in available_models else 0, 147 | help="Select the Ollama model to use for summarization." 148 | ) 149 | else: 150 | st.sidebar.warning("No Ollama models found. Please install models using 'ollama pull model_name'.") 151 | 152 | # Advanced features 153 | st.sidebar.subheader("Advanced Features") 154 | 155 | # Speaker diarization 156 | use_diarization = st.sidebar.checkbox( 157 | "Speaker Diarization", 158 | value=False, 159 | disabled=not DIARIZATION_AVAILABLE, 160 | help="Identify different speakers in the recording." 161 | ) 162 | 163 | # Show HF token input if diarization is enabled 164 | hf_token = None 165 | if use_diarization and DIARIZATION_AVAILABLE: 166 | hf_token = st.sidebar.text_input( 167 | "HuggingFace Token", 168 | type="password", 169 | help="Required for speaker diarization. Get your token at huggingface.co/settings/tokens" 170 | ) 171 | 172 | num_speakers = st.sidebar.number_input( 173 | "Number of Speakers", 174 | min_value=1, 175 | max_value=10, 176 | value=2, 177 | help="Specify the number of speakers if known, or leave at default for auto-detection." 178 | ) 179 | 180 | # Translation 181 | use_translation = st.sidebar.checkbox( 182 | "Translation", 183 | value=False, 184 | disabled=not TRANSLATION_AVAILABLE, 185 | help="Translate the transcript to another language." 186 | ) 187 | 188 | # Target language selection if translation is enabled 189 | target_lang = None 190 | if use_translation and TRANSLATION_AVAILABLE: 191 | target_lang = st.sidebar.selectbox( 192 | "Target Language", 193 | ["en", "es", "fr", "de", "it", "pt", "nl", "ru", "zh", "ja", "ko", "ar"], 194 | format_func=lambda x: f"{get_language_name(x)} ({x})", 195 | help="Select the language to translate to." 196 | ) 197 | 198 | # Keyword extraction 199 | use_keywords = st.sidebar.checkbox( 200 | "Keyword Extraction", 201 | value=False, 202 | disabled=not KEYWORD_EXTRACTION_AVAILABLE, 203 | help="Extract keywords and link them to timestamps." 204 | ) 205 | 206 | if use_keywords and KEYWORD_EXTRACTION_AVAILABLE: 207 | max_keywords = st.sidebar.slider( 208 | "Max Keywords", 209 | min_value=5, 210 | max_value=30, 211 | value=15, 212 | help="Maximum number of keywords to extract." 213 | ) 214 | 215 | # Performance settings 216 | st.sidebar.subheader("Performance Settings") 217 | 218 | # GPU acceleration 219 | use_gpu = st.sidebar.checkbox( 220 | "Use GPU Acceleration", 221 | value=True if GPU_UTILS_AVAILABLE else False, 222 | disabled=not GPU_UTILS_AVAILABLE, 223 | help="Use GPU for faster processing if available." 224 | ) 225 | 226 | # Show GPU info if available 227 | if GPU_UTILS_AVAILABLE and use_gpu: 228 | gpu_info = get_gpu_info() 229 | if gpu_info["cuda_available"]: 230 | gpu_devices = [f"{d['name']} ({humanize.naturalsize(d['total_memory'])})" for d in gpu_info["cuda_devices"]] 231 | st.sidebar.info(f"GPU(s) available: {', '.join(gpu_devices)}") 232 | elif gpu_info["mps_available"]: 233 | st.sidebar.info("Apple Silicon GPU (MPS) available") 234 | else: 235 | st.sidebar.warning("No GPU detected. Using CPU.") 236 | 237 | # Memory usage 238 | memory_fraction = st.sidebar.slider( 239 | "GPU Memory Usage", 240 | min_value=0.1, 241 | max_value=1.0, 242 | value=0.8, 243 | step=0.1, 244 | disabled=not (GPU_UTILS_AVAILABLE and use_gpu), 245 | help="Fraction of GPU memory to use. Lower if you encounter out-of-memory errors." 246 | ) 247 | 248 | # Caching options 249 | use_cache = st.sidebar.checkbox( 250 | "Use Caching", 251 | value=True if CACHE_AVAILABLE else False, 252 | disabled=not CACHE_AVAILABLE, 253 | help="Cache transcription results to avoid reprocessing the same files." 254 | ) 255 | 256 | # Cache management 257 | if CACHE_AVAILABLE and use_cache: 258 | cache_size, cache_files = get_cache_size() 259 | if cache_size > 0: 260 | st.sidebar.info(f"Cache: {humanize.naturalsize(cache_size)} ({cache_files} files)") 261 | if st.sidebar.button("Clear Cache"): 262 | cleared = clear_cache() 263 | st.sidebar.success(f"Cleared {cleared} cache files") 264 | 265 | # Export options 266 | st.sidebar.subheader("Export Options") 267 | export_format = st.sidebar.multiselect( 268 | "Export Formats", 269 | ["TXT", "SRT", "VTT", "ASS"], 270 | default=["TXT"], 271 | help="Select the formats to export the transcript." 272 | ) 273 | 274 | # Compression options 275 | compress_exports = st.sidebar.checkbox( 276 | "Compress Exports", 277 | value=False, 278 | help="Compress exported files to save space." 279 | ) 280 | 281 | if compress_exports: 282 | compression_type = st.sidebar.radio( 283 | "Compression Format", 284 | ["gzip", "zip"], 285 | index=0, 286 | help="Select the compression format for exported files." 287 | ) 288 | else: 289 | compression_type = None 290 | 291 | # ASS subtitle styling 292 | if "ASS" in export_format: 293 | st.sidebar.subheader("ASS Subtitle Styling") 294 | show_style_options = st.sidebar.checkbox("Customize ASS Style", value=False) 295 | 296 | if show_style_options: 297 | ass_style = {} 298 | ass_style["fontname"] = st.sidebar.selectbox( 299 | "Font", 300 | ["Arial", "Helvetica", "Times New Roman", "Courier New", "Comic Sans MS"], 301 | index=0 302 | ) 303 | ass_style["fontsize"] = st.sidebar.slider("Font Size", 12, 72, 48) 304 | ass_style["alignment"] = st.sidebar.selectbox( 305 | "Alignment", 306 | ["2 (Bottom Center)", "1 (Bottom Left)", "3 (Bottom Right)", "8 (Top Center)"], 307 | index=0 308 | ).split()[0] # Extract just the number 309 | ass_style["bold"] = "-1" if st.sidebar.checkbox("Bold", value=True) else "0" 310 | ass_style["italic"] = "-1" if st.sidebar.checkbox("Italic", value=False) else "0" 311 | else: 312 | ass_style = None 313 | 314 | # Validate environment 315 | env_errors = validate_environment(base_path) 316 | if env_errors: 317 | st.error("## Environment Issues") 318 | for error in env_errors: 319 | st.markdown(f"- {error}") 320 | return 321 | 322 | # File selection 323 | recordings = list(base_path.glob("*.mp4")) 324 | if not recordings: 325 | st.warning(f"📂 No recordings found in the folder: {base_folder}!") 326 | return 327 | 328 | selected_file = st.selectbox("Choose a recording", recordings) 329 | 330 | # Process button with spinner 331 | if st.button("🚀 Start Processing"): 332 | # Create a progress bar 333 | progress_bar = st.progress(0) 334 | status_text = st.empty() 335 | 336 | try: 337 | # Update progress 338 | status_text.text("Extracting audio...") 339 | progress_bar.progress(10) 340 | 341 | # Process based on selected features 342 | if use_diarization and DIARIZATION_AVAILABLE and hf_token: 343 | # Transcribe with speaker diarization 344 | status_text.text("Transcribing with speaker diarization...") 345 | num_speakers_arg = int(num_speakers) if num_speakers > 0 else None 346 | diarized_segments, diarized_transcript = transcribe_with_diarization( 347 | selected_file, 348 | whisper_model=transcription_model, 349 | num_speakers=num_speakers_arg, 350 | use_gpu=use_gpu, 351 | hf_token=hf_token 352 | ) 353 | segments = diarized_segments 354 | transcript = diarized_transcript 355 | elif use_translation and TRANSLATION_AVAILABLE: 356 | # Transcribe and translate 357 | status_text.text("Transcribing and translating...") 358 | original_segments, translated_segments, original_transcript, translated_transcript = transcribe_and_translate( 359 | selected_file, 360 | whisper_model=transcription_model, 361 | target_lang=target_lang, 362 | use_gpu=use_gpu 363 | ) 364 | segments = translated_segments 365 | transcript = translated_transcript 366 | # Store original for display 367 | original_text = original_transcript 368 | else: 369 | # Standard transcription 370 | status_text.text("Transcribing audio...") 371 | segments, transcript = transcribe_audio( 372 | selected_file, 373 | model=transcription_model, 374 | use_cache=use_cache, 375 | use_gpu=use_gpu, 376 | memory_fraction=memory_fraction 377 | ) 378 | 379 | progress_bar.progress(50) 380 | 381 | if transcript: 382 | # Extract keywords if requested 383 | keyword_timestamps = None 384 | entity_timestamps = None 385 | if use_keywords and KEYWORD_EXTRACTION_AVAILABLE: 386 | status_text.text("Extracting keywords...") 387 | keyword_timestamps, entity_timestamps = extract_keywords_from_transcript( 388 | transcript, 389 | segments, 390 | max_keywords=max_keywords, 391 | use_gpu=use_gpu 392 | ) 393 | 394 | # Generate keyword index 395 | keyword_index = generate_keyword_index(keyword_timestamps, entity_timestamps) 396 | 397 | # Generate interactive transcript 398 | interactive_transcript = generate_interactive_transcript( 399 | segments, 400 | keyword_timestamps, 401 | entity_timestamps 402 | ) 403 | 404 | # Generate summary based on selected method 405 | status_text.text("Generating summary...") 406 | if OLLAMA_AVAILABLE and summarization_method == "Ollama (Local)" and ollama_model: 407 | summary = chunk_and_summarize(transcript, model=ollama_model) 408 | if not summary: 409 | st.warning("Ollama summarization failed. Falling back to Hugging Face.") 410 | summary = summarize_text( 411 | transcript, 412 | use_gpu=use_gpu, 413 | memory_fraction=memory_fraction 414 | ) 415 | else: 416 | summary = summarize_text( 417 | transcript, 418 | use_gpu=use_gpu, 419 | memory_fraction=memory_fraction 420 | ) 421 | 422 | progress_bar.progress(80) 423 | status_text.text("Preparing results...") 424 | 425 | # Display results in tabs 426 | tab1, tab2, tab3 = st.tabs(["Summary", "Transcript", "Advanced"]) 427 | 428 | with tab1: 429 | st.subheader("🖍 Summary") 430 | st.write(summary) 431 | 432 | # If translation was used, show original language 433 | if use_translation and TRANSLATION_AVAILABLE and 'original_text' in locals(): 434 | with st.expander("Original Language Summary"): 435 | original_summary = summarize_text( 436 | original_text, 437 | use_gpu=use_gpu, 438 | memory_fraction=memory_fraction 439 | ) 440 | st.write(original_summary) 441 | 442 | with tab2: 443 | st.subheader("📜 Full Transcript") 444 | 445 | # Show interactive transcript if keywords were extracted 446 | if use_keywords and KEYWORD_EXTRACTION_AVAILABLE and 'interactive_transcript' in locals(): 447 | st.markdown(interactive_transcript, unsafe_allow_html=True) 448 | else: 449 | st.text(transcript) 450 | 451 | # If translation was used, show original language 452 | if use_translation and TRANSLATION_AVAILABLE and 'original_text' in locals(): 453 | with st.expander("Original Language Transcript"): 454 | st.text(original_text) 455 | 456 | with tab3: 457 | # Show keyword index if available 458 | if use_keywords and KEYWORD_EXTRACTION_AVAILABLE and 'keyword_index' in locals(): 459 | st.subheader("🔑 Keyword Index") 460 | st.markdown(keyword_index) 461 | 462 | # Show speaker information if available 463 | if use_diarization and DIARIZATION_AVAILABLE: 464 | st.subheader("🎙️ Speaker Information") 465 | speakers = set(segment.get('speaker', 'UNKNOWN') for segment in segments) 466 | st.write(f"Detected {len(speakers)} speakers: {', '.join(speakers)}") 467 | 468 | # Count words per speaker 469 | speaker_words = {} 470 | for segment in segments: 471 | speaker = segment.get('speaker', 'UNKNOWN') 472 | words = len(segment['text'].split()) 473 | if speaker in speaker_words: 474 | speaker_words[speaker] += words 475 | else: 476 | speaker_words[speaker] = words 477 | 478 | # Display speaker statistics 479 | st.write("### Speaker Statistics") 480 | for speaker, words in speaker_words.items(): 481 | st.write(f"- **{speaker}**: {words} words") 482 | 483 | # Export options 484 | st.subheader("💾 Export Options") 485 | export_cols = st.columns(len(export_format)) 486 | 487 | output_base = Path(selected_file).stem 488 | 489 | for i, format_type in enumerate(export_format): 490 | with export_cols[i]: 491 | if format_type == "TXT": 492 | st.download_button( 493 | label=f"Download {format_type}", 494 | data=transcript, 495 | file_name=f"{output_base}_transcript.txt", 496 | mime="text/plain" 497 | ) 498 | elif format_type in ["SRT", "VTT", "ASS"]: 499 | # Export to subtitle format 500 | output_path = export_transcript( 501 | transcript, 502 | output_base, 503 | format_type.lower(), 504 | segments=segments, 505 | compress=compress_exports, 506 | compression_type=compression_type, 507 | style=ass_style if format_type == "ASS" and ass_style else None 508 | ) 509 | 510 | # Read the exported file for download 511 | with open(output_path, 'rb') as f: 512 | subtitle_content = f.read() 513 | 514 | # Determine file extension 515 | file_ext = f".{format_type.lower()}" 516 | if compress_exports: 517 | file_ext += ".gz" if compression_type == "gzip" else ".zip" 518 | 519 | st.download_button( 520 | label=f"Download {format_type}", 521 | data=subtitle_content, 522 | file_name=f"{output_base}{file_ext}", 523 | mime="application/octet-stream" 524 | ) 525 | 526 | # Clean up the temporary file 527 | os.remove(output_path) 528 | 529 | # Complete progress 530 | progress_bar.progress(100) 531 | status_text.text("Processing complete!") 532 | else: 533 | st.error("❌ Failed to process recording") 534 | except Exception as e: 535 | st.error(f"An error occurred: {e}") 536 | st.write(e) # This will show the traceback in the Streamlit app 537 | 538 | if __name__ == "__main__": 539 | main() 540 | -------------------------------------------------------------------------------- /install.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | echo =================================================== 3 | echo OBS Recording Transcriber - Windows Installation 4 | echo =================================================== 5 | echo. 6 | 7 | :: Check for Python 8 | python --version > nul 2>&1 9 | if %errorlevel% neq 0 ( 10 | echo Python not found! Please install Python 3.8 or higher. 11 | echo Download from: https://www.python.org/downloads/ 12 | echo Make sure to check "Add Python to PATH" during installation. 13 | pause 14 | exit /b 1 15 | ) 16 | 17 | :: Run the installation script 18 | echo Running installation script... 19 | python install.py 20 | 21 | echo. 22 | echo If the installation was successful, you can run the application with: 23 | echo streamlit run app.py 24 | echo. 25 | pause -------------------------------------------------------------------------------- /install.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | """ 3 | Installation script for OBS Recording Transcriber. 4 | This script helps install all required dependencies and checks for common issues. 5 | """ 6 | 7 | import os 8 | import sys 9 | import platform 10 | import subprocess 11 | import shutil 12 | from pathlib import Path 13 | 14 | def print_header(text): 15 | """Print a formatted header.""" 16 | print("\n" + "=" * 80) 17 | print(f" {text}") 18 | print("=" * 80) 19 | 20 | def print_step(text): 21 | """Print a step in the installation process.""" 22 | print(f"\n>> {text}") 23 | 24 | def run_command(command, check=True): 25 | """Run a shell command and return the result.""" 26 | try: 27 | result = subprocess.run( 28 | command, 29 | shell=True, 30 | check=check, 31 | stdout=subprocess.PIPE, 32 | stderr=subprocess.PIPE, 33 | text=True 34 | ) 35 | return result 36 | except subprocess.CalledProcessError as e: 37 | print(f"Error executing command: {command}") 38 | print(f"Error message: {e.stderr}") 39 | return None 40 | 41 | def check_python_version(): 42 | """Check if Python version is 3.8 or higher.""" 43 | print_step("Checking Python version") 44 | version = sys.version_info 45 | if version.major < 3 or (version.major == 3 and version.minor < 8): 46 | print(f"Python 3.8 or higher is required. You have {sys.version}") 47 | print("Please upgrade your Python installation.") 48 | return False 49 | print(f"Python version: {sys.version}") 50 | return True 51 | 52 | def check_ffmpeg(): 53 | """Check if FFmpeg is installed.""" 54 | print_step("Checking FFmpeg installation") 55 | result = shutil.which("ffmpeg") 56 | if result is None: 57 | print("FFmpeg not found in PATH.") 58 | print("Please install FFmpeg:") 59 | if platform.system() == "Windows": 60 | print(" - Download from: https://www.gyan.dev/ffmpeg/builds/") 61 | print(" - Extract and add the bin folder to your PATH") 62 | elif platform.system() == "Darwin": # macOS 63 | print(" - Install with Homebrew: brew install ffmpeg") 64 | else: # Linux 65 | print(" - Install with apt: sudo apt update && sudo apt install ffmpeg") 66 | return False 67 | 68 | # Check FFmpeg version 69 | version_result = run_command("ffmpeg -version") 70 | if version_result: 71 | print(f"FFmpeg is installed: {version_result.stdout.splitlines()[0]}") 72 | return True 73 | return False 74 | 75 | def check_gpu(): 76 | """Check for GPU availability.""" 77 | print_step("Checking GPU availability") 78 | 79 | # Check for NVIDIA GPU 80 | if platform.system() == "Windows": 81 | nvidia_smi = shutil.which("nvidia-smi") 82 | if nvidia_smi: 83 | result = run_command("nvidia-smi", check=False) 84 | if result and result.returncode == 0: 85 | print("NVIDIA GPU detected:") 86 | for line in result.stdout.splitlines()[:10]: 87 | print(f" {line}") 88 | return "nvidia" 89 | 90 | # Check for Apple Silicon 91 | if platform.system() == "Darwin" and platform.machine() == "arm64": 92 | print("Apple Silicon (M1/M2) detected") 93 | return "apple" 94 | 95 | print("No GPU detected or GPU drivers not installed. CPU will be used for processing.") 96 | return "cpu" 97 | 98 | def setup_virtual_env(): 99 | """Set up a virtual environment.""" 100 | print_step("Setting up virtual environment") 101 | 102 | # Check if venv module is available 103 | try: 104 | import venv 105 | print("Python venv module is available") 106 | except ImportError: 107 | print("Python venv module is not available. Please install it.") 108 | return False 109 | 110 | # Create virtual environment if it doesn't exist 111 | venv_path = Path("venv") 112 | if venv_path.exists(): 113 | print(f"Virtual environment already exists at {venv_path}") 114 | activate_venv() 115 | return True 116 | 117 | print(f"Creating virtual environment at {venv_path}") 118 | try: 119 | subprocess.run([sys.executable, "-m", "venv", "venv"], check=True) 120 | print("Virtual environment created successfully") 121 | activate_venv() 122 | return True 123 | except subprocess.CalledProcessError as e: 124 | print(f"Error creating virtual environment: {e}") 125 | return False 126 | 127 | def activate_venv(): 128 | """Activate the virtual environment.""" 129 | print_step("Activating virtual environment") 130 | 131 | venv_path = Path("venv") 132 | if not venv_path.exists(): 133 | print("Virtual environment not found") 134 | return False 135 | 136 | # Get the path to the activate script 137 | if platform.system() == "Windows": 138 | activate_script = venv_path / "Scripts" / "activate.bat" 139 | activate_cmd = f"call {activate_script}" 140 | else: 141 | activate_script = venv_path / "bin" / "activate" 142 | activate_cmd = f"source {activate_script}" 143 | 144 | print(f"To activate the virtual environment, run:") 145 | print(f" {activate_cmd}") 146 | 147 | # We can't actually activate the venv in this script because it would only 148 | # affect the subprocess, not the parent process. We just provide instructions. 149 | return True 150 | 151 | def install_pytorch(gpu_type): 152 | """Install PyTorch with appropriate GPU support.""" 153 | print_step("Installing PyTorch") 154 | 155 | if gpu_type == "nvidia": 156 | print("Installing PyTorch with CUDA support") 157 | cmd = "pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118" 158 | elif gpu_type == "apple": 159 | print("Installing PyTorch with MPS support") 160 | cmd = "pip install torch torchvision torchaudio" 161 | else: 162 | print("Installing PyTorch (CPU version)") 163 | cmd = "pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu" 164 | 165 | result = run_command(cmd) 166 | if result and result.returncode == 0: 167 | print("PyTorch installed successfully") 168 | return True 169 | else: 170 | print("Failed to install PyTorch") 171 | return False 172 | 173 | def install_dependencies(): 174 | """Install dependencies from requirements.txt.""" 175 | print_step("Installing dependencies from requirements.txt") 176 | 177 | requirements_path = Path("requirements.txt") 178 | if not requirements_path.exists(): 179 | print("requirements.txt not found") 180 | return False 181 | 182 | result = run_command("pip install -r requirements.txt") 183 | if result and result.returncode == 0: 184 | print("Dependencies installed successfully") 185 | return True 186 | else: 187 | print("Some dependencies failed to install. See error messages above.") 188 | return False 189 | 190 | def install_tokenizers(): 191 | """Install tokenizers package separately.""" 192 | print_step("Installing tokenizers package") 193 | 194 | # First try the normal installation 195 | result = run_command("pip install tokenizers", check=False) 196 | if result and result.returncode == 0: 197 | print("Tokenizers installed successfully") 198 | return True 199 | 200 | # If that fails, try the no-binary option 201 | print("Standard installation failed, trying alternative method...") 202 | result = run_command("pip install tokenizers --no-binary tokenizers", check=False) 203 | if result and result.returncode == 0: 204 | print("Tokenizers installed successfully with alternative method") 205 | return True 206 | 207 | print("Failed to install tokenizers. You may need to install Rust or Visual C++ Build Tools.") 208 | if platform.system() == "Windows": 209 | print("Download Visual C++ Build Tools: https://visualstudio.microsoft.com/visual-cpp-build-tools/") 210 | print("Install Rust: https://rustup.rs/") 211 | return False 212 | 213 | def check_installation(): 214 | """Verify the installation by importing key packages.""" 215 | print_step("Verifying installation") 216 | 217 | packages_to_check = [ 218 | "streamlit", 219 | "torch", 220 | "transformers", 221 | "whisper", 222 | "numpy", 223 | "sklearn" 224 | ] 225 | 226 | all_successful = True 227 | for package in packages_to_check: 228 | try: 229 | __import__(package) 230 | print(f"✓ {package} imported successfully") 231 | except ImportError: 232 | print(f"✗ Failed to import {package}") 233 | all_successful = False 234 | 235 | # Check optional packages 236 | optional_packages = [ 237 | "pyannote.audio", 238 | "iso639" 239 | ] 240 | 241 | print("\nChecking optional packages:") 242 | for package in optional_packages: 243 | try: 244 | if package == "pyannote.audio": 245 | # Just try to import pyannote 246 | __import__("pyannote") 247 | else: 248 | __import__(package) 249 | print(f"✓ {package} imported successfully") 250 | except ImportError: 251 | print(f"⚠ {package} not available (required for some advanced features)") 252 | 253 | return all_successful 254 | 255 | def main(): 256 | """Main installation function.""" 257 | print_header("OBS Recording Transcriber - Installation Script") 258 | 259 | # Check prerequisites 260 | if not check_python_version(): 261 | return 262 | 263 | ffmpeg_available = check_ffmpeg() 264 | gpu_type = check_gpu() 265 | 266 | # Setup environment 267 | if not setup_virtual_env(): 268 | print("Failed to set up virtual environment. Continuing with system Python...") 269 | 270 | # Install packages 271 | print("\nReady to install packages. Make sure your virtual environment is activated.") 272 | input("Press Enter to continue...") 273 | 274 | install_pytorch(gpu_type) 275 | install_dependencies() 276 | install_tokenizers() 277 | 278 | # Verify installation 279 | success = check_installation() 280 | 281 | print_header("Installation Summary") 282 | print(f"Python: {'✓ OK' if check_python_version() else '✗ Needs upgrade'}") 283 | print(f"FFmpeg: {'✓ Installed' if ffmpeg_available else '✗ Not found'}") 284 | print(f"GPU Support: {gpu_type.upper()}") 285 | print(f"Dependencies: {'✓ Installed' if success else '⚠ Some issues'}") 286 | 287 | print("\nNext steps:") 288 | if not ffmpeg_available: 289 | print("1. Install FFmpeg (required for audio processing)") 290 | 291 | print("1. Activate your virtual environment:") 292 | if platform.system() == "Windows": 293 | print(" venv\\Scripts\\activate") 294 | else: 295 | print(" source venv/bin/activate") 296 | 297 | print("2. Run the application:") 298 | print(" streamlit run app.py") 299 | 300 | print("\nFor advanced features like speaker diarization:") 301 | print("1. Get a HuggingFace token: https://huggingface.co/settings/tokens") 302 | print("2. Request access to pyannote models: https://huggingface.co/pyannote/speaker-diarization-3.0") 303 | 304 | print("\nSee INSTALLATION.md for more details and troubleshooting.") 305 | 306 | if __name__ == "__main__": 307 | main() -------------------------------------------------------------------------------- /install.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | echo "===================================================" 4 | echo " OBS Recording Transcriber - Unix Installation" 5 | echo "===================================================" 6 | echo 7 | 8 | # Check for Python 9 | if ! command -v python3 &> /dev/null; then 10 | echo "Python 3 not found! Please install Python 3.8 or higher." 11 | echo "For Ubuntu/Debian: sudo apt update && sudo apt install python3 python3-pip python3-venv" 12 | echo "For macOS: brew install python3" 13 | exit 1 14 | fi 15 | 16 | # Make the script executable 17 | chmod +x install.py 18 | 19 | # Run the installation script 20 | echo "Running installation script..." 21 | python3 ./install.py 22 | 23 | echo 24 | echo "If the installation was successful, you can run the application with:" 25 | echo "streamlit run app.py" 26 | echo -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | # OBS Recording Transcriber Dependencies 2 | # Core dependencies 3 | streamlit==1.26.0 4 | moviepy==1.0.3 5 | openai-whisper==20231117 6 | transformers>=4.21.1 7 | torch>=1.7.0 8 | torchaudio>=0.7.0 9 | requests>=2.28.0 10 | humanize>=4.6.0 11 | 12 | # Phase 2 dependencies 13 | scikit-learn>=1.0.0 14 | numpy>=1.20.0 15 | 16 | # Phase 3 dependencies 17 | pyannote.audio>=2.1.1 18 | iso639>=0.1.4 19 | protobuf>=3.20.0,<4.0.0 20 | tokenizers>=0.13.2 21 | scipy>=1.7.0 22 | matplotlib>=3.5.0 23 | soundfile>=0.10.3 24 | ffmpeg-python>=0.2.0 25 | 26 | # Optional: Ollama Python client (uncomment to install) 27 | # ollama 28 | 29 | # Installation notes: 30 | # 1. For Windows users, you may need to install PyTorch separately: 31 | # pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 32 | # 33 | # 2. For tokenizers issues, try installing Visual C++ Build Tools: 34 | # https://visualstudio.microsoft.com/visual-cpp-build-tools/ 35 | # 36 | # 3. For pyannote.audio, you'll need a HuggingFace token with access to: 37 | # https://huggingface.co/pyannote/speaker-diarization-3.0 38 | # 39 | # 4. FFmpeg is required for audio processing: 40 | # Windows: https://www.gyan.dev/ffmpeg/builds/ 41 | # Mac: brew install ffmpeg 42 | # Linux: apt-get install ffmpeg 43 | -------------------------------------------------------------------------------- /utils/audio_processing.py: -------------------------------------------------------------------------------- 1 | from moviepy.editor import AudioFileClip 2 | from pathlib import Path 3 | 4 | def extract_audio(video_path: Path): 5 | """Extract audio from a video file.""" 6 | try: 7 | audio = AudioFileClip(str(video_path)) 8 | audio_path = video_path.parent / f"{video_path.stem}_audio.wav" 9 | audio.write_audiofile(str(audio_path), verbose=False, logger=None) 10 | return audio_path 11 | except Exception as e: 12 | raise RuntimeError(f"Audio extraction failed: {e}") 13 | -------------------------------------------------------------------------------- /utils/cache.py: -------------------------------------------------------------------------------- 1 | """ 2 | Caching utilities for the OBS Recording Transcriber. 3 | Provides functions to cache and retrieve transcription and summarization results. 4 | """ 5 | 6 | import json 7 | import hashlib 8 | import os 9 | from pathlib import Path 10 | import logging 11 | import time 12 | 13 | # Configure logging 14 | logging.basicConfig(level=logging.INFO) 15 | logger = logging.getLogger(__name__) 16 | 17 | # Default cache directory 18 | CACHE_DIR = Path.home() / ".obs_transcriber_cache" 19 | 20 | 21 | def get_file_hash(file_path): 22 | """ 23 | Generate a hash for a file based on its content and modification time. 24 | 25 | Args: 26 | file_path (Path): Path to the file 27 | 28 | Returns: 29 | str: Hash string representing the file 30 | """ 31 | file_path = Path(file_path) 32 | if not file_path.exists(): 33 | return None 34 | 35 | # Get file stats 36 | stats = file_path.stat() 37 | file_size = stats.st_size 38 | mod_time = stats.st_mtime 39 | 40 | # Create a hash based on path, size and modification time 41 | # This is faster than hashing the entire file content 42 | hash_input = f"{file_path.absolute()}|{file_size}|{mod_time}" 43 | return hashlib.md5(hash_input.encode()).hexdigest() 44 | 45 | 46 | def get_cache_path(file_path, model=None, operation=None): 47 | """ 48 | Get the cache file path for a given input file and operation. 49 | 50 | Args: 51 | file_path (Path): Path to the original file 52 | model (str, optional): Model used for processing 53 | operation (str, optional): Operation type (e.g., 'transcribe', 'summarize') 54 | 55 | Returns: 56 | Path: Path to the cache file 57 | """ 58 | file_path = Path(file_path) 59 | file_hash = get_file_hash(file_path) 60 | 61 | if not file_hash: 62 | return None 63 | 64 | # Create cache directory if it doesn't exist 65 | cache_dir = CACHE_DIR 66 | cache_dir.mkdir(parents=True, exist_ok=True) 67 | 68 | # Create a cache filename based on the hash and optional parameters 69 | cache_name = file_hash 70 | if model: 71 | cache_name += f"_{model}" 72 | if operation: 73 | cache_name += f"_{operation}" 74 | 75 | return cache_dir / f"{cache_name}.json" 76 | 77 | 78 | def save_to_cache(file_path, data, model=None, operation=None): 79 | """ 80 | Save data to cache. 81 | 82 | Args: 83 | file_path (Path): Path to the original file 84 | data (dict): Data to cache 85 | model (str, optional): Model used for processing 86 | operation (str, optional): Operation type 87 | 88 | Returns: 89 | bool: True if successful, False otherwise 90 | """ 91 | cache_path = get_cache_path(file_path, model, operation) 92 | if not cache_path: 93 | return False 94 | 95 | try: 96 | # Add metadata to the cached data 97 | cache_data = { 98 | "original_file": str(Path(file_path).absolute()), 99 | "timestamp": time.time(), 100 | "model": model, 101 | "operation": operation, 102 | "data": data 103 | } 104 | 105 | with open(cache_path, 'w', encoding='utf-8') as f: 106 | json.dump(cache_data, f, ensure_ascii=False, indent=2) 107 | 108 | logger.info(f"Cached data saved to {cache_path}") 109 | return True 110 | except Exception as e: 111 | logger.error(f"Error saving cache: {e}") 112 | return False 113 | 114 | 115 | def load_from_cache(file_path, model=None, operation=None, max_age=None): 116 | """ 117 | Load data from cache if available and not expired. 118 | 119 | Args: 120 | file_path (Path): Path to the original file 121 | model (str, optional): Model used for processing 122 | operation (str, optional): Operation type 123 | max_age (float, optional): Maximum age of cache in seconds 124 | 125 | Returns: 126 | dict or None: Cached data or None if not available 127 | """ 128 | cache_path = get_cache_path(file_path, model, operation) 129 | if not cache_path or not cache_path.exists(): 130 | return None 131 | 132 | try: 133 | with open(cache_path, 'r', encoding='utf-8') as f: 134 | cache_data = json.load(f) 135 | 136 | # Check if cache is expired 137 | if max_age is not None: 138 | cache_time = cache_data.get("timestamp", 0) 139 | if time.time() - cache_time > max_age: 140 | logger.info(f"Cache expired for {file_path}") 141 | return None 142 | 143 | logger.info(f"Loaded data from cache: {cache_path}") 144 | return cache_data.get("data") 145 | except Exception as e: 146 | logger.error(f"Error loading cache: {e}") 147 | return None 148 | 149 | 150 | def clear_cache(max_age=None): 151 | """ 152 | Clear all cache files or only expired ones. 153 | 154 | Args: 155 | max_age (float, optional): Maximum age of cache in seconds 156 | 157 | Returns: 158 | int: Number of files deleted 159 | """ 160 | if not CACHE_DIR.exists(): 161 | return 0 162 | 163 | count = 0 164 | for cache_file in CACHE_DIR.glob("*.json"): 165 | try: 166 | if max_age is not None: 167 | # Check if file is expired 168 | with open(cache_file, 'r', encoding='utf-8') as f: 169 | cache_data = json.load(f) 170 | 171 | cache_time = cache_data.get("timestamp", 0) 172 | if time.time() - cache_time <= max_age: 173 | continue # Skip non-expired files 174 | 175 | # Delete the file 176 | os.remove(cache_file) 177 | count += 1 178 | except Exception as e: 179 | logger.error(f"Error deleting cache file {cache_file}: {e}") 180 | 181 | logger.info(f"Cleared {count} cache files") 182 | return count 183 | 184 | 185 | def get_cache_size(): 186 | """ 187 | Get the total size of the cache directory. 188 | 189 | Returns: 190 | tuple: (size_bytes, file_count) 191 | """ 192 | if not CACHE_DIR.exists(): 193 | return 0, 0 194 | 195 | total_size = 0 196 | file_count = 0 197 | 198 | for cache_file in CACHE_DIR.glob("*.json"): 199 | try: 200 | total_size += cache_file.stat().st_size 201 | file_count += 1 202 | except Exception: 203 | pass 204 | 205 | return total_size, file_count -------------------------------------------------------------------------------- /utils/diarization.py: -------------------------------------------------------------------------------- 1 | """ 2 | Speaker diarization utilities for the OBS Recording Transcriber. 3 | Provides functions to identify different speakers in audio recordings. 4 | """ 5 | 6 | import logging 7 | import os 8 | import numpy as np 9 | from pathlib import Path 10 | import torch 11 | from pyannote.audio import Pipeline 12 | from pyannote.core import Segment 13 | import whisper 14 | 15 | # Configure logging 16 | logging.basicConfig(level=logging.INFO) 17 | logger = logging.getLogger(__name__) 18 | 19 | # Try to import GPU utilities, but don't fail if not available 20 | try: 21 | from utils.gpu_utils import get_optimal_device 22 | GPU_UTILS_AVAILABLE = True 23 | except ImportError: 24 | GPU_UTILS_AVAILABLE = False 25 | 26 | # Default HuggingFace auth token environment variable 27 | HF_TOKEN_ENV = "HF_TOKEN" 28 | 29 | 30 | def get_diarization_pipeline(use_gpu=True, hf_token=None): 31 | """ 32 | Initialize the speaker diarization pipeline. 33 | 34 | Args: 35 | use_gpu (bool): Whether to use GPU acceleration if available 36 | hf_token (str, optional): HuggingFace API token for accessing the model 37 | 38 | Returns: 39 | Pipeline or None: Diarization pipeline if successful, None otherwise 40 | """ 41 | # Check if token is provided or in environment 42 | if hf_token is None: 43 | hf_token = os.environ.get(HF_TOKEN_ENV) 44 | if hf_token is None: 45 | logger.error(f"HuggingFace token not provided. Set {HF_TOKEN_ENV} environment variable or pass token directly.") 46 | return None 47 | 48 | try: 49 | # Configure device 50 | device = torch.device("cpu") 51 | if use_gpu and GPU_UTILS_AVAILABLE: 52 | device = get_optimal_device() 53 | logger.info(f"Using device: {device} for diarization") 54 | 55 | # Initialize the pipeline 56 | pipeline = Pipeline.from_pretrained( 57 | "pyannote/speaker-diarization-3.0", 58 | use_auth_token=hf_token 59 | ) 60 | 61 | # Move to appropriate device 62 | if device.type == "cuda": 63 | pipeline = pipeline.to(torch.device(device)) 64 | 65 | return pipeline 66 | except Exception as e: 67 | logger.error(f"Error initializing diarization pipeline: {e}") 68 | return None 69 | 70 | 71 | def diarize_audio(audio_path, pipeline=None, num_speakers=None, use_gpu=True, hf_token=None): 72 | """ 73 | Perform speaker diarization on an audio file. 74 | 75 | Args: 76 | audio_path (Path): Path to the audio file 77 | pipeline (Pipeline, optional): Pre-initialized diarization pipeline 78 | num_speakers (int, optional): Number of speakers (if known) 79 | use_gpu (bool): Whether to use GPU acceleration if available 80 | hf_token (str, optional): HuggingFace API token 81 | 82 | Returns: 83 | dict: Dictionary mapping time segments to speaker IDs 84 | """ 85 | audio_path = Path(audio_path) 86 | 87 | # Initialize pipeline if not provided 88 | if pipeline is None: 89 | pipeline = get_diarization_pipeline(use_gpu, hf_token) 90 | if pipeline is None: 91 | return None 92 | 93 | try: 94 | # Run diarization 95 | logger.info(f"Running speaker diarization on {audio_path}") 96 | diarization = pipeline(audio_path, num_speakers=num_speakers) 97 | 98 | # Extract speaker segments 99 | speaker_segments = {} 100 | for turn, _, speaker in diarization.itertracks(yield_label=True): 101 | segment = (turn.start, turn.end) 102 | speaker_segments[segment] = speaker 103 | 104 | return speaker_segments 105 | except Exception as e: 106 | logger.error(f"Error during diarization: {e}") 107 | return None 108 | 109 | 110 | def apply_diarization_to_transcript(transcript_segments, speaker_segments): 111 | """ 112 | Apply speaker diarization results to transcript segments. 113 | 114 | Args: 115 | transcript_segments (list): List of transcript segments with timing info 116 | speaker_segments (dict): Dictionary mapping time segments to speaker IDs 117 | 118 | Returns: 119 | list: Updated transcript segments with speaker information 120 | """ 121 | if not speaker_segments: 122 | return transcript_segments 123 | 124 | # Convert speaker segments to a more usable format 125 | speaker_ranges = [(Segment(start, end), speaker) 126 | for (start, end), speaker in speaker_segments.items()] 127 | 128 | # Update transcript segments with speaker information 129 | for segment in transcript_segments: 130 | segment_start = segment['start'] 131 | segment_end = segment['end'] 132 | segment_range = Segment(segment_start, segment_end) 133 | 134 | # Find overlapping speaker segments 135 | overlaps = [] 136 | for (spk_range, speaker) in speaker_ranges: 137 | overlap = segment_range.intersect(spk_range) 138 | if overlap: 139 | overlaps.append((overlap.duration, speaker)) 140 | 141 | # Assign the speaker with the most overlap 142 | if overlaps: 143 | overlaps.sort(reverse=True) # Sort by duration (descending) 144 | segment['speaker'] = overlaps[0][1] 145 | else: 146 | segment['speaker'] = "UNKNOWN" 147 | 148 | return transcript_segments 149 | 150 | 151 | def format_transcript_with_speakers(transcript_segments): 152 | """ 153 | Format transcript with speaker labels. 154 | 155 | Args: 156 | transcript_segments (list): List of transcript segments with speaker info 157 | 158 | Returns: 159 | str: Formatted transcript with speaker labels 160 | """ 161 | formatted_lines = [] 162 | current_speaker = None 163 | 164 | for segment in transcript_segments: 165 | speaker = segment.get('speaker', 'UNKNOWN') 166 | text = segment['text'].strip() 167 | 168 | # Add speaker label when speaker changes 169 | if speaker != current_speaker: 170 | formatted_lines.append(f"\n[{speaker}]") 171 | current_speaker = speaker 172 | 173 | formatted_lines.append(text) 174 | 175 | return " ".join(formatted_lines) 176 | 177 | 178 | def transcribe_with_diarization(audio_path, whisper_model="base", num_speakers=None, 179 | use_gpu=True, hf_token=None): 180 | """ 181 | Transcribe audio with speaker diarization. 182 | 183 | Args: 184 | audio_path (Path): Path to the audio file 185 | whisper_model (str): Whisper model size to use 186 | num_speakers (int, optional): Number of speakers (if known) 187 | use_gpu (bool): Whether to use GPU acceleration if available 188 | hf_token (str, optional): HuggingFace API token 189 | 190 | Returns: 191 | tuple: (diarized_segments, formatted_transcript) 192 | """ 193 | audio_path = Path(audio_path) 194 | 195 | # Configure device 196 | device = torch.device("cpu") 197 | if use_gpu and GPU_UTILS_AVAILABLE: 198 | device = get_optimal_device() 199 | 200 | try: 201 | # Step 1: Transcribe audio with Whisper 202 | logger.info(f"Transcribing audio with Whisper model: {whisper_model}") 203 | model = whisper.load_model(whisper_model, device=device if device.type != "mps" else "cpu") 204 | result = model.transcribe(str(audio_path)) 205 | transcript_segments = result["segments"] 206 | 207 | # Step 2: Perform speaker diarization 208 | logger.info("Performing speaker diarization") 209 | pipeline = get_diarization_pipeline(use_gpu, hf_token) 210 | if pipeline is None: 211 | logger.warning("Diarization pipeline not available, returning transcript without speakers") 212 | return transcript_segments, result["text"] 213 | 214 | speaker_segments = diarize_audio(audio_path, pipeline, num_speakers, use_gpu) 215 | 216 | # Step 3: Apply diarization to transcript 217 | if speaker_segments: 218 | diarized_segments = apply_diarization_to_transcript(transcript_segments, speaker_segments) 219 | formatted_transcript = format_transcript_with_speakers(diarized_segments) 220 | return diarized_segments, formatted_transcript 221 | else: 222 | return transcript_segments, result["text"] 223 | 224 | except Exception as e: 225 | logger.error(f"Error in transcribe_with_diarization: {e}") 226 | return None, None -------------------------------------------------------------------------------- /utils/export.py: -------------------------------------------------------------------------------- 1 | """ 2 | Subtitle export utilities for the OBS Recording Transcriber. 3 | Supports exporting transcripts to SRT, ASS, and WebVTT subtitle formats. 4 | """ 5 | 6 | from pathlib import Path 7 | import re 8 | from datetime import timedelta 9 | import gzip 10 | import zipfile 11 | import logging 12 | 13 | # Configure logging 14 | logging.basicConfig(level=logging.INFO) 15 | logger = logging.getLogger(__name__) 16 | 17 | 18 | def format_timestamp_srt(timestamp_ms): 19 | """ 20 | Format a timestamp in milliseconds to SRT format (HH:MM:SS,mmm). 21 | 22 | Args: 23 | timestamp_ms (int): Timestamp in milliseconds 24 | 25 | Returns: 26 | str: Formatted timestamp string 27 | """ 28 | hours, remainder = divmod(timestamp_ms, 3600000) 29 | minutes, remainder = divmod(remainder, 60000) 30 | seconds, milliseconds = divmod(remainder, 1000) 31 | return f"{int(hours):02d}:{int(minutes):02d}:{int(seconds):02d},{int(milliseconds):03d}" 32 | 33 | 34 | def format_timestamp_ass(timestamp_ms): 35 | """ 36 | Format a timestamp in milliseconds to ASS format (H:MM:SS.cc). 37 | 38 | Args: 39 | timestamp_ms (int): Timestamp in milliseconds 40 | 41 | Returns: 42 | str: Formatted timestamp string 43 | """ 44 | hours, remainder = divmod(timestamp_ms, 3600000) 45 | minutes, remainder = divmod(remainder, 60000) 46 | seconds, remainder = divmod(remainder, 1000) 47 | centiseconds = remainder // 10 48 | return f"{int(hours)}:{int(minutes):02d}:{int(seconds):02d}.{int(centiseconds):02d}" 49 | 50 | 51 | def format_timestamp_vtt(timestamp_ms): 52 | """ 53 | Format a timestamp in milliseconds to WebVTT format (HH:MM:SS.mmm). 54 | 55 | Args: 56 | timestamp_ms (int): Timestamp in milliseconds 57 | 58 | Returns: 59 | str: Formatted timestamp string 60 | """ 61 | hours, remainder = divmod(timestamp_ms, 3600000) 62 | minutes, remainder = divmod(remainder, 60000) 63 | seconds, milliseconds = divmod(remainder, 1000) 64 | return f"{int(hours):02d}:{int(minutes):02d}:{int(seconds):02d}.{int(milliseconds):03d}" 65 | 66 | 67 | def export_to_srt(segments, output_path): 68 | """ 69 | Export transcript segments to SRT format. 70 | 71 | Args: 72 | segments (list): List of transcript segments with start, end, and text 73 | output_path (Path): Path to save the SRT file 74 | 75 | Returns: 76 | Path: Path to the saved SRT file 77 | """ 78 | with open(output_path, 'w', encoding='utf-8') as f: 79 | for i, segment in enumerate(segments, 1): 80 | start_time = format_timestamp_srt(int(segment['start'] * 1000)) 81 | end_time = format_timestamp_srt(int(segment['end'] * 1000)) 82 | 83 | f.write(f"{i}\n") 84 | f.write(f"{start_time} --> {end_time}\n") 85 | f.write(f"{segment['text'].strip()}\n\n") 86 | 87 | return output_path 88 | 89 | 90 | def export_to_ass(segments, output_path, video_width=1920, video_height=1080, style=None): 91 | """ 92 | Export transcript segments to ASS format with styling. 93 | 94 | Args: 95 | segments (list): List of transcript segments with start, end, and text 96 | output_path (Path): Path to save the ASS file 97 | video_width (int): Width of the video in pixels 98 | video_height (int): Height of the video in pixels 99 | style (dict, optional): Custom style parameters 100 | 101 | Returns: 102 | Path: Path to the saved ASS file 103 | """ 104 | # Default style 105 | default_style = { 106 | "fontname": "Arial", 107 | "fontsize": "48", 108 | "primary_color": "&H00FFFFFF", # White 109 | "secondary_color": "&H000000FF", # Blue 110 | "outline_color": "&H00000000", # Black 111 | "back_color": "&H80000000", # Semi-transparent black 112 | "bold": "-1", # True 113 | "italic": "0", # False 114 | "alignment": "2", # Bottom center 115 | } 116 | 117 | # Apply custom style if provided 118 | if style: 119 | default_style.update(style) 120 | 121 | # ASS header template 122 | ass_header = f"""[Script Info] 123 | Title: Transcription 124 | ScriptType: v4.00+ 125 | WrapStyle: 0 126 | PlayResX: {video_width} 127 | PlayResY: {video_height} 128 | ScaledBorderAndShadow: yes 129 | 130 | [V4+ Styles] 131 | Format: Name, Fontname, Fontsize, PrimaryColour, SecondaryColour, OutlineColour, BackColour, Bold, Italic, Underline, StrikeOut, ScaleX, ScaleY, Spacing, Angle, BorderStyle, Outline, Shadow, Alignment, MarginL, MarginR, MarginV, Encoding 132 | Style: Default,{default_style['fontname']},{default_style['fontsize']},{default_style['primary_color']},{default_style['secondary_color']},{default_style['outline_color']},{default_style['back_color']},{default_style['bold']},{default_style['italic']},0,0,100,100,0,0,1,2,2,{default_style['alignment']},10,10,10,1 133 | 134 | [Events] 135 | Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text 136 | """ 137 | 138 | with open(output_path, 'w', encoding='utf-8') as f: 139 | f.write(ass_header) 140 | 141 | for segment in segments: 142 | start_time = format_timestamp_ass(int(segment['start'] * 1000)) 143 | end_time = format_timestamp_ass(int(segment['end'] * 1000)) 144 | text = segment['text'].strip().replace('\n', '\\N') 145 | 146 | f.write(f"Dialogue: 0,{start_time},{end_time},Default,,0,0,0,,{text}\n") 147 | 148 | return output_path 149 | 150 | 151 | def export_to_vtt(segments, output_path): 152 | """ 153 | Export transcript segments to WebVTT format. 154 | 155 | Args: 156 | segments (list): List of transcript segments with start, end, and text 157 | output_path (Path): Path to save the WebVTT file 158 | 159 | Returns: 160 | Path: Path to the saved WebVTT file 161 | """ 162 | with open(output_path, 'w', encoding='utf-8') as f: 163 | # WebVTT header 164 | f.write("WEBVTT\n\n") 165 | 166 | for i, segment in enumerate(segments, 1): 167 | start_time = format_timestamp_vtt(int(segment['start'] * 1000)) 168 | end_time = format_timestamp_vtt(int(segment['end'] * 1000)) 169 | 170 | # Optional cue identifier 171 | f.write(f"{i}\n") 172 | f.write(f"{start_time} --> {end_time}\n") 173 | f.write(f"{segment['text'].strip()}\n\n") 174 | 175 | return output_path 176 | 177 | 178 | def transcript_to_segments(transcript, segment_duration=5.0): 179 | """ 180 | Convert a plain transcript to timed segments for subtitle export. 181 | Used when the original segments are not available. 182 | 183 | Args: 184 | transcript (str): Full transcript text 185 | segment_duration (float): Duration of each segment in seconds 186 | 187 | Returns: 188 | list: List of segments with start, end, and text 189 | """ 190 | # Split transcript into sentences 191 | sentences = re.split(r'(?<=[.!?])\s+', transcript) 192 | segments = [] 193 | 194 | current_time = 0.0 195 | for sentence in sentences: 196 | if not sentence.strip(): 197 | continue 198 | 199 | # Estimate duration based on word count (approx. 2.5 words per second) 200 | word_count = len(sentence.split()) 201 | duration = max(2.0, word_count / 2.5) 202 | 203 | segments.append({ 204 | 'start': current_time, 205 | 'end': current_time + duration, 206 | 'text': sentence 207 | }) 208 | 209 | current_time += duration 210 | 211 | return segments 212 | 213 | 214 | def compress_file(input_path, compression_type='gzip'): 215 | """ 216 | Compress a file using the specified compression method. 217 | 218 | Args: 219 | input_path (Path): Path to the file to compress 220 | compression_type (str): Type of compression ('gzip' or 'zip') 221 | 222 | Returns: 223 | Path: Path to the compressed file 224 | """ 225 | input_path = Path(input_path) 226 | 227 | if compression_type == 'gzip': 228 | output_path = input_path.with_suffix(input_path.suffix + '.gz') 229 | with open(input_path, 'rb') as f_in: 230 | with gzip.open(output_path, 'wb') as f_out: 231 | f_out.write(f_in.read()) 232 | return output_path 233 | 234 | elif compression_type == 'zip': 235 | output_path = input_path.with_suffix('.zip') 236 | with zipfile.ZipFile(output_path, 'w', zipfile.ZIP_DEFLATED) as zipf: 237 | zipf.write(input_path, arcname=input_path.name) 238 | return output_path 239 | 240 | else: 241 | logger.warning(f"Unsupported compression type: {compression_type}") 242 | return input_path 243 | 244 | 245 | def export_transcript(transcript, output_path, format_type='srt', segments=None, 246 | compress=False, compression_type='gzip', style=None): 247 | """ 248 | Export transcript to the specified subtitle format. 249 | 250 | Args: 251 | transcript (str): Full transcript text 252 | output_path (Path): Base path for the output file (without extension) 253 | format_type (str): 'srt', 'ass', or 'vtt' 254 | segments (list, optional): List of transcript segments with timing information 255 | compress (bool): Whether to compress the output file 256 | compression_type (str): Type of compression ('gzip' or 'zip') 257 | style (dict, optional): Custom style parameters for ASS format 258 | 259 | Returns: 260 | Path: Path to the saved subtitle file 261 | """ 262 | output_path = Path(output_path) 263 | 264 | # If segments are not provided, create them from the transcript 265 | if segments is None: 266 | segments = transcript_to_segments(transcript) 267 | 268 | if format_type.lower() == 'srt': 269 | output_file = output_path.with_suffix('.srt') 270 | result_path = export_to_srt(segments, output_file) 271 | elif format_type.lower() == 'ass': 272 | output_file = output_path.with_suffix('.ass') 273 | result_path = export_to_ass(segments, output_file, style=style) 274 | elif format_type.lower() == 'vtt': 275 | output_file = output_path.with_suffix('.vtt') 276 | result_path = export_to_vtt(segments, output_file) 277 | else: 278 | raise ValueError(f"Unsupported format type: {format_type}. Use 'srt', 'ass', or 'vtt'.") 279 | 280 | # Compress the file if requested 281 | if compress: 282 | result_path = compress_file(result_path, compression_type) 283 | 284 | return result_path -------------------------------------------------------------------------------- /utils/gpu_utils.py: -------------------------------------------------------------------------------- 1 | """ 2 | GPU utilities for the OBS Recording Transcriber. 3 | Provides functions to detect and configure GPU acceleration. 4 | """ 5 | 6 | import logging 7 | import os 8 | import platform 9 | import subprocess 10 | import torch 11 | 12 | # Configure logging 13 | logging.basicConfig(level=logging.INFO) 14 | logger = logging.getLogger(__name__) 15 | 16 | 17 | def get_gpu_info(): 18 | """ 19 | Get information about available GPUs. 20 | 21 | Returns: 22 | dict: Information about available GPUs 23 | """ 24 | gpu_info = { 25 | "cuda_available": torch.cuda.is_available(), 26 | "cuda_device_count": torch.cuda.device_count() if torch.cuda.is_available() else 0, 27 | "cuda_devices": [], 28 | "mps_available": hasattr(torch.backends, "mps") and torch.backends.mps.is_available() 29 | } 30 | 31 | # Get CUDA device information 32 | if gpu_info["cuda_available"]: 33 | for i in range(gpu_info["cuda_device_count"]): 34 | device_props = torch.cuda.get_device_properties(i) 35 | gpu_info["cuda_devices"].append({ 36 | "index": i, 37 | "name": device_props.name, 38 | "total_memory": device_props.total_memory, 39 | "compute_capability": f"{device_props.major}.{device_props.minor}" 40 | }) 41 | 42 | return gpu_info 43 | 44 | 45 | def get_optimal_device(): 46 | """ 47 | Get the optimal device for computation. 48 | 49 | Returns: 50 | torch.device: The optimal device (cuda, mps, or cpu) 51 | """ 52 | if torch.cuda.is_available(): 53 | # If multiple GPUs are available, select the one with the most memory 54 | if torch.cuda.device_count() > 1: 55 | max_memory = 0 56 | best_device = 0 57 | for i in range(torch.cuda.device_count()): 58 | device_props = torch.cuda.get_device_properties(i) 59 | if device_props.total_memory > max_memory: 60 | max_memory = device_props.total_memory 61 | best_device = i 62 | return torch.device(f"cuda:{best_device}") 63 | return torch.device("cuda:0") 64 | elif hasattr(torch.backends, "mps") and torch.backends.mps.is_available(): 65 | return torch.device("mps") 66 | else: 67 | return torch.device("cpu") 68 | 69 | 70 | def set_memory_limits(memory_fraction=0.8): 71 | global torch 72 | import torch 73 | """ 74 | Set memory limits for GPU usage. 75 | 76 | Args: 77 | memory_fraction (float): Fraction of GPU memory to use (0.0 to 1.0) 78 | 79 | Returns: 80 | bool: True if successful, False otherwise 81 | """ 82 | if not torch.cuda.is_available(): 83 | return False 84 | 85 | try: 86 | # Set memory fraction for each device 87 | for i in range(torch.cuda.device_count()): 88 | torch.cuda.set_per_process_memory_fraction(memory_fraction, i) 89 | 90 | return True 91 | except Exception as e: 92 | logger.error(f"Error setting memory limits: {e}") 93 | return False 94 | 95 | 96 | def optimize_for_inference(): 97 | """ 98 | Apply optimizations for inference. 99 | 100 | Returns: 101 | bool: True if successful, False otherwise 102 | """ 103 | try: 104 | # Set deterministic algorithms for reproducibility 105 | torch.backends.cudnn.deterministic = True 106 | 107 | # Enable cuDNN benchmark mode for optimized performance 108 | torch.backends.cudnn.benchmark = True 109 | 110 | # Disable gradient calculation for inference 111 | torch.set_grad_enabled(False) 112 | 113 | return True 114 | except Exception as e: 115 | logger.error(f"Error optimizing for inference: {e}") 116 | return False 117 | 118 | 119 | def get_recommended_batch_size(model_size="base"): 120 | """ 121 | Get recommended batch size based on available GPU memory. 122 | 123 | Args: 124 | model_size (str): Size of the model (tiny, base, small, medium, large) 125 | 126 | Returns: 127 | int: Recommended batch size 128 | """ 129 | # Default batch sizes for CPU 130 | default_batch_sizes = { 131 | "tiny": 16, 132 | "base": 8, 133 | "small": 4, 134 | "medium": 2, 135 | "large": 1 136 | } 137 | 138 | # If CUDA is not available, return default CPU batch size 139 | if not torch.cuda.is_available(): 140 | return default_batch_sizes.get(model_size, 1) 141 | 142 | # Approximate memory requirements in GB for different model sizes 143 | memory_requirements = { 144 | "tiny": 1, 145 | "base": 2, 146 | "small": 4, 147 | "medium": 8, 148 | "large": 16 149 | } 150 | 151 | # Get available GPU memory 152 | device = get_optimal_device() 153 | if device.type == "cuda": 154 | device_idx = device.index 155 | device_props = torch.cuda.get_device_properties(device_idx) 156 | available_memory_gb = device_props.total_memory / (1024 ** 3) 157 | 158 | # Calculate batch size based on available memory 159 | model_memory = memory_requirements.get(model_size, 2) 160 | max_batch_size = int(available_memory_gb / model_memory) 161 | 162 | # Ensure batch size is at least 1 163 | return max(1, max_batch_size) 164 | 165 | # For MPS or other devices, return default 166 | return default_batch_sizes.get(model_size, 1) 167 | 168 | 169 | def configure_gpu(model_size="base", memory_fraction=0.8): 170 | """ 171 | Configure GPU settings for optimal performance. 172 | 173 | Args: 174 | model_size (str): Size of the model (tiny, base, small, medium, large) 175 | memory_fraction (float): Fraction of GPU memory to use (0.0 to 1.0) 176 | 177 | Returns: 178 | dict: Configuration information 179 | """ 180 | gpu_info = get_gpu_info() 181 | device = get_optimal_device() 182 | 183 | # Set memory limits if using CUDA 184 | if device.type == "cuda": 185 | set_memory_limits(memory_fraction) 186 | 187 | # Apply inference optimizations 188 | optimize_for_inference() 189 | 190 | # Get recommended batch size 191 | batch_size = get_recommended_batch_size(model_size) 192 | 193 | config = { 194 | "device": device, 195 | "batch_size": batch_size, 196 | "gpu_info": gpu_info, 197 | "memory_fraction": memory_fraction if device.type == "cuda" else None 198 | } 199 | 200 | logger.info(f"GPU configuration: Using {device} with batch size {batch_size}") 201 | return config -------------------------------------------------------------------------------- /utils/keyword_extraction.py: -------------------------------------------------------------------------------- 1 | """ 2 | Keyword extraction utilities for the OBS Recording Transcriber. 3 | Provides functions to extract keywords and link them to timestamps. 4 | """ 5 | 6 | import logging 7 | import re 8 | import torch 9 | import numpy as np 10 | from pathlib import Path 11 | from transformers import pipeline, AutoTokenizer, AutoModelForTokenClassification 12 | from sklearn.feature_extraction.text import TfidfVectorizer 13 | from collections import Counter 14 | 15 | # Configure logging 16 | logging.basicConfig(level=logging.INFO) 17 | logger = logging.getLogger(__name__) 18 | 19 | # Try to import GPU utilities, but don't fail if not available 20 | try: 21 | from utils.gpu_utils import get_optimal_device 22 | GPU_UTILS_AVAILABLE = True 23 | except ImportError: 24 | GPU_UTILS_AVAILABLE = False 25 | 26 | # Default models 27 | NER_MODEL = "dslim/bert-base-NER" 28 | 29 | 30 | def extract_keywords_tfidf(text, max_keywords=10, ngram_range=(1, 2)): 31 | """ 32 | Extract keywords using TF-IDF. 33 | 34 | Args: 35 | text (str): Text to extract keywords from 36 | max_keywords (int): Maximum number of keywords to extract 37 | ngram_range (tuple): Range of n-grams to consider 38 | 39 | Returns: 40 | list: List of (keyword, score) tuples 41 | """ 42 | try: 43 | # Preprocess text 44 | text = text.lower() 45 | 46 | # Remove common stopwords 47 | stopwords = {'a', 'an', 'the', 'and', 'or', 'but', 'if', 'because', 'as', 'what', 48 | 'when', 'where', 'how', 'who', 'which', 'this', 'that', 'these', 'those', 49 | 'then', 'just', 'so', 'than', 'such', 'both', 'through', 'about', 'for', 50 | 'is', 'of', 'while', 'during', 'to', 'from', 'in', 'out', 'on', 'off', 'by'} 51 | 52 | # Create sentences for better TF-IDF analysis 53 | sentences = re.split(r'[.!?]', text) 54 | sentences = [s.strip() for s in sentences if s.strip()] 55 | 56 | if not sentences: 57 | return [] 58 | 59 | # Apply TF-IDF 60 | vectorizer = TfidfVectorizer( 61 | max_features=100, 62 | stop_words=stopwords, 63 | ngram_range=ngram_range 64 | ) 65 | 66 | try: 67 | tfidf_matrix = vectorizer.fit_transform(sentences) 68 | feature_names = vectorizer.get_feature_names_out() 69 | 70 | # Calculate average TF-IDF score across all sentences 71 | avg_tfidf = np.mean(tfidf_matrix.toarray(), axis=0) 72 | 73 | # Get top keywords 74 | keywords = [(feature_names[i], avg_tfidf[i]) for i in avg_tfidf.argsort()[::-1]] 75 | 76 | # Filter out single-character keywords and limit to max_keywords 77 | keywords = [(k, s) for k, s in keywords if len(k) > 1][:max_keywords] 78 | 79 | return keywords 80 | except ValueError as e: 81 | logger.warning(f"TF-IDF extraction failed: {e}") 82 | return [] 83 | 84 | except Exception as e: 85 | logger.error(f"Error extracting keywords with TF-IDF: {e}") 86 | return [] 87 | 88 | 89 | def extract_named_entities(text, model=NER_MODEL, use_gpu=True): 90 | """ 91 | Extract named entities from text. 92 | 93 | Args: 94 | text (str): Text to extract entities from 95 | model (str): Model to use for NER 96 | use_gpu (bool): Whether to use GPU acceleration if available 97 | 98 | Returns: 99 | list: List of (entity, type) tuples 100 | """ 101 | # Configure device 102 | device = torch.device("cpu") 103 | if use_gpu and GPU_UTILS_AVAILABLE: 104 | device = get_optimal_device() 105 | device_arg = 0 if device.type == "cuda" else -1 106 | else: 107 | device_arg = -1 108 | 109 | try: 110 | # Initialize the pipeline 111 | ner_pipeline = pipeline("ner", model=model, device=device_arg, aggregation_strategy="simple") 112 | 113 | # Split text into manageable chunks if too long 114 | max_length = 512 115 | if len(text) > max_length: 116 | chunks = [text[i:i+max_length] for i in range(0, len(text), max_length)] 117 | else: 118 | chunks = [text] 119 | 120 | # Process each chunk 121 | all_entities = [] 122 | for chunk in chunks: 123 | entities = ner_pipeline(chunk) 124 | all_entities.extend(entities) 125 | 126 | # Extract entity text and type 127 | entity_info = [(entity["word"], entity["entity_group"]) for entity in all_entities] 128 | 129 | return entity_info 130 | except Exception as e: 131 | logger.error(f"Error extracting named entities: {e}") 132 | return [] 133 | 134 | 135 | def find_keyword_timestamps(segments, keywords): 136 | """ 137 | Find timestamps for keywords in transcript segments. 138 | 139 | Args: 140 | segments (list): List of transcript segments with timing info 141 | keywords (list): List of keywords to find 142 | 143 | Returns: 144 | dict: Dictionary mapping keywords to lists of timestamps 145 | """ 146 | keyword_timestamps = {} 147 | 148 | # Convert keywords to lowercase for case-insensitive matching 149 | if isinstance(keywords[0], tuple): 150 | # If keywords is a list of (keyword, score) tuples 151 | keywords_lower = [k.lower() for k, _ in keywords] 152 | else: 153 | # If keywords is just a list of keywords 154 | keywords_lower = [k.lower() for k in keywords] 155 | 156 | # Process each segment 157 | for segment in segments: 158 | segment_text = segment["text"].lower() 159 | start_time = segment["start"] 160 | end_time = segment["end"] 161 | 162 | # Check each keyword 163 | for i, keyword in enumerate(keywords_lower): 164 | if keyword in segment_text: 165 | # Get the original case of the keyword 166 | original_keyword = keywords[i][0] if isinstance(keywords[0], tuple) else keywords[i] 167 | 168 | # Initialize the list if this is the first occurrence 169 | if original_keyword not in keyword_timestamps: 170 | keyword_timestamps[original_keyword] = [] 171 | 172 | # Add the timestamp 173 | keyword_timestamps[original_keyword].append({ 174 | "start": start_time, 175 | "end": end_time, 176 | "context": segment["text"] 177 | }) 178 | 179 | return keyword_timestamps 180 | 181 | 182 | def extract_keywords_from_transcript(transcript, segments, max_keywords=15, use_gpu=True): 183 | """ 184 | Extract keywords from transcript and link them to timestamps. 185 | 186 | Args: 187 | transcript (str): Full transcript text 188 | segments (list): List of transcript segments with timing info 189 | max_keywords (int): Maximum number of keywords to extract 190 | use_gpu (bool): Whether to use GPU acceleration if available 191 | 192 | Returns: 193 | tuple: (keyword_timestamps, entities_with_timestamps) 194 | """ 195 | try: 196 | # Extract keywords using TF-IDF 197 | tfidf_keywords = extract_keywords_tfidf(transcript, max_keywords=max_keywords) 198 | 199 | # Extract named entities 200 | entities = extract_named_entities(transcript, use_gpu=use_gpu) 201 | 202 | # Count entity occurrences and get the most frequent ones 203 | entity_counter = Counter([entity for entity, _ in entities]) 204 | top_entities = [(entity, count) for entity, count in entity_counter.most_common(max_keywords)] 205 | 206 | # Find timestamps for keywords and entities 207 | keyword_timestamps = find_keyword_timestamps(segments, tfidf_keywords) 208 | entity_timestamps = find_keyword_timestamps(segments, top_entities) 209 | 210 | return keyword_timestamps, entity_timestamps 211 | 212 | except Exception as e: 213 | logger.error(f"Error extracting keywords from transcript: {e}") 214 | return {}, {} 215 | 216 | 217 | def generate_keyword_index(keyword_timestamps, entity_timestamps=None): 218 | """ 219 | Generate a keyword index with timestamps. 220 | 221 | Args: 222 | keyword_timestamps (dict): Dictionary mapping keywords to timestamp lists 223 | entity_timestamps (dict, optional): Dictionary mapping entities to timestamp lists 224 | 225 | Returns: 226 | str: Formatted keyword index 227 | """ 228 | lines = ["# Keyword Index\n"] 229 | 230 | # Add keywords section 231 | if keyword_timestamps: 232 | lines.append("## Keywords\n") 233 | for keyword, timestamps in sorted(keyword_timestamps.items()): 234 | if timestamps: 235 | times = [f"{int(ts['start'] // 60):02d}:{int(ts['start'] % 60):02d}" for ts in timestamps] 236 | lines.append(f"- **{keyword}**: {', '.join(times)}\n") 237 | 238 | # Add entities section 239 | if entity_timestamps: 240 | lines.append("\n## Named Entities\n") 241 | for entity, timestamps in sorted(entity_timestamps.items()): 242 | if timestamps: 243 | times = [f"{int(ts['start'] // 60):02d}:{int(ts['start'] % 60):02d}" for ts in timestamps] 244 | lines.append(f"- **{entity}**: {', '.join(times)}\n") 245 | 246 | return "".join(lines) 247 | 248 | 249 | def generate_interactive_transcript(segments, keyword_timestamps=None, entity_timestamps=None): 250 | """ 251 | Generate an interactive transcript with keyword highlighting. 252 | 253 | Args: 254 | segments (list): List of transcript segments with timing info 255 | keyword_timestamps (dict, optional): Dictionary mapping keywords to timestamp lists 256 | entity_timestamps (dict, optional): Dictionary mapping entities to timestamp lists 257 | 258 | Returns: 259 | str: HTML formatted interactive transcript 260 | """ 261 | # Combine keywords and entities 262 | all_keywords = {} 263 | if keyword_timestamps: 264 | all_keywords.update(keyword_timestamps) 265 | if entity_timestamps: 266 | all_keywords.update(entity_timestamps) 267 | 268 | # Generate HTML 269 | html = ["
") 293 | html.append(f"
") 295 | 296 | html.append(" {speaker_html}{highlighted_text}") 294 | html.append("