├── LICENSE ├── README.md ├── acent.py ├── app.py ├── bot.py ├── isfahan.json ├── libraries.docx ├── main.py ├── shirazi.json ├── src ├── __pycache__ │ ├── acent.cpython-38.pyc │ ├── asr_utils.cpython-38.pyc │ ├── audio_utils.cpython-38.pyc │ ├── audio_utils.cpython-39.pyc │ ├── make_it_correct.cpython-38.pyc │ └── subtitle_utils.cpython-38.pyc ├── acent.py ├── asr_main.py ├── asr_utils.py ├── audio_utils.py ├── helpers.py ├── make_it_correct.py ├── subtitle_utils.py ├── system_prompt.txt └── translation_utils.py ├── static └── uploads │ └── fc67d12e-a983-48e1-aa8c-5ddaa026588a │ └── sample1.mp4 └── templates ├── home.html └── output.html /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2025 Taha Bakhtari 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Video Processing and Subtitle Generation Bot 2 | 3 | ## Overview 4 | 5 | This project is a video processing bot that extracts audio from video files, performs speech recognition, generates subtitles, and allows for text correction and dialect conversion. It utilizes various libraries for audio processing, speech recognition, and translation. 6 | 7 | ## Features 8 | 9 | - **Audio Extraction**: Extracts audio from video files. 10 | - **Speech Recognition**: Converts audio to text using VAD (Voice Activity Detection). 11 | - **Subtitle Generation**: Creates SRT files for subtitles. 12 | - **Text Correction**: Uses OpenAI's API to correct transcription errors. 13 | - **Dialect Conversion**: Converts subtitles to specified dialects. 14 | - **Web Interface**: A Flask-based web interface for user interaction. 15 | 16 | ## Requirements 17 | 18 | - Python 3.x 19 | - Required libraries: 20 | - `flask` 21 | - `pyrogram` 22 | - `librosa` 23 | - `soundfile` 24 | - `numpy` 25 | - `deep_translator` 26 | - `transformers` 27 | - `openai` 28 | - `noisereduce` 29 | - `pyloudnorm` 30 | - `av` 31 | - `werkzeug` 32 | 33 | 34 | ## Installation 35 | 36 | 1. Clone the repository: 37 | 38 | ```bash 39 | git clone 40 | cd 41 | ``` 42 | 43 | 2. Install the required dependencies. 44 | 45 | 3. Set up your OpenAI API key in the `make_it_correct.py` file. 46 | 47 | 4. Create necessary directories for input and output files: 48 | 49 | ```bash 50 | mkdir -p data/input_videos data/audio_outputs data/text_outputs data/subtitles data/chunks 51 | ``` 52 | 53 | ## Usage 54 | 55 | ### Running the Bot 56 | 57 | To run the bot, execute the following command: 58 | 59 | ```bash 60 | python bot.py 61 | ``` 62 | 63 | ### Web Interface 64 | 65 | To access the web interface, run: 66 | 67 | ```bash 68 | python app.py 69 | ``` 70 | 71 | Then navigate to `http://localhost:5000` in your web browser. 72 | 73 | ### Commands 74 | 75 | - **/start**: Start the bot and receive instructions. 76 | - **Upload a video**: Send a video file to the bot for processing. 77 | 78 | ### Processing Steps 79 | 80 | 1. Upload a video file. 81 | 2. Select the original language of the video. 82 | 3. Choose the target language for subtitles. 83 | 4. Decide if you want to enhance audio quality. 84 | 5. Choose whether to correct the subtitles. 85 | 6. Select the desired dialect for the subtitles. 86 | 87 | ## Contributing 88 | 89 | Contributions are welcome! Please feel free to submit a pull request or open an issue for any suggestions or improvements. 90 | 91 | ## License 92 | 93 | This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details. 94 | 95 | ## Acknowledgments 96 | 97 | - [OpenAI](https://openai.com/) for providing the API for text correction. 98 | - [Hugging Face](https://huggingface.co/) for the speech recognition models. 99 | - [Deep Translator](https://pypi.org/project/deep-translator/) for translation capabilities. 100 | 101 | 102 | 103 | 104 | https://github.com/user-attachments/assets/a8bd37d0-7c1f-4df3-876e-2319bcb995d8 105 | 106 | -------------------------------------------------------------------------------- /acent.py: -------------------------------------------------------------------------------- 1 | import json 2 | import re 3 | 4 | with open("isfahani.json", "r", encoding="utf-8") as file: 5 | shirazi_dict = json.load(file) 6 | def convert_to_shirazi(text): 7 | words = re.findall(r'\b\w+\b', text) # استخراج کلمات بدون علائم نگارشی 8 | converted_text = text 9 | 10 | for word in words: 11 | if word in shirazi_dict: 12 | converted_text = re.sub(rf'\b{word}\b', shirazi_dict[word], converted_text) 13 | 14 | return converted_text 15 | text = """سلام شما چطوری؟ حالت خوبه؟ 16 | میشه موبایل را به من بدی؟""" 17 | 18 | converted_text = convert_to_shirazi(text) 19 | print(converted_text) 20 | -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | import os 2 | import logging 3 | import time 4 | from flask import Flask, request, render_template, send_from_directory, redirect, url_for, flash, jsonify 5 | from werkzeug.utils import secure_filename 6 | from src.audio_utils import process_video_audio 7 | from src.asr_utils import chunk_based_transcription_vad, save_transcript_segments 8 | from src.subtitle_utils import generate_srt, generate_dual_srt 9 | from src.make_it_correct import make_it_correct_openai 10 | 11 | logging.basicConfig(level=logging.INFO) 12 | logger = logging.getLogger(__name__) 13 | 14 | app = Flask(__name__) 15 | app.secret_key = "your_secret_key_here" 16 | 17 | UPLOAD_FOLDER = 'data/input_videos' 18 | CHUNKS_FOLDER = 'data/chunks' 19 | OUTPUT_FOLDER = 'data' 20 | ALLOWED_EXTENSIONS = {'mp4', 'avi', 'mov', 'mkv'} 21 | 22 | app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER 23 | app.config['CHUNKS_FOLDER'] = CHUNKS_FOLDER 24 | app.config['MAX_CONTENT_LENGTH'] = 100 * 1024 * 1024 25 | 26 | app.jinja_env.cache = {} 27 | 28 | for directory in [ 29 | 'data/input_videos', 30 | 'data/audio_outputs', 31 | 'data/text_outputs', 32 | 'data/subtitles', 33 | 'data/chunks' 34 | ]: 35 | os.makedirs(directory, exist_ok=True) 36 | 37 | def allowed_file(filename): 38 | return '.' in filename and filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS 39 | 40 | @app.route('/') 41 | def index(): 42 | return render_template('home.html') 43 | 44 | @app.route('/upload', methods=['POST']) 45 | def upload_file(): 46 | if 'video' not in request.files: 47 | flash('فایل ویدیو ارسال نشده است.') 48 | return redirect(request.url) 49 | 50 | file = request.files['video'] 51 | if file.filename == '': 52 | flash('هیچ فایلی انتخاب نشده است.') 53 | return redirect(request.url) 54 | 55 | if file and allowed_file(file.filename): 56 | filename = secure_filename(file.filename) 57 | video_path = os.path.join(app.config['UPLOAD_FOLDER'], filename) 58 | file.save(video_path) 59 | 60 | original_lang = request.form.get('original_lang', 'en') 61 | target_lang = request.form.get('target_lang', 'fa') 62 | enhance_audio = 'enhance_audio' in request.form 63 | correct_text = 'correct_text' in request.form 64 | dialect = request.form.get('dialect', 'none') 65 | 66 | job_data = { 67 | 'video_path': video_path, 68 | 'filename': filename, 69 | 'original_lang': original_lang, 70 | 'target_lang': target_lang, 71 | 'enhance_audio': enhance_audio, 72 | 'correct_text': correct_text, 73 | 'dialect': dialect 74 | } 75 | 76 | return process_video(job_data) 77 | 78 | flash('فرمت فایل نامعتبر است. فرمت‌های مجاز: mp4, avi, mov, mkv') 79 | return redirect(url_for('index')) 80 | 81 | @app.route('/upload-chunk', methods=['POST']) 82 | def upload_chunk(): 83 | chunk_number = int(request.form.get('chunk_number')) 84 | total_chunks = int(request.form.get('total_chunks')) 85 | filename = secure_filename(request.form.get('filename')) 86 | file_id = request.form.get('file_id') 87 | 88 | file_chunk_dir = os.path.join(app.config['CHUNKS_FOLDER'], file_id) 89 | os.makedirs(file_chunk_dir, exist_ok=True) 90 | 91 | chunk_file = request.files['chunk'] 92 | chunk_path = os.path.join(file_chunk_dir, f"chunk_{chunk_number}") 93 | chunk_file.save(chunk_path) 94 | 95 | if chunk_number == total_chunks - 1: 96 | video_path = os.path.join(app.config['UPLOAD_FOLDER'], filename) 97 | with open(video_path, 'wb') as outfile: 98 | for i in range(total_chunks): 99 | chunk_path = os.path.join(file_chunk_dir, f"chunk_{i}") 100 | with open(chunk_path, 'rb') as infile: 101 | outfile.write(infile.read()) 102 | for i in range(total_chunks): 103 | chunk_path = os.path.join(file_chunk_dir, f"chunk_{i}") 104 | if os.path.exists(chunk_path): 105 | os.remove(chunk_path) 106 | os.rmdir(file_chunk_dir) 107 | 108 | return jsonify({'status': 'success', 'message': 'آپلود فایل کامل شد', 'filename': filename}) 109 | 110 | return jsonify({'status': 'success', 'message': f'دریافت chunk {chunk_number + 1}/{total_chunks}'}) 111 | 112 | @app.route('/process', methods=['POST']) 113 | def process_uploaded(): 114 | filename = request.form.get('filename') 115 | if not filename or not os.path.exists(os.path.join(app.config['UPLOAD_FOLDER'], filename)): 116 | flash('فایل پیدا نشد یا نامعتبر است.') 117 | return redirect(url_for('index')) 118 | 119 | original_lang = request.form.get('original_lang', 'en') 120 | target_lang = request.form.get('target_lang', 'fa') 121 | enhance_audio = request.form.get('enhance_audio') == '1' 122 | correct_text = request.form.get('correct_text') == '1' 123 | dialect = request.form.get('dialect', 'none') 124 | 125 | job_data = { 126 | 'video_path': os.path.join(app.config['UPLOAD_FOLDER'], filename), 127 | 'filename': filename, 128 | 'original_lang': original_lang, 129 | 'target_lang': target_lang, 130 | 'enhance_audio': enhance_audio, 131 | 'correct_text': correct_text, 132 | 'dialect': dialect 133 | } 134 | 135 | return process_video(job_data) 136 | 137 | def process_video(job_data): 138 | try: 139 | video_path = job_data['video_path'] 140 | filename_base = os.path.splitext(job_data['filename'])[0] 141 | original_lang = job_data['original_lang'] 142 | target_lang = job_data['target_lang'] 143 | enhance_audio = job_data['enhance_audio'] 144 | correct_text = job_data['correct_text'] 145 | dialect = job_data['dialect'] 146 | 147 | clean_audio_path = f"data/audio_outputs/{filename_base}_clean.wav" 148 | transcript_path = f"data/text_outputs/{filename_base}_transcript.txt" 149 | srt_path = f"data/subtitles/{filename_base}.srt" 150 | dual_srt_path = f"data/subtitles/{filename_base}_dual.srt" 151 | 152 | logger.info(f"پردازش صدا برای {video_path}...") 153 | process_video_audio(video_path, clean_audio_path, target_sr=16000, enhance=enhance_audio) 154 | 155 | logger.info("اجرای رونویسی با VAD...") 156 | segments = chunk_based_transcription_vad(clean_audio_path, sr=16000) 157 | save_transcript_segments(segments, transcript_path) 158 | 159 | logger.info("تولید فایل‌های زیرنویس...") 160 | generate_srt(segments, srt_path) 161 | generate_dual_srt(segments, original_lang, target_lang, dual_srt_path) 162 | 163 | if correct_text: 164 | logger.info("اصلاح زیرنویس با استفاده از make_it_correct_openai...") 165 | with open("src/system_prompt.txt", "r", encoding="utf-8") as f: 166 | system_prompt = f.read() 167 | message_srt = open(srt_path, "r", encoding="utf-8").read() 168 | message_dual_srt = open(dual_srt_path, "r", encoding="utf-8").read() 169 | 170 | corrected_srt = make_it_correct_openai(message=message_srt, system_prompt=system_prompt) 171 | corrected_dual_srt = make_it_correct_openai(message=message_dual_srt, system_prompt=system_prompt) 172 | 173 | with open(srt_path, "w", encoding="utf-8") as f: 174 | f.write(corrected_srt) 175 | with open(dual_srt_path, "w", encoding="utf-8") as f: 176 | f.write(corrected_dual_srt) 177 | 178 | if dialect != "none": 179 | logger.info("اعمال تبدیل گویش...") 180 | from src.acent import convert_to_dialect 181 | current_srt = open(srt_path, "r", encoding="utf-8").read() 182 | current_dual_srt = open(dual_srt_path, "r", encoding="utf-8").read() 183 | 184 | converted_srt = convert_to_dialect(dialect, current_srt) 185 | converted_dual_srt = convert_to_dialect(dialect, current_dual_srt) 186 | 187 | with open(srt_path, "w", encoding="utf-8") as f: 188 | f.write(converted_srt) 189 | with open(dual_srt_path, "w", encoding="utf-8") as f: 190 | f.write(converted_dual_srt) 191 | 192 | results = { 193 | 'filename': job_data['filename'], 194 | 'original_lang': original_lang, 195 | 'target_lang': target_lang, 196 | 'transcript_path': os.path.basename(transcript_path), 197 | 'srt_path': os.path.basename(srt_path), 198 | 'dual_srt_path': os.path.basename(dual_srt_path) 199 | } 200 | 201 | return render_template('output.html', results=results, current_time=int(time.time())) 202 | 203 | except Exception as e: 204 | logger.error(f"خطا در پردازش: {str(e)}") 205 | flash(f"خطا در پردازش ویدیو: {str(e)}") 206 | return redirect(url_for('index')) 207 | 208 | @app.route('/download/') 209 | def download_file(filename): 210 | directory = os.path.dirname(filename) 211 | file = os.path.basename(filename) 212 | if file.endswith('.srt'): 213 | response = send_from_directory(os.path.join(OUTPUT_FOLDER, directory), file, as_attachment=request.args.get('download', 'false').lower() == 'true') 214 | response.headers['Content-Type'] = 'application/x-subrip; charset=utf-8' 215 | response.headers['Access-Control-Allow-Origin'] = '*' 216 | return response 217 | return send_from_directory(os.path.join(OUTPUT_FOLDER, directory), file, as_attachment=True) 218 | 219 | @app.route('/serve-video/') 220 | def serve_video(filename): 221 | response = send_from_directory(os.path.join(OUTPUT_FOLDER, 'input_videos'), filename, as_attachment=False) 222 | response.headers['Access-Control-Allow-Origin'] = '*' 223 | return response 224 | 225 | @app.route('/update_subtitle', methods=['POST']) 226 | def update_subtitle(): 227 | file_type = request.form.get('file_type', 'srt') 228 | new_content = request.form.get('new_content') 229 | file_name = request.form.get('file_name') 230 | if not file_name: 231 | return jsonify({'status': 'error', 'message': 'نام فایل مشخص نشده است.'}), 400 232 | file_path = os.path.join("data/subtitles", file_name) 233 | try: 234 | with open(file_path, "w", encoding="utf-8") as f: 235 | f.write(new_content) 236 | return jsonify({'status': 'success', 'message': 'زیرنویس به‌روزرسانی شد.'}) 237 | except Exception as e: 238 | return jsonify({'status': 'error', 'message': str(e)}), 500 239 | 240 | if __name__ == "__main__": 241 | app.run(debug=True, host='0.0.0.0', port=5000) 242 | -------------------------------------------------------------------------------- /bot.py: -------------------------------------------------------------------------------- 1 | import os 2 | import logging 3 | from pyrogram import Client, filters 4 | from pyrogram.types import Message 5 | from src.audio_utils import process_video_audio 6 | from src.asr_utils import chunk_based_transcription_vad, save_transcript_segments 7 | from src.subtitle_utils import generate_srt, generate_dual_srt 8 | from src.make_it_correct import make_it_correct_openai 9 | from src.acent import convert_to_dialect 10 | 11 | logging.basicConfig(level=logging.INFO) 12 | logger = logging.getLogger(__name__) 13 | 14 | API_ID = "x" 15 | API_HASH = "x" 16 | BOT_TOKEN = "x:x-x" 17 | 18 | UPLOAD_FOLDER = 'data/input_videos' 19 | AUDIO_OUTPUTS = 'data/audio_outputs' 20 | TEXT_OUTPUTS = 'data/text_outputs' 21 | SUBTITLES_FOLDER = 'data/subtitles' 22 | CHUNKS_FOLDER = 'data/chunks' 23 | 24 | for folder in [UPLOAD_FOLDER, AUDIO_OUTPUTS, TEXT_OUTPUTS, SUBTITLES_FOLDER, CHUNKS_FOLDER]: 25 | os.makedirs(folder, exist_ok=True) 26 | 27 | pending_jobs = {} 28 | 29 | def process_video(job_data): 30 | try: 31 | video_path = job_data['video_path'] 32 | filename = job_data['filename'] 33 | filename_base = os.path.splitext(filename)[0] 34 | original_lang = job_data['original_lang'] 35 | target_lang = job_data['target_lang'] 36 | enhance_audio = job_data['enhance_audio'] 37 | correct_text = job_data['correct_text'] 38 | dialect = job_data['dialect'] 39 | 40 | clean_audio_path = os.path.join(AUDIO_OUTPUTS, f"{filename_base}_clean.wav") 41 | transcript_path = os.path.join(TEXT_OUTPUTS, f"{filename_base}_transcript.txt") 42 | srt_path = os.path.join(SUBTITLES_FOLDER, f"{filename_base}.srt") 43 | dual_srt_path = os.path.join(SUBTITLES_FOLDER, f"{filename_base}_dual.srt") 44 | 45 | logger.info(f"پردازش صدا برای {video_path}...") 46 | process_video_audio(video_path, clean_audio_path, target_sr=16000, enhance=enhance_audio) 47 | 48 | logger.info("اجرای رونویسی با VAD...") 49 | segments = chunk_based_transcription_vad(clean_audio_path, sr=16000) 50 | save_transcript_segments(segments, transcript_path) 51 | 52 | logger.info("تولید فایل‌های زیرنویس...") 53 | generate_srt(segments, srt_path) 54 | generate_dual_srt(segments, original_lang, target_lang, dual_srt_path) 55 | 56 | if correct_text: 57 | logger.info("اصلاح زیرنویس...") 58 | with open("src/system_prompt.txt", "r", encoding="utf-8") as f: 59 | system_prompt = f.read() 60 | with open(srt_path, "r", encoding="utf-8") as f: 61 | message_srt = f.read() 62 | with open(dual_srt_path, "r", encoding="utf-8") as f: 63 | message_dual_srt = f.read() 64 | 65 | result_srt = make_it_correct_openai(message=message_srt, system_prompt=system_prompt) 66 | result_dual_srt = make_it_correct_openai(message=message_dual_srt, system_prompt=system_prompt) 67 | 68 | with open(srt_path, "w", encoding="utf-8") as f: 69 | f.write(result_srt) 70 | with open(dual_srt_path, "w", encoding="utf-8") as f: 71 | f.write(result_dual_srt) 72 | 73 | if dialect != "none": 74 | logger.info("اعمال تبدیل گویش...") 75 | with open(srt_path, "r", encoding="utf-8") as f: 76 | current_srt = f.read() 77 | with open(dual_srt_path, "r", encoding="utf-8") as f: 78 | current_dual_srt = f.read() 79 | 80 | converted_srt = convert_to_dialect(dialect, current_srt) 81 | converted_dual_srt = convert_to_dialect(dialect, current_dual_srt) 82 | 83 | with open(srt_path, "w", encoding="utf-8") as f: 84 | f.write(converted_srt) 85 | with open(dual_srt_path, "w", encoding="utf-8") as f: 86 | f.write(converted_dual_srt) 87 | 88 | return { 89 | 'filename': filename, 90 | 'original_lang': original_lang, 91 | 'target_lang': target_lang, 92 | 'transcript_path': transcript_path, 93 | 'srt_path': srt_path, 94 | 'dual_srt_path': dual_srt_path 95 | } 96 | 97 | except Exception as e: 98 | logger.error(f"خطا در پردازش: {str(e)}") 99 | raise 100 | 101 | app = Client("video_processor_bot", api_id=API_ID, api_hash=API_HASH, bot_token=BOT_TOKEN) 102 | 103 | @app.on_message(filters.command("start")) 104 | def start_handler(client: Client, message: Message): 105 | message.reply_text( 106 | "سلام! یک فایل ویدیویی برای من ارسال کنید تا صدای آن را استخراج کرده و زیرنویس تولید کنم." 107 | ) 108 | 109 | @app.on_message(filters.video | filters.document) 110 | def video_handler(client: Client, message: Message): 111 | try: 112 | if message.document: 113 | original_file_name = message.document.file_name 114 | elif message.video and hasattr(message.video, "file_name"): 115 | original_file_name = message.video.file_name 116 | else: 117 | original_file_name = "video.mp4" 118 | 119 | download_path = os.path.join(UPLOAD_FOLDER, original_file_name) 120 | file_path = message.download(file_name=download_path) 121 | 122 | pending_jobs[message.from_user.id] = { 123 | 'job_data': {'video_path': file_path, 'filename': os.path.basename(file_path)}, 124 | 'step': 1, 125 | 'params': {} 126 | } 127 | 128 | message.reply_text( 129 | "ویدیو دریافت شد! لطفا تنظیمات پردازش را انجام دهید.\n\n" 130 | "مرحله 1: لطفا زبان اصلی را انتخاب کنید:\n" 131 | "en - انگلیسی\n" 132 | "fa - فارسی\n" 133 | "es - اسپانیایی" 134 | ) 135 | 136 | except Exception as e: 137 | logger.error(f"خطا در دریافت ویدیو: {str(e)}") 138 | message.reply_text(f"خطایی رخ داد: {str(e)}") 139 | 140 | def parse_bool(text): 141 | return text.strip().lower() in ('بله', 'yes', 'true', '1') 142 | 143 | @app.on_message(filters.text) 144 | def conversation_handler(client: Client, message: Message): 145 | user_id = message.from_user.id 146 | if user_id not in pending_jobs: 147 | return 148 | 149 | state = pending_jobs[user_id] 150 | current_step = state['step'] 151 | text = message.text.strip() 152 | 153 | if current_step == 1: 154 | state['params']['original_lang'] = text 155 | state['step'] = 2 156 | message.reply_text( 157 | "مرحله 2: لطفا زبان مقصد را انتخاب کنید:\n" 158 | "fa - فارسی\n" 159 | "en - انگلیسی\n" 160 | "es - اسپانیایی" 161 | ) 162 | 163 | elif current_step == 2: 164 | state['params']['target_lang'] = text 165 | state['step'] = 3 166 | message.reply_text("مرحله 3: آیا می‌خواهید کیفیت صدا بهبود یابد؟ (بله/خیر)") 167 | 168 | elif current_step == 3: 169 | state['params']['enhance_audio'] = parse_bool(text) 170 | state['step'] = 4 171 | message.reply_text("مرحله 4: آیا می‌خواهید زیرنویس‌ها اصلاح شوند؟ (بله/خیر)") 172 | 173 | elif current_step == 4: 174 | state['params']['correct_text'] = parse_bool(text) 175 | state['step'] = 5 176 | message.reply_text( 177 | "مرحله 5: لطفا گویش مورد نظر را انتخاب کنید:\n" 178 | "none - بدون تغییر\n" 179 | "isfahani - اصفهانی\n" 180 | "shirazi - شیرازی" 181 | ) 182 | 183 | elif current_step == 5: 184 | state['params']['dialect'] = text 185 | job_data = state['job_data'] 186 | job_data.update(state['params']) 187 | pending_jobs.pop(user_id) 188 | 189 | message.reply_text("در حال پردازش ویدیو... لطفا صبر کنید.") 190 | 191 | try: 192 | results = process_video(job_data) 193 | reply_message = ( 194 | "پردازش تکمیل شد!\n\n" 195 | f"فایل متن: {os.path.basename(results['transcript_path'])}\n" 196 | f"فایل زیرنویس: {os.path.basename(results['srt_path'])}\n" 197 | f"فایل زیرنویس دوزبانه: {os.path.basename(results['dual_srt_path'])}" 198 | ) 199 | message.reply_text(reply_message) 200 | message.reply_document(results['transcript_path'], caption="متن") 201 | message.reply_document(results['srt_path'], caption="زیرنویس") 202 | message.reply_document(results['dual_srt_path'], caption="زیرنویس دوزبانه") 203 | 204 | except Exception as e: 205 | logger.error(f"خطا در پردازش ویدیو: {str(e)}") 206 | message.reply_text(f"خطایی در پردازش رخ داد: {str(e)}") 207 | 208 | if __name__ == "__main__": 209 | print("ربات در حال اجراست...") 210 | app.run() 211 | -------------------------------------------------------------------------------- /isfahan.json: -------------------------------------------------------------------------------- 1 | { 2 | "شما": "شوما", 3 | "کجا": "کوجا", 4 | "چادر": "چادور", 5 | "در": "دری", 6 | "گل": "گلی", 7 | "آدم": "آدمی", 8 | "خوب": "خب", 9 | "پراید": "پرایت", 10 | "آرد": "آرت", 11 | "کارد": "کارت", 12 | "گاو": "گاب", 13 | "از": "ازِ", 14 | "قفَس": "قفِس", 15 | "اَزَش": "اِزِش", 16 | "بِزَن": "بِزِن", 17 | "امروز": "اَمروز", 18 | "جمعه": "جمعَه", 19 | "سفید": "سیفید", 20 | "حیف": "حَیف", 21 | "فشار": "فَشار", 22 | "بچه‌ها": "بِچا", 23 | "گربه‌ها": "گربا", 24 | "می‌جهد": "میجد", 25 | "بره": "برد", 26 | "بشه": "بشد", 27 | "بهتر": "بیتِرِس", 28 | "سر": "سری", 29 | "بهش": "بشش", 30 | "هم": "وم", 31 | "می‌آییم": "می‌ییم", 32 | "می‌خواهی": "می‌خَی", 33 | "می‌شنوی": "میشنُوی", 34 | "میگی": "میگوی", 35 | "بگیر": "بیگیر", 36 | "بشین": "بیشین", 37 | "بریز": "بیریز", 38 | "ببین": "بیبین", 39 | "دیوار": "دیفال", 40 | "گنبد": "گنبِز", 41 | "ماساژ": "ماساج", 42 | "پاساژ": "پاساج", 43 | "رفتید": "رفدین", 44 | "گفتید": "گفدین", 45 | "می‌رَوید": "می‌رین", 46 | "می‌گویید": "می‌گوین", 47 | "بلگ": "برگ", 48 | "قیل": "قیر", 49 | "خوبی": "خبیا", 50 | "آویزون": "آونگون", 51 | "طالبی": "سمسوری", 52 | "سر و سینه": "سکاسینه", 53 | "صورت": "سکاصورت", 54 | "شاید": "بلکی", 55 | "جوی آب": "مادی", 56 | "پاتختی": "بندرتخت", 57 | "زگیل": "توتولی", 58 | "خوره": "آکله", 59 | "ترکیدن": "پُکیدن", 60 | "سیزده": "سینزَ", 61 | "نوزده": "نونزَ", 62 | "دوازده": "دوازَ", 63 | "گرفتم": "اِسِدم", 64 | "عجله": "جل", 65 | "کلاغ": "غِلاغ", 66 | "مغازه": "دُکون", 67 | "خواب": "خاب", 68 | "خانه": "خونه", 69 | "خواهر": "خاهر", 70 | "برادر": "بارادَر", 71 | "پول": "پُل", 72 | "بیرون": "بیرونَس", 73 | "چکار": "چِکا", 74 | "بلند": "بیلند", 75 | "می‌رویم": "می‌ریم", 76 | "چه": "چی", 77 | "بخور": "بُخور", 78 | "بده": "بدِس", 79 | "بیار": "بییار", 80 | "بزن": "بِزِن", 81 | "نکن": "نیِکُن", 82 | "نخوردی": "نَخُردی", 83 | "بپرس": "بِپُرس", 84 | "چه‌کار": "چِکا", 85 | "دوست": "دوس", 86 | "من": "منِس", 87 | "نمی‌دونم": "نیمی‌دونِم", 88 | "دارم": "دارم", 89 | "آنجا": "اونجا", 90 | "چرا": "چَرا", 91 | "نمی‌خوام": "نیمی‌خَم", 92 | "اصلا": "اَصلاً", 93 | "بهتره": "بیتِرِس", 94 | "خیلی": "خیلی", 95 | "یعنی": "یَنی", 96 | "می‌ری": "می‌ریس", 97 | "بیایم": "بییَم", 98 | "ببینم": "بیبینَم", 99 | "معلوم": "معلو", 100 | "دارم": "دارم", 101 | "راست": "راست", 102 | "واقعا": "واقعاً", 103 | "من": "منِس", 104 | "خودت": "خودت", 105 | "عجیب": "عجیبِس", 106 | "دوباره": "دُباره", 107 | "همیشه": "هِمیشا", 108 | "کلی": "کِلی", 109 | "همه": "هِما", 110 | "دیگه": "دیگه", 111 | "باشه": "باشِس" 112 | } -------------------------------------------------------------------------------- /libraries.docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TahaBakhtari/SubtitleGenerator/2e2553a40d66c95c4d7d123ebfd0aa6bb63cbc09/libraries.docx -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import os 2 | import logging 3 | from src.audio_utils import process_video_audio 4 | from src.asr_utils import chunk_based_transcription_vad, save_transcript_segments 5 | from src.subtitle_utils import generate_srt, generate_dual_srt 6 | from src.acent import convert_to_dialect 7 | 8 | logging.basicConfig(level=logging.INFO) 9 | logger = logging.getLogger(__name__) 10 | 11 | def main(): 12 | video_path = "data/input_videos/sample1.mp4" 13 | clean_audio_path = "data/audio_outputs/sample_clean.wav" 14 | transcript_path = "data/text_outputs/sample_transcript.txt" 15 | srt_path = "data/subtitles/sample.srt" 16 | dual_srt_path = "data/subtitles/sample_dual.srt" 17 | 18 | original_lang = input("Enter the video language code (e.g., 'en', 'fa'): ").strip().lower() or "en" 19 | target_lang = input("Enter the target language code for subtitles (e.g., 'fa'): ").strip().lower() or "fa" 20 | enhance_input = input("Enhance the audio? (Y/n): ").strip().lower() 21 | enhance_audio = False if enhance_input in ["n", "no"] else True 22 | 23 | for d in [os.path.dirname(clean_audio_path), 24 | os.path.dirname(transcript_path), 25 | os.path.dirname(srt_path), 26 | os.path.dirname(dual_srt_path)]: 27 | if not os.path.exists(d): 28 | os.makedirs(d, exist_ok=True) 29 | 30 | try: 31 | process_video_audio(video_path, clean_audio_path, target_sr=16000, enhance=enhance_audio) 32 | logger.info("Phase 1 completed successfully.") 33 | 34 | segments = chunk_based_transcription_vad(clean_audio_path, sr=16000) 35 | save_transcript_segments(segments, transcript_path) 36 | logger.info("Phase 2 completed successfully.") 37 | 38 | generate_srt(segments, srt_path) 39 | generate_dual_srt(segments, original_lang, target_lang, dual_srt_path) 40 | 41 | correct_text = input("Do you want to correct the text? (Y/n): ").strip().lower() 42 | if correct_text != 'n': 43 | print("Correcting the text...") 44 | message_srt = open("data/subtitles/sample.srt", mode="r", encoding="utf-8").read() 45 | message_dual_srt = open("data/subtitles/sample_dual.srt", mode="r", encoding="utf-8").read() 46 | with open("src/system_prompt.txt", "r", encoding="utf-8") as f: 47 | system_prompt = f.read() 48 | from src.make_it_correct import make_it_correct_openai 49 | result_srt = make_it_correct_openai(message=message_srt, system_prompt=system_prompt) 50 | result_dual_srt = make_it_correct_openai(message=message_dual_srt, system_prompt=system_prompt) 51 | open("data/subtitles/sample.srt", "w", encoding="utf-8").write(result_srt) 52 | open("data/subtitles/sample_dual.srt", "w", encoding="utf-8").write(result_dual_srt) 53 | 54 | accent_correction = input("Do you want to apply accent correction to the transcript? (y/n): ").strip().lower() 55 | if accent_correction == 'y': 56 | dialect = input("Enter the dialect for accent correction (e.g., 'shirazi', 'isfahani'): ").strip().lower() 57 | with open(transcript_path, "r", encoding="utf-8") as f: 58 | transcript_text = f.read() 59 | corrected_text = convert_to_dialect(dialect, transcript_text) 60 | with open(transcript_path, "w", encoding="utf-8") as f: 61 | f.write(corrected_text) 62 | logger.info("Accent correction applied successfully.") 63 | 64 | logger.info("Phase 3 completed successfully.") 65 | except Exception as e: 66 | logger.error(f"Processing error: {e}") 67 | 68 | if __name__ == "__main__": 69 | main() 70 | -------------------------------------------------------------------------------- /shirazi.json: -------------------------------------------------------------------------------- 1 | { 2 | "کردن": "کِردن", 3 | "را": "رِ", 4 | "کاکا": "کاکو", 5 | "بابا": "بابو", 6 | "دادا": "دادو", 7 | "آقا": "آقو", 8 | "ماما": "مامو", 9 | "پاپا": "پاپو", 10 | "بله": "ها", 11 | "تایید": "ها عامو", 12 | "چاه": "چا", 13 | "تب": "تو", 14 | "لب": "لو", 15 | "نان": "نون", 16 | "بادام": "بودام", 17 | "لاله": "لولِه", 18 | "کاسه": "کوسه", 19 | "استاد": "اساد", 20 | "شنبه": "شمبه", 21 | "پنجره": "پمجره", 22 | "بهت زده": "بت زده", 23 | "فهمیده": "فمیده", 24 | "قفل": "قلف", 25 | "سارا": "سارو", 26 | "چرا": "چرو", 27 | "کجا": "کجو", 28 | "سرب": "سرو", 29 | "مدت": "مد", 30 | "ببین": "ببیم", 31 | "کوهستان": "کوشستان", 32 | "آب": "آبو", 33 | "خورشید": "خورشیدو", 34 | "ستاره": "ستارو", 35 | "گلدان": "گلدونو", 36 | "دیوار": "دیوو", 37 | "زمین": "زمینو", 38 | "آسمان": "آسمانو", 39 | "دریا": "دریاو", 40 | "باغ": "باغو", 41 | "شهر": "شهرو", 42 | "مرد": "مردو", 43 | "زن": "زنو", 44 | "فرزند": "فرزندو", 45 | "دوست": "دوستو", 46 | "زندگی": "زندگیو", 47 | "وقت": "وقتو", 48 | "موسیقی": "موسیقیو", 49 | "دانشگاه": "دانشگاهو", 50 | "پرنده": "پرندو", 51 | "گل": "گول", 52 | "آبشار": "آبشارو", 53 | "جاده": "جادِو", 54 | "ساحل": "ساحلو", 55 | "موزه": "موزِو", 56 | "رستوران": "رستورانو", 57 | "هتل": "هتلو", 58 | "فست فود": "فست فودو", 59 | "مدیر": "مدیرو", 60 | "آشپز": "آشپزو", 61 | "سیب": "سیبُو", 62 | "ترشی": "ترشیو", 63 | "شیر": "شیرُو", 64 | "ماهی": "ماهیُو", 65 | "پیام": "پیامو", 66 | "خبر": "خبرُو", 67 | "سوال": "سوالُو", 68 | "پاسخ": "پاسخو", 69 | "معلم": "معلمو", 70 | "دانش‌آموز": "دانش‌آموزو", 71 | "فرصت": "فرصتو", 72 | "ساعت": "ساعتو", 73 | "روز": "روزو", 74 | "هفته": "هفتو", 75 | "ماه": "ماهُو", 76 | "سال": "سالُو", 77 | "سرد": "سردو", 78 | "گرم": "گرمو", 79 | "باران": "بارانو", 80 | "برف": "برفو", 81 | "ابر": "ابرو", 82 | "آفتاب": "آفتابو", 83 | "سایه": "سایِو", 84 | "گنجه": "گنچِو", 85 | "صدا": "صداو", 86 | "آواز": "آوازو", 87 | "زبان": "زبانُو", 88 | "حرف": "حرفُو", 89 | "چهره": "چهرِو", 90 | "لبخند": "لبخندو", 91 | "شادی": "شادیو", 92 | "غم": "غمُو", 93 | "دل": "دلُو", 94 | "حس": "حسُو", 95 | "درد": "دردو", 96 | "درمان": "درمانو", 97 | "خیال": "خیالو", 98 | "افسردگی": "افسردگیو", 99 | "خوشحالی": "خوشحالیو", 100 | "همبستگی": "همبستگیو", 101 | "تفاهم": "تفاهمو", 102 | "احترام": "احترامو", 103 | "کتاب": "کتابو", 104 | "دفتر": "دفترُو", 105 | "قلم": "قلمُو", 106 | "میز": "میزو", 107 | "صندلی": "صندلیو", 108 | "ماشین": "ماشینو", 109 | "خانه": "خونه‌و", 110 | "دوچرخه": "دوچرخو", 111 | "باغچه": "باغچه‌و", 112 | "ورزش": "ورزشو", 113 | "بازی": "بازیو", 114 | "سینما": "سینماو", 115 | "تلویزیون": "تلویزیونو", 116 | "رادیو": "رادیوُ", 117 | "آشپزخانه": "آشپزخونه‌و", 118 | "گیتار": "گیتارو", 119 | "پیانو": "پیانوُ", 120 | "تلفن": "تلفونو", 121 | "موبایل": "موبایلو", 122 | "اینترنت": "اینترنتو", 123 | "فوتبال": "فوتبالو", 124 | "بسکتبال": "بسکتبولو", 125 | "والیبال": "والیبالو", 126 | "استخر": "استخرُو", 127 | "کت": "کت‌بو", 128 | "پرینتر": "پرینترو", 129 | "کامپیوتر": "کامپیوترُو", 130 | "بیسکویت": "بیسکوییتو", 131 | "شکلات": "شکلاتو", 132 | "نوشیدنی": "نوشیدنیو", 133 | "سبزی": "سبزیو", 134 | "میوه": "میوه‌و", 135 | "پخت": "پختو", 136 | "غذا": "غذایو", 137 | "شوربا": "شوربو", 138 | "سیب‌زمینی": "سیب‌زمینیو", 139 | "گوجه‌فرنگی": "گوجه‌فرنگیُو", 140 | "هویج": "هویجو", 141 | "موز": "موزو", 142 | "شیرینی": "شیرینیو", 143 | "کرم": "کرمُو", 144 | "عطر": "عطرِو", 145 | "بوته": "بوته‌و", 146 | "درخت": "درختو", 147 | "برگ": "برگُو", 148 | "ریش": "ریشو", 149 | "لب": "لبتو", 150 | "چشم": "چشمو", 151 | "گوش": "گوشو", 152 | "بینی": "بینیو", 153 | "کفش": "کفشو", 154 | "کیف": "کیفو", 155 | "کوه": "کوهو", 156 | "جزیره": "جزیرو", 157 | "دریاچه": "دریاچِو", 158 | "گل": "گولو", 159 | "مدرسه": "مدرسِو", 160 | "دانشجو": "دانشجُو", 161 | "کتابخانه": "کتابخانِو", 162 | "فیلم": "فیلمو", 163 | "تئاتر": "تئاترو", 164 | "کافه": "کافِو", 165 | "خیابان": "خیابانو", 166 | "پل": "پلو", 167 | "روستا": "روستاو", 168 | "شهرستان": "شهرستانو", 169 | "کشور": "کشورو", 170 | "ملی": "ملیِو", 171 | "بین‌المللی": "بین‌المللیو", 172 | "زیبا": "زیباو", 173 | "بزرگ": "بزرگو", 174 | "کوچک": "کوچکو", 175 | "سریع": "سریعو", 176 | "آهسته": "آهستِو" 177 | } -------------------------------------------------------------------------------- /src/__pycache__/acent.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TahaBakhtari/SubtitleGenerator/2e2553a40d66c95c4d7d123ebfd0aa6bb63cbc09/src/__pycache__/acent.cpython-38.pyc -------------------------------------------------------------------------------- /src/__pycache__/asr_utils.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TahaBakhtari/SubtitleGenerator/2e2553a40d66c95c4d7d123ebfd0aa6bb63cbc09/src/__pycache__/asr_utils.cpython-38.pyc -------------------------------------------------------------------------------- /src/__pycache__/audio_utils.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TahaBakhtari/SubtitleGenerator/2e2553a40d66c95c4d7d123ebfd0aa6bb63cbc09/src/__pycache__/audio_utils.cpython-38.pyc -------------------------------------------------------------------------------- /src/__pycache__/audio_utils.cpython-39.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TahaBakhtari/SubtitleGenerator/2e2553a40d66c95c4d7d123ebfd0aa6bb63cbc09/src/__pycache__/audio_utils.cpython-39.pyc -------------------------------------------------------------------------------- /src/__pycache__/make_it_correct.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TahaBakhtari/SubtitleGenerator/2e2553a40d66c95c4d7d123ebfd0aa6bb63cbc09/src/__pycache__/make_it_correct.cpython-38.pyc -------------------------------------------------------------------------------- /src/__pycache__/subtitle_utils.cpython-38.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TahaBakhtari/SubtitleGenerator/2e2553a40d66c95c4d7d123ebfd0aa6bb63cbc09/src/__pycache__/subtitle_utils.cpython-38.pyc -------------------------------------------------------------------------------- /src/acent.py: -------------------------------------------------------------------------------- 1 | import json 2 | import re 3 | 4 | def load_dialect_dict(dialect): 5 | if dialect == "isfahani": 6 | with open("isfahani.json", "r", encoding="utf-8") as file: 7 | return json.load(file) 8 | elif dialect == "shirazi": 9 | with open("shirazi.json", "r", encoding="utf-8") as file: 10 | return json.load(file) 11 | else: 12 | raise ValueError("گویش پشتیبانی نشده است.") 13 | 14 | def convert_to_dialect(dialect, text): 15 | dialect_dict = load_dialect_dict(dialect) 16 | words = re.findall(r'\b\w+\b', text) 17 | converted_text = text 18 | for word in words: 19 | if word in dialect_dict: 20 | converted_text = re.sub(rf'\b{word}\b', dialect_dict[word], converted_text) 21 | return converted_text 22 | -------------------------------------------------------------------------------- /src/asr_main.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TahaBakhtari/SubtitleGenerator/2e2553a40d66c95c4d7d123ebfd0aa6bb63cbc09/src/asr_main.py -------------------------------------------------------------------------------- /src/asr_utils.py: -------------------------------------------------------------------------------- 1 | import logging 2 | import os 3 | import librosa 4 | import numpy as np 5 | import tempfile 6 | import soundfile as sf 7 | from typing import List, Tuple 8 | from transformers import pipeline 9 | import torch 10 | 11 | logging.basicConfig(level=logging.INFO) 12 | logger = logging.getLogger(__name__) 13 | 14 | def load_asr_pipeline(): 15 | device = 0 if torch.cuda.is_available() else -1 16 | logger.info("Loading HF pipeline with openai/whisper-large-v3-turbo model.") 17 | asr_pipe = pipeline( 18 | task="automatic-speech-recognition", 19 | model="openai/whisper-large-v3-turbo", 20 | return_timestamps=True, 21 | chunk_length_s=30, 22 | device=device 23 | ) 24 | return asr_pipe 25 | 26 | def vad_segmentation_energy(audio: np.ndarray, sr: int, frame_duration_ms: int = 30, energy_threshold_factor: float = 0.1, merge_threshold: float = 0.5) -> List[Tuple[float, float, np.ndarray]]: 27 | frame_length = int(sr * frame_duration_ms / 1000) 28 | num_frames = int(np.ceil(len(audio) / frame_length)) 29 | rms_values = [] 30 | for i in range(num_frames): 31 | frame = audio[i * frame_length:min((i + 1) * frame_length, len(audio))] 32 | if len(frame) > 0: 33 | rms = np.sqrt(np.mean(frame ** 2)) 34 | rms_values.append(rms) 35 | else: 36 | rms_values.append(0) 37 | rms_values = np.array(rms_values) 38 | sorted_rms = np.sort(rms_values) 39 | noise_level = np.percentile(sorted_rms, 10) 40 | median_rms = np.median(sorted_rms) 41 | energy_threshold = noise_level + energy_threshold_factor * (median_rms - noise_level) 42 | logger.info(f"Adaptive energy threshold set to: {energy_threshold:.6f}") 43 | segments = [] 44 | segment_start = None 45 | for i, rms in enumerate(rms_values): 46 | time = i * frame_duration_ms / 1000.0 47 | if rms > energy_threshold and segment_start is None: 48 | segment_start = time 49 | elif rms <= energy_threshold and segment_start is not None: 50 | segment_end = time 51 | if segment_end - segment_start > 0.2: 52 | segments.append((segment_start, segment_end)) 53 | segment_start = None 54 | if segment_start is not None: 55 | segments.append((segment_start, num_frames * frame_duration_ms / 1000.0)) 56 | merged_segments = [] 57 | if segments: 58 | cur_start, cur_end = segments[0] 59 | for start, end in segments[1:]: 60 | if start - cur_end < merge_threshold: 61 | cur_end = end 62 | else: 63 | merged_segments.append((cur_start, cur_end)) 64 | cur_start, cur_end = start, end 65 | merged_segments.append((cur_start, cur_end)) 66 | extended_segments = [] 67 | padding = 0.1 68 | for start, end in merged_segments: 69 | ext_start = max(0, start - padding) 70 | ext_end = min(len(audio) / sr, end + padding) 71 | extended_segments.append((ext_start, ext_end)) 72 | segments_info = [(start, end, audio[int(start * sr):int(end * sr)]) for (start, end) in extended_segments] 73 | logger.info(f"Energy-based VAD produced {len(segments_info)} segments.") 74 | return segments_info 75 | 76 | def chunk_based_transcription_vad(audio_path: str, sr: int = 16000) -> List[Tuple[float, float, str, List[Tuple[float, float, str]]]]: 77 | asr_pipe = load_asr_pipeline() 78 | audio, file_sr = librosa.load(audio_path, sr=sr, mono=True) 79 | segments = vad_segmentation_energy(audio, sr, frame_duration_ms=30, energy_threshold_factor=0.1, merge_threshold=0.5) 80 | segments_info = [] 81 | for (start, end, seg_audio) in segments: 82 | with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp_wav: 83 | seg_path = tmp_wav.name 84 | sf.write(seg_path, seg_audio, sr, format="WAV") 85 | try: 86 | result = asr_pipe(seg_path) 87 | if isinstance(result, dict) and "text" in result and "chunks" in result: 88 | text = result["text"].strip() 89 | words = [(w["timestamp"][0] + start, w["timestamp"][1] + start, w["text"]) for w in result["chunks"]] 90 | else: 91 | text = "[Transcription Error]" 92 | words = [] 93 | except Exception as e: 94 | logger.error(f"Error transcribing segment from {start:.2f} to {end:.2f}: {e}") 95 | text = "[Transcription Error]" 96 | words = [] 97 | os.remove(seg_path) 98 | if text and not text.isspace() and text != ".": 99 | segments_info.append((start, end, text, words)) 100 | logger.info("VAD-based transcription with word-level timestamps completed.") 101 | return segments_info 102 | 103 | def save_transcript_segments(segments: List[Tuple[float, float, str, List[Tuple[float, float, str]]]], output_path: str): 104 | try: 105 | with open(output_path, "w", encoding="utf-8") as f: 106 | for (start, end, text, words) in segments: 107 | f.write(f"[{start:.2f} - {end:.2f}] {text}\n") 108 | for w_start, w_end, w_text in words: 109 | f.write(f" ({w_start:.2f} - {w_end:.2f}): {w_text}\n") 110 | logger.info(f"Segmented transcript saved to {output_path}.") 111 | except Exception as e: 112 | logger.error(f"Error saving segmented transcript: {e}") 113 | raise -------------------------------------------------------------------------------- /src/audio_utils.py: -------------------------------------------------------------------------------- 1 | import av 2 | import numpy as np 3 | import librosa 4 | import soundfile as sf 5 | import logging 6 | import noisereduce as nr 7 | import pyloudnorm as pyln 8 | from scipy import signal 9 | 10 | logging.basicConfig(level=logging.INFO) 11 | logger = logging.getLogger(__name__) 12 | 13 | def extract_audio_from_video(video_path: str): 14 | try: 15 | container = av.open(video_path) 16 | except Exception as e: 17 | logger.error(f"Error opening video {video_path}: {e}") 18 | raise e 19 | 20 | audio_stream = None 21 | for stream in container.streams: 22 | if stream.type == 'audio': 23 | audio_stream = stream 24 | break 25 | if audio_stream is None: 26 | raise ValueError("No audio stream found in the video.") 27 | 28 | audio_frames = [] 29 | sample_rate = audio_stream.rate if hasattr(audio_stream, "rate") else 48000 30 | try: 31 | for frame in container.decode(audio_stream): 32 | try: 33 | pcm = frame.to_ndarray() 34 | except TypeError: 35 | pcm = frame.to_ndarray(format='flt') 36 | 37 | if pcm.ndim > 1: 38 | pcm = np.mean(pcm, axis=0) 39 | audio_frames.append(pcm) 40 | except Exception as e: 41 | logger.error(f"Error decoding audio frames: {e}") 42 | raise e 43 | 44 | if not audio_frames: 45 | raise ValueError("No audio frames decoded.") 46 | audio_data = np.concatenate(audio_frames, axis=0) 47 | logger.info("Audio extraction from video completed successfully.") 48 | return audio_data, sample_rate 49 | 50 | def speech_enhance(audio_data: np.ndarray, sr: int) -> np.ndarray: 51 | logger.info("Starting speech enhancement...") 52 | nyquist = 0.5 * sr 53 | low = 250 / nyquist 54 | high = 4000 / nyquist 55 | b, a = signal.butter(2, [low, high], btype='band') 56 | filtered_audio = signal.filtfilt(b, a, audio_data) 57 | filtered_audio = 0.7 * filtered_audio + 0.3 * audio_data 58 | percentile_val = np.percentile(np.abs(filtered_audio), 10) 59 | noise_mask = np.abs(filtered_audio) < percentile_val * 1.5 60 | if np.sum(noise_mask) > sr * 0.1: 61 | noise_sample = filtered_audio[noise_mask] 62 | else: 63 | noise_sample = filtered_audio[:int(sr * 0.2)] 64 | reduced_noise_audio = nr.reduce_noise( 65 | y=filtered_audio, 66 | sr=sr, 67 | y_noise=noise_sample, 68 | stationary=False, 69 | prop_decrease=0.25, 70 | n_fft=1024, 71 | win_length=512, 72 | n_jobs=1, 73 | thresh_n_mult_nonstationary=1.5, 74 | ) 75 | def adaptive_compress(audio, threshold=-20, ratio=1.5): 76 | threshold_linear = 10 ** (threshold / 20) 77 | output = np.zeros_like(audio) 78 | for i, sample in enumerate(audio): 79 | if abs(sample) > threshold_linear: 80 | if sample > 0: 81 | output[i] = threshold_linear + (sample - threshold_linear) / ratio 82 | else: 83 | output[i] = -threshold_linear + (sample + threshold_linear) / ratio 84 | else: 85 | output[i] = sample 86 | return output 87 | compressed_audio = adaptive_compress(reduced_noise_audio) 88 | meter = pyln.Meter(sr) 89 | loudness = meter.integrated_loudness(compressed_audio) 90 | target_loudness = -18.0 91 | normalized_audio = pyln.normalize.loudness(compressed_audio, loudness, target_loudness) 92 | if np.max(np.abs(normalized_audio)) > 0.99: 93 | normalized_audio = normalized_audio / np.max(np.abs(normalized_audio)) * 0.95 94 | logger.info("Speech enhancement completed.") 95 | return normalized_audio 96 | 97 | def reduce_noise_and_enhance(audio_data: np.ndarray, sr: int) -> np.ndarray: 98 | enhanced_audio = speech_enhance(audio_data, sr) 99 | def deess(audio, sr): 100 | nyquist = 0.5 * sr 101 | high_freq = 5000 / nyquist 102 | ess_b, ess_a = signal.butter(3, high_freq, btype='highpass') 103 | ess_band = signal.filtfilt(ess_b, ess_a, audio) 104 | ess_band_compressed = np.tanh(ess_band * 0.8) / 0.8 105 | ess_energy = np.sqrt(np.mean(ess_band**2)) 106 | compressed_energy = np.sqrt(np.mean(ess_band_compressed**2)) 107 | if compressed_energy > 0: 108 | gain = ess_energy / compressed_energy 109 | ess_band_compressed *= gain 110 | low_freq = 4800 / nyquist 111 | low_b, low_a = signal.butter(3, low_freq, btype='lowpass') 112 | low_band = signal.filtfilt(low_b, low_a, audio) 113 | result = low_band + ess_band_compressed * 0.7 114 | return result 115 | deessed_audio = deess(enhanced_audio, sr) 116 | meter = pyln.Meter(sr) 117 | loudness = meter.integrated_loudness(deessed_audio) 118 | target_loudness = -18.0 119 | final_audio = pyln.normalize.loudness(deessed_audio, loudness, target_loudness) 120 | logger.info("Enhanced audio processing completed.") 121 | return final_audio 122 | 123 | def save_audio_to_wav(audio_data: np.ndarray, sr: int, output_path: str): 124 | try: 125 | sf.write(output_path, audio_data, sr) 126 | logger.info(f"Audio saved to {output_path}.") 127 | except Exception as e: 128 | logger.error(f"Error saving audio to {output_path}: {e}") 129 | raise e 130 | 131 | def process_video_audio(video_path: str, output_path: str, target_sr: int = 16000, enhance: bool = True): 132 | logger.info(f"Starting processing for video: {video_path}") 133 | raw_audio, orig_sr = extract_audio_from_video(video_path) 134 | logger.info(f"Resampling audio from {orig_sr} Hz to {target_sr} Hz...") 135 | audio_data = librosa.resample(raw_audio, 136 | orig_sr=orig_sr, 137 | target_sr=target_sr, 138 | res_type='kaiser_best') 139 | 140 | if enhance: 141 | processed_audio = reduce_noise_and_enhance(audio_data, target_sr) 142 | orig_output_path = output_path.replace(".wav", "_original.wav") 143 | save_audio_to_wav(audio_data, target_sr, orig_output_path) 144 | logger.info(f"Original resampled audio saved to {orig_output_path} for comparison.") 145 | else: 146 | logger.info("Enhancement disabled; using resampled audio without further processing.") 147 | processed_audio = audio_data 148 | 149 | save_audio_to_wav(processed_audio, target_sr, output_path) 150 | logger.info("Phase 1: Audio processing completed.") 151 | -------------------------------------------------------------------------------- /src/helpers.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TahaBakhtari/SubtitleGenerator/2e2553a40d66c95c4d7d123ebfd0aa6bb63cbc09/src/helpers.py -------------------------------------------------------------------------------- /src/make_it_correct.py: -------------------------------------------------------------------------------- 1 | def make_it_correct_openai(message, system_prompt): 2 | import openai 3 | openai.api_key = "KEY" 4 | response = openai.ChatCompletion.create( 5 | model="gpt-4o-mini", 6 | messages=[ 7 | {"role": "system", "content": system_prompt}, 8 | {"role": "user", "content": message} 9 | ] 10 | ) 11 | return response.choices[0].message['content'] -------------------------------------------------------------------------------- /src/subtitle_utils.py: -------------------------------------------------------------------------------- 1 | import logging 2 | from typing import List, Tuple, Dict 3 | from deep_translator import GoogleTranslator 4 | import re 5 | 6 | logging.basicConfig(level=logging.INFO) 7 | logger = logging.getLogger(__name__) 8 | 9 | def format_srt_timestamp(seconds: float) -> str: 10 | hrs = int(seconds // 3600) 11 | mins = int((seconds % 3600) // 60) 12 | secs = int(seconds % 60) 13 | millis = int((seconds - int(seconds)) * 1000) 14 | return f"{hrs:02d}:{mins:02d}:{secs:02d},{millis:03d}" 15 | 16 | def smart_segment_text(text: str, max_chars: int = 42) -> List[str]: 17 | if len(text) <= max_chars: 18 | return [text] 19 | 20 | segments = [] 21 | break_chars = ['. ', '! ', '? ', '; ', ', ', ' '] 22 | 23 | while len(text) > max_chars: 24 | best_break_idx = -1 25 | 26 | for char in break_chars: 27 | slice_to_check = text[max_chars//2:max_chars] 28 | pos = slice_to_check.find(char) 29 | 30 | if pos != -1: 31 | best_break_idx = pos + max_chars//2 + len(char) - 1 32 | break 33 | 34 | if best_break_idx == -1: 35 | best_break_idx = max_chars 36 | while best_break_idx < len(text) and text[best_break_idx] != ' ': 37 | best_break_idx -= 1 38 | if best_break_idx <= max_chars // 2: 39 | best_break_idx = max_chars 40 | segments.append(text[:best_break_idx+1].strip()) 41 | text = text[best_break_idx+1:].strip() 42 | if text: 43 | segments.append(text) 44 | 45 | return segments 46 | 47 | def optimize_subtitle_timing(words: List[Tuple[float, float, str]], 48 | min_duration: float = 1.0, 49 | max_duration: float = 7.0, 50 | max_chars_per_line: int = 42) -> List[Tuple[float, float, str]]: 51 | if not words: 52 | return [] 53 | 54 | full_text = " ".join(word[2] for word in words) 55 | full_text = re.sub(r'\s+', ' ', full_text).strip() 56 | 57 | start_time = words[0][0] 58 | end_time = words[-1][1] 59 | 60 | duration = end_time - start_time 61 | 62 | if duration < min_duration: 63 | end_time = start_time + min_duration 64 | 65 | if len(full_text) <= max_chars_per_line and duration <= max_duration: 66 | return [(start_time, end_time, full_text)] 67 | 68 | text_segments = smart_segment_text(full_text, max_chars_per_line) 69 | 70 | result = [] 71 | segment_count = len(text_segments) 72 | 73 | word_to_time = {} 74 | for word_start, word_end, word_text in words: 75 | for word in word_text.split(): 76 | word = word.strip().lower() 77 | if word not in word_to_time and word: 78 | word_to_time[word] = (word_start, word_end) 79 | 80 | current_position = 0 81 | for i, segment in enumerate(text_segments): 82 | segment_words = segment.lower().split() 83 | 84 | if not segment_words: 85 | continue 86 | 87 | if i == 0: 88 | seg_start = start_time 89 | else: 90 | for word in segment_words: 91 | if word in word_to_time: 92 | seg_start = word_to_time[word][0] 93 | break 94 | else: 95 | seg_start = start_time + (duration * current_position / len(full_text)) 96 | 97 | if i == segment_count - 1: 98 | seg_end = end_time 99 | else: 100 | for word in reversed(segment_words): 101 | if word in word_to_time: 102 | seg_end = word_to_time[word][1] 103 | break 104 | else: 105 | current_position += len(segment) 106 | seg_end = start_time + (duration * current_position / len(full_text)) 107 | 108 | if seg_end - seg_start < min_duration: 109 | seg_end = seg_start + min_duration 110 | 111 | result.append((seg_start, seg_end, segment)) 112 | 113 | return result 114 | 115 | def generate_srt(segments: List[Tuple[float, float, str, List[Tuple[float, float, str]]]], srt_output: str): 116 | try: 117 | with open(srt_output, "w", encoding="utf-8") as f: 118 | subtitle_index = 1 119 | 120 | for segment_start, segment_end, text, words in segments: 121 | if not words: 122 | subtitles = [(segment_start, segment_end, text)] 123 | else: 124 | subtitles = optimize_subtitle_timing(words) 125 | 126 | for start, end, subtitle_text in subtitles: 127 | start_ts = format_srt_timestamp(start) 128 | end_ts = format_srt_timestamp(end) 129 | 130 | f.write(f"{subtitle_index}\n{start_ts} --> {end_ts}\n{subtitle_text.strip()}\n\n") 131 | subtitle_index += 1 132 | 133 | logger.info(f"SRT file saved to {srt_output}.") 134 | except Exception as e: 135 | logger.error(f"Error generating SRT: {e}") 136 | raise 137 | 138 | def generate_dual_srt(segments: List[Tuple[float, float, str, List[Tuple[float, float, str]]]], original_lang: str, target_lang: str, srt_output: str): 139 | translator = GoogleTranslator(source=original_lang, target=target_lang) 140 | 141 | context_window = [] 142 | grouped_segments = [] 143 | 144 | for i, segment in enumerate(segments): 145 | start, end, text, words = segment 146 | 147 | if i == 0 or start - segments[i-1][1] > 2.0: 148 | if context_window: 149 | grouped_segments.append(context_window) 150 | context_window = [] 151 | 152 | context_window = [segment] 153 | else: 154 | context_window.append(segment) 155 | 156 | if context_window: 157 | grouped_segments.append(context_window) 158 | 159 | try: 160 | with open(srt_output, "w", encoding="utf-8") as f: 161 | subtitle_index = 1 162 | 163 | for group in grouped_segments: 164 | combined_text = " ".join(segment[2] for segment in group) 165 | 166 | try: 167 | translated_full = translator.translate(combined_text) 168 | 169 | orig_lengths = [len(segment[2]) for segment in group] 170 | total_orig_len = sum(orig_lengths) 171 | trans_approx_positions = [] 172 | 173 | pos = 0 174 | for length in orig_lengths[:-1]: 175 | ratio = length / total_orig_len 176 | pos += int(ratio * len(translated_full)) 177 | trans_approx_positions.append(pos) 178 | 179 | translated_segments = [] 180 | last_pos = 0 181 | 182 | for pos in trans_approx_positions: 183 | break_chars = ['. ', '! ', '? ', '; ', ', ', ' '] 184 | best_pos = pos 185 | 186 | window = 10 187 | for char in break_chars: 188 | window_text = translated_full[max(0, pos-window):min(len(translated_full), pos+window)] 189 | char_pos = window_text.find(char) 190 | 191 | if char_pos != -1: 192 | best_pos = max(0, pos-window) + char_pos + len(char) 193 | break 194 | 195 | translated_segments.append(translated_full[last_pos:best_pos].strip()) 196 | last_pos = best_pos 197 | 198 | translated_segments.append(translated_full[last_pos:].strip()) 199 | 200 | except Exception as e: 201 | logger.error(f"Translation error: {e}") 202 | translated_segments = [] 203 | for segment in group: 204 | try: 205 | translated = translator.translate(segment[2]) 206 | translated_segments.append(translated) 207 | except: 208 | translated_segments.append("[Translation Error]") 209 | 210 | for i, (segment, translated) in enumerate(zip(group, translated_segments)): 211 | segment_start, segment_end, text, words = segment 212 | 213 | if not words: 214 | subtitles = [(segment_start, segment_end, translated)] 215 | else: 216 | orig_subtitles = optimize_subtitle_timing(words) 217 | 218 | trans_lines = smart_segment_text(translated, max_chars=42) 219 | 220 | if len(orig_subtitles) == len(trans_lines): 221 | subtitles = [(start, end, trans) 222 | for (start, end, _), trans in zip(orig_subtitles, trans_lines)] 223 | else: 224 | if len(orig_subtitles) > 0: 225 | if len(trans_lines) == 1: 226 | subtitles = [(start, end, translated) for start, end, _ in orig_subtitles] 227 | else: 228 | ratio = len(trans_lines) / len(orig_subtitles) 229 | subtitles = [] 230 | 231 | for j, (start, end, _) in enumerate(orig_subtitles): 232 | start_idx = min(int(j * ratio), len(trans_lines) - 1) 233 | end_idx = min(int((j + 1) * ratio), len(trans_lines)) 234 | 235 | if start_idx < end_idx: 236 | combined_text = " ".join(trans_lines[start_idx:end_idx]) 237 | subtitles.append((start, end, combined_text)) 238 | else: 239 | subtitles.append((start, end, trans_lines[min(j, len(trans_lines)-1)])) 240 | else: 241 | subtitles = [(segment_start, segment_end, translated)] 242 | 243 | for start, end, trans_text in subtitles: 244 | start_ts = format_srt_timestamp(start) 245 | end_ts = format_srt_timestamp(end) 246 | 247 | f.write(f"{subtitle_index}\n{start_ts} --> {end_ts}\n{trans_text.strip()}\n\n") 248 | subtitle_index += 1 249 | 250 | logger.info(f"Translated SRT saved to {srt_output}.") 251 | except Exception as e: 252 | logger.error(f"Error generating translated SRT: {e}") 253 | raise -------------------------------------------------------------------------------- /src/system_prompt.txt: -------------------------------------------------------------------------------- 1 | در این سیستم وظیفه‌ات به عنوان یک اصلاح‌کننده تخصصی زیرنویس است. ورودی شامل بلوک‌هایی با فرمت دقیق زیر است: 2 | 3 | " 4 | شماره 5 | زمان شروع --> زمان پایان 6 | متن 7 | " 8 | 9 | دستورالعمل‌های دقیق: 10 | 11 | تنها اشتباهات املایی و نحوی موجود در متن‌های استخراج‌شده (از تبدیل صدا به متن) را اصلاح کن. 12 | ساختار، فرمت و تایم‌های موجود (زمان شروع و پایان) را دست نخورده همانگونه که هستند حفظ کن؛ هیچ تغییری در آنها ایجاد نکن. 13 | اگر یک کلمه اشتباه نوشته شده است، آن را به شکل صحیح و استاندارد اصلاح کن (مثلاً «پانل» را به «پنل» و برای مثال اگر مرتبط بود چیک رو به چک . «ویب» را به «وب» در صورت همراهی با کلمه «سایت» یا مشابه آن اصلاح کن). 14 | هیچ حرف یا توضیح اضافه‌ای به متن اضافه نکن؛ فقط نسخه اصلاح‌شده را به همان شکل ورودی (با حفظ ترتیب شماره‌ها و تایم‌ها) برگردان. 15 | خروجی نهایی باید دقیقاً همان قالب ورودی باشد اما تنها با اصلاحات لازم در متن‌های اشتباه. -------------------------------------------------------------------------------- /src/translation_utils.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TahaBakhtari/SubtitleGenerator/2e2553a40d66c95c4d7d123ebfd0aa6bb63cbc09/src/translation_utils.py -------------------------------------------------------------------------------- /static/uploads/fc67d12e-a983-48e1-aa8c-5ddaa026588a/sample1.mp4: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/TahaBakhtari/SubtitleGenerator/2e2553a40d66c95c4d7d123ebfd0aa6bb63cbc09/static/uploads/fc67d12e-a983-48e1-aa8c-5ddaa026588a/sample1.mp4 -------------------------------------------------------------------------------- /templates/home.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | تولید زیرنویس ویدیو 7 | 8 | 9 | 74 | 75 | 76 |

77 |

تولید زیرنویس ویدیو

78 | 79 | {% with messages = get_flashed_messages() %} 80 | {% if messages %} 81 | {% for message in messages %} 82 |

83 | {{ message }} 84 |

85 | {% endfor %} 86 | {% endif %} 87 | {% endwith %} 88 | 89 |

90 |

91 |

92 |

93 | انتخاب فایل ویدیو 94 | 95 |

حداکثر سایز فایل: 100 مگابایت

96 |

97 | 98 |

99 |

100 | زبان اصلی 101 | 106 |

107 |

108 | زبان مقصد 109 | 114 |

115 |

116 | 117 |

118 |

119 | 120 | بهبود کیفیت صدا 121 |

122 |

123 | 124 | اصلاح زیرنویس 125 |

126 |

127 | 128 |

129 | تبدیل گویش (اختیاری): 130 |

131 | 132 | بدون تغییر 133 |

134 |

135 | 136 | اصفهانی 137 |

138 |

139 | 140 | شیرازی 141 |

142 |

143 | 144 |

145 | 146 |

147 |

148 |

149 |

150 |

151 | 152 | 153 | -------------------------------------------------------------------------------- /templates/output.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | نتایج تولید زیرنویس 7 | 8 | 9 | 10 | 91 | 92 | 93 |

94 |

نتایج تولید زیرنویس

95 | 96 |

97 | پردازش با موفقیت به پایان رسید! 98 |

99 | 100 |

101 |

پیش نمایش ویدیو با زیرنویس

102 |

103 |

104 |

115 |

116 | 117 |

118 |

تنظیمات زیرنویس

119 |

120 |

121 | 122 | {{ results.original_lang|upper }} زیرنویس 123 |

124 |

125 | 126 | زیرنویس دوزبانه 127 |

128 |

129 | 130 | بدون زیرنویس 131 |

132 |

133 |

134 | اندازه زیرنویس 135 | 136 |

137 |

138 |

139 |

140 | 141 |

142 |

اطلاعات ویدیو

143 |

144 |

ویدیو: {{ results.filename }}

145 |

زبان اصلی: {{ results.original_lang }}

146 |

زبان مقصد: {{ results.target_lang }}

147 |

148 |

149 | 150 |

151 |

152 |

154 | 157 |
159 | 162 |
164 | 167 |

169 |

170 |

171 |

172 |

173 |

174 |

175 |

176 | 177 |

178 |

در حال بارگذاری متن...

179 |

180 |

181 |

182 |

183 |

184 |

185 |

186 | 187 |

188 |

در حال بارگذاری SRT...

189 |

190 |

191 |

192 |

193 |

194 |

195 |

196 | 197 |

198 |

در حال بارگذاری زیرنویس دوزبانه...

199 |

200 |

201 |

202 |

203 |

204 |

205 | 206 |

207 |

ویرایش دستی زیرنویس

208 |

209 | 210 | 211 |

212 | 213 |

214 |

215 |

216 | 217 |

218 |

دانلود فایل‌ها

219 |

220 |

221 | دانلود متن کامل 222 |

223 |

224 | دانلود زیرنویس SRT 225 |

226 |

227 | دانلود زیرنویس دوزبانه 228 |

229 |

230 |

231 | پردازش ویدیو دیگر 232 |

233 |

234 |

235 |

236 | 237 | 238 | 239 | 325 | 326 | --------------------------------------------------------------------------------