├── .github └── FUNDING.yml ├── requirements.txt ├── LICENSE ├── README.md └── telegram-scraper.py /.github/FUNDING.yml: -------------------------------------------------------------------------------- 1 | ko_fi: unnohwn 2 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | aiohappyeyeballs==2.6.1 2 | aiohttp==3.12.14 3 | aiosignal==1.4.0 4 | asyncio==3.4.3 5 | attrs==25.3.0 6 | frozenlist==1.7.0 7 | idna==3.10 8 | multidict==6.6.3 9 | propcache==0.3.2 10 | pyaes==1.6.1 11 | pyasn1==0.6.1 12 | qrcode==8.0 13 | rsa==4.9.1 14 | Telethon==1.40.0 15 | yarl==1.20.1 16 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2024 unnohwn 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Telegram Channel Scraper 📱 2 | 3 | A powerful Python script that allows you to scrape messages and media from Telegram channels using the Telethon library. Features include real-time continuous scraping, media downloading, and data export capabilities. 4 | 5 | ``` 6 | ___________________ _________ 7 | \__ ___/ _____/ / _____/ 8 | | | / \ ___ \_____ \ 9 | | | \ \_\ \/ \ 10 | |____| \______ /_______ / 11 | \/ \/ 12 | ``` 13 | 14 | ## What's New in v3.1 🎉 15 | 16 | **Enhanced Message Data:** 17 | - **Message statistics** - Captures views, forwards, and post_author for each message 18 | - **Reactions support** - Records all emoji reactions with counts (e.g., "😀 12 👍 3") 19 | - **Automatic database migration** - Seamlessly adds new columns to existing databases 20 | - **Richer exports** - All new data included in CSV/JSON exports 21 | 22 | **Improved Channel Management:** 23 | - **Channel names displayed** - Shows channel names alongside IDs everywhere 24 | - **Smart filtering** - List option now only shows Channels and Groups (no private chats) 25 | - **channels_list.csv export** - Automatically saves channel list with names, IDs, usernames, and types 26 | - **"all" selection** - Quickly add all listed channels at once 27 | - **Better export naming** - Files now named as `ID_username.csv` and `ID_username.json` 28 | 29 | **Bug Fixes:** 30 | - **Fixed channel ID parsing** - Resolved "invalid literal for int()" error in fix missing media 31 | - **Better entity resolution** - Handles both numeric IDs and channel usernames 32 | - **Improved error messages** - Shows channel names with IDs for clearer debugging 33 | 34 | ## Features 🚀 35 | 36 | - **QR Code & Phone Authentication** - Choose your preferred login method 37 | - Scrape messages with full metadata (views, forwards, reactions, post author) 38 | - Download media files with parallel processing and unique naming 39 | - Real-time continuous scraping 40 | - Export data to JSON and CSV formats with enhanced metadata 41 | - SQLite database storage with automatic schema migration 42 | - Resume capability (saves progress) 43 | - Interactive menu with channel names and numbered selection 44 | - Smart channel filtering (only shows channels/groups) 45 | - Progress tracking with visual progress bars 46 | - Automatic channels list export to CSV 47 | 48 | ## Prerequisites 📋 49 | 50 | Before running the script, you'll need: 51 | 52 | - Python 3.7 or higher 53 | - Telegram account 54 | - API credentials from Telegram 55 | 56 | ### Required Python packages 57 | 58 | ``` 59 | pip install -r requirements.txt 60 | ``` 61 | 62 | ## Getting Telegram API Credentials 🔑 63 | 64 | 1. Visit https://my.telegram.org/auth 65 | 2. Log in with your phone number 66 | 3. Click on "API development tools" 67 | 4. Fill in the form: 68 | - App title: Your app name 69 | - Short name: Your app short name 70 | - Platform: Can be left as "Desktop" 71 | - Description: Brief description of your app 72 | 5. Click "Create application" 73 | 6. You'll receive: 74 | - `api_id`: A number 75 | - `api_hash`: A string of letters and numbers 76 | 77 | Keep these credentials safe, you'll need them to run the script! 78 | 79 | ## Setup and Running 🔧 80 | 81 | 1. Clone the repository: 82 | ```bash 83 | git clone https://github.com/unnohwn/telegram-scraper.git 84 | cd telegram-scraper 85 | ``` 86 | 87 | 2. Install requirements: 88 | ```bash 89 | pip install -r requirements.txt 90 | ``` 91 | 92 | 3. Run the script: 93 | ```bash 94 | python telegram-scraper.py 95 | ``` 96 | 97 | 4. On first run, you'll be prompted to enter: 98 | - Your API ID (from my.telegram.org) 99 | - Your API Hash (from my.telegram.org) 100 | - **Choose authentication method:** 101 | - **QR Code** (Recommended) - Scan with your phone (no phone number needed) 102 | - **Phone Number** - Traditional SMS verification 103 | 104 | ## Usage 📝 105 | 106 | The script provides a clean interactive menu: 107 | 108 | ``` 109 | ======================================== 110 | TELEGRAM SCRAPER 111 | ======================================== 112 | [S] Scrape channels 113 | [C] Continuous scraping 114 | [M] Media scraping: ON 115 | [L] List & add channels 116 | [R] Remove channels 117 | [E] Export data 118 | [T] Rescrape media 119 | [Q] Quit 120 | ======================================== 121 | ``` 122 | 123 | ### Channel Selection Made Easy 🔢 124 | 125 | Instead of typing long channel IDs, use numbers: 126 | 127 | **Adding Channels:** 128 | ``` 129 | [1] Tech News (ID: -1002116176890, Type: Channel, Username: @technews) 130 | [2] Python Dev (ID: -1001597139842, Type: Group, Username: @pythondev) 131 | [3] Daily Updates (ID: -1002274713954, Type: Channel, Username: @dailyupdates) 132 | 133 | Enter: 1,3 (adds channels 1 and 3) 134 | Or: all (adds all listed channels) 135 | ``` 136 | 137 | **Viewing Your Channels:** 138 | ``` 139 | [1] Tech News (ID: -1002116176890), Last Message ID: 5234, Messages: 12450 140 | [2] Python Dev (ID: -1001597139842), Last Message ID: 8192, Messages: 45782 141 | ``` 142 | 143 | **Scraping Channels:** 144 | - Single: `1` 145 | - Multiple: `1,3,5` 146 | - All: `all` 147 | - Mix formats: `1,-1001597139842,3` 148 | 149 | ## Data Storage 💾 150 | 151 | ### Database Structure 152 | 153 | Data is stored in SQLite databases, one per channel: 154 | - Location: `./channelname/channelname.db` 155 | - Optimized with indexes for fast queries 156 | - WAL mode for better performance 157 | - Schema includes: message_id, date, sender info, message text, media info, reply_to, post_author, views, forwards, reactions 158 | - Automatic migration adds new columns to existing databases 159 | 160 | ### Media Storage 📁 161 | 162 | Media files are stored with unique naming: 163 | - Location: `./channelname/media/` 164 | - Format: `{message_id}-{unique_id}-{original_name}.ext` 165 | - **No more file overwrites** - Each file gets a unique name 166 | 167 | ### Exported Data 📊 168 | 169 | Export formats: 170 | 1. **CSV**: `./channelname/channelid_username.csv` 171 | 2. **JSON**: `./channelname/channelid_username.json` 172 | 3. **Channel List**: `./channels_list.csv` (automatically created when using [L] option) 173 | 174 | All exports include complete message metadata: views, forwards, reactions, and post author information. 175 | 176 | ## Performance Features ⚙️ 177 | 178 | - **5 concurrent downloads** for faster media processing 179 | - **Batch database operations** for optimal speed 180 | - **Progress bars** with real-time feedback 181 | - **Resume capability** - Continue where you left off 182 | - **Memory-efficient** exports for large datasets 183 | 184 | ## Error Handling 🛠️ 185 | 186 | - Automatic retry with exponential backoff 187 | - Rate limit compliance 188 | - Network error recovery 189 | - State preservation during interruptions 190 | 191 | ## Limitations ⚠️ 192 | 193 | - Respects Telegram's rate limits 194 | - Can only access public channels or channels you're a member of 195 | - Media download size limits apply as per Telegram's restrictions 196 | 197 | ## License 📄 198 | 199 | This project is licensed under the MIT License - see the LICENSE file for details. 200 | 201 | ## Disclaimer ⚖️ 202 | 203 | This tool is for educational purposes only. Make sure to: 204 | - Respect Telegram's Terms of Service 205 | - Obtain necessary permissions before scraping 206 | - Use responsibly and ethically 207 | - Comply with data protection regulations 208 | -------------------------------------------------------------------------------- /telegram-scraper.py: -------------------------------------------------------------------------------- 1 | import os 2 | import sqlite3 3 | import json 4 | import csv 5 | import asyncio 6 | import time 7 | import sys 8 | import uuid 9 | import warnings 10 | from dataclasses import dataclass 11 | from typing import Dict, List, Optional, Any 12 | from pathlib import Path 13 | from io import StringIO 14 | from telethon import TelegramClient 15 | from telethon.tl.types import MessageMediaPhoto, MessageMediaDocument, MessageMediaWebPage, User, PeerChannel, Channel, Chat 16 | from telethon.errors import FloodWaitError, SessionPasswordNeededError 17 | import qrcode 18 | 19 | warnings.filterwarnings("ignore", message="Using async sessions support is an experimental feature") 20 | 21 | def display_ascii_art(): 22 | WHITE = "\033[97m" 23 | RESET = "\033[0m" 24 | art = r""" 25 | ___________________ _________ 26 | \__ ___/ _____/ / _____/ 27 | | | / \ ___ \_____ \ 28 | | | \ \_\ \/ \ 29 | |____| \______ /_______ / 30 | \/ \/ 31 | """ 32 | print(WHITE + art + RESET) 33 | 34 | @dataclass 35 | class MessageData: 36 | message_id: int 37 | date: str 38 | sender_id: int 39 | first_name: Optional[str] 40 | last_name: Optional[str] 41 | username: Optional[str] 42 | message: str 43 | media_type: Optional[str] 44 | media_path: Optional[str] 45 | reply_to: Optional[int] 46 | post_author: Optional[str] 47 | views: Optional[int] 48 | forwards: Optional[int] 49 | reactions: Optional[str] 50 | 51 | class OptimizedTelegramScraper: 52 | def __init__(self): 53 | self.STATE_FILE = 'state.json' 54 | self.state = self.load_state() 55 | self.client = None 56 | self.continuous_scraping_active = False 57 | self.max_concurrent_downloads = 5 58 | self.batch_size = 100 59 | self.state_save_interval = 50 60 | self.db_connections = {} 61 | 62 | def load_state(self) -> Dict[str, Any]: 63 | if os.path.exists(self.STATE_FILE): 64 | try: 65 | with open(self.STATE_FILE, 'r') as f: 66 | return json.load(f) 67 | except: 68 | pass 69 | return { 70 | 'api_id': None, 71 | 'api_hash': None, 72 | 'channels': {}, 73 | 'channel_names': {}, 74 | 'scrape_media': True, 75 | } 76 | 77 | def save_state(self): 78 | try: 79 | with open(self.STATE_FILE, 'w') as f: 80 | json.dump(self.state, f, indent=2) 81 | except Exception as e: 82 | print(f"Failed to save state: {e}") 83 | 84 | def get_db_connection(self, channel: str) -> sqlite3.Connection: 85 | if channel not in self.db_connections: 86 | channel_dir = Path(channel) 87 | channel_dir.mkdir(exist_ok=True) 88 | 89 | db_file = channel_dir / f'{channel}.db' 90 | conn = sqlite3.connect(str(db_file), check_same_thread=False) 91 | conn.execute('''CREATE TABLE IF NOT EXISTS messages 92 | (id INTEGER PRIMARY KEY, message_id INTEGER UNIQUE, date TEXT, 93 | sender_id INTEGER, first_name TEXT, last_name TEXT, username TEXT, 94 | message TEXT, media_type TEXT, media_path TEXT, reply_to INTEGER, 95 | post_author TEXT, views INTEGER, forwards INTEGER, reactions TEXT)''') 96 | conn.execute('CREATE INDEX IF NOT EXISTS idx_message_id ON messages(message_id)') 97 | conn.execute('CREATE INDEX IF NOT EXISTS idx_date ON messages(date)') 98 | conn.execute('PRAGMA journal_mode=WAL') 99 | conn.execute('PRAGMA synchronous=NORMAL') 100 | conn.commit() 101 | 102 | self.migrate_database(conn) 103 | 104 | self.db_connections[channel] = conn 105 | 106 | return self.db_connections[channel] 107 | 108 | def migrate_database(self, conn: sqlite3.Connection): 109 | cursor = conn.cursor() 110 | cursor.execute("PRAGMA table_info(messages)") 111 | columns = {row[1] for row in cursor.fetchall()} 112 | 113 | migrations = [] 114 | if 'post_author' not in columns: 115 | migrations.append('ALTER TABLE messages ADD COLUMN post_author TEXT') 116 | if 'views' not in columns: 117 | migrations.append('ALTER TABLE messages ADD COLUMN views INTEGER') 118 | if 'forwards' not in columns: 119 | migrations.append('ALTER TABLE messages ADD COLUMN forwards INTEGER') 120 | if 'reactions' not in columns: 121 | migrations.append('ALTER TABLE messages ADD COLUMN reactions TEXT') 122 | 123 | for migration in migrations: 124 | try: 125 | conn.execute(migration) 126 | except: 127 | pass 128 | 129 | if migrations: 130 | conn.commit() 131 | 132 | def close_db_connections(self): 133 | for conn in self.db_connections.values(): 134 | conn.close() 135 | self.db_connections.clear() 136 | 137 | def batch_insert_messages(self, channel: str, messages: List[MessageData]): 138 | if not messages: 139 | return 140 | 141 | conn = self.get_db_connection(channel) 142 | data = [(msg.message_id, msg.date, msg.sender_id, msg.first_name, 143 | msg.last_name, msg.username, msg.message, msg.media_type, 144 | msg.media_path, msg.reply_to, msg.post_author, msg.views, 145 | msg.forwards, msg.reactions) for msg in messages] 146 | 147 | conn.executemany('''INSERT OR IGNORE INTO messages 148 | (message_id, date, sender_id, first_name, last_name, username, 149 | message, media_type, media_path, reply_to, post_author, views, 150 | forwards, reactions) 151 | VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''', data) 152 | conn.commit() 153 | 154 | async def download_media(self, channel: str, message) -> Optional[str]: 155 | if not message.media or not self.state['scrape_media']: 156 | return None 157 | 158 | if isinstance(message.media, MessageMediaWebPage): 159 | return None 160 | 161 | try: 162 | channel_dir = Path(channel) 163 | media_folder = channel_dir / 'media' 164 | media_folder.mkdir(exist_ok=True) 165 | 166 | if isinstance(message.media, MessageMediaPhoto): 167 | original_name = getattr(message.file, 'name', None) or "photo.jpg" 168 | ext = "jpg" 169 | elif isinstance(message.media, MessageMediaDocument): 170 | ext = getattr(message.file, 'ext', 'bin') if message.file else 'bin' 171 | original_name = getattr(message.file, 'name', None) or f"document.{ext}" 172 | else: 173 | return None 174 | 175 | base_name = Path(original_name).stem 176 | extension = Path(original_name).suffix or f".{ext}" 177 | unique_filename = f"{message.id}-{base_name}{extension}" 178 | media_path = media_folder / unique_filename 179 | 180 | existing_files = list(media_folder.glob(f"{message.id}-*")) 181 | if existing_files: 182 | return str(existing_files[0]) 183 | 184 | for attempt in range(3): 185 | try: 186 | downloaded_path = await message.download_media(file=str(media_path)) 187 | if downloaded_path and Path(downloaded_path).exists(): 188 | return downloaded_path 189 | else: 190 | return None 191 | except FloodWaitError as e: 192 | if attempt < 2: 193 | await asyncio.sleep(e.seconds) 194 | else: 195 | return None 196 | except Exception: 197 | if attempt < 2: 198 | await asyncio.sleep(2 ** attempt) 199 | else: 200 | return None 201 | 202 | return None 203 | except Exception: 204 | return None 205 | 206 | async def update_media_path(self, channel: str, message_id: int, media_path: str): 207 | conn = self.get_db_connection(channel) 208 | conn.execute('UPDATE messages SET media_path = ? WHERE message_id = ?', 209 | (media_path, message_id)) 210 | conn.commit() 211 | 212 | async def scrape_channel(self, channel: str, offset_id: int): 213 | try: 214 | entity = await self.client.get_entity(PeerChannel(int(channel)) if channel.startswith('-') else channel) 215 | result = await self.client.get_messages(entity, offset_id=offset_id, reverse=True, limit=0) 216 | total_messages = result.total 217 | 218 | if total_messages == 0: 219 | print(f"No messages found in channel {channel}") 220 | return 221 | 222 | print(f"Found {total_messages} messages in channel {channel}") 223 | 224 | message_batch = [] 225 | media_tasks = [] 226 | processed_messages = 0 227 | last_message_id = offset_id 228 | semaphore = asyncio.Semaphore(self.max_concurrent_downloads) 229 | 230 | async for message in self.client.iter_messages(entity, offset_id=offset_id, reverse=True): 231 | try: 232 | sender = await message.get_sender() 233 | 234 | reactions_str = None 235 | if message.reactions and message.reactions.results: 236 | reactions_parts = [] 237 | for reaction in message.reactions.results: 238 | emoji = getattr(reaction.reaction, 'emoticon', '') 239 | count = reaction.count 240 | if emoji: 241 | reactions_parts.append(f"{emoji} {count}") 242 | if reactions_parts: 243 | reactions_str = ' '.join(reactions_parts) 244 | 245 | msg_data = MessageData( 246 | message_id=message.id, 247 | date=message.date.strftime('%Y-%m-%d %H:%M:%S'), 248 | sender_id=message.sender_id, 249 | first_name=getattr(sender, 'first_name', None) if isinstance(sender, User) else None, 250 | last_name=getattr(sender, 'last_name', None) if isinstance(sender, User) else None, 251 | username=getattr(sender, 'username', None) if isinstance(sender, User) else None, 252 | message=message.message or '', 253 | media_type=message.media.__class__.__name__ if message.media else None, 254 | media_path=None, 255 | reply_to=message.reply_to_msg_id if message.reply_to else None, 256 | post_author=message.post_author, 257 | views=message.views, 258 | forwards=message.forwards, 259 | reactions=reactions_str 260 | ) 261 | 262 | message_batch.append(msg_data) 263 | 264 | if self.state['scrape_media'] and message.media and not isinstance(message.media, MessageMediaWebPage): 265 | media_tasks.append(message) 266 | 267 | last_message_id = message.id 268 | processed_messages += 1 269 | 270 | if len(message_batch) >= self.batch_size: 271 | self.batch_insert_messages(channel, message_batch) 272 | message_batch.clear() 273 | 274 | if processed_messages % self.state_save_interval == 0: 275 | self.state['channels'][channel] = last_message_id 276 | self.save_state() 277 | 278 | progress = (processed_messages / total_messages) * 100 279 | bar_length = 30 280 | filled_length = int(bar_length * processed_messages // total_messages) 281 | bar = '█' * filled_length + '░' * (bar_length - filled_length) 282 | 283 | sys.stdout.write(f"\r📄 Messages: [{bar}] {progress:.1f}% ({processed_messages}/{total_messages})") 284 | sys.stdout.flush() 285 | 286 | except Exception as e: 287 | print(f"\nError processing message {message.id}: {e}") 288 | 289 | if message_batch: 290 | self.batch_insert_messages(channel, message_batch) 291 | 292 | if media_tasks: 293 | total_media = len(media_tasks) 294 | completed_media = 0 295 | successful_downloads = 0 296 | print(f"\n📥 Downloading {total_media} media files...") 297 | 298 | semaphore = asyncio.Semaphore(self.max_concurrent_downloads) 299 | 300 | async def download_single_media(message): 301 | async with semaphore: 302 | return await self.download_media(channel, message) 303 | 304 | batch_size = 10 305 | for i in range(0, len(media_tasks), batch_size): 306 | batch = media_tasks[i:i + batch_size] 307 | tasks = [asyncio.create_task(download_single_media(msg)) for msg in batch] 308 | 309 | for j, task in enumerate(tasks): 310 | try: 311 | media_path = await task 312 | if media_path: 313 | await self.update_media_path(channel, batch[j].id, media_path) 314 | successful_downloads += 1 315 | except Exception: 316 | pass 317 | 318 | completed_media += 1 319 | progress = (completed_media / total_media) * 100 320 | bar_length = 30 321 | filled_length = int(bar_length * completed_media // total_media) 322 | bar = '█' * filled_length + '░' * (bar_length - filled_length) 323 | 324 | sys.stdout.write(f"\r📥 Media: [{bar}] {progress:.1f}% ({completed_media}/{total_media})") 325 | sys.stdout.flush() 326 | 327 | print(f"\n✅ Media download complete! ({successful_downloads}/{total_media} successful)") 328 | 329 | self.state['channels'][channel] = last_message_id 330 | self.save_state() 331 | print(f"\nCompleted scraping channel {channel}") 332 | 333 | except Exception as e: 334 | print(f"Error with channel {channel}: {e}") 335 | 336 | async def rescrape_media(self, channel: str): 337 | conn = self.get_db_connection(channel) 338 | cursor = conn.cursor() 339 | cursor.execute('SELECT message_id FROM messages WHERE media_type IS NOT NULL AND media_type != "MessageMediaWebPage" AND media_path IS NULL') 340 | message_ids = [row[0] for row in cursor.fetchall()] 341 | 342 | channel_name = self.state.get('channel_names', {}).get(channel, 'Unknown') 343 | 344 | if not message_ids: 345 | print(f"No media files to reprocess for {channel_name} (ID: {channel})") 346 | return 347 | 348 | print(f"📥 Reprocessing {len(message_ids)} media files for {channel_name} (ID: {channel})") 349 | 350 | try: 351 | if channel.lstrip('-').isdigit(): 352 | entity = await self.client.get_entity(PeerChannel(int(channel))) 353 | else: 354 | entity = await self.client.get_entity(channel) 355 | semaphore = asyncio.Semaphore(self.max_concurrent_downloads) 356 | completed_media = 0 357 | successful_downloads = 0 358 | 359 | async def download_single_media(message): 360 | async with semaphore: 361 | return await self.download_media(channel, message) 362 | 363 | batch_size = 10 364 | for i in range(0, len(message_ids), batch_size): 365 | batch_ids = message_ids[i:i + batch_size] 366 | messages = await self.client.get_messages(entity, ids=batch_ids) 367 | 368 | valid_messages = [msg for msg in messages if msg and msg.media and not isinstance(msg.media, MessageMediaWebPage)] 369 | tasks = [asyncio.create_task(download_single_media(msg)) for msg in valid_messages] 370 | 371 | for j, task in enumerate(tasks): 372 | try: 373 | media_path = await task 374 | if media_path: 375 | await self.update_media_path(channel, valid_messages[j].id, media_path) 376 | successful_downloads += 1 377 | except Exception: 378 | pass 379 | 380 | completed_media += 1 381 | progress = (completed_media / len(message_ids)) * 100 382 | bar_length = 30 383 | filled_length = int(bar_length * completed_media // len(message_ids)) 384 | bar = '█' * filled_length + '░' * (bar_length - filled_length) 385 | 386 | sys.stdout.write(f"\r🔄 Rescrape: [{bar}] {progress:.1f}% ({completed_media}/{len(message_ids)})") 387 | sys.stdout.flush() 388 | 389 | print(f"\n✅ Media reprocessing complete! ({successful_downloads}/{len(message_ids)} successful)") 390 | 391 | except Exception as e: 392 | print(f"Error reprocessing media: {e}") 393 | 394 | async def fix_missing_media(self, channel: str): 395 | conn = self.get_db_connection(channel) 396 | cursor = conn.cursor() 397 | 398 | cursor.execute('SELECT COUNT(*) FROM messages WHERE media_type IS NOT NULL AND media_type != "MessageMediaWebPage"') 399 | total_with_media = cursor.fetchone()[0] 400 | 401 | cursor.execute('SELECT COUNT(*) FROM messages WHERE media_type IS NOT NULL AND media_type != "MessageMediaWebPage" AND media_path IS NOT NULL') 402 | total_with_files = cursor.fetchone()[0] 403 | 404 | missing_count = total_with_media - total_with_files 405 | 406 | channel_name = self.state.get('channel_names', {}).get(channel, 'Unknown') 407 | print(f"\n📊 Media Analysis for {channel_name} (ID: {channel}):") 408 | print(f"Messages with media: {total_with_media}") 409 | print(f"Media files downloaded: {total_with_files}") 410 | print(f"Missing media files: {missing_count}") 411 | 412 | if missing_count == 0: 413 | print("✅ All media files are already downloaded!") 414 | return 415 | 416 | cursor.execute('SELECT message_id, media_type FROM messages WHERE media_type IS NOT NULL AND media_type != "MessageMediaWebPage" AND (media_path IS NULL OR media_path = "")') 417 | missing_media = cursor.fetchall() 418 | 419 | if not missing_media: 420 | print("✅ No missing media found!") 421 | return 422 | 423 | print(f"\n🔧 Attempting to download {len(missing_media)} missing media files...") 424 | 425 | try: 426 | if channel.lstrip('-').isdigit(): 427 | entity = await self.client.get_entity(PeerChannel(int(channel))) 428 | else: 429 | entity = await self.client.get_entity(channel) 430 | semaphore = asyncio.Semaphore(self.max_concurrent_downloads) 431 | completed_media = 0 432 | successful_downloads = 0 433 | 434 | async def download_single_media(message): 435 | async with semaphore: 436 | return await self.download_media(channel, message) 437 | 438 | batch_size = 10 439 | for i in range(0, len(missing_media), batch_size): 440 | batch = missing_media[i:i + batch_size] 441 | message_ids = [msg[0] for msg in batch] 442 | 443 | messages = await self.client.get_messages(entity, ids=message_ids) 444 | valid_messages = [msg for msg in messages if msg and msg.media and not isinstance(msg.media, MessageMediaWebPage)] 445 | 446 | tasks = [asyncio.create_task(download_single_media(msg)) for msg in valid_messages] 447 | 448 | for j, task in enumerate(tasks): 449 | try: 450 | media_path = await task 451 | if media_path: 452 | await self.update_media_path(channel, valid_messages[j].id, media_path) 453 | successful_downloads += 1 454 | except Exception: 455 | pass 456 | 457 | completed_media += 1 458 | progress = (completed_media / len(missing_media)) * 100 459 | bar_length = 30 460 | filled_length = int(bar_length * completed_media // len(missing_media)) 461 | bar = '█' * filled_length + '░' * (bar_length - filled_length) 462 | 463 | sys.stdout.write(f"\r🔧 Fix Media: [{bar}] {progress:.1f}% ({completed_media}/{len(missing_media)})") 464 | sys.stdout.flush() 465 | 466 | print(f"\n✅ Media fix complete! ({successful_downloads}/{len(missing_media)} successful)") 467 | 468 | except Exception as e: 469 | print(f"Error fixing missing media: {e}") 470 | 471 | async def continuous_scraping(self): 472 | self.continuous_scraping_active = True 473 | 474 | try: 475 | while self.continuous_scraping_active: 476 | start_time = time.time() 477 | 478 | for channel in self.state['channels']: 479 | if not self.continuous_scraping_active: 480 | break 481 | print(f"\nChecking for new messages in channel: {channel}") 482 | await self.scrape_channel(channel, self.state['channels'][channel]) 483 | 484 | elapsed = time.time() - start_time 485 | sleep_time = max(0, 60 - elapsed) 486 | if sleep_time > 0: 487 | await asyncio.sleep(sleep_time) 488 | 489 | except asyncio.CancelledError: 490 | print("Continuous scraping stopped") 491 | finally: 492 | self.continuous_scraping_active = False 493 | 494 | def get_export_filename(self, channel: str): 495 | username = self.state.get('channel_names', {}).get(channel, 'no_username') 496 | return f"{channel}_{username}" 497 | 498 | def export_to_csv(self, channel: str): 499 | conn = self.get_db_connection(channel) 500 | filename = self.get_export_filename(channel) 501 | csv_file = Path(channel) / f'{filename}.csv' 502 | 503 | cursor = conn.cursor() 504 | cursor.execute('SELECT * FROM messages ORDER BY date') 505 | columns = [description[0] for description in cursor.description] 506 | 507 | with open(csv_file, 'w', newline='', encoding='utf-8') as f: 508 | writer = csv.writer(f) 509 | writer.writerow(columns) 510 | 511 | while True: 512 | rows = cursor.fetchmany(1000) 513 | if not rows: 514 | break 515 | writer.writerows(rows) 516 | 517 | def export_to_json(self, channel: str): 518 | conn = self.get_db_connection(channel) 519 | filename = self.get_export_filename(channel) 520 | json_file = Path(channel) / f'{filename}.json' 521 | 522 | cursor = conn.cursor() 523 | cursor.execute('SELECT * FROM messages ORDER BY date') 524 | columns = [description[0] for description in cursor.description] 525 | 526 | with open(json_file, 'w', encoding='utf-8') as f: 527 | f.write('[\n') 528 | first_row = True 529 | 530 | while True: 531 | rows = cursor.fetchmany(1000) 532 | if not rows: 533 | break 534 | 535 | for row in rows: 536 | if not first_row: 537 | f.write(',\n') 538 | else: 539 | first_row = False 540 | 541 | data = dict(zip(columns, row)) 542 | json.dump(data, f, ensure_ascii=False, indent=2) 543 | 544 | f.write('\n]') 545 | 546 | async def export_data(self): 547 | if not self.state['channels']: 548 | print("No channels to export") 549 | return 550 | 551 | for channel in self.state['channels']: 552 | print(f"Exporting data for channel {channel}...") 553 | try: 554 | self.export_to_csv(channel) 555 | self.export_to_json(channel) 556 | print(f"✅ Completed export for channel {channel}") 557 | except Exception as e: 558 | print(f"❌ Export failed for channel {channel}: {e}") 559 | 560 | async def view_channels(self): 561 | if not self.state['channels']: 562 | print("No channels saved") 563 | return 564 | 565 | print("\nCurrent channels:") 566 | for i, (channel, last_id) in enumerate(self.state['channels'].items(), 1): 567 | try: 568 | conn = self.get_db_connection(channel) 569 | cursor = conn.cursor() 570 | cursor.execute('SELECT COUNT(*) FROM messages') 571 | count = cursor.fetchone()[0] 572 | channel_name = self.state.get('channel_names', {}).get(channel, 'Unknown') 573 | print(f"[{i}] {channel_name} (ID: {channel}), Last Message ID: {last_id}, Messages: {count}") 574 | except: 575 | channel_name = self.state.get('channel_names', {}).get(channel, 'Unknown') 576 | print(f"[{i}] {channel_name} (ID: {channel}), Last Message ID: {last_id}") 577 | 578 | async def list_channels(self): 579 | try: 580 | print("\nList of channels and groups joined by account:") 581 | count = 1 582 | channels_data = [] 583 | async for dialog in self.client.iter_dialogs(): 584 | entity = dialog.entity 585 | if dialog.id != 777000 and (isinstance(entity, Channel) or isinstance(entity, Chat)): 586 | channel_type = "Channel" if isinstance(entity, Channel) and entity.broadcast else "Group" 587 | username = getattr(entity, 'username', None) or 'no_username' 588 | print(f"[{count}] {dialog.title} (ID: {dialog.id}, Type: {channel_type}, Username: @{username})") 589 | channels_data.append({ 590 | 'number': count, 591 | 'channel_name': dialog.title, 592 | 'channel_id': str(dialog.id), 593 | 'username': username, 594 | 'type': channel_type 595 | }) 596 | count += 1 597 | 598 | if channels_data: 599 | csv_file = Path('channels_list.csv') 600 | with open(csv_file, 'w', newline='', encoding='utf-8') as f: 601 | writer = csv.DictWriter(f, fieldnames=['number', 'channel_name', 'channel_id', 'username', 'type']) 602 | writer.writeheader() 603 | writer.writerows(channels_data) 604 | print(f"\n✅ Saved channels list to {csv_file}") 605 | 606 | return channels_data 607 | 608 | except Exception as e: 609 | print(f"Error listing channels: {e}") 610 | return [] 611 | 612 | def display_qr_code_ascii(self, qr_login): 613 | qr = qrcode.QRCode(box_size=1, border=1) 614 | qr.add_data(qr_login.url) 615 | qr.make() 616 | 617 | f = StringIO() 618 | qr.print_ascii(out=f) 619 | f.seek(0) 620 | print(f.read()) 621 | 622 | async def qr_code_auth(self): 623 | print("\nChoosing QR Code authentication...") 624 | print("Please scan the QR code with your Telegram app:") 625 | print("1. Open Telegram on your phone") 626 | print("2. Go to Settings > Devices > Scan QR") 627 | print("3. Scan the code below\n") 628 | 629 | qr_login = await self.client.qr_login() 630 | self.display_qr_code_ascii(qr_login) 631 | 632 | try: 633 | await qr_login.wait() 634 | print("\n✅ Successfully logged in via QR code!") 635 | return True 636 | except SessionPasswordNeededError: 637 | password = input("Two-factor authentication enabled. Enter your password: ") 638 | await self.client.sign_in(password=password) 639 | print("\n✅ Successfully logged in with 2FA!") 640 | return True 641 | except Exception as e: 642 | print(f"\n❌ QR code authentication failed: {e}") 643 | return False 644 | 645 | async def phone_auth(self): 646 | phone = input("Enter your phone number: ") 647 | await self.client.send_code_request(phone) 648 | code = input("Enter the code you received: ") 649 | 650 | try: 651 | await self.client.sign_in(phone, code) 652 | print("\n✅ Successfully logged in via phone!") 653 | return True 654 | except SessionPasswordNeededError: 655 | password = input("Two-factor authentication enabled. Enter your password: ") 656 | await self.client.sign_in(password=password) 657 | print("\n✅ Successfully logged in with 2FA!") 658 | return True 659 | except Exception as e: 660 | print(f"\n❌ Phone authentication failed: {e}") 661 | return False 662 | 663 | async def initialize_client(self): 664 | if not all([self.state.get('api_id'), self.state.get('api_hash')]): 665 | print("\n=== API Configuration Required ===") 666 | print("You need to provide API credentials from https://my.telegram.org") 667 | try: 668 | self.state['api_id'] = int(input("Enter your API ID: ")) 669 | self.state['api_hash'] = input("Enter your API Hash: ") 670 | self.save_state() 671 | except ValueError: 672 | print("Invalid API ID. Must be a number.") 673 | return False 674 | 675 | self.client = TelegramClient('session', self.state['api_id'], self.state['api_hash']) 676 | 677 | try: 678 | await self.client.connect() 679 | except Exception as e: 680 | print(f"Failed to connect: {e}") 681 | return False 682 | 683 | if not await self.client.is_user_authorized(): 684 | print("\n=== Choose Authentication Method ===") 685 | print("[1] QR Code (Recommended - No phone number needed)") 686 | print("[2] Phone Number (Traditional method)") 687 | 688 | while True: 689 | choice = input("Enter your choice (1 or 2): ").strip() 690 | if choice in ['1', '2']: 691 | break 692 | print("Please enter 1 or 2") 693 | 694 | success = await self.qr_code_auth() if choice == '1' else await self.phone_auth() 695 | 696 | if not success: 697 | print("Authentication failed. Please try again.") 698 | await self.client.disconnect() 699 | return False 700 | else: 701 | print("✅ Already authenticated!") 702 | 703 | return True 704 | 705 | def parse_channel_selection(self, choice): 706 | channels_list = list(self.state['channels'].keys()) 707 | selected_channels = [] 708 | 709 | if choice.lower() == 'all': 710 | return channels_list 711 | 712 | for selection in [x.strip() for x in choice.split(',')]: 713 | try: 714 | if selection.startswith('-'): 715 | if selection in self.state['channels']: 716 | selected_channels.append(selection) 717 | else: 718 | print(f"Channel ID {selection} not found in your channels") 719 | else: 720 | num = int(selection) 721 | if 1 <= num <= len(channels_list): 722 | selected_channels.append(channels_list[num - 1]) 723 | else: 724 | print(f"Invalid channel number: {num}. Valid range: 1-{len(channels_list)}") 725 | except ValueError: 726 | print(f"Invalid input: {selection}. Use numbers (1,2,3) or full IDs (-100123...)") 727 | 728 | return selected_channels 729 | 730 | async def scrape_specific_channels(self): 731 | if not self.state['channels']: 732 | print("No channels available. Use [L] to add channels first") 733 | return 734 | 735 | await self.view_channels() 736 | print("\n📥 Scrape Options:") 737 | print("• Single: 1 or -1001234567890") 738 | print("• Multiple: 1,3,5 or mix formats") 739 | print("• All channels: all") 740 | 741 | choice = input("\nEnter selection: ").strip() 742 | selected_channels = self.parse_channel_selection(choice) 743 | 744 | if selected_channels: 745 | print(f"\n🚀 Starting scrape of {len(selected_channels)} channel(s)...") 746 | for i, channel in enumerate(selected_channels, 1): 747 | print(f"\n[{i}/{len(selected_channels)}] Scraping: {channel}") 748 | await self.scrape_channel(channel, self.state['channels'][channel]) 749 | print(f"\n✅ Completed scraping {len(selected_channels)} channel(s)!") 750 | else: 751 | print("❌ No valid channels selected") 752 | 753 | async def manage_channels(self): 754 | while True: 755 | print("\n" + "="*40) 756 | print(" TELEGRAM SCRAPER") 757 | print("="*40) 758 | print("[S] Scrape channels") 759 | print("[C] Continuous scraping") 760 | print(f"[M] Media scraping: {'ON' if self.state['scrape_media'] else 'OFF'}") 761 | print("[L] List & add channels") 762 | print("[R] Remove channels") 763 | print("[E] Export data") 764 | print("[T] Rescrape media") 765 | print("[F] Fix missing media") 766 | print("[Q] Quit") 767 | print("="*40) 768 | 769 | choice = input("Enter your choice: ").lower().strip() 770 | 771 | try: 772 | if choice == 'r': 773 | if not self.state['channels']: 774 | print("No channels to remove") 775 | continue 776 | 777 | await self.view_channels() 778 | print("\nTo remove channels:") 779 | print("• Single: 1 or -1001234567890") 780 | print("• Multiple: 1,2,3 or mix formats") 781 | selection = input("Enter selection: ").strip() 782 | selected_channels = self.parse_channel_selection(selection) 783 | 784 | if selected_channels: 785 | removed_count = 0 786 | for channel in selected_channels: 787 | if channel in self.state['channels']: 788 | del self.state['channels'][channel] 789 | print(f"✅ Removed channel {channel}") 790 | removed_count += 1 791 | else: 792 | print(f"❌ Channel {channel} not found") 793 | 794 | if removed_count > 0: 795 | self.save_state() 796 | print(f"\n🎉 Removed {removed_count} channel(s)!") 797 | await self.view_channels() 798 | else: 799 | print("No channels were removed") 800 | else: 801 | print("No valid channels selected") 802 | 803 | elif choice == 's': 804 | await self.scrape_specific_channels() 805 | 806 | elif choice == 'm': 807 | self.state['scrape_media'] = not self.state['scrape_media'] 808 | self.save_state() 809 | print(f"\n✅ Media scraping {'enabled' if self.state['scrape_media'] else 'disabled'}") 810 | 811 | elif choice == 'c': 812 | task = asyncio.create_task(self.continuous_scraping()) 813 | print("Continuous scraping started. Press Ctrl+C to stop.") 814 | try: 815 | await asyncio.sleep(float('inf')) 816 | except KeyboardInterrupt: 817 | self.continuous_scraping_active = False 818 | task.cancel() 819 | print("\nStopping continuous scraping...") 820 | try: 821 | await task 822 | except asyncio.CancelledError: 823 | pass 824 | 825 | elif choice == 'e': 826 | await self.export_data() 827 | 828 | elif choice == 'l': 829 | channels_data = await self.list_channels() 830 | 831 | if not channels_data: 832 | continue 833 | 834 | print("\nTo add channels from the list above:") 835 | print("• Single: 1 or -1001234567890") 836 | print("• Multiple: 1,3,5 or mix formats") 837 | print("• All channels: all") 838 | print("• Press Enter to skip adding") 839 | selection = input("\nEnter selection (or Enter to skip): ").strip() 840 | 841 | if selection: 842 | added_count = 0 843 | 844 | if selection.lower() == 'all': 845 | for channel_info in channels_data: 846 | channel_id = channel_info['channel_id'] 847 | if channel_id not in self.state['channels']: 848 | self.state['channels'][channel_id] = 0 849 | if 'channel_names' not in self.state: 850 | self.state['channel_names'] = {} 851 | self.state['channel_names'][channel_id] = channel_info['username'] 852 | print(f"✅ Added channel {channel_info['channel_name']} (ID: {channel_id})") 853 | added_count += 1 854 | else: 855 | print(f"Channel {channel_info['channel_name']} already added") 856 | else: 857 | for sel in [x.strip() for x in selection.split(',')]: 858 | try: 859 | if sel.startswith('-'): 860 | channel_id = sel 861 | channel_info = next((c for c in channels_data if c['channel_id'] == channel_id), None) 862 | if not channel_info: 863 | print(f"Channel ID {channel_id} not found") 864 | continue 865 | else: 866 | num = int(sel) 867 | if 1 <= num <= len(channels_data): 868 | channel_info = channels_data[num - 1] 869 | channel_id = channel_info['channel_id'] 870 | else: 871 | print(f"Invalid number: {num}. Choose 1-{len(channels_data)}") 872 | continue 873 | 874 | if channel_id in self.state['channels']: 875 | print(f"Channel {channel_info['channel_name']} already added") 876 | else: 877 | self.state['channels'][channel_id] = 0 878 | if 'channel_names' not in self.state: 879 | self.state['channel_names'] = {} 880 | self.state['channel_names'][channel_id] = channel_info['username'] 881 | print(f"✅ Added channel {channel_info['channel_name']} (ID: {channel_id})") 882 | added_count += 1 883 | 884 | except ValueError: 885 | print(f"Invalid input: {sel}") 886 | 887 | if added_count > 0: 888 | self.save_state() 889 | print(f"\n🎉 Added {added_count} new channel(s)!") 890 | await self.view_channels() 891 | else: 892 | print("No new channels were added") 893 | 894 | elif choice == 't': 895 | if not self.state['channels']: 896 | print("No channels available. Add channels first") 897 | continue 898 | 899 | await self.view_channels() 900 | print("\nEnter channel NUMBER (1,2,3...) or full channel ID (-100123...)") 901 | selection = input("Enter your selection: ").strip() 902 | selected_channels = self.parse_channel_selection(selection) 903 | 904 | if len(selected_channels) == 1: 905 | channel = selected_channels[0] 906 | print(f"Rescaping media for channel: {channel}") 907 | await self.rescrape_media(channel) 908 | elif len(selected_channels) > 1: 909 | print("Please select only one channel for media rescaping") 910 | else: 911 | print("No valid channel selected") 912 | 913 | elif choice == 'f': 914 | if not self.state['channels']: 915 | print("No channels available. Add channels first") 916 | continue 917 | 918 | await self.view_channels() 919 | print("\nEnter channel NUMBER (1,2,3...) or full channel ID (-100123...)") 920 | selection = input("Enter your selection: ").strip() 921 | selected_channels = self.parse_channel_selection(selection) 922 | 923 | if len(selected_channels) == 1: 924 | channel = selected_channels[0] 925 | await self.fix_missing_media(channel) 926 | elif len(selected_channels) > 1: 927 | print("Please select only one channel for fixing missing media") 928 | else: 929 | print("No valid channel selected") 930 | 931 | elif choice == 'q': 932 | print("\n👋 Goodbye!") 933 | self.close_db_connections() 934 | if self.client: 935 | await self.client.disconnect() 936 | sys.exit() 937 | 938 | else: 939 | print("Invalid option") 940 | 941 | except Exception as e: 942 | print(f"Error: {e}") 943 | 944 | async def run(self): 945 | display_ascii_art() 946 | if await self.initialize_client(): 947 | try: 948 | await self.manage_channels() 949 | finally: 950 | self.close_db_connections() 951 | if self.client: 952 | await self.client.disconnect() 953 | else: 954 | print("Failed to initialize client. Exiting.") 955 | 956 | async def main(): 957 | scraper = OptimizedTelegramScraper() 958 | await scraper.run() 959 | 960 | if __name__ == '__main__': 961 | try: 962 | asyncio.run(main()) 963 | except KeyboardInterrupt: 964 | print("\nProgram interrupted. Exiting...") 965 | sys.exit() 966 | --------------------------------------------------------------------------------