├── .gitignore ├── LICENSE ├── Makefile ├── README.md ├── channelscraper ├── __init__.py ├── app.py ├── bots.py ├── channel.py ├── client.py ├── config.py ├── config.yaml ├── driver.py ├── message.py ├── scrapeChannelMetadata.py └── utilities.py ├── input ├── .gitkeep ├── channel_info.csv └── channels.csv ├── requirements.txt └── scrape.sh /.gitignore: -------------------------------------------------------------------------------- 1 | /scripts/+4367763544147.session 2 | ./.idea 3 | venv 4 | __pycache__ 5 | /output/ 6 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Peter Walchhofer, Valentin Peter 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | venv: 2 | python3 -m venv venv; \ 3 | . venv/bin/activate; \ 4 | pip install -r requirements.txt 5 | 6 | install: venv 7 | 8 | .PHONY: update 9 | update: 10 | . venv/bin/activate; \ 11 | pip install -r requirements.txt 12 | 13 | .PHONY: clean 14 | clean: 15 | rm -rf venv 16 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Telegram Scraper 2 | 3 | This Telegram scraper collects telegram messages, comments (comments.bot/comments.app) and media files. It was originally build for this [story](https://www.addendum.org/news/telegram-netzwerk-sellner/) on behalf of [Addendum](https://addendum.org). 4 | 5 | **Contributors:** [@PeterWalchhofer](https://github.com/PeterWalchhofer), [@vali101](https://github.com/vali101), [@fin](https://github.com/fin) 6 | ## Notes 7 | This scraper was written before Telgram introduced its native comment feature for broadcasting channels. Nowadays, comment.bot/comment.app extensions are rarely used anymore. Feel free to make a PR to support this feature. 8 | [@vali101](https://github.com/vali101) re-wrote the scraper for his bachelor thesis [here](https://github.com/vali101/telegraph) with a focus on snowball sampling. His code may help for extending the scraper or for finding interesting channels in the first place. Check it out! 9 | 10 | ## Setup 11 | 12 | ### Requirements: 13 | - Google Chrome (However, in theory you could als use Firefox when installing the necessary driver manually ) 14 | - Python 3 15 | - Telegram Account (Phone Number) 16 | - A lot of storage if downloading media 17 | - Time - arround 3000msg and comments/ per Minute. 18 | 19 | ### Getting Started 20 | 21 | 1. Install dependencies `make install` OR just install requirements.txt (using venv is recommended) 22 | 2. Create your own `channel.csv` as explained in the next section 23 | 3. Put the phone-number of the linked telegram account int the `config.yaml` 24 | 4. Get your API-key [here](https://my.telegram.org/auth?to=apps) an put them inside the `config.yaml`. 25 | 5. `sh scrape.sh` to start the scraper OR run `channelscraper/python app.py` 26 | 6. The outputs will be stored in the `/output` directory. 27 | 28 | 29 | ### Input Data 30 | You need to create your own `channels.csv`and put it in the `/input` folder. 31 | 32 | Only **Link** and **Broadcast** Relevant for scraping. The csv should have the form described below. There also is an example csv in the folder. 33 | 34 | Kategorie | Name | **Link** | @ | **Broadcast** 35 | --- | --- | --- | --- | --- 36 | Gruppe Typ XY | Example Channel | https://t.me/example_channel | example_channel | TRUE 37 | 38 | * Kategorie(optional): Metadata to annotate channel 39 | * Name(optional): Not identifier Name 40 | * Link: Link to channel 41 | * @ (optional): Indentifier Name 42 | * Broadcast: ´True´ if channel is Broadcasting Channel ´else´ false. Broadcasting channels are large one-to-many channels that only allow owners to write messages. 43 | 44 | ### Configuration 45 | 46 | The Scraper can be further configured via the `channelscraper/config.yaml`. 47 | 48 | ## Features 49 | ### Messages 50 | The Scraper extracts all messages from a channel. It is also possible to scrape only those messages that were written in the last x days. This can be set in the `config.yaml`. 51 | 52 | ### Comment Bots 53 | - In many broadcasting channels comment-bots are used in order to provide feedback from the audience. It is also possible to scrape those messages. Currently comments.app and comments.bot bots are supported. 54 | - Be careful. The date format is different from the telegram-api and has to be parsed manually (e.g. Dec 09) 55 | - As there is a "Load more comments" button it has to be clicked using javascript. Selenium is used to interact with the chrome driver that is installed automatically. 56 | #### Comments.app 57 | - Unique username extraction is working most of the time. However, if the user has deleted its account, this is not possible. 58 | #### Comments.bot 59 | - Unique usernames are not extracted, because we found no way to find out without querying the api at a high cost. 60 | - Only the display name is persited 61 | 62 | ## Further remarks 63 | - Messages and comments are persisted in the same csv-file. To tell them apart use the `isComment` column. Additionally, the ID includes a period in the format msgId.commId (e.g. 101.3) 64 | - We allocated ids to the comments manually. The telegram message ids are unique within a channel. 65 | 66 | ## Optional 67 | 68 | If you want to run selenium with docker use `selenium/standalone-chrome:3.141.59-yttrium` 69 | Also see: 70 | https://stackoverflow.com/questions/45323271/how-to-run-selenium-with-chrome-in-docker 71 | -------------------------------------------------------------------------------- /channelscraper/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PeterWalchhofer/Telescrape/733220a72f5cf40001803007ae8528b8ae6849a3/channelscraper/__init__.py -------------------------------------------------------------------------------- /channelscraper/app.py: -------------------------------------------------------------------------------- 1 | from datetime import datetime 2 | from datetime import timedelta 3 | import traceback 4 | import logging 5 | import os 6 | import yaml 7 | import csv 8 | import asyncio 9 | 10 | from utilities import getInputPath 11 | from channel import Channel 12 | from driver import Driver 13 | 14 | 15 | def getChannelList(filename): 16 | """Initialises the CSV of channels to scrape given by ADDENDUM.""" 17 | channelList = [] 18 | with open(getInputPath() + '/' + filename, newline='', encoding='utf-8') as csvfile: 19 | reader = csv.reader(csvfile) 20 | csvList = list(reader) 21 | for row in csvList[1:]: 22 | channelList.append(Channel(row[2], row[4])) 23 | return channelList 24 | 25 | 26 | config = yaml.safe_load(open("config.yaml")) 27 | input_file = config["input_channel_file"] 28 | channels = getChannelList(input_file) 29 | 30 | for channel in channels: 31 | channelPath = getInputPath() + "/" + channel.username 32 | logging.info("Collecting channel: " + channel.username) 33 | logging.info("-> Collecting messages") 34 | # Scrape Channels 35 | try: 36 | channel.scrape() 37 | except: 38 | traceback.print_exc() 39 | pass 40 | 41 | try: 42 | channel.getChannelUsers() 43 | except: 44 | traceback.print_exc() 45 | pass 46 | 47 | # WriteCsv 48 | try: 49 | channel.writeCsv() 50 | except: 51 | traceback.print_exc() 52 | pass 53 | 54 | Driver.closeDriver() 55 | -------------------------------------------------------------------------------- /channelscraper/bots.py: -------------------------------------------------------------------------------- 1 | import selenium 2 | import telethon.tl.types.messages 3 | from urllib.request import Request, urlopen 4 | from selenium.webdriver.support.ui import WebDriverWait 5 | from selenium.webdriver.common.by import By 6 | from selenium.webdriver.support import expected_conditions as EC 7 | from selenium.common.exceptions import ElementNotInteractableException 8 | from bs4 import BeautifulSoup 9 | import urllib.request 10 | import traceback 11 | import time 12 | import logging 13 | 14 | from client import Client 15 | from driver import Driver 16 | from utilities import concat, eliminateWhitespaces 17 | from message import Message 18 | 19 | 20 | 21 | class Bots: 22 | client = Client.getClient() 23 | driver = Driver.getDriver() 24 | 25 | @staticmethod 26 | def javaScriptLoadMoreHack(url, provider): 27 | """ All it does is to click on the 'load more comments' button in order to scrape all the messages""" 28 | if provider == "bot": 29 | index = 1 30 | selector = "sg_load" 31 | else: 32 | index = 0 33 | selector = "bc-load-more" 34 | # use firefox to get page with javascript generated content 35 | Bots.driver.get(url) 36 | resultDriver = Bots.loadMore(Bots.driver, index, selector) 37 | 38 | # store it to string variable 39 | if resultDriver is None: 40 | return None 41 | else: 42 | page_source = resultDriver.page_source 43 | return BeautifulSoup(page_source, "html.parser") 44 | 45 | @staticmethod 46 | def loadMore(driver, index, selector): 47 | """Recursive call of load more if there are lots of comments""" 48 | selection = driver.find_elements_by_css_selector("." + selector) 49 | try: 50 | if len(selection) > index: 51 | selection[index].click() 52 | else: 53 | return None 54 | except ElementNotInteractableException: 55 | if (index == 1): 56 | Bots.loadMore(driver, 0, selector) 57 | else: 58 | return None 59 | # wait for the page to load 60 | try: 61 | element = WebDriverWait(driver, 3).until( 62 | EC.invisibility_of_element_located((By.CLASS_NAME, selector)) 63 | ) 64 | time.sleep(1) 65 | return driver 66 | except selenium.common.exceptions.TimeoutException: 67 | return Bots.loadMore(driver, index, selector) 68 | 69 | @staticmethod 70 | async def scrapeCommentsApp(url, messageId, queryUser): 71 | """Scrapes https://comments.app for comment feature extension 72 | """ 73 | commentList = list() 74 | userList = list() 75 | # soup = BeautifulSoup(page, 'html.parser') 76 | soup = Bots.javaScriptLoadMoreHack(url, "app") 77 | if soup is None: 78 | try: 79 | page = urllib.request.urlopen(url) 80 | soup = BeautifulSoup(page, "html.parser") 81 | except Exception: 82 | logging.info("Exception on loading comments occurred") 83 | return 84 | 85 | comments = soup.find_all('div', class_='bc-comment-box') 86 | count = 1 # to create an id for comments, which the naturaly do not have. 87 | for comment in comments: 88 | Bots.__parseCommentFromApp(comment, commentList, userList, count, messageId, queryUser) 89 | count = count + 1 90 | 91 | return commentList, userList 92 | 93 | 94 | @staticmethod 95 | def scrapeCommentsBot(url, channel_users, messageId): 96 | """ Scrapes https://comments.bot/ website for comment feature extension.""" 97 | hdr = { 98 | 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11', 99 | 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 100 | 'Referer': 'https://cssspritegenerator.com', 101 | 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3', 102 | 'Accept-Encoding': 'none', 103 | 'Accept-Language': 'en-US,en;q=0.8', 104 | 'Connection': 'keep-alive'} 105 | 106 | soup = Bots.javaScriptLoadMoreHack(url, "bot") 107 | if soup is None: 108 | try: 109 | req = Request(url, headers=hdr) 110 | page = urlopen(req) 111 | soup = BeautifulSoup(page, 'html.parser') 112 | except: 113 | logging.info("Could not open comments app.") 114 | 115 | commentList = [] 116 | if soup is not None: 117 | comments = soup.find_all('div', class_='comment-content') 118 | count = 1 119 | for comment in comments: 120 | # Save comment Message 121 | new_comment = Message() 122 | new_comment.isComment = True 123 | new_comment.parent = messageId 124 | # Save Id 125 | new_comment.id = str(messageId) + "." + str(count) 126 | # Save Text 127 | text_list = comment.find('div', class_='comment-text').contents 128 | text_list = [elem for elem in text_list if str(elem) != "
"] # Eliminate Html 129 | new_comment.text = eliminateWhitespaces(' '.join([str(elem) for elem in text_list])) 130 | 131 | # Save replyId 132 | reply_text_list = comment.find('span', class_='comment-reply-text') 133 | if reply_text_list is not None: 134 | reply_text = eliminateWhitespaces(' '.join([str(elem) for elem in reply_text_list])) 135 | try: 136 | new_comment.replyToMessageId = next( 137 | (new_comment for new_comment in commentList if new_comment.text == 138 | reply_text), None).id 139 | except: 140 | logging.info("Could not find quoted comment in Bot: " + url) 141 | count = count + 1 142 | # Find not identified User 143 | new_comment.sender_name = comment.find('div', class_='name-row').findChildren()[0].contents[0] 144 | commentUser = telethon.types.User(0) 145 | commentUser.first_name = new_comment.sender_name 146 | channel_users.append(commentUser) 147 | 148 | new_comment.timestamp = comment.find('div', class_='comment-date').findChildren()[0].contents[0] 149 | 150 | # No identified Users 151 | commentList.append(new_comment) 152 | 153 | return commentList 154 | 155 | @staticmethod 156 | def __parseCommentFromApp(comment, commentList, userList, count, messageId, queryUser): 157 | # Save comment Message 158 | new_comment = Message() 159 | new_comment.isComment = True 160 | new_comment.parent = messageId 161 | new_comment.id = str(messageId) + "." + str(count) 162 | 163 | # Get all text elements of a comment 164 | textList = None 165 | try: 166 | textList = comment.find_all('div', class_='bc-comment-text') 167 | except: 168 | logging.info("This thread has no comments") 169 | 170 | # Save comment text and reply id 171 | try: 172 | # Find comment text and quoted comment id. 173 | if len(textList) > 1: 174 | try: 175 | new_comment.text = textList[1].text 176 | new_comment.replyToMessageId = next( 177 | (new_comment for new_comment in commentList if new_comment.text == 178 | textList[0].text), None).id 179 | except AttributeError: 180 | logging.info( 181 | "Could not find quoted comment" + ", MessageId: " + str(new_comment.id)) 182 | # Find only comment text 183 | else: 184 | try: 185 | new_comment.text = textList[0].text 186 | except IndexError: 187 | logging.info("Could not find comment text" + ", MessageId: " + str(messageId)) 188 | 189 | except: 190 | traceback.print_exc() 191 | logging.info("An error occurred reading comment text" + ", MessageId: " + str(messageId)) 192 | 193 | # Find not identified User 194 | new_comment.sender_name = comment.find('span', class_='bc-comment-author-name').contents[0].contents[0] 195 | # Find identified User and save User ich channel.users and save message.sendername 196 | 197 | try: 198 | identifierName = comment.find('span', class_='bc-comment-author-name').contents[0]['href'].rsplit('/', 1)[ 199 | -1] 200 | except: 201 | identifierName = "" 202 | user = None 203 | 204 | # Query user if identifier name was found 205 | if not identifierName == "" and queryUser: 206 | try: 207 | user = Bots.getEntity(identifierName) 208 | userList.append(user) 209 | except ValueError: 210 | # User for some reason not found. 211 | pass 212 | 213 | new_comment.username = identifierName 214 | if user is not None: 215 | new_comment.sender_name = concat(user.first_name, user.last_name) 216 | new_comment.sender = user.id 217 | else: 218 | commentUser = telethon.types.User(0) 219 | commentUser.first_name = new_comment.sender_name 220 | 221 | new_comment.timestamp = comment.find('time')['datetime'] 222 | commentList.append(new_comment) 223 | 224 | @staticmethod 225 | async def getEntity(identifierName): 226 | return await Bots.client.get_entity(identifierName) 227 | -------------------------------------------------------------------------------- /channelscraper/channel.py: -------------------------------------------------------------------------------- 1 | import asyncio 2 | import csv 3 | import logging 4 | import os 5 | import time 6 | import traceback 7 | from datetime import datetime 8 | 9 | import telethon.tl.types.messages 10 | from bots import Bots 11 | from client import Client 12 | from config import Config 13 | from message import Message 14 | from utilities import concat, wait, calcDateOffset, getOutputPath, create_path_if_not_exists, extractUrls 15 | 16 | 17 | class Channel: 18 | config = Config.getConfig() 19 | client = Client.getClient() 20 | logging.basicConfig(format='[%(levelname) 5s/%(asctime)s] %(name)s: %(message)s', 21 | level=logging.INFO) 22 | 23 | def __init__(self, link, isBroadcastingChannel): 24 | self.id = None 25 | 26 | self.users = list() 27 | if isBroadcastingChannel == "True": 28 | self.isBroadcastingChannel = True 29 | else: 30 | self.isBroadcastingChannel = False 31 | self.username = link.rsplit('/', 1)[-1] 32 | 33 | self.path = getOutputPath() + "/" + self.username 34 | self.messages = list() 35 | 36 | def scrape(self): 37 | create_path_if_not_exists(self.path) 38 | 39 | if Channel.config.get("scrape_mode") == "OFFSET_SCRAPE": 40 | self.getRecentChannelMessages() 41 | elif Channel.config.get("scrape_mode") == "FULL_SCRAPE": 42 | self.getAllChannelMessages() 43 | else: 44 | raise AttributeError("Invalid scraping mode set in config file.") 45 | 46 | # Collects messages and comments from a given channel. 47 | def getRecentChannelMessages(self): 48 | """ Scrapes messages of a channel of the last X days and saves the information in the channel object. 49 | """ 50 | 51 | async def main(): 52 | async for message in Channel.client.iter_messages(self.username, offset_date=calcDateOffset( 53 | Channel.config.get("scrape_offset")), reverse=True): 54 | if type(message) == telethon.tl.types.Message: 55 | await self.parseMessage(message) 56 | 57 | with Channel.client: 58 | try: 59 | Channel.client.loop.run_until_complete(main()) 60 | except telethon.errors.ServerError: 61 | logging.info("Server error: Passed") 62 | pass 63 | except telethon.errors.FloodWaitError as e: 64 | logging.info("FloodWaitError: Sleep for " + str(e.seconds)) 65 | time.sleep(e.seconds) 66 | 67 | self.messages = list(reversed(self.messages)) 68 | 69 | # Collects messages and comments from a given channel. 70 | def getAllChannelMessages(self): 71 | """ Scrapes all messages of a channel and saves the information in the channel object. 72 | """ 73 | 74 | async def main(): 75 | async for message in Channel.client.iter_messages(self.username): 76 | if type(message) == telethon.tl.types.Message: 77 | await self.parseMessage(message) 78 | 79 | with Channel.client: 80 | try: 81 | Channel.client.loop.run_until_complete(main()) 82 | except telethon.errors.ServerError: 83 | logging.info("Server error: Passed") 84 | pass 85 | except telethon.errors.FloodWaitError as e: 86 | logging.info("FloodWaitError: Sleep for " + str(e.seconds)) 87 | time.sleep(e.seconds) 88 | 89 | async def parseMessage(self, message): 90 | # Wait to prevent getting blocked 91 | await wait() 92 | 93 | new_message = Message() 94 | new_message.id = message.id 95 | new_message.sender = message.sender_id 96 | try: 97 | first_name = message.sender.first_name 98 | last_name = message.sender.last_name 99 | except AttributeError: 100 | first_name = "" 101 | last_name = "" 102 | new_message.sender_name = concat(first_name, last_name) 103 | try: 104 | new_message.username = message.sender.username 105 | except AttributeError: 106 | pass 107 | new_message.replyToMessageId = message.reply_to_msg_id 108 | new_message.edit_date = message.edit_date 109 | new_message.entities = message.entities 110 | new_message.post_author = message.post_author 111 | new_message.timestamp = message.date 112 | new_message.text = message.text 113 | new_message.views = message.views 114 | new_message.media = type(message.media) 115 | self.member_count = message.chat.participants_count 116 | 117 | # Saves the channel from which the message was forwarded. 118 | try: 119 | await self.__parseForward(message, new_message) 120 | except AttributeError: 121 | pass 122 | 123 | if type(message.media) == telethon.types.MessageMediaPhoto and Channel.config.get("media_download"): 124 | mediapath = self.path + "/media/" + str(new_message.id) 125 | if not os.path.exists(mediapath + ".jpg"): 126 | try: 127 | await message.download_media(mediapath) 128 | except telethon.errors.FloodWaitError as e: 129 | logging.info("Waiting " + str(e.seconds) + " seconds: FloodWaitError") 130 | await asyncio.sleep(e.seconds) 131 | except telethon.errors.RpcCallFailError: 132 | pass 133 | await asyncio.sleep(1) 134 | 135 | # Checks which kind of comment bot is used by the provider of the group a uses the correct scraper. 136 | # --> then fills the comment list for each messages with the comments (prints "no comments" if no comment 137 | # bot is used. 138 | comments = list() 139 | if message.buttons is not None and message.forward is None: 140 | buttons = message.buttons 141 | 142 | for button in buttons: 143 | button_url = None 144 | try: 145 | button_url = button[0].button.url[:21] 146 | except AttributeError: 147 | pass 148 | 149 | if button_url == 'https://comments.bot/': 150 | logging.info("---> Found comments.bot...") 151 | new_message.hasComments = True 152 | new_message.bot_url = button[0].button.url 153 | try: 154 | comments.extend(Bots.scrapeCommentsBot(new_message.bot_url, self.users, message.id)) 155 | except Exception: 156 | traceback.print_exc() 157 | elif button[0].text[-8:] == 'comments': 158 | logging.info("---> Found comments.app...") 159 | new_message.hasComments = True 160 | new_message.bot_url = button[0].button.url 161 | try: 162 | commentsAppComments, commentsAppUsers = \ 163 | await Bots.scrapeCommentsApp(new_message.bot_url, message.id, 164 | Channel.config.get("query_users")) 165 | comments.extend(commentsAppComments) 166 | self.users.extend(commentsAppUsers) 167 | except Exception: 168 | traceback.print_exc() 169 | 170 | new_message.comments = comments 171 | self.messages.append(new_message) 172 | 173 | def writeCsv(self): 174 | if len(self.messages) == 0: 175 | raise LookupError("Nothing to write. You have to execute 'scrape' method first.") 176 | 177 | chatlogs_csv = self.path + "/chatlogs_" + str(datetime.now().strftime("%Y-%m-%d--%H-%M-%S")) + ".csv" 178 | users_csv = self.path + "/users" + str(datetime.now().strftime("%Y-%m-%d--%H-%M-%S")) + ".csv" 179 | 180 | # WRITE MESSAGES AND COMMENTS 181 | with open(chatlogs_csv, "w", encoding="utf-8", newline='') as chatFile: 182 | writer = csv.writer(chatFile) 183 | writer.writerow(Message.getMessageHeader()) 184 | for message in self.messages: 185 | message.urls = extractUrls(message) 186 | writer.writerow(message.getMessageRow(self.username, self.member_count, self.isBroadcastingChannel)) 187 | 188 | for comment in message.comments: 189 | comment.urls = extractUrls(comment) 190 | writer.writerow(comment.getMessageRow(self.username, self.member_count, self.isBroadcastingChannel)) 191 | 192 | with open(users_csv, "w", encoding="utf-8", newline='') as users_csv: 193 | writer = csv.writer(users_csv) 194 | writer.writerow(Message.getUserHeader()) 195 | for user in self.users: 196 | # Write in user table. 197 | writer.writerow( 198 | [self.username, user.id, user.first_name, user.last_name, concat(user.first_name, user.last_name), 199 | user.phone, user.bot, user.verified, user.username]) 200 | 201 | async def __parseForward(self, message, new_message): 202 | if message.forward is not None: 203 | if message.forward.chat is not None: 204 | new_message.forward = message.forward.chat.username 205 | new_message.forwardId = message.forward.chat.id 206 | if new_message.forwardId is None: 207 | new_message.forwardId = message.forward.channel_id 208 | elif message.forward.sender is not None: 209 | sender = message.forward.sender 210 | new_message.forward = concat(sender.first_name, sender.last_name) 211 | new_message.forwardId = sender.id 212 | else: 213 | new_message.forward = "Unknown" 214 | new_message.forwardId = "Unknown" 215 | 216 | if message.forward.original_fwd is not None: 217 | new_message.forward_msg_id = message.forward.original_fwd.channel_post 218 | new_message.forward_msg_date = message.forward.date 219 | 220 | def getChannelUsers(self): 221 | """ Scrape the users of a channel, if it is not a broadcasting channel.""" 222 | if self.isBroadcastingChannel: 223 | return 224 | 225 | async def main(): 226 | async for user in Channel.client.iter_participants(self.username, aggressive=True): 227 | if type(user) == telethon.types.User: 228 | if user not in self.users: 229 | self.users.append(user) 230 | 231 | with Channel.client: 232 | try: 233 | Channel.client.loop.run_until_complete(main()) 234 | except telethon.errors.FloodWaitError as e: 235 | print("FloodWaitError: Sleep for " + str(e.seconds)) 236 | time.sleep(e.seconds) 237 | -------------------------------------------------------------------------------- /channelscraper/client.py: -------------------------------------------------------------------------------- 1 | from telethon import TelegramClient, sync 2 | from config import Config 3 | 4 | 5 | class Client: 6 | config = Config.getConfig() 7 | client = None 8 | 9 | @staticmethod 10 | def getClient(): 11 | """ Static access method. """ 12 | if Client.client is None: 13 | Client() 14 | 15 | return Client.client 16 | 17 | def __init__(self): 18 | """ Virtually private constructor. """ 19 | if Client.client is not None: 20 | raise Exception("This class is a singleton!") 21 | else: 22 | # API CONNECTION # 23 | api_id = Client.config.get("api_id") 24 | api_hash = Client.config.get("api_hash") 25 | phone = Client.config.get('phone') 26 | Client.client = TelegramClient(phone, api_id, api_hash) 27 | Client.client.connect() 28 | 29 | # LOGIN 30 | if not Client.client.is_user_authorized(): 31 | Client.client.send_code_request(phone) 32 | Client.client.sign_in(phone, input('Enter the code: ')) 33 | -------------------------------------------------------------------------------- /channelscraper/config.py: -------------------------------------------------------------------------------- 1 | import yaml 2 | 3 | 4 | class Config: 5 | config = None 6 | 7 | @staticmethod 8 | def getConfig(): 9 | """ Static access method. """ 10 | if Config.config is None: 11 | Config() 12 | 13 | return Config.config 14 | 15 | def __init__(self): 16 | Config.config = yaml.safe_load(open("config.yaml")) 17 | -------------------------------------------------------------------------------- /channelscraper/config.yaml: -------------------------------------------------------------------------------- 1 | #### API CREDENTIALS ##### 2 | # Credentials for telegram API 3 | api_id: #Put Numeric API ID here 4 | api_hash: #Put API hash string here 5 | phone: #Put phone number string here 6 | 7 | ### INPUT CHANNEL FILES ### 8 | #input_channel_file: telegram_accounts.csv 9 | input_channel_file: channels.csv 10 | 11 | ### SCRAPING SETTINGS #### 12 | # More extensive logging if true. 13 | debug_mode: false 14 | 15 | # Specify scrape mode: FULL_SCRAPE vs. OFFSET_SCRAPE 16 | scrape_mode: FULL_SCRAPE 17 | 18 | # Only relevant on OFFSET_SCRAPE mode. Set the amount of days to go back in chat history. 19 | scrape_offset: 30 20 | 21 | # There is a different driver needed for aws. 22 | driver_mode: "normal" 23 | 24 | # If true, the script will run slower in order to avoid being blocked. 25 | init: False 26 | 27 | # If true, media files will be downloaded as well. On default those are photos. If you want something different edit source-code. 28 | media_download: False 29 | 30 | # Query comments.bot/app usernames. Attention: This can lead to a API-ban more likely. 31 | query_users: False 32 | -------------------------------------------------------------------------------- /channelscraper/driver.py: -------------------------------------------------------------------------------- 1 | import os 2 | import platform 3 | from config import Config 4 | from selenium.webdriver import Chrome, Remote 5 | from selenium.webdriver.common.desired_capabilities import DesiredCapabilities 6 | from selenium import webdriver 7 | import chromedriver_autoinstaller 8 | 9 | 10 | class Driver: 11 | driver = None 12 | config = Config.getConfig() 13 | 14 | @staticmethod 15 | def getDriver(): 16 | """ Static access method. """ 17 | if Driver.driver is None: 18 | Driver() 19 | return Driver.driver 20 | 21 | @staticmethod 22 | def closeDriver(): 23 | Driver.driver.quit() 24 | 25 | def __init__(self): 26 | """ Virtually private constructor. """ 27 | if Driver.driver is not None: 28 | raise Exception("This class is a singleton!") 29 | else: 30 | if Driver.config.get("driver_mode") == "live": 31 | Driver.driver = Remote("http://127.0.0.1:4444/wd/hub", DesiredCapabilities.CHROME) 32 | else: 33 | chromedriver_autoinstaller.install() 34 | Driver.driver = webdriver.Chrome() 35 | -------------------------------------------------------------------------------- /channelscraper/message.py: -------------------------------------------------------------------------------- 1 | from utilities import getClassName 2 | 3 | 4 | class Message: 5 | def __init__(self): 6 | self.text = "" 7 | self.id = None 8 | self.timestamp = None 9 | self.replyToMessageId = "" 10 | self.views = None 11 | self.sender = None 12 | self.replyToMessageId = "" 13 | self.edit_date = None 14 | self.entities = list() 15 | self.forward = None 16 | self.forwardId = None 17 | self.forward_msg_id = None 18 | self.forward_msg_date = None 19 | self.post_author = None 20 | self.comments = [] 21 | self.sender_name = None 22 | self.username = None 23 | self.urls = None 24 | self.media = None 25 | self.isComment = False 26 | self.hasComments = False 27 | self.parent = None 28 | self.bot_url = None 29 | self.isDeleted = "" 30 | self.bot_url = None 31 | 32 | @staticmethod 33 | def getMessageHeader(): 34 | return ["channel", "member_count", "broadcast", "id", "timestamp", "content", "user_id", "first_and_last_name", 35 | "username", 36 | "views", "edit-date", 37 | "replyToId", "forward", "forward_id", "forward_msg_id", "forward_date", "URLs", "media", "hasComments", 38 | "isComment", 39 | "bot_url", 40 | "parent", "isDeleted"] 41 | 42 | @staticmethod 43 | def getUserHeader(): 44 | return ["channel", "Id", "first_name", "last_name", "first_and_last_name", "phone", "bot", 45 | "verified", 46 | "username"] 47 | 48 | def getMessageRow(self, channel_username, channel_member_count, channel_broadcast): 49 | row = [channel_username, channel_member_count, channel_broadcast, self.id, self.timestamp, self.text, 50 | self.sender, 51 | self.sender_name, 52 | self.username, self.views, 53 | self.edit_date, self.replyToMessageId, self.forward, self.forwardId, self.forward_msg_id, 54 | self.forward_msg_date, 55 | ", ".join(self.urls), 56 | getClassName(self.media), self.hasComments, self.isComment, self.bot_url, 57 | self.parent, self.isDeleted] 58 | return row 59 | -------------------------------------------------------------------------------- /channelscraper/scrapeChannelMetadata.py: -------------------------------------------------------------------------------- 1 | import time 2 | 3 | import telethon 4 | import os 5 | import yaml 6 | import csv 7 | from client import Client 8 | import traceback 9 | from channel import Channel 10 | from telethon.tl.functions.channels import GetFullChannelRequest 11 | from utilities import getInputPath, getOutputPath 12 | from driver import Driver 13 | 14 | 15 | def getChannelList(filename): 16 | """Initialises the CSV of channels to scrape given by ADDENDUM.""" 17 | channelList = [] 18 | with open(getInputPath() + '/' + filename, newline='', 19 | encoding='utf-8') as csvfile: 20 | reader = csv.reader(csvfile) 21 | csvList = list(reader) 22 | for row in csvList[1:]: 23 | channelList.append(row[2].split("/")[-1]) 24 | return channelList 25 | 26 | client = Client.getClient() 27 | config = yaml.safe_load(open("config.yaml")) 28 | input_file = config["input_channel_file"] 29 | channels = getChannelList(input_file) 30 | 31 | channel_info_list = list() 32 | i = 1 33 | for channel in channels: 34 | i = i + 1 35 | print("Channel: " + channel + " Nr: " + str(i)) 36 | time.sleep(3) 37 | try: 38 | channel_entity = client.get_entity(channel) 39 | channel_full_info = client(GetFullChannelRequest(channel=channel_entity)) 40 | 41 | if channel_full_info.full_chat.location is not None: 42 | geo_point = channel_full_info.full_chat.location.geo_point 43 | address = channel_full_info.full_chat.location.address 44 | else: 45 | geo_point = "" 46 | address = "" 47 | channel_info_list.append( 48 | [channel_entity.username, channel_entity.id, channel_full_info.full_chat.about, channel_entity.broadcast, 49 | channel_entity.date, channel_full_info.full_chat.participants_count, geo_point 50 | , address]) 51 | except: 52 | traceback.print_exc() 53 | channelExists = False 54 | print("Channel '" + channel + "' does not exist.") 55 | 56 | with open(getInputPath() + '/' + '/channel_info.csv', mode="w", 57 | newline='', 58 | encoding='utf-8') as csvfile: 59 | writer = csv.writer(csvfile) 60 | writer.writerow(["channel", "id", "about", "broadcast", "created_at", "members", "location", "address"]) 61 | for channel in channel_info_list: 62 | writer.writerow(channel) 63 | 64 | Driver.closeDriver() 65 | -------------------------------------------------------------------------------- /channelscraper/utilities.py: -------------------------------------------------------------------------------- 1 | import re 2 | from random import random 3 | from datetime import datetime, timedelta 4 | import asyncio 5 | import os 6 | import telethon 7 | import logging 8 | 9 | 10 | def getInputPath(): 11 | path = os.path.dirname(os.path.dirname(__file__)) + "/input" 12 | create_path_if_not_exists(path) 13 | return path 14 | 15 | 16 | def getOutputPath(): 17 | path = os.path.dirname(os.path.dirname(__file__)) + "/output" 18 | create_path_if_not_exists(path) 19 | return path 20 | 21 | 22 | def concat(first_name, last_name): 23 | if first_name is None: 24 | first_name = "" 25 | if last_name is None: 26 | last_name = "" 27 | return first_name + " " + last_name 28 | 29 | 30 | def eliminateWhitespaces(text): 31 | return re.sub(r"\s\s+", " ", text) 32 | 33 | 34 | def calcDateOffset(offset): 35 | if offset <= 0: 36 | raise AttributeError("Offset must be greater 0") 37 | return datetime.now() - timedelta(days=offset) 38 | 39 | 40 | async def wait(): 41 | """ Random sleep time to avoid getting blocked by telegram.""" 42 | randomNum = random() / 10 43 | if randomNum < 0.0001: 44 | await asyncio.sleep(3) 45 | await asyncio.sleep(randomNum) 46 | 47 | 48 | def create_path_if_not_exists(channelPath): 49 | if not os.path.exists(channelPath): 50 | os.makedirs(channelPath) 51 | 52 | 53 | def extractUrls(message): 54 | """ Url extraction by RegEx and telegram url entity to get maximum results""" 55 | urls_regex = re.findall(r"(?Phttps?://[^\s]+)", message.text) 56 | # if found through regex add it, if not already found (because it has bugs) 57 | for i in range(len(urls_regex)): 58 | url_reg = urls_regex[i] 59 | match = re.search(r".jpg|.htm|.html|.mp4", url_reg) 60 | if match is not None: 61 | urls_regex[i] = url_reg[0:match.end()] 62 | 63 | entities = message.entities 64 | if entities is None: 65 | return urls_regex 66 | urls = [] 67 | 68 | for entity in entities: 69 | if type(entity) == telethon.tl.types.MessageEntityUrl: 70 | try: 71 | enc_text = message.text.encode('utf-16-le') 72 | url = enc_text[entity.offset * 2:(entity.offset + entity.length) * 2].decode('utf-16-le') 73 | if (re.match("http|www", url)): 74 | urls.append(url) 75 | except UnicodeDecodeError: 76 | logging.info("Unicode Error for text") 77 | pass 78 | 79 | for url in urls: 80 | for url_reg in urls_regex: 81 | if url in url_reg: 82 | urls_regex.remove(url_reg) 83 | 84 | urls.extend(urls_regex) 85 | list(dict.fromkeys(urls)) # remove duplicates 86 | return urls 87 | 88 | 89 | def getClassName(object): 90 | if object is not None: 91 | return object.__name__ 92 | else: 93 | return "" 94 | -------------------------------------------------------------------------------- /input/.gitkeep: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/PeterWalchhofer/Telescrape/733220a72f5cf40001803007ae8528b8ae6849a3/input/.gitkeep -------------------------------------------------------------------------------- /input/channel_info.csv: -------------------------------------------------------------------------------- 1 | channel,id,about,broadcast,created_at,members,location,address 2 | channelolero,1246406916,,True,2019-12-16 09:58:02+00:00,2,, 3 | -------------------------------------------------------------------------------- /input/channels.csv: -------------------------------------------------------------------------------- 1 | "Kategorie","Name","Link","@","Broadcast","asdasd" 2 | "Gruppe Typ XY","Example Channel","https://t.me/channelolero","example_channel","TRUE", 3 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | selenium 2 | telethon 3 | beautifulsoup4 4 | pyyaml 5 | chromedriver-autoinstaller -------------------------------------------------------------------------------- /scrape.sh: -------------------------------------------------------------------------------- 1 | #!/bin/sh 2 | 3 | . venv/bin/activate 4 | cd ./channelscraper 5 | 6 | if [ $# -eq 0 ] 7 | then 8 | python3 app.py 9 | else 10 | if [ $1 == "meta" ]; then 11 | python3 scrapeChannelMetadata.py 12 | fi 13 | fi 14 | 15 | deactivate --------------------------------------------------------------------------------