├── .gitignore
├── LICENSE
├── Makefile
├── README.md
├── channelscraper
├── __init__.py
├── app.py
├── bots.py
├── channel.py
├── client.py
├── config.py
├── config.yaml
├── driver.py
├── message.py
├── scrapeChannelMetadata.py
└── utilities.py
├── input
├── .gitkeep
├── channel_info.csv
└── channels.csv
├── requirements.txt
└── scrape.sh
/.gitignore:
--------------------------------------------------------------------------------
1 | /scripts/+4367763544147.session
2 | ./.idea
3 | venv
4 | __pycache__
5 | /output/
6 |
--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
1 | MIT License
2 |
3 | Copyright (c) 2020 Peter Walchhofer, Valentin Peter
4 |
5 | Permission is hereby granted, free of charge, to any person obtaining a copy
6 | of this software and associated documentation files (the "Software"), to deal
7 | in the Software without restriction, including without limitation the rights
8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 |
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 |
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 |
--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
1 | venv:
2 | python3 -m venv venv; \
3 | . venv/bin/activate; \
4 | pip install -r requirements.txt
5 |
6 | install: venv
7 |
8 | .PHONY: update
9 | update:
10 | . venv/bin/activate; \
11 | pip install -r requirements.txt
12 |
13 | .PHONY: clean
14 | clean:
15 | rm -rf venv
16 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Telegram Scraper
2 |
3 | This Telegram scraper collects telegram messages, comments (comments.bot/comments.app) and media files. It was originally build for this [story](https://www.addendum.org/news/telegram-netzwerk-sellner/) on behalf of [Addendum](https://addendum.org).
4 |
5 | **Contributors:** [@PeterWalchhofer](https://github.com/PeterWalchhofer), [@vali101](https://github.com/vali101), [@fin](https://github.com/fin)
6 | ## Notes
7 | This scraper was written before Telgram introduced its native comment feature for broadcasting channels. Nowadays, comment.bot/comment.app extensions are rarely used anymore. Feel free to make a PR to support this feature.
8 | [@vali101](https://github.com/vali101) re-wrote the scraper for his bachelor thesis [here](https://github.com/vali101/telegraph) with a focus on snowball sampling. His code may help for extending the scraper or for finding interesting channels in the first place. Check it out!
9 |
10 | ## Setup
11 |
12 | ### Requirements:
13 | - Google Chrome (However, in theory you could als use Firefox when installing the necessary driver manually )
14 | - Python 3
15 | - Telegram Account (Phone Number)
16 | - A lot of storage if downloading media
17 | - Time - arround 3000msg and comments/ per Minute.
18 |
19 | ### Getting Started
20 |
21 | 1. Install dependencies `make install` OR just install requirements.txt (using venv is recommended)
22 | 2. Create your own `channel.csv` as explained in the next section
23 | 3. Put the phone-number of the linked telegram account int the `config.yaml`
24 | 4. Get your API-key [here](https://my.telegram.org/auth?to=apps) an put them inside the `config.yaml`.
25 | 5. `sh scrape.sh` to start the scraper OR run `channelscraper/python app.py`
26 | 6. The outputs will be stored in the `/output` directory.
27 |
28 |
29 | ### Input Data
30 | You need to create your own `channels.csv`and put it in the `/input` folder.
31 |
32 | Only **Link** and **Broadcast** Relevant for scraping. The csv should have the form described below. There also is an example csv in the folder.
33 |
34 | Kategorie | Name | **Link** | @ | **Broadcast**
35 | --- | --- | --- | --- | ---
36 | Gruppe Typ XY | Example Channel | https://t.me/example_channel | example_channel | TRUE
37 |
38 | * Kategorie(optional): Metadata to annotate channel
39 | * Name(optional): Not identifier Name
40 | * Link: Link to channel
41 | * @ (optional): Indentifier Name
42 | * Broadcast: ´True´ if channel is Broadcasting Channel ´else´ false. Broadcasting channels are large one-to-many channels that only allow owners to write messages.
43 |
44 | ### Configuration
45 |
46 | The Scraper can be further configured via the `channelscraper/config.yaml`.
47 |
48 | ## Features
49 | ### Messages
50 | The Scraper extracts all messages from a channel. It is also possible to scrape only those messages that were written in the last x days. This can be set in the `config.yaml`.
51 |
52 | ### Comment Bots
53 | - In many broadcasting channels comment-bots are used in order to provide feedback from the audience. It is also possible to scrape those messages. Currently comments.app and comments.bot bots are supported.
54 | - Be careful. The date format is different from the telegram-api and has to be parsed manually (e.g. Dec 09)
55 | - As there is a "Load more comments" button it has to be clicked using javascript. Selenium is used to interact with the chrome driver that is installed automatically.
56 | #### Comments.app
57 | - Unique username extraction is working most of the time. However, if the user has deleted its account, this is not possible.
58 | #### Comments.bot
59 | - Unique usernames are not extracted, because we found no way to find out without querying the api at a high cost.
60 | - Only the display name is persited
61 |
62 | ## Further remarks
63 | - Messages and comments are persisted in the same csv-file. To tell them apart use the `isComment` column. Additionally, the ID includes a period in the format msgId.commId (e.g. 101.3)
64 | - We allocated ids to the comments manually. The telegram message ids are unique within a channel.
65 |
66 | ## Optional
67 |
68 | If you want to run selenium with docker use `selenium/standalone-chrome:3.141.59-yttrium`
69 | Also see:
70 | https://stackoverflow.com/questions/45323271/how-to-run-selenium-with-chrome-in-docker
71 |
--------------------------------------------------------------------------------
/channelscraper/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PeterWalchhofer/Telescrape/733220a72f5cf40001803007ae8528b8ae6849a3/channelscraper/__init__.py
--------------------------------------------------------------------------------
/channelscraper/app.py:
--------------------------------------------------------------------------------
1 | from datetime import datetime
2 | from datetime import timedelta
3 | import traceback
4 | import logging
5 | import os
6 | import yaml
7 | import csv
8 | import asyncio
9 |
10 | from utilities import getInputPath
11 | from channel import Channel
12 | from driver import Driver
13 |
14 |
15 | def getChannelList(filename):
16 | """Initialises the CSV of channels to scrape given by ADDENDUM."""
17 | channelList = []
18 | with open(getInputPath() + '/' + filename, newline='', encoding='utf-8') as csvfile:
19 | reader = csv.reader(csvfile)
20 | csvList = list(reader)
21 | for row in csvList[1:]:
22 | channelList.append(Channel(row[2], row[4]))
23 | return channelList
24 |
25 |
26 | config = yaml.safe_load(open("config.yaml"))
27 | input_file = config["input_channel_file"]
28 | channels = getChannelList(input_file)
29 |
30 | for channel in channels:
31 | channelPath = getInputPath() + "/" + channel.username
32 | logging.info("Collecting channel: " + channel.username)
33 | logging.info("-> Collecting messages")
34 | # Scrape Channels
35 | try:
36 | channel.scrape()
37 | except:
38 | traceback.print_exc()
39 | pass
40 |
41 | try:
42 | channel.getChannelUsers()
43 | except:
44 | traceback.print_exc()
45 | pass
46 |
47 | # WriteCsv
48 | try:
49 | channel.writeCsv()
50 | except:
51 | traceback.print_exc()
52 | pass
53 |
54 | Driver.closeDriver()
55 |
--------------------------------------------------------------------------------
/channelscraper/bots.py:
--------------------------------------------------------------------------------
1 | import selenium
2 | import telethon.tl.types.messages
3 | from urllib.request import Request, urlopen
4 | from selenium.webdriver.support.ui import WebDriverWait
5 | from selenium.webdriver.common.by import By
6 | from selenium.webdriver.support import expected_conditions as EC
7 | from selenium.common.exceptions import ElementNotInteractableException
8 | from bs4 import BeautifulSoup
9 | import urllib.request
10 | import traceback
11 | import time
12 | import logging
13 |
14 | from client import Client
15 | from driver import Driver
16 | from utilities import concat, eliminateWhitespaces
17 | from message import Message
18 |
19 |
20 |
21 | class Bots:
22 | client = Client.getClient()
23 | driver = Driver.getDriver()
24 |
25 | @staticmethod
26 | def javaScriptLoadMoreHack(url, provider):
27 | """ All it does is to click on the 'load more comments' button in order to scrape all the messages"""
28 | if provider == "bot":
29 | index = 1
30 | selector = "sg_load"
31 | else:
32 | index = 0
33 | selector = "bc-load-more"
34 | # use firefox to get page with javascript generated content
35 | Bots.driver.get(url)
36 | resultDriver = Bots.loadMore(Bots.driver, index, selector)
37 |
38 | # store it to string variable
39 | if resultDriver is None:
40 | return None
41 | else:
42 | page_source = resultDriver.page_source
43 | return BeautifulSoup(page_source, "html.parser")
44 |
45 | @staticmethod
46 | def loadMore(driver, index, selector):
47 | """Recursive call of load more if there are lots of comments"""
48 | selection = driver.find_elements_by_css_selector("." + selector)
49 | try:
50 | if len(selection) > index:
51 | selection[index].click()
52 | else:
53 | return None
54 | except ElementNotInteractableException:
55 | if (index == 1):
56 | Bots.loadMore(driver, 0, selector)
57 | else:
58 | return None
59 | # wait for the page to load
60 | try:
61 | element = WebDriverWait(driver, 3).until(
62 | EC.invisibility_of_element_located((By.CLASS_NAME, selector))
63 | )
64 | time.sleep(1)
65 | return driver
66 | except selenium.common.exceptions.TimeoutException:
67 | return Bots.loadMore(driver, index, selector)
68 |
69 | @staticmethod
70 | async def scrapeCommentsApp(url, messageId, queryUser):
71 | """Scrapes https://comments.app for comment feature extension
72 | """
73 | commentList = list()
74 | userList = list()
75 | # soup = BeautifulSoup(page, 'html.parser')
76 | soup = Bots.javaScriptLoadMoreHack(url, "app")
77 | if soup is None:
78 | try:
79 | page = urllib.request.urlopen(url)
80 | soup = BeautifulSoup(page, "html.parser")
81 | except Exception:
82 | logging.info("Exception on loading comments occurred")
83 | return
84 |
85 | comments = soup.find_all('div', class_='bc-comment-box')
86 | count = 1 # to create an id for comments, which the naturaly do not have.
87 | for comment in comments:
88 | Bots.__parseCommentFromApp(comment, commentList, userList, count, messageId, queryUser)
89 | count = count + 1
90 |
91 | return commentList, userList
92 |
93 |
94 | @staticmethod
95 | def scrapeCommentsBot(url, channel_users, messageId):
96 | """ Scrapes https://comments.bot/ website for comment feature extension."""
97 | hdr = {
98 | 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
99 | 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
100 | 'Referer': 'https://cssspritegenerator.com',
101 | 'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
102 | 'Accept-Encoding': 'none',
103 | 'Accept-Language': 'en-US,en;q=0.8',
104 | 'Connection': 'keep-alive'}
105 |
106 | soup = Bots.javaScriptLoadMoreHack(url, "bot")
107 | if soup is None:
108 | try:
109 | req = Request(url, headers=hdr)
110 | page = urlopen(req)
111 | soup = BeautifulSoup(page, 'html.parser')
112 | except:
113 | logging.info("Could not open comments app.")
114 |
115 | commentList = []
116 | if soup is not None:
117 | comments = soup.find_all('div', class_='comment-content')
118 | count = 1
119 | for comment in comments:
120 | # Save comment Message
121 | new_comment = Message()
122 | new_comment.isComment = True
123 | new_comment.parent = messageId
124 | # Save Id
125 | new_comment.id = str(messageId) + "." + str(count)
126 | # Save Text
127 | text_list = comment.find('div', class_='comment-text').contents
128 | text_list = [elem for elem in text_list if str(elem) != "
"] # Eliminate Html
129 | new_comment.text = eliminateWhitespaces(' '.join([str(elem) for elem in text_list]))
130 |
131 | # Save replyId
132 | reply_text_list = comment.find('span', class_='comment-reply-text')
133 | if reply_text_list is not None:
134 | reply_text = eliminateWhitespaces(' '.join([str(elem) for elem in reply_text_list]))
135 | try:
136 | new_comment.replyToMessageId = next(
137 | (new_comment for new_comment in commentList if new_comment.text ==
138 | reply_text), None).id
139 | except:
140 | logging.info("Could not find quoted comment in Bot: " + url)
141 | count = count + 1
142 | # Find not identified User
143 | new_comment.sender_name = comment.find('div', class_='name-row').findChildren()[0].contents[0]
144 | commentUser = telethon.types.User(0)
145 | commentUser.first_name = new_comment.sender_name
146 | channel_users.append(commentUser)
147 |
148 | new_comment.timestamp = comment.find('div', class_='comment-date').findChildren()[0].contents[0]
149 |
150 | # No identified Users
151 | commentList.append(new_comment)
152 |
153 | return commentList
154 |
155 | @staticmethod
156 | def __parseCommentFromApp(comment, commentList, userList, count, messageId, queryUser):
157 | # Save comment Message
158 | new_comment = Message()
159 | new_comment.isComment = True
160 | new_comment.parent = messageId
161 | new_comment.id = str(messageId) + "." + str(count)
162 |
163 | # Get all text elements of a comment
164 | textList = None
165 | try:
166 | textList = comment.find_all('div', class_='bc-comment-text')
167 | except:
168 | logging.info("This thread has no comments")
169 |
170 | # Save comment text and reply id
171 | try:
172 | # Find comment text and quoted comment id.
173 | if len(textList) > 1:
174 | try:
175 | new_comment.text = textList[1].text
176 | new_comment.replyToMessageId = next(
177 | (new_comment for new_comment in commentList if new_comment.text ==
178 | textList[0].text), None).id
179 | except AttributeError:
180 | logging.info(
181 | "Could not find quoted comment" + ", MessageId: " + str(new_comment.id))
182 | # Find only comment text
183 | else:
184 | try:
185 | new_comment.text = textList[0].text
186 | except IndexError:
187 | logging.info("Could not find comment text" + ", MessageId: " + str(messageId))
188 |
189 | except:
190 | traceback.print_exc()
191 | logging.info("An error occurred reading comment text" + ", MessageId: " + str(messageId))
192 |
193 | # Find not identified User
194 | new_comment.sender_name = comment.find('span', class_='bc-comment-author-name').contents[0].contents[0]
195 | # Find identified User and save User ich channel.users and save message.sendername
196 |
197 | try:
198 | identifierName = comment.find('span', class_='bc-comment-author-name').contents[0]['href'].rsplit('/', 1)[
199 | -1]
200 | except:
201 | identifierName = ""
202 | user = None
203 |
204 | # Query user if identifier name was found
205 | if not identifierName == "" and queryUser:
206 | try:
207 | user = Bots.getEntity(identifierName)
208 | userList.append(user)
209 | except ValueError:
210 | # User for some reason not found.
211 | pass
212 |
213 | new_comment.username = identifierName
214 | if user is not None:
215 | new_comment.sender_name = concat(user.first_name, user.last_name)
216 | new_comment.sender = user.id
217 | else:
218 | commentUser = telethon.types.User(0)
219 | commentUser.first_name = new_comment.sender_name
220 |
221 | new_comment.timestamp = comment.find('time')['datetime']
222 | commentList.append(new_comment)
223 |
224 | @staticmethod
225 | async def getEntity(identifierName):
226 | return await Bots.client.get_entity(identifierName)
227 |
--------------------------------------------------------------------------------
/channelscraper/channel.py:
--------------------------------------------------------------------------------
1 | import asyncio
2 | import csv
3 | import logging
4 | import os
5 | import time
6 | import traceback
7 | from datetime import datetime
8 |
9 | import telethon.tl.types.messages
10 | from bots import Bots
11 | from client import Client
12 | from config import Config
13 | from message import Message
14 | from utilities import concat, wait, calcDateOffset, getOutputPath, create_path_if_not_exists, extractUrls
15 |
16 |
17 | class Channel:
18 | config = Config.getConfig()
19 | client = Client.getClient()
20 | logging.basicConfig(format='[%(levelname) 5s/%(asctime)s] %(name)s: %(message)s',
21 | level=logging.INFO)
22 |
23 | def __init__(self, link, isBroadcastingChannel):
24 | self.id = None
25 |
26 | self.users = list()
27 | if isBroadcastingChannel == "True":
28 | self.isBroadcastingChannel = True
29 | else:
30 | self.isBroadcastingChannel = False
31 | self.username = link.rsplit('/', 1)[-1]
32 |
33 | self.path = getOutputPath() + "/" + self.username
34 | self.messages = list()
35 |
36 | def scrape(self):
37 | create_path_if_not_exists(self.path)
38 |
39 | if Channel.config.get("scrape_mode") == "OFFSET_SCRAPE":
40 | self.getRecentChannelMessages()
41 | elif Channel.config.get("scrape_mode") == "FULL_SCRAPE":
42 | self.getAllChannelMessages()
43 | else:
44 | raise AttributeError("Invalid scraping mode set in config file.")
45 |
46 | # Collects messages and comments from a given channel.
47 | def getRecentChannelMessages(self):
48 | """ Scrapes messages of a channel of the last X days and saves the information in the channel object.
49 | """
50 |
51 | async def main():
52 | async for message in Channel.client.iter_messages(self.username, offset_date=calcDateOffset(
53 | Channel.config.get("scrape_offset")), reverse=True):
54 | if type(message) == telethon.tl.types.Message:
55 | await self.parseMessage(message)
56 |
57 | with Channel.client:
58 | try:
59 | Channel.client.loop.run_until_complete(main())
60 | except telethon.errors.ServerError:
61 | logging.info("Server error: Passed")
62 | pass
63 | except telethon.errors.FloodWaitError as e:
64 | logging.info("FloodWaitError: Sleep for " + str(e.seconds))
65 | time.sleep(e.seconds)
66 |
67 | self.messages = list(reversed(self.messages))
68 |
69 | # Collects messages and comments from a given channel.
70 | def getAllChannelMessages(self):
71 | """ Scrapes all messages of a channel and saves the information in the channel object.
72 | """
73 |
74 | async def main():
75 | async for message in Channel.client.iter_messages(self.username):
76 | if type(message) == telethon.tl.types.Message:
77 | await self.parseMessage(message)
78 |
79 | with Channel.client:
80 | try:
81 | Channel.client.loop.run_until_complete(main())
82 | except telethon.errors.ServerError:
83 | logging.info("Server error: Passed")
84 | pass
85 | except telethon.errors.FloodWaitError as e:
86 | logging.info("FloodWaitError: Sleep for " + str(e.seconds))
87 | time.sleep(e.seconds)
88 |
89 | async def parseMessage(self, message):
90 | # Wait to prevent getting blocked
91 | await wait()
92 |
93 | new_message = Message()
94 | new_message.id = message.id
95 | new_message.sender = message.sender_id
96 | try:
97 | first_name = message.sender.first_name
98 | last_name = message.sender.last_name
99 | except AttributeError:
100 | first_name = ""
101 | last_name = ""
102 | new_message.sender_name = concat(first_name, last_name)
103 | try:
104 | new_message.username = message.sender.username
105 | except AttributeError:
106 | pass
107 | new_message.replyToMessageId = message.reply_to_msg_id
108 | new_message.edit_date = message.edit_date
109 | new_message.entities = message.entities
110 | new_message.post_author = message.post_author
111 | new_message.timestamp = message.date
112 | new_message.text = message.text
113 | new_message.views = message.views
114 | new_message.media = type(message.media)
115 | self.member_count = message.chat.participants_count
116 |
117 | # Saves the channel from which the message was forwarded.
118 | try:
119 | await self.__parseForward(message, new_message)
120 | except AttributeError:
121 | pass
122 |
123 | if type(message.media) == telethon.types.MessageMediaPhoto and Channel.config.get("media_download"):
124 | mediapath = self.path + "/media/" + str(new_message.id)
125 | if not os.path.exists(mediapath + ".jpg"):
126 | try:
127 | await message.download_media(mediapath)
128 | except telethon.errors.FloodWaitError as e:
129 | logging.info("Waiting " + str(e.seconds) + " seconds: FloodWaitError")
130 | await asyncio.sleep(e.seconds)
131 | except telethon.errors.RpcCallFailError:
132 | pass
133 | await asyncio.sleep(1)
134 |
135 | # Checks which kind of comment bot is used by the provider of the group a uses the correct scraper.
136 | # --> then fills the comment list for each messages with the comments (prints "no comments" if no comment
137 | # bot is used.
138 | comments = list()
139 | if message.buttons is not None and message.forward is None:
140 | buttons = message.buttons
141 |
142 | for button in buttons:
143 | button_url = None
144 | try:
145 | button_url = button[0].button.url[:21]
146 | except AttributeError:
147 | pass
148 |
149 | if button_url == 'https://comments.bot/':
150 | logging.info("---> Found comments.bot...")
151 | new_message.hasComments = True
152 | new_message.bot_url = button[0].button.url
153 | try:
154 | comments.extend(Bots.scrapeCommentsBot(new_message.bot_url, self.users, message.id))
155 | except Exception:
156 | traceback.print_exc()
157 | elif button[0].text[-8:] == 'comments':
158 | logging.info("---> Found comments.app...")
159 | new_message.hasComments = True
160 | new_message.bot_url = button[0].button.url
161 | try:
162 | commentsAppComments, commentsAppUsers = \
163 | await Bots.scrapeCommentsApp(new_message.bot_url, message.id,
164 | Channel.config.get("query_users"))
165 | comments.extend(commentsAppComments)
166 | self.users.extend(commentsAppUsers)
167 | except Exception:
168 | traceback.print_exc()
169 |
170 | new_message.comments = comments
171 | self.messages.append(new_message)
172 |
173 | def writeCsv(self):
174 | if len(self.messages) == 0:
175 | raise LookupError("Nothing to write. You have to execute 'scrape' method first.")
176 |
177 | chatlogs_csv = self.path + "/chatlogs_" + str(datetime.now().strftime("%Y-%m-%d--%H-%M-%S")) + ".csv"
178 | users_csv = self.path + "/users" + str(datetime.now().strftime("%Y-%m-%d--%H-%M-%S")) + ".csv"
179 |
180 | # WRITE MESSAGES AND COMMENTS
181 | with open(chatlogs_csv, "w", encoding="utf-8", newline='') as chatFile:
182 | writer = csv.writer(chatFile)
183 | writer.writerow(Message.getMessageHeader())
184 | for message in self.messages:
185 | message.urls = extractUrls(message)
186 | writer.writerow(message.getMessageRow(self.username, self.member_count, self.isBroadcastingChannel))
187 |
188 | for comment in message.comments:
189 | comment.urls = extractUrls(comment)
190 | writer.writerow(comment.getMessageRow(self.username, self.member_count, self.isBroadcastingChannel))
191 |
192 | with open(users_csv, "w", encoding="utf-8", newline='') as users_csv:
193 | writer = csv.writer(users_csv)
194 | writer.writerow(Message.getUserHeader())
195 | for user in self.users:
196 | # Write in user table.
197 | writer.writerow(
198 | [self.username, user.id, user.first_name, user.last_name, concat(user.first_name, user.last_name),
199 | user.phone, user.bot, user.verified, user.username])
200 |
201 | async def __parseForward(self, message, new_message):
202 | if message.forward is not None:
203 | if message.forward.chat is not None:
204 | new_message.forward = message.forward.chat.username
205 | new_message.forwardId = message.forward.chat.id
206 | if new_message.forwardId is None:
207 | new_message.forwardId = message.forward.channel_id
208 | elif message.forward.sender is not None:
209 | sender = message.forward.sender
210 | new_message.forward = concat(sender.first_name, sender.last_name)
211 | new_message.forwardId = sender.id
212 | else:
213 | new_message.forward = "Unknown"
214 | new_message.forwardId = "Unknown"
215 |
216 | if message.forward.original_fwd is not None:
217 | new_message.forward_msg_id = message.forward.original_fwd.channel_post
218 | new_message.forward_msg_date = message.forward.date
219 |
220 | def getChannelUsers(self):
221 | """ Scrape the users of a channel, if it is not a broadcasting channel."""
222 | if self.isBroadcastingChannel:
223 | return
224 |
225 | async def main():
226 | async for user in Channel.client.iter_participants(self.username, aggressive=True):
227 | if type(user) == telethon.types.User:
228 | if user not in self.users:
229 | self.users.append(user)
230 |
231 | with Channel.client:
232 | try:
233 | Channel.client.loop.run_until_complete(main())
234 | except telethon.errors.FloodWaitError as e:
235 | print("FloodWaitError: Sleep for " + str(e.seconds))
236 | time.sleep(e.seconds)
237 |
--------------------------------------------------------------------------------
/channelscraper/client.py:
--------------------------------------------------------------------------------
1 | from telethon import TelegramClient, sync
2 | from config import Config
3 |
4 |
5 | class Client:
6 | config = Config.getConfig()
7 | client = None
8 |
9 | @staticmethod
10 | def getClient():
11 | """ Static access method. """
12 | if Client.client is None:
13 | Client()
14 |
15 | return Client.client
16 |
17 | def __init__(self):
18 | """ Virtually private constructor. """
19 | if Client.client is not None:
20 | raise Exception("This class is a singleton!")
21 | else:
22 | # API CONNECTION #
23 | api_id = Client.config.get("api_id")
24 | api_hash = Client.config.get("api_hash")
25 | phone = Client.config.get('phone')
26 | Client.client = TelegramClient(phone, api_id, api_hash)
27 | Client.client.connect()
28 |
29 | # LOGIN
30 | if not Client.client.is_user_authorized():
31 | Client.client.send_code_request(phone)
32 | Client.client.sign_in(phone, input('Enter the code: '))
33 |
--------------------------------------------------------------------------------
/channelscraper/config.py:
--------------------------------------------------------------------------------
1 | import yaml
2 |
3 |
4 | class Config:
5 | config = None
6 |
7 | @staticmethod
8 | def getConfig():
9 | """ Static access method. """
10 | if Config.config is None:
11 | Config()
12 |
13 | return Config.config
14 |
15 | def __init__(self):
16 | Config.config = yaml.safe_load(open("config.yaml"))
17 |
--------------------------------------------------------------------------------
/channelscraper/config.yaml:
--------------------------------------------------------------------------------
1 | #### API CREDENTIALS #####
2 | # Credentials for telegram API
3 | api_id: #Put Numeric API ID here
4 | api_hash: #Put API hash string here
5 | phone: #Put phone number string here
6 |
7 | ### INPUT CHANNEL FILES ###
8 | #input_channel_file: telegram_accounts.csv
9 | input_channel_file: channels.csv
10 |
11 | ### SCRAPING SETTINGS ####
12 | # More extensive logging if true.
13 | debug_mode: false
14 |
15 | # Specify scrape mode: FULL_SCRAPE vs. OFFSET_SCRAPE
16 | scrape_mode: FULL_SCRAPE
17 |
18 | # Only relevant on OFFSET_SCRAPE mode. Set the amount of days to go back in chat history.
19 | scrape_offset: 30
20 |
21 | # There is a different driver needed for aws.
22 | driver_mode: "normal"
23 |
24 | # If true, the script will run slower in order to avoid being blocked.
25 | init: False
26 |
27 | # If true, media files will be downloaded as well. On default those are photos. If you want something different edit source-code.
28 | media_download: False
29 |
30 | # Query comments.bot/app usernames. Attention: This can lead to a API-ban more likely.
31 | query_users: False
32 |
--------------------------------------------------------------------------------
/channelscraper/driver.py:
--------------------------------------------------------------------------------
1 | import os
2 | import platform
3 | from config import Config
4 | from selenium.webdriver import Chrome, Remote
5 | from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
6 | from selenium import webdriver
7 | import chromedriver_autoinstaller
8 |
9 |
10 | class Driver:
11 | driver = None
12 | config = Config.getConfig()
13 |
14 | @staticmethod
15 | def getDriver():
16 | """ Static access method. """
17 | if Driver.driver is None:
18 | Driver()
19 | return Driver.driver
20 |
21 | @staticmethod
22 | def closeDriver():
23 | Driver.driver.quit()
24 |
25 | def __init__(self):
26 | """ Virtually private constructor. """
27 | if Driver.driver is not None:
28 | raise Exception("This class is a singleton!")
29 | else:
30 | if Driver.config.get("driver_mode") == "live":
31 | Driver.driver = Remote("http://127.0.0.1:4444/wd/hub", DesiredCapabilities.CHROME)
32 | else:
33 | chromedriver_autoinstaller.install()
34 | Driver.driver = webdriver.Chrome()
35 |
--------------------------------------------------------------------------------
/channelscraper/message.py:
--------------------------------------------------------------------------------
1 | from utilities import getClassName
2 |
3 |
4 | class Message:
5 | def __init__(self):
6 | self.text = ""
7 | self.id = None
8 | self.timestamp = None
9 | self.replyToMessageId = ""
10 | self.views = None
11 | self.sender = None
12 | self.replyToMessageId = ""
13 | self.edit_date = None
14 | self.entities = list()
15 | self.forward = None
16 | self.forwardId = None
17 | self.forward_msg_id = None
18 | self.forward_msg_date = None
19 | self.post_author = None
20 | self.comments = []
21 | self.sender_name = None
22 | self.username = None
23 | self.urls = None
24 | self.media = None
25 | self.isComment = False
26 | self.hasComments = False
27 | self.parent = None
28 | self.bot_url = None
29 | self.isDeleted = ""
30 | self.bot_url = None
31 |
32 | @staticmethod
33 | def getMessageHeader():
34 | return ["channel", "member_count", "broadcast", "id", "timestamp", "content", "user_id", "first_and_last_name",
35 | "username",
36 | "views", "edit-date",
37 | "replyToId", "forward", "forward_id", "forward_msg_id", "forward_date", "URLs", "media", "hasComments",
38 | "isComment",
39 | "bot_url",
40 | "parent", "isDeleted"]
41 |
42 | @staticmethod
43 | def getUserHeader():
44 | return ["channel", "Id", "first_name", "last_name", "first_and_last_name", "phone", "bot",
45 | "verified",
46 | "username"]
47 |
48 | def getMessageRow(self, channel_username, channel_member_count, channel_broadcast):
49 | row = [channel_username, channel_member_count, channel_broadcast, self.id, self.timestamp, self.text,
50 | self.sender,
51 | self.sender_name,
52 | self.username, self.views,
53 | self.edit_date, self.replyToMessageId, self.forward, self.forwardId, self.forward_msg_id,
54 | self.forward_msg_date,
55 | ", ".join(self.urls),
56 | getClassName(self.media), self.hasComments, self.isComment, self.bot_url,
57 | self.parent, self.isDeleted]
58 | return row
59 |
--------------------------------------------------------------------------------
/channelscraper/scrapeChannelMetadata.py:
--------------------------------------------------------------------------------
1 | import time
2 |
3 | import telethon
4 | import os
5 | import yaml
6 | import csv
7 | from client import Client
8 | import traceback
9 | from channel import Channel
10 | from telethon.tl.functions.channels import GetFullChannelRequest
11 | from utilities import getInputPath, getOutputPath
12 | from driver import Driver
13 |
14 |
15 | def getChannelList(filename):
16 | """Initialises the CSV of channels to scrape given by ADDENDUM."""
17 | channelList = []
18 | with open(getInputPath() + '/' + filename, newline='',
19 | encoding='utf-8') as csvfile:
20 | reader = csv.reader(csvfile)
21 | csvList = list(reader)
22 | for row in csvList[1:]:
23 | channelList.append(row[2].split("/")[-1])
24 | return channelList
25 |
26 | client = Client.getClient()
27 | config = yaml.safe_load(open("config.yaml"))
28 | input_file = config["input_channel_file"]
29 | channels = getChannelList(input_file)
30 |
31 | channel_info_list = list()
32 | i = 1
33 | for channel in channels:
34 | i = i + 1
35 | print("Channel: " + channel + " Nr: " + str(i))
36 | time.sleep(3)
37 | try:
38 | channel_entity = client.get_entity(channel)
39 | channel_full_info = client(GetFullChannelRequest(channel=channel_entity))
40 |
41 | if channel_full_info.full_chat.location is not None:
42 | geo_point = channel_full_info.full_chat.location.geo_point
43 | address = channel_full_info.full_chat.location.address
44 | else:
45 | geo_point = ""
46 | address = ""
47 | channel_info_list.append(
48 | [channel_entity.username, channel_entity.id, channel_full_info.full_chat.about, channel_entity.broadcast,
49 | channel_entity.date, channel_full_info.full_chat.participants_count, geo_point
50 | , address])
51 | except:
52 | traceback.print_exc()
53 | channelExists = False
54 | print("Channel '" + channel + "' does not exist.")
55 |
56 | with open(getInputPath() + '/' + '/channel_info.csv', mode="w",
57 | newline='',
58 | encoding='utf-8') as csvfile:
59 | writer = csv.writer(csvfile)
60 | writer.writerow(["channel", "id", "about", "broadcast", "created_at", "members", "location", "address"])
61 | for channel in channel_info_list:
62 | writer.writerow(channel)
63 |
64 | Driver.closeDriver()
65 |
--------------------------------------------------------------------------------
/channelscraper/utilities.py:
--------------------------------------------------------------------------------
1 | import re
2 | from random import random
3 | from datetime import datetime, timedelta
4 | import asyncio
5 | import os
6 | import telethon
7 | import logging
8 |
9 |
10 | def getInputPath():
11 | path = os.path.dirname(os.path.dirname(__file__)) + "/input"
12 | create_path_if_not_exists(path)
13 | return path
14 |
15 |
16 | def getOutputPath():
17 | path = os.path.dirname(os.path.dirname(__file__)) + "/output"
18 | create_path_if_not_exists(path)
19 | return path
20 |
21 |
22 | def concat(first_name, last_name):
23 | if first_name is None:
24 | first_name = ""
25 | if last_name is None:
26 | last_name = ""
27 | return first_name + " " + last_name
28 |
29 |
30 | def eliminateWhitespaces(text):
31 | return re.sub(r"\s\s+", " ", text)
32 |
33 |
34 | def calcDateOffset(offset):
35 | if offset <= 0:
36 | raise AttributeError("Offset must be greater 0")
37 | return datetime.now() - timedelta(days=offset)
38 |
39 |
40 | async def wait():
41 | """ Random sleep time to avoid getting blocked by telegram."""
42 | randomNum = random() / 10
43 | if randomNum < 0.0001:
44 | await asyncio.sleep(3)
45 | await asyncio.sleep(randomNum)
46 |
47 |
48 | def create_path_if_not_exists(channelPath):
49 | if not os.path.exists(channelPath):
50 | os.makedirs(channelPath)
51 |
52 |
53 | def extractUrls(message):
54 | """ Url extraction by RegEx and telegram url entity to get maximum results"""
55 | urls_regex = re.findall(r"(?Phttps?://[^\s]+)", message.text)
56 | # if found through regex add it, if not already found (because it has bugs)
57 | for i in range(len(urls_regex)):
58 | url_reg = urls_regex[i]
59 | match = re.search(r".jpg|.htm|.html|.mp4", url_reg)
60 | if match is not None:
61 | urls_regex[i] = url_reg[0:match.end()]
62 |
63 | entities = message.entities
64 | if entities is None:
65 | return urls_regex
66 | urls = []
67 |
68 | for entity in entities:
69 | if type(entity) == telethon.tl.types.MessageEntityUrl:
70 | try:
71 | enc_text = message.text.encode('utf-16-le')
72 | url = enc_text[entity.offset * 2:(entity.offset + entity.length) * 2].decode('utf-16-le')
73 | if (re.match("http|www", url)):
74 | urls.append(url)
75 | except UnicodeDecodeError:
76 | logging.info("Unicode Error for text")
77 | pass
78 |
79 | for url in urls:
80 | for url_reg in urls_regex:
81 | if url in url_reg:
82 | urls_regex.remove(url_reg)
83 |
84 | urls.extend(urls_regex)
85 | list(dict.fromkeys(urls)) # remove duplicates
86 | return urls
87 |
88 |
89 | def getClassName(object):
90 | if object is not None:
91 | return object.__name__
92 | else:
93 | return ""
94 |
--------------------------------------------------------------------------------
/input/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PeterWalchhofer/Telescrape/733220a72f5cf40001803007ae8528b8ae6849a3/input/.gitkeep
--------------------------------------------------------------------------------
/input/channel_info.csv:
--------------------------------------------------------------------------------
1 | channel,id,about,broadcast,created_at,members,location,address
2 | channelolero,1246406916,,True,2019-12-16 09:58:02+00:00,2,,
3 |
--------------------------------------------------------------------------------
/input/channels.csv:
--------------------------------------------------------------------------------
1 | "Kategorie","Name","Link","@","Broadcast","asdasd"
2 | "Gruppe Typ XY","Example Channel","https://t.me/channelolero","example_channel","TRUE",
3 |
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | selenium
2 | telethon
3 | beautifulsoup4
4 | pyyaml
5 | chromedriver-autoinstaller
--------------------------------------------------------------------------------
/scrape.sh:
--------------------------------------------------------------------------------
1 | #!/bin/sh
2 |
3 | . venv/bin/activate
4 | cd ./channelscraper
5 |
6 | if [ $# -eq 0 ]
7 | then
8 | python3 app.py
9 | else
10 | if [ $1 == "meta" ]; then
11 | python3 scrapeChannelMetadata.py
12 | fi
13 | fi
14 |
15 | deactivate
--------------------------------------------------------------------------------