├── .gitignore
├── LICENSE
├── Makefile
├── README.md
├── channelscraper
    ├── __init__.py
    ├── app.py
    ├── bots.py
    ├── channel.py
    ├── client.py
    ├── config.py
    ├── config.yaml
    ├── driver.py
    ├── message.py
    ├── scrapeChannelMetadata.py
    └── utilities.py
├── input
    ├── .gitkeep
    ├── channel_info.csv
    └── channels.csv
├── requirements.txt
└── scrape.sh


/.gitignore:
--------------------------------------------------------------------------------
1 | /scripts/+4367763544147.session
2 | ./.idea
3 | venv
4 | __pycache__
5 | /output/
6 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Peter Walchhofer, Valentin Peter
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/Makefile:
--------------------------------------------------------------------------------
 1 | venv:
 2 | 	python3 -m venv venv; \
 3 | 	. venv/bin/activate; \
 4 | 	pip install -r requirements.txt
 5 | 
 6 | install: venv
 7 | 
 8 | .PHONY: update
 9 | update:
10 | 	. venv/bin/activate; \
11 | 	pip install -r requirements.txt
12 | 
13 | .PHONY: clean
14 | clean:
15 | 	rm -rf venv
16 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Telegram Scraper
 2 | 
 3 | This Telegram scraper collects telegram messages, comments (comments.bot/comments.app) and media files. It was originally build for this [story](https://www.addendum.org/news/telegram-netzwerk-sellner/) on behalf of [Addendum](https://addendum.org).
 4 | 
 5 | **Contributors:** [@PeterWalchhofer](https://github.com/PeterWalchhofer), [@vali101](https://github.com/vali101), [@fin](https://github.com/fin) 
 6 | ## Notes
 7 | This scraper was written before Telgram introduced its native comment feature for broadcasting channels. Nowadays, comment.bot/comment.app extensions are rarely used anymore. Feel free to make a PR to support this feature. 
 8 | [@vali101](https://github.com/vali101) re-wrote the scraper for his bachelor thesis [here](https://github.com/vali101/telegraph) with a focus on snowball sampling. His code may help for extending the scraper or for finding interesting channels in the first place. Check it out!
 9 | 
10 | ## Setup
11 | 
12 | ### Requirements:
13 | - Google Chrome (However, in theory you could als use Firefox when installing the necessary driver manually )
14 | - Python 3
15 | - Telegram Account (Phone Number)
16 | - A lot of storage if downloading media
17 | - Time - arround 3000msg and comments/ per Minute.
18 | 
19 | ### Getting Started 
20 | 
21 | 1. Install dependencies `make install` OR just install requirements.txt (using venv is recommended)
22 | 2. Create your own `channel.csv` as explained in the next section
23 | 3. Put the phone-number of the linked telegram account int the `config.yaml`
24 | 4. Get your API-key [here](https://my.telegram.org/auth?to=apps) an put them inside the `config.yaml`.
25 | 5. `sh scrape.sh` to start the scraper OR run `channelscraper/python app.py`
26 | 6. The outputs will be stored in the `/output` directory. 
27 | 
28 | 
29 | ### Input Data 
30 | You need to create your own `channels.csv`and put it in the `/input` folder. 
31 | 
32 | Only **Link** and **Broadcast** Relevant for scraping. The csv should have the form described below. There also is an example csv in the folder.
33 | 
34 | Kategorie | Name | **Link** | @ | **Broadcast**
35 | --- | --- | --- | --- | --- 
36 | Gruppe Typ XY | Example Channel | https://t.me/example_channel | example_channel | TRUE
37 | 
38 | * Kategorie(optional): Metadata to annotate channel
39 | * Name(optional): Not identifier Name
40 | * Link: Link to channel
41 | * @ (optional): Indentifier Name
42 | * Broadcast: ´True´ if channel is Broadcasting Channel ´else´ false. Broadcasting channels are large one-to-many channels that only allow owners to write messages.
43 | 
44 | ### Configuration 
45 | 
46 | The Scraper can be further configured via the `channelscraper/config.yaml`.
47 | 
48 | ## Features 
49 | ### Messages 
50 | The Scraper extracts all messages from a channel. It is also possible to scrape only those messages that were written in the last x days. This can be set in the `config.yaml`.
51 | 
52 | ### Comment Bots
53 | - In many broadcasting channels comment-bots are used in order to provide feedback from the audience. It is also possible to scrape those messages. Currently comments.app and comments.bot bots are supported.
54 | - Be careful. The date format is different from the telegram-api and has to be parsed manually (e.g. Dec 09)
55 | - As there is a "Load more comments" button it has to be clicked using javascript. Selenium is used to interact with the chrome driver that is installed automatically.
56 | #### Comments.app
57 | - Unique username extraction is working most of the time. However, if the user has deleted its account, this is not possible.
58 | #### Comments.bot
59 | - Unique usernames are not extracted, because we found no way to find out without querying the api at a high cost.
60 | - Only the display name is persited
61 | 
62 | ## Further remarks
63 | - Messages and comments are persisted in the same csv-file. To tell them apart use the `isComment` column. Additionally, the ID includes a period in the format msgId.commId (e.g. 101.3)
64 | - We allocated ids to the comments manually. The telegram message ids are unique within a channel.
65 | 
66 | ## Optional
67 | 
68 | If you want to run selenium with docker use `selenium/standalone-chrome:3.141.59-yttrium`
69 | Also see:
70 | https://stackoverflow.com/questions/45323271/how-to-run-selenium-with-chrome-in-docker
71 | 


--------------------------------------------------------------------------------
/channelscraper/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PeterWalchhofer/Telescrape/733220a72f5cf40001803007ae8528b8ae6849a3/channelscraper/__init__.py


--------------------------------------------------------------------------------
/channelscraper/app.py:
--------------------------------------------------------------------------------
 1 | from datetime import datetime
 2 | from datetime import timedelta
 3 | import traceback
 4 | import logging
 5 | import os
 6 | import yaml
 7 | import csv
 8 | import asyncio
 9 | 
10 | from utilities import getInputPath
11 | from channel import Channel
12 | from driver import Driver
13 | 
14 | 
15 | def getChannelList(filename):
16 |     """Initialises the CSV of channels to scrape given by ADDENDUM."""
17 |     channelList = []
18 |     with open(getInputPath() + '/' + filename, newline='', encoding='utf-8') as csvfile:
19 |         reader = csv.reader(csvfile)
20 |         csvList = list(reader)
21 |         for row in csvList[1:]:
22 |             channelList.append(Channel(row[2], row[4]))
23 |     return channelList
24 | 
25 | 
26 | config = yaml.safe_load(open("config.yaml"))
27 | input_file = config["input_channel_file"]
28 | channels = getChannelList(input_file)
29 | 
30 | for channel in channels:
31 |     channelPath = getInputPath() + "/" + channel.username
32 |     logging.info("Collecting channel: " + channel.username)
33 |     logging.info("-> Collecting messages")
34 |     # Scrape Channels
35 |     try:
36 |         channel.scrape()
37 |     except:
38 |         traceback.print_exc()
39 |         pass
40 | 
41 |     try:
42 |         channel.getChannelUsers()
43 |     except:
44 |         traceback.print_exc()
45 |         pass
46 | 
47 |     # WriteCsv
48 |     try:
49 |         channel.writeCsv()
50 |     except:
51 |         traceback.print_exc()
52 |         pass
53 | 
54 | Driver.closeDriver()
55 | 


--------------------------------------------------------------------------------
/channelscraper/bots.py:
--------------------------------------------------------------------------------
  1 | import selenium
  2 | import telethon.tl.types.messages
  3 | from urllib.request import Request, urlopen
  4 | from selenium.webdriver.support.ui import WebDriverWait
  5 | from selenium.webdriver.common.by import By
  6 | from selenium.webdriver.support import expected_conditions as EC
  7 | from selenium.common.exceptions import ElementNotInteractableException
  8 | from bs4 import BeautifulSoup
  9 | import urllib.request
 10 | import traceback
 11 | import time
 12 | import logging
 13 | 
 14 | from client import Client
 15 | from driver import Driver
 16 | from utilities import concat, eliminateWhitespaces
 17 | from message import Message
 18 | 
 19 | 
 20 | 
 21 | class Bots:
 22 |     client = Client.getClient()
 23 |     driver = Driver.getDriver()
 24 | 
 25 |     @staticmethod
 26 |     def javaScriptLoadMoreHack(url, provider):
 27 |         """ All it does is to click on the 'load more comments' button in order to scrape all the messages"""
 28 |         if provider == "bot":
 29 |             index = 1
 30 |             selector = "sg_load"
 31 |         else:
 32 |             index = 0
 33 |             selector = "bc-load-more"
 34 |         # use firefox to get page with javascript generated content
 35 |         Bots.driver.get(url)
 36 |         resultDriver = Bots.loadMore(Bots.driver, index, selector)
 37 | 
 38 |         # store it to string variable
 39 |         if resultDriver is None:
 40 |             return None
 41 |         else:
 42 |             page_source = resultDriver.page_source
 43 |             return BeautifulSoup(page_source, "html.parser")
 44 | 
 45 |     @staticmethod
 46 |     def loadMore(driver, index, selector):
 47 |         """Recursive call of load more if there are lots of comments"""
 48 |         selection = driver.find_elements_by_css_selector("." + selector)
 49 |         try:
 50 |             if len(selection) > index:
 51 |                 selection[index].click()
 52 |             else:
 53 |                 return None
 54 |         except ElementNotInteractableException:
 55 |             if (index == 1):
 56 |                 Bots.loadMore(driver, 0, selector)
 57 |             else:
 58 |                 return None
 59 |         # wait for the page to load
 60 |         try:
 61 |             element = WebDriverWait(driver, 3).until(
 62 |                 EC.invisibility_of_element_located((By.CLASS_NAME, selector))
 63 |             )
 64 |             time.sleep(1)
 65 |             return driver
 66 |         except selenium.common.exceptions.TimeoutException:
 67 |             return Bots.loadMore(driver, index, selector)
 68 | 
 69 |     @staticmethod
 70 |     async def scrapeCommentsApp(url, messageId, queryUser):
 71 |         """Scrapes https://comments.app for comment feature extension
 72 |         """
 73 |         commentList = list()
 74 |         userList = list()
 75 |         # soup = BeautifulSoup(page, 'html.parser')
 76 |         soup = Bots.javaScriptLoadMoreHack(url, "app")
 77 |         if soup is None:
 78 |             try:
 79 |                 page = urllib.request.urlopen(url)
 80 |                 soup = BeautifulSoup(page, "html.parser")
 81 |             except Exception:
 82 |                 logging.info("Exception on loading comments occurred")
 83 |                 return
 84 | 
 85 |         comments = soup.find_all('div', class_='bc-comment-box')
 86 |         count = 1  # to create an id for comments, which the naturaly do not have.
 87 |         for comment in comments:
 88 |             Bots.__parseCommentFromApp(comment, commentList, userList, count, messageId, queryUser)
 89 |             count = count + 1
 90 | 
 91 |         return commentList, userList
 92 | 
 93 | 
 94 |     @staticmethod
 95 |     def scrapeCommentsBot(url, channel_users, messageId):
 96 |         """ Scrapes https://comments.bot/ website for comment feature extension."""
 97 |         hdr = {
 98 |             'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
 99 |             'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
100 |             'Referer': 'https://cssspritegenerator.com',
101 |             'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
102 |             'Accept-Encoding': 'none',
103 |             'Accept-Language': 'en-US,en;q=0.8',
104 |             'Connection': 'keep-alive'}
105 | 
106 |         soup = Bots.javaScriptLoadMoreHack(url, "bot")
107 |         if soup is None:
108 |             try:
109 |                 req = Request(url, headers=hdr)
110 |                 page = urlopen(req)
111 |                 soup = BeautifulSoup(page, 'html.parser')
112 |             except:
113 |                 logging.info("Could not open comments app.")
114 | 
115 |         commentList = []
116 |         if soup is not None:
117 |             comments = soup.find_all('div', class_='comment-content')
118 |             count = 1
119 |             for comment in comments:
120 |                 # Save comment Message
121 |                 new_comment = Message()
122 |                 new_comment.isComment = True
123 |                 new_comment.parent = messageId
124 |                 # Save Id
125 |                 new_comment.id = str(messageId) + "." + str(count)
126 |                 # Save Text
127 |                 text_list = comment.find('div', class_='comment-text').contents
128 |                 text_list = [elem for elem in text_list if str(elem) != "<br/>"]  # Eliminate Html
129 |                 new_comment.text = eliminateWhitespaces(' '.join([str(elem) for elem in text_list]))
130 | 
131 |                 # Save replyId
132 |                 reply_text_list = comment.find('span', class_='comment-reply-text')
133 |                 if reply_text_list is not None:
134 |                     reply_text = eliminateWhitespaces(' '.join([str(elem) for elem in reply_text_list]))
135 |                     try:
136 |                         new_comment.replyToMessageId = next(
137 |                             (new_comment for new_comment in commentList if new_comment.text ==
138 |                              reply_text), None).id
139 |                     except:
140 |                         logging.info("Could not find quoted comment in Bot: " + url)
141 |                 count = count + 1
142 |                 # Find not identified User
143 |                 new_comment.sender_name = comment.find('div', class_='name-row').findChildren()[0].contents[0]
144 |                 commentUser = telethon.types.User(0)
145 |                 commentUser.first_name = new_comment.sender_name
146 |                 channel_users.append(commentUser)
147 | 
148 |                 new_comment.timestamp = comment.find('div', class_='comment-date').findChildren()[0].contents[0]
149 | 
150 |                 # No identified Users
151 |                 commentList.append(new_comment)
152 | 
153 |         return commentList
154 | 
155 |     @staticmethod
156 |     def __parseCommentFromApp(comment, commentList, userList, count, messageId, queryUser):
157 |         # Save comment Message
158 |         new_comment = Message()
159 |         new_comment.isComment = True
160 |         new_comment.parent = messageId
161 |         new_comment.id = str(messageId) + "." + str(count)
162 | 
163 |         # Get all text elements of a comment
164 |         textList = None
165 |         try:
166 |             textList = comment.find_all('div', class_='bc-comment-text')
167 |         except:
168 |             logging.info("This thread has no comments")
169 | 
170 |         # Save comment text and reply id
171 |         try:
172 |             # Find comment text and quoted comment id.
173 |             if len(textList) > 1:
174 |                 try:
175 |                     new_comment.text = textList[1].text
176 |                     new_comment.replyToMessageId = next(
177 |                         (new_comment for new_comment in commentList if new_comment.text ==
178 |                          textList[0].text), None).id
179 |                 except AttributeError:
180 |                     logging.info(
181 |                         "Could not find quoted comment" + ", MessageId: " + str(new_comment.id))
182 |             # Find only comment text
183 |             else:
184 |                 try:
185 |                     new_comment.text = textList[0].text
186 |                 except IndexError:
187 |                     logging.info("Could not find comment text" + ", MessageId: " + str(messageId))
188 | 
189 |         except:
190 |             traceback.print_exc()
191 |             logging.info("An error occurred reading comment text" + ", MessageId: " + str(messageId))
192 | 
193 |         # Find not identified User
194 |         new_comment.sender_name = comment.find('span', class_='bc-comment-author-name').contents[0].contents[0]
195 |         # Find identified User and save User ich channel.users and save message.sendername
196 | 
197 |         try:
198 |             identifierName = comment.find('span', class_='bc-comment-author-name').contents[0]['href'].rsplit('/', 1)[
199 |                 -1]
200 |         except:
201 |             identifierName = ""
202 |         user = None
203 | 
204 |         # Query user if identifier name was found
205 |         if not identifierName == "" and queryUser:
206 |             try:
207 |                 user = Bots.getEntity(identifierName)
208 |                 userList.append(user)
209 |             except ValueError:
210 |                 # User for some reason not found.
211 |                 pass
212 | 
213 |         new_comment.username = identifierName
214 |         if user is not None:
215 |             new_comment.sender_name = concat(user.first_name, user.last_name)
216 |             new_comment.sender = user.id
217 |         else:
218 |             commentUser = telethon.types.User(0)
219 |             commentUser.first_name = new_comment.sender_name
220 | 
221 |         new_comment.timestamp = comment.find('time')['datetime']
222 |         commentList.append(new_comment)
223 | 
224 |     @staticmethod
225 |     async def getEntity(identifierName):
226 |         return await Bots.client.get_entity(identifierName)
227 | 


--------------------------------------------------------------------------------
/channelscraper/channel.py:
--------------------------------------------------------------------------------
  1 | import asyncio
  2 | import csv
  3 | import logging
  4 | import os
  5 | import time
  6 | import traceback
  7 | from datetime import datetime
  8 | 
  9 | import telethon.tl.types.messages
 10 | from bots import Bots
 11 | from client import Client
 12 | from config import Config
 13 | from message import Message
 14 | from utilities import concat, wait, calcDateOffset, getOutputPath, create_path_if_not_exists, extractUrls
 15 | 
 16 | 
 17 | class Channel:
 18 |     config = Config.getConfig()
 19 |     client = Client.getClient()
 20 |     logging.basicConfig(format='[%(levelname) 5s/%(asctime)s] %(name)s: %(message)s',
 21 |                         level=logging.INFO)
 22 | 
 23 |     def __init__(self, link, isBroadcastingChannel):
 24 |         self.id = None
 25 | 
 26 |         self.users = list()
 27 |         if isBroadcastingChannel == "True":
 28 |             self.isBroadcastingChannel = True
 29 |         else:
 30 |             self.isBroadcastingChannel = False
 31 |         self.username = link.rsplit('/', 1)[-1]
 32 | 
 33 |         self.path = getOutputPath() + "/" + self.username
 34 |         self.messages = list()
 35 | 
 36 |     def scrape(self):
 37 |         create_path_if_not_exists(self.path)
 38 | 
 39 |         if Channel.config.get("scrape_mode") == "OFFSET_SCRAPE":
 40 |             self.getRecentChannelMessages()
 41 |         elif Channel.config.get("scrape_mode") == "FULL_SCRAPE":
 42 |             self.getAllChannelMessages()
 43 |         else:
 44 |             raise AttributeError("Invalid scraping mode set in config file.")
 45 | 
 46 |     # Collects messages and comments from a given channel.
 47 |     def getRecentChannelMessages(self):
 48 |         """ Scrapes messages of a channel of the last X days and saves the information in the channel object.
 49 |         """
 50 | 
 51 |         async def main():
 52 |             async for message in Channel.client.iter_messages(self.username, offset_date=calcDateOffset(
 53 |                     Channel.config.get("scrape_offset")), reverse=True):
 54 |                 if type(message) == telethon.tl.types.Message:
 55 |                     await self.parseMessage(message)
 56 | 
 57 |         with Channel.client:
 58 |             try:
 59 |                 Channel.client.loop.run_until_complete(main())
 60 |             except telethon.errors.ServerError:
 61 |                 logging.info("Server error: Passed")
 62 |                 pass
 63 |             except telethon.errors.FloodWaitError as e:
 64 |                 logging.info("FloodWaitError: Sleep for " + str(e.seconds))
 65 |                 time.sleep(e.seconds)
 66 | 
 67 |         self.messages = list(reversed(self.messages))
 68 | 
 69 |     # Collects messages and comments from a given channel.
 70 |     def getAllChannelMessages(self):
 71 |         """ Scrapes all messages of a channel and saves the information in the channel object.
 72 |         """
 73 | 
 74 |         async def main():
 75 |             async for message in Channel.client.iter_messages(self.username):
 76 |                 if type(message) == telethon.tl.types.Message:
 77 |                     await self.parseMessage(message)
 78 | 
 79 |         with Channel.client:
 80 |             try:
 81 |                 Channel.client.loop.run_until_complete(main())
 82 |             except telethon.errors.ServerError:
 83 |                 logging.info("Server error: Passed")
 84 |                 pass
 85 |             except telethon.errors.FloodWaitError as e:
 86 |                 logging.info("FloodWaitError: Sleep for " + str(e.seconds))
 87 |                 time.sleep(e.seconds)
 88 | 
 89 |     async def parseMessage(self, message):
 90 |         # Wait to prevent getting blocked
 91 |         await wait()
 92 | 
 93 |         new_message = Message()
 94 |         new_message.id = message.id
 95 |         new_message.sender = message.sender_id
 96 |         try:
 97 |             first_name = message.sender.first_name
 98 |             last_name = message.sender.last_name
 99 |         except AttributeError:
100 |             first_name = ""
101 |             last_name = ""
102 |         new_message.sender_name = concat(first_name, last_name)
103 |         try:
104 |             new_message.username = message.sender.username
105 |         except AttributeError:
106 |             pass
107 |         new_message.replyToMessageId = message.reply_to_msg_id
108 |         new_message.edit_date = message.edit_date
109 |         new_message.entities = message.entities
110 |         new_message.post_author = message.post_author
111 |         new_message.timestamp = message.date
112 |         new_message.text = message.text
113 |         new_message.views = message.views
114 |         new_message.media = type(message.media)
115 |         self.member_count = message.chat.participants_count
116 | 
117 |         # Saves the channel from which the message was forwarded.
118 |         try:
119 |             await self.__parseForward(message, new_message)
120 |         except AttributeError:
121 |             pass
122 | 
123 |         if type(message.media) == telethon.types.MessageMediaPhoto and Channel.config.get("media_download"):
124 |             mediapath = self.path + "/media/" + str(new_message.id)
125 |             if not os.path.exists(mediapath + ".jpg"):
126 |                 try:
127 |                     await message.download_media(mediapath)
128 |                 except telethon.errors.FloodWaitError as e:
129 |                     logging.info("Waiting " + str(e.seconds) + " seconds: FloodWaitError")
130 |                     await asyncio.sleep(e.seconds)
131 |                 except telethon.errors.RpcCallFailError:
132 |                     pass
133 |                 await asyncio.sleep(1)
134 | 
135 |         # Checks which kind of comment bot is used by the provider of the group a uses the correct scraper.
136 |         #   --> then fills the comment list for each messages with the comments (prints "no comments" if no comment
137 |         #   bot is used.
138 |         comments = list()
139 |         if message.buttons is not None and message.forward is None:
140 |             buttons = message.buttons
141 | 
142 |             for button in buttons:
143 |                 button_url = None
144 |                 try:
145 |                     button_url = button[0].button.url[:21]
146 |                 except AttributeError:
147 |                     pass
148 | 
149 |                 if button_url == 'https://comments.bot/':
150 |                     logging.info("---> Found comments.bot...")
151 |                     new_message.hasComments = True
152 |                     new_message.bot_url = button[0].button.url
153 |                     try:
154 |                         comments.extend(Bots.scrapeCommentsBot(new_message.bot_url, self.users, message.id))
155 |                     except Exception:
156 |                         traceback.print_exc()
157 |                 elif button[0].text[-8:] == 'comments':
158 |                     logging.info("---> Found comments.app...")
159 |                     new_message.hasComments = True
160 |                     new_message.bot_url = button[0].button.url
161 |                     try:
162 |                         commentsAppComments, commentsAppUsers = \
163 |                             await Bots.scrapeCommentsApp(new_message.bot_url, message.id,
164 |                                                          Channel.config.get("query_users"))
165 |                         comments.extend(commentsAppComments)
166 |                         self.users.extend(commentsAppUsers)
167 |                     except Exception:
168 |                         traceback.print_exc()
169 | 
170 |             new_message.comments = comments
171 |         self.messages.append(new_message)
172 | 
173 |     def writeCsv(self):
174 |         if len(self.messages) == 0:
175 |             raise LookupError("Nothing to write. You have to execute 'scrape' method first.")
176 | 
177 |         chatlogs_csv = self.path + "/chatlogs_" + str(datetime.now().strftime("%Y-%m-%d--%H-%M-%S")) + ".csv"
178 |         users_csv = self.path + "/users" + str(datetime.now().strftime("%Y-%m-%d--%H-%M-%S")) + ".csv"
179 | 
180 |         # WRITE MESSAGES AND COMMENTS
181 |         with open(chatlogs_csv, "w", encoding="utf-8", newline='') as chatFile:
182 |             writer = csv.writer(chatFile)
183 |             writer.writerow(Message.getMessageHeader())
184 |             for message in self.messages:
185 |                 message.urls = extractUrls(message)
186 |                 writer.writerow(message.getMessageRow(self.username, self.member_count, self.isBroadcastingChannel))
187 | 
188 |                 for comment in message.comments:
189 |                     comment.urls = extractUrls(comment)
190 |                     writer.writerow(comment.getMessageRow(self.username, self.member_count, self.isBroadcastingChannel))
191 | 
192 |         with open(users_csv, "w", encoding="utf-8", newline='') as users_csv:
193 |             writer = csv.writer(users_csv)
194 |             writer.writerow(Message.getUserHeader())
195 |             for user in self.users:
196 |                 # Write in user table.
197 |                 writer.writerow(
198 |                     [self.username, user.id, user.first_name, user.last_name, concat(user.first_name, user.last_name),
199 |                      user.phone, user.bot, user.verified, user.username])
200 | 
201 |     async def __parseForward(self, message, new_message):
202 |         if message.forward is not None:
203 |             if message.forward.chat is not None:
204 |                 new_message.forward = message.forward.chat.username
205 |                 new_message.forwardId = message.forward.chat.id
206 |                 if new_message.forwardId is None:
207 |                     new_message.forwardId = message.forward.channel_id
208 |             elif message.forward.sender is not None:
209 |                 sender = message.forward.sender
210 |                 new_message.forward = concat(sender.first_name, sender.last_name)
211 |                 new_message.forwardId = sender.id
212 |             else:
213 |                 new_message.forward = "Unknown"
214 |                 new_message.forwardId = "Unknown"
215 | 
216 |             if message.forward.original_fwd is not None:
217 |                 new_message.forward_msg_id = message.forward.original_fwd.channel_post
218 |                 new_message.forward_msg_date = message.forward.date
219 | 
220 |     def getChannelUsers(self):
221 |         """ Scrape the users of a channel, if it is not a broadcasting channel."""
222 |         if self.isBroadcastingChannel:
223 |             return
224 | 
225 |         async def main():
226 |             async for user in Channel.client.iter_participants(self.username, aggressive=True):
227 |                 if type(user) == telethon.types.User:
228 |                     if user not in self.users:
229 |                         self.users.append(user)
230 | 
231 |         with Channel.client:
232 |             try:
233 |                 Channel.client.loop.run_until_complete(main())
234 |             except telethon.errors.FloodWaitError as e:
235 |                 print("FloodWaitError: Sleep for " + str(e.seconds))
236 |                 time.sleep(e.seconds)
237 | 


--------------------------------------------------------------------------------
/channelscraper/client.py:
--------------------------------------------------------------------------------
 1 | from telethon import TelegramClient, sync
 2 | from config import Config
 3 | 
 4 | 
 5 | class Client:
 6 |     config = Config.getConfig()
 7 |     client = None
 8 | 
 9 |     @staticmethod
10 |     def getClient():
11 |         """ Static access method. """
12 |         if Client.client is None:
13 |             Client()
14 | 
15 |         return Client.client
16 | 
17 |     def __init__(self):
18 |         """ Virtually private constructor. """
19 |         if Client.client is not None:
20 |             raise Exception("This class is a singleton!")
21 |         else:
22 |             # API CONNECTION #
23 |             api_id = Client.config.get("api_id")
24 |             api_hash = Client.config.get("api_hash")
25 |             phone = Client.config.get('phone')
26 |             Client.client = TelegramClient(phone, api_id, api_hash)
27 |             Client.client.connect()
28 | 
29 |             # LOGIN
30 |             if not Client.client.is_user_authorized():
31 |                 Client.client.send_code_request(phone)
32 |                 Client.client.sign_in(phone, input('Enter the code: '))
33 | 


--------------------------------------------------------------------------------
/channelscraper/config.py:
--------------------------------------------------------------------------------
 1 | import yaml
 2 | 
 3 | 
 4 | class Config:
 5 |     config = None
 6 | 
 7 |     @staticmethod
 8 |     def getConfig():
 9 |         """ Static access method. """
10 |         if Config.config is None:
11 |             Config()
12 | 
13 |         return Config.config
14 | 
15 |     def __init__(self):
16 |         Config.config = yaml.safe_load(open("config.yaml"))
17 | 


--------------------------------------------------------------------------------
/channelscraper/config.yaml:
--------------------------------------------------------------------------------
 1 | #### API CREDENTIALS #####
 2 | # Credentials for telegram API
 3 | api_id: #Put Numeric API ID here
 4 | api_hash: #Put API hash string here
 5 | phone: #Put phone number string here
 6 | 
 7 | ### INPUT CHANNEL FILES ###
 8 | #input_channel_file: telegram_accounts.csv
 9 | input_channel_file: channels.csv
10 | 
11 | ### SCRAPING SETTINGS ####
12 | # More extensive logging if true.
13 | debug_mode: false
14 | 
15 | # Specify scrape mode: FULL_SCRAPE vs. OFFSET_SCRAPE
16 | scrape_mode: FULL_SCRAPE
17 | 
18 | # Only relevant on OFFSET_SCRAPE mode. Set the amount of days to go back in chat history.
19 | scrape_offset: 30
20 | 
21 | # There is a different driver needed for aws.
22 | driver_mode: "normal"
23 | 
24 | # If true, the script will run slower in order to avoid being blocked.
25 | init: False
26 | 
27 | # If true, media files will be downloaded as well. On default those are photos. If you want something different edit source-code.
28 | media_download: False
29 | 
30 | # Query comments.bot/app usernames. Attention: This can lead to a API-ban more likely.
31 | query_users: False
32 | 


--------------------------------------------------------------------------------
/channelscraper/driver.py:
--------------------------------------------------------------------------------
 1 | import os
 2 | import platform
 3 | from config import Config
 4 | from selenium.webdriver import Chrome, Remote
 5 | from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
 6 | from selenium import webdriver
 7 | import chromedriver_autoinstaller
 8 | 
 9 | 
10 | class Driver:
11 |     driver = None
12 |     config = Config.getConfig()
13 | 
14 |     @staticmethod
15 |     def getDriver():
16 |         """ Static access method. """
17 |         if Driver.driver is None:
18 |             Driver()
19 |         return Driver.driver
20 |     
21 |     @staticmethod
22 |     def closeDriver():
23 |         Driver.driver.quit()
24 | 
25 |     def __init__(self):
26 |         """ Virtually private constructor. """
27 |         if Driver.driver is not None:
28 |             raise Exception("This class is a singleton!")
29 |         else:
30 |             if Driver.config.get("driver_mode") == "live":
31 |                 Driver.driver = Remote("http://127.0.0.1:4444/wd/hub", DesiredCapabilities.CHROME)
32 |             else:
33 |                 chromedriver_autoinstaller.install() 
34 |                 Driver.driver = webdriver.Chrome()
35 | 


--------------------------------------------------------------------------------
/channelscraper/message.py:
--------------------------------------------------------------------------------
 1 | from utilities import getClassName
 2 | 
 3 | 
 4 | class Message:
 5 |     def __init__(self):
 6 |         self.text = ""
 7 |         self.id = None
 8 |         self.timestamp = None
 9 |         self.replyToMessageId = ""
10 |         self.views = None
11 |         self.sender = None
12 |         self.replyToMessageId = ""
13 |         self.edit_date = None
14 |         self.entities = list()
15 |         self.forward = None
16 |         self.forwardId = None
17 |         self.forward_msg_id = None
18 |         self.forward_msg_date = None
19 |         self.post_author = None
20 |         self.comments = []
21 |         self.sender_name = None
22 |         self.username = None
23 |         self.urls = None
24 |         self.media = None
25 |         self.isComment = False
26 |         self.hasComments = False
27 |         self.parent = None
28 |         self.bot_url = None
29 |         self.isDeleted = ""
30 |         self.bot_url = None
31 | 
32 |     @staticmethod
33 |     def getMessageHeader():
34 |         return ["channel", "member_count", "broadcast", "id", "timestamp", "content", "user_id", "first_and_last_name",
35 |                 "username",
36 |                 "views", "edit-date",
37 |                 "replyToId", "forward", "forward_id", "forward_msg_id", "forward_date", "URLs", "media", "hasComments",
38 |                 "isComment",
39 |                 "bot_url",
40 |                 "parent", "isDeleted"]
41 | 
42 |     @staticmethod
43 |     def getUserHeader():
44 |         return ["channel", "Id", "first_name", "last_name", "first_and_last_name", "phone", "bot",
45 |                 "verified",
46 |                 "username"]
47 | 
48 |     def getMessageRow(self, channel_username, channel_member_count, channel_broadcast):
49 |         row = [channel_username, channel_member_count, channel_broadcast, self.id, self.timestamp, self.text,
50 |                self.sender,
51 |                self.sender_name,
52 |                self.username, self.views,
53 |                self.edit_date, self.replyToMessageId, self.forward, self.forwardId, self.forward_msg_id,
54 |                self.forward_msg_date,
55 |                ", ".join(self.urls),
56 |                getClassName(self.media), self.hasComments, self.isComment, self.bot_url,
57 |                self.parent, self.isDeleted]
58 |         return row
59 | 


--------------------------------------------------------------------------------
/channelscraper/scrapeChannelMetadata.py:
--------------------------------------------------------------------------------
 1 | import time
 2 | 
 3 | import telethon
 4 | import os
 5 | import yaml
 6 | import csv
 7 | from client import Client
 8 | import traceback
 9 | from channel import Channel
10 | from telethon.tl.functions.channels import GetFullChannelRequest
11 | from utilities import getInputPath, getOutputPath
12 | from driver import Driver
13 | 
14 | 
15 | def getChannelList(filename):
16 |     """Initialises the CSV of channels to scrape given by ADDENDUM."""
17 |     channelList = []
18 |     with open(getInputPath() + '/' + filename, newline='',
19 |               encoding='utf-8') as csvfile:
20 |         reader = csv.reader(csvfile)
21 |         csvList = list(reader)
22 |         for row in csvList[1:]:
23 |             channelList.append(row[2].split("/")[-1])
24 |     return channelList
25 | 
26 | client = Client.getClient()
27 | config = yaml.safe_load(open("config.yaml"))
28 | input_file = config["input_channel_file"]
29 | channels = getChannelList(input_file)
30 | 
31 | channel_info_list = list()
32 | i = 1
33 | for channel in channels:
34 |     i = i + 1
35 |     print("Channel: " + channel + " Nr: " + str(i))
36 |     time.sleep(3)
37 |     try:
38 |         channel_entity = client.get_entity(channel)
39 |         channel_full_info = client(GetFullChannelRequest(channel=channel_entity))
40 | 
41 |         if channel_full_info.full_chat.location is not None:
42 |             geo_point = channel_full_info.full_chat.location.geo_point
43 |             address = channel_full_info.full_chat.location.address
44 |         else:
45 |             geo_point = ""
46 |             address = ""
47 |         channel_info_list.append(
48 |             [channel_entity.username, channel_entity.id, channel_full_info.full_chat.about, channel_entity.broadcast,
49 |              channel_entity.date, channel_full_info.full_chat.participants_count, geo_point
50 |                 , address])
51 |     except:
52 |         traceback.print_exc()
53 |         channelExists = False
54 |         print("Channel '" + channel + "' does not exist.")
55 | 
56 | with open(getInputPath() + '/' + '/channel_info.csv', mode="w",
57 |           newline='',
58 |           encoding='utf-8') as csvfile:
59 |     writer = csv.writer(csvfile)
60 |     writer.writerow(["channel", "id", "about", "broadcast", "created_at", "members", "location", "address"])
61 |     for channel in channel_info_list:
62 |         writer.writerow(channel)
63 | 
64 | Driver.closeDriver()
65 | 


--------------------------------------------------------------------------------
/channelscraper/utilities.py:
--------------------------------------------------------------------------------
 1 | import re
 2 | from random import random
 3 | from datetime import datetime, timedelta
 4 | import asyncio
 5 | import os
 6 | import telethon
 7 | import logging
 8 | 
 9 | 
10 | def getInputPath():
11 |     path = os.path.dirname(os.path.dirname(__file__)) + "/input"
12 |     create_path_if_not_exists(path)
13 |     return path
14 | 
15 | 
16 | def getOutputPath():
17 |     path = os.path.dirname(os.path.dirname(__file__)) + "/output"
18 |     create_path_if_not_exists(path)
19 |     return path
20 | 
21 | 
22 | def concat(first_name, last_name):
23 |     if first_name is None:
24 |         first_name = ""
25 |     if last_name is None:
26 |         last_name = ""
27 |     return first_name + " " + last_name
28 | 
29 | 
30 | def eliminateWhitespaces(text):
31 |     return re.sub(r"\s\s+", " ", text)
32 | 
33 | 
34 | def calcDateOffset(offset):
35 |     if offset <= 0:
36 |         raise AttributeError("Offset must be greater 0")
37 |     return datetime.now() - timedelta(days=offset)
38 | 
39 | 
40 | async def wait():
41 |     """ Random sleep time to avoid getting blocked by telegram."""
42 |     randomNum = random() / 10
43 |     if randomNum < 0.0001:
44 |         await asyncio.sleep(3)
45 |     await asyncio.sleep(randomNum)
46 | 
47 | 
48 | def create_path_if_not_exists(channelPath):
49 |     if not os.path.exists(channelPath):
50 |         os.makedirs(channelPath)
51 | 
52 | 
53 | def extractUrls(message):
54 |     """ Url extraction by RegEx and telegram url entity to get maximum results"""
55 |     urls_regex = re.findall(r"(?P<url>https?://[^\s]+)", message.text)
56 |     # if found through regex add it, if not already found (because it has bugs)
57 |     for i in range(len(urls_regex)):
58 |         url_reg = urls_regex[i]
59 |         match = re.search(r".jpg|.htm|.html|.mp4", url_reg)
60 |         if match is not None:
61 |             urls_regex[i] = url_reg[0:match.end()]
62 | 
63 |     entities = message.entities
64 |     if entities is None:
65 |         return urls_regex
66 |     urls = []
67 | 
68 |     for entity in entities:
69 |         if type(entity) == telethon.tl.types.MessageEntityUrl:
70 |             try:
71 |                 enc_text = message.text.encode('utf-16-le')
72 |                 url = enc_text[entity.offset * 2:(entity.offset + entity.length) * 2].decode('utf-16-le')
73 |                 if (re.match("http|www", url)):
74 |                     urls.append(url)
75 |             except UnicodeDecodeError:
76 |                 logging.info("Unicode Error for text")
77 |                 pass
78 | 
79 |     for url in urls:
80 |         for url_reg in urls_regex:
81 |             if url in url_reg:
82 |                 urls_regex.remove(url_reg)
83 | 
84 |     urls.extend(urls_regex)
85 |     list(dict.fromkeys(urls))  # remove duplicates
86 |     return urls
87 | 
88 | 
89 | def getClassName(object):
90 |     if object is not None:
91 |         return object.__name__
92 |     else:
93 |         return ""
94 | 


--------------------------------------------------------------------------------
/input/.gitkeep:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/PeterWalchhofer/Telescrape/733220a72f5cf40001803007ae8528b8ae6849a3/input/.gitkeep


--------------------------------------------------------------------------------
/input/channel_info.csv:
--------------------------------------------------------------------------------
1 | channel,id,about,broadcast,created_at,members,location,address
2 | channelolero,1246406916,,True,2019-12-16 09:58:02+00:00,2,,
3 | 


--------------------------------------------------------------------------------
/input/channels.csv:
--------------------------------------------------------------------------------
1 | "Kategorie","Name","Link","@","Broadcast","asdasd"
2 | "Gruppe Typ XY","Example Channel","https://t.me/channelolero","example_channel","TRUE",
3 | 


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | selenium
2 | telethon
3 | beautifulsoup4
4 | pyyaml
5 | chromedriver-autoinstaller


--------------------------------------------------------------------------------
/scrape.sh:
--------------------------------------------------------------------------------
 1 | #!/bin/sh
 2 | 
 3 | . venv/bin/activate
 4 | cd ./channelscraper
 5 | 
 6 | if [ $# -eq 0 ]
 7 |   then
 8 |     python3 app.py 
 9 |   else 
10 |     if [ $1 == "meta" ]; then
11 |     python3 scrapeChannelMetadata.py 
12 |     fi
13 | fi
14 | 
15 | deactivate


--------------------------------------------------------------------------------