├── setup.cfg ├── .gitignore ├── requirements.txt ├── renovate.json ├── settings.ini ├── README.md └── scraper.py /setup.cfg: -------------------------------------------------------------------------------- 1 | [flake8] 2 | max-line-length = 99 3 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.log 2 | *.db 3 | log/ 4 | output/ 5 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | colorlog==4.7.2 2 | requests==2.25.1 3 | lxml==4.6.2 4 | -------------------------------------------------------------------------------- /renovate.json: -------------------------------------------------------------------------------- 1 | { 2 | "extends": [ 3 | "config:base" 4 | ] 5 | } 6 | -------------------------------------------------------------------------------- /settings.ini: -------------------------------------------------------------------------------- 1 | [GENERAL] 2 | PasteLimit = 0 3 | PBLink = http://pastebin.com/ 4 | DownloadWorkers = 2 5 | NewPasteCheckInterval = 5 6 | ConnectionRetryInterval = 30 7 | IPBlockedWaitTime = 3600 8 | 9 | [LOGGING] 10 | RotationLog = log/pastebin-scraper.log 11 | MaxRotationSize = 2097152 12 | RotationBackupCount = 3 13 | 14 | [STDOUT] 15 | Enable = no 16 | ContentDisplayLimit = 100 17 | ShowName = yes 18 | ShowLang = yes 19 | ShowLink = yes 20 | ShowData = yes 21 | DataEncoding = utf-8 22 | 23 | [MYSQL] 24 | Enable = no 25 | TableName = pastes 26 | Host = 127.0.0.1 27 | Port = 6603 28 | Username = root 29 | Password = pastebin-scraper 30 | 31 | [SQLITE] 32 | Enable = no 33 | Filename = pastes.db 34 | TableName = pastes 35 | 36 | [FILE] 37 | Enable = no 38 | ContentDisplayLimit = 0 39 | ShowName = yes 40 | ShowLang = yes 41 | ShowLink = yes 42 | ShowData = yes 43 | DataEncoding = utf-8 44 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## pastebin-scraper 2 | 3 | This is a multithreaded scraping script for [Pastebin](http://pastebin.com/). It scrapes the main site for new pastes, downloads their raw content and processes them by a user-defined output format. 4 | 5 | ### WHY? 6 | Fun. 7 | 8 | ### Installation 9 | The usual dance. 10 | ``` 11 | pip install -r requirements.txt 12 | ``` 13 | 14 | Define all required specs in `settings.ini`. Should you decide to go with a database output, make sure the respective connector is installed. At the moment MySQL with `pymysql` and SQLite with the standard built in Python 3 connector are supported. 15 | 16 | Also note that the file output creates a subdirectory `output` and dumps every paste as a separate file into it. 17 | 18 | ### Settings 19 | `ini` is a highly underrated file format. Here are some definitions on what the settings parameter actually do. 20 | 21 | #### GENERAL 22 | - `PasteLimit` Stop after having scraped n pastes. Set to 0 for indefinite scraping 23 | - `PBLink` URL to Pastebin or another equivalent site 24 | - `DownloadWorkers` Number of workers that download the raw paste content and further process it 25 | - `NewPasteCheckInterval` Time to wait before checking the main site for new pastes again 26 | - `IPBlockedWaitTime` Time to wait until checking the main site again after the scraper's IP has been blocked 27 | 28 | #### LOGGING 29 | - `RotationLog` Location of log file that contains debug output 30 | - `MaxRotationSize` Size in bytes before another log file is created 31 | - `RotationBackupCount` Maximum number of log files to keep 32 | 33 | #### STDOUT/ FILE 34 | - `Enable` Enable formatted stdout output of paste data 35 | - `ContentDisplayLimit` Maximum amount of characters to show before content is cut off (0 to display all) 36 | - `ShowName` Display the paste name 37 | - `ShowLang` Display the paste language 38 | - `ShowLink` Display the complete paste link 39 | - `ShowData` Display the raw paste content 40 | - `DataEncoding` Encoding of the raw paste data 41 | 42 | #### MYSQL 43 | - `Enable` Enable MySQL output 44 | - `TableName` Main table name to insert data into 45 | - `Host` MySQL server host 46 | - `Port` MySQL server port 47 | - `Username` MySQL server user 48 | - `Password` User password 49 | 50 | #### SQLITE 51 | - `Enable` Enable SQLite output 52 | - `Filename` Filename the db should be saved as (usually ends with .db) 53 | - `TableName` Main table name to insert data into 54 | 55 | --- 56 | 57 | If you use this thing for some cool data analysis or even research, let me know if I can help! 58 | 59 | Inspiration for this scraper was taken from [here](http://www.michielovertoom.com/python/pastebin-abused/). 60 | -------------------------------------------------------------------------------- /scraper.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import logging 4 | import logging.handlers 5 | import os 6 | import sys 7 | import threading 8 | import time 9 | from datetime import datetime 10 | from os import path 11 | 12 | import requests 13 | from lxml import html 14 | 15 | import configparser 16 | import queue 17 | from colorlog import ColoredFormatter 18 | 19 | 20 | class PasteDBConnector(object): 21 | supported = ('MYSQL', 'SQLITE') 22 | 23 | def __init__(self, db, **kwargs): 24 | try: 25 | self.logger = logging.getLogger('pastebin-scraper') 26 | from sqlalchemy.ext.declarative import declarative_base 27 | except ImportError: 28 | self.logger.error('SQLAlchemy import failed. Make sure the SQLAlchemy Python library ' 29 | 'is installed! To check your existing installation run: ' 30 | 'python3 -c "import sqlalchemy;print(sqlalchemy.__version__)"') 31 | if db not in self.supported: 32 | msg = 'The specified database %s is not supported. Please chose an engine from %s' % \ 33 | (db, ', '.join(self.supported)) 34 | self.logger.error(msg) 35 | raise ValueError(msg) 36 | self.db = db 37 | self.Base = declarative_base() 38 | self.engine = self._get_db_engine(**kwargs) 39 | self.session = self._get_db_session(self.engine) 40 | self.paste_model = self._get_paste_model(self.Base, **kwargs) 41 | self.Base.metadata.create_all(self.engine) 42 | 43 | def _get_db_engine(self, **kwargs): 44 | from sqlalchemy import create_engine 45 | if self.db == 'MYSQL': 46 | # use the mysql-python connector 47 | location = 'mysql+pymysql://' 48 | location += '{username}:{password}@{host}:{port}'.format( 49 | host=kwargs.pop('host'), 50 | port=kwargs.pop('port'), 51 | username=kwargs.pop('username'), 52 | password=kwargs.pop('password'), 53 | ) 54 | location += '/{table_name}?charset={charset}'.format( 55 | table_name=kwargs.pop('table_name'), 56 | charset='utf8' 57 | ) 58 | self.logger.info('Using MySQL at ' + location) 59 | return create_engine(location) 60 | elif self.db == 'SQLITE': 61 | location = 'sqlite+pysqlite:///' + kwargs.pop('filename') 62 | self.logger.info('Using SQLite at ' + location) 63 | return create_engine(location) 64 | 65 | def _get_db_session(self, engine): 66 | from sqlalchemy.orm import sessionmaker 67 | return sessionmaker(bind=engine)() 68 | 69 | def _get_paste_model(self, base, **kwargs): 70 | db = self.db 71 | 72 | from sqlalchemy import Column, Integer, String, DateTime 73 | if db == 'MYSQL': 74 | from sqlalchemy.dialects.mysql import LONGTEXT 75 | elif db == 'SQLITE': 76 | from sqlalchemy import UnicodeText 77 | 78 | class Paste(base): 79 | __tablename__ = kwargs.pop('table_name') 80 | 81 | id = Column(Integer, primary_key=True) 82 | name = Column('name', String(60)) 83 | lang = Column('language', String(30)) 84 | link = Column('link', String(28)) # Assuming format http://pastebin.com/XXXXXXXX 85 | date = Column('date', DateTime()) 86 | if db == 'MYSQL': 87 | data = Column('data', LONGTEXT(charset='utf8')) 88 | else: 89 | data = Column('data', UnicodeText()) 90 | 91 | def __repr__(self): 92 | return "= paste_limit): 213 | # Break for limits % 8 != 0 214 | break 215 | name_link = paste.cssselect('a')[0] 216 | name = name_link.text_content().strip() 217 | href = name_link.get('href')[1:] # Get rid of leading / 218 | data = paste.cssselect('span')[0].text_content().split('|') 219 | language = None 220 | if len(data) == 2: 221 | # Got language 222 | language = data[0].strip() 223 | paste_data = (name, language, href) 224 | self.logger.debug('Paste scraped: ' + str(paste_data)) 225 | if paste_data[2] not in self.pastes_seen: 226 | # New paste detected 227 | self.logger.debug('Scheduling new paste:' + str(paste_data)) 228 | self.pastes_seen.add(paste_data[2]) 229 | self.pastes.put(paste_data) 230 | delay = self.conf_general.getint('NewPasteCheckInterval') 231 | time.sleep(delay) 232 | paste_counter += 1 233 | self.logger.debug('Paste counter now at ' + str(paste_counter)) 234 | if paste_counter % 100 == 0: 235 | self.logger.info('Scheduled %d pastes' % paste_counter) 236 | 237 | def _download_paste(self): 238 | while True: 239 | paste = self.pastes.get() # (name, lang, href) 240 | self.logger.debug('Fetching raw paste ' + paste[2]) 241 | link = self.conf_general['PBLink'] + 'raw/' + paste[2] 242 | data = self._handle_data_download(link) 243 | 244 | self.logger.debug('Fetched {} with {} - {}'.format( 245 | link, 246 | data.status_code, 247 | data.reason 248 | )) 249 | if self.conf_stdout.getboolean('Enable'): 250 | self._write_to_stdout(paste, data) 251 | if self.conf_mysql.getboolean('Enable'): 252 | self._write_to_mysql(paste, data) 253 | if self.conf_file.getboolean('Enable'): 254 | self._write_to_file(paste, data) 255 | if self.conf_sqlite.getboolean('Enable'): 256 | self._write_to_sqlite(paste, data) 257 | 258 | def _handle_data_download(self, link): 259 | while True: 260 | try: 261 | data = requests.get(link) 262 | except: 263 | retry = self.conf_general.getint('ConnectionRetryInterval') 264 | self.logger.debug( 265 | 'Error connecting to %s: Retry in %ss, TRACE: %s' % 266 | (link, retry, sys.exc_info()) 267 | ) 268 | self.logger.info('Connection problems - trying again in %ss' % retry) 269 | time.sleep(retry) 270 | else: 271 | if data.status_code == 403 and b'Pastebin.com has blocked your IP' in data.content: 272 | self.logger.info('Our IP has been blocked. Trying again in an hour.') 273 | time.sleep(self.conf_general.getint('IPBlockedWaitTime')) 274 | return data 275 | 276 | def _assemble_output(self, conf, paste, data): 277 | output = '' 278 | if conf.getboolean('ShowName'): 279 | output += 'Name: %s\n' % paste[0] 280 | if conf.getboolean('ShowLang'): 281 | output += 'Lang: %s\n' % paste[1] 282 | if conf.getboolean('ShowLink'): 283 | output += 'Link: %s\n' % (self.conf_general['PBLink'] + paste[2]) 284 | if conf.getboolean('ShowData'): 285 | encoding = conf['DataEncoding'] 286 | limit = conf.getint('ContentDisplayLimit') 287 | if limit > 0: 288 | output += '\n%s\n\n' % data.content.decode(encoding)[:limit] 289 | else: 290 | output += '\n%s\n\n' % data.content.decode(encoding) 291 | return output 292 | 293 | def _write_to_stdout(self, paste, data): 294 | output = self._assemble_output(self.conf_stdout, paste, data) 295 | sys.stdout.write(output) 296 | 297 | def _write_to_mysql(self, paste, data): 298 | self.mysql_conn.add(paste, data) 299 | 300 | def _write_to_sqlite(self, paste, data): 301 | self.sqlite_conn.add(paste, data) 302 | 303 | def _write_to_file(self, paste, data): 304 | # Date and paste ID 305 | fname = '%s_%s.txt' % (datetime.now().strftime('%Y-%m-%d.%H-%M-%S'), paste[2]) 306 | with open(path.join('output', fname), 'w') as f: 307 | output = self._assemble_output(self.conf_file, paste, data) 308 | f.write(output) 309 | 310 | def run(self): 311 | for i in range(self.conf_general.getint('DownloadWorkers')): 312 | t = threading.Thread(target=self._download_paste) 313 | t.setDaemon(True) 314 | t.start() 315 | s = threading.Thread(target=self._get_paste_data) 316 | s.start() 317 | s.join() 318 | 319 | 320 | if __name__ == '__main__': 321 | ps = PastebinScraper() 322 | ps.run() 323 | --------------------------------------------------------------------------------