├── log └── README.md ├── res ├── sniff-paste-pic.jpg ├── sql-statements.txt ├── nmapfilter.conf └── regexesToAdd.json ├── requirements.txt ├── .gitignore ├── settings.ini ├── README.md └── sniff-paste.py /log/README.md: -------------------------------------------------------------------------------- 1 | Scraper logs located here 2 | -------------------------------------------------------------------------------- /res/sniff-paste-pic.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/needmorecowbell/sniff-paste/HEAD/res/sniff-paste-pic.jpg -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | requests==2.10.0 2 | colorlog==2.7.0 3 | lxml==3.4.4 4 | SQLAlchemy==1.2.9 5 | nmap==0.0.1 6 | cssselect==1.0.3 7 | ipaddress==1.0.22 8 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.log 2 | *.db 3 | log/* 4 | !log/README.md 5 | out/*.txt 6 | *.html 7 | *.xml 8 | 9 | configdebug.py 10 | settings.debug.ini 11 | 12 | __pycache__/ 13 | 14 | -------------------------------------------------------------------------------- /settings.ini: -------------------------------------------------------------------------------- 1 | [GENERAL] 2 | PasteLimit = 50 3 | PBLink = http://pastebin.com/ 4 | DownloadWorkers = 2 5 | NewPasteCheckInterval = 5 6 | ConnectionRetryInterval = 30 7 | IPBlockedWaitTime = 3600 8 | 9 | [LOGGING] 10 | RotationLog = log/pastebin-scraper.log 11 | MaxRotationSize = 2097152 12 | RotationBackupCount = 3 13 | 14 | [STDOUT] 15 | Enable = yes 16 | ContentDisplayLimit = 100 17 | ShowName = yes 18 | ShowLang = yes 19 | ShowLink = yes 20 | ShowData = yes 21 | DataEncoding = utf-8 22 | 23 | [MYSQL] 24 | Enable = yes 25 | TableName = sniff_paste 26 | Host = 127.0.0.1 27 | Port = 3306 28 | Username = root 29 | Password = password 30 | 31 | -------------------------------------------------------------------------------- /res/sql-statements.txt: -------------------------------------------------------------------------------- 1 | sql-statements= { 2 | "Top 10 IP Occurences in Pastes":"SELECT ip, COUNT(*) AS magnitude from ips Group By ip Order by magnitude DESC Limit 10", 3 | "Top 10 Email Occurences in Pastes":"SELECT email, COUNT(*) AS magnitude from emails Group By phone Order by magnitude DESC Limit 10", 4 | "Top 10 Phone Number Occurences in Pastes":"SELECT phone, COUNT(*) AS magnitude from phones Group By phone Order by magnitude DESC Limit 10", 5 | "Top 10 URL Occurences in Pastes":"SELECT url, COUNT(*) AS magnitude from links Group By url Order by magnitude DESC Limit 10", 6 | "Top 10 Pastes with Highest IP Counts":"SELECT link, COUNT(link) AS frequency FROM ips GROUP BY link ORDER by frequency DESC Limit 10", 7 | "Top 10 Pastes with Highest Email Counts":"SELECT link, COUNT(link) AS frequency FROM emails GROUP BY link ORDER by frequency DESC Limit 10", 8 | "Top 10 Pastes with Highest URL Counts":"SELECT link, COUNT(link) AS frequency FROM links GROUP BY link ORDER by frequency DESC Limit 10" 9 | } 10 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ## Sniff-Paste: OSINT Pastebin Harvester 2 | 3 |

4 | 5 |

6 | 7 | Multithreaded pastebin scraper, scrapes to mysql database, then reads pastes for noteworthy information. 8 | 9 | Use sniff-paste.py to go through the entire process of collection, logging, and harvest automatically. The scraper can be set to a paste limit of 0 to scrape indefinitely. If scraped indefinitely, press ctrl + c to stop scraping, any useful information will be in the database, along with a link back to the original paste it was found in. 10 | 11 | 12 | ## Installation 13 | 14 | `sudo apt install libxslt-dev python3-lxml python3-nmap xsltproc mysql-server` 15 | 16 | `pip3 install -r requirements.txt` 17 | 18 | - Create database named `sniff_paste` in mysql server 19 | - Fill in settings.ini 20 | 21 | `python3 sniff-paste.py` 22 | 23 | This will scrape pastebin for the latest number of pastes, then run analysis for ip addresses, emails, and phone numbers. It filters out duplicates and runs scans on some of the harvested data. 24 | 25 | ## Database Structure 26 | - `sniff_paste` -- root db 27 | - `pastes` -- stores paste with full text, date, link, title, and language 28 | - `emails` -- stores emails with extension to paste 29 | - `links` -- stores urls with extension to paste 30 | - `ip` -- stores ip with connectivity and extension to paste 31 | - `phones` -- stores phone numbers with extension to paste 32 | - `secrets` -- stores secret type with extension to paste 33 | - `ports` -- stores port scan info (port, status, service, version, ip) 34 | - `cryptos` -- stores cryptocurrency findings with extension to paste 35 | 36 | **Crypto findings are not certain to be valid, consider them low probability findings** 37 | 38 | 39 | ## Notes 40 | 41 | - Please contribute! If there's an error let me know -- even better if you can fix it :) 42 | - Regex Contributions would be very helpful, and should be pretty easy to add! 43 | - This tool is in the process of a bigger update, where the scraper can send all new pastes to my new project needmorecowbell/Funnel. I'm trying to consolidate all of my osint tools into one streamlined solution. 44 | -------------------------------------------------------------------------------- /res/nmapfilter.conf: -------------------------------------------------------------------------------- 1 | 0.0.0.0/8 2 | 10.0.0.0/8 3 | 100.64.0.0/10 4 | 127.0.0.0/8 5 | 169.254.0.0/16 6 | 172.16.0.0/12 7 | 192.0.0.0/24 8 | 192.0.0.0/29 9 | 192.0.0.170/32 10 | 192.0.0.171/32 11 | 192.0.2.0/24 12 | 192.88.99.0/24 13 | 192.168.0.0/16 14 | 198.18.0.0/15 15 | 198.51.100.0/24 16 | 203.0.113.0/24 17 | 240.0.0.0/4 18 | 255.255.255.255/32 19 | 153.11.0.0/16 20 | 4.53.201.0/24 21 | 5.152.179.0/24 22 | 8.12.162.0/24 23 | 8.12.163.0/24 24 | 8.12.164.0/24 25 | 8.14.84.0/22 26 | 8.14.145.0/24 27 | 8.14.146.0/24 28 | 8.14.147.0/24 29 | 8.17.250.0/24 30 | 8.17.251.0/24 31 | 8.17.252.0/24 32 | 23.27.0.0/16 33 | 23.231.128.0/17 34 | 37.72.172.0/23 35 | 38.72.200.0/22 36 | 50.93.192.0/24 37 | 50.93.193.0/24 38 | 50.93.194.0/24 39 | 50.93.195.0/24 40 | 50.93.196.0/24 41 | 50.93.197.0/24 42 | 50.115.128.0/20 43 | 50.117.0.0/17 44 | 50.118.128.0/17 45 | 63.141.222.0/24 46 | 64.62.253.0/24 47 | 64.92.96.0/19 48 | 64.145.79.0/24 49 | 64.145.82.0/23 50 | 64.158.146.0/23 51 | 65.49.24.0/24 52 | 65.49.93.0/24 53 | 65.162.192.0/22 54 | 66.79.160.0/19 55 | 66.160.191.0/24 56 | 68.68.96.0/20 57 | 69.46.64.0/19 58 | 69.176.80.0/20 59 | 72.13.80.0/20 60 | 72.52.76.0/24 61 | 74.82.43.0/24 62 | 74.82.160.0/19 63 | 74.114.88.0/22 64 | 74.115.0.0/24 65 | 74.115.2.0/24 66 | 74.115.4.0/24 67 | 74.122.100.0/22 68 | 75.127.0.0/24 69 | 103.251.91.0/24 70 | 108.171.32.0/24 71 | 108.171.42.0/24 72 | 108.171.52.0/24 73 | 108.171.62.0/24 74 | 118.193.78.0/23 75 | 130.93.16.0/23 76 | 136.0.0.0/16 77 | 142.111.0.0/16 78 | 142.252.0.0/16 79 | 146.82.55.93 80 | 149.54.136.0/21 81 | 149.54.152.0/21 82 | 166.88.0.0/16 83 | 172.252.0.0/16 84 | 173.245.64.0/19 85 | 173.245.194.0/23 86 | 173.245.220.0/22 87 | 173.252.192.0/18 88 | 178.18.16.0/22 89 | 178.18.26.0/24 90 | 178.18.27.0/24 91 | 178.18.28.0/24 92 | 178.18.29.0/24 93 | 183.182.22.0/24 94 | 192.92.114.0/24 95 | 192.155.160.0/19 96 | 192.177.0.0/16 97 | 192.186.0.0/18 98 | 192.249.64.0/20 99 | 192.250.240.0/20 100 | 194.110.214.0/24 101 | 198.12.120.0/24 102 | 198.12.121.0/24 103 | 198.12.122.0/24 104 | 198.144.240.0/20 105 | 199.33.120.0/24 106 | 199.33.124.0/22 107 | 199.48.147.0/24 108 | 199.68.196.0/22 109 | 199.127.240.0/21 110 | 199.187.168.0/22 111 | 199.188.238.0/23 112 | 199.255.208.0/24 113 | 203.12.6.0/24 114 | 204.13.64.0/21 115 | 204.16.192.0/21 116 | 204.19.238.0/24 117 | 204.74.208.0/20 118 | 205.159.189.0/24 119 | 205.164.0.0/18 120 | 205.209.128.0/18 121 | 206.108.52.0/23 122 | 206.165.4.0/24 123 | 208.77.40.0/21 124 | 208.80.4.0/22 125 | 208.123.223.0/24 126 | 209.51.185.0/24 127 | 209.54.48.0/20 128 | 209.107.192.0/23 129 | 209.107.210.0/24 130 | 209.107.212.0/24 131 | 211.156.110.0/23 132 | 216.83.33.0/24 133 | 216.83.34.0/24 134 | 216.83.35.0/24 135 | 216.83.36.0/24 136 | 216.83.37.0/24 137 | 216.83.38.0/24 138 | 216.83.39.0/24 139 | 216.83.40.0/24 140 | 216.83.42.0/24 141 | 216.83.43.0/24 142 | 216.83.44.0/24 143 | 216.83.45.0/24 144 | 216.83.46.0/24 145 | 216.83.47.0/24 146 | 216.83.48.0/24 147 | 216.83.49.0/24 148 | 216.83.51.0/24 149 | 216.83.52.0/24 150 | 216.83.53.0/24 151 | 216.83.54.0/24 152 | 216.83.55.0/24 153 | 216.83.56.0/24 154 | 216.83.57.0/24 155 | 216.83.58.0/24 156 | 216.83.59.0/24 157 | 216.83.60.0/24 158 | 216.83.61.0/24 159 | 216.83.62.0/24 160 | 216.83.63.0/24 161 | 216.151.183.0/24 162 | 216.151.190.0/23 163 | 216.172.128.0/19 164 | 216.185.36.0/24 165 | 216.218.233.0/24 166 | 216.224.112.0/20 167 | 194.77.40.242 168 | 194.77.40.246 169 | 165.160.0.0/16 170 | 129.123.0.0/16 171 | 144.39.0.0/16 172 | 204.113.91.0/24 173 | -------------------------------------------------------------------------------- /res/regexesToAdd.json: -------------------------------------------------------------------------------- 1 | "ip" : { 2 | "ipv6": "i/^(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))$/", 3 | 4 | "ipv4": "/^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/" 5 | }, 6 | 7 | "links": { 8 | "youtube-channel":"/https?:\/\/(www\.)?youtube.com\/channel\/UC([-_a-z0-9]{22})/i", 9 | "youtube-video":"/https?:\/\/(?:youtu\.be\/|(?:[a-z]{2,3}\.)?youtube\.com\/watch(?:\?|#\!)v=)([\w-]{11}).*/gi", 10 | "onion-link":"(?:^[a-z2-7]{16}\.onion$)|(?:^[a-z2-7]{56}\.onion$)", 11 | "https":"https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+", 12 | "http":"http?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+" 13 | }, 14 | 15 | "general":{ 16 | "street":"\d{1,4} [\w\s]{1,20}(?:street|st|avenue|ave|road|rd|highway|hwy|square|sq|trail|trl|drive|dr|court|ct|park|parkway|pkwy|circle|cir|boulevard|blvd)\W?(?=\s|$)", 17 | "credit-cards":"((?:(?:\\d{4}[- ]?){3}\\d{4}|\\d{15,16}))(?![\\d])", 18 | "emails":"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+" 19 | 20 | }, 21 | 22 | "secrets" : { 23 | "Slack Token": "(xox[p|b|o|a]-[0-9]{12}-[0-9]{12}-[0-9]{12}-[a-z0-9]{32})", 24 | "RSA private key": "-----BEGIN RSA PRIVATE KEY-----", 25 | "SSH (OPENSSH) private key": "-----BEGIN OPENSSH PRIVATE KEY-----", 26 | "SSH (DSA) private key": "-----BEGIN DSA PRIVATE KEY-----", 27 | "SSH (EC) private key": "-----BEGIN EC PRIVATE KEY-----", 28 | "PGP private key block": "-----BEGIN PGP PRIVATE KEY BLOCK-----", 29 | "Facebook Oauth": "[f|F][a|A][c|C][e|E][b|B][o|O][o|O][k|K].*['|\"][0-9a-f]{32}['|\"]", 30 | "Twitter Oauth": "[t|T][w|W][i|I][t|T][t|T][e|E][r|R].*['|\"][0-9a-zA-Z]{35,44}['|\"]", 31 | "GitHub": "[g|G][i|I][t|T][h|H][u|U][b|B].*[['|\"]0-9a-zA-Z]{35,40}['|\"]", 32 | "Google Oauth": "(\"client_secret\":\"[a-zA-Z0-9-_]{24}\")", 33 | "AWS API Key": "AKIA[0-9A-Z]{16}", 34 | "Heroku API Key": "[h|H][e|E][r|R][o|O][k|K][u|U].*[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12}", 35 | "Generic Secret": "[s|S][e|E][c|C][r|R][e|E][t|T].*['|\"][0-9a-zA-Z]{32,45}['|\"]" 36 | }, 37 | 38 | "cryptos":{ 39 | "bitcoin-address" : "[13][a-km-zA-HJ-NP-Z1-9]{25,34}" , 40 | "bitcoin-uri" : "bitcoin:([13][a-km-zA-HJ-NP-Z1-9]{25,34})" , 41 | "bitcoin-xpub-key" : "(xpub[a-km-zA-HJ-NP-Z1-9]{100,108})(\\?c=\\d*&h=bip\\d{2,3})?" , 42 | "monero-address": "(?:^4[0-9AB][1-9A-HJ-NP-Za-km-z]{93}$)", 43 | "ethereum-address": "(?:^0x[a-fA-F0-9]{40}$)", 44 | "litecoin-address":"(?:^[LM3][a-km-zA-HJ-NP-Z1-9]{26,33}$)", 45 | "bitcoin-cash-address":"(?:^[13][a-km-zA-HJ-NP-Z1-9]{33}$)", 46 | "dash-address":"(?:^X[1-9A-HJ-NP-Za-km-z]{33}$)", 47 | "ripple-address":"(?:^r[0-9a-zA-Z]{33}$)", 48 | "neo-address":"(?:^A[0-9a-zA-Z]{33}$)", 49 | "dogecoin-address":"(?:^D{1}[5-9A-HJ-NP-U]{1}[1-9A-HJ-NP-Za-km-z]{32}$)" 50 | } 51 | -------------------------------------------------------------------------------- /sniff-paste.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import json 3 | from ipaddress import IPv4Address as IPv4 4 | from ipaddress import IPv4Network as IPv4Net 5 | import logging 6 | import socket 7 | import logging.handlers 8 | import os 9 | import sys 10 | import threading 11 | import time 12 | from datetime import datetime 13 | from os import path 14 | import re 15 | import requests 16 | from lxml import html 17 | 18 | import nmap 19 | 20 | import configparser 21 | import queue 22 | from colorlog import ColoredFormatter 23 | 24 | from sqlalchemy import Column, Integer, String, DateTime, Boolean 25 | from sqlalchemy.dialects.mysql import LONGTEXT 26 | 27 | 28 | debug = False 29 | 30 | IPStack = [] 31 | 32 | 33 | secretRegexes = { 34 | "Slack Token": "(xox[p|b|o|a]-[0-9]{12}-[0-9]{12}-[0-9]{12}-[a-z0-9]{32})", 35 | "RSA private key": "-----BEGIN RSA PRIVATE KEY-----", 36 | "SSH (OPENSSH) private key": "-----BEGIN OPENSSH PRIVATE KEY-----", 37 | "SSH (DSA) private key": "-----BEGIN DSA PRIVATE KEY-----", 38 | "SSH (EC) private key": "-----BEGIN EC PRIVATE KEY-----", 39 | "PGP private key block": "-----BEGIN PGP PRIVATE KEY BLOCK-----", 40 | "Facebook Oauth": "[f|F][a|A][c|C][e|E][b|B][o|O][o|O][k|K].*['|\"][0-9a-f]{32}['|\"]", 41 | "Twitter Oauth": "[t|T][w|W][i|I][t|T][t|T][e|E][r|R].*['|\"][0-9a-zA-Z]{35,44}['|\"]", 42 | "GitHub": "[g|G][i|I][t|T][h|H][u|U][b|B].*[['|\"]0-9a-zA-Z]{35,40}['|\"]", 43 | "Google Oauth": "(\"client_secret\":\"[a-zA-Z0-9-_]{24}\")", 44 | "AWS API Key": "AKIA[0-9A-Z]{16}", 45 | "Heroku API Key": "[h|H][e|E][r|R][o|O][k|K][u|U].*[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12}", 46 | "Generic Secret": "[s|S][e|E][c|C][r|R][e|E][t|T].*['|\"][0-9a-zA-Z]{32,45}['|\"]" 47 | } 48 | 49 | cryptoRegexes={ 50 | "bitcoin-address" : "[13][a-km-zA-HJ-NP-Z1-9]{25,34}" , 51 | "bitcoin-uri" : "bitcoin:([13][a-km-zA-HJ-NP-Z1-9]{25,34})" , 52 | "bitcoin-xpub-key" : "(xpub[a-km-zA-HJ-NP-Z1-9]{100,108})(\\?c=\\d*&h=bip\\d{2,3})?" , 53 | "monero-address": "(?:^4[0-9AB][1-9A-HJ-NP-Za-km-z]{93}$)", 54 | "ethereum-address": "(?:^0x[a-fA-F0-9]{40}$)", 55 | "litecoin-address":"(?:^[LM3][a-km-zA-HJ-NP-Z1-9]{26,33}$)", 56 | "bitcoin-cash-address":"(?:^[13][a-km-zA-HJ-NP-Z1-9]{33}$)", 57 | "dash-address":"(?:^X[1-9A-HJ-NP-Za-km-z]{33}$)", 58 | "ripple-address":"(?:^r[0-9a-zA-Z]{33}$)", 59 | "neo-address":"(?:^A[0-9a-zA-Z]{33}$)", 60 | "dogecoin-address":"(?:^D{1}[5-9A-HJ-NP-U]{1}[1-9A-HJ-NP-Za-km-z]{32}$)" 61 | } 62 | 63 | class PasteDBConnector(object): 64 | supported = ('MYSQL') 65 | 66 | def __init__(self, db, **kwargs): 67 | try: 68 | self.logger = logging.getLogger('pastebin-scraper') 69 | from sqlalchemy.ext.declarative import declarative_base 70 | except ImportError: 71 | self.logger.error('SQLAlchemy import failed. Make sure the SQLAlchemy Python library ' 72 | 'is installed! To check your existing installation run: ' 73 | 'python3 -c "import sqlalchemy;print(sqlalchemy.__version__)"') 74 | self.db = db 75 | self.Base = declarative_base() 76 | self.engine = self._get_db_engine(**kwargs) 77 | self.session = self._get_db_session(self.engine) 78 | 79 | #Create tables for credentials 80 | self.paste_model = self._get_paste_model(self.Base, **kwargs) 81 | self.email_model = self._get_email_model(self.Base) 82 | self.link_model = self._get_link_model(self.Base) 83 | self.ip_model = self._get_ip_model(self.Base) 84 | self.phone_model = self._get_phone_model(self.Base) 85 | self.secret_model = self._get_secret_model(self.Base) 86 | self.crypto_model = self._get_crypto_model(self.Base) 87 | self.port_model = self._get_port_model(self.Base) 88 | self.Base.metadata.create_all(self.engine) 89 | 90 | #Nmap Worker 91 | 92 | nmapper = threading.Thread(target=self._scan_network) 93 | nmapper.setDaemon(True) 94 | nmapper.start() 95 | 96 | def isFiltered(self,ip): 97 | with open("res/nmapfilter.conf") as ipFilters: 98 | for networkString in ipFilters: 99 | ip=IPv4(ip) 100 | network=IPv4Net(networkString.rstrip("\r\n")) 101 | if(ip in network): 102 | 103 | self.logger.debug("["+str(ip)+"] is a filtered address, do not scan.") 104 | return True 105 | return False 106 | 107 | 108 | def _get_db_engine(self, **kwargs): 109 | from sqlalchemy import create_engine 110 | 111 | if self.db == 'MYSQL': 112 | # use the mysql-python connector 113 | location = 'mysql+pymysql://' 114 | location += '{username}:{password}@{host}:{port}'.format( 115 | host=kwargs.pop('host'), 116 | port=kwargs.pop('port'), 117 | username=kwargs.pop('username'), 118 | password=kwargs.pop('password'), 119 | ) 120 | location += '/{table_name}?charset={charset}'.format( 121 | table_name=kwargs.pop('table_name'), 122 | charset='utf8' 123 | ) 124 | 125 | self.logger.info('Using MySQL') 126 | return create_engine(location) 127 | 128 | def _get_db_session(self, engine): 129 | from sqlalchemy.orm import sessionmaker 130 | return sessionmaker(bind=engine)() 131 | 132 | def _get_paste_model(self, base, **kwargs): 133 | class Paste(base): 134 | __tablename__ = "pastes" 135 | 136 | id = Column(Integer, primary_key=True) 137 | name = Column('name', String(60)) 138 | lang = Column('language', String(30)) 139 | link = Column('link', String(28)) # Assuming format http://pastebin.com/XXXXXXXX 140 | date = Column('date', DateTime()) 141 | 142 | data = Column('data', LONGTEXT(charset='utf8')) 143 | 144 | def __repr__(self): 145 | return "= paste_limit): 606 | # Break for limits % 8 != 0 607 | break 608 | name_link = paste.cssselect('a')[0] 609 | name = name_link.text_content().strip() 610 | href = name_link.get('href')[1:] # Get rid of leading / 611 | data = paste.cssselect('span')[0].text_content().split('|') 612 | language = None 613 | if len(data) == 2: 614 | # Got language 615 | language = data[0].strip() 616 | paste_data = (name, language, href) 617 | self.logger.debug('Paste scraped: ' + str(paste_data)) 618 | if paste_data[2] not in self.pastes_seen: 619 | # New paste detected 620 | self.logger.debug('Scheduling new paste:' + str(paste_data)) 621 | self.pastes_seen.add(paste_data[2]) 622 | self.pastes.put(paste_data) 623 | delay = self.conf_general.getint('NewPasteCheckInterval') 624 | time.sleep(delay) 625 | paste_counter += 1 626 | self.logger.debug('Paste counter now at ' + str(paste_counter)) 627 | if paste_counter % 100 == 0: 628 | self.logger.info('Scheduled %d pastes' % paste_counter) 629 | 630 | 631 | 632 | def _download_paste(self): 633 | while True: 634 | paste = self.pastes.get() # (name, lang, href) 635 | self.logger.debug('Fetching raw paste ' + paste[2]) 636 | link = self.conf_general['PBLink'] + 'raw/' + paste[2] 637 | data = self._handle_data_download(link) 638 | 639 | self.logger.debug('Fetched {} with {} - {}'.format( 640 | link, 641 | data.status_code, 642 | data.reason 643 | )) 644 | if self.conf_stdout.getboolean('Enable'): 645 | self._write_to_stdout(paste, data) 646 | if self.conf_mysql.getboolean('Enable'): 647 | self._write_to_mysql(paste, data) 648 | 649 | def _handle_data_download(self, link): 650 | while True: 651 | try: 652 | data = requests.get(link) 653 | except: 654 | retry = self.conf_general.getint('ConnectionRetryInterval') 655 | self.logger.debug( 656 | 'Error connecting to %s: Retry in %ss, TRACE: %s' % 657 | (link, retry, sys.exc_info()) 658 | ) 659 | self.logger.info('Connection problems - trying again in %ss' % retry) 660 | time.sleep(retry) 661 | else: 662 | if data.status_code == 403 and b'Pastebin.com has blocked your IP' in data.content: 663 | self.logger.info('Our IP has been blocked. Trying again in an hour.') 664 | time.sleep(self.conf_general.getint('IPBlockedWaitTime')) 665 | return data 666 | 667 | def _assemble_output(self, conf, paste, data): 668 | output = '' 669 | if conf.getboolean('ShowName'): 670 | output += 'Name: %s\n' % paste[0] 671 | if conf.getboolean('ShowLang'): 672 | output += 'Lang: %s\n' % paste[1] 673 | if conf.getboolean('ShowLink'): 674 | output += 'Link: %s\n' % (self.conf_general['PBLink'] + paste[2]) 675 | if conf.getboolean('ShowData'): 676 | encoding = conf['DataEncoding'] 677 | limit = conf.getint('ContentDisplayLimit') 678 | if limit > 0: 679 | output += '\n%s\n\n' % data.content.decode(encoding)[:limit] 680 | else: 681 | output += '\n%s\n\n' % data.content.decode(encoding) 682 | return output 683 | 684 | def _write_to_stdout(self, paste, data): 685 | output = self._assemble_output(self.conf_stdout, paste, data) 686 | sys.stdout.write(output) 687 | 688 | def _write_to_mysql(self, paste, data): 689 | self.mysql_conn.add(paste, data) 690 | 691 | 692 | def run(self): 693 | for i in range(self.conf_general.getint('DownloadWorkers')): 694 | t = threading.Thread(target=self._download_paste) 695 | t.setDaemon(True) 696 | t.start() 697 | 698 | s = threading.Thread(target=self._get_paste_data) 699 | s.start() 700 | s.join() 701 | 702 | 703 | if __name__ == '__main__': 704 | ps = PastebinScraper() 705 | ps.run() 706 | --------------------------------------------------------------------------------