├── log
└── README.md
├── res
├── sniff-paste-pic.jpg
├── sql-statements.txt
├── nmapfilter.conf
└── regexesToAdd.json
├── requirements.txt
├── .gitignore
├── settings.ini
├── README.md
└── sniff-paste.py
/log/README.md:
--------------------------------------------------------------------------------
1 | Scraper logs located here
2 |
--------------------------------------------------------------------------------
/res/sniff-paste-pic.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/needmorecowbell/sniff-paste/HEAD/res/sniff-paste-pic.jpg
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | requests==2.10.0
2 | colorlog==2.7.0
3 | lxml==3.4.4
4 | SQLAlchemy==1.2.9
5 | nmap==0.0.1
6 | cssselect==1.0.3
7 | ipaddress==1.0.22
8 |
--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | *.log
2 | *.db
3 | log/*
4 | !log/README.md
5 | out/*.txt
6 | *.html
7 | *.xml
8 |
9 | configdebug.py
10 | settings.debug.ini
11 |
12 | __pycache__/
13 |
14 |
--------------------------------------------------------------------------------
/settings.ini:
--------------------------------------------------------------------------------
1 | [GENERAL]
2 | PasteLimit = 50
3 | PBLink = http://pastebin.com/
4 | DownloadWorkers = 2
5 | NewPasteCheckInterval = 5
6 | ConnectionRetryInterval = 30
7 | IPBlockedWaitTime = 3600
8 |
9 | [LOGGING]
10 | RotationLog = log/pastebin-scraper.log
11 | MaxRotationSize = 2097152
12 | RotationBackupCount = 3
13 |
14 | [STDOUT]
15 | Enable = yes
16 | ContentDisplayLimit = 100
17 | ShowName = yes
18 | ShowLang = yes
19 | ShowLink = yes
20 | ShowData = yes
21 | DataEncoding = utf-8
22 |
23 | [MYSQL]
24 | Enable = yes
25 | TableName = sniff_paste
26 | Host = 127.0.0.1
27 | Port = 3306
28 | Username = root
29 | Password = password
30 |
31 |
--------------------------------------------------------------------------------
/res/sql-statements.txt:
--------------------------------------------------------------------------------
1 | sql-statements= {
2 | "Top 10 IP Occurences in Pastes":"SELECT ip, COUNT(*) AS magnitude from ips Group By ip Order by magnitude DESC Limit 10",
3 | "Top 10 Email Occurences in Pastes":"SELECT email, COUNT(*) AS magnitude from emails Group By phone Order by magnitude DESC Limit 10",
4 | "Top 10 Phone Number Occurences in Pastes":"SELECT phone, COUNT(*) AS magnitude from phones Group By phone Order by magnitude DESC Limit 10",
5 | "Top 10 URL Occurences in Pastes":"SELECT url, COUNT(*) AS magnitude from links Group By url Order by magnitude DESC Limit 10",
6 | "Top 10 Pastes with Highest IP Counts":"SELECT link, COUNT(link) AS frequency FROM ips GROUP BY link ORDER by frequency DESC Limit 10",
7 | "Top 10 Pastes with Highest Email Counts":"SELECT link, COUNT(link) AS frequency FROM emails GROUP BY link ORDER by frequency DESC Limit 10",
8 | "Top 10 Pastes with Highest URL Counts":"SELECT link, COUNT(link) AS frequency FROM links GROUP BY link ORDER by frequency DESC Limit 10"
9 | }
10 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | ## Sniff-Paste: OSINT Pastebin Harvester
2 |
3 |
4 |
5 |
6 |
7 | Multithreaded pastebin scraper, scrapes to mysql database, then reads pastes for noteworthy information.
8 |
9 | Use sniff-paste.py to go through the entire process of collection, logging, and harvest automatically. The scraper can be set to a paste limit of 0 to scrape indefinitely. If scraped indefinitely, press ctrl + c to stop scraping, any useful information will be in the database, along with a link back to the original paste it was found in.
10 |
11 |
12 | ## Installation
13 |
14 | `sudo apt install libxslt-dev python3-lxml python3-nmap xsltproc mysql-server`
15 |
16 | `pip3 install -r requirements.txt`
17 |
18 | - Create database named `sniff_paste` in mysql server
19 | - Fill in settings.ini
20 |
21 | `python3 sniff-paste.py`
22 |
23 | This will scrape pastebin for the latest number of pastes, then run analysis for ip addresses, emails, and phone numbers. It filters out duplicates and runs scans on some of the harvested data.
24 |
25 | ## Database Structure
26 | - `sniff_paste` -- root db
27 | - `pastes` -- stores paste with full text, date, link, title, and language
28 | - `emails` -- stores emails with extension to paste
29 | - `links` -- stores urls with extension to paste
30 | - `ip` -- stores ip with connectivity and extension to paste
31 | - `phones` -- stores phone numbers with extension to paste
32 | - `secrets` -- stores secret type with extension to paste
33 | - `ports` -- stores port scan info (port, status, service, version, ip)
34 | - `cryptos` -- stores cryptocurrency findings with extension to paste
35 |
36 | **Crypto findings are not certain to be valid, consider them low probability findings**
37 |
38 |
39 | ## Notes
40 |
41 | - Please contribute! If there's an error let me know -- even better if you can fix it :)
42 | - Regex Contributions would be very helpful, and should be pretty easy to add!
43 | - This tool is in the process of a bigger update, where the scraper can send all new pastes to my new project needmorecowbell/Funnel. I'm trying to consolidate all of my osint tools into one streamlined solution.
44 |
--------------------------------------------------------------------------------
/res/nmapfilter.conf:
--------------------------------------------------------------------------------
1 | 0.0.0.0/8
2 | 10.0.0.0/8
3 | 100.64.0.0/10
4 | 127.0.0.0/8
5 | 169.254.0.0/16
6 | 172.16.0.0/12
7 | 192.0.0.0/24
8 | 192.0.0.0/29
9 | 192.0.0.170/32
10 | 192.0.0.171/32
11 | 192.0.2.0/24
12 | 192.88.99.0/24
13 | 192.168.0.0/16
14 | 198.18.0.0/15
15 | 198.51.100.0/24
16 | 203.0.113.0/24
17 | 240.0.0.0/4
18 | 255.255.255.255/32
19 | 153.11.0.0/16
20 | 4.53.201.0/24
21 | 5.152.179.0/24
22 | 8.12.162.0/24
23 | 8.12.163.0/24
24 | 8.12.164.0/24
25 | 8.14.84.0/22
26 | 8.14.145.0/24
27 | 8.14.146.0/24
28 | 8.14.147.0/24
29 | 8.17.250.0/24
30 | 8.17.251.0/24
31 | 8.17.252.0/24
32 | 23.27.0.0/16
33 | 23.231.128.0/17
34 | 37.72.172.0/23
35 | 38.72.200.0/22
36 | 50.93.192.0/24
37 | 50.93.193.0/24
38 | 50.93.194.0/24
39 | 50.93.195.0/24
40 | 50.93.196.0/24
41 | 50.93.197.0/24
42 | 50.115.128.0/20
43 | 50.117.0.0/17
44 | 50.118.128.0/17
45 | 63.141.222.0/24
46 | 64.62.253.0/24
47 | 64.92.96.0/19
48 | 64.145.79.0/24
49 | 64.145.82.0/23
50 | 64.158.146.0/23
51 | 65.49.24.0/24
52 | 65.49.93.0/24
53 | 65.162.192.0/22
54 | 66.79.160.0/19
55 | 66.160.191.0/24
56 | 68.68.96.0/20
57 | 69.46.64.0/19
58 | 69.176.80.0/20
59 | 72.13.80.0/20
60 | 72.52.76.0/24
61 | 74.82.43.0/24
62 | 74.82.160.0/19
63 | 74.114.88.0/22
64 | 74.115.0.0/24
65 | 74.115.2.0/24
66 | 74.115.4.0/24
67 | 74.122.100.0/22
68 | 75.127.0.0/24
69 | 103.251.91.0/24
70 | 108.171.32.0/24
71 | 108.171.42.0/24
72 | 108.171.52.0/24
73 | 108.171.62.0/24
74 | 118.193.78.0/23
75 | 130.93.16.0/23
76 | 136.0.0.0/16
77 | 142.111.0.0/16
78 | 142.252.0.0/16
79 | 146.82.55.93
80 | 149.54.136.0/21
81 | 149.54.152.0/21
82 | 166.88.0.0/16
83 | 172.252.0.0/16
84 | 173.245.64.0/19
85 | 173.245.194.0/23
86 | 173.245.220.0/22
87 | 173.252.192.0/18
88 | 178.18.16.0/22
89 | 178.18.26.0/24
90 | 178.18.27.0/24
91 | 178.18.28.0/24
92 | 178.18.29.0/24
93 | 183.182.22.0/24
94 | 192.92.114.0/24
95 | 192.155.160.0/19
96 | 192.177.0.0/16
97 | 192.186.0.0/18
98 | 192.249.64.0/20
99 | 192.250.240.0/20
100 | 194.110.214.0/24
101 | 198.12.120.0/24
102 | 198.12.121.0/24
103 | 198.12.122.0/24
104 | 198.144.240.0/20
105 | 199.33.120.0/24
106 | 199.33.124.0/22
107 | 199.48.147.0/24
108 | 199.68.196.0/22
109 | 199.127.240.0/21
110 | 199.187.168.0/22
111 | 199.188.238.0/23
112 | 199.255.208.0/24
113 | 203.12.6.0/24
114 | 204.13.64.0/21
115 | 204.16.192.0/21
116 | 204.19.238.0/24
117 | 204.74.208.0/20
118 | 205.159.189.0/24
119 | 205.164.0.0/18
120 | 205.209.128.0/18
121 | 206.108.52.0/23
122 | 206.165.4.0/24
123 | 208.77.40.0/21
124 | 208.80.4.0/22
125 | 208.123.223.0/24
126 | 209.51.185.0/24
127 | 209.54.48.0/20
128 | 209.107.192.0/23
129 | 209.107.210.0/24
130 | 209.107.212.0/24
131 | 211.156.110.0/23
132 | 216.83.33.0/24
133 | 216.83.34.0/24
134 | 216.83.35.0/24
135 | 216.83.36.0/24
136 | 216.83.37.0/24
137 | 216.83.38.0/24
138 | 216.83.39.0/24
139 | 216.83.40.0/24
140 | 216.83.42.0/24
141 | 216.83.43.0/24
142 | 216.83.44.0/24
143 | 216.83.45.0/24
144 | 216.83.46.0/24
145 | 216.83.47.0/24
146 | 216.83.48.0/24
147 | 216.83.49.0/24
148 | 216.83.51.0/24
149 | 216.83.52.0/24
150 | 216.83.53.0/24
151 | 216.83.54.0/24
152 | 216.83.55.0/24
153 | 216.83.56.0/24
154 | 216.83.57.0/24
155 | 216.83.58.0/24
156 | 216.83.59.0/24
157 | 216.83.60.0/24
158 | 216.83.61.0/24
159 | 216.83.62.0/24
160 | 216.83.63.0/24
161 | 216.151.183.0/24
162 | 216.151.190.0/23
163 | 216.172.128.0/19
164 | 216.185.36.0/24
165 | 216.218.233.0/24
166 | 216.224.112.0/20
167 | 194.77.40.242
168 | 194.77.40.246
169 | 165.160.0.0/16
170 | 129.123.0.0/16
171 | 144.39.0.0/16
172 | 204.113.91.0/24
173 |
--------------------------------------------------------------------------------
/res/regexesToAdd.json:
--------------------------------------------------------------------------------
1 | "ip" : {
2 | "ipv6": "i/^(([0-9a-fA-F]{1,4}:){7,7}[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,7}:|([0-9a-fA-F]{1,4}:){1,6}:[0-9a-fA-F]{1,4}|([0-9a-fA-F]{1,4}:){1,5}(:[0-9a-fA-F]{1,4}){1,2}|([0-9a-fA-F]{1,4}:){1,4}(:[0-9a-fA-F]{1,4}){1,3}|([0-9a-fA-F]{1,4}:){1,3}(:[0-9a-fA-F]{1,4}){1,4}|([0-9a-fA-F]{1,4}:){1,2}(:[0-9a-fA-F]{1,4}){1,5}|[0-9a-fA-F]{1,4}:((:[0-9a-fA-F]{1,4}){1,6})|:((:[0-9a-fA-F]{1,4}){1,7}|:)|fe80:(:[0-9a-fA-F]{0,4}){0,4}%[0-9a-zA-Z]{1,}|::(ffff(:0{1,4}){0,1}:){0,1}((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])|([0-9a-fA-F]{1,4}:){1,4}:((25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9])\.){3,3}(25[0-5]|(2[0-4]|1{0,1}[0-9]){0,1}[0-9]))$/",
3 |
4 | "ipv4": "/^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/"
5 | },
6 |
7 | "links": {
8 | "youtube-channel":"/https?:\/\/(www\.)?youtube.com\/channel\/UC([-_a-z0-9]{22})/i",
9 | "youtube-video":"/https?:\/\/(?:youtu\.be\/|(?:[a-z]{2,3}\.)?youtube\.com\/watch(?:\?|#\!)v=)([\w-]{11}).*/gi",
10 | "onion-link":"(?:^[a-z2-7]{16}\.onion$)|(?:^[a-z2-7]{56}\.onion$)",
11 | "https":"https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+",
12 | "http":"http?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+"
13 | },
14 |
15 | "general":{
16 | "street":"\d{1,4} [\w\s]{1,20}(?:street|st|avenue|ave|road|rd|highway|hwy|square|sq|trail|trl|drive|dr|court|ct|park|parkway|pkwy|circle|cir|boulevard|blvd)\W?(?=\s|$)",
17 | "credit-cards":"((?:(?:\\d{4}[- ]?){3}\\d{4}|\\d{15,16}))(?![\\d])",
18 | "emails":"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+"
19 |
20 | },
21 |
22 | "secrets" : {
23 | "Slack Token": "(xox[p|b|o|a]-[0-9]{12}-[0-9]{12}-[0-9]{12}-[a-z0-9]{32})",
24 | "RSA private key": "-----BEGIN RSA PRIVATE KEY-----",
25 | "SSH (OPENSSH) private key": "-----BEGIN OPENSSH PRIVATE KEY-----",
26 | "SSH (DSA) private key": "-----BEGIN DSA PRIVATE KEY-----",
27 | "SSH (EC) private key": "-----BEGIN EC PRIVATE KEY-----",
28 | "PGP private key block": "-----BEGIN PGP PRIVATE KEY BLOCK-----",
29 | "Facebook Oauth": "[f|F][a|A][c|C][e|E][b|B][o|O][o|O][k|K].*['|\"][0-9a-f]{32}['|\"]",
30 | "Twitter Oauth": "[t|T][w|W][i|I][t|T][t|T][e|E][r|R].*['|\"][0-9a-zA-Z]{35,44}['|\"]",
31 | "GitHub": "[g|G][i|I][t|T][h|H][u|U][b|B].*[['|\"]0-9a-zA-Z]{35,40}['|\"]",
32 | "Google Oauth": "(\"client_secret\":\"[a-zA-Z0-9-_]{24}\")",
33 | "AWS API Key": "AKIA[0-9A-Z]{16}",
34 | "Heroku API Key": "[h|H][e|E][r|R][o|O][k|K][u|U].*[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12}",
35 | "Generic Secret": "[s|S][e|E][c|C][r|R][e|E][t|T].*['|\"][0-9a-zA-Z]{32,45}['|\"]"
36 | },
37 |
38 | "cryptos":{
39 | "bitcoin-address" : "[13][a-km-zA-HJ-NP-Z1-9]{25,34}" ,
40 | "bitcoin-uri" : "bitcoin:([13][a-km-zA-HJ-NP-Z1-9]{25,34})" ,
41 | "bitcoin-xpub-key" : "(xpub[a-km-zA-HJ-NP-Z1-9]{100,108})(\\?c=\\d*&h=bip\\d{2,3})?" ,
42 | "monero-address": "(?:^4[0-9AB][1-9A-HJ-NP-Za-km-z]{93}$)",
43 | "ethereum-address": "(?:^0x[a-fA-F0-9]{40}$)",
44 | "litecoin-address":"(?:^[LM3][a-km-zA-HJ-NP-Z1-9]{26,33}$)",
45 | "bitcoin-cash-address":"(?:^[13][a-km-zA-HJ-NP-Z1-9]{33}$)",
46 | "dash-address":"(?:^X[1-9A-HJ-NP-Za-km-z]{33}$)",
47 | "ripple-address":"(?:^r[0-9a-zA-Z]{33}$)",
48 | "neo-address":"(?:^A[0-9a-zA-Z]{33}$)",
49 | "dogecoin-address":"(?:^D{1}[5-9A-HJ-NP-U]{1}[1-9A-HJ-NP-Za-km-z]{32}$)"
50 | }
51 |
--------------------------------------------------------------------------------
/sniff-paste.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python3
2 | import json
3 | from ipaddress import IPv4Address as IPv4
4 | from ipaddress import IPv4Network as IPv4Net
5 | import logging
6 | import socket
7 | import logging.handlers
8 | import os
9 | import sys
10 | import threading
11 | import time
12 | from datetime import datetime
13 | from os import path
14 | import re
15 | import requests
16 | from lxml import html
17 |
18 | import nmap
19 |
20 | import configparser
21 | import queue
22 | from colorlog import ColoredFormatter
23 |
24 | from sqlalchemy import Column, Integer, String, DateTime, Boolean
25 | from sqlalchemy.dialects.mysql import LONGTEXT
26 |
27 |
28 | debug = False
29 |
30 | IPStack = []
31 |
32 |
33 | secretRegexes = {
34 | "Slack Token": "(xox[p|b|o|a]-[0-9]{12}-[0-9]{12}-[0-9]{12}-[a-z0-9]{32})",
35 | "RSA private key": "-----BEGIN RSA PRIVATE KEY-----",
36 | "SSH (OPENSSH) private key": "-----BEGIN OPENSSH PRIVATE KEY-----",
37 | "SSH (DSA) private key": "-----BEGIN DSA PRIVATE KEY-----",
38 | "SSH (EC) private key": "-----BEGIN EC PRIVATE KEY-----",
39 | "PGP private key block": "-----BEGIN PGP PRIVATE KEY BLOCK-----",
40 | "Facebook Oauth": "[f|F][a|A][c|C][e|E][b|B][o|O][o|O][k|K].*['|\"][0-9a-f]{32}['|\"]",
41 | "Twitter Oauth": "[t|T][w|W][i|I][t|T][t|T][e|E][r|R].*['|\"][0-9a-zA-Z]{35,44}['|\"]",
42 | "GitHub": "[g|G][i|I][t|T][h|H][u|U][b|B].*[['|\"]0-9a-zA-Z]{35,40}['|\"]",
43 | "Google Oauth": "(\"client_secret\":\"[a-zA-Z0-9-_]{24}\")",
44 | "AWS API Key": "AKIA[0-9A-Z]{16}",
45 | "Heroku API Key": "[h|H][e|E][r|R][o|O][k|K][u|U].*[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12}",
46 | "Generic Secret": "[s|S][e|E][c|C][r|R][e|E][t|T].*['|\"][0-9a-zA-Z]{32,45}['|\"]"
47 | }
48 |
49 | cryptoRegexes={
50 | "bitcoin-address" : "[13][a-km-zA-HJ-NP-Z1-9]{25,34}" ,
51 | "bitcoin-uri" : "bitcoin:([13][a-km-zA-HJ-NP-Z1-9]{25,34})" ,
52 | "bitcoin-xpub-key" : "(xpub[a-km-zA-HJ-NP-Z1-9]{100,108})(\\?c=\\d*&h=bip\\d{2,3})?" ,
53 | "monero-address": "(?:^4[0-9AB][1-9A-HJ-NP-Za-km-z]{93}$)",
54 | "ethereum-address": "(?:^0x[a-fA-F0-9]{40}$)",
55 | "litecoin-address":"(?:^[LM3][a-km-zA-HJ-NP-Z1-9]{26,33}$)",
56 | "bitcoin-cash-address":"(?:^[13][a-km-zA-HJ-NP-Z1-9]{33}$)",
57 | "dash-address":"(?:^X[1-9A-HJ-NP-Za-km-z]{33}$)",
58 | "ripple-address":"(?:^r[0-9a-zA-Z]{33}$)",
59 | "neo-address":"(?:^A[0-9a-zA-Z]{33}$)",
60 | "dogecoin-address":"(?:^D{1}[5-9A-HJ-NP-U]{1}[1-9A-HJ-NP-Za-km-z]{32}$)"
61 | }
62 |
63 | class PasteDBConnector(object):
64 | supported = ('MYSQL')
65 |
66 | def __init__(self, db, **kwargs):
67 | try:
68 | self.logger = logging.getLogger('pastebin-scraper')
69 | from sqlalchemy.ext.declarative import declarative_base
70 | except ImportError:
71 | self.logger.error('SQLAlchemy import failed. Make sure the SQLAlchemy Python library '
72 | 'is installed! To check your existing installation run: '
73 | 'python3 -c "import sqlalchemy;print(sqlalchemy.__version__)"')
74 | self.db = db
75 | self.Base = declarative_base()
76 | self.engine = self._get_db_engine(**kwargs)
77 | self.session = self._get_db_session(self.engine)
78 |
79 | #Create tables for credentials
80 | self.paste_model = self._get_paste_model(self.Base, **kwargs)
81 | self.email_model = self._get_email_model(self.Base)
82 | self.link_model = self._get_link_model(self.Base)
83 | self.ip_model = self._get_ip_model(self.Base)
84 | self.phone_model = self._get_phone_model(self.Base)
85 | self.secret_model = self._get_secret_model(self.Base)
86 | self.crypto_model = self._get_crypto_model(self.Base)
87 | self.port_model = self._get_port_model(self.Base)
88 | self.Base.metadata.create_all(self.engine)
89 |
90 | #Nmap Worker
91 |
92 | nmapper = threading.Thread(target=self._scan_network)
93 | nmapper.setDaemon(True)
94 | nmapper.start()
95 |
96 | def isFiltered(self,ip):
97 | with open("res/nmapfilter.conf") as ipFilters:
98 | for networkString in ipFilters:
99 | ip=IPv4(ip)
100 | network=IPv4Net(networkString.rstrip("\r\n"))
101 | if(ip in network):
102 |
103 | self.logger.debug("["+str(ip)+"] is a filtered address, do not scan.")
104 | return True
105 | return False
106 |
107 |
108 | def _get_db_engine(self, **kwargs):
109 | from sqlalchemy import create_engine
110 |
111 | if self.db == 'MYSQL':
112 | # use the mysql-python connector
113 | location = 'mysql+pymysql://'
114 | location += '{username}:{password}@{host}:{port}'.format(
115 | host=kwargs.pop('host'),
116 | port=kwargs.pop('port'),
117 | username=kwargs.pop('username'),
118 | password=kwargs.pop('password'),
119 | )
120 | location += '/{table_name}?charset={charset}'.format(
121 | table_name=kwargs.pop('table_name'),
122 | charset='utf8'
123 | )
124 |
125 | self.logger.info('Using MySQL')
126 | return create_engine(location)
127 |
128 | def _get_db_session(self, engine):
129 | from sqlalchemy.orm import sessionmaker
130 | return sessionmaker(bind=engine)()
131 |
132 | def _get_paste_model(self, base, **kwargs):
133 | class Paste(base):
134 | __tablename__ = "pastes"
135 |
136 | id = Column(Integer, primary_key=True)
137 | name = Column('name', String(60))
138 | lang = Column('language', String(30))
139 | link = Column('link', String(28)) # Assuming format http://pastebin.com/XXXXXXXX
140 | date = Column('date', DateTime())
141 |
142 | data = Column('data', LONGTEXT(charset='utf8'))
143 |
144 | def __repr__(self):
145 | return "= paste_limit):
606 | # Break for limits % 8 != 0
607 | break
608 | name_link = paste.cssselect('a')[0]
609 | name = name_link.text_content().strip()
610 | href = name_link.get('href')[1:] # Get rid of leading /
611 | data = paste.cssselect('span')[0].text_content().split('|')
612 | language = None
613 | if len(data) == 2:
614 | # Got language
615 | language = data[0].strip()
616 | paste_data = (name, language, href)
617 | self.logger.debug('Paste scraped: ' + str(paste_data))
618 | if paste_data[2] not in self.pastes_seen:
619 | # New paste detected
620 | self.logger.debug('Scheduling new paste:' + str(paste_data))
621 | self.pastes_seen.add(paste_data[2])
622 | self.pastes.put(paste_data)
623 | delay = self.conf_general.getint('NewPasteCheckInterval')
624 | time.sleep(delay)
625 | paste_counter += 1
626 | self.logger.debug('Paste counter now at ' + str(paste_counter))
627 | if paste_counter % 100 == 0:
628 | self.logger.info('Scheduled %d pastes' % paste_counter)
629 |
630 |
631 |
632 | def _download_paste(self):
633 | while True:
634 | paste = self.pastes.get() # (name, lang, href)
635 | self.logger.debug('Fetching raw paste ' + paste[2])
636 | link = self.conf_general['PBLink'] + 'raw/' + paste[2]
637 | data = self._handle_data_download(link)
638 |
639 | self.logger.debug('Fetched {} with {} - {}'.format(
640 | link,
641 | data.status_code,
642 | data.reason
643 | ))
644 | if self.conf_stdout.getboolean('Enable'):
645 | self._write_to_stdout(paste, data)
646 | if self.conf_mysql.getboolean('Enable'):
647 | self._write_to_mysql(paste, data)
648 |
649 | def _handle_data_download(self, link):
650 | while True:
651 | try:
652 | data = requests.get(link)
653 | except:
654 | retry = self.conf_general.getint('ConnectionRetryInterval')
655 | self.logger.debug(
656 | 'Error connecting to %s: Retry in %ss, TRACE: %s' %
657 | (link, retry, sys.exc_info())
658 | )
659 | self.logger.info('Connection problems - trying again in %ss' % retry)
660 | time.sleep(retry)
661 | else:
662 | if data.status_code == 403 and b'Pastebin.com has blocked your IP' in data.content:
663 | self.logger.info('Our IP has been blocked. Trying again in an hour.')
664 | time.sleep(self.conf_general.getint('IPBlockedWaitTime'))
665 | return data
666 |
667 | def _assemble_output(self, conf, paste, data):
668 | output = ''
669 | if conf.getboolean('ShowName'):
670 | output += 'Name: %s\n' % paste[0]
671 | if conf.getboolean('ShowLang'):
672 | output += 'Lang: %s\n' % paste[1]
673 | if conf.getboolean('ShowLink'):
674 | output += 'Link: %s\n' % (self.conf_general['PBLink'] + paste[2])
675 | if conf.getboolean('ShowData'):
676 | encoding = conf['DataEncoding']
677 | limit = conf.getint('ContentDisplayLimit')
678 | if limit > 0:
679 | output += '\n%s\n\n' % data.content.decode(encoding)[:limit]
680 | else:
681 | output += '\n%s\n\n' % data.content.decode(encoding)
682 | return output
683 |
684 | def _write_to_stdout(self, paste, data):
685 | output = self._assemble_output(self.conf_stdout, paste, data)
686 | sys.stdout.write(output)
687 |
688 | def _write_to_mysql(self, paste, data):
689 | self.mysql_conn.add(paste, data)
690 |
691 |
692 | def run(self):
693 | for i in range(self.conf_general.getint('DownloadWorkers')):
694 | t = threading.Thread(target=self._download_paste)
695 | t.setDaemon(True)
696 | t.start()
697 |
698 | s = threading.Thread(target=self._get_paste_data)
699 | s.start()
700 | s.join()
701 |
702 |
703 | if __name__ == '__main__':
704 | ps = PastebinScraper()
705 | ps.run()
706 |
--------------------------------------------------------------------------------