├── bitcointalk_ANN
├── __init__.py
├── spiders
│ ├── old_files
│ │ ├── merge.py
│ │ ├── get_urls.py
│ │ ├── ANN_info.py
│ │ ├── urls_spider.py
│ │ ├── spider_bitcointalk.py
│ │ ├── tests_spider.py
│ │ ├── spider_base_url.py
│ │ ├── ANN_runfile.py
│ │ ├── runfile.py
│ │ ├── posts_spider.py
│ │ ├── bitcointalk_spider.py
│ │ ├── bitcointalk_spider_test.py
│ │ ├── add_css.py
│ │ └── urls.py
│ ├── urls.pickle
│ ├── __init__.py
│ └── bitcointalk_spider.py
├── proxy_list.txt
├── items.py
├── pipelines.py
├── middlewares.py
├── helper.py
├── settings.py
├── .idea
│ └── workspace.xml
├── style.html
└── pages
│ └── 7.html
├── .idea
├── vcs.xml
├── dictionaries
│ └── Shasa.xml
├── misc.xml
├── modules.xml
├── bitcointalk_ANN.iml
└── workspace.xml
├── scrapy.cfg
├── runfile.py
└── README.md
/bitcointalk_ANN/__init__.py:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/bitcointalk_ANN/spiders/old_files/merge.py:
--------------------------------------------------------------------------------
1 |
2 |
3 |
--------------------------------------------------------------------------------
/bitcointalk_ANN/spiders/old_files/get_urls.py:
--------------------------------------------------------------------------------
1 | BASE_URL = r'https://bitcointalk.org/index.php?topic=421615.0'
--------------------------------------------------------------------------------
/bitcointalk_ANN/spiders/urls.pickle:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/shasafoster/bitcointalk-ANN/HEAD/bitcointalk_ANN/spiders/urls.pickle
--------------------------------------------------------------------------------
/.idea/vcs.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
--------------------------------------------------------------------------------
/bitcointalk_ANN/spiders/__init__.py:
--------------------------------------------------------------------------------
1 | # This package will contain the spiders of your Scrapy project
2 | #
3 | # Please refer to the documentation for information on how to create and manage
4 | # your spiders.
5 |
--------------------------------------------------------------------------------
/.idea/dictionaries/Shasa.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 | pinkcoin
5 | shasa
6 |
7 |
8 |
--------------------------------------------------------------------------------
/bitcointalk_ANN/proxy_list.txt:
--------------------------------------------------------------------------------
1 | http://5.196.189.50:8080
2 | http://54.36.182.96:3128
3 | http://89.236.17.106:3128
4 | http://163.172.217.103:31288
5 | http://203.74.4.7:80
6 | http://203.74.4.6:80
7 |
8 |
9 |
10 |
11 |
--------------------------------------------------------------------------------
/.idea/misc.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
--------------------------------------------------------------------------------
/bitcointalk_ANN/spiders/old_files/ANN_info.py:
--------------------------------------------------------------------------------
1 | # This file stores information that may often change
2 |
3 | BASE_URL = r'https://bitcointalk.org/index.php?topic=421615.0'
4 | CRYPTO_NAME = r'Dash'
5 | BASE_DIR = r'C:/Users/Shasa/Documents/Projects/bitcointalk_ANN/'
--------------------------------------------------------------------------------
/.idea/modules.xml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
--------------------------------------------------------------------------------
/scrapy.cfg:
--------------------------------------------------------------------------------
1 | # Automatically created by: scrapy startproject
2 | #
3 | # For more information about the [deploy] section see:
4 | # https://scrapyd.readthedocs.org/en/latest/deploy.html
5 |
6 | [settings]
7 | default = bitcointalk_ANN.settings
8 |
9 | [deploy]
10 | #url = http://localhost:6800/
11 | project = bitcointalk_ANN
12 |
--------------------------------------------------------------------------------
/bitcointalk_ANN/items.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | # Define here the models for your scraped items
4 | #
5 | # See documentation in:
6 | # http://doc.scrapy.org/en/latest/topics/items.html
7 |
8 | import scrapy
9 |
10 |
11 | class BitcointalkAnnItem(scrapy.Item):
12 | # define the fields for your item here like:
13 | # name = scrapy.Field()
14 | pass
15 |
16 |
17 | class PostsItem(scrapy.Item):
18 | page_number = scrapy.Field()
19 | posts = scrapy.Field()
20 |
--------------------------------------------------------------------------------
/.idea/bitcointalk_ANN.iml:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
--------------------------------------------------------------------------------
/runfile.py:
--------------------------------------------------------------------------------
1 | from scrapy.crawler import CrawlerProcess
2 | from scrapy.utils.project import get_project_settings
3 | from bitcointalk_ANN.helper import *
4 |
5 |
6 | def script():
7 | num_of_thread_pages, crypto_currency = get_urls()
8 |
9 | process = CrawlerProcess(get_project_settings())
10 | process.crawl('bitcointalk')
11 | process.start() # the script will block here until the crawling is finished
12 |
13 | num_of_scraped_pages = merge(crypto_currency)
14 |
15 | print_log(crypto_currency, num_of_thread_pages, num_of_scraped_pages)
16 |
17 |
18 | script()
19 |
20 |
--------------------------------------------------------------------------------
/bitcointalk_ANN/pipelines.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | # Define your item pipelines here
4 | #
5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting
6 | # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
7 |
8 | from bs4 import BeautifulSoup
9 |
10 |
11 | class PostPipeline(object):
12 |
13 | def process_item(self, item, spider):
14 |
15 | filename = r'C:/Users/Shasa/PycharmProjects/bitcointalk/bitcointalk_ANN/bitcointalk_ANN/pages' \
16 | + '/' + str(item['page_number']) + '.html'
17 | with open(filename, 'wb') as f:
18 | soup = BeautifulSoup(item['posts'])
19 | f.write(soup.prettify(encoding='utf-8'))
20 | print('Saving page ' + str(item['page_number']))
21 | f.close()
22 |
23 |
24 |
--------------------------------------------------------------------------------
/bitcointalk_ANN/spiders/old_files/urls_spider.py:
--------------------------------------------------------------------------------
1 | from ANN_info import BASE_URL
2 | import scrapy
3 | import pickle
4 |
5 |
6 | class UrlsSpider(scrapy.Spider):
7 | name = "urls"
8 |
9 | def start_requests(self):
10 |
11 | yield scrapy.Request(url=BASE_URL, callback=self.parse)
12 |
13 | def parse(self, response):
14 |
15 | # Extract the number of pages in the bitcointalk thread
16 | table = response.xpath('//div[@id="bodyarea"]/table')[0]
17 | num_pages = max([int(x) for x in table.xpath('./tr/td/a/text()').extract()])
18 |
19 |
20 | # Create list of pages in thread for spider to parse and then pickle the list
21 | urls = [BASE_URL] + [BASE_URL[:-1] + str(int(20 * (i - 1))) for i in range(2, num_pages + 1)]
22 | urls = urls[:2]
23 | output = open('urls.pkl', 'wb')
24 | pickle.dump(urls, output)
25 | output.close()
--------------------------------------------------------------------------------
/bitcointalk_ANN/spiders/old_files/spider_bitcointalk.py:
--------------------------------------------------------------------------------
1 | import scrapy
2 | import os
3 | import pickle
4 |
5 |
6 | class BitcointalkSpider(scrapy.Spider):
7 | name = "bitcointalk"
8 |
9 | def start_requests(self):
10 | # Delete text file if exists
11 | try:
12 | path = os.path.join(BASE_DIR, (CRYPTO_NAME + r'.html'))
13 | os.remove(path)
14 | except OSError:
15 | pass
16 |
17 | pkl_file = open('urls.pkl','rb')
18 | urls = pickle.load(pkl_file)
19 | urls = [urls[0]]
20 | pkl_file.close()
21 |
22 | # Parse urls
23 | for url in urls:
24 | yield scrapy.Request(url=url, callback=self.parse)
25 |
26 | def parse(self, response):
27 |
28 | # The posts from the webpage
29 | table = response.xpath('//div[@id="bodyarea"]/form[@id="quickModForm"]/table')[0]
30 | posts = table.xpath('./tr')
31 |
32 |
33 | filename = CRYPTO_NAME + '.html'
34 | with open(filename, 'a') as f:
35 | f.write(BeautifulSoup(table.extract(), 'lxml').encode('utf8'))
36 | #for post in posts:
37 | # f.write(BeautifulSoup(post.extract(),'lxml').encode('utf8'))
38 | f.close()
39 | self.log('Saved file %s' % filename)
--------------------------------------------------------------------------------
/bitcointalk_ANN/spiders/old_files/tests_spider.py:
--------------------------------------------------------------------------------
1 | from ANN_info import BASE_URL
2 | from ANN_info import CRYPTO_NAME
3 | from ANN_info import BASE_DIR
4 | import scrapy
5 | from bs4 import BeautifulSoup
6 | import os
7 | import pickle
8 |
9 |
10 | class TestsSpider(scrapy.Spider):
11 | name = "tests"
12 |
13 | def start_requests(self):
14 | # Delete text file if exists
15 | try:
16 | path = os.path.join(BASE_DIR, (CRYPTO_NAME + r'.html'))
17 | os.remove(path)
18 | except OSError:
19 | pass
20 |
21 | pkl_file = open('urls.pkl','rb')
22 | urls = pickle.load(pkl_file)
23 | urls = [urls[0]]
24 | pkl_file.close()
25 |
26 | # Parse urls
27 | for url in urls:
28 | yield scrapy.Request(url=url, callback=self.parse)
29 |
30 | def parse(self, response):
31 |
32 | # The posts from the webpage
33 | table = response.xpath('//div[@id="bodyarea"]/form[@id="quickModForm"]/table')[0]
34 | posts = table.xpath('./tr')
35 |
36 |
37 | filename = CRYPTO_NAME + '.html'
38 | with open(filename, 'a') as f:
39 | f.write(BeautifulSoup(table.extract(), 'lxml').encode('utf8'))
40 | #for post in posts:
41 | # f.write(BeautifulSoup(post.extract(),'lxml').encode('utf8'))
42 | f.close()
43 | self.log('Saved file %s' % filename)
--------------------------------------------------------------------------------
/bitcointalk_ANN/spiders/old_files/spider_base_url.py:
--------------------------------------------------------------------------------
1 | import pickle
2 | from bs4 import BeautifulSoup
3 | import urllib2
4 | from lxml import etree
5 | import scrapy
6 | from spider_bitcointalk import BitcointalkSpider
7 | from scrapy.crawler import CrawlerProcess
8 |
9 | # Prompt the user for input (via command prompt)
10 | name = raw_input("Enter the name of the crypto economic protocol: ")
11 |
12 | # Get base url from coinmarketcap.com
13 | url = r'https://coinmarketcap.com/currencies/' + name
14 | response = urllib2.urlopen(url)
15 | soup = BeautifulSoup(response, 'lxml')
16 | base_url = soup.find('a', href=True, text='Announcement')['href']
17 |
18 | # Extract the number of pages in the bitcointalk thread
19 | response = urllib2.urlopen(base_url)
20 | html_parser = etree.HTMLParser()
21 | tree = etree.parse(response, html_parser)
22 | table = tree.xpath('//div[@id="bodyarea"]/table')[0]
23 | num_pages = max([int(x) for x in table.xpath('./tr/td/a/text()')])
24 |
25 | # Create list of page urls in thread for spider to parse and then pickle the list
26 | urls = [base_url] + [base_url[:-1] + str(int(20 * (i - 1))) for i in range(2, num_pages + 1)]
27 | urls = urls[:2]
28 | output = open('urls.pkl', 'wb')
29 | pickle.dump(urls, output)
30 | output.close()
31 |
32 | process = CrawlerProcess({
33 | 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
34 | })
35 |
36 | # Run spider
37 | process.crawl(BitcointalkSpider)
38 | process.start()
--------------------------------------------------------------------------------
/bitcointalk_ANN/spiders/old_files/ANN_runfile.py:
--------------------------------------------------------------------------------
1 | import pickle
2 | from bs4 import BeautifulSoup
3 | import urllib2
4 | from lxml import etree
5 | import scrapy
6 | from spider_bitcointalk import BitcointalkSpider
7 | from scrapy.crawler import CrawlerProcess
8 |
9 | # Prompt the user for input (via command prompt)
10 | name = raw_input("Enter the name of the crypto economic protocol: ")
11 |
12 | # Get base url from coinmarketcap.com
13 | url = r'https://coinmarketcap.com/currencies/' + name
14 | response = urllib2.urlopen(url)
15 | soup = BeautifulSoup(response, 'lxml')
16 | base_url = soup.find('a', href=True, text='Announcement')['href']
17 |
18 | # Extract the number of pages in the bitcointalk thread
19 | response = urllib2.urlopen(base_url)
20 | html_parser = etree.HTMLParser()
21 | tree = etree.parse(response, html_parser)
22 | table = tree.xpath('//div[@id="bodyarea"]/table')[0]
23 | num_pages = max([int(x) for x in table.xpath('./tr/td/a/text()')])
24 |
25 | # Create list of page urls in thread for spider to parse and then pickle the list
26 | name_urls = [base_url] + [base_url[:-1] + str(int(20 * (i - 1))) for i in range(2, num_pages + 1)]
27 | path = r'C:/Users/Shasa/Documents/Projects/bitcointalk_ANN/bitcointalk_ANN/spiders/name_urls.pkl'
28 | output = open(path, 'wb')
29 | pickle.dump(name_urls, output)
30 | output.close()
31 |
32 | process = CrawlerProcess({
33 | 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
34 | })
35 |
36 | # Run spider
37 | process.crawl(BitcointalkSpider)
38 | process.start()
--------------------------------------------------------------------------------
/bitcointalk_ANN/spiders/bitcointalk_spider.py:
--------------------------------------------------------------------------------
1 | import scrapy
2 | import pickle
3 | import re
4 | from collections import Counter
5 | from bitcointalk_ANN.items import PostsItem
6 |
7 |
8 | with open('./bitcointalk_ANN/spiders/urls.pickle', 'rb') as handle:
9 | urls = pickle.load(handle)
10 |
11 |
12 | class BitcointalkSpider(scrapy.Spider):
13 | name = "bitcointalk"
14 |
15 | def start_requests(self):
16 | # Parse urls
17 | for i, url in enumerate(urls):
18 | yield scrapy.Request(url=url, callback=self.parse, meta={'page_number': i}, dont_filter=True)
19 |
20 | def parse(self, response):
21 |
22 | # We only want user posts (no ads, deleted posts etc)
23 | table = response.xpath('//div[@id="bodyarea"]/form[@id="quickModForm"]/table')[0]
24 | rows = list(table.xpath('./tr'))
25 | joined = ''.join([str(row) for row in rows])
26 | results = re.findall(r'
', joined)
27 | most_common_result = Counter(results).most_common()[0][0]
28 | most_common_class= re.findall(r'"[\w]+"', most_common_result)[0].replace('"', '')
29 | x_path = './tr[@class="' + most_common_class + '"]'
30 | post_list = table.xpath(x_path)
31 | posts = ''.join([post.extract()for post in post_list])
32 |
33 | # Create PostsItem item and assign variables
34 | posts_item = PostsItem()
35 | posts_item['page_number'] = response.request.meta['page_number']
36 | posts_item['posts'] = posts
37 | yield posts_item
38 |
39 |
--------------------------------------------------------------------------------
/bitcointalk_ANN/spiders/old_files/runfile.py:
--------------------------------------------------------------------------------
1 | import pickle
2 | from lxml import etree
3 | from fetch_css import *
4 | import urllib2
5 | from bs4 import BeautifulSoup
6 | import subprocess
7 |
8 | # Prompt the user for input (via command prompt)
9 | crypto_currency = raw_input("Enter the name of the crypto economic protocol: ")
10 |
11 | # Get base url for the bitcointalk [ANN] from coinmarketcap.com
12 | url = r'https://coinmarketcap.com/currencies/' + crypto_currency
13 | response = urllib2.urlopen(url)
14 | soup = BeautifulSoup(response, 'lxml')
15 | base_url = soup.find('a', href=True, text='Announcement')['href']
16 |
17 | # Extract the number of pages in the bitcointalk thread
18 | response = urllib2.urlopen(base_url)
19 | html_parser = etree.HTMLParser()
20 | tree = etree.parse(response, html_parser)
21 | table = tree.xpath('//div[@id="bodyarea"]/table')[0]
22 | num_pages = max([int(x) for x in table.xpath('./tr/td/a/text()')])
23 |
24 | # Create list of page urls in thread for spider to parse and then pickle the list
25 | name_urls = [crypto_currency] + [base_url] + [base_url[:-1] + str(int(20 * (i - 1))) for i in range(2, num_pages + 1)]
26 | path = r'C:\Users\Shasa\Documents\Projects\bitcointalk_ANN\bitcointalk_ANN\name_urls.pkl'
27 | output = open(path, 'wb')
28 | pickle.dump(name_urls, output)
29 | output.close()
30 |
31 | # Extract the CSS of the bitcointalk webpage and write to file
32 | '---------------------------'
33 | print('Extracting CSS...')
34 | '---------------------------'
35 | write_css(crypto_currency, base_url)
36 |
37 | # python 3.5+
38 | # subprocess.run(['scrapy crawl bitcointalk'])
39 |
--------------------------------------------------------------------------------
/bitcointalk_ANN/spiders/old_files/posts_spider.py:
--------------------------------------------------------------------------------
1 | import scrapy
2 | import os
3 |
4 |
5 | class PostsSpider(scrapy.Spider):
6 | name = "posts"
7 |
8 | def start_requests(self):
9 | try:
10 | path = r"C:\Users\Shasa\Documents\Projects\bitcointalk_ANN\posts.txt"
11 | os.remove(path)
12 | except OSError:
13 | pass
14 |
15 | urls = [
16 | 'https://bitcointalk.org/index.php?topic=421615.0'
17 | ]
18 | for url in urls:
19 | yield scrapy.Request(url=url, callback=self.parse)
20 |
21 | def parse(self, response):
22 |
23 | filename = 'post.txt'
24 | with open(filename, 'a') as f:
25 |
26 | # get title of ANN thread and write
27 | title = response.xpath('//title/text()').extract_first()
28 | f.write(title.encode('utf8'))
29 | f.write("\n")
30 |
31 | # extract table of posts
32 | table = response.xpath('//div[@id="bodyarea"]/form[@id="quickModForm"]/table')[0]
33 |
34 | # Get the info on the posters
35 | poster_info = table.css('.poster_info')
36 |
37 | # Get the info on the post
38 | post_info = table.css('.td_headerandpost n.post')
39 |
40 | for s in poster_info:
41 | username = s.css('a::text').extract()[0]
42 | user_info = s.css('.smalltext').xpath('./text()').re('[ \w . \w ]+')[:3]
43 |
44 |
45 | # write username, rank level, user activity
46 | if any(c.isalpha() for c in username):
47 | f.write(username.encode('utf8'))
48 | f.write(',')
49 |
50 | for ss in user_info:
51 | f.write(ss.encode('utf8'))
52 | f.write(',')
53 | f.write("\n")
54 |
55 | self.log('Saved file %s' % filename)
56 |
--------------------------------------------------------------------------------
/bitcointalk_ANN/middlewares.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 |
3 | # Define here the models for your spider middleware
4 | #
5 | # See documentation in:
6 | # http://doc.scrapy.org/en/latest/topics/spider-middleware.html
7 |
8 | from scrapy import signals
9 |
10 |
11 | class BitcointalkAnnSpiderMiddleware(object):
12 | # Not all methods need to be defined. If a method is not defined,
13 | # scrapy acts as if the spider middleware does not modify the
14 | # passed objects.
15 |
16 | @classmethod
17 | def from_crawler(cls, crawler):
18 | # This method is used by Scrapy to create your spiders.
19 | s = cls()
20 | crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
21 | return s
22 |
23 | def process_spider_input(self, response, spider):
24 | # Called for each response that goes through the spider
25 | # middleware and into the spider.
26 |
27 | # Should return None or raise an exception.
28 | return None
29 |
30 | def process_spider_output(self, response, result, spider):
31 | # Called with the results returned from the Spider, after
32 | # it has processed the response.
33 |
34 | # Must return an iterable of Request, dict or Item objects.
35 | for i in result:
36 | yield i
37 |
38 | def process_spider_exception(self, response, exception, spider):
39 | # Called when a spider or process_spider_input() method
40 | # (from other spider middleware) raises an exception.
41 |
42 | # Should return either None or an iterable of Response, dict
43 | # or Item objects.
44 | pass
45 |
46 | def process_start_requests(self, start_requests, spider):
47 | # Called with the start requests of the spider, and works
48 | # similarly to the process_spider_output() method, except
49 | # that it doesn’t have a response associated.
50 |
51 | # Must return only requests (not items).
52 | for r in start_requests:
53 | yield r
54 |
55 | def spider_opened(self, spider):
56 | spider.logger.info('Spider opened: %s' % spider.name)
57 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # bitcointalk-ANN
2 | Aim: To scrape a 1000+ bitcointalk [ANN] thread into a single highly readible html document for better reading and analysis
3 | An example of such a thread: https://bitcointalk.org/index.php?topic=421615.20
4 |
5 | ## Introduction
6 | Currently reading the bitcointalk [ANN] thread for a crypto-currency is useful tool for (investment) analysis of crypto-currency.
7 | The issues faces by a reader of a bitcointalk [ANN] are
8 | 1. Their are often 1000+ pages in the [ANN] thread so you have to click the 'next button' 1000+ times
9 | 2.There are ads, user footer/motto's, icons ... that effect readibility
10 | 3. The styling is un-appealing
11 |
12 | ## Timeline
13 | The three issues outlined above outline the timeline of the project.
14 | The first challenge has been addressed and completed.
15 |
16 | ### To Do.
17 | 1. Remove the ads, annoying icons, user footers and mottos from the document
18 | 2. Make the styling attractive and highly readible (think medium.com)
19 |
20 |
21 | ## Install / Use
22 |
23 | #### Install packages
24 | Scrapy: https://scrapy.org/ $ pip install scrapy
25 | BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/ $ pip install beautifulsoup4
26 | lxml: http://lxml.de/installation.html $ pip install lxml
27 |
28 | #### Step 1.
29 | Create a new directory (folder) on your computer
30 |
31 | #### Step 2.
32 | Clone the repository into this new directory on your computer
33 |
34 | #### Step 3.
35 | Open the command propmt in this new directory
36 |
37 | #### Step 4.
38 | Enter:
39 | ```
40 | $python runfile.py
41 | ```
42 | * *The command prompt will ask you to enter the name of the crpyto-currencies you want to create the [ANN] document for.*
43 | * *This command should take 1-3 seconds to run*
44 |
45 | #### Step 5.
46 | Enter:
47 | ```
48 | $scrapy crawl bitcointalk
49 | ```
50 | * *This command will run the spider*
51 | * *This command will take much longer to run (it depends highly on the number of webpages the spider has to parse)*
52 |
53 | #### Step 6.
54 | After the spider has finished running, if there were no errors, an .html document should have been created in the new top level directory that you created.
55 |
--------------------------------------------------------------------------------
/bitcointalk_ANN/spiders/old_files/bitcointalk_spider.py:
--------------------------------------------------------------------------------
1 | from bs4 import BeautifulSoup
2 | import os
3 | import urllib2
4 | from lxml import etree
5 | import scrapy
6 | import add_css
7 |
8 | # Prompt the user for input (via command prompt)
9 | crypto_currency = raw_input("Enter the name of the crypto economic protocol: ").lower()
10 | print(r'Parsing ' + r'"https://coinmarketcap.com/currencies/' + crypto_currency + r'"...')
11 |
12 | # Get base url from coinmarketcap.com
13 | url = r'https://coinmarketcap.com/currencies/' + crypto_currency
14 | response = urllib2.urlopen(url)
15 | soup = BeautifulSoup(response, 'lxml')
16 | base_url = soup.find('a', href=True, text='Announcement')['href']
17 |
18 | # Extract the number of pages in the bitcointalk.com thread
19 | forum_response = urllib2.urlopen(base_url)
20 | html_parser = etree.HTMLParser()
21 | tree = etree.parse(forum_response, html_parser)
22 | table = tree.xpath('//div[@id="bodyarea"]/table')[0]
23 |
24 | num_pages = []
25 | for x in table.xpath('./tr/td/a/text()'):
26 | try:
27 | num_pages.append(int(x))
28 | except ValueError:
29 | pass
30 | num_pages = max(num_pages)
31 | urls = [base_url] + [base_url[:-1] + str(int(20 * (i - 1))) for i in range(2, num_pages + 1)]
32 | urls = urls[:2]
33 |
34 | # Extract the CSS pages
35 | add_css.write_css(crypto_currency, base_url)
36 |
37 | class BitcointalkSpider(scrapy.Spider):
38 | name = "bitcointalk"
39 |
40 | def start_requests(self):
41 |
42 | # Delete html file for the crypto-currency if exists
43 | try:
44 | base = r'C:\Users\Shasa\PycharmProjects\bitcointalk\bitcointalk_ANN'
45 | path = os.path.join(base, (crypto_currency + r'.html'))
46 | os.remove(path)
47 | except OSError:
48 | pass
49 |
50 | # Parse urls
51 | for url in urls:
52 | yield scrapy.Request(url=url, callback=self.parse)
53 |
54 | def parse(self, response):
55 |
56 | # The posts from the webpage
57 | table = response.xpath('//div[@id="bodyarea"]/form[@id="quickModForm"]/table')[0]
58 | # posts = table.xpath('./tr')
59 |
60 | filename = crypto_currency + '.html'
61 | with open(filename, 'a') as f:
62 | #f.write(BeautifulSoup(table.extract(), 'lxml').encode('utf8'))
63 |
64 | posts = table.xpath('./tr')
65 | for post in posts:
66 |
67 | # x=list(table.xpath('./tr'))
68 | # x
69 | # re.findall(r'
',X)
70 | # f.write(BeautifulSoup(post.extract(),'lxml').encode('utf8'))
71 | f.close()
72 | self.log('Saved file %s' % filename)
73 |
--------------------------------------------------------------------------------
/bitcointalk_ANN/spiders/old_files/bitcointalk_spider_test.py:
--------------------------------------------------------------------------------
1 | from bs4 import BeautifulSoup
2 | import os
3 | import urllib2
4 | from lxml import etree
5 | import scrapy
6 | import add_css
7 | import re
8 | from collections import Counter
9 |
10 | # Prompt the user for input (via command prompt)
11 | #crypto_currency = raw_input("Enter the name of the crypto economic protocol: ").lower()
12 | #print(r'Parsing ' + r'"https://coinmarketcap.com/currencies/' + crypto_currency + r'"...')
13 |
14 | # Get base url from coinmarketcap.com
15 | crypto_currency = 'pinkcoin'
16 |
17 | response = urllib2.urlopen(r'https://coinmarketcap.com/currencies/' + crypto_currency)
18 | soup = BeautifulSoup(response, 'lxml')
19 | base_url = soup.find('a', href=True, text='Announcement')['href']
20 |
21 | # Extract the number of pages in the bitcointalk.com thread
22 | forum_response = urllib2.urlopen(base_url)
23 | html_parser = etree.HTMLParser()
24 | tree = etree.parse(forum_response, html_parser)
25 | index_table = tree.xpath('//div[@id="bodyarea"]/table')[0]
26 |
27 | num_pages = []
28 | for x in index_table.xpath('./tr/td/a/text()'):
29 | try:
30 | num_pages.append(int(x))
31 | except ValueError:
32 | pass
33 | num_pages = max(num_pages)
34 | urls = [base_url] + [base_url[:-1] + str(int(20 * (i - 1))) for i in range(2, num_pages + 1)]
35 |
36 |
37 | class BitcointalkSpider(scrapy.Spider):
38 | name = "bitcointalkTest"
39 |
40 | def start_requests(self):
41 |
42 | # Delete html file for the crypto-currency if exists
43 | try:
44 | base = r'C:\Users\Shasa\PycharmProjects\bitcointalk\bitcointalk_ANN'
45 | path = os.path.join(base, (crypto_currency + r'.html'))
46 | os.remove(path)
47 | except OSError:
48 | pass
49 |
50 | with open(r'./style.html', 'r') as f:
51 | style = f.read()
52 | f.close()
53 |
54 | with open(crypto_currency + '.html', 'a') as f:
55 | f.write(style)
56 | f.close()
57 |
58 | # Parse urls
59 | for i, url in enumerate(urls):
60 | yield scrapy.Request(url=url, meta={'priority': i}, callback=self.parse, )
61 |
62 | def parse(self, response):
63 |
64 | # We only want user posts (no ads, deleted posts etc)
65 | table = response.xpath('//div[@id="bodyarea"]/form[@id="quickModForm"]/table')[0]
66 | rows = list(table.xpath('./tr'))
67 | joined = ''.join([str(row) for row in rows])
68 | results = re.findall(r'
82 | Hello,
83 |
84 |
85 | Can I set up a masternode ? If yes, how many staked coins are needed in order to be a masternode in the network ?
86 |
87 |
88 | Thanks !
89 |
725 | Happy to announce that
726 |
727 | http://startmy.io
728 |
729 | , now accepts PINK
730 |
731 |
732 |
733 | The worlds first multi-crypto accepting crowdfunding site brought to you by team MonetaryUnit.
734 |
1547 | are there any news for this project, any new milestones that will be achieved soonish or new cooperations? I feel its been a bit quiet after announcing the new team members, whats new?
1548 |
1764 | Hello, sir. I am very happy to see the development of this from day to day, so I decided to make an article about this topic in my blog. If I do that, will I get additional stakes? thank you!
1765 |
2076 |
2077 |
2078 |
2079 | BlockMunch.CLUB
2080 |
2081 | has added PinkCoin to it's Multi-Pool! Only 0.25% Fee's!
2082 |
2083 |
2084 |
2085 |
2086 | We are now a Multi-Pool! We mine the most profitable coin always, and maximize your return!
2087 |
2088 |
2089 |
2090 |
2091 | Choose your payout style. We pay out in this coin, BTC, and any coin we currently have listed on the pool!
2092 |
2093 |
2094 |
2095 |
2096 | Example Config is Below for PinkCoin Pay:
2097 |
2098 |
2358 | Hello PINK team.
2359 |
2360 |
2361 | I'm happy to announce that your coin has been listed to ItalYiiMP 2.0 for mining,
2362 |
2363 |
2364 | to get paid in PINK just set c=PINK as password in your configuration.
2365 |
2366 |
2367 | As an additional service, the explorer is available at the address
2368 |
2369 | http://italyiimp.com/explorer
2370 |
2371 |
2372 |
2373 | Visit
2374 |
2375 | http://italyiimp.com
2376 |
2377 | .
2378 |
2379 |
2380 | Best regards.
2381 |
2382 |
2383 | Bitfawkes.
2384 |
2385 |
2386 |
2387 |
2388 |
2389 |
2390 |
2391 |
2392 |
2393 |
2394 |
2395 |
2396 |
2397 |
2398 |
2399 |
2400 |
2401 |
2402 |
2403 |
2404 | Sometimes people who no one imagined could do certain things, those who do things that no one can imagine.
2405 |