├── .gitignore
├── README.md
├── amazon
    ├── amazon
    │   ├── __init__.py
    │   ├── __pycache__
    │   │   ├── __init__.cpython-36.pyc
    │   │   ├── helper.cpython-36.pyc
    │   │   ├── items.cpython-36.pyc
    │   │   ├── settings.cpython-36.pyc
    │   │   └── sql.cpython-36.pyc
    │   ├── helper.py
    │   ├── items.py
    │   ├── main.py
    │   ├── middlewares
    │   │   ├── AmazonSpiderMiddleware.py
    │   │   ├── ProxyMiddleware.py
    │   │   ├── RotateUserAgentMiddleware.py
    │   │   └── __init__.py
    │   ├── mysqlpipelines
    │   │   ├── __init__.py
    │   │   ├── __pycache__
    │   │   │   ├── __init__.cpython-36.pyc
    │   │   │   └── pipelines.cpython-36.pyc
    │   │   ├── pipelines.py
    │   │   └── sql.py
    │   ├── pipelines.py
    │   ├── proxy.json
    │   ├── settings-demo.py
    │   ├── settings.py
    │   ├── spiders
    │   │   ├── __init__.py
    │   │   ├── asin_spider.py
    │   │   ├── cate_spider.py
    │   │   ├── detail_spider.py
    │   │   ├── keyword_ranking_spider.py
    │   │   ├── proxy
    │   │   │   ├── __init__.py
    │   │   │   ├── __pycache__
    │   │   │   │   ├── __init__.cpython-36.pyc
    │   │   │   │   ├── fineproxy_spider.cpython-36.pyc
    │   │   │   │   └── kuaidaili_spider.cpython-36.pyc
    │   │   │   ├── fineproxy_spider.py
    │   │   │   ├── kuaidaili_spider.py
    │   │   │   └── privateproxy_spider.py
    │   │   ├── reivew_profile_spider.py
    │   │   ├── review_detail_spider.py
    │   │   └── sales_ranking_spider.py
    │   └── sql.py
    ├── db
    │   ├── ipricejot.sql
    │   └── py_salesranking_and_review.sql
    ├── requirements.txt
    └── scrapy.cfg
└── amazon2
    ├── __init__.py
    ├── amazon2
        ├── __init__.py
        ├── items.py
        ├── middlewares
        │   ├── AmazonSpiderMiddleware.py
        │   ├── RotateUserAgentMiddleware.py
        │   └── __init__.py
        ├── pipelines.py
        ├── settings.py
        └── spiders
        │   ├── AmazonBaseSpider.py
        │   ├── DemoSpider.py
        │   └── __init__.py
    ├── requirements.txt
    └── scrapy.cfg


/.gitignore:
--------------------------------------------------------------------------------
1 | .idea
2 | **/__pycache__
3 | amazon/amazon/*.json
4 | /amazon/amazon/settings.py
5 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # amazon-scrapy
 2 | Scrapy the  detail and lowest price of amazon  best seller product by python spider
 3 |  
 4 | In developing........, welcome to join !
 5 | 
 6 | Some code has apply to pricejot.com.
 7 | https://www.pricejot.com/
 8 | 
 9 | 
10 | 
11 | ## How to install
12 | install python3.6
13 | pip install -r requirements.txt
14 | 
15 | if you are contributer
16 | recommend to php install pipreqs
17 | run "pipreqs /path/to/project" if you add some packages in your code 
18 | 
19 | 
20 | ## Source url
21 | * list page
22 | https://www.amazon.com/Best-Sellers/zgbs/     update every two hours.
23 | * price detail page
24 | https://www.amazon.com/gp/offer-listing/B01FCTAEK4  
25 | 
26 | 
27 | ## TODO
28 | 1. Scrapy the reviews by asin https://www.amazon.com/Libratone-ONE-Click-Caribbean-Green/product-reviews/B00ZU33BBC
29 | 2. Scrapy the detail by asin pool(50%)
30 | 
31 | 
32 | ## done
33 | 1. Scrapy  the category of level 1 from https://www.amazon.com/Best-Sellers/zgbs/ and store in mysql
34 | 2. Scrapy the asin in the best seller asin from https://www.amazon.com/Best-Sellers/zgbs/ and store in mysql
35 | 3. Scrapy the keywords rank by asin 
36 | 
37 | 
38 | ## Reffering Documents
39 | * python https://bop.molun.net/
40 | * scrapy https://docs.scrapy.org/en/latest/
41 | 
42 | ## About proxy service
43 | 1. Recommend to use swiftproxy https://www.swiftproxy.net/
44 | 2. We get proxy list from all over the world by random eg: https://www.swiftproxy.net/api/proxy/get_proxy_ip?num=100&regions=GLOBAL&protocol=http&return_type=txt&lb=1&sb=
45 | 3. By the way, We need login in and set the white_ip to use the proxy.
46 | 
47 | 
48 | ## Contract us
49 | Email：huangdingzhi@foxmail.com
50 | Wechat:dzdzhuang
51 | 
52 | Based on  Python 3.6
53 | 
54 | ## License
55 | 
56 | The MIT License(http://opensource.org/licenses/MIT)
57 | 
58 | Please feel free to use and contribute to the development.
59 | 
60 | ## Contribution
61 | 
62 | If you have any ideas or suggestions to improve Amazon-scrapy, welcome to submit an issue/pull request.
63 | 
64 | ## Backer
65 | > 您的支持是我们坚持的最大动力。
66 | 
67 | <img src="https://raw.githubusercontent.com/dynamohuang/dynamohuang.github.io/master/alipay.png" height="300px" alt="打赏">
68 | <img src="https://raw.githubusercontent.com/dynamohuang/dynamohuang.github.io/master/weixinpay.png" height="300px" alt="打赏">
69 | 
70 | ### Thanks
71 | 
72 | | Backer | Fee |
73 | | --- | --- |
74 | | Jet.Zhang | ￥18.88 |
75 | | Mike | ￥18.88 |
76 | | 飞龙 | ￥5 |
77 | 
78 | 


--------------------------------------------------------------------------------
/amazon/amazon/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dynamohuang/amazon-scrapy/33afe5b482d3d55a065084289e20540ce6d99081/amazon/amazon/__init__.py


--------------------------------------------------------------------------------
/amazon/amazon/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dynamohuang/amazon-scrapy/33afe5b482d3d55a065084289e20540ce6d99081/amazon/amazon/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/amazon/amazon/__pycache__/helper.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dynamohuang/amazon-scrapy/33afe5b482d3d55a065084289e20540ce6d99081/amazon/amazon/__pycache__/helper.cpython-36.pyc


--------------------------------------------------------------------------------
/amazon/amazon/__pycache__/items.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dynamohuang/amazon-scrapy/33afe5b482d3d55a065084289e20540ce6d99081/amazon/amazon/__pycache__/items.cpython-36.pyc


--------------------------------------------------------------------------------
/amazon/amazon/__pycache__/settings.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dynamohuang/amazon-scrapy/33afe5b482d3d55a065084289e20540ce6d99081/amazon/amazon/__pycache__/settings.cpython-36.pyc


--------------------------------------------------------------------------------
/amazon/amazon/__pycache__/sql.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dynamohuang/amazon-scrapy/33afe5b482d3d55a065084289e20540ce6d99081/amazon/amazon/__pycache__/sql.cpython-36.pyc


--------------------------------------------------------------------------------
/amazon/amazon/helper.py:
--------------------------------------------------------------------------------
 1 | import datetime
 2 | 
 3 | import re
 4 | import pytz
 5 | 
 6 | from math import ceil
 7 | from random import Random
 8 | 
 9 | from amazon import settings
10 | 
11 | 
12 | class Helper(object):
13 |     tz = pytz.timezone(settings.TIMEZONE)
14 | 
15 |     @classmethod
16 |     def get_num_split_comma(cls, value):    
17 |         # num = value.split(',')
18 |         # ranknum = ''
19 |         # if len(num) > 1:
20 |         #     for n in num:
21 |         #         ranknum += n
22 |         #     return ranknum
23 |         # else:
24 |         #     return value
25 |         
26 |         return value.replace(',', '')
27 | 
28 |     @classmethod
29 |     def get_star_split_str(cls, value):
30 |         rate = value.split('out of 5 stars')   # 分割字符串
31 |         return rate[0].strip()
32 | 
33 |     @classmethod
34 |     def get_date_split_str(cls, value):
35 |         return value.split('on')[1].strip()
36 | 
37 |     @classmethod
38 |     def convert_date_str(cls, date_str):
39 |         return datetime.datetime.strptime(date_str, '%B %d, %Y')
40 | 
41 |     @classmethod
42 |     def delay_forty_days(cls):
43 |         now = datetime.datetime.now()
44 |         delay14 = now + datetime.timedelta(days=-40)  # 计算往前40天之后的时间
45 |         return delay14
46 | 
47 |     @classmethod
48 |     def get_rank_classify(cls, spider_str):
49 |         result = re.match(r'#([0-9,]+)(?:.*)in (.*) \(.*[Ss]ee [Tt]op.*\)', spider_str)
50 |         return result.groups()
51 | 
52 |     @classmethod
53 |     def get_keyword_page_num(cls, rank):
54 |         page_num = ceil(int(rank) / 16)
55 |         return page_num
56 | 
57 |     @classmethod
58 |     def get_keyword_page_range(cls, page_num):
59 |         return range(page_num - 4 if page_num - 4 > 0 else 1, page_num + 4 if page_num + 4 <= 20 else 20)
60 | 
61 |     @classmethod
62 |     def random_str(cls, randomlength):
63 |         str = ''
64 |         chars = 'AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz0123456789'
65 |         length = len(chars) - 1
66 |         random = Random()
67 |         for i in range(randomlength):
68 |             str += chars[random.randint(0, length)]
69 |         return str
70 | 
71 |     @classmethod
72 |     def get_now_date(cls):
73 |         now = datetime.datetime.now(cls.tz).strftime('%Y-%m-%d %H:%M:%S')
74 |         return now
75 | 


--------------------------------------------------------------------------------
/amazon/amazon/items.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Define here the models for your scraped items
 4 | #
 5 | # See documentation in:
 6 | # http://doc.scrapy.org/en/latest/topics/items.html
 7 | 
 8 | import scrapy
 9 | 
10 | 
11 | 
12 | class CateItem(scrapy.Item):
13 |     title = scrapy.Field()
14 |     link = scrapy.Field()
15 |     level = scrapy.Field()
16 |     pid = scrapy.Field()
17 |     pass
18 | 
19 | class AsinBestItem(scrapy.Item):
20 |     asin = scrapy.Field()
21 |     cid = scrapy.Field()
22 |     rank = scrapy.Field()
23 |     pass
24 | 
25 | class DetailItem(scrapy.Item):
26 |     asin = scrapy.Field()
27 |     image = scrapy.Field()
28 |     title = scrapy.Field()
29 |     star = scrapy.Field()
30 |     reviews = scrapy.Field()
31 |     seller_price = scrapy.Field()
32 |     amazon_price = scrapy.Field()
33 |     pass
34 | 
35 | class ReviewProfileItem(scrapy.Item):
36 |     asin = scrapy.Field()
37 |     product = scrapy.Field()
38 |     brand = scrapy.Field()
39 |     seller = scrapy.Field()
40 |     image = scrapy.Field()
41 |     review_total = scrapy.Field()
42 |     review_rate = scrapy.Field()
43 |     pct_five = scrapy.Field()
44 |     pct_four = scrapy.Field()
45 |     pct_three = scrapy.Field()
46 |     pct_two = scrapy.Field()
47 |     pct_one = scrapy.Field()
48 |     pass
49 | 
50 | 
51 | class ReviewDetailItem(scrapy.Item):
52 |     asin = scrapy.Field()
53 |     review_id = scrapy.Field()
54 |     reviewer = scrapy.Field()
55 |     review_url = scrapy.Field()
56 |     star = scrapy.Field()
57 |     date = scrapy.Field()
58 |     title = scrapy.Field()
59 |     content = scrapy.Field()
60 |     pass
61 | 
62 | 
63 | class KeywordRankingItem(scrapy.Item):
64 |     skwd_id = scrapy.Field()
65 |     rank = scrapy.Field()
66 |     date = scrapy.Field()
67 | 
68 | 
69 | class SalesRankingItem(scrapy.Item):
70 |     rank = scrapy.Field()
71 |     classify = scrapy.Field()
72 |     asin = scrapy.Field()
73 | 


--------------------------------------------------------------------------------
/amazon/amazon/main.py:
--------------------------------------------------------------------------------
1 | from scrapy.cmdline import execute
2 | #Just run once
3 | execute("scrapy crawl detail ".split())
4 | 


--------------------------------------------------------------------------------
/amazon/amazon/middlewares/AmazonSpiderMiddleware.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Define here the models for your spider middleware
 4 | #
 5 | # See documentation in:
 6 | # http://doc.scrapy.org/en/latest/topics/spider-middleware.html
 7 | 
 8 | from scrapy import signals
 9 | from datetime import datetime
10 | 
11 | 
12 | class AmazonSpiderMiddleware(object):
13 |     # Not all methods need to be defined. If a method is not defined,
14 |     # scrapy acts as if the spider middleware does not modify the
15 |     # passed objects.
16 | 
17 |     @classmethod
18 |     def from_crawler(cls, crawler):
19 |         # This method is used by Scrapy to create your spiders.
20 |         s = cls()
21 |         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
22 |         return s
23 | 
24 |     def process_spider_input(self, response, spider):
25 |         # Called for each response that goes through the spider
26 |         # middleware and into the spider.
27 | 
28 |         # Should return None or raise an exception.
29 |         return None
30 | 
31 |     def process_spider_output(self, response, result, spider):
32 |         # Called with the results returned from the Spider, after
33 |         # it has processed the response.
34 | 
35 |         # Must return an iterable of Request, dict or Item objects.
36 |         for i in result:
37 |             yield i
38 | 
39 |     def process_spider_exception(self, response, exception, spider):
40 |         # Called when a spider or process_spider_input() method
41 |         # (from other spider middleware) raises an exception.
42 | 
43 |         # Should return either None or an iterable of Response, dict
44 |         # or Item objects.
45 |         pass
46 | 
47 |     def process_start_requests(self, start_requests, spider):
48 |         # Called with the start requests of the spider, and works
49 |         # similarly to the process_spider_output() method, except
50 |         # that it doesn’t have a response associated.
51 | 
52 |         # Must return only requests (not items).
53 |         for r in start_requests:
54 |             yield r
55 | 
56 |     def spider_opened(self, spider):
57 |         spider.started_on = datetime.now()
58 | 
59 | 
60 | 


--------------------------------------------------------------------------------
/amazon/amazon/middlewares/ProxyMiddleware.py:
--------------------------------------------------------------------------------
 1 | import json
 2 | import random
 3 | import redis
 4 | import time
 5 | from amazon import settings
 6 | 
 7 | 
 8 | class ProxyMiddleware(object):
 9 |     def __init__(self):
10 |         with open('proxy.json', 'r') as f:
11 |             self.proxies = json.load(f)
12 |             self.r = redis.Redis(host=settings.REDIS_HOST, port=settings.REDIS_PORT, db=settings.REDIS_DB,
13 |                                  password=settings.REDIS_PASSWORD)
14 | 
15 |     def process_request(self, request, spider):
16 |         while True:
17 |             proxy = random.choice(self.proxies)
18 |             if self.proxyReady(proxy):
19 |                 request.meta['proxy'] = 'http://{}'.format(proxy)
20 |                 break
21 | 
22 |     def proxyReady(self, proxy):
23 |         key = proxy + settings.BOT_NAME
24 |         retult = self.r.exists(key)
25 |         if retult:
26 |             return False
27 |         else:
28 |             self.r.setex(key, 1, 15)
29 |             return True
30 | 


--------------------------------------------------------------------------------
/amazon/amazon/middlewares/RotateUserAgentMiddleware.py:
--------------------------------------------------------------------------------
 1 | import random
 2 | from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
 3 | 
 4 | 
 5 | class RotateUserAgentMiddleware(UserAgentMiddleware):
 6 |     def __init__(self, user_agent=''):
 7 |         UserAgentMiddleware.__init__(self)
 8 |         self.user_agent = user_agent
 9 | 
10 |     def process_request(self, request, spider):
11 |         ua = random.choice(self.user_agent_list)
12 |         if ua:
13 |             # print(ua)
14 |             request.headers.setdefault('User-Agent', ua)
15 | 
16 |     # the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
17 |     # for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
18 |     user_agent_list = [
19 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \
20 |         "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \
21 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \
22 |         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \
23 |         "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \
24 |         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \
25 |         "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \
26 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
27 |         "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
28 |         "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
29 |         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
30 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
31 |         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
32 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
33 |         "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
34 |         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \
35 |         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \
36 |         "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
37 |     ]
38 | 


--------------------------------------------------------------------------------
/amazon/amazon/middlewares/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dynamohuang/amazon-scrapy/33afe5b482d3d55a065084289e20540ce6d99081/amazon/amazon/middlewares/__init__.py


--------------------------------------------------------------------------------
/amazon/amazon/mysqlpipelines/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dynamohuang/amazon-scrapy/33afe5b482d3d55a065084289e20540ce6d99081/amazon/amazon/mysqlpipelines/__init__.py


--------------------------------------------------------------------------------
/amazon/amazon/mysqlpipelines/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dynamohuang/amazon-scrapy/33afe5b482d3d55a065084289e20540ce6d99081/amazon/amazon/mysqlpipelines/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/amazon/amazon/mysqlpipelines/__pycache__/pipelines.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dynamohuang/amazon-scrapy/33afe5b482d3d55a065084289e20540ce6d99081/amazon/amazon/mysqlpipelines/__pycache__/pipelines.cpython-36.pyc


--------------------------------------------------------------------------------
/amazon/amazon/mysqlpipelines/pipelines.py:
--------------------------------------------------------------------------------
 1 | from scrapy.exceptions import DropItem
 2 | 
 3 | from amazon.helper import Helper
 4 | from amazon.sql import ReviewSql, RankingSql
 5 | from .sql import Sql
 6 | from amazon.items import CateItem, ReviewProfileItem, ReviewDetailItem, SalesRankingItem, KeywordRankingItem
 7 | from amazon.items import AsinBestItem
 8 | from amazon.items import DetailItem
 9 | 
10 | class AmazonPipeline(object):
11 |     def process_item(self,item,spider):
12 |         if isinstance(item,CateItem):
13 |             Sql.insert_cate_log(item)
14 |             print('save category: '+ item['title'])
15 |             pass
16 | 
17 |         if isinstance(item,AsinBestItem):
18 |             Sql.cache_best_asin(item)
19 |             print('save best seller: '+item['asin'])
20 |             pass
21 | 
22 |         if isinstance(item, ReviewProfileItem):
23 |             ReviewSql.insert_profile_item(item)
24 |             return item
25 | 
26 |         if isinstance(item, ReviewDetailItem):
27 |             delay_date = Helper.delay_forty_days()  # 40天的截止时间
28 |             item_date = Helper.convert_date_str(item['date'])
29 |             if item_date < delay_date:   # 判断是否过了40天限额，如果超出范围 则抛弃此item
30 |                 raise DropItem('the review_id:[%s] has been expired' % item['review_id'])
31 |             else:
32 |                 item['review_url'] = 'https://www.amazon.com' + item['review_url']
33 |                 item['date'] = item_date.strftime('%Y-%m-%d')
34 |                 ReviewSql.insert_detail_item(item)
35 | 
36 |                 return item
37 | 
38 |         if isinstance(item, SalesRankingItem):
39 |             RankingSql.insert_sales_ranking(item)
40 |             return item
41 | 
42 |         if isinstance(item, KeywordRankingItem):
43 |             RankingSql.insert_keyword_ranking(item)
44 |             return item
45 | 
46 |         if isinstance(item, DetailItem):
47 |             return item
48 | 
49 |         pass
50 | 
51 | 
52 | 
53 | 
54 | 
55 | 


--------------------------------------------------------------------------------
/amazon/amazon/mysqlpipelines/sql.py:
--------------------------------------------------------------------------------
 1 | import pymysql
 2 | from  amazon import settings
 3 | 
 4 | 
 5 | db = pymysql.connect(settings.MYSQL_HOST, settings.MYSQL_USER, settings.MYSQL_PASSWORD, settings.MYSQL_DB, charset=settings.MYSQL_CHARSET, cursorclass=pymysql.cursors.DictCursor)
 6 | cursor = db.cursor()
 7 | 
 8 | 
 9 | class Sql:
10 | 
11 |     asin_pool = []
12 | 
13 |     @classmethod
14 |     def insert_cate_log(cls, item):
15 |         sql = "INSERT INTO py_cates (title,link,level,pid) VALUES ('%s', '%s','%d','%d')" % (item['title'],item['link'],item['level'],item['pid'])
16 |         try:
17 |             cursor.execute(sql)
18 |             db.commit()
19 |         except:
20 |             db.rollback()
21 |         pass
22 | 
23 |     @classmethod
24 |     def clear_cate(cls, level):
25 |         sql = "truncate table py_cates"
26 |         try:
27 |             cursor.execute(sql)
28 |             db.commit()
29 |         except:
30 |             db.rollback()
31 |         pass
32 | 
33 |     @classmethod
34 |     def cache_best_asin(cls, item):
35 |         cls.asin_pool.append((item['asin'], item['cid'], item['rank']))
36 |         pass
37 | 
38 |     @classmethod
39 |     def store_best_asin(cls):
40 |         sql_clear = "truncate table py_asin_best"
41 |         sql = "INSERT INTO py_asin_best (asin,cid,rank) VALUES (%s, %s, %s)"
42 |         try:
43 |             cursor.execute(sql_clear)
44 |             cursor.executemany(sql,cls.asin_pool)
45 |             db.commit()
46 |         except Exception as err:
47 |             print(err)
48 |             db.rollback()
49 |         pass
50 | 
51 |     @classmethod
52 |     def findall_cate_level1(cls):
53 |         sql = "SELECT id,link FROM py_cates WHERE level < 2"
54 |         cursor.execute(sql)
55 |         return cursor.fetchall()
56 | 
57 |     @classmethod
58 |     def findall_asin_level1(cls):
59 |         sql = "SELECT distinct(asin), cid FROM py_asin_best limit 0,300"
60 |         cursor.execute(sql)
61 |         return cursor.fetchall()
62 | 
63 | 
64 | 
65 | 
66 | 


--------------------------------------------------------------------------------
/amazon/amazon/pipelines.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Define your item pipelines here
 4 | #
 5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 | # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 | 
 8 | 
 9 | class AmazonPipeline(object):
10 |     def process_item(self, item, spider):
11 |         return item
12 | 
13 | 
14 | 


--------------------------------------------------------------------------------
/amazon/amazon/proxy.json:
--------------------------------------------------------------------------------
1 | ["121.40.199.105:80", "210.16.120.244:3128", "139.59.125.112:8080", "128.199.75.94:80", "128.199.138.78:8080", "139.59.243.186:443", "128.199.193.202:3128", "128.199.74.233:8080", "47.52.96.5:80", "128.199.191.123:3128", "139.59.117.11:3128", "54.158.134.115:80", "120.25.211.80:9999", "209.198.197.165:80", "120.24.208.42:9999", "195.222.68.87:3128", "180.254.219.129:80", "104.131.63.78:3128", "104.131.132.131:3128", "165.227.124.179:3128", "119.81.197.124:3128", "188.166.221.165:3128", "36.85.246.175:80", "94.181.196.171:5328", "125.162.153.208:80", "36.83.66.57:80", "139.59.21.143:3128", "139.59.125.112:80", "165.227.92.177:80", "167.114.221.238:3128", "94.23.157.1:8081", "149.202.195.236:443", "95.189.123.74:8080", "41.78.25.185:3128", "61.135.155.82:443", "124.47.7.45:80", "125.99.100.200:8080", "178.27.197.105:80", "190.153.210.237:80", "95.110.189.166:80", "91.93.135.47:80", "95.110.189.185:80", "82.165.151.230:80", "82.67.68.28:80", "177.4.173.242:80", "123.59.51.130:8080", "36.85.246.175:80", "196.43.197.25:80", "128.199.169.17:80", "40.114.14.173:80", "64.237.61.242:80", "31.14.40.113:3128", "104.155.189.170:80", "114.215.102.168:8081", "196.43.197.26:80", "31.47.198.61:80", "178.32.213.128:80", "142.4.214.9:88", "199.15.198.7:8080", "199.15.198.9:8080", "199.15.198.10:8080", "94.153.172.75:80", "125.162.153.208:80", "36.83.66.57:80", "104.207.147.8:3256", "64.34.21.84:80", "83.169.17.103:80", "203.74.4.0:80", "203.74.4.5:80", "203.74.4.2:80", "203.74.4.6:80", "203.74.4.3:80", "203.74.4.1:80", "203.74.4.7:80", "203.74.4.4:80", "120.77.255.133:8088", "114.215.103.121:8081", "139.196.104.28:9000", "183.240.87.229:8080", "61.153.67.110:9999", "212.83.164.85:80", "167.114.196.153:80", "212.184.12.11:80", "103.15.251.75:80", "193.108.38.23:80", "46.38.52.36:8081", "177.207.234.14:80", "193.70.3.144:80", "202.78.227.33:80", "61.5.207.102:80", "62.210.249.233:80", "88.198.39.58:80", "35.154.200.203:80", "107.170.214.74:80", "54.233.168.79:80", "54.158.134.115:80", "202.79.36.119:8080", "195.14.242.39:80", "185.141.164.8:8080", "180.254.225.18:80", "168.234.75.142:80", "120.199.64.163:8081", "78.134.212.173:80", "120.77.210.59:80", "120.25.211.80:9999", "121.40.199.105:80", "209.198.197.165:80", "122.192.66.50:808", "120.24.208.42:9999", "119.28.74.189:808", "195.222.68.87:3128", "180.254.219.129:80", "50.203.117.22:80", "209.141.61.84:80", "191.253.67.206:8080", "94.19.39.81:3128", "200.42.45.211:80", "52.41.94.5:80", "168.128.29.75:80", "130.0.24.28:8080", "196.43.197.27:80", "52.65.157.207:80", "57.100.3.252:808", "186.103.239.190:80", "181.221.5.145:80", "124.47.7.38:80", "124.172.191.51:80", "219.91.255.179:80", "193.205.4.176:80", "37.59.36.212:88", "86.102.106.150:8080", "62.210.51.150:80", "138.197.154.98:80", "189.84.213.2:80", "195.55.85.254:80", "91.121.171.104:8888", "82.224.48.173:80", "51.255.161.222:80", "185.44.69.44:3128", "213.108.201.82:80", "82.200.205.49:3128", "51.254.127.194:8081"]


--------------------------------------------------------------------------------
/amazon/amazon/settings-demo.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | # Scrapy settings for amazon project
  4 | #
  5 | # For simplicity, this file contains only settings considered important or
  6 | # commonly used. You can find more settings consulting the documentation:
  7 | #
  8 | #     http://doc.scrapy.org/en/latest/topics/settings.html
  9 | #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
 10 | #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
 11 | 
 12 | BOT_NAME = 'amazon'
 13 | 
 14 | SPIDER_MODULES = ['amazon.spiders']
 15 | NEWSPIDER_MODULE = 'amazon.spiders'
 16 | 
 17 | 
 18 | # Crawl responsibly by identifying yourself (and your website) on the user-agent
 19 | USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
 20 | 
 21 | # Obey robots.txt rules
 22 | ROBOTSTXT_OBEY = False
 23 | 
 24 | # Configure maximum concurrent requests performed by Scrapy (default: 16)
 25 | CONCURRENT_REQUESTS = 32
 26 | 
 27 | # Configure a delay for requests for the same website (default: 0)
 28 | # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
 29 | # See also autothrottle settings and docs
 30 | #DOWNLOAD_DELAY = 3
 31 | # The download delay setting will honor only one of:
 32 | #CONCURRENT_REQUESTS_PER_DOMAIN = 16
 33 | #CONCURRENT_REQUESTS_PER_IP = 16
 34 | 
 35 | # Disable cookies (enabled by default)
 36 | COOKIES_ENABLED = False
 37 | 
 38 | # Disable Telnet Console (enabled by default)
 39 | #TELNETCONSOLE_ENABLED = False
 40 | 
 41 | # Override the default request headers:
 42 | #DEFAULT_REQUEST_HEADERS = {
 43 | #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 44 | #   'Accept-Language': 'en',
 45 | #}
 46 | 
 47 | # Enable or disable spider middlewares
 48 | # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
 49 | SPIDER_MIDDLEWARES = {
 50 |     'amazon.middlewares.AmazonSpiderMiddleware.AmazonSpiderMiddleware': 543,
 51 | }
 52 | 
 53 | # Enable or disable downloader middlewares
 54 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
 55 | #DOWNLOADER_MIDDLEWARES = {
 56 | #    'amazon.middlewares.MyCustomDownloaderMiddleware': 543,
 57 | #}
 58 | DOWNLOADER_MIDDLEWARES = {
 59 |     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
 60 |     'amazon.middlewares.RotateUserAgentMiddleware.RotateUserAgentMiddleware': 543,
 61 |     'amazon.middlewares.ProxyMiddleware.ProxyMiddleware': 542,
 62 | }
 63 | 
 64 | # Enable or disable extensions
 65 | # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
 66 | #EXTENSIONS = {
 67 | #    'scrapy.extensions.telnet.TelnetConsole': None,
 68 | #}
 69 | 
 70 | # Configure item pipelines
 71 | # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
 72 | ITEM_PIPELINES = {
 73 |     #'amazon.pipelines.AmazonPipeline': 300,
 74 |     'amazon.mysqlpipelines.pipelines.AmazonPipeline':1,
 75 | }
 76 | 
 77 | # Enable and configure the AutoThrottle extension (disabled by default)
 78 | # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
 79 | #AUTOTHROTTLE_ENABLED = True
 80 | # The initial download delay
 81 | #AUTOTHROTTLE_START_DELAY = 5
 82 | # The maximum download delay to be set in case of high latencies
 83 | #AUTOTHROTTLE_MAX_DELAY = 60
 84 | # The average number of requests Scrapy should be sending in parallel to
 85 | # each remote server
 86 | #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
 87 | # Enable showing throttling stats for every response received:
 88 | #AUTOTHROTTLE_DEBUG = False
 89 | 
 90 | # Enable and configure HTTP caching (disabled by default)
 91 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
 92 | #HTTPCACHE_ENABLED = True
 93 | #HTTPCACHE_EXPIRATION_SECS = 0
 94 | #HTTPCACHE_DIR = 'httpcache'
 95 | #HTTPCACHE_IGNORE_HTTP_CODES = []
 96 | #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
 97 | #LOG_LEVEL = 'ERROR'
 98 | #mysql
 99 | MYSQL_HOST = '127.0.0.1'
100 | MYSQL_USER = 'dev'
101 | MYSQL_PASSWORD = '123456'
102 | MYSQL_PORT = 3306
103 | MYSQL_DB = 'pricejot_test'
104 | MYSQL_CHARSET = 'utf8mb4'
105 | 
106 | MYSQL = {
107 |     'host': MYSQL_HOST,
108 |     'port': MYSQL_PORT,
109 |     'user': MYSQL_USER,
110 |     'password': MYSQL_PASSWORD,
111 |     'charset': MYSQL_CHARSET,
112 |     'database': MYSQL_DB
113 | }
114 | RETRY_TIMES = 30
115 | DOWNLOAD_TIMEOUT = 30
116 | FEED_EXPORT_ENCODING = 'utf-8'
117 | 
118 | TIMEZONE = 'America/Los_Angeles'


--------------------------------------------------------------------------------
/amazon/amazon/settings.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | 
  3 | # Scrapy settings for amazon project
  4 | #
  5 | # For simplicity, this file contains only settings considered important or
  6 | # commonly used. You can find more settings consulting the documentation:
  7 | #
  8 | #     http://doc.scrapy.org/en/latest/topics/settings.html
  9 | #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
 10 | #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
 11 | 
 12 | BOT_NAME = 'amazon'
 13 | 
 14 | SPIDER_MODULES = ['amazon.spiders']
 15 | NEWSPIDER_MODULE = 'amazon.spiders'
 16 | 
 17 | 
 18 | # Crawl responsibly by identifying yourself (and your website) on the user-agent
 19 | USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
 20 | 
 21 | # Obey robots.txt rules
 22 | ROBOTSTXT_OBEY = False
 23 | 
 24 | # Configure maximum concurrent requests performed by Scrapy (default: 16)
 25 | CONCURRENT_REQUESTS = 32
 26 | 
 27 | # Configure a delay for requests for the same website (default: 0)
 28 | # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
 29 | # See also autothrottle settings and docs
 30 | #DOWNLOAD_DELAY = 3
 31 | # The download delay setting will honor only one of:
 32 | #CONCURRENT_REQUESTS_PER_DOMAIN = 16
 33 | #CONCURRENT_REQUESTS_PER_IP = 16
 34 | 
 35 | # Disable cookies (enabled by default)
 36 | COOKIES_ENABLED = False
 37 | 
 38 | # Disable Telnet Console (enabled by default)
 39 | #TELNETCONSOLE_ENABLED = False
 40 | 
 41 | # Override the default request headers:
 42 | #DEFAULT_REQUEST_HEADERS = {
 43 | #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
 44 | #   'Accept-Language': 'en',
 45 | #}
 46 | 
 47 | # Enable or disable spider middlewares
 48 | # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
 49 | SPIDER_MIDDLEWARES = {
 50 |     'amazon.middlewares.AmazonSpiderMiddleware.AmazonSpiderMiddleware': 543,
 51 | }
 52 | 
 53 | # Enable or disable downloader middlewares
 54 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
 55 | #DOWNLOADER_MIDDLEWARES = {
 56 | #    'amazon.middlewares.MyCustomDownloaderMiddleware': 543,
 57 | #}
 58 | DOWNLOADER_MIDDLEWARES = {
 59 |     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
 60 |     'amazon.middlewares.RotateUserAgentMiddleware.RotateUserAgentMiddleware': 543,
 61 |     'amazon.middlewares.ProxyMiddleware.ProxyMiddleware': 542,
 62 | }
 63 | 
 64 | # Enable or disable extensions
 65 | # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
 66 | #EXTENSIONS = {
 67 | #    'scrapy.extensions.telnet.TelnetConsole': None,
 68 | #}
 69 | 
 70 | # Configure item pipelines
 71 | # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
 72 | ITEM_PIPELINES = {
 73 |     #'amazon.pipelines.AmazonPipeline': 300,
 74 |     'amazon.mysqlpipelines.pipelines.AmazonPipeline':1,
 75 | }
 76 | 
 77 | # Enable and configure the AutoThrottle extension (disabled by default)
 78 | # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
 79 | #AUTOTHROTTLE_ENABLED = True
 80 | # The initial download delay
 81 | #AUTOTHROTTLE_START_DELAY = 5
 82 | # The maximum download delay to be set in case of high latencies
 83 | #AUTOTHROTTLE_MAX_DELAY = 60
 84 | # The average number of requests Scrapy should be sending in parallel to
 85 | # each remote server
 86 | #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
 87 | # Enable showing throttling stats for every response received:
 88 | #AUTOTHROTTLE_DEBUG = False
 89 | 
 90 | # Enable and configure HTTP caching (disabled by default)
 91 | # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
 92 | #HTTPCACHE_ENABLED = True
 93 | #HTTPCACHE_EXPIRATION_SECS = 0
 94 | #HTTPCACHE_DIR = 'httpcache'
 95 | #HTTPCACHE_IGNORE_HTTP_CODES = []
 96 | #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
 97 | #LOG_LEVEL = 'ERROR'
 98 | #mysql
 99 | MYSQL_HOST = '127.0.0.1'
100 | MYSQL_USER = 'dev'
101 | MYSQL_PASSWORD = '123456'
102 | MYSQL_PORT = 3306
103 | MYSQL_DB = 'pricejot_test'
104 | MYSQL_CHARSET = 'utf8mb4'
105 | 
106 | MYSQL = {
107 |     'host': MYSQL_HOST,
108 |     'port': MYSQL_PORT,
109 |     'user': MYSQL_USER,
110 |     'password': MYSQL_PASSWORD,
111 |     'charset': MYSQL_CHARSET,
112 |     'database': MYSQL_DB
113 | }
114 | RETRY_TIMES = 30
115 | DOWNLOAD_TIMEOUT = 30
116 | FEED_EXPORT_ENCODING = 'utf-8'
117 | 
118 | TIMEZONE = 'America/Los_Angeles'


--------------------------------------------------------------------------------
/amazon/amazon/spiders/__init__.py:
--------------------------------------------------------------------------------
1 | # This package will contain the spiders of your Scrapy project
2 | #
3 | # Please refer to the documentation for information on how to create and manage
4 | # your spiders.
5 | 


--------------------------------------------------------------------------------
/amazon/amazon/spiders/asin_spider.py:
--------------------------------------------------------------------------------
 1 | import scrapy
 2 | import json
 3 | from amazon.items import AsinBestItem
 4 | import pydispatch
 5 | from scrapy import signals
 6 | from datetime import datetime
 7 | from amazon.mysqlpipelines.pipelines import Sql
 8 | class AsinSpider(scrapy.Spider):
 9 |     name = "asin"
10 |     custom_settings = {
11 |         'LOG_LEVEL': 'ERROR',
12 |         'LOG_ENABLED': True,
13 |         'LOG_STDOUT': True
14 |     }
15 | 
16 |     def __init__(self):
17 |         scrapy.Spider.__init__(self)
18 |         pydispatch.dispatcher.connect(self.handle_spider_closed, signals.spider_closed)
19 |         # all asin scrapied will store in the array
20 |         self.asin_pool = []
21 | 
22 |     def start_requests(self):
23 |         cates = Sql.findall_cate_level1()
24 |         for row in cates:
25 |             row['link'] += '?ajax=1'
26 |             yield scrapy.Request(url=row['link']+'&pg=1', callback=self.parse, meta={'cid': row['id'], 'page': 1, 'link': row['link']})
27 | 
28 |     def parse(self, response):
29 |         list = response.css('.zg_itemImmersion')
30 | 
31 |         # scrapy next page  go go go !
32 |         response.meta['page'] = response.meta['page'] +1
33 |         if response.meta['page'] < 6:
34 |             yield scrapy.Request(url=response.meta['link']+'&pg='+str(response.meta['page']), callback=self.parse, meta=response.meta)
35 | 
36 |         # yield the asin
37 |         for row in list:
38 |             try:
39 |                 info = row.css('.zg_itemWrapper')[0].css('div::attr(data-p13n-asin-metadata)')[0].extract()
40 |                 rank = int(float(row.css('.zg_rankNumber::text')[0].extract()))
41 | 
42 |             except:
43 |                 continue
44 |                 pass
45 |             info = json.loads(info)
46 |             item = AsinBestItem()
47 |             item['asin'] = info['asin']
48 |             item['cid'] = response.meta['cid']
49 |             item['rank'] = rank
50 |             yield item
51 | 
52 |     def handle_spider_closed(self, spider):
53 |         Sql.store_best_asin()
54 |         work_time = datetime.now() - spider.started_on
55 |         print('total spent:', work_time)
56 |         print('done')
57 | 
58 | 
59 | 
60 | 


--------------------------------------------------------------------------------
/amazon/amazon/spiders/cate_spider.py:
--------------------------------------------------------------------------------
 1 | import scrapy
 2 | from amazon.items import CateItem
 3 | from amazon.mysqlpipelines.sql import Sql
 4 | 
 5 | class CateSpider(scrapy.Spider):
 6 |     name = "cate"
 7 |     custom_settings = {
 8 |         'LOG_LEVEL': 'ERROR',
 9 |         'LOG_ENABLED': True,
10 |         'LOG_STDOUT': True
11 |     }
12 |     level = 1
13 | 
14 |     def start_requests(self):
15 | 
16 |         urls = [
17 |             'https://www.amazon.com/Best-Sellers/zgbs/',
18 |         ]
19 |         Sql.clear_cate(self.level)
20 |         for url in urls:
21 |             yield scrapy.Request(url=url, callback=self.parse, meta={'level': self.level})
22 | 
23 |     def parse(self, response):
24 | 
25 |         if response.meta['level'] == 1:
26 |             list = response.css('#zg_browseRoot ul')[0].css('li a')
27 |         elif response.meta['level'] == 2:
28 |             list = response.css('#zg_browseRoot ul')[0].css('ul')[0].css('li a')
29 |         else:
30 |             return 0
31 |         item = CateItem()
32 |         leve_cur = response.meta['level']
33 |         response.meta['level'] = response.meta['level'] + 1
34 | 
35 |         for one in list:
36 |             item['title'] = one.css('::text')[0].extract()
37 |             link = one.css('::attr(href)')[0].extract()
38 |             item['link'] = link.split('ref=')[0]
39 |             item['level'] = leve_cur
40 |             item['pid'] = 1
41 |             yield item
42 |             if int(float(self.level)) > 1:
43 |                 yield scrapy.Request(url=item['link'], callback=self.parse, meta=response.meta)
44 | 
45 | 
46 | 
47 | 
48 | 


--------------------------------------------------------------------------------
/amazon/amazon/spiders/detail_spider.py:
--------------------------------------------------------------------------------
  1 | import scrapy
  2 | from amazon.items import DetailItem
  3 | from amazon.mysqlpipelines.pipelines import Sql
  4 | import pydispatch
  5 | import re
  6 | from amazon.helper import Helper
  7 | from scrapy import signals
  8 | from datetime import datetime
  9 | 
 10 | 
 11 | 
 12 | class DetailSpider(scrapy.Spider):
 13 |     name = "detail"
 14 |     custom_settings = {
 15 |         'LOG_LEVEL': 'ERROR',
 16 |         'LOG_ENABLED': True,
 17 |         'LOG_STDOUT': True
 18 |     }
 19 | 
 20 |     def __init__(self):
 21 |         scrapy.Spider.__init__(self)
 22 |         pydispatch.dispatcher.connect(self.handle_spider_closed, signals.spider_closed)
 23 |         # all asin scrapied will store in the array
 24 |         self.product_pool = {}
 25 |         self.log = []
 26 |         self.products = []
 27 | 
 28 |     def start_requests(self):
 29 |         self.products = Sql.findall_asin_level1()
 30 |         print(len(self.products))
 31 |         for row in self.products:
 32 |             yield scrapy.Request(
 33 |                     url='https://www.amazon.com/gp/offer-listing/' + row['asin'] + '/?f_new=true',
 34 |                     callback=self.listing_parse,
 35 |                     meta={
 36 |                         'asin': row['asin'],
 37 |                         'cid': row['cid'],
 38 |                     }
 39 |             )
 40 | 
 41 |     def review_parse(self, response):
 42 |         item = self.fetch_detail_from_review_page(response)
 43 |         self.product_pool[item['asin']] = item
 44 |         yield item
 45 | 
 46 |     def listing_parse(self, response):
 47 |         print(response.status)
 48 | 
 49 |         if not response.css('#olpProductImage'):
 50 |             yield scrapy.Request(
 51 |                     url='https://www.amazon.com/product-reviews/' + response.meta['asin'],
 52 |                     callback=self.review_parse,
 53 |                     meta={'asin': response.meta['asin'], 'cid': response.meta['cid']}
 54 |             )
 55 |             return
 56 |         try:
 57 |             item = self.fetch_detail_from_listing_page(response)
 58 |             self.product_pool[item['asin']] = item
 59 |         except Exception as err:
 60 |             print(err)
 61 |             print(response.meta['asin'])
 62 |         yield item
 63 | 
 64 |     def handle_spider_closed(self, spider):
 65 |         work_time = datetime.now() - spider.started_on
 66 |         print('total spent:', work_time)
 67 |         print(len(self.product_pool), 'item fetched')
 68 |         print(self.product_pool)
 69 |         print('done')
 70 |         print(self.log)
 71 | 
 72 | 
 73 | 
 74 | 
 75 |     def fetch_detail_from_listing_page(self, response):
 76 |         item = DetailItem()
 77 |         item['asin'] = response.meta['asin']
 78 |         item['image'] = response.css('#olpProductImage img::attr(src)')[0].extract().strip().replace('_SS160', '_SS320')
 79 |         item['title'] = response.css('title::text')[0].extract().split(':')[2].strip()
 80 | 
 81 |         try:
 82 |             item['star'] = response.css('.a-icon-star span::text')[0].extract().split(' ')[0].strip()
 83 |         except:
 84 |             item['star'] = 0
 85 |         try:
 86 |             item['reviews'] = response.css('.a-size-small > .a-link-normal::text')[0].extract().strip().split(' ')[0]
 87 |         except:
 88 |             item['reviews'] = 0
 89 | 
 90 |         price_info_list = response.css(".olpOffer[role=\"row\"] ")
 91 |         item['amazon_price'] = 0
 92 |         item['seller_price'] = 0
 93 |         for row in price_info_list:
 94 |             if (item['amazon_price'] == 0) and row.css(".olpSellerName > img"):
 95 |                 try:
 96 |                     item['amazon_price'] = row.css('.olpOfferPrice::text')[0].extract().strip().lstrip('$')
 97 |                 except:
 98 |                     item['amazon_price'] = 0
 99 |                 continue
100 |             if (item['seller_price'] == 0) and (not row.css(".olpSellerName > img")):
101 |                 try:
102 |                     item['seller_price'] = row.css('.olpOfferPrice::text')[0].extract().strip().lstrip('$')
103 |                 except:
104 |                     item['seller_price'] = 0
105 |         return item
106 | 
107 |     def fetch_detail_from_review_page(self, response):
108 | 
109 | 
110 |         info = response.css('#cm_cr-product_info')[0].extract()
111 |         item = DetailItem()
112 |         item['asin'] = response.meta['asin']
113 |         item['image'] = response.css('.product-image img::attr(src)')[0].extract().strip().replace('S60', 'S320')
114 |         item['title'] = response.css('.product-title >h1>a::text')[0].extract().strip()
115 |         item['star'] = re.findall("([0-9].[0-9]) out of", info)[0]
116 | 
117 |         # 获取评价总数
118 |         item['reviews'] = response.css('.AverageCustomerReviews .totalReviewCount::text')[0].extract().strip()
119 |         item['reviews'] = Helper.get_num_split_comma(item['reviews'])
120 |         item['seller_price'] = 0
121 |         item['amazon_price'] = 0
122 |         price = response.css('.arp-price::text')[0].extract().strip().lstrip('$')
123 |         item['amazon_price'] = price
124 |         return item
125 | 


--------------------------------------------------------------------------------
/amazon/amazon/spiders/keyword_ranking_spider.py:
--------------------------------------------------------------------------------
 1 | import logging
 2 | 
 3 | import scrapy
 4 | from pydispatch import dispatcher
 5 | from scrapy import signals
 6 | from amazon.helper import Helper
 7 | from amazon.items import KeywordRankingItem
 8 | from amazon.sql import RankingSql
 9 | 
10 | 
11 | class KeywordRankingSpider(scrapy.Spider):
12 |     name = 'keyword_ranking'
13 |     custom_settings = {
14 |         'LOG_LEVEL': 'ERROR',
15 |         'LOG_FILE': 'keyword_ranking.json',
16 |         'LOG_ENABLED': True,
17 |         'LOG_STDOUT': True,
18 |         'CONCURRENT_REQUESTS_PER_DOMAIN': 30
19 |     }
20 | 
21 |     def __init__(self, *args, **kwargs):
22 |         super().__init__(*args, **kwargs)
23 |         self.items = {}
24 |         self.found = {}
25 |         self.keyword_pool = {}
26 |         self.store_poll = {}
27 |         self.store_date = {}
28 |         dispatcher.connect(self.init_scrapy, signals.engine_started)
29 |         dispatcher.connect(self.close_scrapy, signals.engine_stopped)
30 | 
31 |     def start_requests(self):
32 |         for keyword, poll in self.keyword_pool.items():
33 |             yield scrapy.Request(('https://www.amazon.com/s/?field-keywords=%s&t=' + Helper.random_str(10)) % keyword,
34 |                                  self.load_first_page, meta={'items': poll})
35 | 
36 |     def parse(self, response):
37 |         result_li = response.xpath('//li[@data-asin]')
38 |         for item in response.meta['items']:
39 |             if len(result_li) == 0:
40 |                 self.found[item['id']] = 'none'
41 |                 logging.warning("[keyword none]  url: [%s] skwd_id:[%s] asin:[%s] \r\n body: %s" % (response.url, item['id'],item['asin'], response.body))
42 |             else:
43 |                 for result in result_li:
44 |                     data_asin = result.xpath('./@data-asin').extract()[0]
45 |                     if data_asin == item['asin']:
46 |                         # print(item)
47 |                         self.found[item['id']] = True
48 |                         # keywordItem = KeywordRankingItem()
49 |                         data_id = result.xpath('./@id').extract()[0]
50 |                         item_id = data_id.split('_')[1]
51 |                         rank = int(item_id) +1
52 |                         if item['id'] in self.store_poll.keys():
53 |                             self.store_poll[item['id']].append(rank)
54 |                         else:
55 |                             self.store_poll[item['id']] = [rank]
56 |                         self.store_date[item['id']] = Helper.get_now_date()
57 |                         break
58 | 
59 |     def load_first_page(self, response):
60 |         page = response.css('#bottomBar span.pagnDisabled::text').extract()
61 |         page = 1 if len(page) == 0 else int(page[0])
62 |         page_num = 1
63 |         while page_num <= page:
64 |             # yield scrapy.Request(response.url + '&page=%s' % page_num, self.parse, meta={'asin': response.meta['item']['asin'],
65 |             #                                                                              'skwd_id': response.meta['item']['id']})
66 |             yield scrapy.Request(response.url + '&page=%s' % page_num, self.parse, meta={'items': response.meta['items']})
67 |             page_num += 1
68 | 
69 |     def init_scrapy(self):
70 |         self.items = RankingSql.fetch_keywords_ranking()
71 |         for item in self.items:
72 |             if item['keyword'] in self.keyword_pool.keys():
73 |                 self.keyword_pool[item['keyword']].append({'id': item['id'], 'asin': item['asin']})
74 |             else:
75 |                 self.keyword_pool[item['keyword']] = [{'id': item['id'], 'asin': item['asin']}]
76 | 
77 |         self.found = {item['id']: False for item in self.items}
78 | 
79 |     def close_scrapy(self):
80 |         for skwd_id, is_found in self.found.items():
81 |             if is_found is not True:
82 |                 if is_found == 'none':
83 |                     # RankingSql.update_keywords_none_rank(skwd_id)
84 |                     logging.info('[keyword none] skwd_id:[%s]' % skwd_id)
85 |                 else:
86 |                     RankingSql.update_keywords_expire_rank(skwd_id)
87 |             else:
88 |                 keywordrank = KeywordRankingItem()
89 |                 keywordrank['skwd_id'] = skwd_id
90 |                 keywordrank['rank'] = min(self.store_poll[skwd_id])
91 |                 keywordrank['date'] = self.store_date[skwd_id]
92 |                 RankingSql.insert_keyword_ranking(keywordrank)
93 | 
94 | 
95 | 


--------------------------------------------------------------------------------
/amazon/amazon/spiders/proxy/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dynamohuang/amazon-scrapy/33afe5b482d3d55a065084289e20540ce6d99081/amazon/amazon/spiders/proxy/__init__.py


--------------------------------------------------------------------------------
/amazon/amazon/spiders/proxy/__pycache__/__init__.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dynamohuang/amazon-scrapy/33afe5b482d3d55a065084289e20540ce6d99081/amazon/amazon/spiders/proxy/__pycache__/__init__.cpython-36.pyc


--------------------------------------------------------------------------------
/amazon/amazon/spiders/proxy/__pycache__/fineproxy_spider.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dynamohuang/amazon-scrapy/33afe5b482d3d55a065084289e20540ce6d99081/amazon/amazon/spiders/proxy/__pycache__/fineproxy_spider.cpython-36.pyc


--------------------------------------------------------------------------------
/amazon/amazon/spiders/proxy/__pycache__/kuaidaili_spider.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dynamohuang/amazon-scrapy/33afe5b482d3d55a065084289e20540ce6d99081/amazon/amazon/spiders/proxy/__pycache__/kuaidaili_spider.cpython-36.pyc


--------------------------------------------------------------------------------
/amazon/amazon/spiders/proxy/fineproxy_spider.py:
--------------------------------------------------------------------------------
 1 | import scrapy
 2 | import re
 3 | import json
 4 | 
 5 | class FineproxySpider(scrapy.Spider):
 6 |     name = "fineproxy"
 7 |     custom_settings = {
 8 |         'LOG_LEVEL': 'ERROR',
 9 |         'LOG_ENABLED': True,
10 |         'LOG_STDOUT': True
11 |     }
12 | 
13 | 
14 |     def start_requests(self):
15 |         url = "http://fineproxy.org/eng/fresh-proxies/"
16 |         yield scrapy.Request(url=url, callback=self.parse, meta={})
17 | 
18 |     def parse(self, response):
19 |         pattern = "<strong>Fast proxies: </strong>(.*)<strong>Other fresh and working proxies:</strong>"
20 |         tmp = re.findall(pattern, response.text)[0]
21 |         proxy = re.findall("([0-9]{1,4}.[0-9]{1,4}.[0-9]{1,4}.[0-9]{1,4}:[0-9]{1,4})", tmp)
22 |         with open('proxy.json', 'w') as f:
23 |             json.dump(proxy, f)
24 | 
25 | 
26 | 
27 | 
28 | 
29 | 


--------------------------------------------------------------------------------
/amazon/amazon/spiders/proxy/kuaidaili_spider.py:
--------------------------------------------------------------------------------
 1 | import scrapy
 2 | import re
 3 | import json
 4 | 
 5 | class KuaidailiSpider(scrapy.Spider):
 6 |     name = "kuaidaili"
 7 |     # custom_settings = {
 8 |     #     'LOG_LEVEL': 'ERROR',
 9 |     #     'LOG_ENABLED': True,
10 |     #     'LOG_STDOUT': True
11 |     # }
12 | 
13 | 
14 |     def start_requests(self):
15 | 
16 |         self.headers = {
17 |             'Host': 'www.kuaidaili.com',
18 |             'Upgrade-Insecure-Requests': '1',
19 |             'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:52.0) Gecko/20100101 Firefox/52.0',
20 |         }
21 | 
22 |         url = "http://www.kuaidaili.com/free/inha/"
23 |         yield scrapy.Request(url=url, callback=self.parse, meta={})
24 | 
25 |     def parse(self, response):
26 |         print(response.status)
27 |         print('3333')
28 |         print(response.css('.center tr').re('td'))
29 | 
30 | 
31 | 
32 | 
33 | 


--------------------------------------------------------------------------------
/amazon/amazon/spiders/proxy/privateproxy_spider.py:
--------------------------------------------------------------------------------
 1 | import scrapy
 2 | import json
 3 | import pymysql
 4 | from  amazon import settings
 5 | 
 6 | class privateproxySpider(scrapy.Spider):
 7 |     name = "privateproxy"
 8 |     custom_settings = {
 9 |         'LOG_LEVEL': 'ERROR',
10 |         'LOG_ENABLED': True,
11 |         'LOG_STDOUT': True
12 |     }
13 | 
14 |     def start_requests(self):
15 |         url = "http://www.qq.com"
16 |         db = pymysql.connect(settings.MYSQL_HOST, settings.MYSQL_USER, settings.MYSQL_PASSWORD, settings.MYSQL_DB, charset=settings.MYSQL_CHARSET, cursorclass=pymysql.cursors.DictCursor)
17 |         cursor = db.cursor()
18 | 
19 |         sql = "SELECT CONCAT_WS(':', ip, port) AS proxy FROM proxy where work = 1"
20 |         cursor.execute(sql)
21 | 
22 |         proxy_array = []
23 |         proxy_list = cursor.fetchall()
24 |         for item in proxy_list:
25 |             proxy_array.append(item['proxy'])
26 | 
27 |         with open('proxy.json', 'w') as f:
28 |             json.dump(proxy_array, f)
29 |         yield scrapy.Request(url=url, callback=self.parse, meta={})
30 | 
31 |     def parse(self, response):
32 |         print('proxy update done')
33 | 
34 | 
35 | 
36 | 
37 | 
38 | 


--------------------------------------------------------------------------------
/amazon/amazon/spiders/reivew_profile_spider.py:
--------------------------------------------------------------------------------
 1 | import scrapy
 2 | 
 3 | from amazon.helper import Helper
 4 | from amazon.items import ReviewProfileItem
 5 | 
 6 | 
 7 | class ProfileSpider(scrapy.Spider):
 8 |     name = 'review_profile'
 9 |     custom_settings = {
10 |         'LOG_LEVEL': 'ERROR',
11 |         'LOG_FILE': 'profile.json',
12 |         'LOG_ENABLED': True,
13 |         'LOG_STDOUT': True
14 |     }
15 | 
16 |     def __init__(self, asin, *args, **kwargs):
17 |         super().__init__(*args, **kwargs)
18 |         self.asin = asin
19 | 
20 |     def start_requests(self):
21 |         yield scrapy.Request('https://www.amazon.com/product-reviews/%s' % self.asin, callback=self.parse)
22 | 
23 |     def parse(self, response):
24 |         item = ReviewProfileItem()
25 | 
26 |         item['asin'] = response.meta['asin'] if 'asin' in response.meta else self.asin
27 |         # 获取平均评价数值
28 |         average = response.css('.averageStarRatingNumerical a span::text').extract()  # 获取平均评价值
29 |         item['review_rate'] = Helper.get_star_split_str(average[0])  # 获取平均值
30 |         # 获取评价总数
31 |         total = response.css('.AverageCustomerReviews .totalReviewCount::text').extract()  # 获取评价总数
32 |         item['review_total'] = Helper.get_num_split_comma(total[0])
33 |         # 获取产品名称
34 |         product = response.css('.product-title h1 a::text').extract()
35 |         item['product'] = product[0]
36 |         # 获取产品 brand
37 |         item['brand'] = response.css('.product-by-line a::text').extract()[0]
38 |         item['image'] = response.css('.product-image img::attr(src)').extract()[0]
39 | 
40 |         # 获取产品商家
41 |         item['seller'] = item['brand']
42 |         # 获取各星评价百分比数
43 |         review_summary = response.css('.reviewNumericalSummary .histogram '
44 |                                       '#histogramTable tr td:last-child').re(r'\d{1,3}\%')
45 | 
46 |         pct = list(map(lambda x: x[0:-1], review_summary))
47 | 
48 |         item['pct_five'] = pct[0]
49 |         item['pct_four'] = pct[1]
50 |         item['pct_three'] = pct[2]
51 |         item['pct_two'] = pct[3]
52 |         item['pct_one'] = pct[4]
53 | 
54 |         yield item
55 | 


--------------------------------------------------------------------------------
/amazon/amazon/spiders/review_detail_spider.py:
--------------------------------------------------------------------------------
  1 | import math
  2 | import subprocess
  3 | 
  4 | import scrapy
  5 | from pydispatch import dispatcher
  6 | from scrapy import signals
  7 | 
  8 | from amazon.helper import Helper
  9 | from amazon.items import ReviewDetailItem, ReviewProfileItem
 10 | from amazon.sql import ReviewSql
 11 | 
 12 | 
 13 | class ReviewSpider(scrapy.Spider):
 14 |     name = 'review_detail'
 15 |     custom_settings = {
 16 |         'LOG_LEVEL': 'ERROR',
 17 |         'LOG_FILE': 'review_detail.json',
 18 |         'LOG_ENABLED': True,
 19 |         'LOG_STDOUT': True
 20 |     }
 21 | 
 22 |     def __init__(self, asin, daily=0, *args, **kwargs):
 23 |         super().__init__(*args, **kwargs)
 24 |         self.asin = asin
 25 |         self.last_review = 0
 26 |         self.profile_update_self = False    # profile自动计数更新
 27 |         self.updated = False   # profile是否更新过
 28 |         self.daily = True if int(daily) == 1 else False  # 判断是否是每日更新
 29 |         self.start_urls = [
 30 |             'https://www.amazon.com/product-reviews/%s?sortBy=recent&filterByStar=three_star' % self.asin,
 31 |             'https://www.amazon.com/product-reviews/%s?sortBy=recent&filterByStar=two_star' % self.asin,
 32 |             'https://www.amazon.com/product-reviews/%s?sortBy=recent&filterByStar=one_star' % self.asin
 33 |         ]
 34 |         dispatcher.connect(self.update_profile_self, signals.engine_stopped)
 35 |         dispatcher.connect(self.init_profile, signals.engine_started)
 36 | 
 37 |     def start_requests(self):
 38 |         self.load_profile()
 39 |         for url in self.start_urls:
 40 |             yield scrapy.Request(url, callback=self.get_detail)
 41 | 
 42 |     def parse(self, response):
 43 |         reviews = response.css('.review-views .review')
 44 |         for row in reviews:
 45 |             item = ReviewDetailItem()
 46 |             item['asin'] = self.asin
 47 |             item['review_id'] = row.css('div::attr(id)')[0].extract()
 48 |             item['reviewer'] = row.css('.author::text')[0].extract()
 49 |             item['title'] = row.css('.review-title::text')[0].extract()
 50 |             item['review_url'] = row.css('.review-title::attr(href)')[0].extract()
 51 |             item['date'] = Helper.get_date_split_str(row.css('.review-date::text')[0].extract())
 52 |             item['star'] = Helper.get_star_split_str(row.css('.review-rating span::text')[0].extract())
 53 |             content = row.css('.review-data .review-text::text').extract()
 54 |             item['content'] = '<br />'.join(content) if len(content) > 0 else ''
 55 |             yield item
 56 | 
 57 |     def get_detail(self, response):
 58 |         # 获取页面数
 59 |         page = response.css('ul.a-pagination li a::text')
 60 | 
 61 |         i = 1
 62 |         # 获取评价总数
 63 |         total = response.css('.AverageCustomerReviews .totalReviewCount::text').extract()  # 获取评价总数
 64 |         now_total = Helper.get_num_split_comma(total[0])
 65 |         last_review = self.last_review
 66 |         sub_total = int(now_total) - int(last_review)
 67 |         if sub_total != 0:
 68 |             # if sub_total != 0:  # 若计算出的页数 不为0 则说明有新的评论，更新profile
 69 |             self.updated = True
 70 |             yield scrapy.Request('https://www.amazon.com/product-reviews/%s' % self.asin,
 71 |                                  callback=self.profile_parse)
 72 |             if len(page) < 3:  # 若找到的a标签总数小于3 说明没有page组件 只有1页数据
 73 |                 yield scrapy.Request(url=response.url + '&pageNumber=1', callback=self.parse)
 74 |             else:
 75 |                 if self.daily:
 76 |                     page_num = math.ceil(sub_total / 10)
 77 |                     print('update item page_num is %s' % page_num)
 78 |                 else:
 79 |                     self.profile_update_self = True
 80 |                     page_num = Helper.get_num_split_comma(page[len(page) - 3].extract())  # 获得总页数
 81 |                 while i <= int(page_num):
 82 |                     yield scrapy.Request(url=response.url + '&pageNumber=%s' % i,
 83 |                                          callback=self.parse)
 84 |                     i = i + 1
 85 |         else:
 86 |             print('there is no item to update')
 87 | 
 88 |     def profile_parse(self, response):
 89 |         item = ReviewProfileItem()
 90 | 
 91 |         item['asin'] = self.asin
 92 |         # 获取平均评价数值
 93 |         average = response.css('.averageStarRatingNumerical a span::text').extract()  # 获取平均评价值
 94 |         item['review_rate'] = Helper.get_star_split_str(average[0])  # 获取平均值
 95 |         # 获取评价总数
 96 |         total = response.css('.AverageCustomerReviews .totalReviewCount::text').extract()  # 获取评价总数
 97 |         item['review_total'] = Helper.get_num_split_comma(total[0])
 98 |         # 获取产品名称
 99 |         product = response.css('.product-title h1 a::text').extract()
100 |         item['product'] = product[0]
101 |         # 获取产品 brand
102 |         item['brand'] = response.css('.product-by-line a::text').extract()[0]
103 |         item['image'] = response.css('.product-image img::attr(src)').extract()[0]
104 | 
105 |         # 获取产品商家
106 |         item['seller'] = item['brand']
107 |         # 获取各星评价百分比数
108 |         review_summary = response.css('.reviewNumericalSummary .histogram '
109 |                                       '#histogramTable tr td:last-child').re(r'\d{1,3}\%')
110 | 
111 |         pct = list(map(lambda x: x[0:-1], review_summary))
112 | 
113 |         item['pct_five'] = pct[0]
114 |         item['pct_four'] = pct[1]
115 |         item['pct_three'] = pct[2]
116 |         item['pct_two'] = pct[3]
117 |         item['pct_one'] = pct[4]
118 | 
119 |         yield item
120 | 
121 |     def load_profile(self):
122 |         # 若没有profile记录 则抓取新的profile 录入数据库
123 |         if self.last_review is False:
124 |             self.profile_update_self = True
125 |             print('this asin profile is not exist, now to get the profile of asin:', self.asin)
126 |             yield scrapy.Request('https://www.amazon.com/product-reviews/%s' % self.asin,
127 |                                  callback=self.profile_parse)
128 |             self.last_review = ReviewSql.get_last_review_total(self.asin)
129 | 
130 |     # scrapy 完成后 加载，如果是没有记录的profile 初次insert lastest_review为0 将所有多余的数据跑完后 防止第二次重复跑取，将latest_total更新
131 |     def update_profile_self(self):
132 |         if self.profile_update_self is True and self.updated is False:
133 |             # 若需要自主更新 并且 未更新状态
134 |             ReviewSql.update_profile_self(self.asin)
135 | 
136 |     # scrapy 开始前加载，获取目前asin的latest_review
137 |     def init_profile(self):
138 |         self.last_review = ReviewSql.get_last_review_total(self.asin)
139 | 


--------------------------------------------------------------------------------
/amazon/amazon/spiders/sales_ranking_spider.py:
--------------------------------------------------------------------------------
 1 | from datetime import datetime
 2 | 
 3 | import scrapy
 4 | 
 5 | from pydispatch import dispatcher
 6 | from scrapy import signals
 7 | 
 8 | from amazon.helper import Helper
 9 | from amazon.items import SalesRankingItem
10 | from amazon.sql import RankingSql
11 | 
12 | 
13 | class SalesRankingSpider(scrapy.Spider):
14 |     name = 'sales_ranking'
15 |     custom_settings = {
16 |         'LOG_LEVEL': 'ERROR',
17 |         'LOG_FILE': 'sales_ranking.json',
18 |         'LOG_ENABLED': True,
19 |         'LOG_STDOUT': True
20 |     }
21 | 
22 |     def __init__(self, **kwargs):
23 |         super().__init__(**kwargs)
24 |         self.items = []
25 |         dispatcher.connect(self.load_asin, signals.engine_started)
26 | 
27 |     def start_requests(self):
28 |         for item in self.items:
29 |             yield scrapy.Request('https://www.amazon.com/dp/%s' % item['asin'], self.parse, meta={'item': item})
30 | 
31 |     def parse(self, response):
32 |         product_detail = response.xpath('//div/table').re(r'#[0-9,]+(?:.*)in.*\(.*[Ss]ee [Tt]op.*\)')
33 |         if len(product_detail) == 0:
34 |             product_detail = response.css('div #SalesRank').re(r'#[0-9,]+(?:.*)in.*\(.*[Ss]ee [Tt]op.*\)')
35 |         if len(product_detail) != 0:
36 |             item = SalesRankingItem()
37 |             key_rank_str = product_detail[0]
38 |             key_rank_tuple = Helper.get_rank_classify(key_rank_str)
39 |             item['rank'] = Helper.get_num_split_comma(key_rank_tuple[0])
40 |             item['classify'] = key_rank_tuple[1]
41 |             item['asin'] = response.meta['item']['asin']
42 |             yield item
43 |         else:
44 |             raise Exception('catch asin[%s] sales ranking error' % response.meta['item']['asin'])
45 | 
46 |     def load_asin(self):
47 |         self.items = RankingSql.fetch_sales_ranking()
48 | 


--------------------------------------------------------------------------------
/amazon/amazon/sql.py:
--------------------------------------------------------------------------------
  1 | from datetime import datetime
  2 | 
  3 | import pymysql
  4 | import pytz
  5 | 
  6 | from amazon import settings
  7 | 
  8 | 
  9 | def conn_db():
 10 |     db_conf = settings.MYSQL
 11 |     db_conf['cursorclass'] = pymysql.cursors.DictCursor
 12 |     conn = pymysql.connect(**db_conf)
 13 |     conn.autocommit(1)
 14 |     return conn
 15 | 
 16 | 
 17 | def cursor_db(conn):
 18 |     return conn.cursor()
 19 | 
 20 | 
 21 | class ReviewSql(object):
 22 |     conn = conn_db()
 23 |     cursor = cursor_db(conn)
 24 | 
 25 | 
 26 |     @classmethod
 27 |     def insert_profile_item(cls, item):
 28 |         sql = "INSERT INTO `py_review_profile`" \
 29 |               "(`asin`, `product`, `brand`, `seller`, `image`," \
 30 |               "`review_total`, `review_rate`, `pct_five`, `pct_four`, `pct_three`, " \
 31 |               "`pct_two`, `pct_one`, `latest_total`) " \
 32 |               "VALUES ('%s', %s, %s, %s, '%s', '%s', " \
 33 |               "'%s', '%s', '%s', '%s', '%s', '%s', 0)" %\
 34 |               (item['asin'], cls.conn.escape(item['product']), cls.conn.escape(item['brand']), cls.conn.escape(item['seller']), item['image'],
 35 |                item['review_total'], item['review_rate'], item['pct_five'], item['pct_four'],
 36 |                item['pct_three'], item['pct_two'], item['pct_one'])
 37 |         try:
 38 |             if cls.check_exist_profile(item['asin']):
 39 |                 cls.update_profile_item(item)
 40 |                 print('update review profile--[asin]:', item['asin'])
 41 |             else:
 42 |                 cls.cursor.execute(sql)
 43 |                 cls.conn.commit()
 44 |                 print('save review profile--[asin]:', item['asin'])
 45 |         except pymysql.MySQLError as e:
 46 |             with open('sql.log', 'r+') as i:
 47 |                 i.write('profile sql error![error]:'+e)
 48 |             print(e)
 49 |             cls.conn.rollback()
 50 |         pass
 51 | 
 52 |     @classmethod
 53 |     def update_profile_item(cls, item):
 54 |         sql = "UPDATE `py_review_profile` SET `latest_total`=`review_total`,`product`=%s, `brand`=%s, `seller`=%s, `image`=%s, `review_total`='%s', `review_rate`='%s'," \
 55 |               "`pct_five`='%s', `pct_four`='%s', `pct_three`='%s', `pct_two`='%s', `pct_one`='%s' " \
 56 |               "WHERE `asin`='%s'" % \
 57 |               (cls.conn.escape(item['product']), cls.conn.escape(item['brand']), cls.conn.escape(item['seller']), cls.conn.escape(item['image']),
 58 |                item['review_total'], item['review_rate'],item['pct_five'], item['pct_four'], item['pct_three'], item['pct_two'],
 59 |                item['pct_one'], item['asin'])
 60 |         try:
 61 |             cls.cursor.execute(sql)
 62 |             cls.conn.commit()
 63 |         except pymysql.MySQLError as e:
 64 |             print(e)
 65 |             cls.conn.rollback()
 66 | 
 67 |     @classmethod
 68 |     def check_exist_profile(cls, asin):
 69 |         sql = "SELECT * FROM `py_review_profile` WHERE (`asin` = '%s')" % (asin)
 70 |         result = cls.cursor.execute(sql)
 71 |         if result:
 72 |             return True
 73 |         else:
 74 |             return False
 75 | 
 76 |     @classmethod
 77 |     def insert_detail_item(cls, item):
 78 |         sql = "INSERT INTO `py_review_detail`(`asin`, `review_id`, `reviewer`, `review_url`, `star`, `date`, `title`, `content`) " \
 79 |               "VALUES ('%s', '%s', %s, '%s', '%s', '%s', %s, %s)" % \
 80 |               (item['asin'], item['review_id'], cls.conn.escape(item['reviewer']), item['review_url'], item['star'],
 81 |                item['date'], cls.conn.escape(item['title']), cls.conn.escape(item['content']))
 82 |         try:
 83 |             if cls.check_exist_detail(item['asin'], item['review_id']) is not True:
 84 |                 print('save review detail--[asin]:', item['asin'], '[reviewID]:', item['review_id'])
 85 |                 cls.cursor.execute(sql)
 86 |                 cls.conn.commit()
 87 |         except pymysql.MySQLError as e:
 88 |             print(e)
 89 |             cls.conn.rollback()
 90 |         pass
 91 | 
 92 |     @classmethod
 93 |     def check_exist_detail(cls, asin, review_id):
 94 |         sql = "SELECT * FROM `py_review_detail` WHERE `asin` = '%s' AND `review_id`='%s'" % (asin, review_id)
 95 |         result = cls.cursor.execute(sql)
 96 |         if result:
 97 |             return True
 98 |         else:
 99 |             return False
100 | 
101 |     @classmethod
102 |     def get_last_review_total(cls, asin):
103 |         sql = "SELECT `review_total`, `latest_total` FROM `py_review_profile` WHERE `asin`='%s'" % asin
104 |         cls.cursor.execute(sql)
105 |         item = cls.cursor.fetchone()
106 |         if item:
107 |             return item['latest_total']
108 |         else:
109 |             return False
110 | 
111 |     @classmethod
112 |     def update_profile_self(cls, asin):
113 |         sql = "UPDATE `py_review_profile` SET `latest_total` = `review_total` WHERE `asin`='%s'" % asin
114 |         cls.cursor.execute(sql)
115 |         cls.conn.commit()
116 | 
117 | 
118 | class RankingSql(object):
119 |     expire_rank = 500
120 |     conn = conn_db()
121 |     cursor = cursor_db(conn)
122 |     py_keyword_table = 'py_salesranking_keywords'  # 爬虫抓
123 |     py_sales_table = 'py_salesrankings'
124 |     keyword_table = 'salesranking_keywords'
125 |     sales_table = 'salesrankings'
126 |     tz = pytz.timezone(settings.TIMEZONE)
127 | 
128 |     @classmethod
129 |     def insert_sales_ranking(cls, item):
130 |         now = datetime.now(cls.tz).strftime('%Y-%m-%d %H:%M:%S')
131 |         sql = "INSERT INTO `%s`(`asin`, `rank`, `classify`, `date`) VALUES ('%s', '%s', %s, '%s')" % \
132 |               (cls.py_sales_table, item['asin'], item['rank'], cls.conn.escape(item['classify']), now)
133 |         update_sql = "UPDATE `%s` SET `last_rank`=`rank`, `status`=1, `classify`=%s, `rank`='%s', `updated_at`='%s' WHERE `asin` = '%s'"  % \
134 |                      (cls.sales_table, cls.conn.escape(item['classify']), item['rank'], now, item['asin'])
135 |         try:
136 |             cls.cursor.execute(sql)
137 |             cls.cursor.execute(update_sql)
138 |             cls.conn.commit()
139 |             print('save sales_rank:', item)
140 |         except pymysql.DatabaseError as error:
141 |             print(error)
142 |             cls.conn.rollback()
143 | 
144 |     @classmethod
145 |     def insert_keyword_ranking(cls, item):
146 |         sql = "INSERT INTO `%s`(`skwd_id`, `rank`, `date`) VALUES ('%s', '%s', '%s')" % \
147 |               (cls.py_keyword_table, item['skwd_id'], item['rank'], item['date'])
148 |         update_sql = "UPDATE `%s` SET `last_rank`=`rank`, `rank`='%s', `status`=1, `updated_at`='%s' WHERE `id`='%s'" % \
149 |                      (cls.keyword_table, item['rank'], item['date'], item['skwd_id'])
150 |         try:
151 |             cls.cursor.execute(sql)
152 |             cls.cursor.execute(update_sql)
153 |             cls.conn.commit()
154 |             print('save keyword_rank:', item)
155 |         except pymysql.DatabaseError as error:
156 |             print(error)
157 |             cls.conn.rollback()
158 | 
159 |     @classmethod
160 |     def fetch_sales_ranking(cls):
161 |         sql = "SELECT `id`, `asin` FROM `%s`WHERE `status` =1 AND `deleted_at` is NULL" % cls.sales_table
162 |         cls.cursor.execute(sql)
163 |         item = cls.cursor.fetchall()
164 |         return item
165 | 
166 |     @classmethod
167 |     def fetch_keywords_ranking(cls):
168 |         sql = "SELECT `a`.`id`, `a`.`keyword`, `a`.`rank` as `rank`, `b`.`asin` as `asin` FROM `%s` as `a` " \
169 |               "LEFT JOIN `%s` as `b` ON `b`.`id`=`a`.`sk_id` WHERE `b`.`deleted_at` is NULL AND `a`.`deleted_at` is NULL " % \
170 |               (cls.keyword_table, cls.sales_table)
171 |         cls.cursor.execute(sql)
172 |         item = cls.cursor.fetchall()
173 |         return item
174 | 
175 |     @classmethod
176 |     def update_keywords_expire_rank(cls, skwd_id):
177 |         now = datetime.now(cls.tz).strftime('%Y-%m-%d %H:%M:%S')
178 |         sql = "UPDATE `%s` SET `last_rank`=`rank`, `rank`='%s', `updated_at`='%s', `status`=1 WHERE `id`='%s'" % (cls.keyword_table, cls.expire_rank, now, skwd_id)
179 |         py_sql = "INSERT INTO `%s`(`skwd_id`, `rank`, `date`) VALUES ('%s', '%s', '%s')" % (cls.py_keyword_table, skwd_id, cls.expire_rank, now)
180 |         try:
181 |             cls.cursor.execute(sql)
182 |             cls.cursor.execute(py_sql)
183 |             cls.conn.commit()
184 |             print('update keyword_rank: [', skwd_id, '] expired')
185 |         except pymysql.DataError as error:
186 |             print(error)
187 |             cls.conn.rollback()
188 | 
189 |     @classmethod
190 |     def update_keywords_none_rank(cls, skwd_id):
191 |         now = datetime.now(cls.tz).strftime('%Y-%m-%d %H:%M:%S')
192 |         sql = "UPDATE `%s` SET `updated_at`='%s', `status`=2 WHERE `id`='%s'" % (cls.keyword_table, now, skwd_id)
193 |         try:
194 |             cls.cursor.execute(sql)
195 |             print('update keyword_rank: [', skwd_id, '] none')
196 |         except pymysql.DataError as error:
197 |             print(error)
198 |             cls.conn.rollback()
199 | 
200 | 
201 | 
202 | 
203 | 
204 | 


--------------------------------------------------------------------------------
/amazon/db/ipricejot.sql:
--------------------------------------------------------------------------------
 1 | SET NAMES utf8;
 2 | SET FOREIGN_KEY_CHECKS = 0;
 3 | 
 4 | -- ----------------------------
 5 | --  Table structure for `py_cates`
 6 | -- ----------------------------
 7 | DROP TABLE IF EXISTS `py_cates`;
 8 | CREATE TABLE `py_cates` (
 9 |   `id` int(11) NOT NULL AUTO_INCREMENT,
10 |   `title` varchar(512) NOT NULL,
11 |   `link` varchar(512) NOT NULL,
12 |   `level` tinyint(2) NOT NULL DEFAULT '1',
13 |   `pid` int(11) NOT NULL DEFAULT '0',
14 |   `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
15 |   PRIMARY KEY (`id`)
16 | ) ENGINE=InnoDB AUTO_INCREMENT=38 DEFAULT CHARSET=utf8mb4;
17 | 
18 | -- ----------------------------
19 | --  Table structure for `py_asin_best`
20 | -- ----------------------------
21 | DROP TABLE IF EXISTS `py_asin_best`;
22 | CREATE TABLE `py_asin_best` (
23 |   `id` int(11) NOT NULL AUTO_INCREMENT,
24 |   `asin` char(10) NOT NULL,
25 |   `cid` int(11) NOT NULL,
26 |   `rank` tinyint(4) NOT NULL,
27 |   `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
28 |   PRIMARY KEY (`id`)
29 | ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;
30 | 
31 | SET FOREIGN_KEY_CHECKS = 1;
32 | 


--------------------------------------------------------------------------------
/amazon/db/py_salesranking_and_review.sql:
--------------------------------------------------------------------------------
  1 | /*
  2 | SQLyog v10.2 
  3 | MySQL - 5.7.18-log : Database - ipricejot
  4 | *********************************************************************
  5 | */
  6 | 
  7 | 
  8 | /*!40101 SET NAMES utf8 */;
  9 | 
 10 | /*!40101 SET SQL_MODE=''*/;
 11 | 
 12 | /*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
 13 | /*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
 14 | /*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
 15 | /*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
 16 | CREATE DATABASE /*!32312 IF NOT EXISTS*/`ipricejot` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_bin */;
 17 | 
 18 | USE `ipricejot`;
 19 | 
 20 | /*Table structure for table `py_review_detail` */
 21 | 
 22 | DROP TABLE IF EXISTS `py_review_detail`;
 23 | 
 24 | CREATE TABLE `py_review_detail` (
 25 |   `id` bigint(20) NOT NULL AUTO_INCREMENT,
 26 |   `asin` varchar(11) CHARACTER SET utf8 NOT NULL COMMENT 'asin号',
 27 |   `review_id` varchar(50) CHARACTER SET utf8 NOT NULL COMMENT '评论id号',
 28 |   `reviewer` varchar(255) CHARACTER SET utf8 NOT NULL COMMENT '评论者',
 29 |   `review_url` varchar(255) COLLATE utf8mb4_bin NOT NULL COMMENT '评价链接',
 30 |   `star` varchar(4) CHARACTER SET utf8 NOT NULL DEFAULT '0' COMMENT '评论星级',
 31 |   `date` varchar(255) CHARACTER SET utf8 NOT NULL COMMENT '评论日期',
 32 |   `title` varchar(255) CHARACTER SET utf8 NOT NULL COMMENT '评论标题',
 33 |   `content` text CHARACTER SET utf8 COMMENT '评论内容',
 34 |   PRIMARY KEY (`id`),
 35 |   UNIQUE KEY `asin_review_id_unique` (`asin`,`review_id`)
 36 | ) ENGINE=InnoDB AUTO_INCREMENT=2706 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin;
 37 | 
 38 | /*Table structure for table `py_review_profile` */
 39 | 
 40 | DROP TABLE IF EXISTS `py_review_profile`;
 41 | 
 42 | CREATE TABLE `py_review_profile` (
 43 |   `id` bigint(20) NOT NULL AUTO_INCREMENT,
 44 |   `asin` varchar(11) NOT NULL COMMENT 'asin号',
 45 |   `product` varchar(500) NOT NULL COMMENT '产品名',
 46 |   `brand` varchar(255) NOT NULL COMMENT '商品标签',
 47 |   `seller` varchar(255) DEFAULT NULL COMMENT '销售商家',
 48 |   `image` varchar(255) NOT NULL DEFAULT '' COMMENT '图片地址',
 49 |   `review_total` int(11) NOT NULL DEFAULT '0' COMMENT '评论总数',
 50 |   `review_rate` varchar(4) NOT NULL DEFAULT '0' COMMENT '评论平均分值',
 51 |   `pct_five` tinyint(2) NOT NULL DEFAULT '0' COMMENT '5星所占比分比',
 52 |   `pct_four` tinyint(2) NOT NULL DEFAULT '0' COMMENT '4星所占百分比',
 53 |   `pct_three` tinyint(2) NOT NULL DEFAULT '0' COMMENT '3星所占百分比',
 54 |   `pct_two` tinyint(2) NOT NULL DEFAULT '0' COMMENT '2星所占百分比',
 55 |   `pct_one` tinyint(2) NOT NULL DEFAULT '0' COMMENT '1星所占百分比',
 56 |   `latest_total` int(11) DEFAULT NULL COMMENT '上一次的评论总数',
 57 |   PRIMARY KEY (`id`),
 58 |   UNIQUE KEY `unique_asin` (`asin`)
 59 | ) ENGINE=InnoDB AUTO_INCREMENT=22 DEFAULT CHARSET=utf8;
 60 | 
 61 | /*Table structure for table `py_salesranking_keywords` */
 62 | 
 63 | DROP TABLE IF EXISTS `py_salesranking_keywords`;
 64 | 
 65 | CREATE TABLE `py_salesranking_keywords` (
 66 |   `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
 67 |   `skwd_id` int(11) NOT NULL COMMENT 'salesranking_keyword_id',
 68 |   `rank` int(11) NOT NULL COMMENT '排名',
 69 |   `date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '爬取时间',
 70 |   PRIMARY KEY (`id`)
 71 | ) ENGINE=InnoDB AUTO_INCREMENT=11 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
 72 | 
 73 | /*Table structure for table `py_salesrankings` */
 74 | 
 75 | DROP TABLE IF EXISTS `py_salesrankings`;
 76 | 
 77 | CREATE TABLE `py_salesrankings` (
 78 |   `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
 79 |   `sk_id` int(11) NOT NULL COMMENT 'salesranking_id',
 80 |   `rank` int(11) NOT NULL COMMENT '排名',
 81 |   `classify` varchar(150) COLLATE utf8_unicode_ci NOT NULL DEFAULT '' COMMENT '分类',
 82 |   `date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT '爬取时间',
 83 |   PRIMARY KEY (`id`)
 84 | ) ENGINE=InnoDB AUTO_INCREMENT=13 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
 85 | 
 86 | /*Table structure for table `salesranking_keywords` */
 87 | 
 88 | DROP TABLE IF EXISTS `salesranking_keywords`;
 89 | 
 90 | CREATE TABLE `salesranking_keywords` (
 91 |   `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
 92 |   `sk_id` int(11) NOT NULL COMMENT 'saleranking_id',
 93 |   `keyword` varchar(255) COLLATE utf8_unicode_ci NOT NULL COMMENT '关键字',
 94 |   `status` tinyint(4) NOT NULL DEFAULT '0' COMMENT '抓取状态 0抓取中 1成功 2抓取失败',
 95 |   `rank` int(11) NOT NULL DEFAULT '0' COMMENT '当前排名',
 96 |   `last_rank` int(11) NOT NULL DEFAULT '0' COMMENT '上次排名',
 97 |   `deleted_at` timestamp NULL DEFAULT NULL,
 98 |   `created_at` timestamp NULL DEFAULT NULL,
 99 |   `updated_at` timestamp NULL DEFAULT NULL,
100 |   PRIMARY KEY (`id`),
101 |   UNIQUE KEY `salesranking_keywords_sk_id_keyword_unique` (`sk_id`,`keyword`)
102 | ) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
103 | 
104 | /*Table structure for table `salesrankings` */
105 | 
106 | DROP TABLE IF EXISTS `salesrankings`;
107 | 
108 | CREATE TABLE `salesrankings` (
109 |   `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
110 |   `sid` int(11) NOT NULL COMMENT 'seller_uid',
111 |   `asin` varchar(11) COLLATE utf8_unicode_ci NOT NULL COMMENT '商品asin号',
112 |   `title` varchar(500) COLLATE utf8_unicode_ci NOT NULL COMMENT '商品名称',
113 |   `image` varchar(255) COLLATE utf8_unicode_ci NOT NULL COMMENT '商品图片',
114 |   `link` varchar(255) COLLATE utf8_unicode_ci NOT NULL COMMENT '亚马逊商品链接',
115 |   `classify` varchar(150) COLLATE utf8_unicode_ci DEFAULT NULL COMMENT '商品分类',
116 |   `status` tinyint(4) NOT NULL DEFAULT '0' COMMENT '抓取状态 0抓取中 1成功 2抓取失败',
117 |   `rank` int(11) NOT NULL DEFAULT '0' COMMENT '目前排名',
118 |   `last_rank` int(11) NOT NULL DEFAULT '0' COMMENT '上次排名',
119 |   `deleted_at` timestamp NULL DEFAULT NULL,
120 |   `created_at` timestamp NULL DEFAULT NULL,
121 |   `updated_at` timestamp NULL DEFAULT NULL,
122 |   PRIMARY KEY (`id`),
123 |   UNIQUE KEY `salesrankings_sid_asin_unique` (`sid`,`asin`)
124 | ) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
125 | 
126 | /*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
127 | /*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
128 | /*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
129 | /*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;
130 | 


--------------------------------------------------------------------------------
/amazon/requirements.txt:
--------------------------------------------------------------------------------
1 | Scrapy==1.4.0
2 | PyMySQL==0.7.11
3 | PyDispatcher==2.0.5
4 | pytz==2017.2
5 | 


--------------------------------------------------------------------------------
/amazon/scrapy.cfg:
--------------------------------------------------------------------------------
 1 | # Automatically created by: scrapy startproject
 2 | #
 3 | # For more information about the [deploy] section see:
 4 | # https://scrapyd.readthedocs.org/en/latest/deploy.html
 5 | 
 6 | [settings]
 7 | default = amazon.settings
 8 | 
 9 | [deploy]
10 | #url = http://localhost:6800/
11 | project = amazon
12 | 


--------------------------------------------------------------------------------
/amazon2/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dynamohuang/amazon-scrapy/33afe5b482d3d55a065084289e20540ce6d99081/amazon2/__init__.py


--------------------------------------------------------------------------------
/amazon2/amazon2/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dynamohuang/amazon-scrapy/33afe5b482d3d55a065084289e20540ce6d99081/amazon2/amazon2/__init__.py


--------------------------------------------------------------------------------
/amazon2/amazon2/items.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Define here the models for your scraped items
 4 | #
 5 | # See documentation in:
 6 | # http://doc.scrapy.org/en/latest/topics/items.html
 7 | 
 8 | import scrapy
 9 | 
10 | 
11 | class Amazon2Item(scrapy.Item):
12 |     # define the fields for your item here like:
13 |     # name = scrapy.Field()
14 |     pass
15 | 


--------------------------------------------------------------------------------
/amazon2/amazon2/middlewares/AmazonSpiderMiddleware.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Define here the models for your spider middleware
 4 | #
 5 | # See documentation in:
 6 | # http://doc.scrapy.org/en/latest/topics/spider-middleware.html
 7 | 
 8 | from scrapy import signals
 9 | from datetime import datetime
10 | 
11 | 
12 | class AmazonSpiderMiddleware(object):
13 |     # Not all methods need to be defined. If a method is not defined,
14 |     # scrapy acts as if the spider middleware does not modify the
15 |     # passed objects.
16 | 
17 |     @classmethod
18 |     def from_crawler(cls, crawler):
19 |         # This method is used by Scrapy to create your spiders.
20 |         s = cls()
21 |         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
22 |         return s
23 | 
24 |     def process_spider_input(self, response, spider):
25 |         # Called for each response that goes through the spider
26 |         # middleware and into the spider.
27 | 
28 |         # Should return None or raise an exception.
29 |         return None
30 | 
31 |     def process_spider_output(self, response, result, spider):
32 |         # Called with the results returned from the Spider, after
33 |         # it has processed the response.
34 | 
35 |         # Must return an iterable of Request, dict or Item objects.
36 |         for i in result:
37 |             yield i
38 | 
39 |     def process_spider_exception(self, response, exception, spider):
40 |         # Called when a spider or process_spider_input() method
41 |         # (from other spider middleware) raises an exception.
42 | 
43 |         # Should return either None or an iterable of Response, dict
44 |         # or Item objects.
45 |         pass
46 | 
47 |     def process_start_requests(self, start_requests, spider):
48 |         # Called with the start requests of the spider, and works
49 |         # similarly to the process_spider_output() method, except
50 |         # that it doesn’t have a response associated.
51 | 
52 |         # Must return only requests (not items).
53 |         for r in start_requests:
54 |             yield r
55 | 
56 |     def spider_opened(self, spider):
57 |         spider.started_on = datetime.now()
58 | 
59 | 
60 | 


--------------------------------------------------------------------------------
/amazon2/amazon2/middlewares/RotateUserAgentMiddleware.py:
--------------------------------------------------------------------------------
 1 | import random
 2 | from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
 3 | 
 4 | 
 5 | class RotateUserAgentMiddleware(UserAgentMiddleware):
 6 |     def __init__(self, user_agent=''):
 7 |         UserAgentMiddleware.__init__(self)
 8 |         self.user_agent = user_agent
 9 | 
10 |     def process_request(self, request, spider):
11 |         ua = random.choice(self.user_agent_list)
12 |         if ua:
13 |             # print(ua)
14 |             request.headers.setdefault('User-Agent', ua)
15 | 
16 |     # the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
17 |     # for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
18 |     user_agent_list = [
19 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \
20 |         "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \
21 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \
22 |         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \
23 |         "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \
24 |         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \
25 |         "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \
26 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
27 |         "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
28 |         "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
29 |         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
30 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
31 |         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
32 |         "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
33 |         "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
34 |         "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \
35 |         "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \
36 |         "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
37 |     ]
38 | 


--------------------------------------------------------------------------------
/amazon2/amazon2/middlewares/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/dynamohuang/amazon-scrapy/33afe5b482d3d55a065084289e20540ce6d99081/amazon2/amazon2/middlewares/__init__.py


--------------------------------------------------------------------------------
/amazon2/amazon2/pipelines.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | # Define your item pipelines here
 4 | #
 5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 | # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 | 
 8 | 
 9 | class Amazon2Pipeline(object):
10 |     def process_item(self, item, spider):
11 |         return item
12 | 


--------------------------------------------------------------------------------
/amazon2/amazon2/settings.py:
--------------------------------------------------------------------------------
 1 | 
 2 | BOT_NAME = 'amazon2'
 3 | 
 4 | SPIDER_MODULES = ['amazon2.spiders']
 5 | NEWSPIDER_MODULE = 'amazon2.spiders'
 6 | 
 7 | ROBOTSTXT_OBEY = False
 8 | 
 9 | CONCURRENT_REQUESTS = 32
10 | 
11 | COOKIES_ENABLED = False
12 | 
13 | SPIDER_MIDDLEWARES = {
14 |     'amazon2.middlewares.AmazonSpiderMiddleware.AmazonSpiderMiddleware': 543,
15 | }
16 | 
17 | DOWNLOADER_MIDDLEWARES = {
18 |     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
19 |     'amazon2.middlewares.RotateUserAgentMiddleware.RotateUserAgentMiddleware': 543,
20 | }
21 | 


--------------------------------------------------------------------------------
/amazon2/amazon2/spiders/AmazonBaseSpider.py:
--------------------------------------------------------------------------------
 1 | import scrapy
 2 | import pydispatch.dispatcher
 3 | from scrapy import signals
 4 | from datetime import datetime
 5 | 
 6 | 
 7 | class AmazonBaseSpider(scrapy.Spider):
 8 |     name = "AmazonBase"
 9 |     custom_settings = {
10 |         'LOG_LEVEL': 'ERROR',
11 |         'LOG_ENABLED': True,
12 |         'LOG_STDOUT': True,
13 |     }
14 | 
15 |     def __init__(self):
16 |         pydispatch.dispatcher.connect(self.handle_spider_closed, signals.spider_closed)
17 |         self.result_pool = {}
18 |         self.log = []
19 | 
20 |     def start_requests(self):
21 |         return
22 | 
23 |     def parse(self, response):
24 |         return
25 | 
26 |     def print_progress(self, spider):
27 |         work_time = datetime.now() - spider.started_on
28 |         print('Spent:', work_time, ':', len(self.result_pool), 'item fetched')
29 | 
30 |     def handle_spider_closed(self):
31 |         return
32 | 


--------------------------------------------------------------------------------
/amazon2/amazon2/spiders/DemoSpider.py:
--------------------------------------------------------------------------------
 1 | import scrapy
 2 | from amazon2.spiders.AmazonBaseSpider import AmazonBaseSpider
 3 | 
 4 | 
 5 | # scrapy crawl demo -a asin=B07K97BQDF
 6 | class DemoSpider(AmazonBaseSpider):
 7 |     name = "demo"
 8 | 
 9 |     def __init__(self, asin='B07K97BQDF'):
10 |         AmazonBaseSpider.__init__(self)
11 |         self.asin = asin
12 | 
13 |     def start_requests(self):
14 |         yield scrapy.Request(
15 |             url='https://www.amazon.com/dp/' + self.asin,
16 |             callback=self.parse,
17 |             meta={
18 |                 'asin': self.asin,
19 |                 'cid': -10
20 |             }
21 |         )
22 | 
23 |     def parse(self, response):
24 |         print(response.meta['asin'])
25 |         self.result_pool[response.meta['asin']] = {}
26 |         self.result_pool[response.meta['asin']]['title'] = 'title for ' + response.meta['asin']
27 | 
28 |     # Bingo! Here we get the result and You can restore or output it
29 |     def handle_spider_closed(self, spider):
30 |         print(self.result_pool.get(self.asin))
31 |         AmazonBaseSpider.print_progress(self, spider)
32 | 


--------------------------------------------------------------------------------
/amazon2/amazon2/spiders/__init__.py:
--------------------------------------------------------------------------------
1 | # This package will contain the spiders of your Scrapy project
2 | #
3 | # Please refer to the documentation for information on how to create and manage
4 | # your spiders.
5 | 


--------------------------------------------------------------------------------
/amazon2/requirements.txt:
--------------------------------------------------------------------------------
1 | Scrapy==1.5.*
2 | 


--------------------------------------------------------------------------------
/amazon2/scrapy.cfg:
--------------------------------------------------------------------------------
 1 | # Automatically created by: scrapy startproject
 2 | #
 3 | # For more information about the [deploy] section see:
 4 | # https://scrapyd.readthedocs.org/en/latest/deploy.html
 5 | 
 6 | [settings]
 7 | default = amazon2.settings
 8 | 
 9 | [deploy]
10 | #url = http://localhost:6800/
11 | project = amazon2
12 | 


--------------------------------------------------------------------------------