├── picture
    ├── README.md
    ├── true.jpg
    └── undefined.jpg
├── README.md
└── taobao_spider.py


/picture/README.md:
--------------------------------------------------------------------------------
1 | ## 这里的图片没有实际意义，可忽略此文件夹:grimacing:
2 | 


--------------------------------------------------------------------------------
/picture/true.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/15920036578/TMALL_Spider/HEAD/picture/true.jpg


--------------------------------------------------------------------------------
/picture/undefined.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/15920036578/TMALL_Spider/HEAD/picture/undefined.jpg


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # 天猫爬虫
 2 | ![](https://img.shields.io/badge/Python-3.5.3-green.svg) ![](https://img.shields.io/badge/selenium-3.141.0-green.svg) ![](https://img.shields.io/badge/pyquery-1.4.0-green.svg)
 3 | #### 天猫官网 - https://www.tmall.com/
 4 | |Author|Gobi Xu|
 5 | |---|---|
 6 | |Email|xusanity@aliyun.com|
 7 | ****
 8 | ## 声明
 9 | #### 任何内容都仅用于学习交流，请勿用于任何商业用途。
10 | ## 前言
11 | #### 开始之前，你需要
12 | - **注册一个[淘宝账号](https://reg.taobao.com/member/reg/fill_mobile.htm)**
13 | - **注册一个[微博账号](https://weibo.com/signup/signup.php)**
14 | - **下载[chrome浏览器](https://chrome.en.softonic.com/)**
15 | - **下载对应chrome浏览器版本的[chromedriver驱动](http://chromedriver.storage.googleapis.com/index.html)**
16 | #### 简单说明
17 | - **全程使用selenium进行爬取**
18 | - **请求时需要携带cookies（即需要先登录），否则极易出现验证码**
19 | ## 运行环境
20 | #### Version: Python3
21 | ## 安装依赖库
22 | ```
23 | pip install selenium
24 | pip install pyquery
25 | ```
26 | ## 细节
27 | - **防止被检测出为机器**
28 | ###### 一般我们在chrome浏览器的Console里输入window.navigator.webdriver后会返回undefined的值
29 | ![enter image description here](picture/undefined.jpg)
30 | ###### 如果你在chromedriver驱动的Console里输入window.navigator.webdriver后会返回true的值
31 | ![enter image description here](picture/true.jpg)
32 | ###### 解决办法
33 | ```
34 | options = webdriver.ChromeOptions()
35 | options.add_experimental_option('excludeSwitches', ['enable-automation'])  # 设置为开发者模式
36 | # 在options里加上这个参数使得chromedriver驱动不会被检测出为机器
37 | ```
38 | - **模仿人类行为**
39 | ###### 匀加速下滑浏览器
40 | ```
41 | for i in range(1, 52):
42 |   drop_down = "var q=document.documentElement.scrollTop=" + str(i*100)
43 |   self.browser.execute_script(drop_down)
44 |   time.sleep(0.01)  # 值越小越顺滑，越像人类行为
45 |   # 模仿人类，发现喜欢的商品时会停止滑动
46 |   if i == 5:
47 |     time.sleep(1.3)
48 |   if i == 15:
49 |     time.sleep(1.9)
50 |   if i == 29:
51 |     time.sleep(0.7)
52 |   if i == 44:
53 |     time.sleep(0.3)
54 | ```
55 | ###### 匀加速右滑验证码的滑块（每次请求下一页商品列表页时可能会出现滑块验证码）
56 | ```
57 | slider_button = WebDriverWait(self.browser, 5, 0.5).until(EC.presence_of_element_located((By.ID, 'nc_1_n1z')))
58 | action = ActionChains(self.browser)
59 | action.click_and_hold(slider_button).perform()
60 | action.reset_actions()
61 | # 模拟人类 向左拖动滑块（拖动有加速度）
62 | for i in range(100):
63 |   action.move_by_offset(i*1, 0).perform()
64 |   time.sleep(0.01)  # 值越小越顺滑，越像人类行为
65 | ```
66 | ## 最后
67 | - **taobao_spider.py里有大量注释，请放心食用:meat_on_bone:**
68 | - **如有任何问题都可以邮箱:email:联系我，我会尽快回复你。**
69 | 


--------------------------------------------------------------------------------
/taobao_spider.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | __author__ = 'Gobi Xu'
  3 | 
  4 | 
  5 | import re
  6 | import time
  7 | from selenium import webdriver
  8 | from selenium.webdriver import ActionChains
  9 | from selenium.webdriver.support.ui import WebDriverWait
 10 | from selenium.webdriver.support import expected_conditions as EC
 11 | from selenium.webdriver.common.by import By
 12 | from pyquery import PyQuery as pq
 13 | 
 14 | 
 15 | class TaobaoLogin(object):
 16 |     def __init__(self, username, password, chromedriver_path):
 17 |         self.url = 'https://login.taobao.com/member/login.jhtml'  # 淘宝登录地址
 18 |         self.username = username  # 接收传入的 账号
 19 |         self.password = password  # 接收传入的 密码
 20 |         options = webdriver.ChromeOptions()
 21 |         options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2})  # 不加载图片，加快访问速度
 22 |         options.add_experimental_option('excludeSwitches', ['enable-automation'])  # 设置为开发者模式，防止被各大网站识别出来使用了Selenium
 23 |         self.browser = webdriver.Chrome(executable_path=chromedriver_path, options=options)  # 接收传入的 chromedriver地址 和设置好的 options
 24 |         self.browser.maximize_window()  # 设置窗口最大化
 25 |         self.wait = WebDriverWait(self.browser, 10)  # 设置一个智能等待为10秒
 26 | 
 27 |     def login(self):
 28 |         self.browser.get(self.url)
 29 |         username_password_button = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.login-box.no-longlogin.module-quick > .hd > .login-switch')))  # 用css选择器选择 用账号密码登录按钮
 30 |         username_password_button.click()  # 点击 用账号密码登录按钮
 31 |         weibo_button = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.weibo-login')))  # 用css选择器选择 用微博登录按钮
 32 |         weibo_button.click()  # 点击 用微博登录按钮
 33 |         input_username = self.wait.until(EC.presence_of_element_located((By.XPATH, '//input[@name="username"]')))  # 用xpath选择器选择 账号框
 34 |         input_username.send_keys(self.username)  # 输入 账号
 35 |         input_password = self.wait.until(EC.presence_of_element_located((By.XPATH, '//input[@name="password"]')))  # 用xpath选择器选择 密码框
 36 |         input_password.send_keys(self.password)  # 输入 密码
 37 |         login_button = self.wait.until(EC.presence_of_element_located((By.XPATH, '//span[text()="登录"]')))  # 用xpath选择器选择 登录按钮
 38 |         login_button.click()  # 点击 登录按钮
 39 |         taobao_name = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.site-nav-login-info-nick')))  # 用css选择器选择 淘宝昵称
 40 |         print(''.join(['淘宝账号为：', taobao_name.text]))  # 输出 淘宝昵称
 41 | 
 42 |     def getPageTotal(self):
 43 |         page_total = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.ui-page-skip > form')))  # 用css选择器选择 商品列表页 总页数框
 44 |         page_total = page_total.text
 45 |         page_total = re.match('.*?(\d+).*', page_total).group(1)  # 清洗
 46 |         return page_total
 47 | 
 48 |     def dropDown(self):
 49 |         # 模拟人类 向下滑动浏览（下拉有加速度）
 50 |         for i in range(1, 52):
 51 |             drop_down = "var q=document.documentElement.scrollTop=" + str(i*100)
 52 |             self.browser.execute_script(drop_down)
 53 |             time.sleep(0.01)
 54 |             if i == 5:
 55 |                 time.sleep(0.7)
 56 |             if i == 15:
 57 |                 time.sleep(0.5)
 58 |             if i == 29:
 59 |                 time.sleep(0.3)
 60 |             if i == 44:
 61 |                 time.sleep(0.1)
 62 |         # 直接下拉到最底部
 63 |         # drop_down = "var q=document.documentElement.scrollTop=10000"
 64 |         # self.browser.execute_script(drop_down)
 65 | 
 66 |     def nextPage(self):
 67 |         # 获取 下一页的按钮 并 点击
 68 |         next_page_submit = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.ui-page-next')))
 69 |         next_page_submit.click()
 70 | 
 71 |     def sliderVerification(self):
 72 |         # 每次翻页后 检测是否有 滑块验证
 73 |         try:
 74 |             slider_button = WebDriverWait(self.browser, 5, 0.5).until(EC.presence_of_element_located((By.ID, 'nc_1_n1z')))
 75 |             action = ActionChains(self.browser)
 76 |             action.click_and_hold(slider_button).perform()
 77 |             action.reset_actions()
 78 |             # 模拟人类 向左拖动滑块（拖动有加速度）
 79 |             for i in range(100):
 80 |                 action.move_by_offset(i*1, 0).perform()
 81 |                 time.sleep(0.01)
 82 |             action.reset_actions()
 83 |         except:
 84 |             print('没有检测到滑块验证码')
 85 | 
 86 |     def crawlGoods(self, category):
 87 |         self.login()
 88 |         self.browser.get('https://list.tmall.com/search_product.htm?q={0}'.format(category))  # 天猫商品列表页地址，format()里面输入要爬取的类目
 89 |         page_total = self.getPageTotal()  # 获取 商品列表页 总页数
 90 |         print(''.join(['爬取的类目一共有：', page_total, '页']))
 91 |         for page in range(2, int(page_total)):  # 遍历 全部 商品列表页
 92 |             goods_total = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#J_ItemList .product .product-iWrap')))  # 确认 当前商品列表页 的 全部商品 都 加载完成
 93 |             page_frame = self.wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.ui-page-skipTo')))  # 获取 当前页数框
 94 |             page_now = page_frame.get_attribute('value')  # 获取 当前页数
 95 |             print(''.join(['当前页数：', page_now, ' ', '总页数：', page_total]))
 96 |             html = self.browser.page_source  # 获取 当前页面的 源代码
 97 |             doc = pq(html)  # pq模块 解析 网页源代码
 98 |             goods_items = doc('#J_ItemList .product').items()  # 获取 当前页 全部商品数据
 99 |             for item in goods_items:  # 遍历 全部商品数据
100 |                 goods_title = item.find('.productTitle').text().replace('\n', '')  # 获取 商品标题 并清洗
101 |                 goods_sales_volume = item.find('.productStatus').text().replace('该款月成交\n', '').replace('笔', '')  # 获取 商品月销量 并清洗
102 |                 if goods_sales_volume:  # 是否有 月销量 的判断
103 |                     goods_sales_volume = int(float(goods_sales_volume.replace('万', ''))*10000) if '万' in goods_sales_volume else int(goods_sales_volume)  # 进一步清洗 商品月销量
104 |                 else:
105 |                     goods_sales_volume = 0
106 |                 goods_price = float(item.find('.productPrice').text().replace('¥\n', ''))  # 获取 商品价格 并清洗
107 |                 goods_shop = item.find('.productShop').text().replace('\n', '')  # 获取 店名 并清洗
108 |                 goods_url = ''.join(['https:', item.find('.productImg').attr('href')])  # 获取 详情页网址 并在地址最前面加上 https:
109 |                 goods_id = re.match('.*?id=?(\d+)&.*', goods_url).group(1)  # 获取 商品id
110 |                 print(''.join(['商品id：', goods_id, '\u0020\u0020\u0020\u0020\u0020', '商品标题：', goods_title, '\u0020\u0020\u0020\u0020\u0020', '商品价格：', str(goods_price), '\u0020\u0020\u0020\u0020\u0020', '商品月销量：', str(goods_sales_volume), '\u0020\u0020\u0020\u0020\u0020', '店名：', goods_shop, '\u0020\u0020\u0020\u0020\u0020', '详情页网址：', goods_url]))
111 |             self.dropDown()  # 执行 下拉动作
112 |             self.nextPage()  # 执行 按下一页按钮动作
113 |             time.sleep(2)
114 |             self.sliderVerification()  # 检测是否有 滑块验证
115 |             time.sleep(2)
116 | 
117 | 
118 | username = '123456789'  # 你的 微博账号
119 | password = '*********'  # 你的 微博密码
120 | chromedriver_path = 'x:/xxxxxxxx/xxxxxxxxx.exe'  # 你的 selenium驱动 存放地址
121 | category = '某某某'  # 你要爬取的 类目
122 | 
123 | if __name__ == '__main__':
124 |     a = TaobaoLogin(username, password, chromedriver_path)
125 |     a.crawlGoods(category)
126 | 


--------------------------------------------------------------------------------