├── .gitignore
├── Dockerfile
├── README.md
├── docker-compose.yml
├── media
├── 15584541954959.jpg
├── 15584542414888.jpg
├── 15584544774749.jpg
├── 15584545518249.jpg
├── 15584546784023.jpg
├── 15584547028361.jpg
├── 15584578582622.jpg
├── 15584579051963.jpg
├── 15584581938431.jpg
├── 15584582326072.jpg
├── 15584585019970.jpg
├── 15610827417503.jpg
├── 15610832298058.jpg
├── 15632498867519.jpg
├── 95AE10B3227FDE0637AB227A5A8267E3.png
├── A580D0082CCEE0621F98FAF003C5530E.png
└── 赞赏码.png
├── requirements.txt
└── wechat-spider
├── config.py
├── config.yaml
├── core
├── capture_packet.py
├── data_pipeline.py
├── deal_data.py
└── task_manager.py
├── create_tables.py
├── db
├── mysqldb.py
└── redisdb.py
├── run.py
└── utils
├── log.py
├── selector.py
└── tools.py
/.gitignore:
--------------------------------------------------------------------------------
1 | */__pycache__
2 | *.pyc
3 | .svn/
4 | log/*
5 | config/
6 | .vs/
7 | .vscode/
8 | *.log
9 | venv
10 | .venv/
11 | test.py
12 | .DS_Store
13 | .idea
14 | .git
15 |
16 | config.yaml
17 | mariadb_data/
18 | mitmproxy/
--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
1 | FROM python:3.7
2 |
3 | RUN cat /etc/apt/sources.list | awk -F[/:] '{print $4}' | sort | uniq | grep -v "^$" | xargs -I{} sed -i 's|{}|mirrors.aliyun.com|g' /etc/apt/sources.list && \
4 | apt update && \
5 | apt install -y psmisc netcat wait-for-it && \
6 | apt-get clean && \
7 | rm -rf /var/lib/apt/lists/*
8 |
9 | WORKDIR /app
10 |
11 | COPY requirements.txt requirements.txt
12 |
13 | RUN pip3 install -i https://mirrors.aliyun.com/pypi/simple -r requirements.txt
14 |
15 | COPY . .
16 |
17 | WORKDIR /app/wechat-spider
18 |
19 | EXPOSE 8080
20 | EXPOSE 8081
21 |
22 | # ENTRYPOINT [ "python3", "./run.py" ]
23 |
24 | CMD [ "python3", "./run.py" ]
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # 微信爬虫
2 |
3 |
8 |
9 | 以下为部署文档
10 |
11 | 技术文档请查看:[https://t.zsxq.com/7ubmqNJ](https://t.zsxq.com/7ubmqNJ)
12 |
13 | 逆向方式抓取的方案请查看:[https://wx.zsxq.com/dweb2/index/topic_detail/215584212588541](https://wx.zsxq.com/dweb2/index/topic_detail/215584212588541)
14 |
15 | ## 功能:
16 |
17 | - [x] 检测公众号每日新发文章
18 | - [x] 抓取公众号信息
19 | - [x] 抓取文章列表
20 | - [x] 抓取文章信息
21 | - [x] 抓取阅读量、点赞量、评论量
22 | - [x] 抓取评论信息
23 | - [x] 临时链接转永久链接
24 |
25 | 打包好的执行文件下载地址
26 |
27 | 链接: https://pan.baidu.com/s/1hyhj6YnV-L9w8LPx42FFzQ 密码: qnk6
28 |
29 | ## 特色:
30 |
31 | 1. **免安装**:支持mac、window,双击软件即可执行
32 | 2. **自动化**:只需要配置好待监控的公众号列表,启动软件后即可每日自动抓取公众号及文章等信息
33 | 3. **好对接**:抓取到的数据使用mysql存储,方便处理数据
34 | 4. **不漏采**:采用任务状态标记的方式,防止遗漏每一个公众号、每一篇文章
35 | 5. **分布式**:支持多个微信号同时采集,微信客户端支持Android、iphone、Mac、Window 全平台
36 |
37 | ## 数据示例
38 |
39 | **1. 公众号数据**
40 | 
41 |
42 | **2. 文章列表数据**
43 | 
44 |
45 | **3. 文章数据**
46 | 
47 |
48 | **4. 阅读点赞评论数据**
49 | 
50 |
51 | **5. 评论数据**
52 | 
53 |
54 | ## 所需环境
55 |
56 | 1. mysql:用来存储抓取到的数据以及任务表
57 | 2. redis:任务缓存,减少操作mysql的次数
58 |
59 | ## 安装配置
60 |
61 | > 以下安装说明安需查看,仅作为参考。因每个人环境不同,可能安装会有些差异,可参考网上的资料
62 |
63 | ### 1. 安装mysql
64 | #### 1.1 window
65 | #### 1.2 mac
66 | ### 2. 安装redis
67 | #### 2.1 window
68 | #### 2.2 mac
69 | ### 3. 安装证书
70 |
71 | 可用浏览器访问 mitm.it 然后下载,或者百度如何安装mitmproxy证书
72 |
73 | #### 3.1 iphone
74 | 1. 下载安装完毕后别忘记最后一步
75 | 2. 打开设置-通用-关于本机-证书信任设置
76 | 3. 开启mitmproxy选项。
77 |
78 | #### 3.2 android
79 | 1. 安装完毕检查
80 | 2. 打开设置-安全-信任的凭据
81 | 3. 查看安装的证书是否存在
82 |
83 | #### 3.3 window
84 | 2. 双击运行
85 | 3. 安装到本地计算机
86 | 4. 需要密钥时跳过
87 | 5. 选择“将所有的证书都放入下列存储”,接着选择“受信任的根证书颁发机构”
88 | 6. 最后,弹出警告窗口,直接点击“是”
89 |
90 | #### 3.4 mac
91 | 2. 下载完双击安装
92 | 3. 打开Keychain Access.app
93 | 4. 选择login(Keychains)和Certificates(Category)中找到mitmproxy
94 | 5. 点击mitmproxy,在Trust中选择Always Trust
95 |
96 |
97 | ### 4. 配置代理
98 |
99 | > 如果使用手机,需要确保手机和运行wechat-spider的电脑连接在同一个路由器上
100 |
101 | #### 3.1 iphone
102 |
103 | 打开设置-无线局域网-所连接的Wifi-配置代理-手动
104 | 填上该安装服务器的IP和端口8080
105 |
106 | #### 3.2 android
107 |
108 | 打开设置-WLAN-长按所连接的网络-修改网络-高级选项-手动
109 | 填上该安装服务器的IP和端口8080
110 |
111 | #### 3.3 window
112 | 打开chrome 设置->高级
113 | 
114 | 
115 |
116 | #### 3.4 mac
117 |
118 | 打开系统配置(System Preferences.app)- 网络(Network)- 高级(Advanced)- 代理(Proxies)- Secure Web Proxy(HTTPS)
119 | 填上该安装服务器的IP和端口8080
120 |
121 | 
122 | 
123 |
124 |
125 |
126 | ## 使用说明
127 |
128 | ### 1. 安装如上说明安装好证书及配置好代理
129 | ### 2. 正确配置config.yaml
130 |
131 | 主要是配置mysql及redis的链接信息,确保能正确链接上
132 |
133 | ### 3. 创建数据库 wechat
134 |
135 | 
136 |
137 |
138 | ### 4. 启动wechat-spider
139 |
140 | 此步骤如果config里的auto_create_tables值为true时,会自动创建mysql数据表。建议首次启动时设置为true,创建完表后设置为false
141 |
142 | ### 5. 下发公众号任务
143 |
144 | 
145 | 录入数据到wechat_account_task, 如:
146 | 
147 | 只填写__biz就好, 如:MzIxNzg1ODQ0MQ==
148 |
149 | ### 6. 点击任意一公众号,查看历史消息
150 |
151 | 
152 |
153 | 当出现如上红框中的提示信息时,说明大功告成了,过一会可以去数据库里验证数据了
154 |
155 | 技术交流
156 | ----
157 | 若大家有什么疑问或指教,可加qq群,一起讨论问题。请备注`微信爬虫学习交流`
158 |
159 |
160 |
161 |
162 | ## 常见问题
163 |
164 | ### 1. mysql 链接问题
165 |
166 | 问题:链接时打印object supporting the buffer api required异常
167 | 
168 | 解决: 如果密钥是整形的,如123456,需要在配置文件中加双引号,如下:
169 |
170 | mysqldb:
171 | ip: localhost
172 | port: 3306
173 | db: wechat
174 | user: root
175 | passwd: "123456"
176 | auto_create_tables: true # 是否自动建表 建议当表不存在是设置为true,表存在是设置为false,加快软件启动速度
177 |
178 | ### 2. 正确配置完代理后提示证书或安全问题
179 |
180 | 原因是我那个证书失效了,可参考 https://www.cnblogs.com/yunlongaimeng/p/9617708.html 安装数据
181 |
182 | ### 3. 提示无任务
183 |
184 | 检查 wechat_account_task 表中是否下发了__biz。可多下发几个测试
185 |
186 | ### 4. Exception:DISCARD without MULTI
187 |
188 | 
189 |
190 | ### 5. 正常启动后抓不到包
191 |
192 | 1. 检是否设置代理
193 | 2. 检查端口是否被占用
194 |
195 | ## 微信攒赏
196 |
197 | 开源项目不易,维护代码以及解决大家问题往往占据了大部分时间,为了保证内容持续输出,且**本项目恰巧对您有帮助**,还望多多支持哦(* ̄︶ ̄)。
198 |
199 | 可提供部署支持,答疑解惑(仅限打赏用户、提PR的开发者)。
200 |
201 | 微信: boris_tm
202 |
203 | 
204 |
205 |
206 |
--------------------------------------------------------------------------------
/docker-compose.yml:
--------------------------------------------------------------------------------
1 | version: '3'
2 | services:
3 | wechat-spider:
4 | image: wechat-spider:latest
5 | build: .
6 | restart: always
7 | privileged: true
8 | command: ["wait-for-it", "mariadb:3306", "--", "python3", "./run.py"]
9 | ports:
10 | - '8080:8080'
11 | - '8081:8081'
12 | volumes:
13 | - "./mitmproxy:/root/.mitmproxy"
14 | - "./config:/app/wechat-spider/config"
15 | depends_on:
16 | - mariadb
17 | - redis
18 | links:
19 | - redis
20 | - mariadb
21 | redis:
22 | image: redis:3.2
23 | restart: always
24 | mariadb:
25 | image: bitnami/mariadb:10.5-debian-10
26 | ports:
27 | - '3306:3306'
28 | volumes:
29 | - './mariadb_data:/bitnami/mariadb'
30 | environment:
31 | - MARIADB_ROOT_USER=root
32 | - MARIADB_ROOT_PASSWORD=root
33 | - MARIADB_CHARACTER_SET=utf8mb4
34 | - MARIADB_COLLATE=utf8mb4_unicode_ci
35 | healthcheck:
36 | test: ['CMD', '/opt/bitnami/scripts/mariadb/healthcheck.sh']
37 | interval: 15s
38 | timeout: 5s
39 | retries: 6
--------------------------------------------------------------------------------
/media/15584541954959.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584541954959.jpg
--------------------------------------------------------------------------------
/media/15584542414888.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584542414888.jpg
--------------------------------------------------------------------------------
/media/15584544774749.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584544774749.jpg
--------------------------------------------------------------------------------
/media/15584545518249.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584545518249.jpg
--------------------------------------------------------------------------------
/media/15584546784023.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584546784023.jpg
--------------------------------------------------------------------------------
/media/15584547028361.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584547028361.jpg
--------------------------------------------------------------------------------
/media/15584578582622.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584578582622.jpg
--------------------------------------------------------------------------------
/media/15584579051963.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584579051963.jpg
--------------------------------------------------------------------------------
/media/15584581938431.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584581938431.jpg
--------------------------------------------------------------------------------
/media/15584582326072.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584582326072.jpg
--------------------------------------------------------------------------------
/media/15584585019970.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584585019970.jpg
--------------------------------------------------------------------------------
/media/15610827417503.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15610827417503.jpg
--------------------------------------------------------------------------------
/media/15610832298058.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15610832298058.jpg
--------------------------------------------------------------------------------
/media/15632498867519.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15632498867519.jpg
--------------------------------------------------------------------------------
/media/95AE10B3227FDE0637AB227A5A8267E3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/95AE10B3227FDE0637AB227A5A8267E3.png
--------------------------------------------------------------------------------
/media/A580D0082CCEE0621F98FAF003C5530E.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/A580D0082CCEE0621F98FAF003C5530E.png
--------------------------------------------------------------------------------
/media/赞赏码.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/赞赏码.png
--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | w3lib==1.22.0
2 | mitmproxy==7.0.3
3 | better_exceptions==0.2.2
4 | PyMySQL==0.10.0
5 | parsel==1.6.0
6 | six==1.15.0
7 | redis==2.10.6
8 | DBUtils==1.3
9 | PyYAML==5.4
--------------------------------------------------------------------------------
/wechat-spider/config.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | '''
3 | Created on 2019/5/18 11:54 AM
4 | ---------
5 | @summary:
6 | ---------
7 | @author:
8 | '''
9 | import os
10 | import socket
11 | import sys
12 | import shutil
13 |
14 | import yaml # pip3 install pyyaml
15 |
16 | if 'python' in sys.executable:
17 | abs_path = lambda file: os.path.abspath(os.path.join(os.path.dirname(__file__), file))
18 | else:
19 | abs_path = lambda file: os.path.abspath(os.path.join(os.path.dirname(sys.executable), file)) # mac 上打包后 __file__ 指定的是用户根路径,非当执行文件路径
20 |
21 | if not os.path.exists('./config/config.yaml'):
22 | shutil.copyfile('./config.yaml', './config/config.yaml')
23 |
24 | config = yaml.full_load(open(abs_path('./config/config.yaml'), encoding='utf8'))
25 |
26 |
27 | def get_host_ip():
28 | """
29 | 利用 UDP 协议来实现的,生成一个UDP包,把自己的 IP 放如到 UDP 协议头中,然后从UDP包中获取本机的IP。
30 | 这个方法并不会真实的向外部发包,所以用抓包工具是看不到的
31 | :return:
32 | """
33 | s = None
34 | try:
35 | s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
36 | s.connect(('8.8.8.8', 80))
37 | ip = s.getsockname()[0]
38 | finally:
39 | if s:
40 | s.close()
41 |
42 | return ip
43 |
44 |
45 | IP = get_host_ip()
46 |
--------------------------------------------------------------------------------
/wechat-spider/config.yaml:
--------------------------------------------------------------------------------
1 | mysqldb:
2 | ip: "mariadb"
3 | port: 3306
4 | db: test
5 | user: root
6 | passwd: "root"
7 | auto_create_tables: True # 是否自动建表 建议当表不存在是设置为true,表存在是设置为false,加快软件启动速度
8 |
9 | redisdb:
10 | ip: redis
11 | port: 6379
12 | db: 0
13 | passwd:
14 |
15 | spider:
16 | monitor_interval: 3600 # 公众号扫描新发布文章周期时间间隔 单位秒
17 | ignore_haved_crawl_today_article_account: true # 忽略已经抓取到今日发布文章的公众号,即今日不再监测该公众号
18 | redis_task_cache_root_key: wechat # reids 中缓存任务的根key 如 wechat:
19 | zombie_account_not_publish_article_days: 90 # 连续90天未发布新文章,判定为僵尸账号,日后不再监控
20 | spider_interval:
21 | min_sleep_time: 5
22 | max_sleep_time: 10
23 | no_task_sleep_time: 3600 # 当无任务时休眠时间
24 | service_port: 8080 # 服务的端口
25 | # crawl_time_range: 2019-07-10 00:00:00~2019-07-01 00:00:00 # 抓取的时间范围 若不限制最近时间可写为 ~2019-07-01 00:00:00 若想抓取全部历史则不设置
26 | crawl_time_range:
27 |
28 | log:
29 | level: INFO
30 | to_file: false
31 | log_path: ./logs/wechat_spider.log
32 |
33 | mitm:
34 | log_level: 0 # mitm框架日志的打印级别。值在0~3之间,值越大,输出的日志信息越详细,默认是1。详见:https://docs.mitmproxy.org/stable/concepts-options/
35 |
--------------------------------------------------------------------------------
/wechat-spider/core/capture_packet.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | '''
3 | Created on 2019/5/8 10:59 PM
4 | ---------
5 | @summary:
6 | ---------
7 | @author:
8 | '''
9 | import re
10 |
11 | import mitmproxy
12 | from mitmproxy import ctx
13 |
14 | from core.deal_data import deal_data
15 |
16 |
17 | class WechatCapture():
18 |
19 | def response(self, flow: mitmproxy.http.HTTPFlow):
20 | url = flow.request.url
21 |
22 | next_page = None
23 | try:
24 | if 'mp/profile_ext?action=home' in url or 'mp/profile_ext?action=getmsg' in url: # 文章列表 包括html格式和json格式
25 |
26 | ctx.log.info('抽取文章列表数据')
27 | next_page = deal_data.deal_article_list(url, flow.response.text)
28 |
29 | flow.response.text = re.sub('', '', flow.response.text)
30 |
31 | elif '/s?__biz=' in url or '/mp/appmsg/show?__biz=' in url or '/mp/rumor' in url: # 文章内容;mp/appmsg/show?_biz 为2014年老版链接; mp/rumor 是不详实的文章
32 |
33 | ctx.log.info('抽取文章内容')
34 | next_page = deal_data.deal_article(url, flow.response.text)
35 |
36 | # 修改文章内容的响应头,去掉安全协议,否则注入的 < script > setTimeout(function() {window.location.href = 'url';}, sleep_time); < / script > js脚本不执行
37 | flow.response.headers.pop('Content-Security-Policy', None)
38 | flow.response.headers.pop('content-security-policy-report-only', None)
39 | flow.response.headers.pop('Strict-Transport-Security', None)
40 |
41 | # 不缓存
42 | flow.response.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
43 |
44 | # 去掉图片
45 | flow.response.text = re.sub('', '', flow.response.text)
46 |
47 | elif 'mp/getappmsgext' in url: # 阅读量 观看量
48 |
49 | ctx.log.info('抽取阅读量 观看量')
50 | deal_data.deal_article_dynamic_info(flow.request.data.content.decode('utf-8'), flow.response.text)
51 |
52 | elif '/mp/appmsg_comment' in url: # 评论列表
53 |
54 | ctx.log.info('抽取评论列表')
55 | deal_data.deal_comment(url, flow.response.text)
56 |
57 | except Exception as e:
58 | # log.exception(e)
59 | next_page = "Exception: {}".format(e)
60 |
61 | if next_page:
62 | # 修改返回求头 json 为text
63 | flow.response.headers['content-type'] = 'text/html; charset=UTF-8'
64 | if 'window.location.reload()' in next_page:
65 | flow.response.set_text(next_page)
66 | else:
67 | flow.response.set_text(next_page + flow.response.text)
68 |
69 |
70 | addons = [
71 | WechatCapture(),
72 | ]
73 |
74 | # 运行 mitmdump -s capture_packet.py -p 8080
75 |
--------------------------------------------------------------------------------
/wechat-spider/core/data_pipeline.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | '''
3 | Created on 2019/5/13 12:44 AM
4 | ---------
5 | @summary:
6 | ---------
7 | @author:
8 | '''
9 | from db.mysqldb import MysqlDB
10 | import utils.tools as tools
11 | from utils.log import log
12 | from config import config
13 |
14 | db = MysqlDB(**config.get('mysqldb'))
15 |
16 |
17 | def save_account(data):
18 | log.debug(tools.dumps_json(data))
19 |
20 | sql = tools.make_insert_sql('wechat_account', data, insert_ignore=True)
21 | db.add(sql)
22 |
23 |
24 | def save_article_list(datas: list):
25 | log.debug(tools.dumps_json(datas))
26 |
27 | sql, articles = tools.make_batch_sql('wechat_article_list', datas)
28 | db.add_batch(sql, articles)
29 |
30 | # 存文章任务
31 | article_task = [
32 | {
33 | "sn": article.get('sn'),
34 | "article_url": article.get('url'),
35 | "__biz": article.get('__biz')
36 | }
37 | for article in datas
38 | ]
39 |
40 | sql, article_task = tools.make_batch_sql('wechat_article_task', article_task)
41 | db.add_batch(sql, article_task)
42 |
43 |
44 | def save_article(data):
45 | log.debug(tools.dumps_json(data))
46 |
47 | sql = tools.make_insert_sql('wechat_article', data, insert_ignore=True)
48 | return db.add(sql)
49 |
50 |
51 | def save_article_dynamic(data):
52 | log.debug(tools.dumps_json(data))
53 |
54 | sql = tools.make_insert_sql('wechat_article_dynamic', data, insert_ignore=True)
55 | db.add(sql)
56 |
57 |
58 | def save_article_commnet(datas: list):
59 | log.debug(tools.dumps_json(datas))
60 |
61 | sql, datas = tools.make_batch_sql('wechat_article_comment', datas)
62 | db.add_batch(sql, datas)
63 |
--------------------------------------------------------------------------------
/wechat-spider/core/deal_data.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on 2019/5/11 6:37 PM
4 | ---------
5 | @summary: 处理数据
6 | ---------
7 | @author:
8 | """
9 | from utils.selector import Selector
10 | import utils.tools as tools
11 | from utils.log import log
12 | from core import data_pipeline
13 | from core.task_manager import TaskManager
14 |
15 |
16 | class DealData:
17 | def __init__(self):
18 | self._task_manager = TaskManager()
19 | self._task_manager.reset_task()
20 |
21 | def __parse_account_info(self, data, req_url):
22 | """
23 | @summary:
24 | ---------
25 | @param data:
26 | ---------
27 | @result:
28 | """
29 | __biz = tools.get_param(req_url, "__biz")
30 |
31 | regex = 'id="nickname">(.*?)'
32 | account = tools.get_info(data, regex, fetch_one=True).strip()
33 |
34 | regex = 'profile_avatar">.*?
{monitor_interval}
76 | {publish_time_condition}
77 | )
78 | OR (last_spider_time IS NULL)
79 | )
80 | '''.format(monitor_interval=self._monitor_interval, publish_time_condition=publish_time_condition)
81 |
82 | tasks = self._mysqldb.find(sql, to_json=True)
83 | if tasks:
84 | self._redis.zadd(self._account_task_key, tasks)
85 | task = self.__get_task_from_redis(self._account_task_key)
86 |
87 | return task
88 |
89 | def get_article_task(self):
90 | """
91 | 获取文章任务
92 | :return:
93 | {'article_url': 'http://mp.weixin.qq.com/s?__biz=MzIxNzg1ODQ0MQ==&mid=2247485501&idx=1&sn=92721338ddbf7d907eaf03a70a0715bd&chksm=97f220dba085a9cd2b9a922fb174c767603203d6dbd2a7d3a6dc41b3400a0c477a8d62b96396&scene=27#wechat_redirect'}
94 | 或
95 | None
96 | """
97 | task = self.__get_task_from_redis(self._article_task_key)
98 | if not task:
99 | sql = 'select id, article_url from wechat_article_task where state = 0 limit 5000'
100 | tasks = self._mysqldb.find(sql)
101 | if tasks:
102 | # 更新任务状态
103 | task_ids = str(tuple([task[0] for task in tasks])).replace(',)', ')')
104 | sql = 'update wechat_article_task set state = 2 where id in %s' % (task_ids)
105 | self._mysqldb.update(sql)
106 |
107 | else:
108 | sql = 'select id, article_url from wechat_article_task where state = 2 limit 5000'
109 | tasks = self._mysqldb.find(sql)
110 |
111 | if tasks:
112 | task_json = [
113 | {
114 | 'article_url': article_url
115 | }
116 | for id, article_url in tasks
117 | ]
118 | self._redis.zadd(self._article_task_key, task_json)
119 | task = self.__get_task_from_redis(self._article_task_key)
120 |
121 | return task
122 |
123 | def update_article_task_state(self, sn, state=1):
124 | sql = 'update wechat_article_task set state = %s where sn = "%s"' % (state, sn)
125 | self._mysqldb.update(sql)
126 |
127 | def record_last_article_publish_time(self, __biz, last_publish_time):
128 | self._redis.hset(self._last_article_publish_time, __biz, last_publish_time or '')
129 |
130 | def is_reach_last_article_publish_time(self, __biz, publish_time):
131 | last_publish_time = self._redis.hget(self._last_article_publish_time, __biz)
132 | if not last_publish_time:
133 | # 查询mysql里是否有该任务
134 | sql = "select last_publish_time from wechat_account_task where __biz = '%s'" % __biz
135 | data = self._mysqldb.find(sql)
136 | if data: # [(None,)] / []
137 | last_publish_time = str(data[0][0] or '')
138 | self.record_last_article_publish_time(__biz, last_publish_time)
139 |
140 | if last_publish_time is None:
141 | return
142 |
143 | if publish_time < last_publish_time:
144 | return True
145 |
146 | return False
147 |
148 | def is_in_crawl_time_range(self, publish_time):
149 | """
150 | 是否在时间范围
151 | :param publish_time:
152 | :return: 是否达时间范围
153 | """
154 | if not publish_time or (not self._crawl_time_range[0] and not self._crawl_time_range[1]):
155 | return TaskManager.IS_IN_TIME_RANGE
156 |
157 | if self._crawl_time_range[0]: # 时间范围 上限
158 | if publish_time > self._crawl_time_range[0]:
159 | return TaskManager.NOT_REACH_TIME_RANGE
160 |
161 | if publish_time <= self._crawl_time_range[0] and publish_time >= self._crawl_time_range[1]:
162 | return TaskManager.IS_IN_TIME_RANGE
163 |
164 | if publish_time < self._crawl_time_range[1]: # 下限
165 | return TaskManager.OVER_MIN_TIME_RANGE
166 |
167 | return TaskManager.IS_IN_TIME_RANGE
168 |
169 | def record_new_last_article_publish_time(self, __biz, new_last_publish_time):
170 | self._redis.hset(self._new_last_article_publish_time, __biz, new_last_publish_time)
171 |
172 | def get_new_last_article_publish_time(self, __biz):
173 | return self._redis.hget(self._new_last_article_publish_time, __biz)
174 |
175 | def update_account_last_publish_time(self, __biz, last_publish_time):
176 | sql = 'update wechat_account_task set last_publish_time = "{}", last_spider_time="{}" where __biz="{}"'.format(
177 | last_publish_time, tools.get_current_date(), __biz
178 | )
179 | self._mysqldb.update(sql)
180 |
181 | def is_zombie_account(self, last_publish_timestamp):
182 | if tools.get_current_timestamp() - last_publish_timestamp > self._zombie_account_not_publish_article_days * 86400:
183 | return True
184 | return False
185 |
186 | def sign_account_is_zombie(self, __biz, last_publish_time=None):
187 | if last_publish_time:
188 | sql = 'update wechat_account_task set last_publish_time = "{}", last_spider_time="{}", is_zombie=1 where __biz="{}"'.format(
189 | last_publish_time, tools.get_current_date(), __biz
190 | )
191 | else:
192 | sql = 'update wechat_account_task set last_spider_time="{}", is_zombie=1 where __biz="{}"'.format(
193 | tools.get_current_date(), __biz
194 | )
195 |
196 | self._mysqldb.update(sql)
197 |
198 | def get_task(self, url=None, tip=''):
199 | """
200 | 获取任务
201 | :param url: 指定url时,返回该url包装后的任务。否则先取公众号任务,无则取文章任务。若均无任务,则休眠一段时间之后再取
202 | :return:
203 | """
204 |
205 | sleep_time = random.randint(self._spider_interval_min, self._spider_interval_max)
206 |
207 | if not url:
208 | account_task = self.get_account_task()
209 | if account_task:
210 | __biz = account_task.get('__biz')
211 | last_publish_time = account_task.get('last_publish_time')
212 | self.record_last_article_publish_time(__biz, last_publish_time)
213 | tip = '正在抓取列表'
214 | url = 'https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz={}&scene=124#wechat_redirect'.format(__biz)
215 | else:
216 | article_task = self.get_article_task()
217 | if article_task:
218 | tip = '正在抓取详情'
219 | url = article_task.get('article_url')
220 | else:
221 | sleep_time = config.get('spider').get('no_task_sleep_time')
222 | log.info('暂无任务 休眠 {}s'.format(sleep_time))
223 | tip = '暂无任务 '
224 |
225 | if url:
226 | next_page = "{tip} 休眠 {sleep_time}s 下次刷新时间 {begin_spider_time} ".format(
227 | tip=tip and tip + ' ', sleep_time=sleep_time, begin_spider_time=tools.timestamp_to_date(tools.get_current_timestamp() + sleep_time), url=url, sleep_time_msec=sleep_time * 1000
228 | )
229 | else:
230 | next_page = "{tip} 休眠 {sleep_time}s 下次刷新时间 {begin_spider_time} ".format(
231 | tip=tip and tip + ' ', sleep_time=sleep_time, begin_spider_time=tools.timestamp_to_date(tools.get_current_timestamp() + sleep_time), sleep_time_msec=sleep_time * 1000
232 | )
233 |
234 | return next_page
235 |
236 | def reset_task(self):
237 | # 清除redis缓存
238 | keys = self._task_root_key + "*"
239 | keys = self._redis.getkeys(keys)
240 | if keys:
241 | for key in keys:
242 | self._redis.clear(key)
243 |
244 | # 重设任务
245 | sql = "update wechat_article_task set state = 0 where state = 2"
246 | self._mysqldb.update(sql)
247 |
248 |
249 | if __name__ == '__main__':
250 | task_manager = TaskManager()
251 |
252 | result = task_manager.get_task()
253 | print(result)
254 |
--------------------------------------------------------------------------------
/wechat-spider/create_tables.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | '''
3 | Created on 2019/5/20 11:47 PM
4 | ---------
5 | @summary:
6 | ---------
7 | @author:
8 | '''
9 |
10 | from db.mysqldb import MysqlDB
11 | from config import config
12 |
13 | def _create_database(mysqldb, dbname):
14 | mysqldb.execute("CREATE DATABASE IF NOT EXISTS `%s`;", (dbname))
15 |
16 | def _create_table(mysqldb, sql):
17 | mysqldb.execute(sql)
18 |
19 |
20 | def create_table():
21 | wechat_article_list_table = '''
22 | CREATE TABLE IF NOT EXISTS `wechat_article_list` (
23 | `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
24 | `title` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
25 | `digest` varchar(2000) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
26 | `url` varchar(500) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
27 | `source_url` varchar(1000) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
28 | `cover` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
29 | `subtype` int(11) DEFAULT NULL,
30 | `is_multi` int(11) DEFAULT NULL,
31 | `author` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
32 | `copyright_stat` int(11) DEFAULT NULL,
33 | `duration` int(11) DEFAULT NULL,
34 | `del_flag` int(11) DEFAULT NULL,
35 | `type` int(11) DEFAULT NULL,
36 | `publish_time` datetime DEFAULT NULL,
37 | `sn` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
38 | `spider_time` datetime DEFAULT NULL,
39 | `__biz` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
40 | PRIMARY KEY (`id`),
41 | UNIQUE KEY `sn` (`sn`)
42 | ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC
43 | '''
44 |
45 | wechat_article_task_table = '''
46 | CREATE TABLE IF NOT EXISTS `wechat_article_task` (
47 | `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
48 | `sn` varchar(50) DEFAULT NULL,
49 | `article_url` varchar(255) DEFAULT NULL,
50 | `state` int(11) DEFAULT '0' COMMENT '文章抓取状态,0 待抓取 2 抓取中 1 抓取完毕 -1 抓取失败',
51 | `__biz` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
52 | PRIMARY KEY (`id`),
53 | UNIQUE KEY `sn` (`sn`) USING BTREE,
54 | KEY `state` (`state`) USING BTREE
55 | ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
56 | '''
57 |
58 | wechat_article_dynamic_table = '''
59 | CREATE TABLE IF NOT EXISTS `wechat_article_dynamic` (
60 | `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
61 | `sn` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
62 | `read_num` int(11) DEFAULT NULL,
63 | `like_num` int(11) DEFAULT NULL,
64 | `comment_count` int(11) DEFAULT NULL,
65 | `spider_time` datetime DEFAULT NULL,
66 | `__biz` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
67 | PRIMARY KEY (`id`),
68 | UNIQUE KEY `sn` (`sn`)
69 | ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC
70 | '''
71 |
72 | wechat_article_comment_table = '''
73 | CREATE TABLE IF NOT EXISTS `wechat_article_comment` (
74 | `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
75 | `comment_id` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL COMMENT '与文章关联',
76 | `nick_name` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
77 | `logo_url` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
78 | `content` varchar(2000) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
79 | `create_time` datetime DEFAULT NULL,
80 | `content_id` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL COMMENT '本条评论内容的id',
81 | `like_num` int(11) DEFAULT NULL,
82 | `is_top` int(11) DEFAULT NULL,
83 | `spider_time` datetime DEFAULT NULL,
84 | `__biz` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
85 | PRIMARY KEY (`id`),
86 | UNIQUE KEY `content_id` (`content_id`),
87 | KEY `comment_id` (`comment_id`)
88 | ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC
89 | '''
90 |
91 | wechat_article_table = '''
92 | CREATE TABLE IF NOT EXISTS `wechat_article` (
93 | `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
94 | `account` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
95 | `title` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
96 | `url` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
97 | `author` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
98 | `publish_time` datetime DEFAULT NULL,
99 | `__biz` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
100 | `digest` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
101 | `cover` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
102 | `pics_url` text CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
103 | `content_html` text COLLATE utf8mb4_unicode_ci,
104 | `source_url` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
105 | `comment_id` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
106 | `sn` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
107 | `spider_time` datetime DEFAULT NULL,
108 | PRIMARY KEY (`id`),
109 | UNIQUE KEY `sn` (`sn`),
110 | KEY `__biz` (`__biz`)
111 | ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC
112 | '''
113 |
114 | wechat_account_task_table = '''
115 | CREATE TABLE IF NOT EXISTS `wechat_account_task` (
116 | `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
117 | `__biz` varchar(50) DEFAULT NULL,
118 | `last_publish_time` datetime DEFAULT NULL COMMENT '上次抓取到的文章发布时间,做文章增量采集用',
119 | `last_spider_time` datetime DEFAULT NULL COMMENT '上次抓取时间,用于同一个公众号每隔一段时间扫描一次',
120 | `is_zombie` int(11) DEFAULT '0' COMMENT '僵尸号 默认3个月未发布内容为僵尸号,不再检测',
121 | PRIMARY KEY (`id`)
122 | ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
123 | '''
124 |
125 | wechat_account_table = '''
126 | CREATE TABLE IF NOT EXISTS `wechat_account` (
127 | `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
128 | `__biz` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
129 | `account` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
130 | `head_url` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
131 | `summary` varchar(500) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
132 | `qr_code` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
133 | `verify` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
134 | `spider_time` datetime DEFAULT NULL,
135 | PRIMARY KEY (`id`),
136 | UNIQUE KEY `__biz` (`__biz`)
137 | ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC
138 | '''
139 |
140 | if config.get('mysqldb').get('auto_create_tables'):
141 | mysqldb = MysqlDB(**config.get('mysqldb'))
142 | # _create_database(mysqldb, config.get('mysqldb').get('db'))
143 | _create_table(mysqldb, wechat_article_list_table)
144 | _create_table(mysqldb, wechat_article_task_table)
145 | _create_table(mysqldb, wechat_article_dynamic_table)
146 | _create_table(mysqldb, wechat_article_comment_table)
147 | _create_table(mysqldb, wechat_article_table)
148 | _create_table(mysqldb, wechat_account_task_table)
149 | _create_table(mysqldb, wechat_account_table)
150 |
--------------------------------------------------------------------------------
/wechat-spider/db/mysqldb.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | '''
3 | Created on 2016-11-16 16:25
4 | ---------
5 | @summary: 操作mysql数据库
6 | ---------
7 | @author: Boris
8 | '''
9 | import datetime
10 | import json
11 |
12 | import pymysql
13 | from DBUtils.PooledDB import PooledDB
14 | from pymysql import cursors
15 | from pymysql import err
16 |
17 | from utils.log import log
18 |
19 |
20 | def auto_retry(func):
21 | def wapper(*args, **kwargs):
22 | for i in range(3):
23 | try:
24 | return func(*args, **kwargs)
25 | except (err.InterfaceError, err.OperationalError, err.ProgrammingError) as e:
26 | log.error('''
27 | error:%s
28 | sql: %s
29 | ''' % (e, kwargs.get('sql') or args[1]))
30 |
31 | return wapper
32 |
33 |
34 | class MysqlDB():
35 |
36 | def __init__(self, ip=None, port=None, db=None, user=None, passwd=None, **kwargs):
37 | # 可能会改setting中的值,所以此处不能直接赋值为默认值,需要后加载赋值
38 | try:
39 | self.connect_pool = PooledDB(creator=pymysql, mincached=1, maxcached=100, maxconnections=100, blocking=True, ping=7,
40 | host=ip, port=port, user=user, passwd=passwd, db=db, charset='utf8mb4', cursorclass=cursors.SSCursor) # cursorclass 使用服务的游标,默认的在多线程下大批量插入数据会使内存递增
41 | except Exception as e:
42 | input('''
43 | ******************************************
44 | 未链接到mysql数据库,
45 | 您当前的链接信息为:
46 | ip = {}
47 | port = {}
48 | db = {}
49 | user = {}
50 | passwd = {}
51 | 请参考教程正确安装配置mysql,然后重启本程序
52 | Exception: {}'''.format(ip, port, db, user, passwd, str(e))
53 | )
54 | import sys
55 | sys.exit()
56 |
57 | def get_connection(self):
58 | conn = self.connect_pool.connection(shareable=False)
59 | # cursor = conn.cursor(cursors.SSCursor)
60 | cursor = conn.cursor()
61 |
62 | return conn, cursor
63 |
64 | def close_connection(self, conn, cursor):
65 | cursor.close()
66 | conn.close()
67 |
68 | def size_of_connections(self):
69 | '''
70 | 当前活跃的连接数
71 | @return:
72 | '''
73 | return self.connect_pool._connections
74 |
75 | def size_of_connect_pool(self):
76 | '''
77 | 池子里一共有多少连接
78 | @return:
79 | '''
80 | return len(self.connect_pool._idle_cache)
81 |
82 | @auto_retry
83 | def find(self, sql, limit=0, to_json=False, cursor=None):
84 | '''
85 | @summary:
86 | 无数据: 返回()
87 | 有数据: 若limit == 1 则返回 (data1, data2)
88 | 否则返回 ((data1, data2),)
89 | ---------
90 | @param sql:
91 | @param limit:
92 | ---------
93 | @result:
94 | '''
95 | conn, cursor = self.get_connection()
96 |
97 | cursor.execute(sql)
98 |
99 | if limit == 1:
100 | result = cursor.fetchone() # 全部查出来,截取 不推荐使用
101 | elif limit > 1:
102 | result = cursor.fetchmany(limit) # 全部查出来,截取 不推荐使用
103 | else:
104 | result = cursor.fetchall()
105 |
106 | if to_json:
107 | columns = [i[0] for i in cursor.description]
108 |
109 | # 处理时间类型
110 | def fix_lob(row):
111 | def convert(col):
112 | if isinstance(col, (datetime.date, datetime.time)):
113 | return str(col)
114 | elif isinstance(col, str) and (col.startswith('{') or col.startswith('[')):
115 | try:
116 | return json.loads(col)
117 | except:
118 | return col
119 | else:
120 | return col
121 |
122 | return [convert(c) for c in row]
123 |
124 | result = [fix_lob(row) for row in result]
125 | result = [dict(zip(columns, r)) for r in result]
126 |
127 | self.close_connection(conn, cursor)
128 |
129 | return result
130 |
131 | def add(self, sql, exception_callfunc=''):
132 | affect_count = None
133 |
134 | try:
135 | conn, cursor = self.get_connection()
136 | affect_count = cursor.execute(sql)
137 | conn.commit()
138 |
139 | except Exception as e:
140 | log.error('''
141 | error:%s
142 | sql: %s
143 | ''' % (e, sql))
144 | if exception_callfunc:
145 | exception_callfunc(e)
146 | finally:
147 | self.close_connection(conn, cursor)
148 |
149 | return affect_count
150 |
151 | def add_batch(self, sql, datas):
152 | '''
153 | @summary:
154 | ---------
155 | @ param sql: insert ignore into (xxx,xxx) values (%s, %s, %s)
156 | # param datas:[[..], [...]]
157 | ---------
158 | @result:
159 | '''
160 | affect_count = None
161 |
162 | try:
163 | conn, cursor = self.get_connection()
164 | affect_count = cursor.executemany(sql, datas)
165 | conn.commit()
166 |
167 | except Exception as e:
168 | log.error('''
169 | error:%s
170 | sql: %s
171 | ''' % (e, sql))
172 | finally:
173 | self.close_connection(conn, cursor)
174 |
175 | return affect_count
176 |
177 | def update(self, sql):
178 | try:
179 | conn, cursor = self.get_connection()
180 | cursor.execute(sql)
181 | conn.commit()
182 |
183 | except Exception as e:
184 | log.error('''
185 | error:%s
186 | sql: %s
187 | ''' % (e, sql))
188 | return False
189 | else:
190 | return True
191 | finally:
192 | self.close_connection(conn, cursor)
193 |
194 | def delete(self, sql):
195 | try:
196 | conn, cursor = self.get_connection()
197 | cursor.execute(sql)
198 | conn.commit()
199 |
200 | except Exception as e:
201 | log.error('''
202 | error:%s
203 | sql: %s
204 | ''' % (e, sql))
205 | return False
206 | else:
207 | return True
208 | finally:
209 | self.close_connection(conn, cursor)
210 |
211 | def execute(self, sql):
212 | try:
213 | conn, cursor = self.get_connection()
214 | cursor.execute(sql)
215 | conn.commit()
216 |
217 | except Exception as e:
218 | log.error('''
219 | error:%s
220 | sql: %s
221 | ''' % (e, sql))
222 | return False
223 | else:
224 | return True
225 | finally:
226 | self.close_connection(conn, cursor)
227 |
228 | def set_unique_key(self, table, key):
229 | try:
230 | sql = 'alter table %s add unique (%s)' % (table, key)
231 |
232 | conn, cursor = self.get_connection()
233 | cursor.execute(sql)
234 | conn.commit()
235 |
236 | except Exception as e:
237 | log.error(table + ' ' + str(e) + ' key = ' + key)
238 | return False
239 | else:
240 | log.debug('%s表创建唯一索引成功 索引为 %s' % (table, key))
241 | return True
242 | finally:
243 | self.close_connection(conn, cursor)
244 |
245 |
246 | if __name__ == '__main__':
247 | db = MysqlDB()
248 | sql = "select is_done from qiancheng_job_list_batch_record where id = 3"
249 |
250 | data = db.find(sql)
251 | print(data)
252 |
--------------------------------------------------------------------------------
/wechat-spider/db/redisdb.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | '''
3 | Created on 2016-11-16 16:25
4 | ---------
5 | @summary: 操作redis数据库
6 | ---------
7 | @author: Boris
8 | '''
9 |
10 | from utils.log import log
11 | import redis
12 |
13 |
14 | # setting.REDISDB_DB = 0
15 | # setting.REDISDB_IP_PORTS = 'localhost:6379'
16 | # setting.REDISDB_USER_PASS = None
17 |
18 |
19 | class RedisDB():
20 |
21 | def __init__(self, ip=None, port=None, db=None, passwd=None, decode_responses=True):
22 | self._is_redis_cluster = False
23 | try:
24 | self._redis = redis.Redis(host=ip, port=port, db=db, password=passwd, decode_responses=decode_responses) # redis默认端口是6379
25 | self._redis.ping()
26 | except Exception as e:
27 | input('''
28 | ******************************************
29 | 未链接到redis数据库,
30 | 您当前的链接信息为:
31 | ip = {}
32 | port = {}
33 | db = {}
34 | passwd = {}
35 | 请参考教程正确安装配置redis,然后重启本程序
36 | Exception: {}'''.format(ip, port, db, passwd, str(e))
37 | )
38 | import sys
39 | sys.exit()
40 |
41 | def sadd(self, table, values):
42 | '''
43 | @summary: 使用无序set集合存储数据, 去重
44 | ---------
45 | @param table:
46 | @param values: 值; 支持list 或 单个值
47 | ---------
48 | @result: 若库中存在 返回0,否则入库,返回1。 批量添加返回None
49 | '''
50 |
51 | if isinstance(values, list):
52 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。
53 |
54 | if not self._is_redis_cluster:
55 | pipe.multi()
56 | for value in values:
57 | pipe.sadd(table, value)
58 | pipe.execute()
59 |
60 | else:
61 | return self._redis.sadd(table, values)
62 |
63 | def sget(self, table, count=1, is_pop=True):
64 | '''
65 | 返回 list 如 ['1'] 或 []
66 | @param table:
67 | @param count:
68 | @param is_pop:
69 | @return:
70 | '''
71 | datas = []
72 | if is_pop:
73 | count = count if count <= self.sget_count(table) else self.sget_count(table)
74 | if count:
75 | if count > 1:
76 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。
77 |
78 | if not self._is_redis_cluster:
79 | pipe.multi()
80 | while count:
81 | pipe.spop(table)
82 | count -= 1
83 | datas = pipe.execute()
84 |
85 | else:
86 | datas.append(self._redis.spop(table))
87 |
88 | else:
89 | datas = self._redis.srandmember(table, count)
90 |
91 | return datas
92 |
93 | def srem(self, table, values):
94 | '''
95 | @summary: 移除集合中的指定元素
96 | ---------
97 | @param table:
98 | @param values: 一个或者列表
99 | ---------
100 | @result:
101 | '''
102 | if isinstance(values, list):
103 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。
104 |
105 | if not self._is_redis_cluster:
106 | pipe.multi()
107 | for value in values:
108 | pipe.srem(table, value)
109 | pipe.execute()
110 | else:
111 | self._redis.srem(table, values)
112 |
113 | def sget_count(self, table):
114 | return self._redis.scard(table)
115 |
116 | def sdelete(self, table):
117 | '''
118 | @summary: 删除set集合的大键(数据量大的表)
119 | 删除大set键,使用sscan命令,每次扫描集合中500个元素,再用srem命令每次删除一个键
120 | 若直接用delete命令,会导致Redis阻塞,出现故障切换和应用程序崩溃的故障。
121 | ---------
122 | @param table:
123 | ---------
124 | @result:
125 | '''
126 | # 当 SCAN 命令的游标参数被设置为 0 时, 服务器将开始一次新的迭代, 而当服务器向用户返回值为 0 的游标时, 表示迭代已结束
127 | cursor = '0'
128 | while cursor != 0:
129 | cursor, data = self._redis.sscan(table, cursor=cursor, count=500)
130 | for item in data:
131 | # pipe.srem(table, item)
132 | self._redis.srem(table, item)
133 |
134 | # pipe.execute()
135 |
136 | def zadd(self, table, values, prioritys=0):
137 | '''
138 | @summary: 使用有序set集合存储数据, 去重(值存在更新)
139 | ---------
140 | @param table:
141 | @param values: 值; 支持list 或 单个值
142 | @param prioritys: 优先级; double类型,支持list 或 单个值。 根据此字段的值来排序, 值越小越优先。 可不传值,默认value的优先级为0
143 | ---------
144 | @result:若库中存在 返回0,否则入库,返回1。 批量添加返回 [0, 1 ...]
145 | '''
146 | if isinstance(values, list):
147 | if not isinstance(prioritys, list):
148 | prioritys = [prioritys] * len(values)
149 | else:
150 | assert len(values) == len(prioritys), 'values值要与prioritys值一一对应'
151 |
152 | pipe = self._redis.pipeline(transaction=True)
153 |
154 | if not self._is_redis_cluster:
155 | pipe.multi()
156 | for value, priority in zip(values, prioritys):
157 | if self._is_redis_cluster:
158 | pipe.zadd(table, priority, value)
159 | else:
160 | pipe.zadd(table, value, priority)
161 | return pipe.execute()
162 |
163 | else:
164 | if self._is_redis_cluster:
165 | return self._redis.zadd(table, prioritys, values)
166 | else:
167 | return self._redis.zadd(table, values, prioritys)
168 |
169 | def zget(self, table, count=1, is_pop=True):
170 | '''
171 | @summary: 从有序set集合中获取数据 优先返回分数小的(优先级高的)
172 | ---------
173 | @param table:
174 | @param count: 数量 -1 返回全部数据
175 | @param is_pop:获取数据后,是否在原set集合中删除,默认是
176 | ---------
177 | @result: 列表
178 | '''
179 | start_pos = 0 # 包含
180 | end_pos = count - 1 if count > 0 else count
181 |
182 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。
183 |
184 | if not self._is_redis_cluster:
185 | pipe.multi() # 标记事务的开始 参考 http://www.runoob.com/redis/redis-transactions.html
186 | pipe.zrange(table, start_pos, end_pos) # 取值
187 | if is_pop:
188 | pipe.zremrangebyrank(table, start_pos, end_pos) # 删除
189 | results, *count = pipe.execute()
190 | return results
191 |
192 | def zremrangebyscore(self, table, priority_min, priority_max):
193 | '''
194 | 根据分数移除成员 闭区间
195 | @param table:
196 | @param priority_min:
197 | @param priority_max:
198 | @return: 被移除的成员个数
199 | '''
200 | return self._redis.zremrangebyscore(table, priority_min, priority_max)
201 |
202 | def zrangebyscore(self, table, priority_min, priority_max, count=None, is_pop=True):
203 | '''
204 | @summary: 返回指定分数区间的数据 闭区间
205 | ---------
206 | @param table:
207 | @param priority_min: 优先级越小越优先
208 | @param priority_max:
209 | @param count: 获取的数量,为空则表示分数区间内的全部数据
210 | @param is_pop: 是否删除
211 | ---------
212 | @result:
213 | '''
214 |
215 | # 使用lua脚本, 保证操作的原子性
216 | lua = '''
217 | local key = KEYS[1]
218 | local min_score = ARGV[2]
219 | local max_score = ARGV[3]
220 | local is_pop = ARGV[4]
221 | local count = ARGV[5]
222 |
223 | -- 取值
224 | local datas = nil
225 | if count then
226 | datas = redis.call('zrangebyscore', key, min_score, max_score, 'limit', 0, count)
227 | else
228 | datas = redis.call('zrangebyscore', key, min_score, max_score)
229 | end
230 |
231 | -- 删除redis中刚取到的值
232 | if (is_pop) then
233 | for i=1, #datas do
234 | redis.call('zrem', key, datas[i])
235 | end
236 | end
237 |
238 |
239 | return datas
240 |
241 | '''
242 | cmd = self._redis.register_script(lua)
243 | if count:
244 | res = cmd(keys=[table], args=[table, priority_min, priority_max, is_pop, count])
245 | else:
246 | res = cmd(keys=[table], args=[table, priority_min, priority_max, is_pop])
247 |
248 | return res
249 |
250 | def zrangebyscore_increase_score(self, table, priority_min, priority_max, increase_score, count=None):
251 | '''
252 | @summary: 返回指定分数区间的数据 闭区间, 同时修改分数
253 | ---------
254 | @param table:
255 | @param priority_min: 最小分数
256 | @param priority_max: 最大分数
257 | @param increase_score: 分数值增量 正数则在原有的分数上叠加,负数则相减
258 | @param count: 获取的数量,为空则表示分数区间内的全部数据
259 | ---------
260 | @result:
261 | '''
262 |
263 | # 使用lua脚本, 保证操作的原子性
264 | lua = '''
265 | local key = KEYS[1]
266 | local min_score = ARGV[1]
267 | local max_score = ARGV[2]
268 | local increase_score = ARGV[3]
269 | local count = ARGV[4]
270 |
271 | -- 取值
272 | local datas = nil
273 | if count then
274 | datas = redis.call('zrangebyscore', key, min_score, max_score, 'limit', 0, count)
275 | else
276 | datas = redis.call('zrangebyscore', key, min_score, max_score)
277 | end
278 |
279 | --修改优先级
280 | for i=1, #datas do
281 | redis.call('zincrby', key, increase_score, datas[i])
282 | end
283 |
284 | return datas
285 |
286 | '''
287 | cmd = self._redis.register_script(lua)
288 | if count:
289 | res = cmd(keys=[table], args=[priority_min, priority_max, increase_score, count])
290 | else:
291 | res = cmd(keys=[table], args=[priority_min, priority_max, increase_score])
292 |
293 | return res
294 |
295 | def zrangebyscore_set_score(self, table, priority_min, priority_max, score, count=None):
296 | '''
297 | @summary: 返回指定分数区间的数据 闭区间, 同时修改分数
298 | ---------
299 | @param table:
300 | @param priority_min: 最小分数
301 | @param priority_max: 最大分数
302 | @param score: 分数值
303 | @param count: 获取的数量,为空则表示分数区间内的全部数据
304 | ---------
305 | @result:
306 | '''
307 |
308 | # 使用lua脚本, 保证操作的原子性
309 | lua = '''
310 | local key = KEYS[1]
311 | local min_score = ARGV[1]
312 | local max_score = ARGV[2]
313 | local set_score = ARGV[3]
314 | local count = ARGV[4]
315 |
316 | -- 取值
317 | local datas = nil
318 | if count then
319 | datas = redis.call('zrangebyscore', key, min_score, max_score, 'withscores','limit', 0, count)
320 | else
321 | datas = redis.call('zrangebyscore', key, min_score, max_score, 'withscores')
322 | end
323 |
324 | local real_datas = {} -- 数据
325 | --修改优先级
326 | for i=1, #datas, 2 do
327 | local data = datas[i]
328 | local score = datas[i+1]
329 |
330 | table.insert(real_datas, data) -- 添加数据
331 |
332 | redis.call('zincrby', key, set_score - score, datas[i])
333 | end
334 |
335 | return real_datas
336 |
337 | '''
338 | cmd = self._redis.register_script(lua)
339 | if count:
340 | res = cmd(keys=[table], args=[priority_min, priority_max, score, count])
341 | else:
342 | res = cmd(keys=[table], args=[priority_min, priority_max, score])
343 |
344 | return res
345 |
346 | def zget_count(self, table, priority_min=None, priority_max=None):
347 | '''
348 | @summary: 获取表数据的数量
349 | ---------
350 | @param table:
351 | @param priority_min:优先级范围 最小值(包含)
352 | @param priority_max:优先级范围 最大值(包含)
353 | ---------
354 | @result:
355 | '''
356 |
357 | if priority_min != None and priority_max != None:
358 | return self._redis.zcount(table, priority_min, priority_max)
359 | else:
360 | return self._redis.zcard(table)
361 |
362 | def zrem(self, table, values):
363 | '''
364 | @summary: 移除集合中的指定元素
365 | ---------
366 | @param table:
367 | @param values: 一个或者列表
368 | ---------
369 | @result:
370 | '''
371 | if isinstance(values, list):
372 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。
373 |
374 | if not self._is_redis_cluster:
375 | pipe.multi()
376 | for value in values:
377 | pipe.zrem(table, value)
378 | pipe.execute()
379 | else:
380 | self._redis.zrem(table, values)
381 |
382 | def zexists(self, table, values):
383 | '''
384 | 利用zscore判断某元素是否存在
385 | @param values:
386 | @return:
387 | '''
388 | is_exists = []
389 |
390 | if isinstance(values, list):
391 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。
392 | pipe.multi()
393 | for value in values:
394 | pipe.zscore(table, value)
395 | is_exists_temp = pipe.execute()
396 | for is_exist in is_exists_temp:
397 | if is_exist != None:
398 | is_exists.append(1)
399 | else:
400 | is_exists.append(0)
401 |
402 | else:
403 | is_exists = self._redis.zscore(table, values)
404 | is_exists = 1 if is_exists != None else 0
405 |
406 | return is_exists
407 |
408 | def lpush(self, table, values):
409 | if isinstance(values, list):
410 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。
411 |
412 | if not self._is_redis_cluster:
413 | pipe.multi()
414 | for value in values:
415 | pipe.rpush(table, value)
416 | pipe.execute()
417 |
418 | else:
419 | return self._redis.rpush(table, values)
420 |
421 | def lpop(self, table, count=1):
422 | '''
423 | @summary:
424 | ---------
425 | @param table:
426 | @param count:
427 | ---------
428 | @result: count>1时返回列表
429 | '''
430 | datas = None
431 |
432 | count = count if count <= self.lget_count(table) else self.lget_count(table)
433 |
434 | if count:
435 | if count > 1:
436 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。
437 |
438 | if not self._is_redis_cluster:
439 | pipe.multi()
440 | while count:
441 | pipe.lpop(table)
442 | count -= 1
443 | datas = pipe.execute()
444 |
445 | else:
446 | datas = self._redis.lpop(table)
447 |
448 | return datas
449 |
450 | def rpoplpush(self, from_table, to_table=None):
451 | '''
452 | 将列表 from_table 中的最后一个元素(尾元素)弹出,并返回给客户端。
453 | 将 from_table 弹出的元素插入到列表 to_table ,作为 to_table 列表的的头元素。
454 | 如果 from_table 和 to_table 相同,则列表中的表尾元素被移动到表头,并返回该元素,可以把这种特殊情况视作列表的旋转(rotation)操作
455 | @param from_table:
456 | @param to_table:
457 | @return:
458 | '''
459 |
460 | if not to_table:
461 | to_table = from_table
462 |
463 | return self._redis.rpoplpush(from_table, to_table)
464 |
465 | def lget_count(self, table):
466 | return self._redis.llen(table)
467 |
468 | def lrem(self, table, value, num=0):
469 | return self._redis.lrem(table, value, num)
470 |
471 | def hset(self, table, key, value):
472 | '''
473 | @summary:
474 | 如果 key 不存在,一个新的哈希表被创建并进行 HSET 操作。
475 | 如果域 field 已经存在于哈希表中,旧值将被覆盖
476 | ---------
477 | @param table:
478 | @param key:
479 | @param value:
480 | ---------
481 | @result: 1 新插入; 0 覆盖
482 | '''
483 |
484 | return self._redis.hset(table, key, value)
485 |
486 | def hincrby(self, table, key, increment):
487 | return self._redis.hincrby(table, key, increment)
488 |
489 | def hget(self, table, key, is_pop=False):
490 | if not is_pop:
491 | return self._redis.hget(table, key)
492 | else:
493 | lua = '''
494 | local key = KEYS[1]
495 | local field = ARGV[1]
496 |
497 | -- 取值
498 | local datas = redis.call('hget', key, field)
499 | -- 删除值
500 | redis.call('hdel', key, field)
501 |
502 | return datas
503 |
504 | '''
505 | cmd = self._redis.register_script(lua)
506 | res = cmd(keys=[table], args=[key])
507 |
508 | return res
509 |
510 | def hgetall(self, table):
511 | return self._redis.hgetall(table)
512 |
513 | def hexists(self, table, key):
514 | return self._redis.hexists(table, key)
515 |
516 | def hdel(self, table, *keys):
517 | '''
518 | @summary: 删除对应的key 可传多个
519 | ---------
520 | @param table:
521 | @param *keys:
522 | ---------
523 | @result:
524 | '''
525 |
526 | self._redis.hdel(table, *keys)
527 |
528 | def hget_count(self, table):
529 | return self._redis.hlen(table)
530 |
531 | def setbit(self, table, offsets, values):
532 | '''
533 | 设置字符串数组某一位的值, 返回之前的值
534 | @param table:
535 | @param offsets: 支持列表或单个值
536 | @param values: 支持列表或单个值
537 | @return: list / 单个值
538 | '''
539 | if isinstance(offsets, list):
540 | if not isinstance(values, list):
541 | values = [values] * len(offsets)
542 | else:
543 | assert len(offsets) == len(values), 'offsets值要与values值一一对应'
544 |
545 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。
546 | pipe.multi()
547 |
548 | for offset, value in zip(offsets, values):
549 | pipe.setbit(table, offset, value)
550 |
551 | return pipe.execute()
552 |
553 | else:
554 | return self._redis.setbit(table, offsets, values)
555 |
556 | def getbit(self, table, offsets):
557 | '''
558 | 取字符串数组某一位的值
559 | @param table:
560 | @param offsets: 支持列表
561 | @return: list / 单个值
562 | '''
563 | if isinstance(offsets, list):
564 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。
565 | pipe.multi()
566 | for offset in offsets:
567 | pipe.getbit(table, offset)
568 |
569 | return pipe.execute()
570 |
571 | else:
572 | return self._redis.getbit(table, offsets)
573 |
574 | def bitcount(self, table):
575 | return self._redis.bitcount(table)
576 |
577 | def strset(self, table, value, **kwargs):
578 | return self._redis.set(table, value, **kwargs)
579 |
580 | def strget(self, table):
581 | return self._redis.get(table)
582 |
583 | def strlen(self, table):
584 | return self._redis.strlen(table)
585 |
586 | def getkeys(self, regex):
587 | return self._redis.keys(regex)
588 |
589 | def exists_key(self, key):
590 | return self._redis.exists(key)
591 |
592 | def set_expire(self, key, seconds):
593 | '''
594 | @summary: 设置过期时间
595 | ---------
596 | @param key:
597 | @param seconds: 秒
598 | ---------
599 | @result:
600 | '''
601 |
602 | self._redis.expire(key, seconds)
603 |
604 | def clear(self, table):
605 | try:
606 | self._redis.delete(table)
607 | except Exception as e:
608 | log.error(e)
609 |
610 | def get_redis_obj(self):
611 | return self._redis
612 |
613 |
614 | if __name__ == '__main__':
615 | db = RedisDB(db=1)
616 | print(db)
617 |
--------------------------------------------------------------------------------
/wechat-spider/run.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | '''
3 | Created on 2019/5/18 9:52 PM
4 | ---------
5 | @summary:
6 | ---------
7 | @author:
8 | '''
9 |
10 | from core.capture_packet import WechatCapture
11 | from create_tables import create_table
12 | from mitmproxy import options
13 | from mitmproxy import proxy
14 | from mitmproxy.tools.dump import DumpMaster
15 | from config import config, IP
16 |
17 |
18 | def start():
19 | ip = IP
20 | port = config.get('spider').get('service_port')
21 |
22 | print("温馨提示:服务IP {} 端口 {} 请确保代理已配置".format(ip, port))
23 |
24 | myaddon = WechatCapture()
25 | opts = options.Options(listen_port=port)
26 | pconf = proxy.config.ProxyConfig(opts)
27 | m = DumpMaster(opts)
28 | m.options.set('flow_detail={mitm_log_level}'.format(mitm_log_level=config.get('mitm').get('log_level')))
29 | m.server = proxy.server.ProxyServer(pconf)
30 | m.addons.add(myaddon)
31 |
32 | try:
33 | m.run()
34 | except KeyboardInterrupt:
35 | m.shutdown()
36 |
37 |
38 | create_table()
39 | start()
40 |
--------------------------------------------------------------------------------
/wechat-spider/utils/log.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | """
3 | Created on 2018-12-08 16:50
4 | ---------
5 | @summary:
6 | ---------
7 | @author:
8 | """
9 |
10 | import logging
11 | import os
12 | import sys
13 | from logging.handlers import BaseRotatingHandler
14 | from config import config
15 |
16 | from better_exceptions import format_exception
17 |
18 | LOG_FORMAT = "%(threadName)s|%(asctime)s|%(filename)s|%(funcName)s|line:%(lineno)d|%(levelname)s| %(message)s"
19 | PRINT_EXCEPTION_DETAILS = True
20 |
21 |
22 | # 重写 RotatingFileHandler 自定义log的文件名
23 | # 原来 xxx.log xxx.log.1 xxx.log.2 xxx.log.3 文件由近及远
24 | # 现在 xxx.log xxx1.log xxx2.log 如果backupCount 是2位数时 则 01 02 03 三位数 001 002 .. 文件由近及远
25 | class RotatingFileHandler(BaseRotatingHandler):
26 |
27 | def __init__(
28 | self, filename, mode="a", maxBytes=0, backupCount=0, encoding=None, delay=0
29 | ):
30 | # if maxBytes > 0:
31 | # mode = 'a'
32 | BaseRotatingHandler.__init__(self, filename, mode, encoding, delay)
33 | self.maxBytes = maxBytes
34 | self.backupCount = backupCount
35 | self.placeholder = str(len(str(backupCount)))
36 |
37 | def doRollover(self):
38 | if self.stream:
39 | self.stream.close()
40 | self.stream = None
41 | if self.backupCount > 0:
42 | for i in range(self.backupCount - 1, 0, -1):
43 | sfn = ("%0" + self.placeholder + "d.") % i # '%2d.'%i -> 02
44 | sfn = sfn.join(self.baseFilename.split("."))
45 | # sfn = "%d_%s" % (i, self.baseFilename)
46 | # dfn = "%d_%s" % (i + 1, self.baseFilename)
47 | dfn = ("%0" + self.placeholder + "d.") % (i + 1)
48 | dfn = dfn.join(self.baseFilename.split("."))
49 | if os.path.exists(sfn):
50 | # print "%s -> %s" % (sfn, dfn)
51 | if os.path.exists(dfn):
52 | os.remove(dfn)
53 | os.rename(sfn, dfn)
54 | dfn = (("%0" + self.placeholder + "d.") % 1).join(
55 | self.baseFilename.split(".")
56 | )
57 | if os.path.exists(dfn):
58 | os.remove(dfn)
59 | # Issue 18940: A file may not have been created if delay is True.
60 | if os.path.exists(self.baseFilename):
61 | os.rename(self.baseFilename, dfn)
62 | if not self.delay:
63 | self.stream = self._open()
64 |
65 | def shouldRollover(self, record):
66 |
67 | if self.stream is None: # delay was set...
68 | self.stream = self._open()
69 | if self.maxBytes > 0: # are we rolling over?
70 | msg = "%s\n" % self.format(record)
71 | self.stream.seek(0, 2) # due to non-posix-compliant Windows feature
72 | if self.stream.tell() + len(msg) >= self.maxBytes:
73 | return 1
74 | return 0
75 |
76 |
77 | def get_logger(
78 | name, path="", log_level="DEBUG", is_write_to_file=False, is_write_to_stdout=True
79 | ):
80 | """
81 | @summary: 获取log
82 | ---------
83 | @param name: log名
84 | @param path: log文件存储路径 如 D://xxx.log
85 | @param log_level: log等级 CRITICAL/ERROR/WARNING/INFO/DEBUG
86 | @param is_write_to_file: 是否写入到文件 默认否
87 | ---------
88 | @result:
89 | """
90 | name = name.split(os.sep)[-1].split(".")[0] # 取文件名
91 |
92 | logger = logging.getLogger(name)
93 | logger.setLevel(log_level)
94 |
95 | formatter = logging.Formatter(LOG_FORMAT)
96 | if PRINT_EXCEPTION_DETAILS:
97 | formatter.formatException = lambda exc_info: format_exception(*exc_info)
98 |
99 | # 定义一个RotatingFileHandler,最多备份5个日志文件,每个日志文件最大10M
100 | if is_write_to_file:
101 | if path and not os.path.exists(os.path.dirname(path)):
102 | os.makedirs(os.path.dirname(path))
103 |
104 | rf_handler = RotatingFileHandler(
105 | path, mode="w", maxBytes=10 * 1024 * 1024, backupCount=20, encoding="utf8"
106 | )
107 | rf_handler.setFormatter(formatter)
108 | logger.addHandler(rf_handler)
109 |
110 | if is_write_to_stdout:
111 | stream_handler = logging.StreamHandler()
112 | stream_handler.stream = sys.stdout
113 | stream_handler.setFormatter(formatter)
114 | # 检查是否已存在
115 | handle_exists = 0
116 | for _handler in logger.handlers:
117 | if (
118 | isinstance(_handler, logging.StreamHandler)
119 | and _handler.stream == sys.stdout
120 | ):
121 | handle_exists = 1
122 | if not handle_exists:
123 | logger.addHandler(stream_handler)
124 |
125 | return logger
126 |
127 |
128 | # logging.disable(logging.DEBUG) # 关闭所有log
129 |
130 | # 不让打印log的配置
131 | STOP_LOGS = [
132 | # ES
133 | "urllib3.response",
134 | "urllib3.connection",
135 | "elasticsearch.trace",
136 | "requests.packages.urllib3.util",
137 | "requests.packages.urllib3.util.retry",
138 | "urllib3.util",
139 | "requests.packages.urllib3.response",
140 | "requests.packages.urllib3.contrib.pyopenssl",
141 | "requests.packages",
142 | "urllib3.util.retry",
143 | "requests.packages.urllib3.contrib",
144 | "requests.packages.urllib3.connectionpool",
145 | "requests.packages.urllib3.poolmanager",
146 | "urllib3.connectionpool",
147 | "requests.packages.urllib3.connection",
148 | "elasticsearch",
149 | "log_request_fail",
150 | # requests
151 | "requests",
152 | "selenium.webdriver.remote.remote_connection",
153 | "selenium.webdriver.remote",
154 | "selenium.webdriver",
155 | "selenium",
156 | # markdown
157 | "MARKDOWN",
158 | "build_extension",
159 | # newspaper
160 | "calculate_area",
161 | "largest_image_url",
162 | "newspaper.images",
163 | "newspaper",
164 | "Importing",
165 | "PIL",
166 | ]
167 |
168 | # 关闭日志打印
169 | for STOP_LOG in STOP_LOGS:
170 | log_level = eval("logging.ERROR")
171 | logging.getLogger(STOP_LOG).setLevel(log_level)
172 |
173 | # print(logging.Logger.manager.loggerDict) # 取使用debug模块的name
174 |
175 | # 日志级别大小关系为:critical > error > warning > info > debug
176 | log = get_logger(
177 | name="wechat_spider",
178 | path=config.get('log').get('log_path'),
179 | log_level=config.get('log').get('level'),
180 | is_write_to_file=config.get('log').get('to_file'),
181 | )
182 |
183 | if __name__ == "__main__":
184 | try:
185 | a = 1
186 | b = 0
187 | c = a / b
188 | except Exception as e:
189 | log.exception(e)
190 |
--------------------------------------------------------------------------------
/wechat-spider/utils/selector.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | '''
3 | Created on 2018-10-08 15:33:37
4 | ---------
5 | @summary: 重新定义 selector
6 | ---------
7 | @author:
8 | @email:
9 | '''
10 | import re
11 |
12 | import six
13 | from parsel import Selector as ParselSelector
14 | from parsel import SelectorList as ParselSelectorList
15 | from w3lib.html import replace_entities as w3lib_replace_entities
16 |
17 |
18 | def extract_regex(regex, text, replace_entities=True, flags=0):
19 | """Extract a list of unicode strings from the given text/encoding using the following policies:
20 | * if the regex contains a named group called "extract" that will be returned
21 | * if the regex contains multiple numbered groups, all those will be returned (flattened)
22 | * if the regex doesn't contain any group the entire regex matching is returned
23 | """
24 | if isinstance(regex, six.string_types):
25 | regex = re.compile(regex, flags=flags)
26 |
27 | if 'extract' in regex.groupindex:
28 | # named group
29 | try:
30 | extracted = regex.search(text).group('extract')
31 | except AttributeError:
32 | strings = []
33 | else:
34 | strings = [extracted] if extracted is not None else []
35 | else:
36 | # full regex or numbered groups
37 | strings = regex.findall(text)
38 |
39 | # strings = flatten(strings) # 这东西会把多维列表铺平
40 | if not replace_entities:
41 | return strings
42 |
43 | values = []
44 | for value in strings:
45 | if isinstance(value, (list, tuple)): # w3lib_replace_entities 不能接收list tuple
46 | values.append([w3lib_replace_entities(v, keep=['lt', 'amp']) for v in value])
47 | else:
48 | values.append(w3lib_replace_entities(value, keep=['lt', 'amp']))
49 |
50 | return values
51 |
52 |
53 | class SelectorList(ParselSelectorList):
54 | """
55 | The :class:`SelectorList` class is a subclass of the builtin ``list``
56 | class, which provides a few additional methods.
57 | """
58 |
59 | def re_first(self, regex, default=None, replace_entities=True, flags=re.S):
60 | """
61 | Call the ``.re()`` method for the first element in this list and
62 | return the result in an unicode string. If the list is empty or the
63 | regex doesn't match anything, return the default value (``None`` if
64 | the argument is not provided).
65 |
66 | By default, character entity references are replaced by their
67 | corresponding character (except for ``&`` and ``<``.
68 | Passing ``replace_entities`` as ``False`` switches off these
69 | replacements.
70 | """
71 |
72 | datas = self.re(regex, replace_entities=replace_entities, flags=flags)
73 | return datas[0] if datas else default
74 |
75 | def re(self, regex, replace_entities=True, flags=re.S):
76 | """
77 | Call the ``.re()`` method for each element in this list and return
78 | their results flattened, as a list of unicode strings.
79 |
80 | By default, character entity references are replaced by their
81 | corresponding character (except for ``&`` and ``<``.
82 | Passing ``replace_entities`` as ``False`` switches off these
83 | replacements.
84 | """
85 | datas = [x.re(regex, replace_entities=replace_entities, flags=flags) for x in self]
86 | return datas[0] if len(datas) == 1 else datas
87 |
88 |
89 | class Selector(ParselSelector):
90 | selectorlist_cls = SelectorList
91 |
92 | def __str__(self):
93 | data = repr(self.get())
94 | return "<%s xpath=%r data=%s>" % (type(self).__name__, self._expr, data)
95 |
96 | __repr__ = __str__
97 |
98 | def re_first(self, regex, default=None, replace_entities=True, flags=re.S):
99 | """
100 | Apply the given regex and return the first unicode string which
101 | matches. If there is no match, return the default value (``None`` if
102 | the argument is not provided).
103 |
104 | By default, character entity references are replaced by their
105 | corresponding character (except for ``&`` and ``<``.
106 | Passing ``replace_entities`` as ``False`` switches off these
107 | replacements.
108 | """
109 |
110 | datas = self.re(regex, replace_entities=replace_entities, flags=flags)
111 |
112 | return datas[0] if datas else default
113 |
114 | def re(self, regex, replace_entities=True, flags=re.S):
115 | """
116 | Apply the given regex and return a list of unicode strings with the
117 | matches.
118 |
119 | ``regex`` can be either a compiled regular expression or a string which
120 | will be compiled to a regular expression using ``re.compile(regex)``.
121 |
122 | By default, character entity references are replaced by their
123 | corresponding character (except for ``&`` and ``<``.
124 | Passing ``replace_entities`` as ``False`` switches off these
125 | replacements.
126 | """
127 |
128 | return extract_regex(regex, self.get(), replace_entities=replace_entities, flags=flags)
129 |
--------------------------------------------------------------------------------
/wechat-spider/utils/tools.py:
--------------------------------------------------------------------------------
1 | # -*- coding: utf-8 -*-
2 | '''
3 | Created on 2019/5/19 3:03 PM
4 | ---------
5 | @summary:
6 | ---------
7 | @author:
8 | '''
9 | import datetime
10 | import json
11 | import re
12 | import ssl
13 | import time
14 | import uuid
15 | from pprint import pformat
16 | import hashlib
17 |
18 | import pymysql
19 | from utils.log import log
20 |
21 | # 全局取消ssl证书验证
22 | ssl._create_default_https_context = ssl._create_unverified_context
23 |
24 | _regexs = {}
25 |
26 |
27 | # @log_function_time
28 | def get_info(html, regexs, allow_repeat=True, fetch_one=False, split=None):
29 | regexs = isinstance(regexs, str) and [regexs] or regexs
30 |
31 | infos = []
32 | for regex in regexs:
33 | if regex == '':
34 | continue
35 |
36 | if regex not in _regexs.keys():
37 | _regexs[regex] = re.compile(regex, re.S)
38 |
39 | if fetch_one:
40 | infos = _regexs[regex].search(html)
41 | if infos:
42 | infos = infos.groups()
43 | else:
44 | continue
45 | else:
46 | infos = _regexs[regex].findall(str(html))
47 |
48 | if len(infos) > 0:
49 | # print(regex)
50 | break
51 |
52 | if fetch_one:
53 | infos = infos if infos else ('',)
54 | return infos if len(infos) > 1 else infos[0]
55 | else:
56 | infos = allow_repeat and infos or sorted(set(infos), key=infos.index)
57 | infos = split.join(infos) if split else infos
58 | return infos
59 |
60 |
61 | def get_param(url, key):
62 | params = url.split('?')[-1].split('&')
63 | for param in params:
64 | key_value = param.split('=', 1)
65 | if key == key_value[0]:
66 | return key_value[1]
67 | return None
68 |
69 |
70 | def get_current_timestamp():
71 | return int(time.time())
72 |
73 |
74 | def get_current_date(date_format='%Y-%m-%d %H:%M:%S'):
75 | return datetime.datetime.now().strftime(date_format)
76 | # return time.strftime(date_format, time.localtime(time.time()))
77 |
78 |
79 | def timestamp_to_date(timestamp, time_format='%Y-%m-%d %H:%M:%S'):
80 | '''
81 | @summary:
82 | ---------
83 | @param timestamp: 将时间戳转化为日期
84 | @param format: 日期格式
85 | ---------
86 | @result: 返回日期
87 | '''
88 |
89 | date = time.localtime(timestamp)
90 | return time.strftime(time_format, date)
91 |
92 |
93 | def get_json(json_str):
94 | '''
95 | @summary: 取json对象
96 | ---------
97 | @param json_str: json格式的字符串
98 | ---------
99 | @result: 返回json对象
100 | '''
101 |
102 | try:
103 | return json.loads(json_str) if json_str else {}
104 | except Exception as e1:
105 | try:
106 | json_str = json_str.strip()
107 | json_str = json_str.replace("'", '"')
108 | keys = get_info(json_str, "(\w+):")
109 | for key in keys:
110 | json_str = json_str.replace(key, '"%s"' % key)
111 |
112 | return json.loads(json_str) if json_str else {}
113 |
114 | except Exception as e2:
115 | log.error(
116 | '''
117 | e1: %s
118 | format json_str: %s
119 | e2: %s
120 | ''' % (e1, json_str, e2)
121 | )
122 |
123 | return {}
124 |
125 |
126 | def dumps_json(json_, indent=4):
127 | '''
128 | @summary: 格式化json 用于打印
129 | ---------
130 | @param json_: json格式的字符串或json对象
131 | ---------
132 | @result: 格式化后的字符串
133 | '''
134 | try:
135 | if isinstance(json_, str):
136 | json_ = get_json(json_)
137 |
138 | json_ = json.dumps(json_, ensure_ascii=False, indent=indent, skipkeys=True)
139 |
140 | except Exception as e:
141 | log.error(e)
142 | json_ = pformat(json_)
143 |
144 | return json_
145 |
146 |
147 | ############
148 | def format_sql_value(value):
149 | if isinstance(value, str):
150 | value = pymysql.escape_string(value)
151 |
152 | elif isinstance(value, list) or isinstance(value, dict):
153 | value = dumps_json(value, indent=None)
154 |
155 | elif isinstance(value, bool):
156 | value = int(value)
157 |
158 | return value
159 |
160 |
161 | def list2str(datas):
162 | '''
163 | 列表转字符串
164 | :param datas: [1, 2]
165 | :return: (1, 2)
166 | '''
167 | data_str = str(tuple(datas))
168 | data_str = re.sub(",\)$", ')', data_str)
169 | return data_str
170 |
171 |
172 | def make_insert_sql(table, data, auto_update=False, update_columns=(), insert_ignore=False):
173 | '''
174 | @summary: 适用于mysql, oracle数据库时间需要to_date 处理(TODO)
175 | ---------
176 | @param table:
177 | @param data: 表数据 json格式
178 | @param auto_update: 使用的是replace into, 为完全覆盖已存在的数据
179 | @param update_columns: 需要更新的列 默认全部,当指定值时,auto_update设置无效,当duplicate key冲突时更新指定的列
180 | @param insert_ignore: 数据存在忽略
181 | ---------
182 | @result:
183 | '''
184 |
185 | keys = ['`{}`'.format(key) for key in data.keys()]
186 | keys = list2str(keys).replace("'", '')
187 |
188 | values = [format_sql_value(value) for value in data.values()]
189 | values = list2str(values)
190 |
191 | if update_columns:
192 | if not isinstance(update_columns, (tuple, list)):
193 | update_columns = [update_columns]
194 | update_columns_ = ', '.join(["{key}=values({key})".format(key=key) for key in update_columns])
195 | sql = 'insert%s into {table} {keys} values {values} on duplicate key update %s' % (' ignore' if insert_ignore else '', update_columns_)
196 |
197 | elif auto_update:
198 | sql = 'replace into {table} {keys} values {values}'
199 | else:
200 | sql = 'insert%s into {table} {keys} values {values}' % (' ignore' if insert_ignore else '')
201 |
202 | sql = sql.format(table=table, keys=keys, values=values).replace('None', 'null')
203 | return sql
204 |
205 |
206 | def make_update_sql(table, data, condition):
207 | '''
208 | @summary: 适用于mysql, oracle数据库时间需要to_date 处理(TODO)
209 | ---------
210 | @param table:
211 | @param data: 表数据 json格式
212 | @param condition: where 条件
213 | ---------
214 | @result:
215 | '''
216 | key_values = []
217 |
218 | for key, value in data.items():
219 | value = format_sql_value(value)
220 | if isinstance(value, str):
221 | key_values.append("`{}`='{}'".format(key, value))
222 | elif value is None:
223 | key_values.append("`{}`={}".format(key, 'null'))
224 | else:
225 | key_values.append("`{}`={}".format(key, value))
226 |
227 | key_values = ', '.join(key_values)
228 |
229 | sql = 'update {table} set {key_values} where {condition}'
230 | sql = sql.format(table=table, key_values=key_values, condition=condition)
231 | return sql
232 |
233 |
234 | def make_batch_sql(table, datas, auto_update=False, update_columns=()):
235 | '''
236 | @summary: 生产批量的sql
237 | ---------
238 | @param table:
239 | @param datas: 表数据 [{...}]
240 | @param auto_update: 使用的是replace into, 为完全覆盖已存在的数据
241 | @param update_columns: 需要更新的列 默认全部,当指定值时,auto_update设置无效,当duplicate key冲突时更新指定的列
242 | ---------
243 | @result:
244 | '''
245 | if not datas:
246 | return
247 |
248 | keys = list(datas[0].keys())
249 | values_placeholder = ['%s'] * len(keys)
250 |
251 | values = []
252 | for data in datas:
253 | value = []
254 | for key in keys:
255 | current_data = data.get(key)
256 | current_data = format_sql_value(current_data)
257 |
258 | value.append(current_data)
259 |
260 | values.append(value)
261 |
262 | keys = ['`{}`'.format(key) for key in keys]
263 | keys = str(keys).replace('[', '(').replace(']', ')').replace("'", '')
264 | values_placeholder = str(values_placeholder).replace('[', '(').replace(']', ')').replace("'", '')
265 |
266 | if update_columns:
267 | if not isinstance(update_columns, (tuple, list)):
268 | update_columns = [update_columns]
269 | update_columns_ = ', '.join(["`{key}`=values(`{key}`)".format(key=key) for key in update_columns])
270 | sql = 'insert into {table} {keys} values {values_placeholder} on duplicate key update {update_columns}'.format(table=table, keys=keys, values_placeholder=values_placeholder, update_columns=update_columns_)
271 | elif auto_update:
272 | sql = 'replace into {table} {keys} values {values_placeholder}'.format(table=table, keys=keys, values_placeholder=values_placeholder)
273 | else:
274 | sql = 'insert ignore into {table} {keys} values {values_placeholder}'.format(table=table, keys=keys, values_placeholder=values_placeholder)
275 |
276 | return sql, values
277 |
278 |
279 | ##########
280 |
281 | def get_mac_address():
282 | mac = uuid.UUID(int=uuid.getnode()).hex[-12:]
283 | return ":".join([mac[e:e + 2] for e in range(0, 11, 2)])
284 |
285 |
286 | def get_md5(*args):
287 | '''
288 | @summary: 获取唯一的32位md5
289 | ---------
290 | @param *args: 参与联合去重的值
291 | ---------
292 | @result: 7c8684bcbdfcea6697650aa53d7b1405
293 | '''
294 |
295 | m = hashlib.md5()
296 | for arg in args:
297 | m.update(str(arg).encode())
298 |
299 | return m.hexdigest()
300 |
--------------------------------------------------------------------------------