├── .gitignore
├── Dockerfile
├── README.md
├── docker-compose.yml
├── media
    ├── 15584541954959.jpg
    ├── 15584542414888.jpg
    ├── 15584544774749.jpg
    ├── 15584545518249.jpg
    ├── 15584546784023.jpg
    ├── 15584547028361.jpg
    ├── 15584578582622.jpg
    ├── 15584579051963.jpg
    ├── 15584581938431.jpg
    ├── 15584582326072.jpg
    ├── 15584585019970.jpg
    ├── 15610827417503.jpg
    ├── 15610832298058.jpg
    ├── 15632498867519.jpg
    ├── 95AE10B3227FDE0637AB227A5A8267E3.png
    ├── A580D0082CCEE0621F98FAF003C5530E.png
    └── 赞赏码.png
├── requirements.txt
└── wechat-spider
    ├── config.py
    ├── config.yaml
    ├── core
        ├── capture_packet.py
        ├── data_pipeline.py
        ├── deal_data.py
        └── task_manager.py
    ├── create_tables.py
    ├── db
        ├── mysqldb.py
        └── redisdb.py
    ├── run.py
    └── utils
        ├── log.py
        ├── selector.py
        └── tools.py


/.gitignore:
--------------------------------------------------------------------------------
 1 | */__pycache__
 2 | *.pyc
 3 | .svn/
 4 | log/*
 5 | config/
 6 | .vs/
 7 | .vscode/
 8 | *.log
 9 | venv
10 | .venv/
11 | test.py
12 | .DS_Store
13 | .idea
14 | .git
15 | 
16 | config.yaml
17 | mariadb_data/
18 | mitmproxy/


--------------------------------------------------------------------------------
/Dockerfile:
--------------------------------------------------------------------------------
 1 | FROM python:3.7
 2 | 
 3 | RUN cat /etc/apt/sources.list | awk -F[/:] '{print $4}' | sort | uniq | grep -v "^$" | xargs -I{} sed -i 's|{}|mirrors.aliyun.com|g' /etc/apt/sources.list && \
 4 |     apt update && \
 5 |     apt install -y psmisc netcat wait-for-it && \
 6 |     apt-get clean && \
 7 |     rm -rf /var/lib/apt/lists/*
 8 | 
 9 | WORKDIR /app
10 | 
11 | COPY requirements.txt requirements.txt
12 | 
13 | RUN pip3 install -i https://mirrors.aliyun.com/pypi/simple -r requirements.txt
14 | 
15 | COPY . .
16 | 
17 | WORKDIR /app/wechat-spider
18 | 
19 | EXPOSE 8080
20 | EXPOSE 8081
21 | 
22 | # ENTRYPOINT [ "python3", "./run.py" ]
23 | 
24 | CMD [ "python3", "./run.py" ]


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # 微信爬虫
  2 | 
  3 | <!--<!--**该爬虫为基于中间人的方式，时效性不高，且可能会封号，请酌情使用。
  4 | 若需`长期``稳定``实时`监控大批量公众号，可使用如下api接口：**
  5 | 
  6 | [http://182.92.108.94:2119/client/wechat_article/document](http://182.92.108.94:2119/client/wechat_article/document)
  7 | -->
  8 | 
  9 | 以下为部署文档
 10 | 
 11 | 技术文档请查看：[https://t.zsxq.com/7ubmqNJ](https://t.zsxq.com/7ubmqNJ)
 12 | 
 13 | 逆向方式抓取的方案请查看：[https://wx.zsxq.com/dweb2/index/topic_detail/215584212588541](https://wx.zsxq.com/dweb2/index/topic_detail/215584212588541)
 14 | 
 15 | ## 功能：
 16 | 
 17 | - [x] 检测公众号每日新发文章
 18 | - [x] 抓取公众号信息
 19 | - [x] 抓取文章列表
 20 | - [x] 抓取文章信息
 21 | - [x] 抓取阅读量、点赞量、评论量
 22 | - [x] 抓取评论信息
 23 | - [x] 临时链接转永久链接
 24 | 
 25 | 打包好的执行文件下载地址
 26 | 
 27 | 链接: https://pan.baidu.com/s/1hyhj6YnV-L9w8LPx42FFzQ  密码: qnk6
 28 | 
 29 | ## 特色：
 30 | 
 31 | 1. **免安装**：支持mac、window，双击软件即可执行
 32 | 2. **自动化**：只需要配置好待监控的公众号列表，启动软件后即可每日自动抓取公众号及文章等信息
 33 | 3. **好对接**：抓取到的数据使用mysql存储，方便处理数据
 34 | 4. **不漏采**：采用任务状态标记的方式，防止遗漏每一个公众号、每一篇文章
 35 | 5. **分布式**：支持多个微信号同时采集，微信客户端支持Android、iphone、Mac、Window 全平台
 36 | 
 37 | ## 数据示例
 38 | 
 39 | **1. 公众号数据**
 40 | ![-w829](media/15584541954959.jpg)
 41 | 
 42 | **2. 文章列表数据**
 43 | ![-w1369](media/15584542414888.jpg)
 44 | 
 45 | **3. 文章数据**
 46 | ![-w1466](media/15584545518249.jpg)
 47 | 
 48 | **4. 阅读点赞评论数据**
 49 | ![-w623](media/15584546784023.jpg)
 50 | 
 51 | **5. 评论数据**
 52 | ![-w1033](media/15584547028361.jpg)
 53 | 
 54 | ## 所需环境
 55 | 
 56 | 1. mysql：用来存储抓取到的数据以及任务表
 57 | 2. redis：任务缓存，减少操作mysql的次数
 58 | 
 59 | ## 安装配置
 60 | 
 61 | > 以下安装说明安需查看，仅作为参考。因每个人环境不同，可能安装会有些差异，可参考网上的资料
 62 | 
 63 | ### 1. 安装mysql
 64 | #### 1.1 window
 65 | #### 1.2 mac
 66 | ### 2. 安装redis
 67 | #### 2.1 window
 68 | #### 2.2 mac
 69 | ### 3. 安装证书
 70 | 
 71 | 可用浏览器访问 mitm.it 然后下载，或者百度如何安装mitmproxy证书
 72 | 
 73 | #### 3.1 iphone
 74 | 1. 下载安装完毕后别忘记最后一步
 75 | 2. 打开设置-通用-关于本机-证书信任设置
 76 | 3. 开启mitmproxy选项。
 77 | 
 78 | #### 3.2 android
 79 | 1. 安装完毕检查
 80 | 2. 打开设置-安全-信任的凭据
 81 | 3. 查看安装的证书是否存在
 82 | 
 83 | #### 3.3 window
 84 |  2. 双击运行
 85 |  3. 安装到本地计算机
 86 |  4. 需要密钥时跳过
 87 |  5. 选择“将所有的证书都放入下列存储”，接着选择“受信任的根证书颁发机构”
 88 |  6. 最后，弹出警告窗口，直接点击“是”
 89 | 
 90 | #### 3.4 mac
 91 | 2. 下载完双击安装
 92 | 3. 打开Keychain Access.app
 93 | 4. 选择login(Keychains)和Certificates(Category)中找到mitmproxy
 94 | 5. 点击mitmproxy，在Trust中选择Always Trust
 95 | 
 96 | 
 97 | ### 4. 配置代理
 98 | 
 99 | > 如果使用手机，需要确保手机和运行wechat-spider的电脑连接在同一个路由器上
100 | 
101 | #### 3.1 iphone
102 | 
103 | 打开设置-无线局域网-所连接的Wifi-配置代理-手动
104 | 填上该安装服务器的IP和端口8080
105 | 
106 | #### 3.2 android
107 | 
108 | 打开设置-WLAN-长按所连接的网络-修改网络-高级选项-手动
109 | 填上该安装服务器的IP和端口8080
110 | 
111 | #### 3.3 window
112 | 打开chrome 设置->高级
113 | ![A580D0082CCEE0621F98FAF003C5530E](media/A580D0082CCEE0621F98FAF003C5530E.png)
114 | ![95AE10B3227FDE0637AB227A5A8267E3](media/95AE10B3227FDE0637AB227A5A8267E3.png)
115 | 
116 | #### 3.4 mac
117 | 
118 | 打开系统配置（System Preferences.app）- 网络（Network）- 高级（Advanced）- 代理（Proxies）- Secure Web Proxy(HTTPS)
119 | 填上该安装服务器的IP和端口8080
120 | 
121 | ![-w668](media/15584581938431.jpg)
122 | ![-w667](media/15584582326072.jpg)
123 | 
124 | 
125 | 
126 | ## 使用说明
127 | 
128 | ### 1. 安装如上说明安装好证书及配置好代理
129 | ### 2. 正确配置config.yaml
130 | 
131 | 主要是配置mysql及redis的链接信息，确保能正确链接上
132 | 
133 | ### 3. 创建数据库 wechat
134 | 
135 | ![-w418](media/15610827417503.jpg)
136 | 
137 | 
138 | ### 4. 启动wechat-spider
139 | 
140 | 此步骤如果config里的auto_create_tables值为true时，会自动创建mysql数据表。建议首次启动时设置为true，创建完表后设置为false
141 |     
142 | ### 5. 下发公众号任务
143 | 
144 | ![-w201](media/15584578582622.jpg)
145 | 录入数据到wechat_account_task, 如：
146 | ![-w503](media/15584579051963.jpg)
147 | 只填写__biz就好, 如：MzIxNzg1ODQ0MQ==
148 | 
149 | ### 6. 点击任意一公众号，查看历史消息
150 | 
151 | ![-w637](media/15584585019970.jpg)
152 | 
153 | 当出现如上红框中的提示信息时，说明大功告成了，过一会可以去数据库里验证数据了
154 | 
155 | 技术交流
156 | ----
157 | 若大家有什么疑问或指教，可加qq群，一起讨论问题。请备注`微信爬虫学习交流`
158 | 
159 | <img src='https://i.imgur.com/5FM26rc.png' align = 'center' width = "250" style = "margin-top:20px">
160 | 
161 | 
162 | ## 常见问题
163 | 
164 | ### 1. mysql 链接问题
165 | 
166 | 问题：链接时打印object supporting the buffer api required异常
167 | ![](media/15610832298058.jpg)
168 | 解决: 如果密钥是整形的，如123456，需要在配置文件中加双引号，如下：
169 | 
170 |     mysqldb:
171 |       ip: localhost
172 |       port: 3306
173 |       db: wechat
174 |       user: root
175 |       passwd: "123456"
176 |       auto_create_tables: true # 是否自动建表 建议当表不存在是设置为true，表存在是设置为false，加快软件启动速度
177 | 
178 | ### 2. 正确配置完代理后提示证书或安全问题
179 | 
180 | 原因是我那个证书失效了，可参考 https://www.cnblogs.com/yunlongaimeng/p/9617708.html 安装数据
181 | 
182 | ### 3. 提示无任务
183 | 
184 | 检查 wechat_account_task 表中是否下发了__biz。可多下发几个测试
185 | 
186 | ### 4. Exception:DISCARD without MULTI
187 | 
188 | ![-w406](media/15632498867519.jpg)
189 | 
190 | ### 5. 正常启动后抓不到包
191 | 
192 | 1. 检是否设置代理
193 | 2. 检查端口是否被占用
194 | 
195 | ## 微信攒赏
196 | 
197 | 开源项目不易，维护代码以及解决大家问题往往占据了大部分时间，为了保证内容持续输出，且**本项目恰巧对您有帮助**，还望多多支持哦(*￣︶￣)。
198 | 
199 | 可提供部署支持，答疑解惑（仅限打赏用户、提PR的开发者）。
200 | 
201 | 微信: boris_tm
202 | 
203 | ![赞赏码](media/%E8%B5%9E%E8%B5%8F%E7%A0%81.png)
204 | 
205 | 
206 | 


--------------------------------------------------------------------------------
/docker-compose.yml:
--------------------------------------------------------------------------------
 1 | version: '3'
 2 | services:
 3 |   wechat-spider:
 4 |     image: wechat-spider:latest
 5 |     build: .
 6 |     restart: always
 7 |     privileged: true
 8 |     command: ["wait-for-it", "mariadb:3306", "--", "python3", "./run.py"]
 9 |     ports:
10 |       - '8080:8080'
11 |       - '8081:8081'
12 |     volumes:
13 |       - "./mitmproxy:/root/.mitmproxy"
14 |       - "./config:/app/wechat-spider/config"
15 |     depends_on:
16 |       - mariadb
17 |       - redis
18 |     links:
19 |       - redis
20 |       - mariadb
21 |   redis:
22 |     image: redis:3.2
23 |     restart: always
24 |   mariadb:
25 |     image: bitnami/mariadb:10.5-debian-10
26 |     ports:
27 |       - '3306:3306'
28 |     volumes:
29 |       - './mariadb_data:/bitnami/mariadb'
30 |     environment:
31 |       - MARIADB_ROOT_USER=root
32 |       - MARIADB_ROOT_PASSWORD=root
33 |       - MARIADB_CHARACTER_SET=utf8mb4
34 |       - MARIADB_COLLATE=utf8mb4_unicode_ci
35 |     healthcheck:
36 |       test: ['CMD', '/opt/bitnami/scripts/mariadb/healthcheck.sh']
37 |       interval: 15s
38 |       timeout: 5s
39 |       retries: 6


--------------------------------------------------------------------------------
/media/15584541954959.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584541954959.jpg


--------------------------------------------------------------------------------
/media/15584542414888.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584542414888.jpg


--------------------------------------------------------------------------------
/media/15584544774749.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584544774749.jpg


--------------------------------------------------------------------------------
/media/15584545518249.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584545518249.jpg


--------------------------------------------------------------------------------
/media/15584546784023.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584546784023.jpg


--------------------------------------------------------------------------------
/media/15584547028361.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584547028361.jpg


--------------------------------------------------------------------------------
/media/15584578582622.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584578582622.jpg


--------------------------------------------------------------------------------
/media/15584579051963.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584579051963.jpg


--------------------------------------------------------------------------------
/media/15584581938431.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584581938431.jpg


--------------------------------------------------------------------------------
/media/15584582326072.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584582326072.jpg


--------------------------------------------------------------------------------
/media/15584585019970.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584585019970.jpg


--------------------------------------------------------------------------------
/media/15610827417503.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15610827417503.jpg


--------------------------------------------------------------------------------
/media/15610832298058.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15610832298058.jpg


--------------------------------------------------------------------------------
/media/15632498867519.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15632498867519.jpg


--------------------------------------------------------------------------------
/media/95AE10B3227FDE0637AB227A5A8267E3.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/95AE10B3227FDE0637AB227A5A8267E3.png


--------------------------------------------------------------------------------
/media/A580D0082CCEE0621F98FAF003C5530E.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/A580D0082CCEE0621F98FAF003C5530E.png


--------------------------------------------------------------------------------
/media/赞赏码.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/赞赏码.png


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
1 | w3lib==1.22.0
2 | mitmproxy==7.0.3
3 | better_exceptions==0.2.2
4 | PyMySQL==0.10.0
5 | parsel==1.6.0
6 | six==1.15.0
7 | redis==2.10.6
8 | DBUtils==1.3
9 | PyYAML==5.4


--------------------------------------------------------------------------------
/wechat-spider/config.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | '''
 3 | Created on 2019/5/18 11:54 AM
 4 | ---------
 5 | @summary:
 6 | ---------
 7 | @author:
 8 | '''
 9 | import os
10 | import socket
11 | import sys
12 | import shutil
13 | 
14 | import yaml  # pip3 install pyyaml
15 | 
16 | if 'python' in sys.executable:
17 |     abs_path = lambda file: os.path.abspath(os.path.join(os.path.dirname(__file__), file))
18 | else:
19 |     abs_path = lambda file: os.path.abspath(os.path.join(os.path.dirname(sys.executable), file))  # mac 上打包后 __file__ 指定的是用户根路径，非当执行文件路径
20 | 
21 | if not os.path.exists('./config/config.yaml'):
22 |     shutil.copyfile('./config.yaml', './config/config.yaml')
23 | 
24 | config = yaml.full_load(open(abs_path('./config/config.yaml'), encoding='utf8'))
25 | 
26 | 
27 | def get_host_ip():
28 |     """
29 |     利用 UDP 协议来实现的，生成一个UDP包，把自己的 IP 放如到 UDP 协议头中，然后从UDP包中获取本机的IP。
30 |     这个方法并不会真实的向外部发包，所以用抓包工具是看不到的
31 |     :return:
32 |     """
33 |     s = None
34 |     try:
35 |         s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
36 |         s.connect(('8.8.8.8', 80))
37 |         ip = s.getsockname()[0]
38 |     finally:
39 |         if s:
40 |             s.close()
41 | 
42 |     return ip
43 | 
44 | 
45 | IP = get_host_ip()
46 | 


--------------------------------------------------------------------------------
/wechat-spider/config.yaml:
--------------------------------------------------------------------------------
 1 | mysqldb:
 2 |   ip: "mariadb"
 3 |   port: 3306
 4 |   db: test
 5 |   user: root
 6 |   passwd: "root"
 7 |   auto_create_tables: True # 是否自动建表 建议当表不存在是设置为true，表存在是设置为false，加快软件启动速度
 8 | 
 9 | redisdb:
10 |   ip: redis
11 |   port: 6379
12 |   db: 0
13 |   passwd:
14 | 
15 | spider:
16 |   monitor_interval: 3600 # 公众号扫描新发布文章周期时间间隔 单位秒
17 |   ignore_haved_crawl_today_article_account: true # 忽略已经抓取到今日发布文章的公众号，即今日不再监测该公众号
18 |   redis_task_cache_root_key: wechat # reids 中缓存任务的根key 如 wechat:
19 |   zombie_account_not_publish_article_days: 90 # 连续90天未发布新文章，判定为僵尸账号，日后不再监控
20 |   spider_interval:
21 |     min_sleep_time: 5
22 |     max_sleep_time: 10
23 |   no_task_sleep_time: 3600 # 当无任务时休眠时间
24 |   service_port: 8080 # 服务的端口
25 |   # crawl_time_range: 2019-07-10 00:00:00~2019-07-01 00:00:00 # 抓取的时间范围 若不限制最近时间可写为 ~2019-07-01 00:00:00 若想抓取全部历史则不设置
26 |   crawl_time_range:
27 | 
28 | log:
29 |   level: INFO
30 |   to_file: false
31 |   log_path: ./logs/wechat_spider.log
32 | 
33 | mitm:
34 |   log_level: 0 # mitm框架日志的打印级别。值在0~3之间，值越大，输出的日志信息越详细，默认是1。详见：https://docs.mitmproxy.org/stable/concepts-options/
35 | 


--------------------------------------------------------------------------------
/wechat-spider/core/capture_packet.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | '''
 3 | Created on 2019/5/8 10:59 PM
 4 | ---------
 5 | @summary:
 6 | ---------
 7 | @author:
 8 | '''
 9 | import re
10 | 
11 | import mitmproxy
12 | from mitmproxy import ctx
13 | 
14 | from core.deal_data import deal_data
15 | 
16 | 
17 | class WechatCapture():
18 | 
19 |     def response(self, flow: mitmproxy.http.HTTPFlow):
20 |         url = flow.request.url
21 | 
22 |         next_page = None
23 |         try:
24 |             if 'mp/profile_ext?action=home' in url or 'mp/profile_ext?action=getmsg' in url:  # 文章列表 包括html格式和json格式
25 | 
26 |                 ctx.log.info('抽取文章列表数据')
27 |                 next_page = deal_data.deal_article_list(url, flow.response.text)
28 | 
29 |                 flow.response.text = re.sub('<img.*?>', '', flow.response.text)
30 | 
31 |             elif '/s?__biz=' in url or '/mp/appmsg/show?__biz=' in url or '/mp/rumor' in url:  # 文章内容；mp/appmsg/show?_biz 为2014年老版链接;  mp/rumor 是不详实的文章
32 | 
33 |                 ctx.log.info('抽取文章内容')
34 |                 next_page = deal_data.deal_article(url, flow.response.text)
35 | 
36 |                 # 修改文章内容的响应头，去掉安全协议，否则注入的 < script > setTimeout(function() {window.location.href = 'url';}, sleep_time); < / script > js脚本不执行
37 |                 flow.response.headers.pop('Content-Security-Policy', None)
38 |                 flow.response.headers.pop('content-security-policy-report-only', None)
39 |                 flow.response.headers.pop('Strict-Transport-Security', None)
40 | 
41 |                 #  不缓存
42 |                 flow.response.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate'
43 | 
44 |                 # 去掉图片
45 |                 flow.response.text = re.sub('<img.*?>', '', flow.response.text)
46 | 
47 |             elif 'mp/getappmsgext' in url:  # 阅读量 观看量
48 | 
49 |                 ctx.log.info('抽取阅读量 观看量')
50 |                 deal_data.deal_article_dynamic_info(flow.request.data.content.decode('utf-8'), flow.response.text)
51 | 
52 |             elif '/mp/appmsg_comment' in url:  # 评论列表
53 | 
54 |                 ctx.log.info('抽取评论列表')
55 |                 deal_data.deal_comment(url, flow.response.text)
56 | 
57 |         except Exception as e:
58 |             # log.exception(e)
59 |             next_page = "Exception: {}".format(e)
60 | 
61 |         if next_page:
62 |             # 修改返回求头 json 为text
63 |             flow.response.headers['content-type'] = 'text/html; charset=UTF-8'
64 |             if 'window.location.reload()' in next_page:
65 |                 flow.response.set_text(next_page)
66 |             else:
67 |                 flow.response.set_text(next_page + flow.response.text)
68 | 
69 | 
70 | addons = [
71 |     WechatCapture(),
72 | ]
73 | 
74 | # 运行  mitmdump -s capture_packet.py -p 8080
75 | 


--------------------------------------------------------------------------------
/wechat-spider/core/data_pipeline.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | '''
 3 | Created on 2019/5/13 12:44 AM
 4 | ---------
 5 | @summary:
 6 | ---------
 7 | @author:
 8 | '''
 9 | from db.mysqldb import MysqlDB
10 | import utils.tools as tools
11 | from utils.log import log
12 | from config import config
13 | 
14 | db = MysqlDB(**config.get('mysqldb'))
15 | 
16 | 
17 | def save_account(data):
18 |     log.debug(tools.dumps_json(data))
19 | 
20 |     sql = tools.make_insert_sql('wechat_account', data, insert_ignore=True)
21 |     db.add(sql)
22 | 
23 | 
24 | def save_article_list(datas: list):
25 |     log.debug(tools.dumps_json(datas))
26 | 
27 |     sql, articles = tools.make_batch_sql('wechat_article_list', datas)
28 |     db.add_batch(sql, articles)
29 | 
30 |     # 存文章任务
31 |     article_task = [
32 |         {
33 |             "sn": article.get('sn'),
34 |             "article_url": article.get('url'),
35 |             "__biz": article.get('__biz')
36 |         }
37 |         for article in datas
38 |     ]
39 | 
40 |     sql, article_task = tools.make_batch_sql('wechat_article_task', article_task)
41 |     db.add_batch(sql, article_task)
42 | 
43 | 
44 | def save_article(data):
45 |     log.debug(tools.dumps_json(data))
46 | 
47 |     sql = tools.make_insert_sql('wechat_article', data, insert_ignore=True)
48 |     return db.add(sql)
49 | 
50 | 
51 | def save_article_dynamic(data):
52 |     log.debug(tools.dumps_json(data))
53 | 
54 |     sql = tools.make_insert_sql('wechat_article_dynamic', data, insert_ignore=True)
55 |     db.add(sql)
56 | 
57 | 
58 | def save_article_commnet(datas: list):
59 |     log.debug(tools.dumps_json(datas))
60 | 
61 |     sql, datas = tools.make_batch_sql('wechat_article_comment', datas)
62 |     db.add_batch(sql, datas)
63 | 


--------------------------------------------------------------------------------
/wechat-spider/core/deal_data.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Created on 2019/5/11 6:37 PM
  4 | ---------
  5 | @summary: 处理数据
  6 | ---------
  7 | @author:
  8 | """
  9 | from utils.selector import Selector
 10 | import utils.tools as tools
 11 | from utils.log import log
 12 | from core import data_pipeline
 13 | from core.task_manager import TaskManager
 14 | 
 15 | 
 16 | class DealData:
 17 |     def __init__(self):
 18 |         self._task_manager = TaskManager()
 19 |         self._task_manager.reset_task()
 20 | 
 21 |     def __parse_account_info(self, data, req_url):
 22 |         """
 23 |         @summary:
 24 |         ---------
 25 |         @param data:
 26 |         ---------
 27 |         @result:
 28 |         """
 29 |         __biz = tools.get_param(req_url, "__biz")
 30 | 
 31 |         regex = 'id="nickname">(.*?)</strong>'
 32 |         account = tools.get_info(data, regex, fetch_one=True).strip()
 33 | 
 34 |         regex = 'profile_avatar">.*?<img src="(.*?)"'
 35 |         head_url = tools.get_info(data, regex, fetch_one=True)
 36 | 
 37 |         regex = 'class="profile_desc">(.*?)</p>'
 38 |         summary = tools.get_info(data, regex, fetch_one=True).strip()
 39 | 
 40 |         # 认证信息（关注的账号直接点击查看历史消息，无认证信息）
 41 |         regex = '<i class="icon_verify success">.*?</i>(.*?)</span>'
 42 |         verify = tools.get_info(data, regex, fetch_one=True)
 43 |         verify = verify.strip() if verify else ""
 44 | 
 45 |         # 二维码
 46 |         regex = 'var username = "" \|\| "(.*?)";'  # ||  需要转译
 47 |         qr_code = tools.get_info(data, regex, fetch_one=True)
 48 |         qr_code = "http://open.weixin.qq.com/qr/code?username=" + qr_code
 49 | 
 50 |         account_data = {
 51 |             "__biz": __biz,
 52 |             "account": account,
 53 |             "head_url": head_url,
 54 |             "summary": summary,
 55 |             "qr_code": qr_code,
 56 |             "verify": verify,
 57 |             "spider_time": tools.get_current_date(),
 58 |         }
 59 | 
 60 |         if account_data:
 61 |             data_pipeline.save_account(account_data)
 62 | 
 63 |     def __parse_article_list(self, article_list, __biz, is_first_page=False):
 64 |         """
 65 |         @summary: 解析文章列表
 66 |         ---------
 67 |         @param article_list: 文章列表信息 str
 68 |         ---------
 69 |         @result: True / None (True: 继续向下抓取； None: 停止向下抓取）
 70 |         """
 71 | 
 72 |         # log.debug(tools.dumps_json(article_list))
 73 | 
 74 |         # 解析json内容里文章信息
 75 |         def parse_article_info(article_info, comm_msg_info):
 76 |             if not article_info:
 77 |                 return
 78 | 
 79 |             # log.debug(tools.dumps_json(article_info))
 80 | 
 81 |             title = article_info.get("title")
 82 |             digest = article_info.get("digest")
 83 |             url = article_info.get("content_url").replace("\\", "").replace("amp;", "")
 84 |             source_url = article_info.get("source_url").replace("\\", "")  # 引用的文章链接
 85 |             cover = article_info.get("cover").replace("\\", "")
 86 |             subtype = article_info.get("subtype")
 87 |             is_multi = article_info.get("is_multi")
 88 |             author = article_info.get("author")
 89 |             copyright_stat = article_info.get("copyright_stat")
 90 |             duration = article_info.get("duration")
 91 |             del_flag = article_info.get("del_flag")
 92 |             type = comm_msg_info.get("type")
 93 |             publish_time = tools.timestamp_to_date(comm_msg_info.get("datetime"))
 94 |             sn = tools.get_param(url, "sn")
 95 | 
 96 |             if sn:
 97 |                 # 缓存文章信息
 98 |                 article_data = {
 99 |                     "title": title,
100 |                     "digest": digest,
101 |                     "url": url,
102 |                     "source_url": source_url,
103 |                     "cover": cover,
104 |                     "subtype": subtype,
105 |                     "is_multi": is_multi,
106 |                     "author": author,
107 |                     "copyright_stat": copyright_stat,
108 |                     "duration": duration,
109 |                     "del_flag": del_flag,
110 |                     "type": type,
111 |                     "publish_time": publish_time,
112 |                     "sn": sn,
113 |                     "__biz": __biz,
114 |                     "spider_time": tools.get_current_date(),
115 |                 }
116 | 
117 |                 return article_data
118 | 
119 |         # log.debug(tools.dumps_json(article_list))
120 |         article_list = tools.get_json(article_list)
121 | 
122 |         article_list_data = []
123 |         publish_time = None
124 |         is_need_get_more = True
125 |         article_list = article_list.get("list", [])
126 |         is_first_article = True
127 |         for article in article_list:
128 |             comm_msg_info = article.get("comm_msg_info", {})
129 | 
130 |             publish_timestamp = comm_msg_info.get("datetime")
131 |             publish_time = tools.timestamp_to_date(publish_timestamp)
132 | 
133 |             # 记录最新发布时间
134 |             if is_first_page and is_first_article:
135 |                 self._task_manager.record_new_last_article_publish_time(
136 |                     __biz, publish_time
137 |                 )
138 |                 is_first_article = False
139 | 
140 |                 if publish_timestamp and self._task_manager.is_zombie_account(
141 |                     publish_timestamp
142 |                 ):  # 首页检测是否为最新发布的文章 若最近未发布 则为僵尸账号
143 |                     log.info("公众号 {} 为僵尸账号 不再监控".format(__biz))
144 |                     self._task_manager.sign_account_is_zombie(__biz, publish_time)
145 |                     is_need_get_more = False
146 |                     break
147 | 
148 |             # 对比时间 若采集到上次时间，则跳出
149 |             is_reach = self._task_manager.is_reach_last_article_publish_time(
150 |                 __biz, publish_time
151 |             )
152 |             if is_reach:
153 |                 log.info("采集到上次发布时间 公众号 {} 采集完成".format(__biz))
154 |                 new_last_publish_time = self._task_manager.get_new_last_article_publish_time(
155 |                     __biz
156 |                 )
157 |                 self._task_manager.update_account_last_publish_time(
158 |                     __biz, new_last_publish_time
159 |                 )
160 |                 is_need_get_more = False
161 |                 break
162 | 
163 |             elif is_reach is None:
164 |                 log.info("公众号 {} 为爬虫启动时的手点公众号。不遍历历史消息，即将抓取监控池里的公众号".format(__biz))
165 |                 return
166 | 
167 |             article_type = comm_msg_info.get("type")
168 |             if article_type != 49:  # 49为常见的图文消息、其他消息有文本、语音、视频，此处不采集，格式不统一
169 |                 continue
170 | 
171 |             # 看是否在抓取时间范围
172 |             publish_time_status = self._task_manager.is_in_crawl_time_range(
173 |                 publish_time
174 |             )
175 |             if publish_time_status == TaskManager.OVER_MIN_TIME_RANGE:
176 |                 log.info("公众号 {} 超过采集时间范围 采集完成".format(__biz))
177 |                 new_last_publish_time = self._task_manager.get_new_last_article_publish_time(
178 |                     __biz
179 |                 )
180 |                 self._task_manager.update_account_last_publish_time(
181 |                     __biz, new_last_publish_time
182 |                 )
183 |                 is_need_get_more = False
184 |                 break
185 |             elif publish_time_status == TaskManager.NOT_REACH_TIME_RANGE:
186 |                 log.info("公众号 {} 当前采集到的时间 {} 未到采集时间范围 不采集".format(__biz, publish_time))
187 |                 continue
188 | 
189 |             # 在时间范围
190 | 
191 |             # 微信公众号每次可以发多个图文消息
192 |             # 第一个图文消息
193 |             app_msg_ext_info = article.get("app_msg_ext_info", {})
194 |             article_data = parse_article_info(app_msg_ext_info, comm_msg_info)
195 |             if article_data:
196 |                 article_list_data.append(article_data)
197 | 
198 |             # 同一天附带的图文消息
199 |             multi_app_msg_item_list = app_msg_ext_info.get("multi_app_msg_item_list")
200 |             for multi_app_msg_item in multi_app_msg_item_list:
201 |                 article_data = parse_article_info(multi_app_msg_item, comm_msg_info)
202 |                 if article_data:
203 |                     article_list_data.append(article_data)
204 | 
205 |         if article_list_data:
206 |             data_pipeline.save_article_list(article_list_data)
207 | 
208 |         if is_need_get_more:
209 |             return publish_time
210 | 
211 |     def deal_article_list(self, req_url, text):
212 |         """
213 |         @summary: 获取文章列表
214 |         分为两种
215 |             1、第一次查看历史消息 返回的是html格式 包含公众号信息
216 |             2、下拉显示更多时 返回json格式
217 |         但是文章列表都是json格式 且合适相同
218 |         抓取思路：
219 |         1、如果是第一种格式，直接解析文章内容，拼接下一页json格式的地址
220 |         2、如果是第二种格式，
221 |         ---------
222 |         @param data:
223 |         ---------
224 |         @result:
225 |         """
226 |         try:
227 |             # 判断是否为被封的账号， 被封账号没有文章列表
228 |             __biz = tools.get_param(req_url, "__biz")
229 | 
230 |             if "list" in text:
231 |                 # 取html格式里的文章列表
232 |                 if "action=home" in req_url:
233 |                     # 解析公众号信息
234 |                     self.__parse_account_info(text, req_url)
235 | 
236 |                     # 解析文章列表
237 |                     regex = "msgList = '(.*?})';"
238 |                     article_list = tools.get_info(text, regex, fetch_one=True)
239 |                     article_list = article_list.replace("&quot;", '"')
240 |                     publish_time = self.__parse_article_list(
241 |                         article_list, __biz, is_first_page=True
242 |                     )
243 | 
244 |                     # 判断是否还有更多文章 没有跳转到下个公众号，有则下拉显示更多
245 |                     regex = "can_msg_continue = '(\d)'"
246 |                     can_msg_continue = tools.get_info(text, regex, fetch_one=True)
247 |                     if can_msg_continue == "0":  # 无更多文章
248 |                         log.info("抓取到列表底部 无更多文章，公众号 {} 抓取完毕".format(__biz))
249 |                         new_last_publish_time = self._task_manager.get_new_last_article_publish_time(
250 |                             __biz
251 |                         )
252 |                         if not new_last_publish_time:
253 |                             # 标记成僵尸号
254 |                             log.info("公众号 {} 为僵尸账号 不再监控".format(__biz))
255 |                             self._task_manager.sign_account_is_zombie(__biz)
256 |                         else:
257 |                             self._task_manager.update_account_last_publish_time(
258 |                                 __biz, new_last_publish_time
259 |                             )
260 | 
261 |                     elif publish_time:
262 |                         # 以下是拼接下拉显示更多的历史文章 跳转
263 |                         # 取appmsg_token 在html中
264 |                         regex = 'appmsg_token = "(.*?)";'
265 |                         appmsg_token = tools.get_info(text, regex, fetch_one=True)
266 | 
267 |                         # 取其他参数  在url中
268 |                         __biz = tools.get_param(req_url, "__biz")
269 |                         pass_ticket = tools.get_param(req_url, "pass_ticket")
270 | 
271 |                         next_page_url = "https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz={__biz}&f=json&offset={offset}&count=10&is_ok=1&scene=124&uin=777&key=777&pass_ticket={pass_ticket}&wxtoken=&appmsg_token={appmsg_token}&x5=0&f=json".format(
272 |                             __biz=__biz,
273 |                             offset=10,
274 |                             pass_ticket=pass_ticket,
275 |                             appmsg_token=appmsg_token,
276 |                         )
277 |                         return self._task_manager.get_task(
278 |                             next_page_url,
279 |                             tip="正在抓取列表 next_offset {} 抓取到 {}".format(10, publish_time),
280 |                         )
281 | 
282 |                 else:  # json格式
283 |                     text = tools.get_json(text)
284 |                     article_list = text.get("general_msg_list", {})
285 |                     publish_time = self.__parse_article_list(article_list, __biz)
286 | 
287 |                     # 判断是否还有更多文章 没有跳转到下个公众号，有则下拉显示更多
288 |                     can_msg_continue = text.get("can_msg_continue")
289 |                     if not can_msg_continue:  # 无更多文章
290 |                         log.info("抓取到列表底部 无更多文章，公众号 {} 抓取完毕".format(__biz))
291 |                         new_last_publish_time = self._task_manager.get_new_last_article_publish_time(
292 |                             __biz
293 |                         )
294 |                         self._task_manager.update_account_last_publish_time(
295 |                             __biz, new_last_publish_time
296 |                         )
297 |                         pass
298 | 
299 |                     elif publish_time:
300 |                         # 以下是拼接下拉显示更多的历史文章 跳转
301 |                         # 取参数  在url中
302 |                         __biz = tools.get_param(req_url, "__biz")
303 |                         pass_ticket = tools.get_param(req_url, "pass_ticket")
304 |                         appmsg_token = tools.get_param(req_url, "appmsg_token")
305 | 
306 |                         # 取offset 在json中
307 |                         offset = text.get("next_offset", 0)
308 | 
309 |                         next_page_url = "https://mp.weixin.qq.com/mp/profile_ext?action=getmsg&__biz={__biz}&f=json&offset={offset}&count=10&is_ok=1&scene=124&uin=777&key=777&pass_ticket={pass_ticket}&wxtoken=&appmsg_token={appmsg_token}&x5=0&f=json".format(
310 |                             __biz=__biz,
311 |                             offset=offset,
312 |                             pass_ticket=pass_ticket,
313 |                             appmsg_token=appmsg_token,
314 |                         )
315 |                         return self._task_manager.get_task(
316 |                             next_page_url,
317 |                             tip="正在抓取列表 next_offset {} 抓取到 {}".format(
318 |                                 offset, publish_time
319 |                             ),
320 |                         )
321 | 
322 |             else:  # 该__biz 账号已被封
323 |                 self._task_manager.sign_account_is_zombie(__biz)
324 |                 pass
325 | 
326 |         except Exception as e:
327 |             log.exception(e)
328 | 
329 |         return self._task_manager.get_task()
330 | 
331 |     def deal_article(self, req_url, text):
332 |         """
333 |         解析文章
334 |         :param req_url:
335 |         :param text:
336 |         :return:
337 |         """
338 |         sn = tools.get_param(req_url, "sn")
339 | 
340 |         if not text:
341 |             self._task_manager.update_article_task_state(sn, -1)
342 |             return None
343 | 
344 |         selector = Selector(text)
345 | 
346 |         content = selector.xpath(
347 |             '//div[@class="rich_media_content "]|//div[@class="rich_media_content"]|//div[@class="share_media"]'
348 |         )
349 |         title = (
350 |             selector.xpath('//h2[@class="rich_media_title"]/text()')
351 |             .extract_first(default="")
352 |             .strip()
353 |         )
354 |         account = (
355 |             selector.xpath('//a[@id="js_name"]/text()')
356 |             .extract_first(default="")
357 |             .strip()
358 |         )
359 |         author = (
360 |             selector.xpath(
361 |                 '//span[@class="rich_media_meta rich_media_meta_text"]//text()'
362 |             )
363 |             .extract_first(default="")
364 |             .strip()
365 |         )
366 | 
367 |         publish_timestamp = selector.re_first('n="(\d{10})"')
368 |         publish_timestamp = int(publish_timestamp) if publish_timestamp else None
369 |         publish_time = (
370 |             tools.timestamp_to_date(publish_timestamp) if publish_timestamp else None
371 |         )
372 | 
373 |         pics_url = content.xpath(".//img/@src|.//img/@data-src").extract()
374 |         biz = tools.get_param(req_url, "__biz")
375 | 
376 |         digest = selector.re_first('var msg_desc = "(.*?)"')
377 |         cover = selector.re_first('var cover = "(.*?)";') or selector.re_first(
378 |             'msg_cdn_url = "(.*?)"'
379 |         )
380 |         source_url = selector.re_first("var msg_source_url = '(.*?)';")
381 | 
382 |         content_html = content.extract_first(default="")
383 |         comment_id = selector.re_first('var comment_id = "(\d+)"')
384 | 
385 |         article_data = {
386 |             "account": account,
387 |             "title": title,
388 |             "url": req_url,
389 |             "author": author,
390 |             "publish_time": publish_time,
391 |             "__biz": biz,
392 |             "digest": digest,
393 |             "cover": cover,
394 |             "pics_url": pics_url,
395 |             "content_html": content_html,
396 |             "source_url": source_url,
397 |             "comment_id": comment_id,
398 |             "sn": sn,
399 |             "spider_time": tools.get_current_date(),
400 |         }
401 | 
402 |         # 入库
403 |         if article_data and data_pipeline.save_article(article_data) is not None:
404 |             self._task_manager.update_article_task_state(sn, 1)
405 | 
406 |         return self._task_manager.get_task()
407 | 
408 |     def deal_article_dynamic_info(self, req_data, text):
409 |         """
410 |         取文章动态信息 阅读 点赞 评论
411 |         :param req_data: post 请求的data str格式
412 |         :param text:
413 |         :return:
414 |         """
415 |         data = tools.get_json(text)
416 | 
417 |         dynamic_data = dict(
418 |             sn=tools.get_param(req_data, "sn"),
419 |             __biz=tools.get_param(req_data, "__biz").replace("%3D", "="),
420 |             read_num=data.get("appmsgstat", {}).get("read_num"),
421 |             like_num=data.get("appmsgstat", {}).get("like_num"),
422 |             comment_count=data.get("comment_count"),
423 |             spider_time=tools.get_current_date(),
424 |         )
425 | 
426 |         if dynamic_data:
427 |             data_pipeline.save_article_dynamic(dynamic_data)
428 | 
429 |     def deal_comment(self, req_url, text):
430 |         """
431 |         解析评论
432 |         :param req_url:
433 |         :param text:
434 |         :return:
435 |         """
436 | 
437 |         data = tools.get_json(text)
438 | 
439 |         __biz = tools.get_param(req_url, "__biz")
440 | 
441 |         comment_id = tools.get_param(req_url, "comment_id")  # 与文章关联
442 |         elected_comment = data.get("elected_comment", [])
443 | 
444 |         comment_datas = [
445 |             dict(
446 |                 __biz=__biz,
447 |                 comment_id=comment_id,
448 |                 nick_name=comment.get("nick_name"),
449 |                 logo_url=comment.get("logo_url"),
450 |                 content=comment.get("content"),
451 |                 create_time=tools.timestamp_to_date(comment.get("create_time")),
452 |                 content_id=comment.get("content_id"),
453 |                 like_num=comment.get("like_num"),
454 |                 is_top=comment.get("is_top"),
455 |                 spider_time=tools.get_current_date(),
456 |             )
457 |             for comment in elected_comment
458 |         ]
459 | 
460 |         if comment_datas:
461 |             data_pipeline.save_article_commnet(comment_datas)
462 | 
463 |     def get_task(self):
464 |         return self._task_manager.get_task()
465 | 
466 | 
467 | deal_data = DealData()
468 | 


--------------------------------------------------------------------------------
/wechat-spider/core/task_manager.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | '''
  3 | Created on 2019/5/15 11:14 PM
  4 | ---------
  5 | @summary:
  6 | ---------
  7 | @author:
  8 | '''
  9 | 
 10 | from db.mysqldb import MysqlDB
 11 | from db.redisdb import RedisDB
 12 | import utils.tools as tools
 13 | from utils.log import log
 14 | import random
 15 | from config import config
 16 | 
 17 | 
 18 | class TaskManager():
 19 |     IS_IN_TIME_RANGE = 1  # 在时间范围
 20 |     NOT_REACH_TIME_RANGE = 2  # 没到达时间范围
 21 |     OVER_MIN_TIME_RANGE = 3  # 超过时间范围
 22 | 
 23 |     def __init__(self):
 24 | 
 25 |         self._mysqldb = MysqlDB(**config.get('mysqldb'))
 26 |         self._redis = RedisDB(**config.get('redisdb'))
 27 | 
 28 |         self._task_root_key = config.get('spider').get('redis_task_cache_root_key')
 29 | 
 30 |         self._account_task_key = self._task_root_key + ':z_account_task'
 31 |         self._article_task_key = self._task_root_key + ':z_article_task'
 32 |         self._last_article_publish_time = self._task_root_key + ':h_last_article_publish_time'
 33 |         self._new_last_article_publish_time = self._task_root_key + ':h_new_last_article_publish_time'
 34 | 
 35 |         self._ignore_haved_crawl_today_article_account = config.get('spider').get('ignore_haved_crawl_today_article_account')
 36 |         self._monitor_interval = config.get('spider').get('monitor_interval')
 37 |         self._zombie_account_not_publish_article_days = config.get('spider').get('zombie_account_not_publish_article_days')
 38 |         self._spider_interval_min = config.get('spider').get('spider_interval').get('min_sleep_time')
 39 |         self._spider_interval_max = config.get('spider').get('spider_interval').get('max_sleep_time')
 40 |         self._spider_interval_max = config.get('spider').get('spider_interval').get('max_sleep_time')
 41 |         self._crawl_time_range = (config.get("spider").get("crawl_time_range") or "~").split('~')
 42 | 
 43 |     def __get_task_from_redis(self, key):
 44 |         task = self._redis.zget(key, is_pop=True)
 45 |         if task:
 46 |             task = eval(task[0])
 47 |             return task
 48 | 
 49 |     def __random_int(self, min, max):
 50 |         pass
 51 | 
 52 |     def get_account_task(self):
 53 |         """
 54 |         获取公众号任务
 55 |         :return:
 56 |             {'__biz': 'Mjc1NjM3MjY2MA==', 'last_publish_time': None}
 57 |             或
 58 |             None
 59 |         """
 60 |         task = self.__get_task_from_redis(self._account_task_key)
 61 |         if not task:
 62 |             publish_time_condition = "AND last_publish_time < '{today}'".format(today=tools.get_current_date(date_format='%Y-%m-%d' + ' 00:00:00')) if self._ignore_haved_crawl_today_article_account else ''
 63 |             sql = '''
 64 |                 SELECT
 65 |                     __biz,
 66 |                     last_publish_time
 67 |                 FROM
 68 |                     wechat_account_task
 69 |                 WHERE
 70 |                     `is_zombie` != 1
 71 |                 AND (
 72 |                     (
 73 |                         (
 74 |                             UNIX_TIMESTAMP(CURRENT_TIMESTAMP) - UNIX_TIMESTAMP(last_spider_time)
 75 |                         ) > {monitor_interval}
 76 |                         {publish_time_condition}
 77 |                     )
 78 |                     OR (last_spider_time IS NULL)
 79 |                 )
 80 |                 '''.format(monitor_interval=self._monitor_interval, publish_time_condition=publish_time_condition)
 81 | 
 82 |             tasks = self._mysqldb.find(sql, to_json=True)
 83 |             if tasks:
 84 |                 self._redis.zadd(self._account_task_key, tasks)
 85 |                 task = self.__get_task_from_redis(self._account_task_key)
 86 | 
 87 |         return task
 88 | 
 89 |     def get_article_task(self):
 90 |         """
 91 |         获取文章任务
 92 |         :return:
 93 |             {'article_url': 'http://mp.weixin.qq.com/s?__biz=MzIxNzg1ODQ0MQ==&mid=2247485501&idx=1&sn=92721338ddbf7d907eaf03a70a0715bd&chksm=97f220dba085a9cd2b9a922fb174c767603203d6dbd2a7d3a6dc41b3400a0c477a8d62b96396&scene=27#wechat_redirect'}
 94 |             或
 95 |             None
 96 |         """
 97 |         task = self.__get_task_from_redis(self._article_task_key)
 98 |         if not task:
 99 |             sql = 'select id, article_url from wechat_article_task where state = 0 limit 5000'
100 |             tasks = self._mysqldb.find(sql)
101 |             if tasks:
102 |                 # 更新任务状态
103 |                 task_ids = str(tuple([task[0] for task in tasks])).replace(',)', ')')
104 |                 sql = 'update wechat_article_task set state = 2 where id in %s' % (task_ids)
105 |                 self._mysqldb.update(sql)
106 | 
107 |             else:
108 |                 sql = 'select id, article_url from wechat_article_task where state = 2 limit 5000'
109 |                 tasks = self._mysqldb.find(sql)
110 | 
111 |             if tasks:
112 |                 task_json = [
113 |                     {
114 |                         'article_url': article_url
115 |                     }
116 |                     for id, article_url in tasks
117 |                 ]
118 |                 self._redis.zadd(self._article_task_key, task_json)
119 |                 task = self.__get_task_from_redis(self._article_task_key)
120 | 
121 |         return task
122 | 
123 |     def update_article_task_state(self, sn, state=1):
124 |         sql = 'update wechat_article_task set state = %s where sn = "%s"' % (state, sn)
125 |         self._mysqldb.update(sql)
126 | 
127 |     def record_last_article_publish_time(self, __biz, last_publish_time):
128 |         self._redis.hset(self._last_article_publish_time, __biz, last_publish_time or '')
129 | 
130 |     def is_reach_last_article_publish_time(self, __biz, publish_time):
131 |         last_publish_time = self._redis.hget(self._last_article_publish_time, __biz)
132 |         if not last_publish_time:
133 |             # 查询mysql里是否有该任务
134 |             sql = "select last_publish_time from wechat_account_task where __biz = '%s'" % __biz
135 |             data = self._mysqldb.find(sql)
136 |             if data:  # [(None,)] / []
137 |                 last_publish_time = str(data[0][0] or '')
138 |                 self.record_last_article_publish_time(__biz, last_publish_time)
139 | 
140 |         if last_publish_time is None:
141 |             return
142 | 
143 |         if publish_time < last_publish_time:
144 |             return True
145 | 
146 |         return False
147 | 
148 |     def is_in_crawl_time_range(self, publish_time):
149 |         """
150 |         是否在时间范围
151 |         :param publish_time:
152 |         :return: 是否达时间范围
153 |         """
154 |         if not publish_time or (not self._crawl_time_range[0] and not self._crawl_time_range[1]):
155 |             return TaskManager.IS_IN_TIME_RANGE
156 | 
157 |         if self._crawl_time_range[0]:  # 时间范围 上限
158 |             if publish_time > self._crawl_time_range[0]:
159 |                 return TaskManager.NOT_REACH_TIME_RANGE
160 | 
161 |             if publish_time <= self._crawl_time_range[0] and publish_time >= self._crawl_time_range[1]:
162 |                 return TaskManager.IS_IN_TIME_RANGE
163 | 
164 |         if publish_time < self._crawl_time_range[1]:  # 下限
165 |             return TaskManager.OVER_MIN_TIME_RANGE
166 | 
167 |         return TaskManager.IS_IN_TIME_RANGE
168 | 
169 |     def record_new_last_article_publish_time(self, __biz, new_last_publish_time):
170 |         self._redis.hset(self._new_last_article_publish_time, __biz, new_last_publish_time)
171 | 
172 |     def get_new_last_article_publish_time(self, __biz):
173 |         return self._redis.hget(self._new_last_article_publish_time, __biz)
174 | 
175 |     def update_account_last_publish_time(self, __biz, last_publish_time):
176 |         sql = 'update wechat_account_task set last_publish_time = "{}", last_spider_time="{}" where __biz="{}"'.format(
177 |             last_publish_time, tools.get_current_date(), __biz
178 |         )
179 |         self._mysqldb.update(sql)
180 | 
181 |     def is_zombie_account(self, last_publish_timestamp):
182 |         if tools.get_current_timestamp() - last_publish_timestamp > self._zombie_account_not_publish_article_days * 86400:
183 |             return True
184 |         return False
185 | 
186 |     def sign_account_is_zombie(self, __biz, last_publish_time=None):
187 |         if last_publish_time:
188 |             sql = 'update wechat_account_task set last_publish_time = "{}", last_spider_time="{}", is_zombie=1 where __biz="{}"'.format(
189 |                 last_publish_time, tools.get_current_date(), __biz
190 |             )
191 |         else:
192 |             sql = 'update wechat_account_task set last_spider_time="{}", is_zombie=1 where __biz="{}"'.format(
193 |                 tools.get_current_date(), __biz
194 |             )
195 | 
196 |         self._mysqldb.update(sql)
197 | 
198 |     def get_task(self, url=None, tip=''):
199 |         """
200 |         获取任务
201 |         :param url: 指定url时，返回该url包装后的任务。否则先取公众号任务，无则取文章任务。若均无任务，则休眠一段时间之后再取
202 |         :return:
203 |         """
204 | 
205 |         sleep_time = random.randint(self._spider_interval_min, self._spider_interval_max)
206 | 
207 |         if not url:
208 |             account_task = self.get_account_task()
209 |             if account_task:
210 |                 __biz = account_task.get('__biz')
211 |                 last_publish_time = account_task.get('last_publish_time')
212 |                 self.record_last_article_publish_time(__biz, last_publish_time)
213 |                 tip = '正在抓取列表'
214 |                 url = 'https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz={}&scene=124#wechat_redirect'.format(__biz)
215 |             else:
216 |                 article_task = self.get_article_task()
217 |                 if article_task:
218 |                     tip = '正在抓取详情'
219 |                     url = article_task.get('article_url')
220 |                 else:
221 |                     sleep_time = config.get('spider').get('no_task_sleep_time')
222 |                     log.info('暂无任务 休眠 {}s'.format(sleep_time))
223 |                     tip = '暂无任务 '
224 | 
225 |         if url:
226 |             next_page = "{tip} 休眠 {sleep_time}s 下次刷新时间 {begin_spider_time} <script>setTimeout(function(){{window.location.href='{url}';}},{sleep_time_msec});</script>".format(
227 |                 tip=tip and tip + ' ', sleep_time=sleep_time, begin_spider_time=tools.timestamp_to_date(tools.get_current_timestamp() + sleep_time), url=url, sleep_time_msec=sleep_time * 1000
228 |             )
229 |         else:
230 |             next_page = "{tip} 休眠 {sleep_time}s 下次刷新时间 {begin_spider_time} <script>setTimeout(function(){{window.location.reload();}},{sleep_time_msec});</script>".format(
231 |                 tip=tip and tip + ' ', sleep_time=sleep_time, begin_spider_time=tools.timestamp_to_date(tools.get_current_timestamp() + sleep_time), sleep_time_msec=sleep_time * 1000
232 |             )
233 | 
234 |         return next_page
235 | 
236 |     def reset_task(self):
237 |         # 清除redis缓存
238 |         keys = self._task_root_key + "*"
239 |         keys = self._redis.getkeys(keys)
240 |         if keys:
241 |             for key in keys:
242 |                 self._redis.clear(key)
243 | 
244 |             # 重设任务
245 |             sql = "update wechat_article_task set state = 0 where state = 2"
246 |             self._mysqldb.update(sql)
247 | 
248 | 
249 | if __name__ == '__main__':
250 |     task_manager = TaskManager()
251 | 
252 |     result = task_manager.get_task()
253 |     print(result)
254 | 


--------------------------------------------------------------------------------
/wechat-spider/create_tables.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | '''
  3 | Created on 2019/5/20 11:47 PM
  4 | ---------
  5 | @summary:
  6 | ---------
  7 | @author:
  8 | '''
  9 | 
 10 | from db.mysqldb import MysqlDB
 11 | from config import config
 12 | 
 13 | def _create_database(mysqldb, dbname):
 14 |     mysqldb.execute("CREATE DATABASE IF NOT EXISTS `%s`;", (dbname))
 15 | 
 16 | def _create_table(mysqldb, sql):
 17 |     mysqldb.execute(sql)
 18 | 
 19 | 
 20 | def create_table():
 21 |     wechat_article_list_table = '''
 22 |     CREATE TABLE IF NOT EXISTS `wechat_article_list` (
 23 |       `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
 24 |       `title` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
 25 |       `digest` varchar(2000) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
 26 |       `url` varchar(500) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
 27 |       `source_url` varchar(1000) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
 28 |       `cover` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
 29 |       `subtype` int(11) DEFAULT NULL,
 30 |       `is_multi` int(11) DEFAULT NULL,
 31 |       `author` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
 32 |       `copyright_stat` int(11) DEFAULT NULL,
 33 |       `duration` int(11) DEFAULT NULL,
 34 |       `del_flag` int(11) DEFAULT NULL,
 35 |       `type` int(11) DEFAULT NULL,
 36 |       `publish_time` datetime DEFAULT NULL,
 37 |       `sn` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
 38 |       `spider_time` datetime DEFAULT NULL,
 39 |       `__biz` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
 40 |       PRIMARY KEY (`id`),
 41 |       UNIQUE KEY `sn` (`sn`)
 42 |     ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC
 43 |     '''
 44 | 
 45 |     wechat_article_task_table = '''
 46 |     CREATE TABLE IF NOT EXISTS `wechat_article_task` (
 47 |       `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
 48 |       `sn` varchar(50) DEFAULT NULL,
 49 |       `article_url` varchar(255) DEFAULT NULL,
 50 |       `state` int(11) DEFAULT '0' COMMENT '文章抓取状态，0 待抓取 2 抓取中 1 抓取完毕 -1 抓取失败',
 51 |       `__biz` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
 52 |       PRIMARY KEY (`id`),
 53 |       UNIQUE KEY `sn` (`sn`) USING BTREE,
 54 |       KEY `state` (`state`) USING BTREE
 55 |     ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
 56 |     '''
 57 | 
 58 |     wechat_article_dynamic_table = '''
 59 |     CREATE TABLE IF NOT EXISTS `wechat_article_dynamic` (
 60 |       `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
 61 |       `sn` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
 62 |       `read_num` int(11) DEFAULT NULL,
 63 |       `like_num` int(11) DEFAULT NULL,
 64 |       `comment_count` int(11) DEFAULT NULL,
 65 |       `spider_time` datetime DEFAULT NULL,
 66 |       `__biz` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
 67 |       PRIMARY KEY (`id`),
 68 |       UNIQUE KEY `sn` (`sn`)
 69 |     ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC
 70 |     '''
 71 | 
 72 |     wechat_article_comment_table = '''
 73 |     CREATE TABLE IF NOT EXISTS `wechat_article_comment` (
 74 |       `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
 75 |       `comment_id` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL COMMENT '与文章关联',
 76 |       `nick_name` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
 77 |       `logo_url` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
 78 |       `content` varchar(2000) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
 79 |       `create_time` datetime DEFAULT NULL,
 80 |       `content_id` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL COMMENT '本条评论内容的id',
 81 |       `like_num` int(11) DEFAULT NULL,
 82 |       `is_top` int(11) DEFAULT NULL,
 83 |       `spider_time` datetime DEFAULT NULL,
 84 |       `__biz` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
 85 |       PRIMARY KEY (`id`),
 86 |       UNIQUE KEY `content_id` (`content_id`),
 87 |       KEY `comment_id` (`comment_id`)
 88 |     ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC
 89 |     '''
 90 | 
 91 |     wechat_article_table = '''
 92 |     CREATE TABLE IF NOT EXISTS `wechat_article` (
 93 |       `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
 94 |       `account` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
 95 |       `title` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
 96 |       `url` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
 97 |       `author` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
 98 |       `publish_time` datetime DEFAULT NULL,
 99 |       `__biz` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
100 |       `digest` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
101 |       `cover` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
102 |       `pics_url` text CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci,
103 |       `content_html` text COLLATE utf8mb4_unicode_ci,
104 |       `source_url` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
105 |       `comment_id` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
106 |       `sn` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
107 |       `spider_time` datetime DEFAULT NULL,
108 |       PRIMARY KEY (`id`),
109 |       UNIQUE KEY `sn` (`sn`),
110 |       KEY `__biz` (`__biz`)
111 |     ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC
112 |     '''
113 | 
114 |     wechat_account_task_table = '''
115 |     CREATE TABLE IF NOT EXISTS `wechat_account_task` (
116 |       `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
117 |       `__biz` varchar(50) DEFAULT NULL,
118 |       `last_publish_time` datetime DEFAULT NULL COMMENT '上次抓取到的文章发布时间，做文章增量采集用',
119 |       `last_spider_time` datetime DEFAULT NULL COMMENT '上次抓取时间，用于同一个公众号每隔一段时间扫描一次',
120 |       `is_zombie` int(11) DEFAULT '0' COMMENT '僵尸号 默认3个月未发布内容为僵尸号，不再检测',
121 |       PRIMARY KEY (`id`)
122 |     ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
123 |     '''
124 | 
125 |     wechat_account_table = '''
126 |     CREATE TABLE IF NOT EXISTS `wechat_account` (
127 |       `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
128 |       `__biz` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
129 |       `account` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
130 |       `head_url` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
131 |       `summary` varchar(500) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
132 |       `qr_code` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
133 |       `verify` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
134 |       `spider_time` datetime DEFAULT NULL,
135 |       PRIMARY KEY (`id`),
136 |       UNIQUE KEY `__biz` (`__biz`)
137 |     ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC
138 |     '''
139 | 
140 |     if config.get('mysqldb').get('auto_create_tables'):
141 |         mysqldb = MysqlDB(**config.get('mysqldb'))
142 |         # _create_database(mysqldb, config.get('mysqldb').get('db'))
143 |         _create_table(mysqldb, wechat_article_list_table)
144 |         _create_table(mysqldb, wechat_article_task_table)
145 |         _create_table(mysqldb, wechat_article_dynamic_table)
146 |         _create_table(mysqldb, wechat_article_comment_table)
147 |         _create_table(mysqldb, wechat_article_table)
148 |         _create_table(mysqldb, wechat_account_task_table)
149 |         _create_table(mysqldb, wechat_account_table)
150 | 


--------------------------------------------------------------------------------
/wechat-spider/db/mysqldb.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | '''
  3 | Created on 2016-11-16 16:25
  4 | ---------
  5 | @summary: 操作mysql数据库
  6 | ---------
  7 | @author: Boris
  8 | '''
  9 | import datetime
 10 | import json
 11 | 
 12 | import pymysql
 13 | from DBUtils.PooledDB import PooledDB
 14 | from pymysql import cursors
 15 | from pymysql import err
 16 | 
 17 | from utils.log import log
 18 | 
 19 | 
 20 | def auto_retry(func):
 21 |     def wapper(*args, **kwargs):
 22 |         for i in range(3):
 23 |             try:
 24 |                 return func(*args, **kwargs)
 25 |             except (err.InterfaceError, err.OperationalError, err.ProgrammingError) as e:
 26 |                 log.error('''
 27 |                     error:%s
 28 |                     sql:  %s
 29 |                     ''' % (e, kwargs.get('sql') or args[1]))
 30 | 
 31 |     return wapper
 32 | 
 33 | 
 34 | class MysqlDB():
 35 | 
 36 |     def __init__(self, ip=None, port=None, db=None, user=None, passwd=None, **kwargs):
 37 |         # 可能会改setting中的值，所以此处不能直接赋值为默认值，需要后加载赋值
 38 |         try:
 39 |             self.connect_pool = PooledDB(creator=pymysql, mincached=1, maxcached=100, maxconnections=100, blocking=True, ping=7,
 40 |                                          host=ip, port=port, user=user, passwd=passwd, db=db, charset='utf8mb4', cursorclass=cursors.SSCursor)  # cursorclass 使用服务的游标，默认的在多线程下大批量插入数据会使内存递增
 41 |         except Exception as e:
 42 |             input('''
 43 |             ******************************************
 44 |                 未链接到mysql数据库，
 45 |                 您当前的链接信息为：
 46 |                     ip = {}
 47 |                     port = {}
 48 |                     db = {}
 49 |                     user = {}
 50 |                     passwd = {}
 51 |                 请参考教程正确安装配置mysql，然后重启本程序 
 52 |                 Exception: {}'''.format(ip, port, db, user, passwd, str(e))
 53 |                   )
 54 |             import sys
 55 |             sys.exit()
 56 | 
 57 |     def get_connection(self):
 58 |         conn = self.connect_pool.connection(shareable=False)
 59 |         # cursor = conn.cursor(cursors.SSCursor)
 60 |         cursor = conn.cursor()
 61 | 
 62 |         return conn, cursor
 63 | 
 64 |     def close_connection(self, conn, cursor):
 65 |         cursor.close()
 66 |         conn.close()
 67 | 
 68 |     def size_of_connections(self):
 69 |         '''
 70 |         当前活跃的连接数
 71 |         @return:
 72 |         '''
 73 |         return self.connect_pool._connections
 74 | 
 75 |     def size_of_connect_pool(self):
 76 |         '''
 77 |         池子里一共有多少连接
 78 |         @return:
 79 |         '''
 80 |         return len(self.connect_pool._idle_cache)
 81 | 
 82 |     @auto_retry
 83 |     def find(self, sql, limit=0, to_json=False, cursor=None):
 84 |         '''
 85 |         @summary:
 86 |         无数据： 返回()
 87 |         有数据： 若limit == 1 则返回 (data1, data2)
 88 |                 否则返回 ((data1, data2),)
 89 |         ---------
 90 |         @param sql:
 91 |         @param limit:
 92 |         ---------
 93 |         @result:
 94 |         '''
 95 |         conn, cursor = self.get_connection()
 96 | 
 97 |         cursor.execute(sql)
 98 | 
 99 |         if limit == 1:
100 |             result = cursor.fetchone()  # 全部查出来，截取 不推荐使用
101 |         elif limit > 1:
102 |             result = cursor.fetchmany(limit)  # 全部查出来，截取 不推荐使用
103 |         else:
104 |             result = cursor.fetchall()
105 | 
106 |         if to_json:
107 |             columns = [i[0] for i in cursor.description]
108 | 
109 |             # 处理时间类型
110 |             def fix_lob(row):
111 |                 def convert(col):
112 |                     if isinstance(col, (datetime.date, datetime.time)):
113 |                         return str(col)
114 |                     elif isinstance(col, str) and (col.startswith('{') or col.startswith('[')):
115 |                         try:
116 |                             return json.loads(col)
117 |                         except:
118 |                             return col
119 |                     else:
120 |                         return col
121 | 
122 |                 return [convert(c) for c in row]
123 | 
124 |             result = [fix_lob(row) for row in result]
125 |             result = [dict(zip(columns, r)) for r in result]
126 | 
127 |         self.close_connection(conn, cursor)
128 | 
129 |         return result
130 | 
131 |     def add(self, sql, exception_callfunc=''):
132 |         affect_count = None
133 | 
134 |         try:
135 |             conn, cursor = self.get_connection()
136 |             affect_count = cursor.execute(sql)
137 |             conn.commit()
138 | 
139 |         except Exception as e:
140 |             log.error('''
141 |                 error:%s
142 |                 sql:  %s
143 |                 ''' % (e, sql))
144 |             if exception_callfunc:
145 |                 exception_callfunc(e)
146 |         finally:
147 |             self.close_connection(conn, cursor)
148 | 
149 |         return affect_count
150 | 
151 |     def add_batch(self, sql, datas):
152 |         '''
153 |         @summary:
154 |         ---------
155 |         @ param sql: insert ignore into (xxx,xxx) values (%s, %s, %s)
156 |         # param datas:[[..], [...]]
157 |         ---------
158 |         @result:
159 |         '''
160 |         affect_count = None
161 | 
162 |         try:
163 |             conn, cursor = self.get_connection()
164 |             affect_count = cursor.executemany(sql, datas)
165 |             conn.commit()
166 | 
167 |         except Exception as e:
168 |             log.error('''
169 |                 error:%s
170 |                 sql:  %s
171 |                 ''' % (e, sql))
172 |         finally:
173 |             self.close_connection(conn, cursor)
174 | 
175 |         return affect_count
176 | 
177 |     def update(self, sql):
178 |         try:
179 |             conn, cursor = self.get_connection()
180 |             cursor.execute(sql)
181 |             conn.commit()
182 | 
183 |         except Exception as e:
184 |             log.error('''
185 |                 error:%s
186 |                 sql:  %s
187 |                 ''' % (e, sql))
188 |             return False
189 |         else:
190 |             return True
191 |         finally:
192 |             self.close_connection(conn, cursor)
193 | 
194 |     def delete(self, sql):
195 |         try:
196 |             conn, cursor = self.get_connection()
197 |             cursor.execute(sql)
198 |             conn.commit()
199 | 
200 |         except Exception as e:
201 |             log.error('''
202 |                 error:%s
203 |                 sql:  %s
204 |                 ''' % (e, sql))
205 |             return False
206 |         else:
207 |             return True
208 |         finally:
209 |             self.close_connection(conn, cursor)
210 | 
211 |     def execute(self, sql):
212 |         try:
213 |             conn, cursor = self.get_connection()
214 |             cursor.execute(sql)
215 |             conn.commit()
216 | 
217 |         except Exception as e:
218 |             log.error('''
219 |                 error:%s
220 |                 sql:  %s
221 |                 ''' % (e, sql))
222 |             return False
223 |         else:
224 |             return True
225 |         finally:
226 |             self.close_connection(conn, cursor)
227 | 
228 |     def set_unique_key(self, table, key):
229 |         try:
230 |             sql = 'alter table %s add unique (%s)' % (table, key)
231 | 
232 |             conn, cursor = self.get_connection()
233 |             cursor.execute(sql)
234 |             conn.commit()
235 | 
236 |         except Exception as e:
237 |             log.error(table + ' ' + str(e) + ' key = ' + key)
238 |             return False
239 |         else:
240 |             log.debug('%s表创建唯一索引成功 索引为 %s' % (table, key))
241 |             return True
242 |         finally:
243 |             self.close_connection(conn, cursor)
244 | 
245 | 
246 | if __name__ == '__main__':
247 |     db = MysqlDB()
248 |     sql = "select is_done from qiancheng_job_list_batch_record where id = 3"
249 | 
250 |     data = db.find(sql)
251 |     print(data)
252 | 


--------------------------------------------------------------------------------
/wechat-spider/db/redisdb.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | '''
  3 | Created on 2016-11-16 16:25
  4 | ---------
  5 | @summary: 操作redis数据库
  6 | ---------
  7 | @author: Boris
  8 | '''
  9 | 
 10 | from utils.log import log
 11 | import redis
 12 | 
 13 | 
 14 | # setting.REDISDB_DB = 0
 15 | # setting.REDISDB_IP_PORTS = 'localhost:6379'
 16 | # setting.REDISDB_USER_PASS = None
 17 | 
 18 | 
 19 | class RedisDB():
 20 | 
 21 |     def __init__(self, ip=None, port=None, db=None, passwd=None, decode_responses=True):
 22 |         self._is_redis_cluster = False
 23 |         try:
 24 |             self._redis = redis.Redis(host=ip, port=port, db=db, password=passwd, decode_responses=decode_responses)  # redis默认端口是6379
 25 |             self._redis.ping()
 26 |         except Exception as e:
 27 |             input('''
 28 |             ******************************************
 29 |                 未链接到redis数据库，
 30 |                 您当前的链接信息为：
 31 |                     ip = {}
 32 |                     port = {}
 33 |                     db = {}
 34 |                     passwd = {}
 35 |                 请参考教程正确安装配置redis，然后重启本程序 
 36 |                 Exception: {}'''.format(ip, port, db, passwd, str(e))
 37 |                   )
 38 |             import sys
 39 |             sys.exit()
 40 | 
 41 |     def sadd(self, table, values):
 42 |         '''
 43 |         @summary: 使用无序set集合存储数据， 去重
 44 |         ---------
 45 |         @param table:
 46 |         @param values: 值； 支持list 或 单个值
 47 |         ---------
 48 |         @result: 若库中存在 返回0，否则入库，返回1。 批量添加返回None
 49 |         '''
 50 | 
 51 |         if isinstance(values, list):
 52 |             pipe = self._redis.pipeline(transaction=True)  # redis-py默认在执行每次请求都会创建（连接池申请连接）和断开（归还连接池）一次连接操作，如果想要在一次请求中指定多个命令，则可以使用pipline实现一次请求指定多个命令，并且默认情况下一次pipline 是原子性操作。
 53 | 
 54 |             if not self._is_redis_cluster:
 55 |                 pipe.multi()
 56 |             for value in values:
 57 |                 pipe.sadd(table, value)
 58 |             pipe.execute()
 59 | 
 60 |         else:
 61 |             return self._redis.sadd(table, values)
 62 | 
 63 |     def sget(self, table, count=1, is_pop=True):
 64 |         '''
 65 |         返回 list 如 ['1'] 或 []
 66 |         @param table:
 67 |         @param count:
 68 |         @param is_pop:
 69 |         @return:
 70 |         '''
 71 |         datas = []
 72 |         if is_pop:
 73 |             count = count if count <= self.sget_count(table) else self.sget_count(table)
 74 |             if count:
 75 |                 if count > 1:
 76 |                     pipe = self._redis.pipeline(transaction=True)  # redis-py默认在执行每次请求都会创建（连接池申请连接）和断开（归还连接池）一次连接操作，如果想要在一次请求中指定多个命令，则可以使用pipline实现一次请求指定多个命令，并且默认情况下一次pipline 是原子性操作。
 77 | 
 78 |                     if not self._is_redis_cluster:
 79 |                         pipe.multi()
 80 |                     while count:
 81 |                         pipe.spop(table)
 82 |                         count -= 1
 83 |                     datas = pipe.execute()
 84 | 
 85 |                 else:
 86 |                     datas.append(self._redis.spop(table))
 87 | 
 88 |         else:
 89 |             datas = self._redis.srandmember(table, count)
 90 | 
 91 |         return datas
 92 | 
 93 |     def srem(self, table, values):
 94 |         '''
 95 |         @summary: 移除集合中的指定元素
 96 |         ---------
 97 |         @param table:
 98 |         @param values: 一个或者列表
 99 |         ---------
100 |         @result:
101 |         '''
102 |         if isinstance(values, list):
103 |             pipe = self._redis.pipeline(transaction=True)  # redis-py默认在执行每次请求都会创建（连接池申请连接）和断开（归还连接池）一次连接操作，如果想要在一次请求中指定多个命令，则可以使用pipline实现一次请求指定多个命令，并且默认情况下一次pipline 是原子性操作。
104 | 
105 |             if not self._is_redis_cluster:
106 |                 pipe.multi()
107 |             for value in values:
108 |                 pipe.srem(table, value)
109 |             pipe.execute()
110 |         else:
111 |             self._redis.srem(table, values)
112 | 
113 |     def sget_count(self, table):
114 |         return self._redis.scard(table)
115 | 
116 |     def sdelete(self, table):
117 |         '''
118 |         @summary: 删除set集合的大键（数据量大的表）
119 |         删除大set键，使用sscan命令，每次扫描集合中500个元素，再用srem命令每次删除一个键
120 |         若直接用delete命令，会导致Redis阻塞，出现故障切换和应用程序崩溃的故障。
121 |         ---------
122 |         @param table:
123 |         ---------
124 |         @result:
125 |         '''
126 |         # 当 SCAN 命令的游标参数被设置为 0 时， 服务器将开始一次新的迭代， 而当服务器向用户返回值为 0 的游标时， 表示迭代已结束
127 |         cursor = '0'
128 |         while cursor != 0:
129 |             cursor, data = self._redis.sscan(table, cursor=cursor, count=500)
130 |             for item in data:
131 |                 # pipe.srem(table, item)
132 |                 self._redis.srem(table, item)
133 | 
134 |             # pipe.execute()
135 | 
136 |     def zadd(self, table, values, prioritys=0):
137 |         '''
138 |         @summary: 使用有序set集合存储数据， 去重(值存在更新)
139 |         ---------
140 |         @param table:
141 |         @param values: 值； 支持list 或 单个值
142 |         @param prioritys: 优先级； double类型，支持list 或 单个值。 根据此字段的值来排序, 值越小越优先。 可不传值，默认value的优先级为0
143 |         ---------
144 |         @result:若库中存在 返回0，否则入库，返回1。 批量添加返回 [0, 1 ...]
145 |         '''
146 |         if isinstance(values, list):
147 |             if not isinstance(prioritys, list):
148 |                 prioritys = [prioritys] * len(values)
149 |             else:
150 |                 assert len(values) == len(prioritys), 'values值要与prioritys值一一对应'
151 | 
152 |             pipe = self._redis.pipeline(transaction=True)
153 | 
154 |             if not self._is_redis_cluster:
155 |                 pipe.multi()
156 |             for value, priority in zip(values, prioritys):
157 |                 if self._is_redis_cluster:
158 |                     pipe.zadd(table, priority, value)
159 |                 else:
160 |                     pipe.zadd(table, value, priority)
161 |             return pipe.execute()
162 | 
163 |         else:
164 |             if self._is_redis_cluster:
165 |                 return self._redis.zadd(table, prioritys, values)
166 |             else:
167 |                 return self._redis.zadd(table, values, prioritys)
168 | 
169 |     def zget(self, table, count=1, is_pop=True):
170 |         '''
171 |         @summary: 从有序set集合中获取数据 优先返回分数小的（优先级高的）
172 |         ---------
173 |         @param table:
174 |         @param count: 数量 -1 返回全部数据
175 |         @param is_pop:获取数据后，是否在原set集合中删除，默认是
176 |         ---------
177 |         @result: 列表
178 |         '''
179 |         start_pos = 0  # 包含
180 |         end_pos = count - 1 if count > 0 else count
181 | 
182 |         pipe = self._redis.pipeline(transaction=True)  # redis-py默认在执行每次请求都会创建（连接池申请连接）和断开（归还连接池）一次连接操作，如果想要在一次请求中指定多个命令，则可以使用pipline实现一次请求指定多个命令，并且默认情况下一次pipline 是原子性操作。
183 | 
184 |         if not self._is_redis_cluster:
185 |             pipe.multi()  # 标记事务的开始 参考 http://www.runoob.com/redis/redis-transactions.html
186 |         pipe.zrange(table, start_pos, end_pos)  # 取值
187 |         if is_pop:
188 |             pipe.zremrangebyrank(table, start_pos, end_pos)  # 删除
189 |         results, *count = pipe.execute()
190 |         return results
191 | 
192 |     def zremrangebyscore(self, table, priority_min, priority_max):
193 |         '''
194 |         根据分数移除成员 闭区间
195 |         @param table:
196 |         @param priority_min:
197 |         @param priority_max:
198 |         @return: 被移除的成员个数
199 |         '''
200 |         return self._redis.zremrangebyscore(table, priority_min, priority_max)
201 | 
202 |     def zrangebyscore(self, table, priority_min, priority_max, count=None, is_pop=True):
203 |         '''
204 |         @summary: 返回指定分数区间的数据 闭区间
205 |         ---------
206 |         @param table:
207 |         @param priority_min: 优先级越小越优先
208 |         @param priority_max:
209 |         @param count: 获取的数量，为空则表示分数区间内的全部数据
210 |         @param is_pop: 是否删除
211 |         ---------
212 |         @result:
213 |         '''
214 | 
215 |         # 使用lua脚本， 保证操作的原子性
216 |         lua = '''
217 |             local key = KEYS[1]
218 |             local min_score = ARGV[2]
219 |             local max_score = ARGV[3]
220 |             local is_pop = ARGV[4]
221 |             local count = ARGV[5]
222 | 
223 |             -- 取值
224 |             local datas = nil
225 |             if count then
226 |                 datas = redis.call('zrangebyscore', key, min_score, max_score, 'limit', 0, count)
227 |             else
228 |                 datas = redis.call('zrangebyscore', key, min_score, max_score)
229 |             end
230 | 
231 |             -- 删除redis中刚取到的值
232 |             if (is_pop) then
233 |                 for i=1, #datas do
234 |                     redis.call('zrem', key, datas[i])
235 |                 end
236 |             end
237 | 
238 | 
239 |             return datas
240 | 
241 |         '''
242 |         cmd = self._redis.register_script(lua)
243 |         if count:
244 |             res = cmd(keys=[table], args=[table, priority_min, priority_max, is_pop, count])
245 |         else:
246 |             res = cmd(keys=[table], args=[table, priority_min, priority_max, is_pop])
247 | 
248 |         return res
249 | 
250 |     def zrangebyscore_increase_score(self, table, priority_min, priority_max, increase_score, count=None):
251 |         '''
252 |         @summary: 返回指定分数区间的数据 闭区间， 同时修改分数
253 |         ---------
254 |         @param table:
255 |         @param priority_min: 最小分数
256 |         @param priority_max: 最大分数
257 |         @param increase_score: 分数值增量 正数则在原有的分数上叠加，负数则相减
258 |         @param count: 获取的数量，为空则表示分数区间内的全部数据
259 |         ---------
260 |         @result:
261 |         '''
262 | 
263 |         # 使用lua脚本， 保证操作的原子性
264 |         lua = '''
265 |             local key = KEYS[1]
266 |             local min_score = ARGV[1]
267 |             local max_score = ARGV[2]
268 |             local increase_score = ARGV[3]
269 |             local count = ARGV[4]
270 | 
271 |             -- 取值
272 |             local datas = nil
273 |             if count then
274 |                 datas = redis.call('zrangebyscore', key, min_score, max_score, 'limit', 0, count)
275 |             else
276 |                 datas = redis.call('zrangebyscore', key, min_score, max_score)
277 |             end
278 | 
279 |             --修改优先级
280 |             for i=1, #datas do
281 |                 redis.call('zincrby', key, increase_score, datas[i])
282 |             end
283 | 
284 |             return datas
285 | 
286 |         '''
287 |         cmd = self._redis.register_script(lua)
288 |         if count:
289 |             res = cmd(keys=[table], args=[priority_min, priority_max, increase_score, count])
290 |         else:
291 |             res = cmd(keys=[table], args=[priority_min, priority_max, increase_score])
292 | 
293 |         return res
294 | 
295 |     def zrangebyscore_set_score(self, table, priority_min, priority_max, score, count=None):
296 |         '''
297 |         @summary: 返回指定分数区间的数据 闭区间， 同时修改分数
298 |         ---------
299 |         @param table:
300 |         @param priority_min: 最小分数
301 |         @param priority_max: 最大分数
302 |         @param score: 分数值
303 |         @param count: 获取的数量，为空则表示分数区间内的全部数据
304 |         ---------
305 |         @result:
306 |         '''
307 | 
308 |         # 使用lua脚本， 保证操作的原子性
309 |         lua = '''
310 |             local key = KEYS[1]
311 |             local min_score = ARGV[1]
312 |             local max_score = ARGV[2]
313 |             local set_score = ARGV[3]
314 |             local count = ARGV[4]
315 | 
316 |             -- 取值
317 |             local datas = nil
318 |             if count then
319 |                 datas = redis.call('zrangebyscore', key, min_score, max_score, 'withscores','limit', 0, count)
320 |             else
321 |                 datas = redis.call('zrangebyscore', key, min_score, max_score, 'withscores')
322 |             end
323 | 
324 |             local real_datas = {} -- 数据
325 |             --修改优先级
326 |             for i=1, #datas, 2 do
327 |                local data = datas[i]
328 |                local score = datas[i+1]
329 | 
330 |                table.insert(real_datas, data) -- 添加数据
331 | 
332 |                redis.call('zincrby', key, set_score - score, datas[i])
333 |             end
334 | 
335 |             return real_datas
336 | 
337 |         '''
338 |         cmd = self._redis.register_script(lua)
339 |         if count:
340 |             res = cmd(keys=[table], args=[priority_min, priority_max, score, count])
341 |         else:
342 |             res = cmd(keys=[table], args=[priority_min, priority_max, score])
343 | 
344 |         return res
345 | 
346 |     def zget_count(self, table, priority_min=None, priority_max=None):
347 |         '''
348 |         @summary: 获取表数据的数量
349 |         ---------
350 |         @param table:
351 |         @param priority_min:优先级范围 最小值（包含）
352 |         @param priority_max:优先级范围 最大值（包含）
353 |         ---------
354 |         @result:
355 |         '''
356 | 
357 |         if priority_min != None and priority_max != None:
358 |             return self._redis.zcount(table, priority_min, priority_max)
359 |         else:
360 |             return self._redis.zcard(table)
361 | 
362 |     def zrem(self, table, values):
363 |         '''
364 |         @summary: 移除集合中的指定元素
365 |         ---------
366 |         @param table:
367 |         @param values: 一个或者列表
368 |         ---------
369 |         @result:
370 |         '''
371 |         if isinstance(values, list):
372 |             pipe = self._redis.pipeline(transaction=True)  # redis-py默认在执行每次请求都会创建（连接池申请连接）和断开（归还连接池）一次连接操作，如果想要在一次请求中指定多个命令，则可以使用pipline实现一次请求指定多个命令，并且默认情况下一次pipline 是原子性操作。
373 | 
374 |             if not self._is_redis_cluster:
375 |                 pipe.multi()
376 |             for value in values:
377 |                 pipe.zrem(table, value)
378 |             pipe.execute()
379 |         else:
380 |             self._redis.zrem(table, values)
381 | 
382 |     def zexists(self, table, values):
383 |         '''
384 |         利用zscore判断某元素是否存在
385 |         @param values:
386 |         @return:
387 |         '''
388 |         is_exists = []
389 | 
390 |         if isinstance(values, list):
391 |             pipe = self._redis.pipeline(transaction=True)  # redis-py默认在执行每次请求都会创建（连接池申请连接）和断开（归还连接池）一次连接操作，如果想要在一次请求中指定多个命令，则可以使用pipline实现一次请求指定多个命令，并且默认情况下一次pipline 是原子性操作。
392 |             pipe.multi()
393 |             for value in values:
394 |                 pipe.zscore(table, value)
395 |             is_exists_temp = pipe.execute()
396 |             for is_exist in is_exists_temp:
397 |                 if is_exist != None:
398 |                     is_exists.append(1)
399 |                 else:
400 |                     is_exists.append(0)
401 | 
402 |         else:
403 |             is_exists = self._redis.zscore(table, values)
404 |             is_exists = 1 if is_exists != None else 0
405 | 
406 |         return is_exists
407 | 
408 |     def lpush(self, table, values):
409 |         if isinstance(values, list):
410 |             pipe = self._redis.pipeline(transaction=True)  # redis-py默认在执行每次请求都会创建（连接池申请连接）和断开（归还连接池）一次连接操作，如果想要在一次请求中指定多个命令，则可以使用pipline实现一次请求指定多个命令，并且默认情况下一次pipline 是原子性操作。
411 | 
412 |             if not self._is_redis_cluster:
413 |                 pipe.multi()
414 |             for value in values:
415 |                 pipe.rpush(table, value)
416 |             pipe.execute()
417 | 
418 |         else:
419 |             return self._redis.rpush(table, values)
420 | 
421 |     def lpop(self, table, count=1):
422 |         '''
423 |         @summary:
424 |         ---------
425 |         @param table:
426 |         @param count:
427 |         ---------
428 |         @result: count>1时返回列表
429 |         '''
430 |         datas = None
431 | 
432 |         count = count if count <= self.lget_count(table) else self.lget_count(table)
433 | 
434 |         if count:
435 |             if count > 1:
436 |                 pipe = self._redis.pipeline(transaction=True)  # redis-py默认在执行每次请求都会创建（连接池申请连接）和断开（归还连接池）一次连接操作，如果想要在一次请求中指定多个命令，则可以使用pipline实现一次请求指定多个命令，并且默认情况下一次pipline 是原子性操作。
437 | 
438 |                 if not self._is_redis_cluster:
439 |                     pipe.multi()
440 |                 while count:
441 |                     pipe.lpop(table)
442 |                     count -= 1
443 |                 datas = pipe.execute()
444 | 
445 |             else:
446 |                 datas = self._redis.lpop(table)
447 | 
448 |         return datas
449 | 
450 |     def rpoplpush(self, from_table, to_table=None):
451 |         '''
452 |         将列表 from_table 中的最后一个元素(尾元素)弹出，并返回给客户端。
453 |         将 from_table 弹出的元素插入到列表 to_table ，作为 to_table 列表的的头元素。
454 |         如果 from_table 和 to_table 相同，则列表中的表尾元素被移动到表头，并返回该元素，可以把这种特殊情况视作列表的旋转(rotation)操作
455 |         @param from_table:
456 |         @param to_table:
457 |         @return:
458 |         '''
459 | 
460 |         if not to_table:
461 |             to_table = from_table
462 | 
463 |         return self._redis.rpoplpush(from_table, to_table)
464 | 
465 |     def lget_count(self, table):
466 |         return self._redis.llen(table)
467 | 
468 |     def lrem(self, table, value, num=0):
469 |         return self._redis.lrem(table, value, num)
470 | 
471 |     def hset(self, table, key, value):
472 |         '''
473 |         @summary:
474 |         如果 key 不存在，一个新的哈希表被创建并进行 HSET 操作。
475 |         如果域 field 已经存在于哈希表中，旧值将被覆盖
476 |         ---------
477 |         @param table:
478 |         @param key:
479 |         @param value:
480 |         ---------
481 |         @result: 1 新插入； 0 覆盖
482 |         '''
483 | 
484 |         return self._redis.hset(table, key, value)
485 | 
486 |     def hincrby(self, table, key, increment):
487 |         return self._redis.hincrby(table, key, increment)
488 | 
489 |     def hget(self, table, key, is_pop=False):
490 |         if not is_pop:
491 |             return self._redis.hget(table, key)
492 |         else:
493 |             lua = '''
494 |                 local key = KEYS[1]
495 |                 local field = ARGV[1]
496 | 
497 |                 -- 取值
498 |                 local datas = redis.call('hget', key, field)
499 |                 -- 删除值
500 |                 redis.call('hdel', key, field)
501 | 
502 |                 return datas
503 | 
504 |                     '''
505 |             cmd = self._redis.register_script(lua)
506 |             res = cmd(keys=[table], args=[key])
507 | 
508 |             return res
509 | 
510 |     def hgetall(self, table):
511 |         return self._redis.hgetall(table)
512 | 
513 |     def hexists(self, table, key):
514 |         return self._redis.hexists(table, key)
515 | 
516 |     def hdel(self, table, *keys):
517 |         '''
518 |         @summary: 删除对应的key 可传多个
519 |         ---------
520 |         @param table:
521 |         @param *keys:
522 |         ---------
523 |         @result:
524 |         '''
525 | 
526 |         self._redis.hdel(table, *keys)
527 | 
528 |     def hget_count(self, table):
529 |         return self._redis.hlen(table)
530 | 
531 |     def setbit(self, table, offsets, values):
532 |         '''
533 |         设置字符串数组某一位的值， 返回之前的值
534 |         @param table:
535 |         @param offsets: 支持列表或单个值
536 |         @param values: 支持列表或单个值
537 |         @return: list / 单个值
538 |         '''
539 |         if isinstance(offsets, list):
540 |             if not isinstance(values, list):
541 |                 values = [values] * len(offsets)
542 |             else:
543 |                 assert len(offsets) == len(values), 'offsets值要与values值一一对应'
544 | 
545 |             pipe = self._redis.pipeline(transaction=True)  # redis-py默认在执行每次请求都会创建（连接池申请连接）和断开（归还连接池）一次连接操作，如果想要在一次请求中指定多个命令，则可以使用pipline实现一次请求指定多个命令，并且默认情况下一次pipline 是原子性操作。
546 |             pipe.multi()
547 | 
548 |             for offset, value in zip(offsets, values):
549 |                 pipe.setbit(table, offset, value)
550 | 
551 |             return pipe.execute()
552 | 
553 |         else:
554 |             return self._redis.setbit(table, offsets, values)
555 | 
556 |     def getbit(self, table, offsets):
557 |         '''
558 |         取字符串数组某一位的值
559 |         @param table:
560 |         @param offsets: 支持列表
561 |         @return: list / 单个值
562 |         '''
563 |         if isinstance(offsets, list):
564 |             pipe = self._redis.pipeline(transaction=True)  # redis-py默认在执行每次请求都会创建（连接池申请连接）和断开（归还连接池）一次连接操作，如果想要在一次请求中指定多个命令，则可以使用pipline实现一次请求指定多个命令，并且默认情况下一次pipline 是原子性操作。
565 |             pipe.multi()
566 |             for offset in offsets:
567 |                 pipe.getbit(table, offset)
568 | 
569 |             return pipe.execute()
570 | 
571 |         else:
572 |             return self._redis.getbit(table, offsets)
573 | 
574 |     def bitcount(self, table):
575 |         return self._redis.bitcount(table)
576 | 
577 |     def strset(self, table, value, **kwargs):
578 |         return self._redis.set(table, value, **kwargs)
579 | 
580 |     def strget(self, table):
581 |         return self._redis.get(table)
582 | 
583 |     def strlen(self, table):
584 |         return self._redis.strlen(table)
585 | 
586 |     def getkeys(self, regex):
587 |         return self._redis.keys(regex)
588 | 
589 |     def exists_key(self, key):
590 |         return self._redis.exists(key)
591 | 
592 |     def set_expire(self, key, seconds):
593 |         '''
594 |         @summary: 设置过期时间
595 |         ---------
596 |         @param key:
597 |         @param seconds: 秒
598 |         ---------
599 |         @result:
600 |         '''
601 | 
602 |         self._redis.expire(key, seconds)
603 | 
604 |     def clear(self, table):
605 |         try:
606 |             self._redis.delete(table)
607 |         except Exception as e:
608 |             log.error(e)
609 | 
610 |     def get_redis_obj(self):
611 |         return self._redis
612 | 
613 | 
614 | if __name__ == '__main__':
615 |     db = RedisDB(db=1)
616 |     print(db)
617 | 


--------------------------------------------------------------------------------
/wechat-spider/run.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | '''
 3 | Created on 2019/5/18 9:52 PM
 4 | ---------
 5 | @summary:
 6 | ---------
 7 | @author:
 8 | '''
 9 | 
10 | from core.capture_packet import WechatCapture
11 | from create_tables import create_table
12 | from mitmproxy import options
13 | from mitmproxy import proxy
14 | from mitmproxy.tools.dump import DumpMaster
15 | from config import config, IP
16 | 
17 | 
18 | def start():
19 |     ip = IP
20 |     port = config.get('spider').get('service_port')
21 | 
22 |     print("温馨提示：服务IP {} 端口 {} 请确保代理已配置".format(ip, port))
23 | 
24 |     myaddon = WechatCapture()
25 |     opts = options.Options(listen_port=port)
26 |     pconf = proxy.config.ProxyConfig(opts)
27 |     m = DumpMaster(opts)
28 |     m.options.set('flow_detail={mitm_log_level}'.format(mitm_log_level=config.get('mitm').get('log_level')))
29 |     m.server = proxy.server.ProxyServer(pconf)
30 |     m.addons.add(myaddon)
31 | 
32 |     try:
33 |         m.run()
34 |     except KeyboardInterrupt:
35 |         m.shutdown()
36 | 
37 | 
38 | create_table()
39 | start()
40 | 


--------------------------------------------------------------------------------
/wechat-spider/utils/log.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Created on 2018-12-08 16:50
  4 | ---------
  5 | @summary:
  6 | ---------
  7 | @author:
  8 | """
  9 | 
 10 | import logging
 11 | import os
 12 | import sys
 13 | from logging.handlers import BaseRotatingHandler
 14 | from config import config
 15 | 
 16 | from better_exceptions import format_exception
 17 | 
 18 | LOG_FORMAT = "%(threadName)s|%(asctime)s|%(filename)s|%(funcName)s|line:%(lineno)d|%(levelname)s| %(message)s"
 19 | PRINT_EXCEPTION_DETAILS = True
 20 | 
 21 | 
 22 | # 重写 RotatingFileHandler 自定义log的文件名
 23 | # 原来 xxx.log xxx.log.1 xxx.log.2 xxx.log.3 文件由近及远
 24 | # 现在 xxx.log xxx1.log xxx2.log  如果backupCount 是2位数时  则 01  02  03 三位数 001 002 .. 文件由近及远
 25 | class RotatingFileHandler(BaseRotatingHandler):
 26 | 
 27 |     def __init__(
 28 |             self, filename, mode="a", maxBytes=0, backupCount=0, encoding=None, delay=0
 29 |     ):
 30 |         # if maxBytes > 0:
 31 |         #    mode = 'a'
 32 |         BaseRotatingHandler.__init__(self, filename, mode, encoding, delay)
 33 |         self.maxBytes = maxBytes
 34 |         self.backupCount = backupCount
 35 |         self.placeholder = str(len(str(backupCount)))
 36 | 
 37 |     def doRollover(self):
 38 |         if self.stream:
 39 |             self.stream.close()
 40 |             self.stream = None
 41 |         if self.backupCount > 0:
 42 |             for i in range(self.backupCount - 1, 0, -1):
 43 |                 sfn = ("%0" + self.placeholder + "d.") % i  # '%2d.'%i -> 02
 44 |                 sfn = sfn.join(self.baseFilename.split("."))
 45 |                 # sfn = "%d_%s" % (i, self.baseFilename)
 46 |                 # dfn = "%d_%s" % (i + 1, self.baseFilename)
 47 |                 dfn = ("%0" + self.placeholder + "d.") % (i + 1)
 48 |                 dfn = dfn.join(self.baseFilename.split("."))
 49 |                 if os.path.exists(sfn):
 50 |                     # print "%s -> %s" % (sfn, dfn)
 51 |                     if os.path.exists(dfn):
 52 |                         os.remove(dfn)
 53 |                     os.rename(sfn, dfn)
 54 |             dfn = (("%0" + self.placeholder + "d.") % 1).join(
 55 |                 self.baseFilename.split(".")
 56 |             )
 57 |             if os.path.exists(dfn):
 58 |                 os.remove(dfn)
 59 |             # Issue 18940: A file may not have been created if delay is True.
 60 |             if os.path.exists(self.baseFilename):
 61 |                 os.rename(self.baseFilename, dfn)
 62 |         if not self.delay:
 63 |             self.stream = self._open()
 64 | 
 65 |     def shouldRollover(self, record):
 66 | 
 67 |         if self.stream is None:  # delay was set...
 68 |             self.stream = self._open()
 69 |         if self.maxBytes > 0:  # are we rolling over?
 70 |             msg = "%s\n" % self.format(record)
 71 |             self.stream.seek(0, 2)  # due to non-posix-compliant Windows feature
 72 |             if self.stream.tell() + len(msg) >= self.maxBytes:
 73 |                 return 1
 74 |         return 0
 75 | 
 76 | 
 77 | def get_logger(
 78 |         name, path="", log_level="DEBUG", is_write_to_file=False, is_write_to_stdout=True
 79 | ):
 80 |     """
 81 |     @summary: 获取log
 82 |     ---------
 83 |     @param name: log名
 84 |     @param path: log文件存储路径 如 D://xxx.log
 85 |     @param log_level: log等级 CRITICAL/ERROR/WARNING/INFO/DEBUG
 86 |     @param is_write_to_file: 是否写入到文件 默认否
 87 |     ---------
 88 |     @result:
 89 |     """
 90 |     name = name.split(os.sep)[-1].split(".")[0]  # 取文件名
 91 | 
 92 |     logger = logging.getLogger(name)
 93 |     logger.setLevel(log_level)
 94 | 
 95 |     formatter = logging.Formatter(LOG_FORMAT)
 96 |     if PRINT_EXCEPTION_DETAILS:
 97 |         formatter.formatException = lambda exc_info: format_exception(*exc_info)
 98 | 
 99 |     # 定义一个RotatingFileHandler，最多备份5个日志文件，每个日志文件最大10M
100 |     if is_write_to_file:
101 |         if path and not os.path.exists(os.path.dirname(path)):
102 |             os.makedirs(os.path.dirname(path))
103 | 
104 |         rf_handler = RotatingFileHandler(
105 |             path, mode="w", maxBytes=10 * 1024 * 1024, backupCount=20, encoding="utf8"
106 |         )
107 |         rf_handler.setFormatter(formatter)
108 |         logger.addHandler(rf_handler)
109 | 
110 |     if is_write_to_stdout:
111 |         stream_handler = logging.StreamHandler()
112 |         stream_handler.stream = sys.stdout
113 |         stream_handler.setFormatter(formatter)
114 |         # 检查是否已存在
115 |         handle_exists = 0
116 |         for _handler in logger.handlers:
117 |             if (
118 |                     isinstance(_handler, logging.StreamHandler)
119 |                     and _handler.stream == sys.stdout
120 |             ):
121 |                 handle_exists = 1
122 |         if not handle_exists:
123 |             logger.addHandler(stream_handler)
124 | 
125 |     return logger
126 | 
127 | 
128 | # logging.disable(logging.DEBUG) # 关闭所有log
129 | 
130 | # 不让打印log的配置
131 | STOP_LOGS = [
132 |     # ES
133 |     "urllib3.response",
134 |     "urllib3.connection",
135 |     "elasticsearch.trace",
136 |     "requests.packages.urllib3.util",
137 |     "requests.packages.urllib3.util.retry",
138 |     "urllib3.util",
139 |     "requests.packages.urllib3.response",
140 |     "requests.packages.urllib3.contrib.pyopenssl",
141 |     "requests.packages",
142 |     "urllib3.util.retry",
143 |     "requests.packages.urllib3.contrib",
144 |     "requests.packages.urllib3.connectionpool",
145 |     "requests.packages.urllib3.poolmanager",
146 |     "urllib3.connectionpool",
147 |     "requests.packages.urllib3.connection",
148 |     "elasticsearch",
149 |     "log_request_fail",
150 |     # requests
151 |     "requests",
152 |     "selenium.webdriver.remote.remote_connection",
153 |     "selenium.webdriver.remote",
154 |     "selenium.webdriver",
155 |     "selenium",
156 |     # markdown
157 |     "MARKDOWN",
158 |     "build_extension",
159 |     # newspaper
160 |     "calculate_area",
161 |     "largest_image_url",
162 |     "newspaper.images",
163 |     "newspaper",
164 |     "Importing",
165 |     "PIL",
166 | ]
167 | 
168 | # 关闭日志打印
169 | for STOP_LOG in STOP_LOGS:
170 |     log_level = eval("logging.ERROR")
171 |     logging.getLogger(STOP_LOG).setLevel(log_level)
172 | 
173 | # print(logging.Logger.manager.loggerDict) # 取使用debug模块的name
174 | 
175 | # 日志级别大小关系为：critical > error > warning > info > debug
176 | log = get_logger(
177 |     name="wechat_spider",
178 |     path=config.get('log').get('log_path'),
179 |     log_level=config.get('log').get('level'),
180 |     is_write_to_file=config.get('log').get('to_file'),
181 | )
182 | 
183 | if __name__ == "__main__":
184 |     try:
185 |         a = 1
186 |         b = 0
187 |         c = a / b
188 |     except Exception as e:
189 |         log.exception(e)
190 | 


--------------------------------------------------------------------------------
/wechat-spider/utils/selector.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | '''
  3 | Created on 2018-10-08 15:33:37
  4 | ---------
  5 | @summary: 重新定义 selector
  6 | ---------
  7 | @author:
  8 | @email:
  9 | '''
 10 | import re
 11 | 
 12 | import six
 13 | from parsel import Selector as ParselSelector
 14 | from parsel import SelectorList as ParselSelectorList
 15 | from w3lib.html import replace_entities as w3lib_replace_entities
 16 | 
 17 | 
 18 | def extract_regex(regex, text, replace_entities=True, flags=0):
 19 |     """Extract a list of unicode strings from the given text/encoding using the following policies:
 20 |     * if the regex contains a named group called "extract" that will be returned
 21 |     * if the regex contains multiple numbered groups, all those will be returned (flattened)
 22 |     * if the regex doesn't contain any group the entire regex matching is returned
 23 |     """
 24 |     if isinstance(regex, six.string_types):
 25 |         regex = re.compile(regex, flags=flags)
 26 | 
 27 |     if 'extract' in regex.groupindex:
 28 |         # named group
 29 |         try:
 30 |             extracted = regex.search(text).group('extract')
 31 |         except AttributeError:
 32 |             strings = []
 33 |         else:
 34 |             strings = [extracted] if extracted is not None else []
 35 |     else:
 36 |         # full regex or numbered groups
 37 |         strings = regex.findall(text)
 38 | 
 39 |     # strings = flatten(strings) # 这东西会把多维列表铺平
 40 |     if not replace_entities:
 41 |         return strings
 42 | 
 43 |     values = []
 44 |     for value in strings:
 45 |         if isinstance(value, (list, tuple)):  # w3lib_replace_entities 不能接收list tuple
 46 |             values.append([w3lib_replace_entities(v, keep=['lt', 'amp']) for v in value])
 47 |         else:
 48 |             values.append(w3lib_replace_entities(value, keep=['lt', 'amp']))
 49 | 
 50 |     return values
 51 | 
 52 | 
 53 | class SelectorList(ParselSelectorList):
 54 |     """
 55 |     The :class:`SelectorList` class is a subclass of the builtin ``list``
 56 |     class, which provides a few additional methods.
 57 |     """
 58 | 
 59 |     def re_first(self, regex, default=None, replace_entities=True, flags=re.S):
 60 |         """
 61 |         Call the ``.re()`` method for the first element in this list and
 62 |         return the result in an unicode string. If the list is empty or the
 63 |         regex doesn't match anything, return the default value (``None`` if
 64 |         the argument is not provided).
 65 | 
 66 |         By default, character entity references are replaced by their
 67 |         corresponding character (except for ``&amp;`` and ``&lt;``.
 68 |         Passing ``replace_entities`` as ``False`` switches off these
 69 |         replacements.
 70 |         """
 71 | 
 72 |         datas = self.re(regex, replace_entities=replace_entities, flags=flags)
 73 |         return datas[0] if datas else default
 74 | 
 75 |     def re(self, regex, replace_entities=True, flags=re.S):
 76 |         """
 77 |         Call the ``.re()`` method for each element in this list and return
 78 |         their results flattened, as a list of unicode strings.
 79 | 
 80 |         By default, character entity references are replaced by their
 81 |         corresponding character (except for ``&amp;`` and ``&lt;``.
 82 |         Passing ``replace_entities`` as ``False`` switches off these
 83 |         replacements.
 84 |         """
 85 |         datas = [x.re(regex, replace_entities=replace_entities, flags=flags) for x in self]
 86 |         return datas[0] if len(datas) == 1 else datas
 87 | 
 88 | 
 89 | class Selector(ParselSelector):
 90 |     selectorlist_cls = SelectorList
 91 | 
 92 |     def __str__(self):
 93 |         data = repr(self.get())
 94 |         return "<%s xpath=%r data=%s>" % (type(self).__name__, self._expr, data)
 95 | 
 96 |     __repr__ = __str__
 97 | 
 98 |     def re_first(self, regex, default=None, replace_entities=True, flags=re.S):
 99 |         """
100 |         Apply the given regex and return the first unicode string which
101 |         matches. If there is no match, return the default value (``None`` if
102 |         the argument is not provided).
103 | 
104 |         By default, character entity references are replaced by their
105 |         corresponding character (except for ``&amp;`` and ``&lt;``.
106 |         Passing ``replace_entities`` as ``False`` switches off these
107 |         replacements.
108 |         """
109 | 
110 |         datas = self.re(regex, replace_entities=replace_entities, flags=flags)
111 | 
112 |         return datas[0] if datas else default
113 | 
114 |     def re(self, regex, replace_entities=True, flags=re.S):
115 |         """
116 |         Apply the given regex and return a list of unicode strings with the
117 |         matches.
118 | 
119 |         ``regex`` can be either a compiled regular expression or a string which
120 |         will be compiled to a regular expression using ``re.compile(regex)``.
121 | 
122 |         By default, character entity references are replaced by their
123 |         corresponding character (except for ``&amp;`` and ``&lt;``.
124 |         Passing ``replace_entities`` as ``False`` switches off these
125 |         replacements.
126 |         """
127 | 
128 |         return extract_regex(regex, self.get(), replace_entities=replace_entities, flags=flags)
129 | 


--------------------------------------------------------------------------------
/wechat-spider/utils/tools.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | '''
  3 | Created on 2019/5/19 3:03 PM
  4 | ---------
  5 | @summary:
  6 | ---------
  7 | @author:
  8 | '''
  9 | import datetime
 10 | import json
 11 | import re
 12 | import ssl
 13 | import time
 14 | import uuid
 15 | from pprint import pformat
 16 | import hashlib
 17 | 
 18 | import pymysql
 19 | from utils.log import log
 20 | 
 21 | # 全局取消ssl证书验证
 22 | ssl._create_default_https_context = ssl._create_unverified_context
 23 | 
 24 | _regexs = {}
 25 | 
 26 | 
 27 | # @log_function_time
 28 | def get_info(html, regexs, allow_repeat=True, fetch_one=False, split=None):
 29 |     regexs = isinstance(regexs, str) and [regexs] or regexs
 30 | 
 31 |     infos = []
 32 |     for regex in regexs:
 33 |         if regex == '':
 34 |             continue
 35 | 
 36 |         if regex not in _regexs.keys():
 37 |             _regexs[regex] = re.compile(regex, re.S)
 38 | 
 39 |         if fetch_one:
 40 |             infos = _regexs[regex].search(html)
 41 |             if infos:
 42 |                 infos = infos.groups()
 43 |             else:
 44 |                 continue
 45 |         else:
 46 |             infos = _regexs[regex].findall(str(html))
 47 | 
 48 |         if len(infos) > 0:
 49 |             # print(regex)
 50 |             break
 51 | 
 52 |     if fetch_one:
 53 |         infos = infos if infos else ('',)
 54 |         return infos if len(infos) > 1 else infos[0]
 55 |     else:
 56 |         infos = allow_repeat and infos or sorted(set(infos), key=infos.index)
 57 |         infos = split.join(infos) if split else infos
 58 |         return infos
 59 | 
 60 | 
 61 | def get_param(url, key):
 62 |     params = url.split('?')[-1].split('&')
 63 |     for param in params:
 64 |         key_value = param.split('=', 1)
 65 |         if key == key_value[0]:
 66 |             return key_value[1]
 67 |     return None
 68 | 
 69 | 
 70 | def get_current_timestamp():
 71 |     return int(time.time())
 72 | 
 73 | 
 74 | def get_current_date(date_format='%Y-%m-%d %H:%M:%S'):
 75 |     return datetime.datetime.now().strftime(date_format)
 76 |     # return time.strftime(date_format, time.localtime(time.time()))
 77 | 
 78 | 
 79 | def timestamp_to_date(timestamp, time_format='%Y-%m-%d %H:%M:%S'):
 80 |     '''
 81 |     @summary:
 82 |     ---------
 83 |     @param timestamp: 将时间戳转化为日期
 84 |     @param format: 日期格式
 85 |     ---------
 86 |     @result: 返回日期
 87 |     '''
 88 | 
 89 |     date = time.localtime(timestamp)
 90 |     return time.strftime(time_format, date)
 91 | 
 92 | 
 93 | def get_json(json_str):
 94 |     '''
 95 |     @summary: 取json对象
 96 |     ---------
 97 |     @param json_str: json格式的字符串
 98 |     ---------
 99 |     @result: 返回json对象
100 |     '''
101 | 
102 |     try:
103 |         return json.loads(json_str) if json_str else {}
104 |     except Exception as e1:
105 |         try:
106 |             json_str = json_str.strip()
107 |             json_str = json_str.replace("'", '"')
108 |             keys = get_info(json_str, "(\w+):")
109 |             for key in keys:
110 |                 json_str = json_str.replace(key, '"%s"' % key)
111 | 
112 |             return json.loads(json_str) if json_str else {}
113 | 
114 |         except Exception as e2:
115 |             log.error(
116 |                 '''
117 |                 e1: %s
118 |                 format json_str: %s
119 |                 e2: %s
120 |                 ''' % (e1, json_str, e2)
121 |             )
122 | 
123 |         return {}
124 | 
125 | 
126 | def dumps_json(json_, indent=4):
127 |     '''
128 |     @summary: 格式化json 用于打印
129 |     ---------
130 |     @param json_: json格式的字符串或json对象
131 |     ---------
132 |     @result: 格式化后的字符串
133 |     '''
134 |     try:
135 |         if isinstance(json_, str):
136 |             json_ = get_json(json_)
137 | 
138 |         json_ = json.dumps(json_, ensure_ascii=False, indent=indent, skipkeys=True)
139 | 
140 |     except Exception as e:
141 |         log.error(e)
142 |         json_ = pformat(json_)
143 | 
144 |     return json_
145 | 
146 | 
147 | ############
148 | def format_sql_value(value):
149 |     if isinstance(value, str):
150 |         value = pymysql.escape_string(value)
151 | 
152 |     elif isinstance(value, list) or isinstance(value, dict):
153 |         value = dumps_json(value, indent=None)
154 | 
155 |     elif isinstance(value, bool):
156 |         value = int(value)
157 | 
158 |     return value
159 | 
160 | 
161 | def list2str(datas):
162 |     '''
163 |     列表转字符串
164 |     :param datas: [1, 2]
165 |     :return: (1, 2)
166 |     '''
167 |     data_str = str(tuple(datas))
168 |     data_str = re.sub(",\)$", ')', data_str)
169 |     return data_str
170 | 
171 | 
172 | def make_insert_sql(table, data, auto_update=False, update_columns=(), insert_ignore=False):
173 |     '''
174 |     @summary: 适用于mysql， oracle数据库时间需要to_date 处理（TODO）
175 |     ---------
176 |     @param table:
177 |     @param data: 表数据 json格式
178 |     @param auto_update: 使用的是replace into， 为完全覆盖已存在的数据
179 |     @param update_columns: 需要更新的列 默认全部，当指定值时，auto_update设置无效，当duplicate key冲突时更新指定的列
180 |     @param insert_ignore: 数据存在忽略
181 |     ---------
182 |     @result:
183 |     '''
184 | 
185 |     keys = ['`{}`'.format(key) for key in data.keys()]
186 |     keys = list2str(keys).replace("'", '')
187 | 
188 |     values = [format_sql_value(value) for value in data.values()]
189 |     values = list2str(values)
190 | 
191 |     if update_columns:
192 |         if not isinstance(update_columns, (tuple, list)):
193 |             update_columns = [update_columns]
194 |         update_columns_ = ', '.join(["{key}=values({key})".format(key=key) for key in update_columns])
195 |         sql = 'insert%s into {table} {keys} values {values} on duplicate key update %s' % (' ignore' if insert_ignore else '', update_columns_)
196 | 
197 |     elif auto_update:
198 |         sql = 'replace into {table} {keys} values {values}'
199 |     else:
200 |         sql = 'insert%s into {table} {keys} values {values}' % (' ignore' if insert_ignore else '')
201 | 
202 |     sql = sql.format(table=table, keys=keys, values=values).replace('None', 'null')
203 |     return sql
204 | 
205 | 
206 | def make_update_sql(table, data, condition):
207 |     '''
208 |     @summary: 适用于mysql， oracle数据库时间需要to_date 处理（TODO）
209 |     ---------
210 |     @param table:
211 |     @param data: 表数据 json格式
212 |     @param condition: where 条件
213 |     ---------
214 |     @result:
215 |     '''
216 |     key_values = []
217 | 
218 |     for key, value in data.items():
219 |         value = format_sql_value(value)
220 |         if isinstance(value, str):
221 |             key_values.append("`{}`='{}'".format(key, value))
222 |         elif value is None:
223 |             key_values.append("`{}`={}".format(key, 'null'))
224 |         else:
225 |             key_values.append("`{}`={}".format(key, value))
226 | 
227 |     key_values = ', '.join(key_values)
228 | 
229 |     sql = 'update {table} set {key_values} where {condition}'
230 |     sql = sql.format(table=table, key_values=key_values, condition=condition)
231 |     return sql
232 | 
233 | 
234 | def make_batch_sql(table, datas, auto_update=False, update_columns=()):
235 |     '''
236 |     @summary: 生产批量的sql
237 |     ---------
238 |     @param table:
239 |     @param datas: 表数据 [{...}]
240 |     @param auto_update: 使用的是replace into， 为完全覆盖已存在的数据
241 |     @param update_columns: 需要更新的列 默认全部，当指定值时，auto_update设置无效，当duplicate key冲突时更新指定的列
242 |     ---------
243 |     @result:
244 |     '''
245 |     if not datas:
246 |         return
247 | 
248 |     keys = list(datas[0].keys())
249 |     values_placeholder = ['%s'] * len(keys)
250 | 
251 |     values = []
252 |     for data in datas:
253 |         value = []
254 |         for key in keys:
255 |             current_data = data.get(key)
256 |             current_data = format_sql_value(current_data)
257 | 
258 |             value.append(current_data)
259 | 
260 |         values.append(value)
261 | 
262 |     keys = ['`{}`'.format(key) for key in keys]
263 |     keys = str(keys).replace('[', '(').replace(']', ')').replace("'", '')
264 |     values_placeholder = str(values_placeholder).replace('[', '(').replace(']', ')').replace("'", '')
265 | 
266 |     if update_columns:
267 |         if not isinstance(update_columns, (tuple, list)):
268 |             update_columns = [update_columns]
269 |         update_columns_ = ', '.join(["`{key}`=values(`{key}`)".format(key=key) for key in update_columns])
270 |         sql = 'insert into {table} {keys} values {values_placeholder} on duplicate key update {update_columns}'.format(table=table, keys=keys, values_placeholder=values_placeholder, update_columns=update_columns_)
271 |     elif auto_update:
272 |         sql = 'replace into {table} {keys} values {values_placeholder}'.format(table=table, keys=keys, values_placeholder=values_placeholder)
273 |     else:
274 |         sql = 'insert ignore into {table} {keys} values {values_placeholder}'.format(table=table, keys=keys, values_placeholder=values_placeholder)
275 | 
276 |     return sql, values
277 | 
278 | 
279 | ##########
280 | 
281 | def get_mac_address():
282 |     mac = uuid.UUID(int=uuid.getnode()).hex[-12:]
283 |     return ":".join([mac[e:e + 2] for e in range(0, 11, 2)])
284 | 
285 | 
286 | def get_md5(*args):
287 |     '''
288 |     @summary: 获取唯一的32位md5
289 |     ---------
290 |     @param *args: 参与联合去重的值
291 |     ---------
292 |     @result: 7c8684bcbdfcea6697650aa53d7b1405
293 |     '''
294 | 
295 |     m = hashlib.md5()
296 |     for arg in args:
297 |         m.update(str(arg).encode())
298 | 
299 |     return m.hexdigest()
300 | 


--------------------------------------------------------------------------------