├── .gitignore ├── Dockerfile ├── README.md ├── docker-compose.yml ├── media ├── 15584541954959.jpg ├── 15584542414888.jpg ├── 15584544774749.jpg ├── 15584545518249.jpg ├── 15584546784023.jpg ├── 15584547028361.jpg ├── 15584578582622.jpg ├── 15584579051963.jpg ├── 15584581938431.jpg ├── 15584582326072.jpg ├── 15584585019970.jpg ├── 15610827417503.jpg ├── 15610832298058.jpg ├── 15632498867519.jpg ├── 95AE10B3227FDE0637AB227A5A8267E3.png ├── A580D0082CCEE0621F98FAF003C5530E.png └── 赞赏码.png ├── requirements.txt └── wechat-spider ├── config.py ├── config.yaml ├── core ├── capture_packet.py ├── data_pipeline.py ├── deal_data.py └── task_manager.py ├── create_tables.py ├── db ├── mysqldb.py └── redisdb.py ├── run.py └── utils ├── log.py ├── selector.py └── tools.py /.gitignore: -------------------------------------------------------------------------------- 1 | */__pycache__ 2 | *.pyc 3 | .svn/ 4 | log/* 5 | config/ 6 | .vs/ 7 | .vscode/ 8 | *.log 9 | venv 10 | .venv/ 11 | test.py 12 | .DS_Store 13 | .idea 14 | .git 15 | 16 | config.yaml 17 | mariadb_data/ 18 | mitmproxy/ -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM python:3.7 2 | 3 | RUN cat /etc/apt/sources.list | awk -F[/:] '{print $4}' | sort | uniq | grep -v "^$" | xargs -I{} sed -i 's|{}|mirrors.aliyun.com|g' /etc/apt/sources.list && \ 4 | apt update && \ 5 | apt install -y psmisc netcat wait-for-it && \ 6 | apt-get clean && \ 7 | rm -rf /var/lib/apt/lists/* 8 | 9 | WORKDIR /app 10 | 11 | COPY requirements.txt requirements.txt 12 | 13 | RUN pip3 install -i https://mirrors.aliyun.com/pypi/simple -r requirements.txt 14 | 15 | COPY . . 16 | 17 | WORKDIR /app/wechat-spider 18 | 19 | EXPOSE 8080 20 | EXPOSE 8081 21 | 22 | # ENTRYPOINT [ "python3", "./run.py" ] 23 | 24 | CMD [ "python3", "./run.py" ] -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 微信爬虫 2 | 3 | 8 | 9 | 以下为部署文档 10 | 11 | 技术文档请查看:[https://t.zsxq.com/7ubmqNJ](https://t.zsxq.com/7ubmqNJ) 12 | 13 | 逆向方式抓取的方案请查看:[https://wx.zsxq.com/dweb2/index/topic_detail/215584212588541](https://wx.zsxq.com/dweb2/index/topic_detail/215584212588541) 14 | 15 | ## 功能: 16 | 17 | - [x] 检测公众号每日新发文章 18 | - [x] 抓取公众号信息 19 | - [x] 抓取文章列表 20 | - [x] 抓取文章信息 21 | - [x] 抓取阅读量、点赞量、评论量 22 | - [x] 抓取评论信息 23 | - [x] 临时链接转永久链接 24 | 25 | 打包好的执行文件下载地址 26 | 27 | 链接: https://pan.baidu.com/s/1hyhj6YnV-L9w8LPx42FFzQ 密码: qnk6 28 | 29 | ## 特色: 30 | 31 | 1. **免安装**:支持mac、window,双击软件即可执行 32 | 2. **自动化**:只需要配置好待监控的公众号列表,启动软件后即可每日自动抓取公众号及文章等信息 33 | 3. **好对接**:抓取到的数据使用mysql存储,方便处理数据 34 | 4. **不漏采**:采用任务状态标记的方式,防止遗漏每一个公众号、每一篇文章 35 | 5. **分布式**:支持多个微信号同时采集,微信客户端支持Android、iphone、Mac、Window 全平台 36 | 37 | ## 数据示例 38 | 39 | **1. 公众号数据** 40 | ![-w829](media/15584541954959.jpg) 41 | 42 | **2. 文章列表数据** 43 | ![-w1369](media/15584542414888.jpg) 44 | 45 | **3. 文章数据** 46 | ![-w1466](media/15584545518249.jpg) 47 | 48 | **4. 阅读点赞评论数据** 49 | ![-w623](media/15584546784023.jpg) 50 | 51 | **5. 评论数据** 52 | ![-w1033](media/15584547028361.jpg) 53 | 54 | ## 所需环境 55 | 56 | 1. mysql:用来存储抓取到的数据以及任务表 57 | 2. redis:任务缓存,减少操作mysql的次数 58 | 59 | ## 安装配置 60 | 61 | > 以下安装说明安需查看,仅作为参考。因每个人环境不同,可能安装会有些差异,可参考网上的资料 62 | 63 | ### 1. 安装mysql 64 | #### 1.1 window 65 | #### 1.2 mac 66 | ### 2. 安装redis 67 | #### 2.1 window 68 | #### 2.2 mac 69 | ### 3. 安装证书 70 | 71 | 可用浏览器访问 mitm.it 然后下载,或者百度如何安装mitmproxy证书 72 | 73 | #### 3.1 iphone 74 | 1. 下载安装完毕后别忘记最后一步 75 | 2. 打开设置-通用-关于本机-证书信任设置 76 | 3. 开启mitmproxy选项。 77 | 78 | #### 3.2 android 79 | 1. 安装完毕检查 80 | 2. 打开设置-安全-信任的凭据 81 | 3. 查看安装的证书是否存在 82 | 83 | #### 3.3 window 84 | 2. 双击运行 85 | 3. 安装到本地计算机 86 | 4. 需要密钥时跳过 87 | 5. 选择“将所有的证书都放入下列存储”,接着选择“受信任的根证书颁发机构” 88 | 6. 最后,弹出警告窗口,直接点击“是” 89 | 90 | #### 3.4 mac 91 | 2. 下载完双击安装 92 | 3. 打开Keychain Access.app 93 | 4. 选择login(Keychains)和Certificates(Category)中找到mitmproxy 94 | 5. 点击mitmproxy,在Trust中选择Always Trust 95 | 96 | 97 | ### 4. 配置代理 98 | 99 | > 如果使用手机,需要确保手机和运行wechat-spider的电脑连接在同一个路由器上 100 | 101 | #### 3.1 iphone 102 | 103 | 打开设置-无线局域网-所连接的Wifi-配置代理-手动 104 | 填上该安装服务器的IP和端口8080 105 | 106 | #### 3.2 android 107 | 108 | 打开设置-WLAN-长按所连接的网络-修改网络-高级选项-手动 109 | 填上该安装服务器的IP和端口8080 110 | 111 | #### 3.3 window 112 | 打开chrome 设置->高级 113 | ![A580D0082CCEE0621F98FAF003C5530E](media/A580D0082CCEE0621F98FAF003C5530E.png) 114 | ![95AE10B3227FDE0637AB227A5A8267E3](media/95AE10B3227FDE0637AB227A5A8267E3.png) 115 | 116 | #### 3.4 mac 117 | 118 | 打开系统配置(System Preferences.app)- 网络(Network)- 高级(Advanced)- 代理(Proxies)- Secure Web Proxy(HTTPS) 119 | 填上该安装服务器的IP和端口8080 120 | 121 | ![-w668](media/15584581938431.jpg) 122 | ![-w667](media/15584582326072.jpg) 123 | 124 | 125 | 126 | ## 使用说明 127 | 128 | ### 1. 安装如上说明安装好证书及配置好代理 129 | ### 2. 正确配置config.yaml 130 | 131 | 主要是配置mysql及redis的链接信息,确保能正确链接上 132 | 133 | ### 3. 创建数据库 wechat 134 | 135 | ![-w418](media/15610827417503.jpg) 136 | 137 | 138 | ### 4. 启动wechat-spider 139 | 140 | 此步骤如果config里的auto_create_tables值为true时,会自动创建mysql数据表。建议首次启动时设置为true,创建完表后设置为false 141 | 142 | ### 5. 下发公众号任务 143 | 144 | ![-w201](media/15584578582622.jpg) 145 | 录入数据到wechat_account_task, 如: 146 | ![-w503](media/15584579051963.jpg) 147 | 只填写__biz就好, 如:MzIxNzg1ODQ0MQ== 148 | 149 | ### 6. 点击任意一公众号,查看历史消息 150 | 151 | ![-w637](media/15584585019970.jpg) 152 | 153 | 当出现如上红框中的提示信息时,说明大功告成了,过一会可以去数据库里验证数据了 154 | 155 | 技术交流 156 | ---- 157 | 若大家有什么疑问或指教,可加qq群,一起讨论问题。请备注`微信爬虫学习交流` 158 | 159 | 160 | 161 | 162 | ## 常见问题 163 | 164 | ### 1. mysql 链接问题 165 | 166 | 问题:链接时打印object supporting the buffer api required异常 167 | ![](media/15610832298058.jpg) 168 | 解决: 如果密钥是整形的,如123456,需要在配置文件中加双引号,如下: 169 | 170 | mysqldb: 171 | ip: localhost 172 | port: 3306 173 | db: wechat 174 | user: root 175 | passwd: "123456" 176 | auto_create_tables: true # 是否自动建表 建议当表不存在是设置为true,表存在是设置为false,加快软件启动速度 177 | 178 | ### 2. 正确配置完代理后提示证书或安全问题 179 | 180 | 原因是我那个证书失效了,可参考 https://www.cnblogs.com/yunlongaimeng/p/9617708.html 安装数据 181 | 182 | ### 3. 提示无任务 183 | 184 | 检查 wechat_account_task 表中是否下发了__biz。可多下发几个测试 185 | 186 | ### 4. Exception:DISCARD without MULTI 187 | 188 | ![-w406](media/15632498867519.jpg) 189 | 190 | ### 5. 正常启动后抓不到包 191 | 192 | 1. 检是否设置代理 193 | 2. 检查端口是否被占用 194 | 195 | ## 微信攒赏 196 | 197 | 开源项目不易,维护代码以及解决大家问题往往占据了大部分时间,为了保证内容持续输出,且**本项目恰巧对您有帮助**,还望多多支持哦(* ̄︶ ̄)。 198 | 199 | 可提供部署支持,答疑解惑(仅限打赏用户、提PR的开发者)。 200 | 201 | 微信: boris_tm 202 | 203 | ![赞赏码](media/%E8%B5%9E%E8%B5%8F%E7%A0%81.png) 204 | 205 | 206 | -------------------------------------------------------------------------------- /docker-compose.yml: -------------------------------------------------------------------------------- 1 | version: '3' 2 | services: 3 | wechat-spider: 4 | image: wechat-spider:latest 5 | build: . 6 | restart: always 7 | privileged: true 8 | command: ["wait-for-it", "mariadb:3306", "--", "python3", "./run.py"] 9 | ports: 10 | - '8080:8080' 11 | - '8081:8081' 12 | volumes: 13 | - "./mitmproxy:/root/.mitmproxy" 14 | - "./config:/app/wechat-spider/config" 15 | depends_on: 16 | - mariadb 17 | - redis 18 | links: 19 | - redis 20 | - mariadb 21 | redis: 22 | image: redis:3.2 23 | restart: always 24 | mariadb: 25 | image: bitnami/mariadb:10.5-debian-10 26 | ports: 27 | - '3306:3306' 28 | volumes: 29 | - './mariadb_data:/bitnami/mariadb' 30 | environment: 31 | - MARIADB_ROOT_USER=root 32 | - MARIADB_ROOT_PASSWORD=root 33 | - MARIADB_CHARACTER_SET=utf8mb4 34 | - MARIADB_COLLATE=utf8mb4_unicode_ci 35 | healthcheck: 36 | test: ['CMD', '/opt/bitnami/scripts/mariadb/healthcheck.sh'] 37 | interval: 15s 38 | timeout: 5s 39 | retries: 6 -------------------------------------------------------------------------------- /media/15584541954959.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584541954959.jpg -------------------------------------------------------------------------------- /media/15584542414888.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584542414888.jpg -------------------------------------------------------------------------------- /media/15584544774749.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584544774749.jpg -------------------------------------------------------------------------------- /media/15584545518249.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584545518249.jpg -------------------------------------------------------------------------------- /media/15584546784023.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584546784023.jpg -------------------------------------------------------------------------------- /media/15584547028361.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584547028361.jpg -------------------------------------------------------------------------------- /media/15584578582622.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584578582622.jpg -------------------------------------------------------------------------------- /media/15584579051963.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584579051963.jpg -------------------------------------------------------------------------------- /media/15584581938431.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584581938431.jpg -------------------------------------------------------------------------------- /media/15584582326072.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584582326072.jpg -------------------------------------------------------------------------------- /media/15584585019970.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15584585019970.jpg -------------------------------------------------------------------------------- /media/15610827417503.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15610827417503.jpg -------------------------------------------------------------------------------- /media/15610832298058.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15610832298058.jpg -------------------------------------------------------------------------------- /media/15632498867519.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/15632498867519.jpg -------------------------------------------------------------------------------- /media/95AE10B3227FDE0637AB227A5A8267E3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/95AE10B3227FDE0637AB227A5A8267E3.png -------------------------------------------------------------------------------- /media/A580D0082CCEE0621F98FAF003C5530E.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/A580D0082CCEE0621F98FAF003C5530E.png -------------------------------------------------------------------------------- /media/赞赏码.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/striver-ing/wechat-spider/7702f8f06e49896d78bf38ed79392d19cef1c245/media/赞赏码.png -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | w3lib==1.22.0 2 | mitmproxy==7.0.3 3 | better_exceptions==0.2.2 4 | PyMySQL==0.10.0 5 | parsel==1.6.0 6 | six==1.15.0 7 | redis==2.10.6 8 | DBUtils==1.3 9 | PyYAML==5.4 -------------------------------------------------------------------------------- /wechat-spider/config.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | ''' 3 | Created on 2019/5/18 11:54 AM 4 | --------- 5 | @summary: 6 | --------- 7 | @author: 8 | ''' 9 | import os 10 | import socket 11 | import sys 12 | import shutil 13 | 14 | import yaml # pip3 install pyyaml 15 | 16 | if 'python' in sys.executable: 17 | abs_path = lambda file: os.path.abspath(os.path.join(os.path.dirname(__file__), file)) 18 | else: 19 | abs_path = lambda file: os.path.abspath(os.path.join(os.path.dirname(sys.executable), file)) # mac 上打包后 __file__ 指定的是用户根路径,非当执行文件路径 20 | 21 | if not os.path.exists('./config/config.yaml'): 22 | shutil.copyfile('./config.yaml', './config/config.yaml') 23 | 24 | config = yaml.full_load(open(abs_path('./config/config.yaml'), encoding='utf8')) 25 | 26 | 27 | def get_host_ip(): 28 | """ 29 | 利用 UDP 协议来实现的,生成一个UDP包,把自己的 IP 放如到 UDP 协议头中,然后从UDP包中获取本机的IP。 30 | 这个方法并不会真实的向外部发包,所以用抓包工具是看不到的 31 | :return: 32 | """ 33 | s = None 34 | try: 35 | s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) 36 | s.connect(('8.8.8.8', 80)) 37 | ip = s.getsockname()[0] 38 | finally: 39 | if s: 40 | s.close() 41 | 42 | return ip 43 | 44 | 45 | IP = get_host_ip() 46 | -------------------------------------------------------------------------------- /wechat-spider/config.yaml: -------------------------------------------------------------------------------- 1 | mysqldb: 2 | ip: "mariadb" 3 | port: 3306 4 | db: test 5 | user: root 6 | passwd: "root" 7 | auto_create_tables: True # 是否自动建表 建议当表不存在是设置为true,表存在是设置为false,加快软件启动速度 8 | 9 | redisdb: 10 | ip: redis 11 | port: 6379 12 | db: 0 13 | passwd: 14 | 15 | spider: 16 | monitor_interval: 3600 # 公众号扫描新发布文章周期时间间隔 单位秒 17 | ignore_haved_crawl_today_article_account: true # 忽略已经抓取到今日发布文章的公众号,即今日不再监测该公众号 18 | redis_task_cache_root_key: wechat # reids 中缓存任务的根key 如 wechat: 19 | zombie_account_not_publish_article_days: 90 # 连续90天未发布新文章,判定为僵尸账号,日后不再监控 20 | spider_interval: 21 | min_sleep_time: 5 22 | max_sleep_time: 10 23 | no_task_sleep_time: 3600 # 当无任务时休眠时间 24 | service_port: 8080 # 服务的端口 25 | # crawl_time_range: 2019-07-10 00:00:00~2019-07-01 00:00:00 # 抓取的时间范围 若不限制最近时间可写为 ~2019-07-01 00:00:00 若想抓取全部历史则不设置 26 | crawl_time_range: 27 | 28 | log: 29 | level: INFO 30 | to_file: false 31 | log_path: ./logs/wechat_spider.log 32 | 33 | mitm: 34 | log_level: 0 # mitm框架日志的打印级别。值在0~3之间,值越大,输出的日志信息越详细,默认是1。详见:https://docs.mitmproxy.org/stable/concepts-options/ 35 | -------------------------------------------------------------------------------- /wechat-spider/core/capture_packet.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | ''' 3 | Created on 2019/5/8 10:59 PM 4 | --------- 5 | @summary: 6 | --------- 7 | @author: 8 | ''' 9 | import re 10 | 11 | import mitmproxy 12 | from mitmproxy import ctx 13 | 14 | from core.deal_data import deal_data 15 | 16 | 17 | class WechatCapture(): 18 | 19 | def response(self, flow: mitmproxy.http.HTTPFlow): 20 | url = flow.request.url 21 | 22 | next_page = None 23 | try: 24 | if 'mp/profile_ext?action=home' in url or 'mp/profile_ext?action=getmsg' in url: # 文章列表 包括html格式和json格式 25 | 26 | ctx.log.info('抽取文章列表数据') 27 | next_page = deal_data.deal_article_list(url, flow.response.text) 28 | 29 | flow.response.text = re.sub('', '', flow.response.text) 30 | 31 | elif '/s?__biz=' in url or '/mp/appmsg/show?__biz=' in url or '/mp/rumor' in url: # 文章内容;mp/appmsg/show?_biz 为2014年老版链接; mp/rumor 是不详实的文章 32 | 33 | ctx.log.info('抽取文章内容') 34 | next_page = deal_data.deal_article(url, flow.response.text) 35 | 36 | # 修改文章内容的响应头,去掉安全协议,否则注入的 < script > setTimeout(function() {window.location.href = 'url';}, sleep_time); < / script > js脚本不执行 37 | flow.response.headers.pop('Content-Security-Policy', None) 38 | flow.response.headers.pop('content-security-policy-report-only', None) 39 | flow.response.headers.pop('Strict-Transport-Security', None) 40 | 41 | # 不缓存 42 | flow.response.headers['Cache-Control'] = 'no-cache, no-store, must-revalidate' 43 | 44 | # 去掉图片 45 | flow.response.text = re.sub('', '', flow.response.text) 46 | 47 | elif 'mp/getappmsgext' in url: # 阅读量 观看量 48 | 49 | ctx.log.info('抽取阅读量 观看量') 50 | deal_data.deal_article_dynamic_info(flow.request.data.content.decode('utf-8'), flow.response.text) 51 | 52 | elif '/mp/appmsg_comment' in url: # 评论列表 53 | 54 | ctx.log.info('抽取评论列表') 55 | deal_data.deal_comment(url, flow.response.text) 56 | 57 | except Exception as e: 58 | # log.exception(e) 59 | next_page = "Exception: {}".format(e) 60 | 61 | if next_page: 62 | # 修改返回求头 json 为text 63 | flow.response.headers['content-type'] = 'text/html; charset=UTF-8' 64 | if 'window.location.reload()' in next_page: 65 | flow.response.set_text(next_page) 66 | else: 67 | flow.response.set_text(next_page + flow.response.text) 68 | 69 | 70 | addons = [ 71 | WechatCapture(), 72 | ] 73 | 74 | # 运行 mitmdump -s capture_packet.py -p 8080 75 | -------------------------------------------------------------------------------- /wechat-spider/core/data_pipeline.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | ''' 3 | Created on 2019/5/13 12:44 AM 4 | --------- 5 | @summary: 6 | --------- 7 | @author: 8 | ''' 9 | from db.mysqldb import MysqlDB 10 | import utils.tools as tools 11 | from utils.log import log 12 | from config import config 13 | 14 | db = MysqlDB(**config.get('mysqldb')) 15 | 16 | 17 | def save_account(data): 18 | log.debug(tools.dumps_json(data)) 19 | 20 | sql = tools.make_insert_sql('wechat_account', data, insert_ignore=True) 21 | db.add(sql) 22 | 23 | 24 | def save_article_list(datas: list): 25 | log.debug(tools.dumps_json(datas)) 26 | 27 | sql, articles = tools.make_batch_sql('wechat_article_list', datas) 28 | db.add_batch(sql, articles) 29 | 30 | # 存文章任务 31 | article_task = [ 32 | { 33 | "sn": article.get('sn'), 34 | "article_url": article.get('url'), 35 | "__biz": article.get('__biz') 36 | } 37 | for article in datas 38 | ] 39 | 40 | sql, article_task = tools.make_batch_sql('wechat_article_task', article_task) 41 | db.add_batch(sql, article_task) 42 | 43 | 44 | def save_article(data): 45 | log.debug(tools.dumps_json(data)) 46 | 47 | sql = tools.make_insert_sql('wechat_article', data, insert_ignore=True) 48 | return db.add(sql) 49 | 50 | 51 | def save_article_dynamic(data): 52 | log.debug(tools.dumps_json(data)) 53 | 54 | sql = tools.make_insert_sql('wechat_article_dynamic', data, insert_ignore=True) 55 | db.add(sql) 56 | 57 | 58 | def save_article_commnet(datas: list): 59 | log.debug(tools.dumps_json(datas)) 60 | 61 | sql, datas = tools.make_batch_sql('wechat_article_comment', datas) 62 | db.add_batch(sql, datas) 63 | -------------------------------------------------------------------------------- /wechat-spider/core/deal_data.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on 2019/5/11 6:37 PM 4 | --------- 5 | @summary: 处理数据 6 | --------- 7 | @author: 8 | """ 9 | from utils.selector import Selector 10 | import utils.tools as tools 11 | from utils.log import log 12 | from core import data_pipeline 13 | from core.task_manager import TaskManager 14 | 15 | 16 | class DealData: 17 | def __init__(self): 18 | self._task_manager = TaskManager() 19 | self._task_manager.reset_task() 20 | 21 | def __parse_account_info(self, data, req_url): 22 | """ 23 | @summary: 24 | --------- 25 | @param data: 26 | --------- 27 | @result: 28 | """ 29 | __biz = tools.get_param(req_url, "__biz") 30 | 31 | regex = 'id="nickname">(.*?)' 32 | account = tools.get_info(data, regex, fetch_one=True).strip() 33 | 34 | regex = 'profile_avatar">.*? {monitor_interval} 76 | {publish_time_condition} 77 | ) 78 | OR (last_spider_time IS NULL) 79 | ) 80 | '''.format(monitor_interval=self._monitor_interval, publish_time_condition=publish_time_condition) 81 | 82 | tasks = self._mysqldb.find(sql, to_json=True) 83 | if tasks: 84 | self._redis.zadd(self._account_task_key, tasks) 85 | task = self.__get_task_from_redis(self._account_task_key) 86 | 87 | return task 88 | 89 | def get_article_task(self): 90 | """ 91 | 获取文章任务 92 | :return: 93 | {'article_url': 'http://mp.weixin.qq.com/s?__biz=MzIxNzg1ODQ0MQ==&mid=2247485501&idx=1&sn=92721338ddbf7d907eaf03a70a0715bd&chksm=97f220dba085a9cd2b9a922fb174c767603203d6dbd2a7d3a6dc41b3400a0c477a8d62b96396&scene=27#wechat_redirect'} 94 | 或 95 | None 96 | """ 97 | task = self.__get_task_from_redis(self._article_task_key) 98 | if not task: 99 | sql = 'select id, article_url from wechat_article_task where state = 0 limit 5000' 100 | tasks = self._mysqldb.find(sql) 101 | if tasks: 102 | # 更新任务状态 103 | task_ids = str(tuple([task[0] for task in tasks])).replace(',)', ')') 104 | sql = 'update wechat_article_task set state = 2 where id in %s' % (task_ids) 105 | self._mysqldb.update(sql) 106 | 107 | else: 108 | sql = 'select id, article_url from wechat_article_task where state = 2 limit 5000' 109 | tasks = self._mysqldb.find(sql) 110 | 111 | if tasks: 112 | task_json = [ 113 | { 114 | 'article_url': article_url 115 | } 116 | for id, article_url in tasks 117 | ] 118 | self._redis.zadd(self._article_task_key, task_json) 119 | task = self.__get_task_from_redis(self._article_task_key) 120 | 121 | return task 122 | 123 | def update_article_task_state(self, sn, state=1): 124 | sql = 'update wechat_article_task set state = %s where sn = "%s"' % (state, sn) 125 | self._mysqldb.update(sql) 126 | 127 | def record_last_article_publish_time(self, __biz, last_publish_time): 128 | self._redis.hset(self._last_article_publish_time, __biz, last_publish_time or '') 129 | 130 | def is_reach_last_article_publish_time(self, __biz, publish_time): 131 | last_publish_time = self._redis.hget(self._last_article_publish_time, __biz) 132 | if not last_publish_time: 133 | # 查询mysql里是否有该任务 134 | sql = "select last_publish_time from wechat_account_task where __biz = '%s'" % __biz 135 | data = self._mysqldb.find(sql) 136 | if data: # [(None,)] / [] 137 | last_publish_time = str(data[0][0] or '') 138 | self.record_last_article_publish_time(__biz, last_publish_time) 139 | 140 | if last_publish_time is None: 141 | return 142 | 143 | if publish_time < last_publish_time: 144 | return True 145 | 146 | return False 147 | 148 | def is_in_crawl_time_range(self, publish_time): 149 | """ 150 | 是否在时间范围 151 | :param publish_time: 152 | :return: 是否达时间范围 153 | """ 154 | if not publish_time or (not self._crawl_time_range[0] and not self._crawl_time_range[1]): 155 | return TaskManager.IS_IN_TIME_RANGE 156 | 157 | if self._crawl_time_range[0]: # 时间范围 上限 158 | if publish_time > self._crawl_time_range[0]: 159 | return TaskManager.NOT_REACH_TIME_RANGE 160 | 161 | if publish_time <= self._crawl_time_range[0] and publish_time >= self._crawl_time_range[1]: 162 | return TaskManager.IS_IN_TIME_RANGE 163 | 164 | if publish_time < self._crawl_time_range[1]: # 下限 165 | return TaskManager.OVER_MIN_TIME_RANGE 166 | 167 | return TaskManager.IS_IN_TIME_RANGE 168 | 169 | def record_new_last_article_publish_time(self, __biz, new_last_publish_time): 170 | self._redis.hset(self._new_last_article_publish_time, __biz, new_last_publish_time) 171 | 172 | def get_new_last_article_publish_time(self, __biz): 173 | return self._redis.hget(self._new_last_article_publish_time, __biz) 174 | 175 | def update_account_last_publish_time(self, __biz, last_publish_time): 176 | sql = 'update wechat_account_task set last_publish_time = "{}", last_spider_time="{}" where __biz="{}"'.format( 177 | last_publish_time, tools.get_current_date(), __biz 178 | ) 179 | self._mysqldb.update(sql) 180 | 181 | def is_zombie_account(self, last_publish_timestamp): 182 | if tools.get_current_timestamp() - last_publish_timestamp > self._zombie_account_not_publish_article_days * 86400: 183 | return True 184 | return False 185 | 186 | def sign_account_is_zombie(self, __biz, last_publish_time=None): 187 | if last_publish_time: 188 | sql = 'update wechat_account_task set last_publish_time = "{}", last_spider_time="{}", is_zombie=1 where __biz="{}"'.format( 189 | last_publish_time, tools.get_current_date(), __biz 190 | ) 191 | else: 192 | sql = 'update wechat_account_task set last_spider_time="{}", is_zombie=1 where __biz="{}"'.format( 193 | tools.get_current_date(), __biz 194 | ) 195 | 196 | self._mysqldb.update(sql) 197 | 198 | def get_task(self, url=None, tip=''): 199 | """ 200 | 获取任务 201 | :param url: 指定url时,返回该url包装后的任务。否则先取公众号任务,无则取文章任务。若均无任务,则休眠一段时间之后再取 202 | :return: 203 | """ 204 | 205 | sleep_time = random.randint(self._spider_interval_min, self._spider_interval_max) 206 | 207 | if not url: 208 | account_task = self.get_account_task() 209 | if account_task: 210 | __biz = account_task.get('__biz') 211 | last_publish_time = account_task.get('last_publish_time') 212 | self.record_last_article_publish_time(__biz, last_publish_time) 213 | tip = '正在抓取列表' 214 | url = 'https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz={}&scene=124#wechat_redirect'.format(__biz) 215 | else: 216 | article_task = self.get_article_task() 217 | if article_task: 218 | tip = '正在抓取详情' 219 | url = article_task.get('article_url') 220 | else: 221 | sleep_time = config.get('spider').get('no_task_sleep_time') 222 | log.info('暂无任务 休眠 {}s'.format(sleep_time)) 223 | tip = '暂无任务 ' 224 | 225 | if url: 226 | next_page = "{tip} 休眠 {sleep_time}s 下次刷新时间 {begin_spider_time} ".format( 227 | tip=tip and tip + ' ', sleep_time=sleep_time, begin_spider_time=tools.timestamp_to_date(tools.get_current_timestamp() + sleep_time), url=url, sleep_time_msec=sleep_time * 1000 228 | ) 229 | else: 230 | next_page = "{tip} 休眠 {sleep_time}s 下次刷新时间 {begin_spider_time} ".format( 231 | tip=tip and tip + ' ', sleep_time=sleep_time, begin_spider_time=tools.timestamp_to_date(tools.get_current_timestamp() + sleep_time), sleep_time_msec=sleep_time * 1000 232 | ) 233 | 234 | return next_page 235 | 236 | def reset_task(self): 237 | # 清除redis缓存 238 | keys = self._task_root_key + "*" 239 | keys = self._redis.getkeys(keys) 240 | if keys: 241 | for key in keys: 242 | self._redis.clear(key) 243 | 244 | # 重设任务 245 | sql = "update wechat_article_task set state = 0 where state = 2" 246 | self._mysqldb.update(sql) 247 | 248 | 249 | if __name__ == '__main__': 250 | task_manager = TaskManager() 251 | 252 | result = task_manager.get_task() 253 | print(result) 254 | -------------------------------------------------------------------------------- /wechat-spider/create_tables.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | ''' 3 | Created on 2019/5/20 11:47 PM 4 | --------- 5 | @summary: 6 | --------- 7 | @author: 8 | ''' 9 | 10 | from db.mysqldb import MysqlDB 11 | from config import config 12 | 13 | def _create_database(mysqldb, dbname): 14 | mysqldb.execute("CREATE DATABASE IF NOT EXISTS `%s`;", (dbname)) 15 | 16 | def _create_table(mysqldb, sql): 17 | mysqldb.execute(sql) 18 | 19 | 20 | def create_table(): 21 | wechat_article_list_table = ''' 22 | CREATE TABLE IF NOT EXISTS `wechat_article_list` ( 23 | `id` int(11) unsigned NOT NULL AUTO_INCREMENT, 24 | `title` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 25 | `digest` varchar(2000) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 26 | `url` varchar(500) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 27 | `source_url` varchar(1000) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 28 | `cover` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 29 | `subtype` int(11) DEFAULT NULL, 30 | `is_multi` int(11) DEFAULT NULL, 31 | `author` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 32 | `copyright_stat` int(11) DEFAULT NULL, 33 | `duration` int(11) DEFAULT NULL, 34 | `del_flag` int(11) DEFAULT NULL, 35 | `type` int(11) DEFAULT NULL, 36 | `publish_time` datetime DEFAULT NULL, 37 | `sn` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 38 | `spider_time` datetime DEFAULT NULL, 39 | `__biz` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 40 | PRIMARY KEY (`id`), 41 | UNIQUE KEY `sn` (`sn`) 42 | ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC 43 | ''' 44 | 45 | wechat_article_task_table = ''' 46 | CREATE TABLE IF NOT EXISTS `wechat_article_task` ( 47 | `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, 48 | `sn` varchar(50) DEFAULT NULL, 49 | `article_url` varchar(255) DEFAULT NULL, 50 | `state` int(11) DEFAULT '0' COMMENT '文章抓取状态,0 待抓取 2 抓取中 1 抓取完毕 -1 抓取失败', 51 | `__biz` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 52 | PRIMARY KEY (`id`), 53 | UNIQUE KEY `sn` (`sn`) USING BTREE, 54 | KEY `state` (`state`) USING BTREE 55 | ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci 56 | ''' 57 | 58 | wechat_article_dynamic_table = ''' 59 | CREATE TABLE IF NOT EXISTS `wechat_article_dynamic` ( 60 | `id` int(10) unsigned NOT NULL AUTO_INCREMENT, 61 | `sn` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 62 | `read_num` int(11) DEFAULT NULL, 63 | `like_num` int(11) DEFAULT NULL, 64 | `comment_count` int(11) DEFAULT NULL, 65 | `spider_time` datetime DEFAULT NULL, 66 | `__biz` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 67 | PRIMARY KEY (`id`), 68 | UNIQUE KEY `sn` (`sn`) 69 | ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC 70 | ''' 71 | 72 | wechat_article_comment_table = ''' 73 | CREATE TABLE IF NOT EXISTS `wechat_article_comment` ( 74 | `id` int(10) unsigned NOT NULL AUTO_INCREMENT, 75 | `comment_id` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL COMMENT '与文章关联', 76 | `nick_name` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 77 | `logo_url` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 78 | `content` varchar(2000) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 79 | `create_time` datetime DEFAULT NULL, 80 | `content_id` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL COMMENT '本条评论内容的id', 81 | `like_num` int(11) DEFAULT NULL, 82 | `is_top` int(11) DEFAULT NULL, 83 | `spider_time` datetime DEFAULT NULL, 84 | `__biz` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 85 | PRIMARY KEY (`id`), 86 | UNIQUE KEY `content_id` (`content_id`), 87 | KEY `comment_id` (`comment_id`) 88 | ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC 89 | ''' 90 | 91 | wechat_article_table = ''' 92 | CREATE TABLE IF NOT EXISTS `wechat_article` ( 93 | `id` int(10) unsigned NOT NULL AUTO_INCREMENT, 94 | `account` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 95 | `title` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 96 | `url` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 97 | `author` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 98 | `publish_time` datetime DEFAULT NULL, 99 | `__biz` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 100 | `digest` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 101 | `cover` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 102 | `pics_url` text CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci, 103 | `content_html` text COLLATE utf8mb4_unicode_ci, 104 | `source_url` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 105 | `comment_id` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 106 | `sn` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 107 | `spider_time` datetime DEFAULT NULL, 108 | PRIMARY KEY (`id`), 109 | UNIQUE KEY `sn` (`sn`), 110 | KEY `__biz` (`__biz`) 111 | ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC 112 | ''' 113 | 114 | wechat_account_task_table = ''' 115 | CREATE TABLE IF NOT EXISTS `wechat_account_task` ( 116 | `id` int(11) unsigned NOT NULL AUTO_INCREMENT, 117 | `__biz` varchar(50) DEFAULT NULL, 118 | `last_publish_time` datetime DEFAULT NULL COMMENT '上次抓取到的文章发布时间,做文章增量采集用', 119 | `last_spider_time` datetime DEFAULT NULL COMMENT '上次抓取时间,用于同一个公众号每隔一段时间扫描一次', 120 | `is_zombie` int(11) DEFAULT '0' COMMENT '僵尸号 默认3个月未发布内容为僵尸号,不再检测', 121 | PRIMARY KEY (`id`) 122 | ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci 123 | ''' 124 | 125 | wechat_account_table = ''' 126 | CREATE TABLE IF NOT EXISTS `wechat_account` ( 127 | `id` int(10) unsigned NOT NULL AUTO_INCREMENT, 128 | `__biz` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 129 | `account` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 130 | `head_url` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 131 | `summary` varchar(500) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 132 | `qr_code` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 133 | `verify` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, 134 | `spider_time` datetime DEFAULT NULL, 135 | PRIMARY KEY (`id`), 136 | UNIQUE KEY `__biz` (`__biz`) 137 | ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=DYNAMIC 138 | ''' 139 | 140 | if config.get('mysqldb').get('auto_create_tables'): 141 | mysqldb = MysqlDB(**config.get('mysqldb')) 142 | # _create_database(mysqldb, config.get('mysqldb').get('db')) 143 | _create_table(mysqldb, wechat_article_list_table) 144 | _create_table(mysqldb, wechat_article_task_table) 145 | _create_table(mysqldb, wechat_article_dynamic_table) 146 | _create_table(mysqldb, wechat_article_comment_table) 147 | _create_table(mysqldb, wechat_article_table) 148 | _create_table(mysqldb, wechat_account_task_table) 149 | _create_table(mysqldb, wechat_account_table) 150 | -------------------------------------------------------------------------------- /wechat-spider/db/mysqldb.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | ''' 3 | Created on 2016-11-16 16:25 4 | --------- 5 | @summary: 操作mysql数据库 6 | --------- 7 | @author: Boris 8 | ''' 9 | import datetime 10 | import json 11 | 12 | import pymysql 13 | from DBUtils.PooledDB import PooledDB 14 | from pymysql import cursors 15 | from pymysql import err 16 | 17 | from utils.log import log 18 | 19 | 20 | def auto_retry(func): 21 | def wapper(*args, **kwargs): 22 | for i in range(3): 23 | try: 24 | return func(*args, **kwargs) 25 | except (err.InterfaceError, err.OperationalError, err.ProgrammingError) as e: 26 | log.error(''' 27 | error:%s 28 | sql: %s 29 | ''' % (e, kwargs.get('sql') or args[1])) 30 | 31 | return wapper 32 | 33 | 34 | class MysqlDB(): 35 | 36 | def __init__(self, ip=None, port=None, db=None, user=None, passwd=None, **kwargs): 37 | # 可能会改setting中的值,所以此处不能直接赋值为默认值,需要后加载赋值 38 | try: 39 | self.connect_pool = PooledDB(creator=pymysql, mincached=1, maxcached=100, maxconnections=100, blocking=True, ping=7, 40 | host=ip, port=port, user=user, passwd=passwd, db=db, charset='utf8mb4', cursorclass=cursors.SSCursor) # cursorclass 使用服务的游标,默认的在多线程下大批量插入数据会使内存递增 41 | except Exception as e: 42 | input(''' 43 | ****************************************** 44 | 未链接到mysql数据库, 45 | 您当前的链接信息为: 46 | ip = {} 47 | port = {} 48 | db = {} 49 | user = {} 50 | passwd = {} 51 | 请参考教程正确安装配置mysql,然后重启本程序 52 | Exception: {}'''.format(ip, port, db, user, passwd, str(e)) 53 | ) 54 | import sys 55 | sys.exit() 56 | 57 | def get_connection(self): 58 | conn = self.connect_pool.connection(shareable=False) 59 | # cursor = conn.cursor(cursors.SSCursor) 60 | cursor = conn.cursor() 61 | 62 | return conn, cursor 63 | 64 | def close_connection(self, conn, cursor): 65 | cursor.close() 66 | conn.close() 67 | 68 | def size_of_connections(self): 69 | ''' 70 | 当前活跃的连接数 71 | @return: 72 | ''' 73 | return self.connect_pool._connections 74 | 75 | def size_of_connect_pool(self): 76 | ''' 77 | 池子里一共有多少连接 78 | @return: 79 | ''' 80 | return len(self.connect_pool._idle_cache) 81 | 82 | @auto_retry 83 | def find(self, sql, limit=0, to_json=False, cursor=None): 84 | ''' 85 | @summary: 86 | 无数据: 返回() 87 | 有数据: 若limit == 1 则返回 (data1, data2) 88 | 否则返回 ((data1, data2),) 89 | --------- 90 | @param sql: 91 | @param limit: 92 | --------- 93 | @result: 94 | ''' 95 | conn, cursor = self.get_connection() 96 | 97 | cursor.execute(sql) 98 | 99 | if limit == 1: 100 | result = cursor.fetchone() # 全部查出来,截取 不推荐使用 101 | elif limit > 1: 102 | result = cursor.fetchmany(limit) # 全部查出来,截取 不推荐使用 103 | else: 104 | result = cursor.fetchall() 105 | 106 | if to_json: 107 | columns = [i[0] for i in cursor.description] 108 | 109 | # 处理时间类型 110 | def fix_lob(row): 111 | def convert(col): 112 | if isinstance(col, (datetime.date, datetime.time)): 113 | return str(col) 114 | elif isinstance(col, str) and (col.startswith('{') or col.startswith('[')): 115 | try: 116 | return json.loads(col) 117 | except: 118 | return col 119 | else: 120 | return col 121 | 122 | return [convert(c) for c in row] 123 | 124 | result = [fix_lob(row) for row in result] 125 | result = [dict(zip(columns, r)) for r in result] 126 | 127 | self.close_connection(conn, cursor) 128 | 129 | return result 130 | 131 | def add(self, sql, exception_callfunc=''): 132 | affect_count = None 133 | 134 | try: 135 | conn, cursor = self.get_connection() 136 | affect_count = cursor.execute(sql) 137 | conn.commit() 138 | 139 | except Exception as e: 140 | log.error(''' 141 | error:%s 142 | sql: %s 143 | ''' % (e, sql)) 144 | if exception_callfunc: 145 | exception_callfunc(e) 146 | finally: 147 | self.close_connection(conn, cursor) 148 | 149 | return affect_count 150 | 151 | def add_batch(self, sql, datas): 152 | ''' 153 | @summary: 154 | --------- 155 | @ param sql: insert ignore into (xxx,xxx) values (%s, %s, %s) 156 | # param datas:[[..], [...]] 157 | --------- 158 | @result: 159 | ''' 160 | affect_count = None 161 | 162 | try: 163 | conn, cursor = self.get_connection() 164 | affect_count = cursor.executemany(sql, datas) 165 | conn.commit() 166 | 167 | except Exception as e: 168 | log.error(''' 169 | error:%s 170 | sql: %s 171 | ''' % (e, sql)) 172 | finally: 173 | self.close_connection(conn, cursor) 174 | 175 | return affect_count 176 | 177 | def update(self, sql): 178 | try: 179 | conn, cursor = self.get_connection() 180 | cursor.execute(sql) 181 | conn.commit() 182 | 183 | except Exception as e: 184 | log.error(''' 185 | error:%s 186 | sql: %s 187 | ''' % (e, sql)) 188 | return False 189 | else: 190 | return True 191 | finally: 192 | self.close_connection(conn, cursor) 193 | 194 | def delete(self, sql): 195 | try: 196 | conn, cursor = self.get_connection() 197 | cursor.execute(sql) 198 | conn.commit() 199 | 200 | except Exception as e: 201 | log.error(''' 202 | error:%s 203 | sql: %s 204 | ''' % (e, sql)) 205 | return False 206 | else: 207 | return True 208 | finally: 209 | self.close_connection(conn, cursor) 210 | 211 | def execute(self, sql): 212 | try: 213 | conn, cursor = self.get_connection() 214 | cursor.execute(sql) 215 | conn.commit() 216 | 217 | except Exception as e: 218 | log.error(''' 219 | error:%s 220 | sql: %s 221 | ''' % (e, sql)) 222 | return False 223 | else: 224 | return True 225 | finally: 226 | self.close_connection(conn, cursor) 227 | 228 | def set_unique_key(self, table, key): 229 | try: 230 | sql = 'alter table %s add unique (%s)' % (table, key) 231 | 232 | conn, cursor = self.get_connection() 233 | cursor.execute(sql) 234 | conn.commit() 235 | 236 | except Exception as e: 237 | log.error(table + ' ' + str(e) + ' key = ' + key) 238 | return False 239 | else: 240 | log.debug('%s表创建唯一索引成功 索引为 %s' % (table, key)) 241 | return True 242 | finally: 243 | self.close_connection(conn, cursor) 244 | 245 | 246 | if __name__ == '__main__': 247 | db = MysqlDB() 248 | sql = "select is_done from qiancheng_job_list_batch_record where id = 3" 249 | 250 | data = db.find(sql) 251 | print(data) 252 | -------------------------------------------------------------------------------- /wechat-spider/db/redisdb.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | ''' 3 | Created on 2016-11-16 16:25 4 | --------- 5 | @summary: 操作redis数据库 6 | --------- 7 | @author: Boris 8 | ''' 9 | 10 | from utils.log import log 11 | import redis 12 | 13 | 14 | # setting.REDISDB_DB = 0 15 | # setting.REDISDB_IP_PORTS = 'localhost:6379' 16 | # setting.REDISDB_USER_PASS = None 17 | 18 | 19 | class RedisDB(): 20 | 21 | def __init__(self, ip=None, port=None, db=None, passwd=None, decode_responses=True): 22 | self._is_redis_cluster = False 23 | try: 24 | self._redis = redis.Redis(host=ip, port=port, db=db, password=passwd, decode_responses=decode_responses) # redis默认端口是6379 25 | self._redis.ping() 26 | except Exception as e: 27 | input(''' 28 | ****************************************** 29 | 未链接到redis数据库, 30 | 您当前的链接信息为: 31 | ip = {} 32 | port = {} 33 | db = {} 34 | passwd = {} 35 | 请参考教程正确安装配置redis,然后重启本程序 36 | Exception: {}'''.format(ip, port, db, passwd, str(e)) 37 | ) 38 | import sys 39 | sys.exit() 40 | 41 | def sadd(self, table, values): 42 | ''' 43 | @summary: 使用无序set集合存储数据, 去重 44 | --------- 45 | @param table: 46 | @param values: 值; 支持list 或 单个值 47 | --------- 48 | @result: 若库中存在 返回0,否则入库,返回1。 批量添加返回None 49 | ''' 50 | 51 | if isinstance(values, list): 52 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。 53 | 54 | if not self._is_redis_cluster: 55 | pipe.multi() 56 | for value in values: 57 | pipe.sadd(table, value) 58 | pipe.execute() 59 | 60 | else: 61 | return self._redis.sadd(table, values) 62 | 63 | def sget(self, table, count=1, is_pop=True): 64 | ''' 65 | 返回 list 如 ['1'] 或 [] 66 | @param table: 67 | @param count: 68 | @param is_pop: 69 | @return: 70 | ''' 71 | datas = [] 72 | if is_pop: 73 | count = count if count <= self.sget_count(table) else self.sget_count(table) 74 | if count: 75 | if count > 1: 76 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。 77 | 78 | if not self._is_redis_cluster: 79 | pipe.multi() 80 | while count: 81 | pipe.spop(table) 82 | count -= 1 83 | datas = pipe.execute() 84 | 85 | else: 86 | datas.append(self._redis.spop(table)) 87 | 88 | else: 89 | datas = self._redis.srandmember(table, count) 90 | 91 | return datas 92 | 93 | def srem(self, table, values): 94 | ''' 95 | @summary: 移除集合中的指定元素 96 | --------- 97 | @param table: 98 | @param values: 一个或者列表 99 | --------- 100 | @result: 101 | ''' 102 | if isinstance(values, list): 103 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。 104 | 105 | if not self._is_redis_cluster: 106 | pipe.multi() 107 | for value in values: 108 | pipe.srem(table, value) 109 | pipe.execute() 110 | else: 111 | self._redis.srem(table, values) 112 | 113 | def sget_count(self, table): 114 | return self._redis.scard(table) 115 | 116 | def sdelete(self, table): 117 | ''' 118 | @summary: 删除set集合的大键(数据量大的表) 119 | 删除大set键,使用sscan命令,每次扫描集合中500个元素,再用srem命令每次删除一个键 120 | 若直接用delete命令,会导致Redis阻塞,出现故障切换和应用程序崩溃的故障。 121 | --------- 122 | @param table: 123 | --------- 124 | @result: 125 | ''' 126 | # 当 SCAN 命令的游标参数被设置为 0 时, 服务器将开始一次新的迭代, 而当服务器向用户返回值为 0 的游标时, 表示迭代已结束 127 | cursor = '0' 128 | while cursor != 0: 129 | cursor, data = self._redis.sscan(table, cursor=cursor, count=500) 130 | for item in data: 131 | # pipe.srem(table, item) 132 | self._redis.srem(table, item) 133 | 134 | # pipe.execute() 135 | 136 | def zadd(self, table, values, prioritys=0): 137 | ''' 138 | @summary: 使用有序set集合存储数据, 去重(值存在更新) 139 | --------- 140 | @param table: 141 | @param values: 值; 支持list 或 单个值 142 | @param prioritys: 优先级; double类型,支持list 或 单个值。 根据此字段的值来排序, 值越小越优先。 可不传值,默认value的优先级为0 143 | --------- 144 | @result:若库中存在 返回0,否则入库,返回1。 批量添加返回 [0, 1 ...] 145 | ''' 146 | if isinstance(values, list): 147 | if not isinstance(prioritys, list): 148 | prioritys = [prioritys] * len(values) 149 | else: 150 | assert len(values) == len(prioritys), 'values值要与prioritys值一一对应' 151 | 152 | pipe = self._redis.pipeline(transaction=True) 153 | 154 | if not self._is_redis_cluster: 155 | pipe.multi() 156 | for value, priority in zip(values, prioritys): 157 | if self._is_redis_cluster: 158 | pipe.zadd(table, priority, value) 159 | else: 160 | pipe.zadd(table, value, priority) 161 | return pipe.execute() 162 | 163 | else: 164 | if self._is_redis_cluster: 165 | return self._redis.zadd(table, prioritys, values) 166 | else: 167 | return self._redis.zadd(table, values, prioritys) 168 | 169 | def zget(self, table, count=1, is_pop=True): 170 | ''' 171 | @summary: 从有序set集合中获取数据 优先返回分数小的(优先级高的) 172 | --------- 173 | @param table: 174 | @param count: 数量 -1 返回全部数据 175 | @param is_pop:获取数据后,是否在原set集合中删除,默认是 176 | --------- 177 | @result: 列表 178 | ''' 179 | start_pos = 0 # 包含 180 | end_pos = count - 1 if count > 0 else count 181 | 182 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。 183 | 184 | if not self._is_redis_cluster: 185 | pipe.multi() # 标记事务的开始 参考 http://www.runoob.com/redis/redis-transactions.html 186 | pipe.zrange(table, start_pos, end_pos) # 取值 187 | if is_pop: 188 | pipe.zremrangebyrank(table, start_pos, end_pos) # 删除 189 | results, *count = pipe.execute() 190 | return results 191 | 192 | def zremrangebyscore(self, table, priority_min, priority_max): 193 | ''' 194 | 根据分数移除成员 闭区间 195 | @param table: 196 | @param priority_min: 197 | @param priority_max: 198 | @return: 被移除的成员个数 199 | ''' 200 | return self._redis.zremrangebyscore(table, priority_min, priority_max) 201 | 202 | def zrangebyscore(self, table, priority_min, priority_max, count=None, is_pop=True): 203 | ''' 204 | @summary: 返回指定分数区间的数据 闭区间 205 | --------- 206 | @param table: 207 | @param priority_min: 优先级越小越优先 208 | @param priority_max: 209 | @param count: 获取的数量,为空则表示分数区间内的全部数据 210 | @param is_pop: 是否删除 211 | --------- 212 | @result: 213 | ''' 214 | 215 | # 使用lua脚本, 保证操作的原子性 216 | lua = ''' 217 | local key = KEYS[1] 218 | local min_score = ARGV[2] 219 | local max_score = ARGV[3] 220 | local is_pop = ARGV[4] 221 | local count = ARGV[5] 222 | 223 | -- 取值 224 | local datas = nil 225 | if count then 226 | datas = redis.call('zrangebyscore', key, min_score, max_score, 'limit', 0, count) 227 | else 228 | datas = redis.call('zrangebyscore', key, min_score, max_score) 229 | end 230 | 231 | -- 删除redis中刚取到的值 232 | if (is_pop) then 233 | for i=1, #datas do 234 | redis.call('zrem', key, datas[i]) 235 | end 236 | end 237 | 238 | 239 | return datas 240 | 241 | ''' 242 | cmd = self._redis.register_script(lua) 243 | if count: 244 | res = cmd(keys=[table], args=[table, priority_min, priority_max, is_pop, count]) 245 | else: 246 | res = cmd(keys=[table], args=[table, priority_min, priority_max, is_pop]) 247 | 248 | return res 249 | 250 | def zrangebyscore_increase_score(self, table, priority_min, priority_max, increase_score, count=None): 251 | ''' 252 | @summary: 返回指定分数区间的数据 闭区间, 同时修改分数 253 | --------- 254 | @param table: 255 | @param priority_min: 最小分数 256 | @param priority_max: 最大分数 257 | @param increase_score: 分数值增量 正数则在原有的分数上叠加,负数则相减 258 | @param count: 获取的数量,为空则表示分数区间内的全部数据 259 | --------- 260 | @result: 261 | ''' 262 | 263 | # 使用lua脚本, 保证操作的原子性 264 | lua = ''' 265 | local key = KEYS[1] 266 | local min_score = ARGV[1] 267 | local max_score = ARGV[2] 268 | local increase_score = ARGV[3] 269 | local count = ARGV[4] 270 | 271 | -- 取值 272 | local datas = nil 273 | if count then 274 | datas = redis.call('zrangebyscore', key, min_score, max_score, 'limit', 0, count) 275 | else 276 | datas = redis.call('zrangebyscore', key, min_score, max_score) 277 | end 278 | 279 | --修改优先级 280 | for i=1, #datas do 281 | redis.call('zincrby', key, increase_score, datas[i]) 282 | end 283 | 284 | return datas 285 | 286 | ''' 287 | cmd = self._redis.register_script(lua) 288 | if count: 289 | res = cmd(keys=[table], args=[priority_min, priority_max, increase_score, count]) 290 | else: 291 | res = cmd(keys=[table], args=[priority_min, priority_max, increase_score]) 292 | 293 | return res 294 | 295 | def zrangebyscore_set_score(self, table, priority_min, priority_max, score, count=None): 296 | ''' 297 | @summary: 返回指定分数区间的数据 闭区间, 同时修改分数 298 | --------- 299 | @param table: 300 | @param priority_min: 最小分数 301 | @param priority_max: 最大分数 302 | @param score: 分数值 303 | @param count: 获取的数量,为空则表示分数区间内的全部数据 304 | --------- 305 | @result: 306 | ''' 307 | 308 | # 使用lua脚本, 保证操作的原子性 309 | lua = ''' 310 | local key = KEYS[1] 311 | local min_score = ARGV[1] 312 | local max_score = ARGV[2] 313 | local set_score = ARGV[3] 314 | local count = ARGV[4] 315 | 316 | -- 取值 317 | local datas = nil 318 | if count then 319 | datas = redis.call('zrangebyscore', key, min_score, max_score, 'withscores','limit', 0, count) 320 | else 321 | datas = redis.call('zrangebyscore', key, min_score, max_score, 'withscores') 322 | end 323 | 324 | local real_datas = {} -- 数据 325 | --修改优先级 326 | for i=1, #datas, 2 do 327 | local data = datas[i] 328 | local score = datas[i+1] 329 | 330 | table.insert(real_datas, data) -- 添加数据 331 | 332 | redis.call('zincrby', key, set_score - score, datas[i]) 333 | end 334 | 335 | return real_datas 336 | 337 | ''' 338 | cmd = self._redis.register_script(lua) 339 | if count: 340 | res = cmd(keys=[table], args=[priority_min, priority_max, score, count]) 341 | else: 342 | res = cmd(keys=[table], args=[priority_min, priority_max, score]) 343 | 344 | return res 345 | 346 | def zget_count(self, table, priority_min=None, priority_max=None): 347 | ''' 348 | @summary: 获取表数据的数量 349 | --------- 350 | @param table: 351 | @param priority_min:优先级范围 最小值(包含) 352 | @param priority_max:优先级范围 最大值(包含) 353 | --------- 354 | @result: 355 | ''' 356 | 357 | if priority_min != None and priority_max != None: 358 | return self._redis.zcount(table, priority_min, priority_max) 359 | else: 360 | return self._redis.zcard(table) 361 | 362 | def zrem(self, table, values): 363 | ''' 364 | @summary: 移除集合中的指定元素 365 | --------- 366 | @param table: 367 | @param values: 一个或者列表 368 | --------- 369 | @result: 370 | ''' 371 | if isinstance(values, list): 372 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。 373 | 374 | if not self._is_redis_cluster: 375 | pipe.multi() 376 | for value in values: 377 | pipe.zrem(table, value) 378 | pipe.execute() 379 | else: 380 | self._redis.zrem(table, values) 381 | 382 | def zexists(self, table, values): 383 | ''' 384 | 利用zscore判断某元素是否存在 385 | @param values: 386 | @return: 387 | ''' 388 | is_exists = [] 389 | 390 | if isinstance(values, list): 391 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。 392 | pipe.multi() 393 | for value in values: 394 | pipe.zscore(table, value) 395 | is_exists_temp = pipe.execute() 396 | for is_exist in is_exists_temp: 397 | if is_exist != None: 398 | is_exists.append(1) 399 | else: 400 | is_exists.append(0) 401 | 402 | else: 403 | is_exists = self._redis.zscore(table, values) 404 | is_exists = 1 if is_exists != None else 0 405 | 406 | return is_exists 407 | 408 | def lpush(self, table, values): 409 | if isinstance(values, list): 410 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。 411 | 412 | if not self._is_redis_cluster: 413 | pipe.multi() 414 | for value in values: 415 | pipe.rpush(table, value) 416 | pipe.execute() 417 | 418 | else: 419 | return self._redis.rpush(table, values) 420 | 421 | def lpop(self, table, count=1): 422 | ''' 423 | @summary: 424 | --------- 425 | @param table: 426 | @param count: 427 | --------- 428 | @result: count>1时返回列表 429 | ''' 430 | datas = None 431 | 432 | count = count if count <= self.lget_count(table) else self.lget_count(table) 433 | 434 | if count: 435 | if count > 1: 436 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。 437 | 438 | if not self._is_redis_cluster: 439 | pipe.multi() 440 | while count: 441 | pipe.lpop(table) 442 | count -= 1 443 | datas = pipe.execute() 444 | 445 | else: 446 | datas = self._redis.lpop(table) 447 | 448 | return datas 449 | 450 | def rpoplpush(self, from_table, to_table=None): 451 | ''' 452 | 将列表 from_table 中的最后一个元素(尾元素)弹出,并返回给客户端。 453 | 将 from_table 弹出的元素插入到列表 to_table ,作为 to_table 列表的的头元素。 454 | 如果 from_table 和 to_table 相同,则列表中的表尾元素被移动到表头,并返回该元素,可以把这种特殊情况视作列表的旋转(rotation)操作 455 | @param from_table: 456 | @param to_table: 457 | @return: 458 | ''' 459 | 460 | if not to_table: 461 | to_table = from_table 462 | 463 | return self._redis.rpoplpush(from_table, to_table) 464 | 465 | def lget_count(self, table): 466 | return self._redis.llen(table) 467 | 468 | def lrem(self, table, value, num=0): 469 | return self._redis.lrem(table, value, num) 470 | 471 | def hset(self, table, key, value): 472 | ''' 473 | @summary: 474 | 如果 key 不存在,一个新的哈希表被创建并进行 HSET 操作。 475 | 如果域 field 已经存在于哈希表中,旧值将被覆盖 476 | --------- 477 | @param table: 478 | @param key: 479 | @param value: 480 | --------- 481 | @result: 1 新插入; 0 覆盖 482 | ''' 483 | 484 | return self._redis.hset(table, key, value) 485 | 486 | def hincrby(self, table, key, increment): 487 | return self._redis.hincrby(table, key, increment) 488 | 489 | def hget(self, table, key, is_pop=False): 490 | if not is_pop: 491 | return self._redis.hget(table, key) 492 | else: 493 | lua = ''' 494 | local key = KEYS[1] 495 | local field = ARGV[1] 496 | 497 | -- 取值 498 | local datas = redis.call('hget', key, field) 499 | -- 删除值 500 | redis.call('hdel', key, field) 501 | 502 | return datas 503 | 504 | ''' 505 | cmd = self._redis.register_script(lua) 506 | res = cmd(keys=[table], args=[key]) 507 | 508 | return res 509 | 510 | def hgetall(self, table): 511 | return self._redis.hgetall(table) 512 | 513 | def hexists(self, table, key): 514 | return self._redis.hexists(table, key) 515 | 516 | def hdel(self, table, *keys): 517 | ''' 518 | @summary: 删除对应的key 可传多个 519 | --------- 520 | @param table: 521 | @param *keys: 522 | --------- 523 | @result: 524 | ''' 525 | 526 | self._redis.hdel(table, *keys) 527 | 528 | def hget_count(self, table): 529 | return self._redis.hlen(table) 530 | 531 | def setbit(self, table, offsets, values): 532 | ''' 533 | 设置字符串数组某一位的值, 返回之前的值 534 | @param table: 535 | @param offsets: 支持列表或单个值 536 | @param values: 支持列表或单个值 537 | @return: list / 单个值 538 | ''' 539 | if isinstance(offsets, list): 540 | if not isinstance(values, list): 541 | values = [values] * len(offsets) 542 | else: 543 | assert len(offsets) == len(values), 'offsets值要与values值一一对应' 544 | 545 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。 546 | pipe.multi() 547 | 548 | for offset, value in zip(offsets, values): 549 | pipe.setbit(table, offset, value) 550 | 551 | return pipe.execute() 552 | 553 | else: 554 | return self._redis.setbit(table, offsets, values) 555 | 556 | def getbit(self, table, offsets): 557 | ''' 558 | 取字符串数组某一位的值 559 | @param table: 560 | @param offsets: 支持列表 561 | @return: list / 单个值 562 | ''' 563 | if isinstance(offsets, list): 564 | pipe = self._redis.pipeline(transaction=True) # redis-py默认在执行每次请求都会创建(连接池申请连接)和断开(归还连接池)一次连接操作,如果想要在一次请求中指定多个命令,则可以使用pipline实现一次请求指定多个命令,并且默认情况下一次pipline 是原子性操作。 565 | pipe.multi() 566 | for offset in offsets: 567 | pipe.getbit(table, offset) 568 | 569 | return pipe.execute() 570 | 571 | else: 572 | return self._redis.getbit(table, offsets) 573 | 574 | def bitcount(self, table): 575 | return self._redis.bitcount(table) 576 | 577 | def strset(self, table, value, **kwargs): 578 | return self._redis.set(table, value, **kwargs) 579 | 580 | def strget(self, table): 581 | return self._redis.get(table) 582 | 583 | def strlen(self, table): 584 | return self._redis.strlen(table) 585 | 586 | def getkeys(self, regex): 587 | return self._redis.keys(regex) 588 | 589 | def exists_key(self, key): 590 | return self._redis.exists(key) 591 | 592 | def set_expire(self, key, seconds): 593 | ''' 594 | @summary: 设置过期时间 595 | --------- 596 | @param key: 597 | @param seconds: 秒 598 | --------- 599 | @result: 600 | ''' 601 | 602 | self._redis.expire(key, seconds) 603 | 604 | def clear(self, table): 605 | try: 606 | self._redis.delete(table) 607 | except Exception as e: 608 | log.error(e) 609 | 610 | def get_redis_obj(self): 611 | return self._redis 612 | 613 | 614 | if __name__ == '__main__': 615 | db = RedisDB(db=1) 616 | print(db) 617 | -------------------------------------------------------------------------------- /wechat-spider/run.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | ''' 3 | Created on 2019/5/18 9:52 PM 4 | --------- 5 | @summary: 6 | --------- 7 | @author: 8 | ''' 9 | 10 | from core.capture_packet import WechatCapture 11 | from create_tables import create_table 12 | from mitmproxy import options 13 | from mitmproxy import proxy 14 | from mitmproxy.tools.dump import DumpMaster 15 | from config import config, IP 16 | 17 | 18 | def start(): 19 | ip = IP 20 | port = config.get('spider').get('service_port') 21 | 22 | print("温馨提示:服务IP {} 端口 {} 请确保代理已配置".format(ip, port)) 23 | 24 | myaddon = WechatCapture() 25 | opts = options.Options(listen_port=port) 26 | pconf = proxy.config.ProxyConfig(opts) 27 | m = DumpMaster(opts) 28 | m.options.set('flow_detail={mitm_log_level}'.format(mitm_log_level=config.get('mitm').get('log_level'))) 29 | m.server = proxy.server.ProxyServer(pconf) 30 | m.addons.add(myaddon) 31 | 32 | try: 33 | m.run() 34 | except KeyboardInterrupt: 35 | m.shutdown() 36 | 37 | 38 | create_table() 39 | start() 40 | -------------------------------------------------------------------------------- /wechat-spider/utils/log.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | """ 3 | Created on 2018-12-08 16:50 4 | --------- 5 | @summary: 6 | --------- 7 | @author: 8 | """ 9 | 10 | import logging 11 | import os 12 | import sys 13 | from logging.handlers import BaseRotatingHandler 14 | from config import config 15 | 16 | from better_exceptions import format_exception 17 | 18 | LOG_FORMAT = "%(threadName)s|%(asctime)s|%(filename)s|%(funcName)s|line:%(lineno)d|%(levelname)s| %(message)s" 19 | PRINT_EXCEPTION_DETAILS = True 20 | 21 | 22 | # 重写 RotatingFileHandler 自定义log的文件名 23 | # 原来 xxx.log xxx.log.1 xxx.log.2 xxx.log.3 文件由近及远 24 | # 现在 xxx.log xxx1.log xxx2.log 如果backupCount 是2位数时 则 01 02 03 三位数 001 002 .. 文件由近及远 25 | class RotatingFileHandler(BaseRotatingHandler): 26 | 27 | def __init__( 28 | self, filename, mode="a", maxBytes=0, backupCount=0, encoding=None, delay=0 29 | ): 30 | # if maxBytes > 0: 31 | # mode = 'a' 32 | BaseRotatingHandler.__init__(self, filename, mode, encoding, delay) 33 | self.maxBytes = maxBytes 34 | self.backupCount = backupCount 35 | self.placeholder = str(len(str(backupCount))) 36 | 37 | def doRollover(self): 38 | if self.stream: 39 | self.stream.close() 40 | self.stream = None 41 | if self.backupCount > 0: 42 | for i in range(self.backupCount - 1, 0, -1): 43 | sfn = ("%0" + self.placeholder + "d.") % i # '%2d.'%i -> 02 44 | sfn = sfn.join(self.baseFilename.split(".")) 45 | # sfn = "%d_%s" % (i, self.baseFilename) 46 | # dfn = "%d_%s" % (i + 1, self.baseFilename) 47 | dfn = ("%0" + self.placeholder + "d.") % (i + 1) 48 | dfn = dfn.join(self.baseFilename.split(".")) 49 | if os.path.exists(sfn): 50 | # print "%s -> %s" % (sfn, dfn) 51 | if os.path.exists(dfn): 52 | os.remove(dfn) 53 | os.rename(sfn, dfn) 54 | dfn = (("%0" + self.placeholder + "d.") % 1).join( 55 | self.baseFilename.split(".") 56 | ) 57 | if os.path.exists(dfn): 58 | os.remove(dfn) 59 | # Issue 18940: A file may not have been created if delay is True. 60 | if os.path.exists(self.baseFilename): 61 | os.rename(self.baseFilename, dfn) 62 | if not self.delay: 63 | self.stream = self._open() 64 | 65 | def shouldRollover(self, record): 66 | 67 | if self.stream is None: # delay was set... 68 | self.stream = self._open() 69 | if self.maxBytes > 0: # are we rolling over? 70 | msg = "%s\n" % self.format(record) 71 | self.stream.seek(0, 2) # due to non-posix-compliant Windows feature 72 | if self.stream.tell() + len(msg) >= self.maxBytes: 73 | return 1 74 | return 0 75 | 76 | 77 | def get_logger( 78 | name, path="", log_level="DEBUG", is_write_to_file=False, is_write_to_stdout=True 79 | ): 80 | """ 81 | @summary: 获取log 82 | --------- 83 | @param name: log名 84 | @param path: log文件存储路径 如 D://xxx.log 85 | @param log_level: log等级 CRITICAL/ERROR/WARNING/INFO/DEBUG 86 | @param is_write_to_file: 是否写入到文件 默认否 87 | --------- 88 | @result: 89 | """ 90 | name = name.split(os.sep)[-1].split(".")[0] # 取文件名 91 | 92 | logger = logging.getLogger(name) 93 | logger.setLevel(log_level) 94 | 95 | formatter = logging.Formatter(LOG_FORMAT) 96 | if PRINT_EXCEPTION_DETAILS: 97 | formatter.formatException = lambda exc_info: format_exception(*exc_info) 98 | 99 | # 定义一个RotatingFileHandler,最多备份5个日志文件,每个日志文件最大10M 100 | if is_write_to_file: 101 | if path and not os.path.exists(os.path.dirname(path)): 102 | os.makedirs(os.path.dirname(path)) 103 | 104 | rf_handler = RotatingFileHandler( 105 | path, mode="w", maxBytes=10 * 1024 * 1024, backupCount=20, encoding="utf8" 106 | ) 107 | rf_handler.setFormatter(formatter) 108 | logger.addHandler(rf_handler) 109 | 110 | if is_write_to_stdout: 111 | stream_handler = logging.StreamHandler() 112 | stream_handler.stream = sys.stdout 113 | stream_handler.setFormatter(formatter) 114 | # 检查是否已存在 115 | handle_exists = 0 116 | for _handler in logger.handlers: 117 | if ( 118 | isinstance(_handler, logging.StreamHandler) 119 | and _handler.stream == sys.stdout 120 | ): 121 | handle_exists = 1 122 | if not handle_exists: 123 | logger.addHandler(stream_handler) 124 | 125 | return logger 126 | 127 | 128 | # logging.disable(logging.DEBUG) # 关闭所有log 129 | 130 | # 不让打印log的配置 131 | STOP_LOGS = [ 132 | # ES 133 | "urllib3.response", 134 | "urllib3.connection", 135 | "elasticsearch.trace", 136 | "requests.packages.urllib3.util", 137 | "requests.packages.urllib3.util.retry", 138 | "urllib3.util", 139 | "requests.packages.urllib3.response", 140 | "requests.packages.urllib3.contrib.pyopenssl", 141 | "requests.packages", 142 | "urllib3.util.retry", 143 | "requests.packages.urllib3.contrib", 144 | "requests.packages.urllib3.connectionpool", 145 | "requests.packages.urllib3.poolmanager", 146 | "urllib3.connectionpool", 147 | "requests.packages.urllib3.connection", 148 | "elasticsearch", 149 | "log_request_fail", 150 | # requests 151 | "requests", 152 | "selenium.webdriver.remote.remote_connection", 153 | "selenium.webdriver.remote", 154 | "selenium.webdriver", 155 | "selenium", 156 | # markdown 157 | "MARKDOWN", 158 | "build_extension", 159 | # newspaper 160 | "calculate_area", 161 | "largest_image_url", 162 | "newspaper.images", 163 | "newspaper", 164 | "Importing", 165 | "PIL", 166 | ] 167 | 168 | # 关闭日志打印 169 | for STOP_LOG in STOP_LOGS: 170 | log_level = eval("logging.ERROR") 171 | logging.getLogger(STOP_LOG).setLevel(log_level) 172 | 173 | # print(logging.Logger.manager.loggerDict) # 取使用debug模块的name 174 | 175 | # 日志级别大小关系为:critical > error > warning > info > debug 176 | log = get_logger( 177 | name="wechat_spider", 178 | path=config.get('log').get('log_path'), 179 | log_level=config.get('log').get('level'), 180 | is_write_to_file=config.get('log').get('to_file'), 181 | ) 182 | 183 | if __name__ == "__main__": 184 | try: 185 | a = 1 186 | b = 0 187 | c = a / b 188 | except Exception as e: 189 | log.exception(e) 190 | -------------------------------------------------------------------------------- /wechat-spider/utils/selector.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | ''' 3 | Created on 2018-10-08 15:33:37 4 | --------- 5 | @summary: 重新定义 selector 6 | --------- 7 | @author: 8 | @email: 9 | ''' 10 | import re 11 | 12 | import six 13 | from parsel import Selector as ParselSelector 14 | from parsel import SelectorList as ParselSelectorList 15 | from w3lib.html import replace_entities as w3lib_replace_entities 16 | 17 | 18 | def extract_regex(regex, text, replace_entities=True, flags=0): 19 | """Extract a list of unicode strings from the given text/encoding using the following policies: 20 | * if the regex contains a named group called "extract" that will be returned 21 | * if the regex contains multiple numbered groups, all those will be returned (flattened) 22 | * if the regex doesn't contain any group the entire regex matching is returned 23 | """ 24 | if isinstance(regex, six.string_types): 25 | regex = re.compile(regex, flags=flags) 26 | 27 | if 'extract' in regex.groupindex: 28 | # named group 29 | try: 30 | extracted = regex.search(text).group('extract') 31 | except AttributeError: 32 | strings = [] 33 | else: 34 | strings = [extracted] if extracted is not None else [] 35 | else: 36 | # full regex or numbered groups 37 | strings = regex.findall(text) 38 | 39 | # strings = flatten(strings) # 这东西会把多维列表铺平 40 | if not replace_entities: 41 | return strings 42 | 43 | values = [] 44 | for value in strings: 45 | if isinstance(value, (list, tuple)): # w3lib_replace_entities 不能接收list tuple 46 | values.append([w3lib_replace_entities(v, keep=['lt', 'amp']) for v in value]) 47 | else: 48 | values.append(w3lib_replace_entities(value, keep=['lt', 'amp'])) 49 | 50 | return values 51 | 52 | 53 | class SelectorList(ParselSelectorList): 54 | """ 55 | The :class:`SelectorList` class is a subclass of the builtin ``list`` 56 | class, which provides a few additional methods. 57 | """ 58 | 59 | def re_first(self, regex, default=None, replace_entities=True, flags=re.S): 60 | """ 61 | Call the ``.re()`` method for the first element in this list and 62 | return the result in an unicode string. If the list is empty or the 63 | regex doesn't match anything, return the default value (``None`` if 64 | the argument is not provided). 65 | 66 | By default, character entity references are replaced by their 67 | corresponding character (except for ``&`` and ``<``. 68 | Passing ``replace_entities`` as ``False`` switches off these 69 | replacements. 70 | """ 71 | 72 | datas = self.re(regex, replace_entities=replace_entities, flags=flags) 73 | return datas[0] if datas else default 74 | 75 | def re(self, regex, replace_entities=True, flags=re.S): 76 | """ 77 | Call the ``.re()`` method for each element in this list and return 78 | their results flattened, as a list of unicode strings. 79 | 80 | By default, character entity references are replaced by their 81 | corresponding character (except for ``&`` and ``<``. 82 | Passing ``replace_entities`` as ``False`` switches off these 83 | replacements. 84 | """ 85 | datas = [x.re(regex, replace_entities=replace_entities, flags=flags) for x in self] 86 | return datas[0] if len(datas) == 1 else datas 87 | 88 | 89 | class Selector(ParselSelector): 90 | selectorlist_cls = SelectorList 91 | 92 | def __str__(self): 93 | data = repr(self.get()) 94 | return "<%s xpath=%r data=%s>" % (type(self).__name__, self._expr, data) 95 | 96 | __repr__ = __str__ 97 | 98 | def re_first(self, regex, default=None, replace_entities=True, flags=re.S): 99 | """ 100 | Apply the given regex and return the first unicode string which 101 | matches. If there is no match, return the default value (``None`` if 102 | the argument is not provided). 103 | 104 | By default, character entity references are replaced by their 105 | corresponding character (except for ``&`` and ``<``. 106 | Passing ``replace_entities`` as ``False`` switches off these 107 | replacements. 108 | """ 109 | 110 | datas = self.re(regex, replace_entities=replace_entities, flags=flags) 111 | 112 | return datas[0] if datas else default 113 | 114 | def re(self, regex, replace_entities=True, flags=re.S): 115 | """ 116 | Apply the given regex and return a list of unicode strings with the 117 | matches. 118 | 119 | ``regex`` can be either a compiled regular expression or a string which 120 | will be compiled to a regular expression using ``re.compile(regex)``. 121 | 122 | By default, character entity references are replaced by their 123 | corresponding character (except for ``&`` and ``<``. 124 | Passing ``replace_entities`` as ``False`` switches off these 125 | replacements. 126 | """ 127 | 128 | return extract_regex(regex, self.get(), replace_entities=replace_entities, flags=flags) 129 | -------------------------------------------------------------------------------- /wechat-spider/utils/tools.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | ''' 3 | Created on 2019/5/19 3:03 PM 4 | --------- 5 | @summary: 6 | --------- 7 | @author: 8 | ''' 9 | import datetime 10 | import json 11 | import re 12 | import ssl 13 | import time 14 | import uuid 15 | from pprint import pformat 16 | import hashlib 17 | 18 | import pymysql 19 | from utils.log import log 20 | 21 | # 全局取消ssl证书验证 22 | ssl._create_default_https_context = ssl._create_unverified_context 23 | 24 | _regexs = {} 25 | 26 | 27 | # @log_function_time 28 | def get_info(html, regexs, allow_repeat=True, fetch_one=False, split=None): 29 | regexs = isinstance(regexs, str) and [regexs] or regexs 30 | 31 | infos = [] 32 | for regex in regexs: 33 | if regex == '': 34 | continue 35 | 36 | if regex not in _regexs.keys(): 37 | _regexs[regex] = re.compile(regex, re.S) 38 | 39 | if fetch_one: 40 | infos = _regexs[regex].search(html) 41 | if infos: 42 | infos = infos.groups() 43 | else: 44 | continue 45 | else: 46 | infos = _regexs[regex].findall(str(html)) 47 | 48 | if len(infos) > 0: 49 | # print(regex) 50 | break 51 | 52 | if fetch_one: 53 | infos = infos if infos else ('',) 54 | return infos if len(infos) > 1 else infos[0] 55 | else: 56 | infos = allow_repeat and infos or sorted(set(infos), key=infos.index) 57 | infos = split.join(infos) if split else infos 58 | return infos 59 | 60 | 61 | def get_param(url, key): 62 | params = url.split('?')[-1].split('&') 63 | for param in params: 64 | key_value = param.split('=', 1) 65 | if key == key_value[0]: 66 | return key_value[1] 67 | return None 68 | 69 | 70 | def get_current_timestamp(): 71 | return int(time.time()) 72 | 73 | 74 | def get_current_date(date_format='%Y-%m-%d %H:%M:%S'): 75 | return datetime.datetime.now().strftime(date_format) 76 | # return time.strftime(date_format, time.localtime(time.time())) 77 | 78 | 79 | def timestamp_to_date(timestamp, time_format='%Y-%m-%d %H:%M:%S'): 80 | ''' 81 | @summary: 82 | --------- 83 | @param timestamp: 将时间戳转化为日期 84 | @param format: 日期格式 85 | --------- 86 | @result: 返回日期 87 | ''' 88 | 89 | date = time.localtime(timestamp) 90 | return time.strftime(time_format, date) 91 | 92 | 93 | def get_json(json_str): 94 | ''' 95 | @summary: 取json对象 96 | --------- 97 | @param json_str: json格式的字符串 98 | --------- 99 | @result: 返回json对象 100 | ''' 101 | 102 | try: 103 | return json.loads(json_str) if json_str else {} 104 | except Exception as e1: 105 | try: 106 | json_str = json_str.strip() 107 | json_str = json_str.replace("'", '"') 108 | keys = get_info(json_str, "(\w+):") 109 | for key in keys: 110 | json_str = json_str.replace(key, '"%s"' % key) 111 | 112 | return json.loads(json_str) if json_str else {} 113 | 114 | except Exception as e2: 115 | log.error( 116 | ''' 117 | e1: %s 118 | format json_str: %s 119 | e2: %s 120 | ''' % (e1, json_str, e2) 121 | ) 122 | 123 | return {} 124 | 125 | 126 | def dumps_json(json_, indent=4): 127 | ''' 128 | @summary: 格式化json 用于打印 129 | --------- 130 | @param json_: json格式的字符串或json对象 131 | --------- 132 | @result: 格式化后的字符串 133 | ''' 134 | try: 135 | if isinstance(json_, str): 136 | json_ = get_json(json_) 137 | 138 | json_ = json.dumps(json_, ensure_ascii=False, indent=indent, skipkeys=True) 139 | 140 | except Exception as e: 141 | log.error(e) 142 | json_ = pformat(json_) 143 | 144 | return json_ 145 | 146 | 147 | ############ 148 | def format_sql_value(value): 149 | if isinstance(value, str): 150 | value = pymysql.escape_string(value) 151 | 152 | elif isinstance(value, list) or isinstance(value, dict): 153 | value = dumps_json(value, indent=None) 154 | 155 | elif isinstance(value, bool): 156 | value = int(value) 157 | 158 | return value 159 | 160 | 161 | def list2str(datas): 162 | ''' 163 | 列表转字符串 164 | :param datas: [1, 2] 165 | :return: (1, 2) 166 | ''' 167 | data_str = str(tuple(datas)) 168 | data_str = re.sub(",\)$", ')', data_str) 169 | return data_str 170 | 171 | 172 | def make_insert_sql(table, data, auto_update=False, update_columns=(), insert_ignore=False): 173 | ''' 174 | @summary: 适用于mysql, oracle数据库时间需要to_date 处理(TODO) 175 | --------- 176 | @param table: 177 | @param data: 表数据 json格式 178 | @param auto_update: 使用的是replace into, 为完全覆盖已存在的数据 179 | @param update_columns: 需要更新的列 默认全部,当指定值时,auto_update设置无效,当duplicate key冲突时更新指定的列 180 | @param insert_ignore: 数据存在忽略 181 | --------- 182 | @result: 183 | ''' 184 | 185 | keys = ['`{}`'.format(key) for key in data.keys()] 186 | keys = list2str(keys).replace("'", '') 187 | 188 | values = [format_sql_value(value) for value in data.values()] 189 | values = list2str(values) 190 | 191 | if update_columns: 192 | if not isinstance(update_columns, (tuple, list)): 193 | update_columns = [update_columns] 194 | update_columns_ = ', '.join(["{key}=values({key})".format(key=key) for key in update_columns]) 195 | sql = 'insert%s into {table} {keys} values {values} on duplicate key update %s' % (' ignore' if insert_ignore else '', update_columns_) 196 | 197 | elif auto_update: 198 | sql = 'replace into {table} {keys} values {values}' 199 | else: 200 | sql = 'insert%s into {table} {keys} values {values}' % (' ignore' if insert_ignore else '') 201 | 202 | sql = sql.format(table=table, keys=keys, values=values).replace('None', 'null') 203 | return sql 204 | 205 | 206 | def make_update_sql(table, data, condition): 207 | ''' 208 | @summary: 适用于mysql, oracle数据库时间需要to_date 处理(TODO) 209 | --------- 210 | @param table: 211 | @param data: 表数据 json格式 212 | @param condition: where 条件 213 | --------- 214 | @result: 215 | ''' 216 | key_values = [] 217 | 218 | for key, value in data.items(): 219 | value = format_sql_value(value) 220 | if isinstance(value, str): 221 | key_values.append("`{}`='{}'".format(key, value)) 222 | elif value is None: 223 | key_values.append("`{}`={}".format(key, 'null')) 224 | else: 225 | key_values.append("`{}`={}".format(key, value)) 226 | 227 | key_values = ', '.join(key_values) 228 | 229 | sql = 'update {table} set {key_values} where {condition}' 230 | sql = sql.format(table=table, key_values=key_values, condition=condition) 231 | return sql 232 | 233 | 234 | def make_batch_sql(table, datas, auto_update=False, update_columns=()): 235 | ''' 236 | @summary: 生产批量的sql 237 | --------- 238 | @param table: 239 | @param datas: 表数据 [{...}] 240 | @param auto_update: 使用的是replace into, 为完全覆盖已存在的数据 241 | @param update_columns: 需要更新的列 默认全部,当指定值时,auto_update设置无效,当duplicate key冲突时更新指定的列 242 | --------- 243 | @result: 244 | ''' 245 | if not datas: 246 | return 247 | 248 | keys = list(datas[0].keys()) 249 | values_placeholder = ['%s'] * len(keys) 250 | 251 | values = [] 252 | for data in datas: 253 | value = [] 254 | for key in keys: 255 | current_data = data.get(key) 256 | current_data = format_sql_value(current_data) 257 | 258 | value.append(current_data) 259 | 260 | values.append(value) 261 | 262 | keys = ['`{}`'.format(key) for key in keys] 263 | keys = str(keys).replace('[', '(').replace(']', ')').replace("'", '') 264 | values_placeholder = str(values_placeholder).replace('[', '(').replace(']', ')').replace("'", '') 265 | 266 | if update_columns: 267 | if not isinstance(update_columns, (tuple, list)): 268 | update_columns = [update_columns] 269 | update_columns_ = ', '.join(["`{key}`=values(`{key}`)".format(key=key) for key in update_columns]) 270 | sql = 'insert into {table} {keys} values {values_placeholder} on duplicate key update {update_columns}'.format(table=table, keys=keys, values_placeholder=values_placeholder, update_columns=update_columns_) 271 | elif auto_update: 272 | sql = 'replace into {table} {keys} values {values_placeholder}'.format(table=table, keys=keys, values_placeholder=values_placeholder) 273 | else: 274 | sql = 'insert ignore into {table} {keys} values {values_placeholder}'.format(table=table, keys=keys, values_placeholder=values_placeholder) 275 | 276 | return sql, values 277 | 278 | 279 | ########## 280 | 281 | def get_mac_address(): 282 | mac = uuid.UUID(int=uuid.getnode()).hex[-12:] 283 | return ":".join([mac[e:e + 2] for e in range(0, 11, 2)]) 284 | 285 | 286 | def get_md5(*args): 287 | ''' 288 | @summary: 获取唯一的32位md5 289 | --------- 290 | @param *args: 参与联合去重的值 291 | --------- 292 | @result: 7c8684bcbdfcea6697650aa53d7b1405 293 | ''' 294 | 295 | m = hashlib.md5() 296 | for arg in args: 297 | m.update(str(arg).encode()) 298 | 299 | return m.hexdigest() 300 | --------------------------------------------------------------------------------