├── LICENSE ├── README.md ├── spiderAPI ├── __init__.py ├── baidumap.py ├── dianping.py ├── github.py ├── lagou.py └── proxyip.py └── spiderFile ├── ECUT_get_grade.py ├── ECUT_pos_html.py ├── JD_spider.py ├── baidu_sy_img.py ├── baidu_wm_img.py ├── fuckCTF.py ├── get_baike.py ├── get_history_weather.py ├── get_photos.py ├── get_tj_accident_info.py ├── get_top_sec_com.py ├── get_web_all_img.py ├── github_hot.py ├── kantuSpider.py ├── lagou_position_spider.py ├── one_img.py ├── one_update.py ├── search_useful_camera_ip_address.py ├── student_img.py └── xz_picture_spider.py /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2018 yhf 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | ```shell 2 | ( 3 | )\ ) ) ) ( ( 4 | (()/( ( ( /( ( /( )\ ( ) ( ( )\ ( ( 5 | /(_)))\ ) )\()))\()) ( ( (((_) )( ( /( )\))( ((_) ))\ )( 6 | (_)) (()/( (_))/((_)\ )\ )\ ) )\___ (()\ )(_))((_)()\ _ /((_)(()\ 7 | | _ \ )(_))| |_ | |(_) ((_) _(_/(((/ __| ((_)((_)_ _(()((_)| |(_)) ((_) 8 | | _/| || || _|| ' \ / _ \| ' \))| (__ | '_|/ _` |\ V V /| |/ -_) | '_| 9 | |_| \_, | \__||_||_|\___/|_||_| \___||_| \__,_| \_/\_/ |_|\___| |_| 10 | |__/ 11 | —————— by yanghangfeng 12 | ``` 13 | #

PythonCrawler: 用 python编写的爬虫项目集合:bug:(本项目代码仅作为爬虫技术学习之用,学习者务必遵循中华人民共和国法律!)

14 | 15 |

16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 |

32 | 33 | # IPWO全球代理资源 | 为采集、跨境与测试项目提供支持(免费试用,爬虫使用强烈推荐!!!) 34 | ### 官网地址 35 | [👉 访问 IPWO 官网](https://www.ipwo.net/?code=WSESV2ONN) 36 | ### 产品简介 37 | * 免费试用,先体验再选择 38 | * 9000万+真实住宅IP,覆盖220+国家和地区 39 | * 支持动态住宅代理、静态住宅代理(ISP) 40 | * 适用于数据抓取、电商、广告验证、SEO监控等场景 41 | * 支持HTTP/HTTPS/SOCKS5协议,兼容性强 42 | * 纯净IP池,实时更新,99.9%连接成功率 43 | * 支持指定国家城市地区访问,保护隐私 44 | 45 | # spiderFile模块简介 46 | 47 | 1. [baidu_sy_img.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/baidu_sy_img.py): **抓取百度的`高清摄影`图片。** 48 | 2. [baidu_wm_img.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/baidu_wm_img.py): **抓取百度图片`唯美意境`模块。** 49 | 3. [get_photos.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/get_photos.py): **抓取百度贴吧某话题下的所有图片。** 50 | 4. [get_web_all_img.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/get_web_all_img.py): **抓取整个网站的图片。** 51 | 5. [lagou_position_spider.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/lagou_position_spider.py): **任意输入关键字,一键抓取与关键字相关的职位招聘信息,并保存到本地文件。** 52 | 6. [student_img.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/student_img.py): **自动化获取自己学籍证件照。** 53 | 7. [JD_spider.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/JD_spider.py): **大批量抓取京东商品id和标签。** 54 | 8. [ECUT_pos_html.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/ECUT_pos_html.py): **抓取学校官网所有校园招聘信息,并保存为html格式,图片也会镶嵌在html中。** 55 | 9. [ECUT_get_grade.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/ECUT_get_grade.py): **模拟登陆学校官网,抓取成绩并计算平均学分绩。** 56 | 10. [github_hot.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/github_hot.py): **抓取github上面热门语言所对应的项目,并把项目简介和项目主页地址保存到本地文件。** 57 | 11. [xz_picture_spider.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/xz_picture_spider.py): **应一位知友的请求,抓取某网站上面所有的写真图片。** 58 | 12. [one_img.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/one_img.py): **抓取one文艺网站的图片。** 59 | 13. [get_baike.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/get_baike.py): **任意输入一个关键词抓取百度百科的介绍。** 60 | 14. [kantuSpider.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/kantuSpider.py): **抓取看图网站上的所有图片。** 61 | 15. [fuckCTF.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/fuckCTF.py): **通过selenium模拟登入合天网站,自动修改原始密码。** 62 | 16. [one_update.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/one_update.py): **更新抓取one文艺网站的代码,添加一句箴言的抓取。** 63 | 17. [get_history_weather.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/get_history_weather.py): **抓取广州市2019年第一季度的天气数据。** 64 | 18. [search_useful_camera_ip_address.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/search_useful_camera_ip_address.py): **摄像头弱密码安全科普。** 65 | 19. [get_top_sec_com.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/get_top_sec_com.py): **异步编程获取A股市场网络安全版块公司市值排名情况,并以图片格式保存下来。** 66 | 20. [get_tf_accident_info.py](https://github.com/yhangf/PythonCrawler/blob/master/spiderFile/get_tj_accident_info.py): **同步和异步编程结合获取天津市应急管理局所有事故信息。** 67 | --- 68 | # spiderAPI模块简介 69 | 70 | #### 本模块提供一些网站的API爬虫接口,功能可能不是很全因此可塑性很大智慧的你如果有兴趣可以继续改进。 71 | 72 | ##### 1.大众点评 73 | 74 | ```python 75 | from spiderAPI.dianping import * 76 | 77 | ''' 78 | citys = { 79 | '北京': '2', '上海': '1', '广州': '4', '深圳': '7', '成都': '8', '重庆': '9', '杭州': '3', '南京': '5', '沈阳': '18', '苏州': '6', '天津': '10','武汉': '16', '西安': '17', '长沙': '344', '大连': '19', '济南': '22', '宁波': '11', '青岛': '21', '无锡': '13', '厦门': '15', '郑州': '160' 80 | } 81 | 82 | ranktype = { 83 | '最佳餐厅': 'score', '人气餐厅': 'popscore', '口味最佳': 'score1', '环境最佳': 'score2', '服务最佳': 'score3' 84 | } 85 | ''' 86 | 87 | result=bestRestaurant(cityId=1, rankType='popscore')#获取人气餐厅 88 | 89 | shoplist=dpindex(cityId=1, page=1)#商户风云榜 90 | 91 | restaurantlist=restaurantList('http://www.dianping.com/search/category/2/10/p2')#获取餐厅 92 | 93 | ``` 94 | 95 | ##### 2.获取代理IP 96 | 爬取[代理IP](http://proxy.ipcn.org) 97 | ```python 98 | from spiderAPI.proxyip import get_enableips 99 | 100 | enableips=get_enableips() 101 | 102 | ``` 103 | 104 | ##### 3.百度地图 105 | 106 | 百度地图提供的API,对查询有一些限制,这里找出了web上查询的接口。 107 | ```python 108 | from spiderAPI.baidumap import * 109 | 110 | citys=citys()#获取城市列表 111 | result=search(keyword="美食", citycode="257", page=1)#获取搜索结果 112 | 113 | ``` 114 | 115 | ##### 4.模拟登录github 116 | ```python 117 | from spiderAPI.github import GitHub 118 | 119 | github = GitHub() 120 | github.login() # 这一步会提示你输入用户名和密码 121 | github.show_timeline() # 获取github主页时间线 122 | # 更多的功能有待你们自己去发掘 123 | ``` 124 | 125 | ##### 5.拉勾网 126 | ```python 127 | from spiderAPI.lagou import * 128 | 129 | lagou_spider(key='数据挖掘', page=1) # 获取关键字为数据挖掘的招聘信息 130 | ``` 131 | -------------------------------------------------------------------------------- /spiderAPI/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/yhangf/PythonCrawler/fac1ffc9c9e6a04875b55d54cd67dbf72ac39db2/spiderAPI/__init__.py -------------------------------------------------------------------------------- /spiderAPI/baidumap.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | 4 | headers = { 5 | 'Host': "map.baidu.com", 6 | "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 7 | "Accept-Encoding": "gzip, deflate", 8 | "Accept-Language": "en-US,en;q=0.5", 9 | "Connection": "keep-alive", 10 | "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:39.0) Gecko/20100101 Firefox/39.0"} 11 | 12 | 13 | def citys(): 14 | html = requests.get( 15 | 'http://map.baidu.com/?newmap=1&reqflag=pcmap&biz=1&from=webmap&da_par=baidu&pcevaname=pc4.1&qt=s&da_src=searchBox.button&wd=美食&c=1&src=0&wd2=&sug=0&l=5&b=(7002451.220000001,1994587.88;19470675.22,7343963.88)&from=webmap&biz_forward={%22scaler%22:1,%22styles%22:%22pl%22}&sug_forward=&tn=B_NORMAL_MAP&nn=0&u_loc=12736591.152491,3547888.166124&ie=utf-8&t=1459951988807', headers=headers).text 16 | data = json.loads(html) 17 | result = [] 18 | for item in data['more_city']: 19 | for city in item['city']: 20 | result.append(city) 21 | for item in data['content']: 22 | result.append(item) 23 | return result 24 | 25 | 26 | def search(keyword, citycode, page): 27 | html = requests.get('http://map.baidu.com/?newmap=1&reqflag=pcmap&biz=1&from=webmap&da_par=baidu&pcevaname=pc4.1&qt=con&from=webmap&c=' + str(citycode) + '&wd=' + keyword + '&wd2=&pn=' + str( 28 | page) + '&nn=' + str(page * 10) + '&db=0&sug=0&addr=0&&da_src=pcmappg.poi.page&on_gel=1&src=7&gr=3&l=12&tn=B_NORMAL_MAP&u_loc=12736591.152491,3547888.166124&ie=utf-8', headers=headers).text 29 | data = json.loads(html)['content'] 30 | return data 31 | -------------------------------------------------------------------------------- /spiderAPI/dianping.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | import os 4 | from bs4 import BeautifulSoup 5 | 6 | headers = { 7 | 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:39.0) Gecko/20100101 Firefox/39.0', 8 | 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 9 | 'Accept-Language': 'en-US,en;q=0.5', 10 | 'Accept-Encoding': 'gzip, deflate', 11 | 'Connection': 'keep-alive'} 12 | 13 | 14 | def bestRestaurant(cityId=1, rankType='popscore'): 15 | html = requests.get('http://www.dianping.com/mylist/ajax/shoprank?cityId=%s&shopType=10&rankType=%s&categoryId=0' % 16 | (cityId, rankType), headers=headers).text 17 | result = json.loads(html)['shopBeans'] 18 | return result 19 | 20 | 21 | def getCityId(): 22 | citys = {'北京': '2', '上海': '1', '广州': '4', '深圳': '7', '成都': '8', '重庆': '9', '杭州': '3', '南京': '5', '沈阳': '18', '苏州': '6', '天津': '10', 23 | '武汉': '16', '西安': '17', '长沙': '344', '大连': '19', '济南': '22', '宁波': '11', '青岛': '21', '无锡': '13', '厦门': '15', '郑州': '160'} 24 | return citys 25 | 26 | 27 | def getRankType(): 28 | RankType = {'最佳餐厅': 'score', '人气餐厅': 'popscore', 29 | '口味最佳': 'score1', '环境最佳': 'score2', '服务最佳': 'score3'} 30 | return RankType 31 | 32 | 33 | def dpindex(cityId=1, page=1): 34 | url = 'http://dpindex.dianping.com/dpindex?region=&category=&type=rank&city=%s&p=%s' % ( 35 | cityId, page) 36 | html = requests.get(url, headers=headers).text 37 | table = BeautifulSoup(html, 'lxml').find( 38 | 'div', attrs={'class': 'idxmain-subcontainer'}).find_all('li') 39 | result = [] 40 | for item in table: 41 | shop = {} 42 | shop['name'] = item.find('div', attrs={'class': 'field-name'}).get_text() 43 | shop['url'] = item.find('a').get('href') 44 | shop['num'] = item.find('div', attrs={'class': 'field-num'}).get_text() 45 | shop['addr'] = item.find('div', attrs={'class': 'field-addr'}).get_text() 46 | shop['index'] = item.find('div', attrs={'class': 'field-index'}).get_text() 47 | result.append(shop) 48 | return result 49 | 50 | 51 | def restaurantList(url): 52 | html = requests.get(url, headers=headers, timeout=30).text.replace('\r', '').replace('\n', '') 53 | table = BeautifulSoup(html, 'lxml').find('div', id='shop-all-list').find_all('li') 54 | result = [] 55 | for item in table: 56 | shop = {} 57 | soup = item.find('div', attrs={'class': 'txt'}) 58 | tit = soup.find('div', attrs={'class': 'tit'}) 59 | comment = soup.find('div', attrs={'class': 'comment'}) 60 | tag_addr = soup.find('div', attrs={'class': 'tag-addr'}) 61 | shop['name'] = tit.find('a').get_text() 62 | shop['star'] = comment.find('span').get('title') 63 | shop['review-num'] = comment.find('a', 64 | attrs={'class': 'review-num'}).get_text().replace('条点评', '') 65 | shop['mean-price'] = comment.find('a', attrs={'class': 'mean-price'}).get_text() 66 | shop['type'] = tag_addr.find('span', attrs={'class': 'tag'}).get_text() 67 | shop['addr'] = tag_addr.find('span', attrs={'class': 'addr'}).get_text() 68 | try: 69 | comment_list = soup.find('span', attrs={'class': 'comment-list'}).find_all('span') 70 | except: 71 | comment_list = [] 72 | score = [] 73 | for i in comment_list: 74 | score.append(i.get_text()) 75 | shop['score'] = score 76 | tags = [] 77 | try: 78 | for i in tit.find('div', attrs={'class': 'promo-icon'}).find_all('a'): 79 | try: 80 | tags += i.get('class') 81 | except: 82 | tags.append(i.get('class')[0]) 83 | except: 84 | pass 85 | shop['tags'] = tags 86 | result.append(shop) 87 | return result 88 | -------------------------------------------------------------------------------- /spiderAPI/github.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from bs4 import BeautifulSoup 3 | import json 4 | 5 | 6 | headers = { 7 | 'Host': "github.com", 8 | 'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:44.0) Gecko/20100101 Firefox/44.0', 9 | 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 10 | 'Accept-Language': 'en-US,en;q=0.5', 11 | 'Accept-Encoding': 'gzip, deflate', 12 | 'Connection': 'keep-alive' 13 | } 14 | 15 | 16 | class GitHub(): 17 | 18 | def __init__(self): 19 | self.session = requests.session() 20 | self.timeline = [] 21 | self.name = '' 22 | self.user = '' 23 | self.passwd = '' 24 | 25 | def login(self): 26 | self.user = input('please input username:') 27 | self.passwd = input('please input password:') 28 | html = self.session.get('https://github.com/login', headers=headers).text 29 | authenticity_token = BeautifulSoup(html, 'lxml').find( 30 | 'input', {'name': 'authenticity_token'}).get('value') 31 | data = { 32 | 'commit': "Sign+in", 33 | 'utf8': "✓", 34 | 'login': self.user, 35 | 'password': self.passwd, 36 | 'authenticity_token': authenticity_token 37 | } 38 | html = self.session.post('https://github.com/session', data=data, headers=headers).text 39 | self.name = BeautifulSoup(html, 'lxml').find( 40 | 'strong', {'class': 'css-truncate-target'}).get_text() 41 | 42 | def get_timeline(self, page=1): 43 | html = self.session.get( 44 | 'https://github.com/dashboard/index/{page}?utf8=%E2%9C%93'.format(page=page), headers=headers).text 45 | table = BeautifulSoup(html, 'lxml').find( 46 | 'div', id='dashboard').find_all('div', {'class': 'alert'}) 47 | for item in table: 48 | line = {} 49 | line['thing'] = item.find('div', {'class': 'title'}).get_text( 50 | ).replace('\r', '').replace('\n', '') 51 | line['time'] = item.find('relative-time').get('datetime') 52 | self.timeline.append(line) 53 | 54 | def show_timeline(self): 55 | keys = ['who', 'do', 'to', 'time'] 56 | for line in self.timeline: 57 | text = line['time'] + ' ' + line['thing'] 58 | print('*' + text + ' ' * (80 - len(text) - 2) + '*') 59 | print('*-*-*' * 16) 60 | 61 | def overview(self, user=None): 62 | if user == None: 63 | user = self.name 64 | html = self.session.get('https://github.com/' + user, headers=headers).text 65 | return overview 66 | 67 | 68 | -------------------------------------------------------------------------------- /spiderAPI/lagou.py: -------------------------------------------------------------------------------- 1 | import json 2 | import requests as rq 3 | 4 | 5 | def lagou_spider(key=None, page=None): 6 | lagou_url = 'http://www.lagou.com/jobs/positionAjax.json?first=false&pn={0}&kd={1}' 7 | lagou_python_data = [] 8 | for i in range(page): 9 | print('抓取第{0}页'.format(i + 1)) 10 | lagou_url_ = lagou_url.format(i, key) 11 | lagou_data = json.loads(rq.get(lagou_url_).text) 12 | lagou_python_data.extend(lagou_data['content']['positionResult']['result']) 13 | return lagou_python_data 14 | -------------------------------------------------------------------------------- /spiderAPI/proxyip.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import threading 3 | import re 4 | 5 | 6 | enableips = [] 7 | 8 | 9 | class IsEnable(threading.Thread): 10 | 11 | def __init__(self, ip): 12 | super(IsEnable, self).__init__() 13 | self.ip = ip 14 | self.proxies = { 15 | 'http': 'http://%s' % ip 16 | } 17 | 18 | def run(self): 19 | global enableips 20 | try: 21 | html = requests.get('http://httpbin.org/ip', proxies=self.proxies, timeout=5).text 22 | result = eval(html)['origin'] 23 | if result in self.ip: 24 | enableips.append(self.ip) 25 | except: 26 | return False 27 | 28 | 29 | def parser(url): 30 | html = requests.get(url).text 31 | ips = re.findall('\d+\.\d+\.\d+\.\d+:\d+', html) 32 | return ips 33 | 34 | 35 | def get_enableips(): 36 | global enableips 37 | urls = ['http://proxy.ipcn.org/proxya.html', 'http://proxy.ipcn.org/proxya2.html', 38 | 'http://proxy.ipcn.org/proxyb.html', 'http://proxy.ipcn.org/proxyb2.html'] 39 | for url in urls: 40 | ips = parser(url) 41 | threadings = [] 42 | for ip in ips: 43 | work = IsEnable(ip) 44 | work.setDaemon(True) 45 | threadings.append(work) 46 | for work in threadings: 47 | work.start() 48 | for work in threadings: 49 | work.join() 50 | return enableips 51 | -------------------------------------------------------------------------------- /spiderFile/ECUT_get_grade.py: -------------------------------------------------------------------------------- 1 | import re 2 | import requests 3 | import numpy as np 4 | import pandas as pd 5 | 6 | 7 | def warn(*args, **kw): pass 8 | import warnings 9 | warnings.warn = warn 10 | 11 | print('*' * 30 + '东华理工大学' + '*' * 30) 12 | print('*' * 30 + '作者:杨航锋' + '*' * 30) 13 | print('*' * 30 + '版本:v1.0' + '*' * 30) 14 | print('\n') 15 | print('请输你学号:') 16 | username = input() 17 | print('请输入密码:') 18 | password = input() 19 | print('\n') 20 | 21 | login_url = 'https://cas.ecit.cn/index.jsp?service=http://portal.ecit.cn/Authentication' 22 | 23 | 24 | def get_LT(login_url): 25 | html = requests.get(login_url, verify=False).text 26 | regex = re.compile('') 27 | lt = re.findall(regex, html)[0] 28 | return lt 29 | 30 | LT = get_LT(login_url) 31 | 32 | data = { 33 | 'lt': LT, 34 | 'password': password, 35 | 'username': username 36 | } 37 | 38 | headers = { 39 | 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 40 | 'Accept-Encoding': 'gzip, deflate, br', 41 | 'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3', 42 | 'Connection': 'keep-alive', 43 | 'Host': 'cas.ecit.cn', 44 | 'Referer': 'https://cas.ecit.cn/index.jsp?service=http://portal.ecit.cn/Authentication', 45 | 'Upgrade-Insecure-Requests': '1', 46 | 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0' 47 | } 48 | 49 | s = requests.Session() 50 | s.post(login_url, data=data, headers=headers, verify=False) 51 | html = s.get('http://jw.ecit.cn').text 52 | reg = re.compile('') 53 | start_url = re.findall(reg, html)[0] 54 | s.post(start_url) 55 | 56 | grade_url = 'http://jw.ecit.cn/gradeLnAllAction.do?type=ln&oper=qbinfo&lnxndm=2016-2017%D1%A7%C4%EA%B5%DA%D2%BB%D1%A7%C6%DA(%C1%BD%D1%A7%C6%DA)#qb_2016-2017%E5%AD%A6%E5%B9%B4%E7%AC%AC%E4%B8%80%E5%AD%A6%E6%9C%9F(%E4%B8%A4%E5%AD%A6%E6%9C%9F)' 57 | grade_html = s.get(grade_url).text 58 | 59 | cloumn_name_reg = re.compile('\s*(.*?)\s*') 60 | cloumn_name = re.findall(cloumn_name_reg, grade_html)[0:7] 61 | del cloumn_name[3] 62 | 63 | grade_reg = re.compile('\s*(.*?)\s*') 64 | grade_pre_list = re.findall(grade_reg, grade_html) 65 | grade_pre1_list = [i for i in grade_pre_list if not i.startswith(' ')] 66 | grade_list = [] 67 | for _ in grade_pre1_list: 68 | if not _.startswith('(.*?) 

') 72 | grade = re.findall(reg_, _)[0] 73 | grade_list.append(grade) 74 | 75 | grade_data_ = pd.DataFrame() 76 | grade_data = np.array(grade_list).reshape(-1, 6) 77 | for i, j in zip(cloumn_name, range(6)): 78 | grade_data_[i] = grade_data[:, j] 79 | 80 | print('1:打印最新的五门成绩') 81 | print('2:保存所有的成绩到本地文件夹') 82 | print('3:打印学位课成绩并计算平均学分绩') 83 | print('\n') 84 | select = input('请输入你的请求:') 85 | if select == '1': 86 | print(grade_data_[-5:]) 87 | elif select == '2': 88 | grade_data_.to_csv('./grade_data.csv', index=False) 89 | print('成绩已保存在运行此程序的文件夹') 90 | elif select == '3': 91 | xw_grade = grade_data_[(grade_data_['课程名'] == '*数学分析(I)') | (grade_data_['课程名'] == '高等代数(I)') | 92 | (grade_data_['课程名'] == 'C语言程序设计基础') | (grade_data_['课程名'] == '大学英语(II)') | 93 | (grade_data_['课程名'] == '*常微分方程') | (grade_data_['课程名'] == '*概率论') | 94 | (grade_data_['课程名'] == '数据结构')] 95 | print(xw_grade) 96 | print('\n') 97 | avg_grade = np.sum((xw_grade.学分.astype(float) * xw_grade.成绩.astype(float))) / \ 98 | np.sum(xw_grade.学分.astype(float)) 99 | print('平均学分绩={0}'.format(avg_grade)) 100 | input('按任意键结束') 101 | -------------------------------------------------------------------------------- /spiderFile/ECUT_pos_html.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import re from bs4 3 | import BeautifulSoup as bs 4 | 5 | 6 | def crawl_all_main_url(page=10): 7 | # 默认抓取官网前十页招聘信息的url 8 | all_url_list = [] 9 | for _ in range(1, page+1): 10 | url = 'http://zjc.ecit.edu.cn/jy/app/newslist.php?BigClassName=%D5%D0%C6%B8%D0%C5%CF%A2&Page={0}'.format(_) 11 | page_html = requests.get(url).text 12 | x_url_reg = re.compile('
(.*?)') 20 | explain_text = re.findall(explain_text_reg, html)[0] 21 | if ('时间' and '地点') in explain_text: 22 | return True 23 | else: pass 24 | def save_html(): 25 | all_url_list = crawl_all_main_url() 26 | for son_url in all_url_list: 27 | if get_title(son_url): 28 | text_html = requests.get(son_url).content.decode('gbk') 29 | domain_url = 'http://zjc.ecit.edu.cn/jy' 30 | img_url_reg = re.compile('border=0 src="\.\.(.*?)"') 31 | child_url = re.findall(img_url_reg, text_html) 32 | if child_url != []: 33 | img_url = domain_url + child_url[0] 34 | re_url = 'src="..{0}"'.format(child_url[0]) 35 | end_url = 'src="{0}"'.format(img_url) 36 | end_html = text_html.replace(re_url, end_url) 37 | soup = bs(end_html, 'lxml') 38 | text_div = soup.find_all('div', id='main')[0] 39 | with open('./{0}.html'.format(son_url[-11:]), 'wb') as file: 40 | text_html = 'U职网提供数据咨询服务 {0} '.format(text_div) file.write(text_html.encode('utf-8')) 41 | else: 42 | with open('./{0}.html'.format(son_url[-11:]), 'wb') as file: 43 | html = requests.get(son_url).content.decode('gbk') 44 | soup = bs(text_html, 'lxml') 45 | text_div = soup.find_all('div', id='main')[0] 46 | text_html = 'U职网提供数据咨询服务 {0} '.format(text_div) 47 | file.write(text_html.encode('utf-8')) 48 | else: continue 49 | if __name__ == '__main__': 50 | save_html() 51 | -------------------------------------------------------------------------------- /spiderFile/JD_spider.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import re 3 | import pandas as pd 4 | 5 | def get_data(): 6 | jj_url1 = 'http://search.jd.com/s_new.php?keyword=%E5%AE%B6%E5%B1%85%E7%94%A8%E5%93%81&enc=utf-8&qrst=1&rt=1&stop=1&pt=1&vt=2&sttr=1&offset=6&page=' 7 | jj_url2 = '&s=53&click=0' 8 | bt_ = [] 9 | _id = [] 10 | url_list = [] 11 | for i in range(1, 10, 2): 12 | jj_url = jj_url1 + str(i) + jj_url2 13 | url_list.append(jj_url) 14 | html = requests.get(jj_url).content.decode('utf-8') 15 | reg1 = re.compile('') 17 | bt = re.findall(reg1, html) 18 | id_ = re.findall(reg2, html) 19 | bt_.extend(bt) 20 | _id.extend(id_) 21 | return bt_, _id 22 | 23 | def split_str(_id): 24 | zid = [] 25 | for _ in _id: 26 | zid.append(_.split('_')[2]) 27 | return zid 28 | 29 | def save_data(zid, bt_): 30 | data = pd.DataFrame({ 31 | '标题': bt_, 32 | 'ID': zid 33 | }) 34 | data.to_excel('./家居用品.xlsx', index=False) 35 | 36 | def start_main(): 37 | bt_, _id = get_data() 38 | zid = split_str(_id) 39 | save_data(zid, bt_) 40 | 41 | if __name__ == '__main__': 42 | start_main() 43 | -------------------------------------------------------------------------------- /spiderFile/baidu_sy_img.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import re 3 | 4 | url = 'http://image.baidu.com/search/index' 5 | headers = { 6 | 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0', 7 | 'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3', 8 | 'Accept-Encoding': 'gzip, deflate', 9 | 'Referer': 'http://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&fm=detail&lm=-1&st=-1&sf=2&fmq=&pv=&ic=0&nc=1&z=&se=&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=%E9%AB%98%E6%B8%85%E6%91%84%E5%BD%B1&oq=%E9%AB%98%E6%B8%85%E6%91%84%E5%BD%B1&rsp=-1', 10 | 'Cookie': 'HOSUPPORT=1; UBI=fi_PncwhpxZ%7ETaMMzY0i9qXJ9ATcu3rvxFIc-a7KI9byBcYk%7EjBVmPGIbL3LTKKJ2D17mh5VfJ5yjlCncAb2yhPI5sZM51Qo7tpCemygM0VNUzuTBJwYF8OYmi3nsCCzbpo5U9tLSzkZfcQ1rxUcJSzaipThg__; HISTORY=fec845b215cd8e8be424cf320de232722d0050; PTOKEN=ff58b208cc3c16596889e0a20833991d; STOKEN=1b1f4b028b5a4415aa1dd9794ff061d312ad2a822d52418f3f1ffabbc0ac6142; SAVEUSERID=0868a2b4c9d166dc85e605f0dfd153; USERNAMETYPE=3; PSTM=1454309602; BAIDUID=E5493FD55CFE5424BA25B1996943B3B6:FG=1; BIDUPSID=B7D6D9EFA208B7B8C7CB6EF8F827BD4E; BDUSS=VSeFB6UXBmRWc3UEdFeXhKOFRvQm4ySmVmTkVEN2N0bldnM2o5RHdyaE54ZDlXQVFBQUFBJCQAAAAAAAAAAAEAAABzhCtU3Mbj5cfl0e8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAE04uFZNOLhWZW; H_PS_PSSID=1447_18282_17946_18205_18559_17001_17073_15479_12166_18086_10634; BDRCVFR[feWj1Vr5u3D]=I67x6TjHwwYf0; BDRCVFR[X_XKQks0S63]=mk3SLVN4HKm; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm', 11 | } 12 | 13 | 14 | def get_html(url, headers): 15 | data = { 16 | 'cl': '2', 17 | 'ct': '201326592', 18 | 'face': '0', 19 | 'fp': 'result', 20 | 'gsm': '200001e', 21 | 'ic': '0', 22 | 'ie': 'utf-8', 23 | 'ipn': 'rj', 24 | 'istype': '2', 25 | 'lm': '-1', 26 | 'nc': '1', 27 | 'oe': 'utf-8', 28 | 'pn': '30', 29 | 'queryword': '高清摄影', 30 | 'rn': '30', 31 | 'st': '-1', 32 | 'tn': 'resultjson_com', 33 | 'word': '高清摄影' 34 | } 35 | 36 | page = requests.get(url, data, headers=headers).text 37 | return page 38 | 39 | 40 | def get_img(page, headers): 41 | # img_url_list = [] 42 | reg = re.compile('http://.*?\.jpg') 43 | imglist1 = re.findall(reg, page) 44 | imglist2 = imglist1[0: len(imglist1): 3] 45 | # [img_url_list.append(i) for i in imglist if not i in img_url_list] 46 | x = 0 47 | for imgurl in imglist2: 48 | bin = requests.get(imgurl, headers=headers).content 49 | with open('./%s.jpg' % x, 'wb') as file: 50 | file.write(bin) 51 | x += 1 52 | 53 | if __name__ == '__main__': 54 | page = get_html(url, headers) 55 | get_img(page, headers) 56 | -------------------------------------------------------------------------------- /spiderFile/baidu_wm_img.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import re 3 | 4 | url = 'http://image.baidu.com/search/index' 5 | date = { 6 | 'cl': '2', 7 | 'ct': '201326592', 8 | 'fp': 'result', 9 | 'gsm': '1e', 10 | 'ie': 'utf-8', 11 | 'ipn': 'rj', 12 | 'istype': '2', 13 | 'lm': '-1', 14 | 'nc': '1', 15 | 'oe': 'utf-8', 16 | 'pn': '30', 17 | 'queryword': '唯美意境图片', 18 | 'rn': '30', 19 | 'st': '-1', 20 | 'tn': 'resultjson_com', 21 | 'word': '唯美意境图片' 22 | } 23 | headers = { 24 | 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0', 25 | 'Accept': 'text/plain, */*; q=0.01', 26 | 'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3', 27 | 'Accept-Encoding': 'gzip, deflate', 28 | 'X-Requested-With': 'XMLHttpRequest', 29 | 'Referer': 'http://image.baidu.com/search/index?ct=201326592&cl=2&st=-1&lm=-1&nc=1&ie=utf-8&tn=baiduimage&ipn=r&rps=1&pv=&fm=rs3&word=%E5%94%AF%E7%BE%8E%E6%84%8F%E5%A2%83%E5%9B%BE%E7%89%87&ofr=%E9%AB%98%E6%B8%85%E6%91%84%E5%BD%B1', 30 | 'Cookie': 'BDqhfp=%E5%94%AF%E7%BE%8E%E6%84%8F%E5%A2%83%E5%9B%BE%E7%89%87%26%26NaN-1undefined-1undefined%26%260%26%261; Hm_lvt_737dbb498415dd39d8abf5bc2404b290=1455016371,1455712809,1455769605,1455772886; PSTM=1454309602; BAIDUID=E5493FD55CFE5424BA25B1996943B3B6:FG=1; BIDUPSID=B7D6D9EFA208B7B8C7CB6EF8F827BD4E; BDUSS=VSeFB6UXBmRWc3UEdFeXhKOFRvQm4ySmVmTkVEN2N0bldnM2o5RHdyaE54ZDlXQVFBQUFBJCQAAAAAAAAAAAEAAABzhCtU3Mbj5cfl0e8AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAE04uFZNOLhWZW; H_PS_PSSID=1447_18282_17946_15479_12166_18086_10634; Hm_lpvt_737dbb498415dd39d8abf5bc2404b290=1455788775; firstShowTip=1; BDRCVFR[-pGxjrCMryR]=mk3SLVN4HKm; BDRCVFR[dG2JNJb_ajR]=mk3SLVN4HKm', 31 | 'Connection': 'keep-alive' 32 | } 33 | 34 | 35 | def get_page(url, date, headers): 36 | page = requests.get(url, date, headers=headers).text 37 | return page 38 | 39 | 40 | def get_img(page, headers): 41 | reg = re.compile('http://.*?\.jpg') 42 | imglist = re.findall(reg, page)[::3] 43 | x = 0 44 | for imgurl in imglist: 45 | with open('E:/Pic/%s.jpg' % x, 'wb') as file: 46 | file.write(requests.get(imgurl, headers=headers).content) 47 | x += 1 48 | 49 | if __name__ == '__main__': 50 | page = get_page(url, date, headers) 51 | get_img(page, headers) 52 | -------------------------------------------------------------------------------- /spiderFile/fuckCTF.py: -------------------------------------------------------------------------------- 1 | 2 | """ 3 | author: 杨航锋 4 | date : 2018.8.19 5 | mood : 嗯,比较无聊,甚至还有点想吃黄焖鸡米饭😋 6 | """ 7 | 8 | 9 | import os 10 | import random 11 | import functools 12 | 13 | from PIL import Image 14 | from selenium import webdriver 15 | 16 | 17 | class fuckCTF: 18 | 19 | def __init__(self, username, old_password): 20 | self.url = "http://hetianlab.com/" 21 | self.login_url = "http://hetianlab.com/loginLab.do" 22 | self.username = username 23 | self.old_password = old_password 24 | self.new_password = (yield_new_password(), "******")[0] 25 | self.options = webdriver.FirefoxOptions() 26 | self.options.add_argument("-headless") 27 | self.browser = webdriver.Firefox(options=self.options) 28 | print("init ok") 29 | 30 | def login_hetian(self): 31 | self.browser.get(self.login_url) 32 | self.browser.find_element_by_id("userEmail").clear() 33 | self.browser.find_element_by_id("userEmail").send_keys(self.username) 34 | self.browser.find_element_by_id("passwordIn").clear() 35 | self.browser.find_element_by_id("passwordIn").send_keys(self.old_password) 36 | self.browser.get_screenshot_as_file(self.username + '/' + "login.png") 37 | self.browser.find_element_by_id("registButIn").click() 38 | self.browser.get(self.url) 39 | print("login_hetian running ok!") 40 | 41 | def get_personl_information_page(self): 42 | grzx_btn = self.browser.find_element_by_xpath("/html/body/div[1]/div[1]/div/div/div[2]/ul/li[2]/a") 43 | self.browser.execute_script("$(arguments[0]).click()", grzx_btn) 44 | self.browser.get("http://hetianlab.com/getUserInfo.do") 45 | print("get_personl_information_page running ok!") 46 | 47 | def get_password_setting_page(self): 48 | mmsz_btn = self.browser.find_element_by_xpath("/html/body/div[2]/div/div[1]/ul/ul[3]/li[2]") 49 | self.browser.execute_script("$(arguments[0]).click()", mmsz_btn) 50 | self.browser.find_element_by_id("person").click() 51 | self.browser.find_element_by_class_name("check") 52 | print("get_password_setting_page running ok!") 53 | 54 | def setting_password(self): 55 | self.browser.find_element_by_id("oldpwd").clear() 56 | self.browser.find_element_by_id("oldpwd").send_keys(self.old_password) 57 | self.browser.find_element_by_id("newpwd").clear() 58 | self.browser.find_element_by_id("newpwd").send_keys(self.new_password) 59 | self.browser.find_element_by_id("quepwd").clear() 60 | self.browser.find_element_by_id("quepwd").send_keys(self.new_password) 61 | print("setting_password running ok!") 62 | 63 | def get_v_code(self): 64 | status = self.browser.get_screenshot_as_file(self.username + '/' + "v_code.png") 65 | if status: 66 | img = Image.open(self.username + '/' + "v_code.png") 67 | img.show() 68 | self.v_code = input("请输入验证码: ") 69 | self.browser.find_element_by_class_name("code").send_keys(self.v_code) 70 | else: 71 | raise("截屏失败!") 72 | print("get_v_code running ok!") 73 | 74 | def submit_data(self): 75 | self.browser.find_element_by_id("submitbtn").click() 76 | self.browser.get_screenshot_as_file(self.username + '/' + "result.png") 77 | self.browser.quit() 78 | print("submit_data running ok!") 79 | 80 | def make_portfolio(self): 81 | if not os.path.exists(self.username): 82 | os.makedirs(self.username) 83 | print("make_portfolio running ok!") 84 | 85 | def save_success_data(self): 86 | with open("./username_and_password_data_successed.log", "a+") as fp: 87 | fp.write( 88 | "username" + ": {}".format(self.username) + "\t" 89 | "password" + ": {}".format(self.new_password) + 90 | "\n" 91 | ) 92 | print("save_success_data running ok!") 93 | 94 | def save_failed_data(self): 95 | with open("./username_and_password_data_failed.log", "a+") as fp: 96 | fp.write( 97 | "username" + ": {}".format(self.username) + "\n" 98 | ) 99 | print("save_failed_data running ok!") 100 | 101 | def main(self): 102 | try: 103 | self.make_portfolio() 104 | self.login_hetian() 105 | self.get_personl_information_page() 106 | self.get_password_setting_page() 107 | self.setting_password() 108 | self.get_v_code() 109 | self.submit_data() 110 | self.save_success_data() 111 | except: 112 | self.save_failed_data() 113 | 114 | 115 | def gen_decorator(gen): 116 | @functools.wraps(gen) 117 | def inner(*args, **kwargs): 118 | return next(gen(*args, **kwargs)) 119 | return inner 120 | 121 | 122 | @gen_decorator 123 | def yield_new_password(): 124 | strings = list("abcdefghijklmnopqrstuvwxyz1234567890!@#$%^&*()") 125 | yield "".join(random.choices(strings, k=6)) 126 | 127 | 128 | def yield_usernames(n): 129 | prefix = "ctf2018_gzhu" 130 | postfix = "@dh.com" 131 | for num in range(n): 132 | if num < 10: 133 | infix = '0' + str(num) 134 | else: 135 | infix = str(num) 136 | yield prefix + infix + postfix 137 | 138 | 139 | if __name__ == "__main__": 140 | for username in yield_usernames(100): 141 | ctfer = fuckCTF(username, "******") 142 | ctfer.main() 143 | -------------------------------------------------------------------------------- /spiderFile/get_baike.py: -------------------------------------------------------------------------------- 1 | import re 2 | import requests as rq 3 | 4 | def get_baidubaike(): 5 | 6 | keyword = input('please input wordkey:') 7 | url = 'http://baike.baidu.com/item/{}'.format(keyword) 8 | html = rq.get(url).content.decode('utf-8') 9 | 10 | regex = re.compile('content="(.*?)">') 11 | words = re.findall(regex, html)[0] 12 | return words 13 | 14 | if __name__ == '__main__': 15 | words = get_baidubaike() 16 | print(words) 17 | 18 | 19 | 20 | 21 | -------------------------------------------------------------------------------- /spiderFile/get_history_weather.py: -------------------------------------------------------------------------------- 1 | import re 2 | import pandas as pd 3 | import requests as rq 4 | from bs4 import BeautifulSoup 5 | 6 | 7 | def get_data(url): 8 | html = rq.get(url).content.decode("gbk") 9 | soup = BeautifulSoup(html, "html.parser") 10 | tr_list = soup.find_all("tr") 11 | dates, conditions, temperatures = [], [], [] 12 | for data in tr_list[1:]: 13 | sub_data = data.text.split() 14 | dates.append(sub_data[0]) 15 | conditions.append("".join(sub_data[1:3])) 16 | temperatures.append("".join(sub_data[3:6])) 17 | _data = pd.DataFrame() 18 | _data["日期"] = dates 19 | _data["天气状况"] = conditions 20 | _data["气温"] = temperatures 21 | return _data 22 | 23 | # 获取广州市2019年第一季度天气状况 24 | data_1_month = get_data("http://www.tianqihoubao.com/lishi/guangzhou/month/201901.html") 25 | data_2_month = get_data("http://www.tianqihoubao.com/lishi/guangzhou/month/201902.html") 26 | data_3_month = get_data("http://www.tianqihoubao.com/lishi/guangzhou/month/201903.html") 27 | 28 | 29 | data = pd.concat([data_1_month, data_2_month, data_3_month]).reset_index(drop=True) 30 | 31 | data.to_csv("guangzhou_history_weather_data.csv", index=False, encoding="utf-8") 32 | -------------------------------------------------------------------------------- /spiderFile/get_photos.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from bs4 import BeautifulSoup 3 | 4 | url = 'http://tieba.baidu.com/p/4178314700' 5 | 6 | 7 | def GetHtml(url): 8 | html = requests.get(url).text 9 | return html 10 | 11 | 12 | def GetImg(html): 13 | soup = BeautifulSoup(html, 'html.parser') 14 | imglist = [] 15 | for photourl in soup.find_all('img'): 16 | imglist.append(photourl.get('src')) 17 | x = 0 18 | for imgurl in imglist: 19 | with open('E:/Pic/%s.jpg' % x, 'wb') as file: 20 | file.write(requests.get(imgurl).content) 21 | x += 1 22 | 23 | if __name__ == '__main__': 24 | html = GetHtml(url) 25 | GetImg(html) 26 | -------------------------------------------------------------------------------- /spiderFile/get_tj_accident_info.py: -------------------------------------------------------------------------------- 1 | import re 2 | import joblib 3 | import asyncio 4 | import aiohttp 5 | import requests as rq 6 | from bs4 import BeautifulSoup 7 | 8 | def yield_all_page_url(root_url, page=51): 9 | """生成所有的页面url 10 | @param root_url: 首页url 11 | type root_url: str 12 | @param page: 爬取的页面个数 13 | type page: int 14 | """ 15 | # 观察网站翻页结构可知 16 | page_url_list = [f"{root_url}index_{i}.html" for i in range(1, page)] 17 | # 添加首页url 18 | page_url_list.insert(0, root_url) 19 | return page_url_list 20 | 21 | async def get_info_page_url(url, session): 22 | regex = re.compile("') 43 | html = rq.get(url, headers=HEADERS).content.decode("utf-8") 44 | soup = BeautifulSoup(html) 45 | title = re.search(title_regex, html) 46 | content_1 = soup.find("div", class_="TRS_UEDITOR TRS_WEB") 47 | content_2 = soup.find("div", class_="view TRS_UEDITOR trs_paper_default trs_word") 48 | content_3 = soup.find("div", class_="view TRS_UEDITOR trs_paper_default trs_web") 49 | if content_1: 50 | content = content_1.text 51 | elif content_2: 52 | content = content_2.text 53 | elif content_3: 54 | content = content_3.text 55 | else: 56 | content = "" 57 | return {"title": title.groups()[0], "content": content} 58 | 59 | def get_all_data(all_info_page_url_list): 60 | all_data = [] 61 | for i, url in enumerate(all_info_page_url_list): 62 | all_data.append(get_data(url)) 63 | print(i, url, all_data[-1]) 64 | joblib.dump(all_data, "all_data.joblib") 65 | 66 | 67 | if __name__ == "__main__": 68 | root_url = "http://yjgl.tj.gov.cn/ZWGK6939/SGXX3106/" 69 | agent_part_1 = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 " 70 | agent_part_2 = "(KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36" 71 | HEADERS = {"Host": "yjgl.tj.gov.cn", 72 | "Connection": "keep-alive", 73 | "User-Agent": agent_part_1 + agent_part_2, 74 | "Referer": "http://static.bshare.cn/"} 75 | page_url_list = yield_all_page_url(root_url, page=51) 76 | all_info_page_url_list = asyncio.run(get_all_info_page_url(root_url, page_url_list)) 77 | joblib.dump("all_info_page_url_list", all_info_page_url_list) 78 | -------------------------------------------------------------------------------- /spiderFile/get_top_sec_com.py: -------------------------------------------------------------------------------- 1 | import re 2 | import os 3 | import time 4 | import joblib 5 | import asyncio 6 | import aiohttp 7 | import requests as rq 8 | 9 | import pandas as pd 10 | import matplotlib.pyplot as plt 11 | # import nest_asyncio 12 | # nest_asyncio.apply() 13 | 14 | class getTopSecCom: 15 | def __init__(self, top=None): 16 | self.headers = {"Referer": "http://quote.eastmoney.com/", 17 | "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36"} 18 | self.bk_url = "http://71.push2.eastmoney.com/api/qt/clist/get?cb=jQuery1124034348162124675374_1612595298605&pn=1&pz=85&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&fid=f62&fs=b:BK0655&fields=f12,f14&_=1612595298611" 19 | self.shares_api = "https://xueqiu.com/S/" 20 | self.top = top 21 | if not os.path.exists("./useful_sec_com_list"): 22 | self.useful_sec_com_list = self.get_sec_com_code() 23 | else: 24 | with open("./useful_sec_com_list", "rb") as fp: 25 | self.useful_sec_com_list = joblib.load(fp) 26 | 27 | def get_sec_com_code(self): 28 | html = rq.get(self.bk_url, headers=self.headers).content.decode("utf-8") 29 | sec_com_list = eval(re.findall("\[(.*?)\]", html)[0]) 30 | useful_sec_com_list = [[i["f12"], i["f14"]] for i in sec_com_list if "ST" not in i["f14"]] 31 | 32 | # 0和3开头的为深证上市股票前缀为sz,6开头的为上证上市股票前缀为sh 33 | for sec_com in useful_sec_com_list: 34 | if sec_com[0][0] == "6": 35 | sec_com[0] = "sh" + sec_com[0] 36 | else: 37 | sec_com[0] = "sz" + sec_com[0] 38 | with open("useful_sec_com_list", "wb") as fp: 39 | joblib.dump(useful_sec_com_list, fp) 40 | return useful_sec_com_list 41 | 42 | async def async_get_shares_details(self, sec_com, url): 43 | async with aiohttp.ClientSession() as session: 44 | async with session.get(url, headers=self.headers) as response: 45 | html = await response.text() 46 | market_value = re.search("总市值:(.*?)亿", html) 47 | if market_value: 48 | return [*sec_com, market_value.groups()[0]] 49 | 50 | async def async_get_all_shares(self): 51 | tasks = [] 52 | for sec_com in self.useful_sec_com_list: 53 | url = self.shares_api + sec_com[0] 54 | tasks.append( 55 | asyncio.create_task( 56 | self.async_get_shares_details(sec_com, url) 57 | ) 58 | ) 59 | done, pendding = await asyncio.wait(tasks) 60 | return [share.result() for share in done if share.result()] 61 | 62 | def get_shares_details(self): 63 | all_shares = [] 64 | for sec_com in self.useful_sec_com_list: 65 | url = self.shares_api + sec_com[0] 66 | response = rq.get(url, headers=self.headers).content.decode("utf-8") 67 | market_value = re.search("总市值:(.*?)亿", response) 68 | if market_value: 69 | all_shares.append([*sec_com, market_value.groups()[0]]) 70 | return all_shares 71 | 72 | def yield_picture(self, save_path): 73 | # all_shares = self.get_shares_details() # 同步代码 74 | all_shares = asyncio.run(self.async_get_all_shares()) # 异步代码 75 | df = pd.DataFrame(all_shares, columns=["股票代码", "公司", "市值(亿)"]) 76 | df["市值(亿)"] = df["市值(亿)"].astype(float) 77 | date = time.strftime("%Y年%m月%d日", time.localtime()) 78 | df.sort_values(by="市值(亿)", ascending=False, inplace=True) 79 | df.index = range(1, df.shape[0]+1) 80 | 81 | plt.rcParams['font.sans-serif'] = ['SimHei'] 82 | plt.rcParams['axes.unicode_minus'] = False 83 | 84 | 85 | fig = plt.figure(dpi=400) 86 | ax = fig.add_subplot(111, frame_on=False) 87 | ax.xaxis.set_visible(False) 88 | ax.yaxis.set_visible(False) 89 | _ = pd.plotting.table(ax, df, loc="best", cellLoc="center") 90 | ax.set_title(f"{date}A股网安版块公司市值排名", fontsize=10) 91 | plt.savefig(save_path, bbox_inches="tight") 92 | 93 | if __name__ == "__main__": 94 | m = getTopSecCom() 95 | m.yield_picture("rank.png") 96 | -------------------------------------------------------------------------------- /spiderFile/get_web_all_img.py: -------------------------------------------------------------------------------- 1 | import re 2 | import time 3 | import requests 4 | 5 | 6 | def get_html(url, headers): 7 | html = requests.get(url, timeout=100, headers=headers).text 8 | return html 9 | 10 | 11 | def get_main_url(html): 12 | reg = re.compile('http://.*?\.jpg') 13 | main_imglist = re.findall(reg, html) 14 | return main_imglist 15 | 16 | 17 | def get_son_url(html): 18 | initurl = 'http://www.woyaogexing.com' 19 | reg = re.compile('/tupian/weimei/\d+/\d+\.html') 20 | son_urllist_init = re.findall(reg, html) 21 | son_urlist = set(son_urllist_init) 22 | son_url_final = [] 23 | for son_url in son_urlist: 24 | son_url_final.append(initurl + son_url) 25 | return son_url_final # 结果是所有含有图片的网页地址 26 | 27 | 28 | def get_all_sonurl(son_url_final, headers): 29 | son_imglist = [] 30 | for sonurl in son_url_final: 31 | son_html = requests.get(sonurl, timeout=100, headers=headers).text 32 | son_reg = re.compile('http://.*?\.jpg') 33 | son_imglist1 = re.findall(son_reg, son_html) 34 | for temp in son_imglist1: 35 | son_imglist.append(temp) 36 | return son_imglist # 结果是所有子网页图片的地址 37 | 38 | 39 | def get_all_img(main_imglist, son_imglist, headers): 40 | global x # 使用全局变量使每次的变量不清除,这个问题有待完美解决! 41 | for imgurl in main_imglist: 42 | son_imglist.append(imgurl) 43 | for imgurl in son_imglist: 44 | with open('E:/Pic2/%s.jpg' % x, 'wb') as file: 45 | file.write(requests.get(imgurl, timeout=100, headers=headers).content) 46 | time.sleep(0.1) 47 | x += 1 48 | 49 | 50 | def turn_page(): 51 | page_list = ['http://www.woyaogexing.com/tupian/weimei/index.html'] 52 | for i in range(1, 7): 53 | page_list.append('http://www.woyaogexing.com/tupian/weimei/index_' + str(i) + '.html') 54 | return page_list 55 | 56 | if __name__ == '__main__': 57 | headers = { 58 | 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0', 59 | 'Accept': 'text/plain, */*; q=0.01', 60 | 'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3', 61 | 'Accept-Encoding': 'gzip, deflate', 62 | 'Cookie': 'bdshare_firstime=1456041345958; Hm_lvt_a077b6b44aeefe3829d03416d9cb4ec3=1456041346; Hm_lpvt_a077b6b44aeefe3829d03416d9cb4ec3=1456048504', 63 | } 64 | x = 0 65 | page_list = ['http://www.woyaogexing.com/tupian/weimei/index.html'] 66 | for i in range(2, 20): 67 | page_list.append('http://www.woyaogexing.com/tupian/weimei/index_' + str(i) + '.html') 68 | for p in range(6): 69 | html = get_html(page_list[p], headers) 70 | main_imglist = get_main_url(html) 71 | son_url_final = get_son_url(html) 72 | son_imglist = get_all_sonurl(son_url_final, headers) 73 | get_all_img(main_imglist, son_imglist, headers) 74 | -------------------------------------------------------------------------------- /spiderFile/github_hot.py: -------------------------------------------------------------------------------- 1 | import re 2 | import requests 3 | import pandas as pd 4 | import numpy as np 5 | 6 | 7 | def hot_github(keyword): 8 | url = 'https://github.com/trending/{0}'.format(keyword) 9 | main_url = 'https://github.com{0}' 10 | html = requests.get(url).content.decode('utf-8') 11 | reg_hot_url = re.compile('

\s*') 12 | hot_url = [main_url.format(i) for i in re.findall(reg_hot_url, html)] 13 | url_abstract_reg = re.compile('

\s*(.*?)\s*

') 14 | summary_text = re.findall(url_abstract_reg, html) 15 | hotDF = pd.DataFrame() 16 | hotDF['项目简介'] = summary_text 17 | hotDF['项目地址'] = hot_url 18 | hotDF.to_csv('./github_hot.csv', index=False) 19 | 20 | if __name__ == '__main__': 21 | keyword = input('请输入查找的热门语言:') 22 | hot_github(keyword) 23 | -------------------------------------------------------------------------------- /spiderFile/kantuSpider.py: -------------------------------------------------------------------------------- 1 | import re 2 | import os 3 | import time 4 | 5 | import requests as rq 6 | 7 | 8 | def get_all_page(page): 9 | url = 'http://52kantu.cn/?page={}'.format(page) 10 | html = rq.get(url).text 11 | 12 | return html 13 | 14 | 15 | def get_img_url(html): 16 | regex = re.compile('') 10 | img_url = re.findall(reg, page) 11 | if img_url != []: 12 | with open('./{}.jpg'.format(count), 'wb') as file: 13 | try: 14 | img_data = requests.get(img_url[0]).content 15 | file.write(img_data) 16 | count += 1 17 | except: 18 | pass 19 | print('OK!') 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | -------------------------------------------------------------------------------- /spiderFile/one_update.py: -------------------------------------------------------------------------------- 1 | import re 2 | import requests as rq 3 | 4 | ROOT_URL = "http://wufazhuce.com/one/" 5 | URL_NUM = 14 6 | 7 | def yield_url(ROOT_URL, URL_NUM): 8 | return ROOT_URL + str(URL_NUM) 9 | 10 | def get_html(url): 11 | return rq.get(url).content.decode("utf-8") 12 | 13 | def get_data(html): 14 | img_url_regex = re.compile('') 15 | cite_regex = re.compile('
(.*?)
', re.S) 16 | img_url = re.findall(img_url_regex, html)[0] 17 | cite = re.findall(cite_regex, html)[0].strip() 18 | return img_url, cite 19 | 20 | def save_data(img_url, cite, URL_NUM): 21 | with open("./{}.jpg".format(URL_NUM), "wb") as fp: 22 | fp.write(rq.get(img_url).content) 23 | with open("./cite{}.txt".format(URL_NUM), "w") as fp: 24 | fp.write(cite) 25 | return URL_NUM + 1 26 | 27 | def main(ROOT_URL, URL_NUM, number): 28 | for _ in range(number): 29 | url = yield_url(ROOT_URL, URL_NUM) 30 | html = get_html(url) 31 | img_url, cite = get_data(html) 32 | URL_NUM = save_data(img_url, cite, URL_NUM) 33 | 34 | if __name__ == "__main__": 35 | try: 36 | main(ROOT_URL, URL_NUM, 20) 37 | except: 38 | pass 39 | -------------------------------------------------------------------------------- /spiderFile/search_useful_camera_ip_address.py: -------------------------------------------------------------------------------- 1 | import re 2 | import tqdm 3 | import time 4 | from selenium import webdriver 5 | from selenium.webdriver.common.by import By 6 | from selenium.webdriver.chrome.options import Options 7 | from selenium.webdriver.support.ui import WebDriverWait 8 | from selenium.webdriver.support import expected_conditions as EC 9 | from selenium.common.exceptions import NoAlertPresentException, TimeoutException 10 | 11 | # 扫描网站可自己寻找,代码仅演示逻辑 12 | country = "IN" #印度 13 | city = "" 14 | login_url = "" 15 | query_url = "" 16 | city_url = "" 17 | USER_NAME = "" 18 | PASSWORD = "" 19 | 20 | # 无头浏览器配置 21 | chrome_options = Options() 22 | chrome_options.add_argument("--headless") 23 | chrome_options.add_argument("--disable-gpu") 24 | chrome_options.add_argument("log-level=3") 25 | browser = webdriver.Chrome(chrome_options=chrome_options) 26 | browser.set_page_load_timeout(10) 27 | 28 | #登录模块 29 | browser.get(login_url) 30 | WebDriverWait(browser, 30).until( 31 | EC.presence_of_element_located((By.XPATH, '//*[@name="login_submit"]')) 32 | ) 33 | browser.find_element_by_id("username").clear() 34 | browser.find_element_by_id("username").send_keys(USER_NAME) 35 | browser.find_element_by_id("password").clear() 36 | browser.find_element_by_id("password").send_keys(PASSWORD) 37 | browser.find_element_by_name("login_submit").click() 38 | 39 | #抓取潜在的摄像头url,默认抓取两页 40 | if city: 41 | query_url += city_url 42 | 43 | latent_camera_url = [] 44 | browser.get(query_url) 45 | WebDriverWait(browser, 30).until( 46 | EC.presence_of_element_located((By.CLASS_NAME, 'button')) 47 | ) 48 | html = browser.page_source 49 | latent_camera_url += re.findall('
') 16 | kids_url = [domain_url.format(i) for i in re.findall(kids_url_regex, start_html)] 17 | for kid_url in kids_url: 18 | all_pic_urls = [] 19 | pic_html = requests.get(kid_url).content.decode('gb2312') 20 | # 抓取标题 21 | title_regex = re.compile('(.*?)') 22 | title = re.findall(title_regex, pic_html)[0] 23 | # 抓取封面图片url 24 | parent_pic_regex = re.compile('') 28 | kids_pic_url = re.findall(kids_pic_regex, pic_html) 29 | # 合并封面url列表和子图url列表 30 | all_pic_urls.extend(parent_pic) 31 | all_pic_urls.extend(kids_pic_url) 32 | # 下载并存储图片 33 | if not os.path.exists('./{0}'.format(title)): 34 | os.mkdir('./{0}'.format(title)) 35 | s = requests.Session() 36 | for count, pic_url in enumerate(all_pic_urls): 37 | with open('./{0}/{1}.jpg'.format(title, count), 'wb') as file: 38 | try: 39 | file.write(s.get(pic_url, timeout=5).content) 40 | except: 41 | pass 42 | 43 | if __name__ == '__main__': 44 | Spidermain() 45 | --------------------------------------------------------------------------------