├── README.md └── 拉勾网 ├── 不同城市不同岗位数据总数.ipynb ├── 全部城市数据分析岗位招聘数据.ipynb └── 拉勾网反爬机制及攻克方式总结 /README.md: -------------------------------------------------------------------------------- 1 | # Python爬虫学习笔记及实战案例 2 | 3 | ## 目录 4 | ### 参考书目 5 | * 《Python网络爬虫从入门到精通》 6 | * 《Python编程从入门到实践》 7 | ### 学习笔记 8 | ### 爬虫实战练习案例 9 | * __拉勾网招聘职位__ 10 | - [不同城市不同岗位数据总数](https://github.com/EvelynZP/Python-Spider/blob/master/%E6%8B%89%E5%8B%BE%E7%BD%91/%E4%B8%8D%E5%90%8C%E5%9F%8E%E5%B8%82%E4%B8%8D%E5%90%8C%E5%B2%97%E4%BD%8D%E6%95%B0%E6%8D%AE%E6%80%BB%E6%95%B0.ipynb) 11 | - [全国数据分析岗位数据爬取](https://github.com/EvelynZP/Python-Spider/blob/master/%E6%8B%89%E5%8B%BE%E7%BD%91/%E5%85%A8%E9%83%A8%E5%9F%8E%E5%B8%82%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E5%B2%97%E4%BD%8D%E6%8B%9B%E8%81%98%E6%95%B0%E6%8D%AE.ipynb) 12 | - 基于拉勾网数据分析数据分析职位 13 | - [拉勾网反爬机制及克服总结](https://github.com/EvelynZP/Python-Spider/blob/master/%E6%8B%89%E5%8B%BE%E7%BD%91/%E6%8B%89%E5%8B%BE%E7%BD%91%E5%8F%8D%E7%88%AC%E6%9C%BA%E5%88%B6%E5%8F%8A%E6%94%BB%E5%85%8B%E6%96%B9%E5%BC%8F%E6%80%BB%E7%BB%93) 14 | * __豆瓣电影短评__ 15 | -------------------------------------------------------------------------------- /拉勾网/不同城市不同岗位数据总数.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 31, 6 | "metadata": {}, 7 | "outputs": [ 8 | { 9 | "name": "stdout", 10 | "output_type": "stream", 11 | "text": [ 12 | "540\n" 13 | ] 14 | } 15 | ], 16 | "source": [ 17 | "import requests\n", 18 | "import time\n", 19 | "import json \n", 20 | "import random\n", 21 | "from urllib import parse\n", 22 | "def main(city,keyword): \n", 23 | " proxie = [\n", 24 | " \"134.249.156.3:82\",\n", 25 | " \"1.198.72.239:9999\",\n", 26 | " \"103.26.245.190:43328\"\n", 27 | " ]\n", 28 | " proxies = {\n", 29 | " \"http\": str(random.sample(proxie, 1))\n", 30 | " }\n", 31 | " city = parse.quote(city)\n", 32 | " keyword = parse.quote(keyword)\n", 33 | " url_start = \"https://www.lagou.com/jobs/list_\"+keyword+\"?city=\"+city+\"&cl=false&fromSearch=true&labelWords=&suginput=\" \n", 34 | " url_parse = \"https://www.lagou.com/jobs/positionAjax.json?city=\"+city+\"&needAddtionalResult=false\"\n", 35 | " \n", 36 | " headers = { \n", 37 | " 'Accept': 'application/json, text/javascript, */*; q=0.01', \n", 38 | " 'Referer': url_start, \n", 39 | " 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36' \n", 40 | " } \n", 41 | " keyword_un = parse.unquote(keyword)\n", 42 | " data = { 'first': 'true', 'pn': 1, 'kd': keyword_un }\n", 43 | " s = requests.Session() \n", 44 | " s.get(url_start, headers=headers, timeout=3) # 请求首页获取cookies \n", 45 | " cookie = s.cookies # 为此次获取的cookies \n", 46 | " response = s.post(url_parse, data=data, headers=headers, proxies=proxies,cookies=cookie, timeout=3) # 获取此次文本 \n", 47 | " time.sleep(5)\n", 48 | " response.encoding = response.apparent_encoding \n", 49 | " text = json.loads(response.text)\n", 50 | " \n", 51 | " info = text[\"content\"][\"positionResult\"][\"totalCount\"] \n", 52 | " print(info)\n", 53 | "if __name__ == '__main__':\n", 54 | " city = '上海'\n", 55 | " keyword = '数据分析'\n", 56 | " main(city,keyword)" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [] 65 | } 66 | ], 67 | "metadata": { 68 | "kernelspec": { 69 | "display_name": "Python 3", 70 | "language": "python", 71 | "name": "python3" 72 | }, 73 | "language_info": { 74 | "codemirror_mode": { 75 | "name": "ipython", 76 | "version": 3 77 | }, 78 | "file_extension": ".py", 79 | "mimetype": "text/x-python", 80 | "name": "python", 81 | "nbconvert_exporter": "python", 82 | "pygments_lexer": "ipython3", 83 | "version": "3.7.0" 84 | } 85 | }, 86 | "nbformat": 4, 87 | "nbformat_minor": 2 88 | } 89 | -------------------------------------------------------------------------------- /拉勾网/全部城市数据分析岗位招聘数据.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 33, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "import requests\n", 10 | "from urllib import parse\n", 11 | "from lxml import etree\n", 12 | "import random\n", 13 | "import time\n", 14 | "import math\n", 15 | "import pandas as pd\n", 16 | "import json\n", 17 | "from bs4 import BeautifulSoup\n", 18 | "import re \n", 19 | "from requests.packages.urllib3.exceptions import InsecureRequestWarning \n", 20 | "# 禁用安全请求警告 \n", 21 | "requests.packages.urllib3.disable_warnings(InsecureRequestWarning)\n", 22 | "\n", 23 | "'''\n", 24 | "获取拉勾网在线的所有城市名称\n", 25 | "页面解析:在源代码可以直接查看到,故为静态网页,可以直接提取,利用BeatuifulSoup解析返回结果\n", 26 | "keyword:所要查询关键字,如:'数据分析'\n", 27 | "'''\n", 28 | "def get_citys(keyword):\n", 29 | " keyword = parse.quote(keyword)\n", 30 | " link = \"https://www.lagou.com/jobs/allCity.html?keyword=\"+keyword+\"&px=default&city=%E5%85%A8%E5%9B%BD&positionNum=500+&companyNum=0&isCompanySelected=false&labelWords=\"\n", 31 | " headers = {\n", 32 | " 'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Mobile Safari/537.36'\n", 33 | " }\n", 34 | " r = requests.get(link,headers=headers)\n", 35 | " citys = []\n", 36 | " soup = BeautifulSoup(r.text,'html.parser')\n", 37 | " city_word_list = soup.find_all('ul',class_ ='city_list')\n", 38 | " for i in range(len(city_word_list)):\n", 39 | " city_list = city_word_list[i].find_all('li')\n", 40 | " for j in range(len(city_list)):\n", 41 | " city = city_list[j].a.text.strip()\n", 42 | " citys.append(city)\n", 43 | "# print(city)\n", 44 | " return citys\n", 45 | " " 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": 34, 51 | "metadata": {}, 52 | "outputs": [], 53 | "source": [ 54 | "def get_url(city,keyword,gm):\n", 55 | " \n", 56 | " '''\n", 57 | " 编码为URL格式,直接传送的地址非URL格式需要利用urllib.parse进行处理\n", 58 | " 错误类型:UnicodeEncodeError\n", 59 | " 'latin-1' codec can't encode characters in position 87-89: ordinal not in range(256)\n", 60 | " '''\n", 61 | " city = parse.quote(city)\n", 62 | " keyword = parse.quote(keyword)\n", 63 | " gm = parse.quote(gm)\n", 64 | " \n", 65 | " url_parse = \"https://www.lagou.com/jobs/positionAjax.json?px=default&city=\"+city+\"&needAddtionalResult=false&gm=\"+gm\n", 66 | " url_start = \"https://www.lagou.com/jobs/list_\"+keyword+\"?px=default&gm=\"+ gm + \"&city=\"+city\n", 67 | " return url_parse,url_start" 68 | ] 69 | }, 70 | { 71 | "cell_type": "code", 72 | "execution_count": 35, 73 | "metadata": {}, 74 | "outputs": [], 75 | "source": [ 76 | "#随机选择执行的IP,防止被封\n", 77 | "def get_proxies():#goole免费代理IP替换或者添加,越多越好\n", 78 | " proxie = [\n", 79 | " \"134.249.156.3:82\",\n", 80 | " \"1.198.72.239:9999\",\n", 81 | " \"103.26.245.190:43328\"]\n", 82 | " \n", 83 | " proxies = {\"http\":str(random.sample(proxie,1))}\n", 84 | " return proxies\n", 85 | "def get_agents():#替换你自己的User-Agent,直接运行不了\n", 86 | " agents = [\n", 87 | " '**Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3626.121 Safari/537.36',\n", 88 | " '**Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3729.157 Mobile Safari/537.36'\n", 89 | " ]\n", 90 | " agent = random.sample(agents,1)\n", 91 | " return agent\n", 92 | "#定制headers,设置模拟浏览器,反爬\n", 93 | "def get_headers(url_start,agent):\n", 94 | " url_start = url_start\n", 95 | " agent = agent\n", 96 | " headers = {\n", 97 | " #所有数据均来自Network-XHR-headers-Requsts Headers\n", 98 | " 'Accept': 'application/json, text/javascript, */*; q=0.01', \n", 99 | " 'Referer': url_start,\n", 100 | " #因为本地的User-Agent已经被封锁,此时有了网上的一个虚拟的,为了反爬可以随机使用几个,类似于下面的proxies\n", 101 | "# 'Cookie':'user_trace_token=20190515230141-12934d87-ec36-47a4-b6f5-b9fdf6116988; _ga=GA1.2.1677629794.1557932504; _gat=1; LGSID=20190515230142-597afe8c-7722-11e9-99a5-525400f775ce; PRE_UTM=m_cf_cpt_sogou_ztch6; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Flanding-page%2Fpc%2Fposition2.html%3Futm_source%3Dm_cf_cpt_sogou_ztch6; LGUID=20190515230142-597b0038-7722-11e9-99a5-525400f775ce; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1557932504; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2216abc03944948-07d289ba2ecf5-353166-921600-16abc039450303%22%2C%22%24device_id%22%3A%2216abc03944948-07d289ba2ecf5-353166-921600-16abc039450303%22%7D; sajssdk_2015_cross_new_user=1; _gid=GA1.2.694067267.1557932513; LG_LOGIN_USER_ID=90be76343c3af8869dd6d12563545c28b420c207df6a38cceeb53f0a5d64da34; LG_HAS_LOGIN=1; _putrc=A7C9A275E2B1F6EB123F89F2B170EADC; JSESSIONID=ABAAABAAADEAAFIAD4122C7F738CEEAE9C99959A9756EB3; login=true; unick=%E5%BC%A0%E5%B9%B3; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; hasDeliver=1; gate_login_token=450a1f502d12ad1d9f320d846f06d1ad92603529c81fa8ff8921a53bcf54a82d; index_location_city=%E5%8C%97%E4%BA%AC; X_HTTP_TOKEN=c7dc7114524f2f9f15523975516bfb19cb3eab323b; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1557932554; LGRID=20190515230231-76ee1774-7722-11e9-99a5-525400f775ce; SEARCH_ID=caaf8eb6597d4ecbbe9a2d807d68e78f',\n", 102 | " 'Host':'www.lagou.com',\n", 103 | " 'Origin': 'https://www.lagou.com',\n", 104 | " 'User-Agent': str(agent),\n", 105 | " 'X-Anit-Forge-Code': '0',\n", 106 | " 'X-Anit-Forge-Token': 'None',\n", 107 | " 'X-Requested-With': 'XMLHttpRequest',\n", 108 | " 'Connection':'keep_alive'\n", 109 | " } \n", 110 | " return headers" 111 | ] 112 | }, 113 | { 114 | "cell_type": "code", 115 | "execution_count": 36, 116 | "metadata": {}, 117 | "outputs": [], 118 | "source": [ 119 | "def get_json(city,keyword,gm,pn):\n", 120 | " #kd是搜索的关键字,若改为“数据分析”即是数据分析相关的数据\n", 121 | " data = { 'first': 'true', 'pn': pn, 'kd': keyword }#keyword若是中文,需要为编码的形式\n", 122 | " url_parse,url_start = get_url(city,keyword,gm)\n", 123 | " agent = get_agents()\n", 124 | " headers = get_headers(url_start,agent)\n", 125 | " proxies = get_proxies()\n", 126 | " print(proxies)\n", 127 | " # 请求首页获取cookies \n", 128 | " s = requests.Session() \n", 129 | " s = s.get(url_start, headers=headers, proxies=proxies,timeout=3) # 请求首页获取cookies ,allow_redirects=False,verify=False\n", 130 | " cookie = s.cookies # 为此类别获取的cookies \n", 131 | " response = requests.post(url_parse,headers=headers,data=data,proxies=proxies,cookies=cookie,timeout=5)\n", 132 | " sleep_time = random.randint(4,5)+random.random()\n", 133 | " time.sleep(sleep_time)\n", 134 | " response.encoding = response.apparent_encoding \n", 135 | "# print(response.text)\n", 136 | " text = json.loads(response.text)\n", 137 | " return text" 138 | ] 139 | }, 140 | { 141 | "cell_type": "code", 142 | "execution_count": 37, 143 | "metadata": {}, 144 | "outputs": [], 145 | "source": [ 146 | "'''\n", 147 | "获取每个城市的全部数据\n", 148 | "'''\n", 149 | "def get_city_Pages(city,keyword):\n", 150 | " \n", 151 | " data = { 'first': 'true', 'pn': 1, 'kd': keyword }#keyword若是中文,需要为编码的形式\n", 152 | " city = parse.quote(city)\n", 153 | " keyword = parse.quote(keyword)\n", 154 | " url_parse = \"https://www.lagou.com/jobs/positionAjax.json?px=default&city=\"+city+\"&needAddtionalResult=false\"\n", 155 | " url_start = \"https://www.lagou.com/jobs/list_\"+keyword+\"?px=default&city=\"+city\n", 156 | " agent = get_agents()\n", 157 | " headers = get_headers(url_start,agent)\n", 158 | " proxies = get_proxies()\n", 159 | " # 请求首页获取cookies \n", 160 | " s = requests.Session() \n", 161 | " s = s.get(url_start, headers=headers, proxies=proxies,timeout=3) # 请求首页获取cookies \n", 162 | " cookie = s.cookies # 为此类别获取的cookies \n", 163 | " response = requests.post(url_parse,headers=headers,data=data,proxies=proxies,cookies=cookie,timeout=5)\n", 164 | " sleep_time = random.randint(4,5)+random.random()\n", 165 | " time.sleep(sleep_time)\n", 166 | " response.encoding = response.apparent_encoding \n", 167 | "# print(response.text)\n", 168 | " text = json.loads(response.text)\n", 169 | "# print(text)\n", 170 | " totalcount = 0\n", 171 | " if 'content' in text:\n", 172 | " totalcount = text['content']['positionResult']['totalCount']\n", 173 | " else:\n", 174 | " print(\"Not exits\")\n", 175 | " totalPages = math.ceil(float(totalcount)/15)\n", 176 | " city_un = parse.unquote(city)\n", 177 | " if totalPages > 0 :\n", 178 | " print(\"%s共%s条,可分为%s页\"%(city_un,totalcount,totalPages))\n", 179 | "# sleep_time = random.randint(5,15)+random.random()\n", 180 | "# time.sleep(sleep_time)#keyerror问题的解决\n", 181 | "# print(\"停留%s\"%sleep_time)\n", 182 | " else:\n", 183 | "# city_un = parse.quote(city)\n", 184 | " print(\"%s-----无数据\"%(city_un))\n", 185 | "# print(\"共%s条,可分为%s页\"%(totalcount,totalPages))\n", 186 | " return totalPages" 187 | ] 188 | }, 189 | { 190 | "cell_type": "code", 191 | "execution_count": 38, 192 | "metadata": {}, 193 | "outputs": [], 194 | "source": [ 195 | "def get_totalPages(city,keyword,gm):\n", 196 | " city = city\n", 197 | " keyword = keyword\n", 198 | " totalcount = 0\n", 199 | "# pn = 1\n", 200 | " text = get_json(city,keyword,gm,'1')\n", 201 | " if 'content' in text:\n", 202 | " totalcount = text['content']['positionResult']['totalCount']\n", 203 | " else:\n", 204 | " print(\"Not exits\")\n", 205 | " totalPages = math.ceil(float(totalcount)/15)\n", 206 | " if totalPages > 0 :\n", 207 | " print(\"%s公司规模为%s共%s条,可分为%s页\"%(city,gm,totalcount,totalPages)) \n", 208 | "# sleep_time = random.randint(5,15)+random.random()\n", 209 | "# time.sleep(sleep_time)#keyerror问题的解决\n", 210 | "# print(\"停留%s\"%sleep_time)\n", 211 | " else:\n", 212 | " print(\"%s公司规模为%s---无数据\"%(city,gm))\n", 213 | " \n", 214 | " return totalPages" 215 | ] 216 | }, 217 | { 218 | "cell_type": "code", 219 | "execution_count": 39, 220 | "metadata": {}, 221 | "outputs": [], 222 | "source": [ 223 | "def get_info(city,keyword,gm,totalPages):\n", 224 | " city = city\n", 225 | " keyword = keyword\n", 226 | " totalPages = totalPages\n", 227 | " position_info_all = []#存储数据\n", 228 | " for page in range(1,totalPages+1):#从1开始循环因为首页的页数编码是1\n", 229 | " time_start = time.time()\n", 230 | " text = get_json(city,keyword,gm,page)\n", 231 | "# print(text)\n", 232 | "# print(proxies)\n", 233 | " if 'content'in text:\n", 234 | " info = text[\"content\"][\"positionResult\"][\"result\"] \n", 235 | " for i in info:\n", 236 | " position_info_single = []\n", 237 | " position_info_single.append(i.get('positionId','NA'))#职位编号\n", 238 | " position_info_single.append(i.get('positionName','NA'))#职位名称\n", 239 | " position_info_single.append(i.get('salary','NA'))#薪酬\n", 240 | " position_info_single.append(i.get('workYear','NA'))#工作经验\n", 241 | " position_info_single.append(i.get('skillLables','NA'))#技能要求\n", 242 | " position_info_single.append(i.get('positionAdvantage','NA'))#职位优势\n", 243 | " position_info_single.append(i.get('education','NA'))#学历要求\n", 244 | " position_info_single.append(i.get('jobNature','NA'))#工作性质\n", 245 | " position_info_single.append(i.get('createTime','NA'))#发布时间\n", 246 | " position_info_single.append(i.get('companyFullName','NA'))#公司\n", 247 | " position_info_single.append(i.get('city','NA'))#城市\n", 248 | " position_info_single.append(i.get('companySize','NA'))#公司规模\n", 249 | " position_info_single.append(i.get('district','NA'))#区域\n", 250 | " position_info_single.append(i.get('financeStage','NA'))#融资情况\n", 251 | " position_info_single.append(i.get('firstType','NA'))#公司类别\n", 252 | " position_info_single.append(i.get('industryField','NA'))#涉及领域\n", 253 | " position_info_single.append(i.get('isSchoolJob','NA'))#是否校招\n", 254 | " position_info_single.append(i.get('subwayline','NA'))#地铁\n", 255 | " position_info_single.append(i.get('stationname','NA'))#站点\n", 256 | " position_info_single.append(i.get('latitude','NA'))#经度\n", 257 | " position_info_single.append(i.get('longitude','NA'))#纬度\n", 258 | " position_info_all.append(position_info_single) \n", 259 | " print(\"第%s页爬取成功,position_info_all现有数据%s行\"%(str(page),str(len(position_info_all))))\n", 260 | " else:\n", 261 | " print(\"Not exits\")\n", 262 | " sleep_time = random.randint(5,15)+random.random()\n", 263 | " time.sleep(sleep_time)#keyerror问题的解决\n", 264 | " time_end = time.time()\n", 265 | " on_time = time_end-time_start\n", 266 | " print(\"本页爬行时间%s\"%str(on_time))\n", 267 | " return position_info_all" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": 52, 273 | "metadata": { 274 | "scrolled": true 275 | }, 276 | "outputs": [ 277 | { 278 | "name": "stdout", 279 | "output_type": "stream", 280 | "text": [ 281 | "293\n", 282 | "珠海共3条,可分为1页\n", 283 | "{'http': \"['218.64.69.79:8080']\"}\n", 284 | "珠海公司规模为2000人以上共3条,可分为1页\n", 285 | "{'http': \"['163.125.68.135:8888 ']\"}\n", 286 | "第1页爬取成功,position_info_all现有数据3行\n", 287 | "本页爬行时间16.876307725906372\n", 288 | "珠海的公司规模2000人以上的数据爬完\n", 289 | "{'http': \"['14.115.104.97:808 ']\"}\n", 290 | "珠海公司规模为少于15人,15-50人---无数据\n", 291 | "停留6.7492073387582385\n", 292 | "{'http': \"['134.249.156.3:82']\"}\n", 293 | "珠海公司规模为50-150人---无数据\n", 294 | "停留11.043163736858903\n", 295 | "{'http': \"['1.198.72.193:9999']\"}\n", 296 | "Not exits\n", 297 | "珠海公司规模为150-500人---无数据\n", 298 | "停留12.172892347874113\n", 299 | "{'http': \"['183.129.207.86:11206 ']\"}\n", 300 | "珠海公司规模为500-2000人---无数据\n", 301 | "停留10.925358249150833\n", 302 | "position_info_city有3条数据\n", 303 | "珠海的全部数据爬取成功\n", 304 | "停留13.576861105003818\n", 305 | "中山共1条,可分为1页\n", 306 | "{'http': \"['115.28.148.192:8118']\"}\n", 307 | "中山公司规模为2000人以上---无数据\n", 308 | "停留9.102117120811412\n", 309 | "{'http': \"['134.249.156.3:82']\"}\n", 310 | "中山公司规模为少于15人,15-50人---无数据\n", 311 | "停留7.852819219941637\n", 312 | "{'http': \"['1.192.243.166:9999 ']\"}\n", 313 | "中山公司规模为50-150人---无数据\n", 314 | "停留15.585111278379154\n", 315 | "{'http': \"['163.125.68.135:8888 ']\"}\n", 316 | "中山公司规模为150-500人共1条,可分为1页\n", 317 | "{'http': \"['171.80.2.110:9999']\"}\n", 318 | "第1页爬取成功,position_info_all现有数据1行\n", 319 | "本页爬行时间20.245324850082397\n", 320 | "中山的公司规模150-500人的数据爬完\n", 321 | "{'http': \"['218.64.69.79:8080']\"}\n", 322 | "中山公司规模为500-2000人---无数据\n", 323 | "停留12.88375543553882\n", 324 | "position_info_city有4条数据\n", 325 | "中山的全部数据爬取成功\n", 326 | "停留12.943937855541314\n", 327 | "镇江-----无数据\n", 328 | "停留9.276375587886204\n", 329 | "湛江-----无数据\n", 330 | "停留8.685334392703105\n", 331 | "株洲共1条,可分为1页\n", 332 | "{'http': \"['120.83.108.77:9999 ']\"}\n", 333 | "株洲公司规模为2000人以上---无数据\n", 334 | "停留5.101241187083019\n", 335 | "{'http': \"['122.193.244.243:9999 ']\"}\n", 336 | "株洲公司规模为少于15人,15-50人共1条,可分为1页\n", 337 | "{'http': \"['14.115.104.97:808 ']\"}\n", 338 | "Not exits\n", 339 | "本页爬行时间16.907893180847168\n", 340 | "株洲的公司规模少于15人,15-50人的数据爬完\n", 341 | "{'http': \"['1.192.243.166:9999 ']\"}\n", 342 | "株洲公司规模为50-150人---无数据\n", 343 | "停留7.160573629384684\n", 344 | "{'http': \"['115.28.148.192:8118']\"}\n", 345 | "株洲公司规模为150-500人---无数据\n", 346 | "停留7.411860943214564\n", 347 | "{'http': \"['1.198.72.239:9999']\"}\n", 348 | "株洲公司规模为500-2000人---无数据\n", 349 | "停留7.505609365573439\n", 350 | "position_info_city有4条数据\n", 351 | "株洲的全部数据爬取成功\n", 352 | "停留12.079427715518644\n", 353 | "淄博-----无数据\n", 354 | "停留5.942007353497083\n", 355 | "肇庆-----无数据\n", 356 | "停留8.137823183299782\n", 357 | "张家口-----无数据\n", 358 | "停留13.28549592337901\n", 359 | "漳州-----无数据\n", 360 | "停留14.207343278491953\n", 361 | "遵义-----无数据\n", 362 | "停留14.95192348923606\n", 363 | "驻马店-----无数据\n", 364 | "停留11.949976561928546\n", 365 | "长治-----无数据\n", 366 | "停留15.335753043318482\n", 367 | "枣庄-----无数据\n", 368 | "停留14.370120856259682\n", 369 | "资阳-----无数据\n", 370 | "停留6.685431142101895\n", 371 | "舟山-----无数据\n", 372 | "停留8.37063967450687\n", 373 | "自贡-----无数据\n", 374 | "停留7.527898435115426\n", 375 | "周口共1条,可分为1页\n", 376 | "{'http': \"['112.85.169.126:9999']\"}\n", 377 | "周口公司规模为2000人以上---无数据\n", 378 | "停留6.014899216823151\n", 379 | "{'http': \"['1.197.204.217:9999 ']\"}\n", 380 | "周口公司规模为少于15人,15-50人---无数据\n", 381 | "停留13.197189074964884\n", 382 | "{'http': \"['112.85.164.68:9999']\"}\n", 383 | "周口公司规模为50-150人共1条,可分为1页\n", 384 | "{'http': \"['1.198.72.239:9999']\"}\n", 385 | "第1页爬取成功,position_info_all现有数据1行\n", 386 | "本页爬行时间19.017383813858032\n", 387 | "周口的公司规模50-150人的数据爬完\n", 388 | "{'http': \"['120.83.108.77:9999 ']\"}\n", 389 | "周口公司规模为150-500人---无数据\n", 390 | "停留14.04838082143874\n", 391 | "{'http': \"['103.26.245.190:43328']\"}\n", 392 | "周口公司规模为500-2000人---无数据\n", 393 | "停留9.220707758209855\n", 394 | "position_info_city有5条数据\n", 395 | "周口的全部数据爬取成功\n", 396 | "停留5.143233270941914\n", 397 | "昭通-----无数据\n", 398 | "停留7.923482816445571\n", 399 | "张掖-----无数据\n", 400 | "停留13.237375799756679\n", 401 | "张家界-----无数据\n", 402 | "停留11.974681287128595\n", 403 | "中卫-----无数据\n", 404 | "停留7.505663442219061\n", 405 | "全部数据存储成功(CSV格式)----持续奔跑中!\n", 406 | "[[5858185, '数据分析师', '15k-30k', '不限', ['数据挖掘', '数据架构', 'MySQL', '数据分析'], '免费三餐,数据分析', '不限', '全职', '2019-04-29 17:45:35', '北京金山软件有限公司', '珠海', '2000人以上', '香洲区', '上市公司', '开发|测试|运维类', '移动互联网', 0, None, None, '22.34793', '113.601205'], [1978586, '数据分析师', '1k-2k', '应届毕业生', [], '提供岗位培训,表现优异者可转正', '不限', '实习', '2019-05-20 16:25:19', '成都西山居互动娱乐科技有限公司珠海分公司', '珠海', '2000人以上', '香洲区', '不需要融资', '产品|需求|项目类', '游戏', 1, None, None, '22.348471', '113.601247'], [5945889, '高级数据分析师', '18k-30k', '3-5年', ['数据分析', '数据挖掘', '数据处理'], '团队优秀 业务前景好 六险一金 年度旅行', '本科', '全职', '2019-05-22 17:37:33', '北京金山软件有限公司', '珠海', '2000人以上', '香洲区', '上市公司', '开发|测试|运维类', '移动互联网', 0, None, None, '22.347929', '113.601205'], [5339042, '数据分析实习生(中山)', '2k-3k', '应届毕业生', ['数据分析', 'MySQL', '数据处理'], '提供转正,周末双休,专题培训,法定假日', '本科', '实习', '2019-05-22 09:34:00', '广东精点数据科技股份有限公司', '中山', '150-500人', '中山市市辖区', '不需要融资', '开发|测试|运维类', '移动互联网,数据服务', 1, None, None, '22.503981', '113.403275'], [5784906, '数据分析工程师', '4k-5k', '1-3年', [], '员工旅游、补充医疗保险、绩效奖金、餐补', '大专', '全职', '2019-05-22 17:22:48', '中科三清科技有限公司', '周口', '50-150人', '川汇区', '不需要融资', '产品|需求|项目类', '其他,移动互联网', 0, None, None, '33.61007', '114.669126']]\n" 407 | ] 408 | } 409 | ], 410 | "source": [ 411 | "def main():\n", 412 | " #公司规模\n", 413 | " gms = ['2000人以上','少于15人,15-50人','50-150人','150-500人','500-2000人'] \n", 414 | " keyword = '数据分析'\n", 415 | " citys = get_citys(keyword)\n", 416 | " position_info_all = []\n", 417 | " columns = ['职位编号','职位名称','薪酬','工作经验','技能要求','职位优势','学历要求','工作形式','发布时间',\n", 418 | " '公司','城市','规模','区域','融资情况','公司类别','涉及领域','是否校招',\n", 419 | " '地铁','站点','经度','纬度']\n", 420 | " print(len(citys))\n", 421 | " for i in range(0,len(citys)):\n", 422 | " position_info_city = []\n", 423 | " city = citys[i]\n", 424 | " cityPages = get_city_Pages(city,keyword)\n", 425 | " if cityPages > 0 :\n", 426 | " for j in range(len(gms)):\n", 427 | " gm = gms[j]\n", 428 | " totalPages = get_totalPages(city,keyword,gm)\n", 429 | " if totalPages>0:\n", 430 | " \n", 431 | " position_info_city_gm = get_info(city,keyword,gm,totalPages)\n", 432 | " for m in range(len(position_info_city_gm)):\n", 433 | " item = position_info_city_gm[m]\n", 434 | " position_info_city.append(item)\n", 435 | " position_info_all.append(item)\n", 436 | " # print(position_info_all)\n", 437 | " print(\"%s的公司规模%s的数据爬完\"%(city,gm))\n", 438 | " else:\n", 439 | " # print(\"%s的公司规模%s无数据\"%(city,gm))\n", 440 | " sleep_time = random.randint(5,15)+random.random()\n", 441 | " time.sleep(sleep_time)#keyerror问题的解决\n", 442 | " print(\"停留%s\"%sleep_time)\n", 443 | " print(\"position_info_all有%s条数据\"%len(position_info_all))\n", 444 | " print(\"%s的全部数据爬取成功\"%city)\n", 445 | " sleep_time = random.randint(5,15)+random.random()\n", 446 | " time.sleep(sleep_time)#keyerror问题的解决\n", 447 | " print(\"停留%s\"%sleep_time)\n", 448 | " df_city = pd.DataFrame(data = position_info_city,columns =columns )\n", 449 | " df_city.to_csv('%s.csv'%(city),index = False,encoding=\"utf_8_sig\")\n", 450 | " else:\n", 451 | " sleep_time = random.randint(5,15)+random.random()\n", 452 | " time.sleep(sleep_time)#keyerror问题的解决\n", 453 | " print(\"停留%s\"%sleep_time)\n", 454 | "# position_info_all.append(position_info_city)\n", 455 | " df_all = pd.DataFrame(data=position_info_all,columns=columns)\n", 456 | " df_all.to_csv('all_city.csv',index = False,encoding=\"utf_8_sig\")\n", 457 | " print(\"全部数据存储成功(CSV格式)----持续奔跑中!\") \n", 458 | " print(position_info_all)\n", 459 | "\n", 460 | "if __name__ == '__main__':\n", 461 | " main()" 462 | ] 463 | }, 464 | { 465 | "cell_type": "code", 466 | "execution_count": null, 467 | "metadata": {}, 468 | "outputs": [], 469 | "source": [] 470 | } 471 | ], 472 | "metadata": { 473 | "kernelspec": { 474 | "display_name": "Python 3", 475 | "language": "python", 476 | "name": "python3" 477 | }, 478 | "language_info": { 479 | "codemirror_mode": { 480 | "name": "ipython", 481 | "version": 3 482 | }, 483 | "file_extension": ".py", 484 | "mimetype": "text/x-python", 485 | "name": "python", 486 | "nbconvert_exporter": "python", 487 | "pygments_lexer": "ipython3", 488 | "version": "3.6.5" 489 | } 490 | }, 491 | "nbformat": 4, 492 | "nbformat_minor": 2 493 | } 494 | -------------------------------------------------------------------------------- /拉勾网/拉勾网反爬机制及攻克方式总结: -------------------------------------------------------------------------------- 1 | 1 连续爬取三页就返回空值 2 | 方式: 3 | 建立User-Agent和IP的循环池,每次换页随机选取使用 4 | 2 直接利用headers中链接,就会返回IP被封的提示 5 | 方式一: 6 | 登录状态,直接提取headers中的cookies,以补充定制请求头数据。但是此次爬虫扔无法解决此问题 7 | 方式二: 8 | seesion.cookies获取首页cookies,并记录下来,以此补充请求头。 9 | 值得注意的是每次换一次IP都需重新获取cookies 10 | 3 KeyError错误 11 | 方式: 12 | 通过sleep.time设置休息时间,将页与页之间爬取的时间拉长 13 | 4 Latin-1编码问题 14 | 方式: 15 | URL地址中不存在中文,需通过urllib.parse.qupte编码为URL通用格式。urlillib.parse.unquote解码为中文 16 | 值得注意的是request.post(url,headers,data=data)中的data如果有中文关键字,则必须不必编码,且必不能编码 17 | 5 数据爬取有限的问题 18 | 方式一: 19 | 拉勾网页面最多显示30页数据,每页15条,但其实json包中的数据远远不止这些数据。 20 | 我们可以通过content-positionResult-totalCount获取数据总量,但是并不一定能全部获取到。这也是拉勾网反爬真的是做到了极致。 21 | 在headers中有一条content-length,其后面对应的数据及我们可以最多爬取的页数,其页数*15即为我们最终可获取的数据量 22 | 方式二: 23 | 通过分组(每组数据量少于450条),且每个组是互相独立的,全部组的并集为全部数据量。例如公司规模分组分别提取,然后合并提取数据。 24 | 6 Python requests 移除SSL认证,控制台输出InsecureRequestWarning取消方法。 25 | 方式: 26 | from requests.packages.urllib3.exceptions import InsecureRequestWarning # 禁用安全请求警告 27 | requests.packages.urllib3.disable_warnings(InsecureRequestWarning) 28 | 7 Python熊猫数据帧读取from_records,“AssertionError:1列传递,传递数据有22列” 29 | 原因:list重复加入不同维度的list 30 | 方式: 31 | 定义全局和局部变量,直接加入同维度的数据 32 | --------------------------------------------------------------------------------