├── README.md
└── 拉勾网
    ├── 不同城市不同岗位数据总数.ipynb
    ├── 全部城市数据分析岗位招聘数据.ipynb
    └── 拉勾网反爬机制及攻克方式总结


/README.md:
--------------------------------------------------------------------------------
 1 | # Python爬虫学习笔记及实战案例
 2 | 
 3 | ## 目录
 4 | ### 参考书目
 5 | * 《Python网络爬虫从入门到精通》
 6 | * 《Python编程从入门到实践》
 7 | ### 学习笔记
 8 | ### 爬虫实战练习案例
 9 | * __拉勾网招聘职位__
10 |   - [不同城市不同岗位数据总数](https://github.com/EvelynZP/Python-Spider/blob/master/%E6%8B%89%E5%8B%BE%E7%BD%91/%E4%B8%8D%E5%90%8C%E5%9F%8E%E5%B8%82%E4%B8%8D%E5%90%8C%E5%B2%97%E4%BD%8D%E6%95%B0%E6%8D%AE%E6%80%BB%E6%95%B0.ipynb)
11 |   - [全国数据分析岗位数据爬取](https://github.com/EvelynZP/Python-Spider/blob/master/%E6%8B%89%E5%8B%BE%E7%BD%91/%E5%85%A8%E9%83%A8%E5%9F%8E%E5%B8%82%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90%E5%B2%97%E4%BD%8D%E6%8B%9B%E8%81%98%E6%95%B0%E6%8D%AE.ipynb)
12 |   - 基于拉勾网数据分析数据分析职位
13 |   - [拉勾网反爬机制及克服总结](https://github.com/EvelynZP/Python-Spider/blob/master/%E6%8B%89%E5%8B%BE%E7%BD%91/%E6%8B%89%E5%8B%BE%E7%BD%91%E5%8F%8D%E7%88%AC%E6%9C%BA%E5%88%B6%E5%8F%8A%E6%94%BB%E5%85%8B%E6%96%B9%E5%BC%8F%E6%80%BB%E7%BB%93)
14 | * __豆瓣电影短评__
15 | 


--------------------------------------------------------------------------------
/拉勾网/不同城市不同岗位数据总数.ipynb:
--------------------------------------------------------------------------------
 1 | {
 2 |  "cells": [
 3 |   {
 4 |    "cell_type": "code",
 5 |    "execution_count": 31,
 6 |    "metadata": {},
 7 |    "outputs": [
 8 |     {
 9 |      "name": "stdout",
10 |      "output_type": "stream",
11 |      "text": [
12 |       "540\n"
13 |      ]
14 |     }
15 |    ],
16 |    "source": [
17 |     "import requests\n",
18 |     "import time\n",
19 |     "import json \n",
20 |     "import random\n",
21 |     "from urllib import parse\n",
22 |     "def main(city,keyword): \n",
23 |     "    proxie = [\n",
24 |     "   \"134.249.156.3:82\",\n",
25 |     "    \"1.198.72.239:9999\",\n",
26 |     "    \"103.26.245.190:43328\"\n",
27 |     "    ]\n",
28 |     "    proxies = {\n",
29 |     "               \"http\": str(random.sample(proxie, 1))\n",
30 |     "    }\n",
31 |     "    city = parse.quote(city)\n",
32 |     "    keyword  = parse.quote(keyword)\n",
33 |     "    url_start = \"https://www.lagou.com/jobs/list_\"+keyword+\"?city=\"+city+\"&cl=false&fromSearch=true&labelWords=&suginput=\" \n",
34 |     "    url_parse = \"https://www.lagou.com/jobs/positionAjax.json?city=\"+city+\"&needAddtionalResult=false\"\n",
35 |     "    \n",
36 |     "    headers = { \n",
37 |     "        'Accept': 'application/json, text/javascript, */*; q=0.01', \n",
38 |     "        'Referer': url_start, \n",
39 |     "        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36' \n",
40 |     "    } \n",
41 |     "    keyword_un = parse.unquote(keyword)\n",
42 |     "    data = { 'first': 'true', 'pn': 1, 'kd': keyword_un }\n",
43 |     "    s = requests.Session() \n",
44 |     "    s.get(url_start, headers=headers, timeout=3)   # 请求首页获取cookies \n",
45 |     "    cookie = s.cookies   # 为此次获取的cookies \n",
46 |     "    response = s.post(url_parse, data=data, headers=headers, proxies=proxies,cookies=cookie, timeout=3)  # 获取此次文本 \n",
47 |     "    time.sleep(5)\n",
48 |     "    response.encoding = response.apparent_encoding \n",
49 |     "    text = json.loads(response.text)\n",
50 |     "   \n",
51 |     "    info = text[\"content\"][\"positionResult\"][\"totalCount\"] \n",
52 |     "    print(info)\n",
53 |     "if __name__ == '__main__':\n",
54 |     "    city = '上海'\n",
55 |     "    keyword = '数据分析'\n",
56 |     "    main(city,keyword)"
57 |    ]
58 |   },
59 |   {
60 |    "cell_type": "code",
61 |    "execution_count": null,
62 |    "metadata": {},
63 |    "outputs": [],
64 |    "source": []
65 |   }
66 |  ],
67 |  "metadata": {
68 |   "kernelspec": {
69 |    "display_name": "Python 3",
70 |    "language": "python",
71 |    "name": "python3"
72 |   },
73 |   "language_info": {
74 |    "codemirror_mode": {
75 |     "name": "ipython",
76 |     "version": 3
77 |    },
78 |    "file_extension": ".py",
79 |    "mimetype": "text/x-python",
80 |    "name": "python",
81 |    "nbconvert_exporter": "python",
82 |    "pygments_lexer": "ipython3",
83 |    "version": "3.7.0"
84 |   }
85 |  },
86 |  "nbformat": 4,
87 |  "nbformat_minor": 2
88 | }
89 | 


--------------------------------------------------------------------------------
/拉勾网/全部城市数据分析岗位招聘数据.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "code",
  5 |    "execution_count": 33,
  6 |    "metadata": {},
  7 |    "outputs": [],
  8 |    "source": [
  9 |     "import requests\n",
 10 |     "from urllib import parse\n",
 11 |     "from lxml import etree\n",
 12 |     "import random\n",
 13 |     "import time\n",
 14 |     "import math\n",
 15 |     "import pandas as pd\n",
 16 |     "import json\n",
 17 |     "from bs4 import BeautifulSoup\n",
 18 |     "import re \n",
 19 |     "from requests.packages.urllib3.exceptions import InsecureRequestWarning \n",
 20 |     "# 禁用安全请求警告 \n",
 21 |     "requests.packages.urllib3.disable_warnings(InsecureRequestWarning)\n",
 22 |     "\n",
 23 |     "'''\n",
 24 |     "获取拉勾网在线的所有城市名称\n",
 25 |     "页面解析：在源代码可以直接查看到，故为静态网页，可以直接提取，利用BeatuifulSoup解析返回结果\n",
 26 |     "keyword:所要查询关键字，如：'数据分析'\n",
 27 |     "'''\n",
 28 |     "def get_citys(keyword):\n",
 29 |     "    keyword = parse.quote(keyword)\n",
 30 |     "    link = \"https://www.lagou.com/jobs/allCity.html?keyword=\"+keyword+\"&px=default&city=%E5%85%A8%E5%9B%BD&positionNum=500+&companyNum=0&isCompanySelected=false&labelWords=\"\n",
 31 |     "    headers = {\n",
 32 |     "            'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Mobile Safari/537.36'\n",
 33 |     "            }\n",
 34 |     "    r = requests.get(link,headers=headers)\n",
 35 |     "    citys = []\n",
 36 |     "    soup = BeautifulSoup(r.text,'html.parser')\n",
 37 |     "    city_word_list = soup.find_all('ul',class_ ='city_list')\n",
 38 |     "    for i in range(len(city_word_list)):\n",
 39 |     "        city_list = city_word_list[i].find_all('li')\n",
 40 |     "        for j in range(len(city_list)):\n",
 41 |     "            city = city_list[j].a.text.strip()\n",
 42 |     "            citys.append(city)\n",
 43 |     "#             print(city)\n",
 44 |     "    return citys\n",
 45 |     "        "
 46 |    ]
 47 |   },
 48 |   {
 49 |    "cell_type": "code",
 50 |    "execution_count": 34,
 51 |    "metadata": {},
 52 |    "outputs": [],
 53 |    "source": [
 54 |     "def get_url(city,keyword,gm):\n",
 55 |     "    \n",
 56 |     "    '''\n",
 57 |     "    编码为URL格式,直接传送的地址非URL格式需要利用urllib.parse进行处理\n",
 58 |     "    错误类型：UnicodeEncodeError\n",
 59 |     "     'latin-1' codec can't encode characters in position 87-89: ordinal not in range(256)\n",
 60 |     "    '''\n",
 61 |     "    city = parse.quote(city)\n",
 62 |     "    keyword = parse.quote(keyword)\n",
 63 |     "    gm = parse.quote(gm)\n",
 64 |     "    \n",
 65 |     "    url_parse = \"https://www.lagou.com/jobs/positionAjax.json?px=default&city=\"+city+\"&needAddtionalResult=false&gm=\"+gm\n",
 66 |     "    url_start = \"https://www.lagou.com/jobs/list_\"+keyword+\"?px=default&gm=\"+ gm + \"&city=\"+city\n",
 67 |     "    return url_parse,url_start"
 68 |    ]
 69 |   },
 70 |   {
 71 |    "cell_type": "code",
 72 |    "execution_count": 35,
 73 |    "metadata": {},
 74 |    "outputs": [],
 75 |    "source": [
 76 |     "#随机选择执行的IP,防止被封\n",
 77 |     "def get_proxies():#goole免费代理IP替换或者添加，越多越好\n",
 78 |     "    proxie = [\n",
 79 |     "        \"134.249.156.3:82\",\n",
 80 |     "        \"1.198.72.239:9999\",\n",
 81 |     "        \"103.26.245.190:43328\"]\n",
 82 |     "    \n",
 83 |     "    proxies = {\"http\":str(random.sample(proxie,1))}\n",
 84 |     "    return proxies\n",
 85 |     "def get_agents():#替换你自己的User-Agent,直接运行不了\n",
 86 |     "    agents = [\n",
 87 |     "        '**Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3626.121 Safari/537.36',\n",
 88 |     "        '**Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3729.157 Mobile Safari/537.36'\n",
 89 |     "    ]\n",
 90 |     "    agent = random.sample(agents,1)\n",
 91 |     "    return agent\n",
 92 |     "#定制headers,设置模拟浏览器，反爬\n",
 93 |     "def get_headers(url_start,agent):\n",
 94 |     "    url_start = url_start\n",
 95 |     "    agent = agent\n",
 96 |     "    headers = {\n",
 97 |     "         #所有数据均来自Network-XHR-headers-Requsts Headers\n",
 98 |     "        'Accept': 'application/json, text/javascript, */*; q=0.01',        \n",
 99 |     "        'Referer': url_start,\n",
100 |     "        #因为本地的User-Agent已经被封锁，此时有了网上的一个虚拟的，为了反爬可以随机使用几个，类似于下面的proxies\n",
101 |     "#         'Cookie':'user_trace_token=20190515230141-12934d87-ec36-47a4-b6f5-b9fdf6116988; _ga=GA1.2.1677629794.1557932504; _gat=1; LGSID=20190515230142-597afe8c-7722-11e9-99a5-525400f775ce; PRE_UTM=m_cf_cpt_sogou_ztch6; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Flanding-page%2Fpc%2Fposition2.html%3Futm_source%3Dm_cf_cpt_sogou_ztch6; LGUID=20190515230142-597b0038-7722-11e9-99a5-525400f775ce; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1557932504; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2216abc03944948-07d289ba2ecf5-353166-921600-16abc039450303%22%2C%22%24device_id%22%3A%2216abc03944948-07d289ba2ecf5-353166-921600-16abc039450303%22%7D; sajssdk_2015_cross_new_user=1; _gid=GA1.2.694067267.1557932513; LG_LOGIN_USER_ID=90be76343c3af8869dd6d12563545c28b420c207df6a38cceeb53f0a5d64da34; LG_HAS_LOGIN=1; _putrc=A7C9A275E2B1F6EB123F89F2B170EADC; JSESSIONID=ABAAABAAADEAAFIAD4122C7F738CEEAE9C99959A9756EB3; login=true; unick=%E5%BC%A0%E5%B9%B3; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; hasDeliver=1; gate_login_token=450a1f502d12ad1d9f320d846f06d1ad92603529c81fa8ff8921a53bcf54a82d; index_location_city=%E5%8C%97%E4%BA%AC; X_HTTP_TOKEN=c7dc7114524f2f9f15523975516bfb19cb3eab323b; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1557932554; LGRID=20190515230231-76ee1774-7722-11e9-99a5-525400f775ce; SEARCH_ID=caaf8eb6597d4ecbbe9a2d807d68e78f',\n",
102 |     "        'Host':'www.lagou.com',\n",
103 |     "        'Origin': 'https://www.lagou.com',\n",
104 |     "        'User-Agent':  str(agent),\n",
105 |     "        'X-Anit-Forge-Code': '0',\n",
106 |     "        'X-Anit-Forge-Token': 'None',\n",
107 |     "        'X-Requested-With': 'XMLHttpRequest',\n",
108 |     "        'Connection':'keep_alive'\n",
109 |     "    }  \n",
110 |     "    return headers"
111 |    ]
112 |   },
113 |   {
114 |    "cell_type": "code",
115 |    "execution_count": 36,
116 |    "metadata": {},
117 |    "outputs": [],
118 |    "source": [
119 |     "def get_json(city,keyword,gm,pn):\n",
120 |     "    #kd是搜索的关键字，若改为“数据分析”即是数据分析相关的数据\n",
121 |     "    data = { 'first': 'true', 'pn': pn, 'kd': keyword }#keyword若是中文，需要为编码的形式\n",
122 |     "    url_parse,url_start = get_url(city,keyword,gm)\n",
123 |     "    agent = get_agents()\n",
124 |     "    headers = get_headers(url_start,agent)\n",
125 |     "    proxies = get_proxies()\n",
126 |     "    print(proxies)\n",
127 |     "    # 请求首页获取cookies \n",
128 |     "    s = requests.Session() \n",
129 |     "    s = s.get(url_start, headers=headers, proxies=proxies,timeout=3)   # 请求首页获取cookies ,allow_redirects=False,verify=False\n",
130 |     "    cookie = s.cookies  # 为此类别获取的cookies \n",
131 |     "    response = requests.post(url_parse,headers=headers,data=data,proxies=proxies,cookies=cookie,timeout=5)\n",
132 |     "    sleep_time = random.randint(4,5)+random.random()\n",
133 |     "    time.sleep(sleep_time)\n",
134 |     "    response.encoding = response.apparent_encoding \n",
135 |     "#     print(response.text)\n",
136 |     "    text = json.loads(response.text)\n",
137 |     "    return text"
138 |    ]
139 |   },
140 |   {
141 |    "cell_type": "code",
142 |    "execution_count": 37,
143 |    "metadata": {},
144 |    "outputs": [],
145 |    "source": [
146 |     "'''\n",
147 |     "获取每个城市的全部数据\n",
148 |     "'''\n",
149 |     "def get_city_Pages(city,keyword):\n",
150 |     "    \n",
151 |     "    data = { 'first': 'true', 'pn': 1, 'kd': keyword }#keyword若是中文，需要为编码的形式\n",
152 |     "    city = parse.quote(city)\n",
153 |     "    keyword = parse.quote(keyword)\n",
154 |     "    url_parse = \"https://www.lagou.com/jobs/positionAjax.json?px=default&city=\"+city+\"&needAddtionalResult=false\"\n",
155 |     "    url_start = \"https://www.lagou.com/jobs/list_\"+keyword+\"?px=default&city=\"+city\n",
156 |     "    agent = get_agents()\n",
157 |     "    headers = get_headers(url_start,agent)\n",
158 |     "    proxies = get_proxies()\n",
159 |     "     # 请求首页获取cookies \n",
160 |     "    s = requests.Session() \n",
161 |     "    s = s.get(url_start, headers=headers, proxies=proxies,timeout=3)   # 请求首页获取cookies \n",
162 |     "    cookie = s.cookies  # 为此类别获取的cookies \n",
163 |     "    response = requests.post(url_parse,headers=headers,data=data,proxies=proxies,cookies=cookie,timeout=5)\n",
164 |     "    sleep_time = random.randint(4,5)+random.random()\n",
165 |     "    time.sleep(sleep_time)\n",
166 |     "    response.encoding = response.apparent_encoding \n",
167 |     "#     print(response.text)\n",
168 |     "    text = json.loads(response.text)\n",
169 |     "#     print(text)\n",
170 |     "    totalcount = 0\n",
171 |     "    if 'content' in text:\n",
172 |     "        totalcount = text['content']['positionResult']['totalCount']\n",
173 |     "    else:\n",
174 |     "        print(\"Not exits\")\n",
175 |     "    totalPages = math.ceil(float(totalcount)/15)\n",
176 |     "    city_un = parse.unquote(city)\n",
177 |     "    if totalPages > 0 :\n",
178 |     "        print(\"%s共%s条,可分为%s页\"%(city_un,totalcount,totalPages))\n",
179 |     "#         sleep_time = random.randint(5,15)+random.random()\n",
180 |     "#         time.sleep(sleep_time)#keyerror问题的解决\n",
181 |     "#         print(\"停留%s\"%sleep_time)\n",
182 |     "    else:\n",
183 |     "#         city_un = parse.quote(city)\n",
184 |     "        print(\"%s-----无数据\"%(city_un))\n",
185 |     "#     print(\"共%s条,可分为%s页\"%(totalcount,totalPages))\n",
186 |     "    return totalPages"
187 |    ]
188 |   },
189 |   {
190 |    "cell_type": "code",
191 |    "execution_count": 38,
192 |    "metadata": {},
193 |    "outputs": [],
194 |    "source": [
195 |     "def get_totalPages(city,keyword,gm):\n",
196 |     "    city = city\n",
197 |     "    keyword = keyword\n",
198 |     "    totalcount = 0\n",
199 |     "#     pn = 1\n",
200 |     "    text = get_json(city,keyword,gm,'1')\n",
201 |     "    if 'content' in text:\n",
202 |     "        totalcount = text['content']['positionResult']['totalCount']\n",
203 |     "    else:\n",
204 |     "        print(\"Not exits\")\n",
205 |     "    totalPages = math.ceil(float(totalcount)/15)\n",
206 |     "    if totalPages > 0 :\n",
207 |     "        print(\"%s公司规模为%s共%s条,可分为%s页\"%(city,gm,totalcount,totalPages))        \n",
208 |     "#         sleep_time = random.randint(5,15)+random.random()\n",
209 |     "#         time.sleep(sleep_time)#keyerror问题的解决\n",
210 |     "#         print(\"停留%s\"%sleep_time)\n",
211 |     "    else:\n",
212 |     "        print(\"%s公司规模为%s---无数据\"%(city,gm))\n",
213 |     "      \n",
214 |     "    return totalPages"
215 |    ]
216 |   },
217 |   {
218 |    "cell_type": "code",
219 |    "execution_count": 39,
220 |    "metadata": {},
221 |    "outputs": [],
222 |    "source": [
223 |     "def get_info(city,keyword,gm,totalPages):\n",
224 |     "    city = city\n",
225 |     "    keyword = keyword\n",
226 |     "    totalPages = totalPages\n",
227 |     "    position_info_all = []#存储数据\n",
228 |     "    for page in range(1,totalPages+1):#从1开始循环因为首页的页数编码是1\n",
229 |     "        time_start = time.time()\n",
230 |     "        text = get_json(city,keyword,gm,page)\n",
231 |     "#         print(text)\n",
232 |     "#         print(proxies)\n",
233 |     "        if 'content'in text:\n",
234 |     "            info = text[\"content\"][\"positionResult\"][\"result\"] \n",
235 |     "            for i in info:\n",
236 |     "                position_info_single = []\n",
237 |     "                position_info_single.append(i.get('positionId','NA'))#职位编号\n",
238 |     "                position_info_single.append(i.get('positionName','NA'))#职位名称\n",
239 |     "                position_info_single.append(i.get('salary','NA'))#薪酬\n",
240 |     "                position_info_single.append(i.get('workYear','NA'))#工作经验\n",
241 |     "                position_info_single.append(i.get('skillLables','NA'))#技能要求\n",
242 |     "                position_info_single.append(i.get('positionAdvantage','NA'))#职位优势\n",
243 |     "                position_info_single.append(i.get('education','NA'))#学历要求\n",
244 |     "                position_info_single.append(i.get('jobNature','NA'))#工作性质\n",
245 |     "                position_info_single.append(i.get('createTime','NA'))#发布时间\n",
246 |     "                position_info_single.append(i.get('companyFullName','NA'))#公司\n",
247 |     "                position_info_single.append(i.get('city','NA'))#城市\n",
248 |     "                position_info_single.append(i.get('companySize','NA'))#公司规模\n",
249 |     "                position_info_single.append(i.get('district','NA'))#区域\n",
250 |     "                position_info_single.append(i.get('financeStage','NA'))#融资情况\n",
251 |     "                position_info_single.append(i.get('firstType','NA'))#公司类别\n",
252 |     "                position_info_single.append(i.get('industryField','NA'))#涉及领域\n",
253 |     "                position_info_single.append(i.get('isSchoolJob','NA'))#是否校招\n",
254 |     "                position_info_single.append(i.get('subwayline','NA'))#地铁\n",
255 |     "                position_info_single.append(i.get('stationname','NA'))#站点\n",
256 |     "                position_info_single.append(i.get('latitude','NA'))#经度\n",
257 |     "                position_info_single.append(i.get('longitude','NA'))#纬度\n",
258 |     "                position_info_all.append(position_info_single) \n",
259 |     "            print(\"第%s页爬取成功,position_info_all现有数据%s行\"%(str(page),str(len(position_info_all))))\n",
260 |     "        else:\n",
261 |     "            print(\"Not exits\")\n",
262 |     "        sleep_time = random.randint(5,15)+random.random()\n",
263 |     "        time.sleep(sleep_time)#keyerror问题的解决\n",
264 |     "        time_end = time.time()\n",
265 |     "        on_time = time_end-time_start\n",
266 |     "        print(\"本页爬行时间%s\"%str(on_time))\n",
267 |     "    return position_info_all"
268 |    ]
269 |   },
270 |   {
271 |    "cell_type": "code",
272 |    "execution_count": 52,
273 |    "metadata": {
274 |     "scrolled": true
275 |    },
276 |    "outputs": [
277 |     {
278 |      "name": "stdout",
279 |      "output_type": "stream",
280 |      "text": [
281 |       "293\n",
282 |       "珠海共3条,可分为1页\n",
283 |       "{'http': \"['218.64.69.79:8080']\"}\n",
284 |       "珠海公司规模为2000人以上共3条,可分为1页\n",
285 |       "{'http': \"['163.125.68.135:8888 ']\"}\n",
286 |       "第1页爬取成功,position_info_all现有数据3行\n",
287 |       "本页爬行时间16.876307725906372\n",
288 |       "珠海的公司规模2000人以上的数据爬完\n",
289 |       "{'http': \"['14.115.104.97:808 ']\"}\n",
290 |       "珠海公司规模为少于15人,15-50人---无数据\n",
291 |       "停留6.7492073387582385\n",
292 |       "{'http': \"['134.249.156.3:82']\"}\n",
293 |       "珠海公司规模为50-150人---无数据\n",
294 |       "停留11.043163736858903\n",
295 |       "{'http': \"['1.198.72.193:9999']\"}\n",
296 |       "Not exits\n",
297 |       "珠海公司规模为150-500人---无数据\n",
298 |       "停留12.172892347874113\n",
299 |       "{'http': \"['183.129.207.86:11206 ']\"}\n",
300 |       "珠海公司规模为500-2000人---无数据\n",
301 |       "停留10.925358249150833\n",
302 |       "position_info_city有3条数据\n",
303 |       "珠海的全部数据爬取成功\n",
304 |       "停留13.576861105003818\n",
305 |       "中山共1条,可分为1页\n",
306 |       "{'http': \"['115.28.148.192:8118']\"}\n",
307 |       "中山公司规模为2000人以上---无数据\n",
308 |       "停留9.102117120811412\n",
309 |       "{'http': \"['134.249.156.3:82']\"}\n",
310 |       "中山公司规模为少于15人,15-50人---无数据\n",
311 |       "停留7.852819219941637\n",
312 |       "{'http': \"['1.192.243.166:9999 ']\"}\n",
313 |       "中山公司规模为50-150人---无数据\n",
314 |       "停留15.585111278379154\n",
315 |       "{'http': \"['163.125.68.135:8888 ']\"}\n",
316 |       "中山公司规模为150-500人共1条,可分为1页\n",
317 |       "{'http': \"['171.80.2.110:9999']\"}\n",
318 |       "第1页爬取成功,position_info_all现有数据1行\n",
319 |       "本页爬行时间20.245324850082397\n",
320 |       "中山的公司规模150-500人的数据爬完\n",
321 |       "{'http': \"['218.64.69.79:8080']\"}\n",
322 |       "中山公司规模为500-2000人---无数据\n",
323 |       "停留12.88375543553882\n",
324 |       "position_info_city有4条数据\n",
325 |       "中山的全部数据爬取成功\n",
326 |       "停留12.943937855541314\n",
327 |       "镇江-----无数据\n",
328 |       "停留9.276375587886204\n",
329 |       "湛江-----无数据\n",
330 |       "停留8.685334392703105\n",
331 |       "株洲共1条,可分为1页\n",
332 |       "{'http': \"['120.83.108.77:9999 ']\"}\n",
333 |       "株洲公司规模为2000人以上---无数据\n",
334 |       "停留5.101241187083019\n",
335 |       "{'http': \"['122.193.244.243:9999 ']\"}\n",
336 |       "株洲公司规模为少于15人,15-50人共1条,可分为1页\n",
337 |       "{'http': \"['14.115.104.97:808 ']\"}\n",
338 |       "Not exits\n",
339 |       "本页爬行时间16.907893180847168\n",
340 |       "株洲的公司规模少于15人,15-50人的数据爬完\n",
341 |       "{'http': \"['1.192.243.166:9999 ']\"}\n",
342 |       "株洲公司规模为50-150人---无数据\n",
343 |       "停留7.160573629384684\n",
344 |       "{'http': \"['115.28.148.192:8118']\"}\n",
345 |       "株洲公司规模为150-500人---无数据\n",
346 |       "停留7.411860943214564\n",
347 |       "{'http': \"['1.198.72.239:9999']\"}\n",
348 |       "株洲公司规模为500-2000人---无数据\n",
349 |       "停留7.505609365573439\n",
350 |       "position_info_city有4条数据\n",
351 |       "株洲的全部数据爬取成功\n",
352 |       "停留12.079427715518644\n",
353 |       "淄博-----无数据\n",
354 |       "停留5.942007353497083\n",
355 |       "肇庆-----无数据\n",
356 |       "停留8.137823183299782\n",
357 |       "张家口-----无数据\n",
358 |       "停留13.28549592337901\n",
359 |       "漳州-----无数据\n",
360 |       "停留14.207343278491953\n",
361 |       "遵义-----无数据\n",
362 |       "停留14.95192348923606\n",
363 |       "驻马店-----无数据\n",
364 |       "停留11.949976561928546\n",
365 |       "长治-----无数据\n",
366 |       "停留15.335753043318482\n",
367 |       "枣庄-----无数据\n",
368 |       "停留14.370120856259682\n",
369 |       "资阳-----无数据\n",
370 |       "停留6.685431142101895\n",
371 |       "舟山-----无数据\n",
372 |       "停留8.37063967450687\n",
373 |       "自贡-----无数据\n",
374 |       "停留7.527898435115426\n",
375 |       "周口共1条,可分为1页\n",
376 |       "{'http': \"['112.85.169.126:9999']\"}\n",
377 |       "周口公司规模为2000人以上---无数据\n",
378 |       "停留6.014899216823151\n",
379 |       "{'http': \"['1.197.204.217:9999 ']\"}\n",
380 |       "周口公司规模为少于15人,15-50人---无数据\n",
381 |       "停留13.197189074964884\n",
382 |       "{'http': \"['112.85.164.68:9999']\"}\n",
383 |       "周口公司规模为50-150人共1条,可分为1页\n",
384 |       "{'http': \"['1.198.72.239:9999']\"}\n",
385 |       "第1页爬取成功,position_info_all现有数据1行\n",
386 |       "本页爬行时间19.017383813858032\n",
387 |       "周口的公司规模50-150人的数据爬完\n",
388 |       "{'http': \"['120.83.108.77:9999 ']\"}\n",
389 |       "周口公司规模为150-500人---无数据\n",
390 |       "停留14.04838082143874\n",
391 |       "{'http': \"['103.26.245.190:43328']\"}\n",
392 |       "周口公司规模为500-2000人---无数据\n",
393 |       "停留9.220707758209855\n",
394 |       "position_info_city有5条数据\n",
395 |       "周口的全部数据爬取成功\n",
396 |       "停留5.143233270941914\n",
397 |       "昭通-----无数据\n",
398 |       "停留7.923482816445571\n",
399 |       "张掖-----无数据\n",
400 |       "停留13.237375799756679\n",
401 |       "张家界-----无数据\n",
402 |       "停留11.974681287128595\n",
403 |       "中卫-----无数据\n",
404 |       "停留7.505663442219061\n",
405 |       "全部数据存储成功（CSV格式）----持续奔跑中！\n",
406 |       "[[5858185, '数据分析师', '15k-30k', '不限', ['数据挖掘', '数据架构', 'MySQL', '数据分析'], '免费三餐，数据分析', '不限', '全职', '2019-04-29 17:45:35', '北京金山软件有限公司', '珠海', '2000人以上', '香洲区', '上市公司', '开发|测试|运维类', '移动互联网', 0, None, None, '22.34793', '113.601205'], [1978586, '数据分析师', '1k-2k', '应届毕业生', [], '提供岗位培训，表现优异者可转正', '不限', '实习', '2019-05-20 16:25:19', '成都西山居互动娱乐科技有限公司珠海分公司', '珠海', '2000人以上', '香洲区', '不需要融资', '产品|需求|项目类', '游戏', 1, None, None, '22.348471', '113.601247'], [5945889, '高级数据分析师', '18k-30k', '3-5年', ['数据分析', '数据挖掘', '数据处理'], '团队优秀 业务前景好 六险一金 年度旅行', '本科', '全职', '2019-05-22 17:37:33', '北京金山软件有限公司', '珠海', '2000人以上', '香洲区', '上市公司', '开发|测试|运维类', '移动互联网', 0, None, None, '22.347929', '113.601205'], [5339042, '数据分析实习生（中山）', '2k-3k', '应届毕业生', ['数据分析', 'MySQL', '数据处理'], '提供转正，周末双休，专题培训，法定假日', '本科', '实习', '2019-05-22 09:34:00', '广东精点数据科技股份有限公司', '中山', '150-500人', '中山市市辖区', '不需要融资', '开发|测试|运维类', '移动互联网,数据服务', 1, None, None, '22.503981', '113.403275'], [5784906, '数据分析工程师', '4k-5k', '1-3年', [], '员工旅游、补充医疗保险、绩效奖金、餐补', '大专', '全职', '2019-05-22 17:22:48', '中科三清科技有限公司', '周口', '50-150人', '川汇区', '不需要融资', '产品|需求|项目类', '其他,移动互联网', 0, None, None, '33.61007', '114.669126']]\n"
407 |      ]
408 |     }
409 |    ],
410 |    "source": [
411 |     "def main():\n",
412 |     "    #公司规模\n",
413 |     "    gms = ['2000人以上','少于15人,15-50人','50-150人','150-500人','500-2000人']    \n",
414 |     "    keyword = '数据分析'\n",
415 |     "    citys = get_citys(keyword)\n",
416 |     "    position_info_all = []\n",
417 |     "    columns = ['职位编号','职位名称','薪酬','工作经验','技能要求','职位优势','学历要求','工作形式','发布时间',\n",
418 |     "               '公司','城市','规模','区域','融资情况','公司类别','涉及领域','是否校招',\n",
419 |     "              '地铁','站点','经度','纬度']\n",
420 |     "    print(len(citys))\n",
421 |     "    for i in range(0,len(citys)):\n",
422 |     "        position_info_city = []\n",
423 |     "        city = citys[i]\n",
424 |     "        cityPages = get_city_Pages(city,keyword)\n",
425 |     "        if cityPages > 0 :\n",
426 |     "            for j in range(len(gms)):\n",
427 |     "                gm = gms[j]\n",
428 |     "                totalPages = get_totalPages(city,keyword,gm)\n",
429 |     "                if totalPages>0:\n",
430 |     "                    \n",
431 |     "                    position_info_city_gm = get_info(city,keyword,gm,totalPages)\n",
432 |     "                    for m in range(len(position_info_city_gm)):\n",
433 |     "                        item = position_info_city_gm[m]\n",
434 |     "                        position_info_city.append(item)\n",
435 |     "                        position_info_all.append(item)\n",
436 |     "    #                 print(position_info_all)\n",
437 |     "                    print(\"%s的公司规模%s的数据爬完\"%(city,gm))\n",
438 |     "                else:\n",
439 |     "    #                 print(\"%s的公司规模%s无数据\"%(city,gm))\n",
440 |     "                    sleep_time = random.randint(5,15)+random.random()\n",
441 |     "                    time.sleep(sleep_time)#keyerror问题的解决\n",
442 |     "                    print(\"停留%s\"%sleep_time)\n",
443 |     "            print(\"position_info_all有%s条数据\"%len(position_info_all))\n",
444 |     "            print(\"%s的全部数据爬取成功\"%city)\n",
445 |     "            sleep_time = random.randint(5,15)+random.random()\n",
446 |     "            time.sleep(sleep_time)#keyerror问题的解决\n",
447 |     "            print(\"停留%s\"%sleep_time)\n",
448 |     "            df_city = pd.DataFrame(data = position_info_city,columns =columns )\n",
449 |     "            df_city.to_csv('%s.csv'%(city),index = False,encoding=\"utf_8_sig\")\n",
450 |     "        else:\n",
451 |     "            sleep_time = random.randint(5,15)+random.random()\n",
452 |     "            time.sleep(sleep_time)#keyerror问题的解决\n",
453 |     "            print(\"停留%s\"%sleep_time)\n",
454 |     "#         position_info_all.append(position_info_city)\n",
455 |     "    df_all = pd.DataFrame(data=position_info_all,columns=columns)\n",
456 |     "    df_all.to_csv('all_city.csv',index = False,encoding=\"utf_8_sig\")\n",
457 |     "    print(\"全部数据存储成功（CSV格式）----持续奔跑中！\") \n",
458 |     "    print(position_info_all)\n",
459 |     "\n",
460 |     "if __name__ == '__main__':\n",
461 |     "    main()"
462 |    ]
463 |   },
464 |   {
465 |    "cell_type": "code",
466 |    "execution_count": null,
467 |    "metadata": {},
468 |    "outputs": [],
469 |    "source": []
470 |   }
471 |  ],
472 |  "metadata": {
473 |   "kernelspec": {
474 |    "display_name": "Python 3",
475 |    "language": "python",
476 |    "name": "python3"
477 |   },
478 |   "language_info": {
479 |    "codemirror_mode": {
480 |     "name": "ipython",
481 |     "version": 3
482 |    },
483 |    "file_extension": ".py",
484 |    "mimetype": "text/x-python",
485 |    "name": "python",
486 |    "nbconvert_exporter": "python",
487 |    "pygments_lexer": "ipython3",
488 |    "version": "3.6.5"
489 |   }
490 |  },
491 |  "nbformat": 4,
492 |  "nbformat_minor": 2
493 | }
494 | 


--------------------------------------------------------------------------------
/拉勾网/拉勾网反爬机制及攻克方式总结:
--------------------------------------------------------------------------------
 1 | 1 连续爬取三页就返回空值
 2 |   方式：
 3 |   建立User-Agent和IP的循环池，每次换页随机选取使用
 4 | 2 直接利用headers中链接，就会返回IP被封的提示
 5 |   方式一：
 6 |   登录状态，直接提取headers中的cookies,以补充定制请求头数据。但是此次爬虫扔无法解决此问题
 7 |   方式二：
 8 |   seesion.cookies获取首页cookies，并记录下来，以此补充请求头。
 9 |   值得注意的是每次换一次IP都需重新获取cookies
10 | 3 KeyError错误
11 |   方式：
12 |   通过sleep.time设置休息时间，将页与页之间爬取的时间拉长
13 | 4 Latin-1编码问题
14 |   方式：
15 |   URL地址中不存在中文，需通过urllib.parse.qupte编码为URL通用格式。urlillib.parse.unquote解码为中文
16 |   值得注意的是request.post（url,headers,data=data）中的data如果有中文关键字，则必须不必编码，且必不能编码
17 | 5 数据爬取有限的问题
18 |   方式一：
19 |   拉勾网页面最多显示30页数据，每页15条，但其实json包中的数据远远不止这些数据。
20 |   我们可以通过content-positionResult-totalCount获取数据总量，但是并不一定能全部获取到。这也是拉勾网反爬真的是做到了极致。
21 |   在headers中有一条content-length，其后面对应的数据及我们可以最多爬取的页数，其页数*15即为我们最终可获取的数据量
22 |   方式二：
23 |   通过分组（每组数据量少于450条），且每个组是互相独立的，全部组的并集为全部数据量。例如公司规模分组分别提取，然后合并提取数据。
24 | 6 Python requests 移除SSL认证，控制台输出InsecureRequestWarning取消方法。
25 |   方式：
26 |   from requests.packages.urllib3.exceptions import InsecureRequestWarning # 禁用安全请求警告
27 |   requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
28 | 7 Python熊猫数据帧读取from_records，“AssertionError：1列传递，传递数据有22列”
29 |   原因：list重复加入不同维度的list
30 |   方式：
31 |   定义全局和局部变量，直接加入同维度的数据
32 | 


--------------------------------------------------------------------------------