├── .gitignore
├── README.md
├── config.json
├── config_utils
├── __init__.py
└── configreader.py
├── crawlspider
├── __init__.py
├── html_downloader.py
├── html_header.py
├── html_outputer.py
├── html_parse_item.py
├── html_parser.py
├── spider.py
├── url_manager.py
└── user_agents.py
├── output
├── *技术讨论区*.html
├── index.html
├── 不骑马的亚洲的电影.html
├── 大家都说中文的电影.html
├── 欧美电影.html
├── 纪念达盖尔的板块.html
└── 骑马的亚洲的电影.html
└── quick_start.py
/.gitignore:
--------------------------------------------------------------------------------
1 | # Byte-compiled / optimized / DLL files
2 | __pycache__/
3 | *.py[cod]
4 | *$py.class
5 |
6 | # C extensions
7 | *.so
8 |
9 | # Distribution / packaging
10 | .Python
11 | env/
12 | build/
13 | develop-eggs/
14 | dist/
15 | downloads/
16 | eggs/
17 | .eggs/
18 | lib/
19 | lib64/
20 | parts/
21 | sdist/
22 | var/
23 | wheels/
24 | *.egg-info/
25 | .installed.cfg
26 | *.egg
27 |
28 | # PyInstaller
29 | # Usually these files are written by a python script from a template
30 | # before PyInstaller builds the exe, so as to inject date/other infos into it.
31 | *.manifest
32 | *.spec
33 |
34 | # Installer logs
35 | pip-log.txt
36 | pip-delete-this-directory.txt
37 |
38 | # Unit test / coverage reports
39 | htmlcov/
40 | .tox/
41 | .coverage
42 | .coverage.*
43 | .cache
44 | nosetests.xml
45 | coverage.xml
46 | *.cover
47 | .hypothesis/
48 |
49 | # Translations
50 | *.mo
51 | *.pot
52 |
53 | # Django stuff:
54 | *.log
55 | local_settings.py
56 |
57 | # Flask stuff:
58 | instance/
59 | .webassets-cache
60 |
61 | # Scrapy stuff:
62 | .scrapy
63 |
64 | # Sphinx documentation
65 | docs/_build/
66 |
67 | # PyBuilder
68 | target/
69 |
70 | # Jupyter Notebook
71 | .ipynb_checkpoints
72 |
73 | # pyenv
74 | .python-version
75 |
76 | # celery beat schedule file
77 | celerybeat-schedule
78 |
79 | # SageMath parsed files
80 | *.sage.py
81 |
82 | # dotenv
83 | .env
84 |
85 | # virtualenv
86 | .venv
87 | venv/
88 | ENV/
89 |
90 | # Spyder project settings
91 | .spyderproject
92 | .spyproject
93 |
94 | # Rope project settings
95 | .ropeproject
96 |
97 | # mkdocs documentation
98 | /site
99 |
100 | # mypy
101 | .mypy_cache/
102 |
103 | /.idea/
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | > 1024是一个好网站
2 |
3 | ***首先,此次实战系列的前提是您能科学的找到1024网站!我这里并不提供网站地址,特此声明,这里只是用计算机科学的态度和方法,来分析一个问题。和1024网站没有任何关联。***
4 |
5 | 在1024网站上,不知道你是否和我一样,平时爱逛技术讨论区,爱看一些每日资讯总结的帖子,那么会不会因为板块的主题帖子是按照回复时间排序而找不到自己喜欢看的帖子而心烦意乱呢?是不是为了找自己今天没看过的帖子,而一遍又一遍的重新从头开始翻呢?
6 |
7 | 别怕,我都被这些问题困扰过!社区人口众多,帖子刷的很快,为了看到每天发布的帖子,板块的排版不得不让我每次进来都得从头开始找,看看哪个帖子今儿没看过。而且是左边看标题,右边看发布时间,好累啊。这样我不喜欢,有些浪费时间。
8 |
9 | 作为一个程序员,我觉得,这些问题,都是可以自己手动写一个Python爬虫来解决。
10 |
11 | #### 我感觉这个虫子全网***最方便***,***最牛逼***,***最便捷***,***最能解决实际问题***的虫子!活学活用,***真正的让代码方便我的生活,这才是我编写程序索要达到的目的***。
12 |
13 | ##我们现在遇到的问题:
14 | 论坛的帖子排序是按照回帖时间排序的,为了能够看到每天最新发布的帖子,总是得从头开始看整个论坛,很烦,浪费时间。
15 |
16 | 
17 |
18 | ## 我们希望变成的样子
19 | 论坛的帖子按照时间发布顺序排列,这样看每天的新内容就很省事儿。
20 |
21 | 如果我们要写一个爬虫来解决的话,大致结构应该如下:
22 |
23 | 
24 |
25 | 这里有几个部分:
26 | - **config.json**: 这个算是配置文件,目前需要填入的信息有:
27 | 1.1024网站的的URL
28 | 2.爬虫结果输出的的文件位置
29 | 3.爬虫需要爬的最大page num
30 | 4.板块信息,指论坛的板块名称(*这个可以自定义*)和板块的fid
31 | - **Url_manager**: 管理备爬取的URL。
32 | - **Html_downloade**r: 爬虫获取网页信息。
33 | - **Html_parser**: 爬虫的网页解析器。
34 | - **Html_output**: 爬虫输出结果。
35 |
36 | 上面的结构很简单,那么简单的流程就是:*我们先配置好本地的config.json文件,然后启动程序,爬虫会自动根据配置好的信息,然后抓取各个板块前几页的内容,根据帖子发帖时间,筛选爬出出来信息,随后,将获取到的信息按照时间排序,最后输出成html格式的文件,使用本地的网页浏览器打开。浏览器里面可以看到帖子的id,帖子的标题以及帖子的发布时间。通过点击帖子的标题,可以跳转到社区的帖子。*
37 |
38 | 这样,内容丰富的小草网站,就直接变成了我们本地写的最简单的***html***文件。
39 |
40 | 我们整理后的网站首页:
41 | 
42 |
43 | 新整理后板块长这个样子:
44 |
45 | 
46 |
47 |
48 |
49 | 
50 |
51 | 这样看上去,就简单和舒服的多了,不再需要像之前那样一个一个的找了。而且,我们看过哪个帖子,都是有不同颜色区分的。这样节省好多好多时间。下面就简单的说一下工程中运用到的技术点吧。
52 |
53 | ### 技术梳理
54 | 虽然现在网络上有很多成熟的爬虫框架,比如`Scrapy`,我之前也用过`Scrapy`,`Scrapy`确实强大,但是感觉这样体会不到爬虫的乐趣。所以干脆自己从零搭建一个爬虫。从零距离感受爬虫,感受`Python`的乐趣。
55 |
56 | #### 整体技术
57 | - `python 3.6`
58 | - `requests`
59 | - `BeautifulSoup4`
60 | - `webbrowser`
61 | - `json`
62 |
63 | #### Config.json
64 | 这个是配置文件,将需要一些基本参数写在这个json文件中。先关的读取类是`config_utils`中的`configreader`。
65 |
66 | 
67 |
68 | #### Url_manager
69 | 通过一个`dict`来存储板块名称和对应的板块`URL`,提供一些简答的方法来操作`URL`。
70 |
71 | #### Html_download
72 | 通过使用`requests`模块来进行网页的访问。从而拿到网页数据,为后面步骤的解析提供基础。
73 | 这里进行网络请求的时候,由于`1024网站`做了反爬处理,我添加了不同的`HTTP header`。目前还算比较好用。表头信息在`user_agents`文件中。
74 |
75 | #### Html_parser
76 | 通过`BeautifulSoup`来对`html`做解析处理。每一个帖子都是有一个*唯一id*的。帖子都封装到`CaoliuItem`中,然后将结果输出到`html_outputer`中。这里是通过`html`的`tag`来做的寻找,并不是通过正则表达式。可能有点*僵*。
77 |
78 | #### Html_outputer
79 | 这个是将之前收集到的爬虫解析结果,整理成`html`文件的类。最终结果有一个`index`页面,每个版块还有自己的页面。他们之间相互链接在一起,点击起来爽爽的,炒鸡方便。
80 |
81 | ### 需要改进的地方 TODO
82 | - 整体结构虽然清晰,但是整体结构还需要优化。要做到像`Scrapy`那样强大的虫子,得一步一步来。
83 | - 目前爬虫能力比较弱,没有用到多线程爬虫。下一个版本可以加入多线程,这样既能提升速度,又能提升质量。
84 | - `parser`的解析还是太依赖网站的布局。若是网站布局发生改变,`parser`就得修改。这个问题是所有爬虫的通病,我还在想办法把这里做的更活一些,不要这么死板。
85 | - `output`的`html`文件美观度不够。
86 | - 下一版本,想将解析出来的东西,能够和`MongoDB`联动,算是本地保存一份吧。因为这样就能够看到之前的帖子信息。
87 | - 接下来应该最好是针对每个帖子,再爬一层,可以做到自动将图片或者种子文件下载下来。这个下载图片和种子的虫子我之前用`Scrapy`的时候做过,但是还是需要结合自己写的虫子比较好。
88 | - 最好能够将爬虫扩展到其他网站,比如微博啊,V2ex啊,之类的资讯网站。感觉每天来回逛这几个网站,打开这个打开那个,确实有时候挺浪费时间的,倒不如把它们每天更新的东西都整合成在一起,通过一个网站,一次看个够。这样多爽。
89 | - 最终的版本就是把这个程序做成一个后台服务,然后部署到服务器上,每天通过访问,能够看到当天各个网站的更新内容。做到***"访问一个,就可以访问全部"***的效果。
90 |
91 | 这个项目源码,通过***阅读原文***即可查阅。
92 |
93 | 最后来一波福利,关注公众号:**皮克啪的铲屎官**,回复“1024”,能够找到你需要的东西哦~
94 | 
95 |
96 |
97 |
98 |
--------------------------------------------------------------------------------
/config.json:
--------------------------------------------------------------------------------
1 | {
2 | "url_root":"http://cc.itbb.men/",
3 | "file_dir":"/Users/SwyftG/Github/Daily1024/output/",
4 | "file_url":"file:///Users/SwyftG/Github/Daily1024/output/index.html",
5 | "max_pages": 3,
6 | "block_info": {
7 | "*技术讨论区*":7,
8 | "骑马的亚洲的电影":15,
9 | "不骑马的亚洲的电影":2,
10 | "大家都说中文的电影":25,
11 | "欧美电影":4,
12 | "纪念达盖尔的板块":16
13 | }
14 | }
15 |
16 |
17 |
18 |
--------------------------------------------------------------------------------
/config_utils/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/SwyftG/Daily1024/9a08c4f49166a4f690f3dc2768c725d2a366735d/config_utils/__init__.py
--------------------------------------------------------------------------------
/config_utils/configreader.py:
--------------------------------------------------------------------------------
1 | # encoding: utf-8
2 | import json
3 |
4 |
5 | class ConfigParams(object):
6 | def __init__(self, path):
7 | file = open(path, "r")
8 | file_json = json.load(file)
9 | self.url_root = file_json['url_root']
10 | self.file_root = file_json['file_dir']
11 | self.file_url = file_json['file_url']
12 | self.max_pages = file_json['max_pages']
13 | self.block_info = file_json['block_info']
14 |
15 | def __str__(self):
16 | return "Url_root: %s \nFile_root: %s \nFile_url: %s \nMax_pages: %s\nBlock_info: %s" % (self.url_root, self.file_root, self.file_url, self.max_pages, self.block_info)
17 |
18 | def get_1024_config(self):
19 | print(self)
20 | return self.url_root, self.file_root, self.file_url, self.max_pages, self.block_info
21 |
--------------------------------------------------------------------------------
/crawlspider/__init__.py:
--------------------------------------------------------------------------------
1 | # encoding: utf-8
2 | __author__ = 'lianggao'
3 | __date__ = '2018/5/7 下午2:31'
--------------------------------------------------------------------------------
/crawlspider/html_downloader.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | # -*- coding: utf-8 -*-
3 | import requests as request
4 | from crawlspider.html_header import HtmlHeader
5 |
6 |
7 | class HtmlDownloader(object):
8 | def __init__(self):
9 | self.header = HtmlHeader()
10 |
11 | def download_data(self, url):
12 | if url is None:
13 | return None
14 | head = self.header.get_header()
15 | result = request.get(url, headers=head, timeout=10)
16 | result.encoding = 'gbk'
17 | return result.text
18 |
--------------------------------------------------------------------------------
/crawlspider/html_header.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | # -*- coding: utf-8 -*-
3 | import random
4 | from crawlspider.user_agents import agents
5 |
6 |
7 | class HtmlHeader(object):
8 | def get_header(self):
9 | agent = random.choice(agents)
10 | return {"User-Agent": agent}
11 |
--------------------------------------------------------------------------------
/crawlspider/html_outputer.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | # -*- coding: utf-8 -*-
3 |
4 | import time
5 |
6 |
7 | class HtmlOutputer(object):
8 | def __init__(self, file_root):
9 | self.data = []
10 | self.file_root = file_root
11 | self.data_list = {}
12 |
13 | def collect_data(self, new_data):
14 | if new_data is None:
15 | return
16 | for item in new_data:
17 | if item not in self.data:
18 | self.data.append(item)
19 |
20 | def collect_data(self, name, new_data):
21 | if new_data is None:
22 | return
23 | temp_block = self.data_list.get(name)
24 | if temp_block is None:
25 | self.data_list[name] = new_data
26 | else:
27 | for item in new_data:
28 | if item not in temp_block:
29 | temp_block.append(item)
30 |
31 | def _sort_data(self):
32 | for item in self.data_list:
33 | data = self.data_list.get(item)
34 | data.sort(key=lambda k: (k.post_time[-5:]), reverse=True)
35 | data.sort(key=lambda k: (k.post_time[0:2]))
36 |
37 | def output_html(self):
38 | self._sort_data()
39 | filename = self.file_root + "index.html"
40 | block_info = {}
41 | result_root_file = open(filename, 'w')
42 | result_root_file.write("")
43 | result_root_file.write("
")
44 | result_root_file.write("")
45 | result_root_file.write("")
46 | result_root_file.write(" %s
" % time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
47 | result_root_file.write("
")
48 | for item_name in self.data_list:
49 | item_file_url = self.file_root + item_name + ".html"
50 | block_info[item_name] = item_file_url
51 | result_root_file.write("")
52 | result_root_file.write(" | ")
53 | result_root_file.write(" %s | " % (item_file_url,item_name))
54 | result_root_file.write("
")
55 | result_root_file.write("")
56 | result_root_file.write("
")
57 | result_root_file.write("")
58 | result_root_file.write("")
59 | result_root_file.close()
60 |
61 | for block_name in block_info:
62 | block_data_list = self.data_list.get(block_name)
63 | block_file = open(block_info[block_name], 'w')
64 | block_file.write("")
65 | block_file.write("")
66 | block_file.write("")
67 | block_file.write("")
68 | block_file.write(" %s
" % time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
69 | block_file.write("
")
70 | pre_time = ""
71 | block_file.write("")
72 | block_file.write(" | ")
73 | block_file.write("************************** %s **************************** | " % block_name)
74 | block_file.write("
")
75 | for data in block_data_list:
76 | if pre_time not in data.post_time:
77 | block_file.write("")
78 | block_file.write(" | ")
79 | block_file.write("****************************************************** | ")
80 | block_file.write("
")
81 | block_file.write("")
82 | block_file.write(" %s | " % data.post_id)
83 | block_file.write(" %s | " % (data.post_url, data.post_title))
84 | block_file.write(" %s | " % data.post_time)
85 | block_file.write("
")
86 | block_file.write("
")
87 | pre_time = data.post_time[0:2]
88 | block_file.write("
")
89 | block_file.write("")
90 | block_file.write("")
91 | block_file.write("")
92 | block_file.close()
--------------------------------------------------------------------------------
/crawlspider/html_parse_item.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | # -*- coding: utf-8 -*-
3 |
4 |
5 | class CaoliuItem(object):
6 | def __init__(self, post_title, post_url, post_time, post_id=0):
7 | self.post_title = post_title
8 | self.post_url = post_url
9 | self.post_time = post_time
10 | self.post_id = post_id
11 | self.download_url = None
12 |
13 | def __str__(self):
14 | return "Id: %s \nName: %s \nUrl: %s \nTime: %s \n--------------------------------------" %(self.post_id, self.post_title, self.post_url, self.post_time)
15 |
16 | def set_download_url(self, url):
17 | self.download_url = url
--------------------------------------------------------------------------------
/crawlspider/html_parser.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | # -*- coding: utf-8 -*-
3 | from bs4 import BeautifulSoup
4 | from crawlspider.html_parse_item import CaoliuItem
5 |
6 | keywords = ['今天', '昨天', 'Top-marks']
7 |
8 |
9 | class HtmlParser(object):
10 | def parse(self, page_url, html_cont, max_pages):
11 | if page_url is None or html_cont is None:
12 | return
13 | soup = BeautifulSoup(html_cont, 'html.parser')
14 | url, data = self._get_titles(page_url, max_pages, soup)
15 | return url, data
16 |
17 | def parse(self, name, page_url, html_cont, max_pages):
18 | if page_url is None or html_cont is None:
19 | return
20 | soup = BeautifulSoup(html_cont, 'html.parser')
21 | url, data = self._get_titles(page_url, soup, max_pages, name)
22 | return url, data
23 |
24 | def _get_titles(self, page_url, soup, max_pages=1, name=None):
25 | result_data = []
26 | post_blocks = soup.find_all(attrs={"class": "tal"})
27 | for item in post_blocks:
28 | post_parent = item.parent
29 | post_time = post_parent.find(attrs={"class": "s3"})
30 | if post_time is None:
31 | continue
32 | post_block = item.find('h3').find('a')
33 | post_name = post_block.text
34 | temp_url = post_block.get('href')
35 | if "tid" in temp_url:
36 | post_id = temp_url[-7:]
37 | else:
38 | post_id = temp_url[-12:-5]
39 | post_url = page_url[0:19] + temp_url
40 | parse_item = CaoliuItem(post_name, post_url, post_time.text, post_id)
41 | result_data.append(parse_item)
42 | page_count = int(page_url[-1:])
43 | if page_count < int(max_pages):
44 | page_count += 1
45 | next_url = page_url[:-1] + str(page_count)
46 | else:
47 | next_url = ""
48 | return next_url, result_data
49 |
50 | def get_urls(self, page_url, block_info):
51 | result_list = {}
52 | for block_name in block_info:
53 | block_url = page_url + "thread0806.php?fid=" + str(block_info[block_name]) + "&search=&page=1"
54 | result_list[block_name] = block_url
55 | return result_list
56 |
57 |
58 |
--------------------------------------------------------------------------------
/crawlspider/spider.py:
--------------------------------------------------------------------------------
1 | # encoding: utf-8
2 | import webbrowser
3 | from crawlspider.url_manager import UrlManager
4 | from crawlspider.html_downloader import HtmlDownloader
5 | from crawlspider.html_parser import HtmlParser
6 | from crawlspider.html_outputer import HtmlOutputer
7 |
8 |
9 | class Spider(object):
10 | def __init__(self, url_root, file_root, file_url, max_pages, block_info):
11 | self.file_root = file_root
12 | self.url_root = url_root
13 | self.file_url = file_url
14 | self.max_pages = max_pages
15 | self.block_info = block_info
16 | self.urls = UrlManager()
17 | self.downloader = HtmlDownloader()
18 | self.parser = HtmlParser()
19 | self.outputer = HtmlOutputer(file_root)
20 |
21 | def crawl_1024(self):
22 | parse_result = self.parser.get_urls(self.url_root, self.block_info)
23 | for item in parse_result:
24 | self.urls.add_new_url_in_wrapper(item, parse_result.get(item))
25 |
26 | while self.urls.has_new_url():
27 | name, url = self.urls.get_new_url()
28 | print("name: %s\ncraw: %s" %(name, url))
29 | html_cont = self.downloader.download_data(url)
30 | new_url, new_data = self.parser.parse(name, url, html_cont, self.max_pages)
31 | self.urls.add_new_url_in_wrapper(name, new_url)
32 | self.outputer.collect_data(name, new_data)
33 |
34 | self.outputer.output_html()
35 | webbrowser.open_new(self.file_url)
36 | webbrowser.get()
37 |
--------------------------------------------------------------------------------
/crawlspider/url_manager.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | # -*- coding: utf-8 -*-
3 |
4 |
5 | class UrlManager(object):
6 | def __init__(self):
7 | self.url_wrapper_list = {}
8 | self.new_urls = set()
9 | self.old_urls = set()
10 | self.cur_wrapper = None
11 |
12 | def add_new_url(self, url):
13 | if url is None or len(url) == 0:
14 | return
15 | if url not in self.new_urls and url not in self.old_urls:
16 | self.new_urls.add(url)
17 |
18 | def has_new_url(self):
19 | for item in self.url_wrapper_list:
20 | temp_data = self.url_wrapper_list.get(item)
21 | if temp_data.has_new_url():
22 | self.cur_wrapper = temp_data
23 | return True
24 | return False
25 |
26 | def get_new_url(self):
27 | if self.cur_wrapper is None:
28 | for item in self.url_wrapper_list:
29 | if self.url_wrapper_list.get(item).has_new_url():
30 | self.cur_wrapper = self.url_wrapper_list.get(item)
31 | return self.cur_wrapper.name, self.cur_wrapper.get_new_url()
32 | self.cur_wrapper = None
33 | else:
34 | temp_url = self.cur_wrapper.get_new_url()
35 | if temp_url is not None:
36 | return self.cur_wrapper.name, temp_url
37 | else:
38 | self.cur_wrapper = None
39 | return None, None
40 |
41 | def add_new_urls(self, urls):
42 | if urls is None or len(urls) == 0:
43 | return
44 | for url in urls:
45 | self.add_new_url(url)
46 |
47 | def add_new_url_in_wrapper(self, name, url):
48 | if url is None:
49 | return
50 | temp_result = self.url_wrapper_list.get(name)
51 | if temp_result is None:
52 | wrapper_item = UrlWrapper(name)
53 | wrapper_item.add_new_url(url)
54 | self.url_wrapper_list[name] = wrapper_item
55 | else:
56 | self.url_wrapper_list.get(name).add_new_url(url)
57 |
58 |
59 | class UrlWrapper(object):
60 | def __init__(self, name):
61 | self.name = name
62 | self.new_urls = set()
63 | self.old_urls = set()
64 |
65 | def add_new_url(self, url):
66 | if url is None or len(url) == 0:
67 | return
68 | if url not in self.new_urls and url not in self.old_urls:
69 | self.new_urls.add(url)
70 |
71 | def get_new_url(self):
72 | new_url = self.new_urls.pop()
73 | self.old_urls.add(new_url)
74 | return new_url
75 |
76 | def has_new_url(self):
77 | return len(self.new_urls) != 0
--------------------------------------------------------------------------------
/crawlspider/user_agents.py:
--------------------------------------------------------------------------------
1 | # encoding=utf-8
2 | agents = [
3 | "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
4 | "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)",
5 | "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5",
6 | "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9",
7 | "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7",
8 | "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14",
9 | "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14",
10 | "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20",
11 | "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27",
12 | "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1",
13 | "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2",
14 | "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7",
15 | "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre",
16 | "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10",
17 | "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)",
18 | "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5",
19 | "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)",
20 | "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
21 | "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
22 | "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
23 | "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2",
24 | "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1",
25 | "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre",
26 | "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )",
27 | "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)",
28 | "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a",
29 | "Mozilla/2.02E (Win95; U)",
30 | "Mozilla/3.01Gold (Win95; I)",
31 | "Mozilla/4.8 [en] (Windows NT 5.1; U)",
32 | "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)",
33 | "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
34 | "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0",
35 | "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
36 | "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
37 | "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
38 | "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
39 | "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
40 | "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
41 | "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
42 | "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
43 | "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
44 | "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3",
45 | "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
46 | "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
47 | "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1",
48 | "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
49 | "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
50 | "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
51 | "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
52 | "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
53 | "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
54 | "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522 (KHTML, like Gecko) Safari/419.3",
55 | "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
56 | "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
57 | "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17",
58 | "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
59 | "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
60 | "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
61 | "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
62 | "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2",
63 | "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
64 | "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1",
65 | ]
66 |
--------------------------------------------------------------------------------
/output/*技术讨论区*.html:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/output/index.html:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/output/不骑马的亚洲的电影.html:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/output/大家都说中文的电影.html:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/output/欧美电影.html:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/output/纪念达盖尔的板块.html:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/output/骑马的亚洲的电影.html:
--------------------------------------------------------------------------------
1 |
--------------------------------------------------------------------------------
/quick_start.py:
--------------------------------------------------------------------------------
1 | # encoding: utf-8
2 | from config_utils.configreader import ConfigParams
3 | from crawlspider.spider import Spider
4 |
5 |
6 | def main():
7 | config = ConfigParams('config.json')
8 | url_root, file_root, file_url, max_pages, block_info = config.get_1024_config()
9 | spider = Spider(url_root, file_root, file_url, max_pages, block_info)
10 | spider.crawl_1024()
11 |
12 |
13 | if __name__ == '__main__':
14 | main()
15 |
--------------------------------------------------------------------------------