├── Scrapy ├── tutorial │ ├── tutorial │ │ ├── __init__.py │ │ ├── spiders │ │ │ └── __init__.py │ │ ├── pipelines.py.tmpl │ │ ├── items.py.tmpl │ │ └── settings.py.tmpl │ └── scrapy.cfg ├── 00-Scrapy安装.txt ├── 02-Scrapy创建项目.txt └── 01-Scrapy安装失败解决方案.txt ├── __pycache__ └── ChromeCookies.cpython-34.pyc ├── 17-Phantomjs.py ├── .gitattributes ├── 13-CookieDeciphering.py ├── .gitignore ├── 08-IdentifyingCode.py ├── 09-downPicture.py ├── 14-ChromePassword.py ├── 24-FilesDownload.py ├── README.md ├── 01-URL.py ├── 07-BaiduLenovo.py ├── ChromeCookies.py ├── 23-C315Check.py ├── 16-selenium.py ├── 18-WeiboAnalbum.py ├── 12-ChromeCookie1.py ├── 12-ChromeCookie2.py ├── 10-zhihuLogin.py ├── 15-ZhihuAnswerList.py ├── 02-BFS.py ├── 05-tieba.py ├── 06-JDprice.py ├── 21-DoubanMovieTypeTop.py ├── 04-Login.py ├── 20-DoubanMovieTop250.py ├── 22-PyQuery.py ├── 11-CSDNBlogList.py ├── 03-Chrome.py └── 19-BeautifulSoup.py /Scrapy/tutorial/tutorial/__init__.py: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /Scrapy/00-Scrapy安装.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jueee/PythonWebCrawlers/HEAD/Scrapy/00-Scrapy安装.txt -------------------------------------------------------------------------------- /Scrapy/02-Scrapy创建项目.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jueee/PythonWebCrawlers/HEAD/Scrapy/02-Scrapy创建项目.txt -------------------------------------------------------------------------------- /Scrapy/01-Scrapy安装失败解决方案.txt: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jueee/PythonWebCrawlers/HEAD/Scrapy/01-Scrapy安装失败解决方案.txt -------------------------------------------------------------------------------- /__pycache__/ChromeCookies.cpython-34.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Jueee/PythonWebCrawlers/HEAD/__pycache__/ChromeCookies.cpython-34.pyc -------------------------------------------------------------------------------- /Scrapy/tutorial/tutorial/spiders/__init__.py: -------------------------------------------------------------------------------- 1 | # This package will contain the spiders of your Scrapy project 2 | # 3 | # Please refer to the documentation for information on how to create and manage 4 | # your spiders. 5 | -------------------------------------------------------------------------------- /17-Phantomjs.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | ''' 4 | 动态爬虫工具 Phantomjs 的安装与使用(通过JS渲染实现)。 5 | ''' 6 | ''' 7 | Phantomjs 安装:到PhantomJS的官方网站上下载,然后放到python的安装目录。 8 | 官网地址:http://phantomjs.org/download.html 9 | ''' 10 | from tornado_fetcher import Fetcher -------------------------------------------------------------------------------- /Scrapy/tutorial/scrapy.cfg: -------------------------------------------------------------------------------- 1 | # Automatically created by: scrapy startproject 2 | # 3 | # For more information about the [deploy] section see: 4 | # https://scrapyd.readthedocs.org/en/latest/deploy.html 5 | 6 | [settings] 7 | default = ${project_name}.settings 8 | 9 | [deploy] 10 | #url = http://localhost:6800/ 11 | project = ${project_name} 12 | -------------------------------------------------------------------------------- /Scrapy/tutorial/tutorial/pipelines.py.tmpl: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define your item pipelines here 4 | # 5 | # Don't forget to add your pipeline to the ITEM_PIPELINES setting 6 | # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html 7 | 8 | 9 | class ${ProjectName}Pipeline(object): 10 | def process_item(self, item, spider): 11 | return item 12 | -------------------------------------------------------------------------------- /Scrapy/tutorial/tutorial/items.py.tmpl: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | # Define here the models for your scraped items 4 | # 5 | # See documentation in: 6 | # http://doc.scrapy.org/en/latest/topics/items.html 7 | 8 | import scrapy 9 | 10 | 11 | class ${ProjectName}Item(scrapy.Item): 12 | # define the fields for your item here like: 13 | # name = scrapy.Field() 14 | pass 15 | -------------------------------------------------------------------------------- /.gitattributes: -------------------------------------------------------------------------------- 1 | # Auto detect text files and perform LF normalization 2 | * text=auto 3 | 4 | # Custom for Visual Studio 5 | *.cs diff=csharp 6 | 7 | # Standard to msysgit 8 | *.doc diff=astextplain 9 | *.DOC diff=astextplain 10 | *.docx diff=astextplain 11 | *.DOCX diff=astextplain 12 | *.dot diff=astextplain 13 | *.DOT diff=astextplain 14 | *.pdf diff=astextplain 15 | *.PDF diff=astextplain 16 | *.rtf diff=astextplain 17 | *.RTF diff=astextplain 18 | -------------------------------------------------------------------------------- /13-CookieDeciphering.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Chrome 33+浏览器 Cookies encrypted_value解密脚本 3 | ''' 4 | ''' 5 | Chrome浏览器版本33以上对Cookies进行了加密. 6 | 7 | 用SQLite Developer打开Chrome的Cookies文件就会发现,原来的value字段已经为空,取而代之的是加密的encrypted_value。 8 | ''' 9 | import sqlite3 10 | import win32crypt 11 | import os 12 | 13 | cookie_file_path = os.path.join(os.environ['LOCALAPPDATA'],r'Google\Chrome\User Data\Default\Cookies') 14 | print('Cookies文件的地址为:%s' % cookie_file_path) 15 | if not os.path.exists(cookie_file_path): 16 | raise Exception('Cookies 文件不存在...') 17 | sql_exe="select host_key,name,value,path,encrypted_value from cookies"; 18 | conn = sqlite3.connect(cookie_file_path) 19 | for row in conn.execute(sql_exe): 20 | ret = win32crypt.CryptUnprotectData(row[4], None, None, None, 0) 21 | print('Cookie的Key:%-40s,Cookie名:%-50s,Cookie值:%s' % (row[0],row[1],ret[1].decode())) 22 | conn.close() 23 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | /result/* 2 | 3 | # Windows image file caches 4 | Thumbs.db 5 | ehthumbs.db 6 | 7 | # Folder config file 8 | Desktop.ini 9 | 10 | # Recycle Bin used on file shares 11 | $RECYCLE.BIN/ 12 | 13 | # Windows Installer files 14 | *.cab 15 | *.msi 16 | *.msm 17 | *.msp 18 | 19 | # Windows shortcuts 20 | *.lnk 21 | 22 | # ========================= 23 | # Operating System Files 24 | # ========================= 25 | 26 | # OSX 27 | # ========================= 28 | 29 | .DS_Store 30 | .AppleDouble 31 | .LSOverride 32 | 33 | # Thumbnails 34 | ._* 35 | 36 | # Files that might appear in the root of a volume 37 | .DocumentRevisions-V100 38 | .fseventsd 39 | .Spotlight-V100 40 | .TemporaryItems 41 | .Trashes 42 | .VolumeIcon.icns 43 | 44 | # Directories potentially created on remote AFP share 45 | .AppleDB 46 | .AppleDesktop 47 | Network Trash Folder 48 | Temporary Items 49 | .apdisk 50 | -------------------------------------------------------------------------------- /08-IdentifyingCode.py: -------------------------------------------------------------------------------- 1 | from PIL import Image 2 | 3 | image_name = r'result\08-IdentifyingCode\test1.jpg' 4 | 5 | sx = 20 6 | sy = 16 7 | ex = 8 8 | ey = 10 9 | st = 20 10 | 11 | def gc(a): 12 | if a>180: 13 | return 0 14 | else: 15 | return 1 16 | 17 | def disp(im): 18 | sizex, sizey = im.size 19 | tz = [] 20 | for y in range(sizey): 21 | t = [] 22 | for x in range(sizex): 23 | t.append(gc(im.getpixel((x,y)))) 24 | tz.append(t) 25 | for i in tz: 26 | print('') 27 | for l in i: 28 | print(l, sep='', end='') 29 | return tz 30 | 31 | im = Image.open(image_name) 32 | im = im.convert('L') 33 | 34 | im_new = [] 35 | for i in range(5): 36 | im1 = im.crop((sx+(i*st),sy,sx+ex+(i+st),sy+ey)) 37 | im_new.append(im1) 38 | 39 | for i in im_new: 40 | disp(i) 41 | print('') 42 | 43 | #input('') -------------------------------------------------------------------------------- /09-downPicture.py: -------------------------------------------------------------------------------- 1 | ''' 2 | 爬取某个网页上的所有图片资源 3 | 4 | ''' 5 | import urllib.request 6 | import socket 7 | import re 8 | import sys 9 | import os 10 | 11 | targetDir = r'result\09-downPicture' #文件保存路径 12 | 13 | def destFile(path): 14 | if not os.path.isdir(targetDir): 15 | os.mkdir(targetDir) 16 | pos = path.rindex('/') 17 | t = os.path.join(targetDir, path[pos+1:]) 18 | return t 19 | 20 | def downPicture(weburl): 21 | webheaders = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'} 22 | req = urllib.request.Request(url=weburl, headers=webheaders) #构造请求报头 23 | webpage = urllib.request.urlopen(req) #发送请求报头 24 | contentBytes = webpage.read() 25 | for link,t in set(re.findall(r'(http:[^\s]*?(jpg|png|gif))', str(contentBytes))): #正则表达式查找所有的图片 26 | print(link) 27 | try: 28 | urllib.request.urlretrieve(link, destFile(link)) #下载图片 29 | except : 30 | print('失败') 31 | 32 | if __name__ == '__main__': 33 | weburl = 'http://www.douban.com/' 34 | downPicture(weburl) -------------------------------------------------------------------------------- /14-ChromePassword.py: -------------------------------------------------------------------------------- 1 | ''' 2 | 获取Chrome浏览器已保存的账号和密码。 3 | ''' 4 | # Chrome浏览器已保存的密码都保存在一个sqlite3数据库文件中,和Cookies数据库在同一个文件夹. 5 | # C:\Users\Jueee\AppData\Local\Google\Chrome\User Data\Default\Login Data 6 | ''' 7 | 使用CryptUnprotectData函数解密数据库中的密码字段,即可还原密码,只需要User权限,并且只能是User权限。 8 | ''' 9 | ''' 10 | 为了防止出现读写出错,建议先把数据库临时拷贝到当前目录。 11 | ''' 12 | import os,sys 13 | import shutil 14 | import sqlite3 15 | import win32crypt 16 | 17 | db_file_path = os.path.join(os.environ['LOCALAPPDATA'],r'Google\Chrome\User Data\Default\Login Data') 18 | print(db_file_path) 19 | 20 | tmp_file = os.path.join(os.path.dirname(sys.executable),'tmp_tmp_tmp') 21 | print(tmp_file) 22 | if os.path.exists(tmp_file): 23 | os.remove(tmp_file) 24 | shutil.copyfile(db_file_path,tmp_file) 25 | 26 | conn = sqlite3.connect(tmp_file) 27 | for row in conn.execute('select signon_realm,username_value,password_value from logins'): 28 | try: 29 | ret = win32crypt.CryptUnprotectData(row[2],None,None,None,0) 30 | print('网站:%-50s,用户名:%-20s,密码:%s' % (row[0][:50],row[1],ret[1].decode('gbk'))) 31 | except Exception as e: 32 | print('获取Chrome密码失败...') 33 | raise e 34 | conn.close() 35 | os.remove(tmp_file) 36 | -------------------------------------------------------------------------------- /24-FilesDownload.py: -------------------------------------------------------------------------------- 1 | ''' 2 | 爬取文件集。 3 | ''' 4 | import requests 5 | import re,time,os 6 | 7 | USER_NAMBER = 'yunzhan365' # 子路径 8 | targetDir = 'result\\24-FilesDownload.py\\'+USER_NAMBER #文件保存路径 9 | 10 | # 获取保存路径 11 | def destFile(path,name=''): 12 | if not os.path.isdir(targetDir): 13 | os.makedirs(targetDir) 14 | pos = path.rindex('/') 15 | pom = path.rindex('.') 16 | if name=='': 17 | t = os.path.join(targetDir, path[pos+1:]) 18 | else: 19 | t = os.path.join(targetDir, name + '.' + path[pom+1:]) 20 | return t 21 | 22 | # 保存图片 23 | def saveImage(imgUrl,name=''): 24 | response = requests.get(imgUrl, stream=True) 25 | image = response.content 26 | imgPath = destFile(imgUrl,name) 27 | try: 28 | with open(imgPath ,"wb") as jpg: 29 | jpg.write(image) 30 | print('保存图片成功!%s' % imgPath) 31 | return 32 | except IOError: 33 | print('保存图片成功!%s' % imgUrl) 34 | return 35 | finally: 36 | jpg.close 37 | 38 | if __name__=='__main__': 39 | for n in range(1,99): 40 | album_url = 'https://book.yunzhan365.com/pcqz/stgm/files/mobile/'+str(n)+'.jpg' 41 | saveImage(album_url, str(n).zfill(4)) 42 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 05-WebCrawlers 2 | 网络爬虫(Web Crawlers)学习笔记。 3 | 4 | ---------- 5 | 6 | ### 内容说明: 7 | #### 1、Scrapy文件夹: 8 | web抓取框架Scrapy学习笔记。 9 | 10 | #### 2、其他: 11 | + 01-URL.py:用Python抓取指定URL页面。 12 | + 02-BFS.py:使用队列来实现爬虫的广度优先搜索(BFS)算法。 13 | + 03-Chrome.py:伪装浏览器来访问网站。 14 | + 04-Login.py:模拟用户登录(以登录 CSDN 网站为例)。 15 | + 05-tieba.py:爬取百度贴吧的HTML网页到本地。 16 | + 06-JDprice.py:爬虫获取京东的商品价格,并把爬取结果保存至Excel。 17 | + 07-BaiduLenovo.py:百度搜索框联想词的获取。 18 | + 08-IdentifyingCode.py:读取验证码图片。 19 | + 09-downPicture.py:爬取某个网页上的所有图片资源。 20 | + 10-zhihuLogin.py:知乎网的登录。 21 | + 11-CSDNBlogList.py:根据用户名,获取该用户的CSDN的博客列表。 22 | + 12-ChromeCookie.py:在Python中使用Chrome浏览器已有的Cookies发起HTTP请求。 23 | + 13-CookieDeciphering.py:Chrome 33+浏览器 Cookies encrypted_value 解密。 24 | + 14-ChromePassword.py:获取Chrome浏览器已保存的账号和密码。 25 | + 15-ZhihuAnswerList.py:获取某个用户的知乎回答列表及赞同数(静态网页爬虫)。 26 | + 16-selenium.py:动态爬虫工具 selenium 的安装与使用(通过控制浏览器实现)。 27 | + 17-Phantomjs.py:动态爬虫工具 Phantomjs 的安装与使用(通过JS渲染实现)。 28 | + 18-WeiboAnalbum.py:爬取新浪微博某个用户的头像相册(通过分析API JSON)。 29 | + 19-BeautifulSoup.py:Beautiful Soup 学习笔记(python3中的爬虫匹配神器)。 30 | + 20-DoubanMovieTop250.py:爬取豆瓣评分最高的250部电影(使用Beautiful Soup)。 31 | + 21-DoubanMovieTypeTop.py:按类别爬取豆瓣评分最高的电影(使用Beautiful Soup)。 32 | + 22-PyQuery.py:Python中PyQuery库的使用总结。 33 | + 23-C315Check.py:根据物流防伪码,查询所购商品是否正品。 34 | -------------------------------------------------------------------------------- /01-URL.py: -------------------------------------------------------------------------------- 1 | ''' 2 | 用Python抓取指定页面 3 | ''' 4 | #encoding:UTF-8 5 | import urllib.request 6 | 7 | 8 | ''' 9 | urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False) 10 | 11 | urlopen()函数返回一个 http.client.HTTPResponse 对象 12 | ''' 13 | # 用Python抓取指定页面的源码 14 | if __name__ != '__main__': 15 | url = 'http://www.baidu.com' 16 | data = urllib.request.urlopen(url) 17 | print(data) 18 | print(data.info()) 19 | print(type(data)) 20 | print(data.geturl()) 21 | print(data.getcode()) 22 | print(data.read()) 23 | 24 | # 获取页码状态和源码 25 | if __name__ != '__main__': 26 | url = 'http://www.douban.com/' 27 | req = urllib.request.Request(url) 28 | req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25') 29 | 30 | with urllib.request.urlopen(req) as f: 31 | print('status:',f.status,f.reason) 32 | for k,v in f.getheaders(): 33 | print('%s:%s' % (k,v)) 34 | print(f.read().decode('utf-8')) 35 | 36 | # 用Python简单处理URL 37 | ''' 38 | data是一个字典, 然后通过urllib.parse.urlencode()来将data转换为 ‘word=Jecvay+Notes’的字符串, 最后和url合并为full_url, 39 | ''' 40 | if __name__=='__main__': 41 | data = {} 42 | data['wd'] = 'ju eee' 43 | url_values = urllib.parse.urlencode(data) 44 | url = 'http://www.baidu.com/s' 45 | full_url = url + url_values 46 | print(full_url) 47 | 48 | data = urllib.request.urlopen(full_url).read() 49 | print(data) 50 | -------------------------------------------------------------------------------- /07-BaiduLenovo.py: -------------------------------------------------------------------------------- 1 | ''' 2 | 百度搜索框理想词的获取 3 | ''' 4 | 5 | import urllib.request 6 | import re 7 | 8 | def get_baidu_lenovo(codeStr): 9 | pass 10 | # urllib的quote()方法控制对特殊字符的URL编码 11 | # 如将"百度"编码为"%E7%99%BE%E5%BA%A6" 12 | gjc = urllib.request.quote(codeStr) 13 | url = 'http://suggestion.baidu.com/su?wd=' + gjc 14 | headers = { 15 | 'User-Agent':'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6', 16 | 'Accept' : '*/*', 17 | 'Connection' : 'Keep-Alive' 18 | } 19 | 20 | req = urllib.request.Request(url, headers=headers) 21 | html = urllib.request.urlopen(req).read().decode('gbk') 22 | lenovoStr = re.search(r's:\[(.*?)\]', html).group(1) 23 | print('“%s”的理想词为:%s' % (codeStr, lenovoStr)) 24 | 25 | codelist = ['百度','谷歌','GitHub','老罗','韩寒','%'] 26 | for i in codelist: 27 | get_baidu_lenovo(i) 28 | ''' 29 | 运行结果为: 30 | 31 | “百度”的理想词为:"百度云","百度翻译","百度地图","百度杀毒","百度卫士","百度音乐","百度网盘","百度文库","百度糯米","百度外卖" 32 | “谷歌”的理想词为:"谷歌翻译","谷歌地图","谷歌浏览器","谷歌地球","谷歌学术","谷歌僵尸地图","谷歌浏览器官方下载","谷歌地图高清卫星地图","谷歌邮箱","谷歌搜索" 33 | “GitHub”的理想词为:"github for windows","github 教程","github desktop","github 下载","github中文网","github使用教程","github for mac","github desktop 教程","github是什么","github删除repository" 34 | “老罗”的理想词为:"老罗语录","老罗的android之旅","老罗英语培训","老罗android视频教程","老罗三句名言","老罗斯福","老罗android开发视频教程","老罗英语培训网站","老罗微博","老罗android" 35 | “韩寒”的理想词为:"韩寒女儿","韩寒后会无期","韩寒 对话","韩寒吧","韩寒现象","韩寒经典语录","韩寒餐厅被罚","韩寒 白龙马","韩寒电影","韩寒博客" 36 | “%”的理想词为:"%s","%d","%2c","%g","%a","%x","%windir%","%f","%20","%u" 37 | ''' 38 | -------------------------------------------------------------------------------- /ChromeCookies.py: -------------------------------------------------------------------------------- 1 | ''' 2 | 在Python中使用Chrome浏览器已有的Cookies发起HTTP请求。 3 | 4 | 参考博客:http://blog.csdn.net/pipisorry/article/details/47980653 5 | ''' 6 | import subprocess 7 | import sqlite3 8 | import win32crypt 9 | import re,os 10 | import requests 11 | 12 | def get_chrome_cookies(url): 13 | DIST_COOKIE_FILENAME = '.\python-chrome-cookies' 14 | SOUR_COOKIE_FILENAME = os.path.join(os.environ['LOCALAPPDATA'],r'Google\Chrome\User Data\Default\Cookies') 15 | print(SOUR_COOKIE_FILENAME) 16 | if not os.path.exists(SOUR_COOKIE_FILENAME): 17 | raise Exception('Cookies 文件不存在...') 18 | subprocess.call(['copy', SOUR_COOKIE_FILENAME, DIST_COOKIE_FILENAME], shell=True) 19 | conn = sqlite3.connect(".\python-chrome-cookies") 20 | ret_dict = {} 21 | for row in conn.execute("SELECT host_key, name, path, value, encrypted_value FROM cookies"): 22 | if __name__=='__main__': 23 | print(row[0],row[1]) 24 | if row[0] != url: 25 | continue 26 | ret = win32crypt.CryptUnprotectData(row[4], None, None, None, 0) 27 | ret_dict[row[1]] = ret[1].decode() 28 | conn.close() 29 | subprocess.call(['del', '.\python-chrome-cookies'], shell=True) 30 | return ret_dict 31 | 32 | # 使用方法参考 33 | if __name__=='__main__': 34 | print('------使用requests进行解析访问------') 35 | DOMAIN_NAME = '.zhihu.com' 36 | get_url = r'https://www.zhihu.com/people/jueee/answers' 37 | response = requests.get(get_url, cookies=get_chrome_cookies(DOMAIN_NAME)) 38 | 39 | html_doc = response.text.encode('gbk','ignore').decode('gbk') 40 | print(html_doc) 41 | ### 42 | for match in re.finditer(r'(.*?)', html_doc): 43 | link = match.group(1) 44 | title = match.group(2) 45 | print(link,title) 46 | ### -------------------------------------------------------------------------------- /23-C315Check.py: -------------------------------------------------------------------------------- 1 | ''' 2 | 根据物流防伪码,查询所购商品是否正品 3 | 4 | 中国质量检验协会防伪溯源和物流管理服务系统:http://www.c315.cn/ 5 | ''' 6 | 7 | from pyquery import PyQuery as pq 8 | import re,random 9 | import urllib.request 10 | 11 | 12 | # 进行校验 13 | def get_check(id): 14 | url='http://www.c315.cn/test2.asp?imageField.x=10&imageField.y=8&textfield2=&textfield='+id 15 | html = urllib.request.urlopen(url).read().decode('gbk') 16 | m = re.search(r'
(.*?)
test 1
test 2
test 1
test 2
27 | print(d('p').html()) # test 1 28 | 29 | 30 | print('----4.eq(index) ——根据给定的索引号得到指定元素----') 31 | d=pq('test 1
test 2
test 1
test 2
test 1
41 | print(d('p').filter('.two')) #test 2
42 | 43 | 44 | print('----6.find() ——查找嵌套元素----') 45 | d=pq("test 1
test 2
test 1
test 2
47 | print(d('div').find('p').eq(0)) #test 1
48 | 49 | 50 | print('----7.直接根据类名、id名获取元素----') 51 | d=pq("test 1
test 2
hello
world
") 83 | print(d.children()) #hello
world
84 | print(d.children('#two')) #world
85 | 86 | 87 | print('----13.parents(selector=None)——获取父元素----') 88 | d=pq("hello
world
") 89 | print(d('p').parents()) #hello
world
90 | print(d('#one').parents('span'))#hello
world
91 | print(d('#one').parents('p')) # [] 92 | 93 | 94 | print('----14.clone() ——返回一个节点的拷贝----') 95 | d=pq("hello
world
") 96 | print(d('#one')) #hello
97 | print(d('#one').clone()) #hello
98 | 99 | 100 | print('----15.empty() ——移除节点内容----') 101 | d=pq("hello
world
") 102 | print(d) #hello
world
103 | d('#one').empty() 104 | print(d) #world
105 | 106 | 107 | print('----16.nextAll(selector=None) ——返回后面全部的元素块----') 108 | d=pq("hello
world
world
hello
world
") 114 | print(d('p').not_('#two')) #hello
115 | 116 | 117 | 118 | ''' 119 | 爬取豆瓣电影页面中主演 120 | ''' 121 | if __name__ == '__main__': 122 | print('----爬取豆瓣电影页面中主演----') 123 | # 读取Batman Begins页面 124 | doc = pq(url='http://movie.douban.com/subject/3077412/') 125 | # 遍历starring节点 126 | starring = doc("a[rel='v:starring']") 127 | # 转化为Map 128 | stars = starring.map(lambda i,e:pq(e).text()) 129 | print('<<%s>>的主演:' % (doc("span[property='v:itemreviewed']").text())) 130 | for i in stars: 131 | print(i) 132 | ''' 133 | 执行结果: 134 | 135 | ----爬取豆瓣电影页面中主演---- 136 | <<寻龙诀>>的主演: 137 | 陈坤 138 | 黄渤 139 | 舒淇 140 | 杨颖 141 | 夏雨 142 | 刘晓庆 143 | 颜卓灵 144 | 曹操 145 | 张东 146 | 黄西 147 | 僧格仁钦 148 | ''' -------------------------------------------------------------------------------- /11-CSDNBlogList.py: -------------------------------------------------------------------------------- 1 | ''' 2 | 根据用户名,获取该用户的CSDN的博客列表 3 | ''' 4 | 5 | import urllib.request 6 | import re 7 | 8 | CSDN_URL = 'http://blog.csdn.net' 9 | 10 | # 获取主页网址 11 | def get_blog_url(bloger): 12 | return 'http://blog.csdn.net/'+bloger+'/article/list' 13 | 14 | # 根据网址获取HTML 15 | def get_blog_html(url): 16 | headers = { 17 | 'Connection': 'Keep-Alive', 18 | 'Accept': 'text/html, application/xhtml+xml, */*', 19 | 'Accept-Language': 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3', 20 | 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko' 21 | } 22 | req = urllib.request.Request(url, headers = headers) 23 | html = urllib.request.urlopen(req).read().decode() 24 | return html 25 | 26 | # 获取页码 27 | def get_page_num(html): 28 | page = re.search(r'共(\d+)页', html).group(1) 29 | return page 30 | 31 | # 获取博客发表时间 32 | def get_blog_time(html): 33 | blogtime = re.search(r'(.*?)', html).group(1) 34 | return blogtime 35 | 36 | # 获取博客阅读次数 37 | def get_blog_reader(html): 38 | reader = re.search(r'(.*?)', html).group(1) 39 | return reader 40 | 41 | # 获取页面列表 42 | def get_blog_list(html): 43 | content = re.search(r'\s*([^\s]*)\s*', html) 44 | for match in re.finditer(r'\s*([^\s]*)\s*', html): 45 | link = match.group(1) 46 | title = match.group(2) 47 | blogurl = CSDN_URL + link 48 | bloghtml = get_blog_html(blogurl) 49 | blogtime = get_blog_time(bloghtml) 50 | blogreader = get_blog_reader(bloghtml) 51 | print('%s %s %+10s %-50s' % (link,blogtime,blogreader,title)) 52 | 53 | if __name__ == '__main__': 54 | bloger = 'oYunTaoLianWu' 55 | blogurl = get_blog_url(bloger) 56 | html = get_blog_html(blogurl) 57 | page = get_page_num(html) 58 | for x in range(int(page)): 59 | pageurl = blogurl + '/' + str(x+1) 60 | print('第%s页的博客目录如下(%s):' % (x+1,pageurl)) 61 | html = get_blog_html(pageurl) 62 | get_blog_list(html) 63 | 64 | 65 | ''' 66 | 运行结果: 67 | 68 | 第1页的博客目录如下(http://blog.csdn.net/oYunTaoLianWu/article/list/1): 69 | /jueblog/article/details/33700479 2014-06-23 09:07 916人阅读 notepad++列块编辑操作 70 | /jueblog/article/details/26486821 2014-05-21 17:17 891人阅读 【Chrome】Chrome插件开发(一)插件的简单实现 71 | /jueblog/article/details/17465225 2013-12-21 13:40 3137人阅读 【Java】实现按中文首字母排序 72 | /jueblog/article/details/16972635 2013-11-26 22:04 7031人阅读 【实用技术】WIN7系统下U盘安装了ubuntu13.04双系统 73 | /jueblog/article/details/16103925 2013-11-13 22:04 1171人阅读 【Android】Android蓝牙开发深入解析 74 | /jueblog/article/details/15013635 2013-11-09 23:27 2038人阅读 【Android】App自动更新之通知栏下载 75 | /jueblog/article/details/14600521 2013-11-08 23:27 2491人阅读 【Android】网络图片加载优化(一)利用弱引用缓存异步加载 76 | /jueblog/article/details/14497181 2013-11-07 22:43 8588人阅读 【Android】第三方QQ账号登录的实现 77 | /jueblog/article/details/13434551 2013-10-29 00:26 6374人阅读 【Java】内部类与外部类的互访使用小结 78 | /jueblog/article/details/13164349 2013-10-27 02:08 8249人阅读 【Android】PULL解析XML文件 79 | /jueblog/article/details/12985045 2013-10-24 01:06 16847人阅读 【Android】Web开发之使用WebView控件展示Web页面 80 | 第2页的博客目录如下(http://blog.csdn.net/oYunTaoLianWu/article/list/2): 81 | /jueblog/article/details/12984417 2013-10-24 00:50 1396人阅读 【Android】Wifi管理与应用 82 | /jueblog/article/details/12983821 2013-10-24 00:38 1670人阅读 【Android】Web开发之通知栏下载更新APP 83 | /jueblog/article/details/12958737 2013-10-23 00:54 2754人阅读 【Android】Web开发之显示网络图片的两种方法 84 | /jueblog/article/details/12958159 2013-10-23 00:40 2658人阅读 【Android】Web开发之通过Apache接口处理Http请求 85 | /jueblog/article/details/12847239 2013-10-18 01:09 3142人阅读 【Android】MediaPlayer使用方法简单介绍 86 | /jueblog/article/details/12806909 2013-10-17 00:29 2334人阅读 【Android】Web开发之通过标准Java接口处理Http请求 87 | /jueblog/article/details/12764325 2013-10-16 01:23 3012人阅读 【Android】Activity与服务Service绑定 88 | /jueblog/article/details/12721651 2013-10-15 00:43 1681人阅读 【Android】利用服务Service创建标题栏通知 89 | /jueblog/article/details/12721555 2013-10-15 00:36 2253人阅读 【Android】利用广播Broadcast接收SMS短信 90 | /jueblog/article/details/12691855 2013-10-14 01:07 6796人阅读 【Android】利用广播BroadCast监听网络的变化 91 | /jueblog/article/details/12668215 2013-10-13 02:34 2909人阅读 【Android】Activity遮罩效果的实现 92 | /jueblog/article/details/12667463 2013-10-13 02:28 12967人阅读 【Android】BroadCast广播机制应用与实例 93 | /jueblog/article/details/12655269 2013-10-12 17:41 1182人阅读 【Android】Handler应用(四):AsyncTask的用法与实例 94 | /jueblog/article/details/12627403 2013-10-12 00:57 3269人阅读 【Android】Handler应用(三):从服务器端分页加载更新ListView 95 | ''' -------------------------------------------------------------------------------- /03-Chrome.py: -------------------------------------------------------------------------------- 1 | ''' 2 | http://www.yiibai.com/python/python3-webbug-series3.html 3 | ''' 4 | ''' 5 | 上一次我自学爬虫的时候, 写了一个简陋的勉强能运行的爬虫alpha. 6 | alpha版有很多问题: 7 | 1、比如一个网站上不了, 爬虫却一直在等待连接返回response, 不知道超时跳过; 8 | 2、或者有的网站专门拦截爬虫程序, 我们的爬虫也不会伪装自己成为浏览器正规部队; 9 | 3、并且抓取的内容没有保存到本地, 没有什么作用 10 | ''' 11 | 12 | import re 13 | import urllib.request 14 | import http.cookiejar 15 | import urllib 16 | from collections import deque 17 | import datetime 18 | 19 | 20 | ''' 21 | 添加超时跳过功能 22 | 23 | 首先, 我简单地将 24 | urlop = urllib.request.urlopen(url) 25 | 改为 26 | urlop = urllib.request.urlopen(url, timeout = 2) 27 | 28 | 运行后发现, 当发生超时, 程序因为exception中断 29 | 30 | 于是我把这一句也放在try .. except 结构里, 问题解决. 31 | ''' 32 | 33 | ''' 34 | 支持自动跳转 35 | 36 | 在爬 http://baidu.com 的时候, 爬回来一个没有什么内容的东西, 这个东西告诉我们应该跳转到 http://www.baidu.com . 37 | 但是我们的爬虫并不支持自动跳转, 现在我们来加上这个功能, 让爬虫在爬 baidu.com 的时候能够抓取 www.baidu.com 的内容. 38 | 39 | 首先我们要知道爬 http://baidu.com 的时候他返回的页面是怎么样的, 这个我们既可以用 Fiddler 看, 也可以写一个小爬虫来抓取. 40 | b'\n 41 | \n 42 | \n' 43 | 利用 html 的 meta 来刷新与重定向的代码, 其中的0是等待0秒后跳转, 也就是立即跳转. 44 | ''' 45 | 46 | ''' 47 | 伪装浏览器正规军 48 | 49 | 现在详细研究一下如何让网站们把我们的Python爬虫当成正规的浏览器来访. 50 | 因为如果不这么伪装自己, 有的网站就爬不回来了. 51 | 如果看过理论方面的知识, 就知道我们是要在 GET 的时候将 User-Agent 添加到header里. 52 | 53 | 在 GET 的时候添加 header 有很多方法, 下面介绍两种方法. 54 | ''' 55 | 56 | ''' 57 | 第一种方法比较简便直接, 但是不好扩展功能, 代码如下: 58 | 59 | ''' 60 | if __name__ != '__main__': 61 | url = 'http://www.baidu.com/' 62 | req = urllib.request.Request(url, headers = { 63 | 'Connection': 'Keep-Alive', 64 | 'Accept': 'text/html, application/xhtml+xml, */*', 65 | 'Accept-Language': 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3', 66 | 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko' 67 | }) 68 | oper = urllib.request.urlopen(req) 69 | data = oper.read() 70 | print(data) 71 | 72 | ''' 73 | 第二种方法使用了 build_opener 这个方法, 用来自定义 opener, 这种方法的好处是可以方便的拓展功能. 74 | 例如下面的代码就拓展了自动处理 Cookies 的功能. 75 | ''' 76 | if __name__ != '__main__': 77 | # head: dict of header 78 | def makeMyOpener(head = { 79 | 'Connection': 'Keep-Alive', 80 | 'Accept': 'text/html, application/xhtml+xml, */*', 81 | 'Accept-Language': 'en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3', 82 | 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko' 83 | }): 84 | cj = http.cookiejar.CookieJar() 85 | opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj)) 86 | header = [] 87 | for key, value in head.items(): 88 | elem = (key, value) 89 | header.append(elem) 90 | opener.addheaders = header 91 | return opener 92 | 93 | oper = makeMyOpener() 94 | uop = oper.open('http://www.baidu.com/', timeout = 1000) 95 | data = uop.read() 96 | print(data) 97 | 98 | 99 | ''' 100 | 101 | ''' 102 | if __name__!='__main__': 103 | 104 | data = urllib.request.urlopen('http://baidu.com').read() 105 | print(data) 106 | 107 | 108 | ''' 109 | 保存抓回来的报文 110 | 111 | Python 的文件操作还是相当方便的. 112 | 我们可以讲抓回来的数据 data 以二进制形式保存, 也可以经过 decode() 处理成为字符串后以文本形式保存. 113 | 改动一下打开文件的方式就能用不同的姿势保存文件了. 114 | 下面是参考代码: 115 | ''' 116 | def get_log(data,fileName='study'): 117 | save_path = 'result\\01-Chrome\\'+fileName+'.txt' 118 | f_obj = open(save_path, 'a+') # wb 表示打开方式 a 表示追加 119 | f_obj.write(data) 120 | f_obj.close() 121 | 122 | 123 | 124 | def get_url_deque(url): 125 | fileName = re.search(r'://(.*)', url).group(1).replace('/','-') 126 | queue = deque() 127 | visited = set() 128 | get_log('开始抓取:%s\n' % url,fileName) 129 | starttime = datetime.datetime.now() 130 | queue.append(url) 131 | cnt = 0 132 | 133 | while queue: 134 | url = queue.popleft() # 队首元素出队 135 | visited |= {url} # 标记为已访问 136 | 137 | get_log('已经抓取:'+str(cnt)+'正在抓取<---'+url + '\n',fileName) 138 | 139 | req = urllib.request.Request(url) 140 | req.add_header('User-Agent', 'Mozilla/6.0 (iPhone; CPU iPhone OS 8_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/8.0 Mobile/10A5376e Safari/8536.25') 141 | 142 | cnt += 1 143 | try: 144 | urlop = urllib.request.urlopen(req, timeout = 2000) 145 | except: 146 | continue 147 | 148 | # 用getheader()函数来获取抓取到的文件类型, 是html再继续分析其中的链接 149 | if 'html' not in urlop.getheader('Content-Type'): 150 | continue 151 | 152 | # 避免程序异常中止, 用try..catch处理异常 153 | try: 154 | data = urlop.read().decode('utf-8') 155 | except: 156 | continue 157 | 158 | # 正则表达式提取页面中所有队列, 并判断是否已经访问过, 然后加入待爬队列 159 | linkre = re.compile('href=\"(.+?)\"') 160 | for x in linkre.findall(data): 161 | if 'http' in x and x not in visited: 162 | queue.append(x) 163 | get_log('加入队列 ---> ' + x + '\n',fileName) 164 | endtime = datetime.datetime.now() 165 | get_log('抓取完毕!共耗时:%s.seconds \n' % (endtime - starttime),fileName) 166 | 167 | if __name__=='__main__': 168 | url = 'https://github.com/Jueee/04-LiaoXueFeng' # 入口页面, 可以换成别的 169 | get_url_deque(url) 170 | 171 | -------------------------------------------------------------------------------- /19-BeautifulSoup.py: -------------------------------------------------------------------------------- 1 | ''' 2 | Beautiful Soup(python3中的爬虫匹配神器) 3 | 4 | 参考阅读:Beautiful Soup 中文文档 5 | http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html 6 | ''' 7 | ''' 8 | Beautiful Soup 是用Python写的一个HTML/XML的解析器,它可以很好的处理不规范标记并生成剖析树(parse tree)。 9 | 它提供简单又常用的导航(navigating),搜索以及修改剖析树的操作。它可以大大节省你的编程时间。 10 | ''' 11 | from bs4 import BeautifulSoup 12 | 13 | html = """ 14 |The Dormouse's story
17 |Once upon a time there were three little sisters; and their names were 18 | , 19 | Lacie and 20 | Tillie; 21 | and they lived at the bottom of a well.
22 |...
23 | """ 24 | # 得到一个 BeautifulSoup 的对象 25 | soup = BeautifulSoup(html) 26 | # 按照标准的缩进格式的结构输出 27 | print(soup.prettify()) 28 | 29 | ''' 30 | # 简单的浏览结构化数据的方法: 31 | ''' 32 | print(soup.title) 33 | #The Dormouse's story
42 | print(soup.p['class']) 43 | # ['title'] 44 | print(soup.a) 45 | # 46 | print(soup.find_all('a')) 47 | # [, Lacie, Tillie] 48 | print(soup.find(id='link3')) 49 | # Tillie 50 | 51 | # 从文档中找到所有标签的链接: 52 | for link in soup.find_all('a'): 53 | print(link.get('href')) 54 | ''' 55 | http://example.com/elsie 56 | http://example.com/lacie 57 | http://example.com/tillie 58 | ''' 59 | 60 | # 从文档中获取所有文字内容: 61 | print(soup.get_text()) 62 | 63 | 64 | ''' 65 | 对象的种类 66 | 67 | Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象. 68 | 所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment . 69 | ''' 70 | ''' 71 | Tag 72 | Tag 对象与XML或HTML原生文档中的tag相同: 73 | ''' 74 | soup = BeautifulSoup('Extremely bold') 75 | tag = soup.b 76 | print(type(tag)) 77 | #Extremely bold88 | ''' 89 | Attributes 90 | 一个tag可能有很多个属性. tag 有一个 “class” 的属性,值为 “boldest” . 91 | tag的属性的操作方法与字典相同: 92 | ''' 93 | print(tag['class']) 94 | # ['boldest'] 95 | # 也可以直接”点”取属性, 比如: .attrs : 96 | print(tag.attrs) 97 | # {'class': ['boldest']} 98 | 99 | tag['class'] = 'verybold' 100 | tag['id'] = 1 101 | print(tag) 102 | #
Extremely bold103 | del tag['id'] 104 | print(tag) 105 | #
Extremely bold106 | ''' 107 | 多值属性 108 | ''' 109 | css_soup = BeautifulSoup('') 110 | print(css_soup.p['class']) 111 | # ['body', 'strikeout'] 112 | # 如果某个属性看起来好像有多个值,但在任何版本的HTML定义中都没有被定义为多值属性,那么Beautiful Soup会将这个属性作为字符串返回 113 | id_soup = BeautifulSoup('') 114 | print(id_soup.p['id']) 115 | # my id 116 | # 将tag转换成字符串时,多值属性会合并为一个值 117 | rel_soup = BeautifulSoup('
Back to the homepage
') 118 | print(rel_soup.a['rel']) 119 | # ['index'] 120 | rel_soup.a['rel'] = ['index', 'contents'] 121 | print(rel_soup.p) 122 | #Back to the homepage
123 | 124 | ''' 125 | 可以遍历的字符串 126 | 字符串常被包含在tag内.Beautiful Soup用 NavigableString 类来包装tag中的字符串: 127 | ''' 128 | print(tag.string) 129 | # Extremely bold 130 | type(tag.string) 131 | #No longer bold144 | 145 | 146 | print('---------遍历文档树---------') 147 | 148 | html = """ 149 |
The Dormouse's story
152 |Once upon a time there were three little sisters; and their names were 153 | , 154 | Lacie and 155 | Tillie; 156 | and they lived at the bottom of a well.
157 |...
158 | """ 159 | soup = BeautifulSoup(html) 160 | ''' 161 | 遍历文档树 162 | ''' 163 | print(soup.head) 164 | print(soup.title) 165 | # 通过点取属性的方式只能获得当前名字的第一个tag: 166 | print(soup.body.b) 167 | # 如果想要得到所有的标签 168 | print(soup.find_all('a')) 169 | 170 | 171 | 172 | # tag的 .contents 属性可以将tag的子节点以列表的方式输出: 173 | head_tag = soup.head 174 | print(head_tag.contents) 175 | print(head_tag.contents[0]) 176 | print(head_tag.contents[0].contents) 177 | print(head_tag.contents[0].contents[0]) 178 | # 通过tag的 .children 生成器,可以对tag的子节点进行循环: 179 | for child in head_tag.contents[0]: 180 | print(child) 181 | 182 | 183 | # .descendants 184 | print('--------.descendants--------') 185 | # .contents 和 .children 属性仅包含tag的直接子节点. 186 | # .descendants 属性可以对所有tag的子孙节点进行递归循环 187 | for child in head_tag.descendants: 188 | print(child) 189 | 190 | # .string 191 | # 如果tag只有一个 NavigableString 类型子节点,那么这个tag可以使用 .string 得到子节点: 192 | print(head_tag.string) 193 | # 如果tag包含了多个子节点,tag就无法确定 .string 方法应该调用哪个子节点的内容, .string 的输出结果是 None : 194 | print(soup.html.string) 195 | 196 | 197 | 198 | 199 | # .strings 和 stripped_strings 200 | # 如果tag中包含多个字符串 [2] ,可以使用 .strings 来循环获取: 201 | for string in soup.strings: 202 | print(repr(string)) 203 | 204 | # 输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容: 205 | # [注]全部是空格的行会被忽略掉,段首和段末的空白会被删除 206 | for string in soup.stripped_strings: 207 | print(repr(string)) 208 | 209 | # 通过 .parent 属性来获取某个元素的父节点. 210 | 211 | --------------------------------------------------------------------------------