├── .gitattributes ├── LICENSE ├── README.md ├── qq音乐的框架 ├── __pycache__ │ └── music_db.cpython-37.pyc ├── main.py ├── music_db.py ├── sign.js └── test.py ├── 中国疫情地图 └── 地图.py ├── 京东框架 ├── Celery.py ├── client.py ├── main.py ├── shop.csv └── test.py ├── 抖音爬取签名,喜欢列表和关注列表 └── 爬取抖音的用户信息和视频连接 │ ├── __pycache__ │ ├── shujuku.cpython-37.pyc │ └── xgorgon.cpython-37.pyc │ ├── ceshi.py │ ├── douying.py │ ├── main.py │ ├── shujuku.py │ ├── wangyexingxi.py │ ├── xgorgon.py │ └── 抖音里面的xg算法,因为一些原因就不适合开源了。.md ├── 最好大学网 └── daxue.py ├── 爬取快代理构建代理池 └── dailici.py ├── 爬取房天下的框架 ├── __pycache__ │ └── shujuku.cpython-37.pyc ├── main.py ├── shujuku.py ├── test.py ├── 清洗后城市二手房数据.csv └── 爬取房天下的信息 │ ├── code │ ├── FTXSpider.py │ ├── chromedriver.exe │ ├── zufangcitymatch.xlsx │ └── 租房2020-10-10 │ │ ├── 包头2020-10-10房天下租房.xlsx │ │ ├── 北海2020-10-10房天下租房.xlsx │ │ └── 安庆2020-10-10房天下租房.xlsx │ └── main.py ├── 爬取抖音无水印视频 └── douying.py ├── 爬取淘宝商品信息基于selenium框架 ├── taobao.py ├── taobaopachong.py ├── 对数据进行清洗.py └── 爬取商品属性.py ├── 爬取百度文库的doc格式 ├── paqubaiduwenku.py └── 抓取百度文库所有内容 │ ├── paqu-ppt │ └── pdf.py │ ├── paqubaiduwenku.py │ ├── 作文.pptx │ ├── 带GUI的爬取百度文库 │ ├── README.md │ ├── img │ │ ├── 1.png │ │ ├── 2.png │ │ ├── 3.png │ │ ├── 4.png │ │ ├── 5.png │ │ └── 6.png │ ├── requirements.txt │ ├── setup.bat │ ├── src │ │ ├── chromedriver.exe │ │ └── wenku.py │ ├── 代码分析 │ │ ├── Ajax知识点补充.md │ │ ├── JSP知识的小补充.md │ │ └── 爬虫代码解读.md │ └── 爬取百度文库.exe │ └── 泪水作文8篇.pdf ├── 爬取豆瓣图书生成Excel表和词云 ├── axis.png └── main.py ├── 用异步去爬取天猫的商品信息 └── paqutianmao.py ├── 破解千图批量下载图片 └── paqutupian.py ├── 破解有道翻译,做成自己的小字典 └── youdao-pojie.py ├── 获取新冠肺炎实时数据 └── paqufeiyang.py └── 酷狗音乐 └── main.py /.gitattributes: -------------------------------------------------------------------------------- 1 | *.js linguist-language=Python 2 | *.css linguist-language=Python 3 | *.html linguist-language=Python 4 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 有猫腻 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 项目和个人笔记 2 | 一些有趣的小项目,实现一些小功能,需要的可以下载来玩玩 3 | 4 | 一些注意事项: 5 | 6 | # 1、关于怎么爬取抖音,这里我们得先用夜神模拟器去模拟手机的登录环境,然后再通过fiddler去抓包,然后就和我们怎么去爬取网页那就怎么去爬取APP 7 | 8 | 9 | 10 | # 2、关于百度文库和千图,房天下,京东,都有涉及到反爬虫机制,这里你必须得会一点JavaScript才可以知道怎么去破解 11 | 12 | 13 | 14 | # 3、这里大多数知识,你要看得懂还是建议先学好爬虫基础,再来实现这些项目 15 | 16 | 17 | 18 | # 4、关于京东的,还有一些小缺陷没有完善,因为这里涉及到分布式的知识,说实话我分布式这块没有学好,所以不太完整,得自己去慢慢探索才行 19 | 20 | 21 | 22 | # 5、关于未来的发展道路,可以的话可以去学习docker和k8s,这些大多数用go语言写的,对了如果学java的话,其实对于我们这些爬虫工程师来说还不如学习go语言,因为go语言大多数是基于C语言的,对于我们这些python工程师来说,比较友好 23 | 24 | 25 | 26 | # 6、Redis内存数据库 MySQL关系数据库 mongobd文档数据库 不同的数据库对应不同的功能,大多数我们爬虫工程师都是用到Redis和MySQL,而且很多应聘都是必须要求熟练使用Redis内存数据库,善用于Redis可以大大提高我们的爬取速率 27 | 28 | 29 | 30 | # 7、关于js破解这块,首先我们得先把破解好的js文件写一个接口去对接我们的python文件,因为毕竟这两门是不同的语言 31 | 32 | ```javascript 33 | rsaPassword = function(t){ 34 | var e= new D; 35 | return e.setPublic("xxx") 36 | e.encrypt(t) 37 | } 38 | function getPwd(pwd){ 39 | return rsaPassword(pwd); 40 | } 41 | //通过这个接口把我们要破解的内容放回到getPwd这个函数里面 42 | ``` 43 | 44 | 45 | 46 | ```python 47 | #先导入我们的接口包 48 | import execjs 49 | #设置函数 50 | def getpwd(password): 51 | #读取我们的js文件,格式为utf8 52 | with open("xxx.js",'r',encoding='utf8')as f: 53 | content = f.read() 54 | #然后去解析这个读取的内容 55 | jsdata = execjs.compile(content) 56 | #去看js那个函数,并且传入参数 57 | pw = jsdata.call('getPwd',password) 58 | print('pw:',pw) 59 | return pw 60 | 61 | 62 | if __name__ == '__main__': 63 | getpwd('123456') 64 | 65 | ``` 66 | 67 | 这个固定格式,基本上照着这样写就完事了,可以百分之99获取我们想要的内容 68 | 69 | -------------------------------------------------------------------------------- /qq音乐的框架/__pycache__/music_db.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/13060923171/Crawl-Project/320d69b1d9051db6f43836d7d8854fa5af9d56fb/qq音乐的框架/__pycache__/music_db.cpython-37.pyc -------------------------------------------------------------------------------- /qq音乐的框架/main.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from urllib import parse 3 | #导入数学库 4 | import math 5 | #导入数据库 6 | from music_db import SQLsession,Song 7 | import os 8 | #导入多线程,多进程 9 | from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor 10 | 11 | headers = { 12 | "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36", 13 | "accept-language": "zh-CN,zh;q=0.9", 14 | "accept-encoding": "gzip, deflate, br", 15 | "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 16 | "cache-control": "max-age=600", 17 | "Referer": "https://y.qq.com/portal/singer_list.html", 18 | } 19 | #根据url下载歌曲 20 | def download(song_mid,sing_name): 21 | #定义headers请求头 22 | headers = { 23 | 'cookie': 'pgv_pvid=2128245208; pac_uid=0_6b1c785781d54; pgv_pvi=772980736; RK=8x5lwvVnY1;' 24 | ' ptcz=baca3422f148c8897bd71cb3765e7c08bf0dddc2aac46b34eb2f6b669e38d215;' 25 | ' ptui_loginuin=1766228968@qq.com; ts_refer=www.baidu.com/link; ts_uid=2958609676; ' 26 | 'pgv_si=s6667516928; pgv_info=ssid=s1048782460; player_exist=1; qqmusic_fromtag=66;' 27 | ' userAction=1; yqq_stat=0; _qpsvr_localtk=0.5664181233633159;' 28 | 'psrf_qqunionid=E7D5E8B282E958B5ED555246677BCD41; psrf_qqrefresh_' 29 | 'token=DC32336F11952FA5867192F46CF15FD5; tmeLoginType=2; qqmusic_' 30 | 'key=Q_H_L_2Sqn2y50eyOV1i5dcbk613wim45KnxmEK5ofj1RsBgxgHN-xLkK25EjEAQ2jvs1;' 31 | ' psrf_qqopenid=33B542A190FCA799A663FDDCB25EA8F0; qm_' 32 | 'keyst=Q_H_L_2Sqn2y50eyOV1i5dcbk613wim45KnxmEK5ofj1RsBgxgHN-xLkK25EjEAQ2jvs1; ' 33 | 'euin=oK4qNe-l7KvPoz**; psrf_access_token_expiresAt=1601793607; ' 34 | 'psrf_musickey_createtime=1594017607; psrf_qqaccess_token=8838D613ABE40CD4A345D8E550EBB967; ' 35 | 'uin=1598275443; ts_last=y.qq.com/portal/player.html; yplayer_open=1; yq_index=3', 36 | 'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36', 37 | 'referer': 'https://y.qq.com/portal/player.html' 38 | } 39 | #导入data参数并且用parse加入url里面,从而获得不同歌曲的URL达到下载 40 | data = '{"req_0":{"module":"vkey.GetVkeyServer","method":"CgiGetVkey","param":' \ 41 | '{"guid":"2128245208","songmid":["%s"],"songtype":[0],"uin":"1598275443","loginflag"' \ 42 | ':1,"platform":"20"}},"comm":{"uin":1598275443,"format":"json","ct":24,"cv":0}}' % str( 43 | song_mid) 44 | url = 'https://u.y.qq.com/cgi-bin/musicu.fcg?-=getplaysongvkey17693804549459324' \ 45 | '&g_tk=5381&loginUin=3262637034&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8' \ 46 | '¬ice=0&platform=yqq.json&needNewCode=0&data={}'.format(parse.quote(data)) 47 | #去获取这个网页的的json值 48 | vkey = requests.get(url, headers=headers) 49 | #用去定位到purl 50 | purl = vkey.json()['req_0']['data']['midurlinfo'][0]['purl'] 51 | url = 'https://ws.stream.qqmusic.qq.com/' + purl 52 | html = requests.get(url) 53 | filename = 'qq音乐' 54 | #创建一个文件夹,当这个文件不存在的时候自动生成一个文件夹 55 | if not os.path.exists(filename): 56 | os.makedirs(filename) 57 | #通过获取到的URL下载对应的歌曲 58 | with open('./{}/{}.m4a'.format(filename, sing_name), 'wb') as f: 59 | print('\n正在下载{}歌曲.....\n'.format(sing_name)) 60 | #下载并保存这个html的全部内容也就是下载歌曲 61 | f.write(html.content) 62 | 63 | #获取歌手信息 64 | def get_singer_data(mid,singer_name): 65 | params = '{"comm":{"ct":24,"cv":0},"singerSongList":{"method":"GetSingerSongList",' \ 66 | '"param":{"order":1,"singerMid":"%s","begin":0,"num":10},' \ 67 | '"module":"musichall.song_list_server"}}' % str(mid) 68 | 69 | url = 'https://u.y.qq.com/cgi-bin/musicu.fcg?-=getSingerSong9513357793133783&' \ 70 | 'g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8' \ 71 | '¬ice=0&platform=yqq.json&needNewCode=0*&data={}'.format(parse.quote(params)) 72 | #做到中转的作用 73 | html = requests.session() 74 | #用get来获取这个网页的内容,并且转化为json 75 | content = html.get(url, headers=headers).json() 76 | #定位这个歌手总歌曲的数量 77 | songs_num = content['singerSongList']['data']['totalNum'] 78 | #连接数据库 79 | session = SQLsession() 80 | 81 | #因为一个歌手一次性最多只能获取80首歌,所以我们做一个循环 82 | if int(songs_num) <= 80: 83 | params = '{"comm":{"ct":24,"cv":0},"singerSongList":{"method":"GetSingerSongList",' \ 84 | '"param":{"order":1,"singerMid":"%s","begin":0,"num":%s},' \ 85 | '"module":"musichall.song_list_server"}}' % (str(mid), str(songs_num)) 86 | 87 | url = 'https://u.y.qq.com/cgi-bin/musicu.fcg?-=getSingerSong9513357793133783&' \ 88 | 'g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8' \ 89 | '¬ice=0&platform=yqq.json&needNewCode=0*&data={}'.format(parse.quote(params)) 90 | html = requests.session() 91 | content = html.get(url, headers=headers).json() 92 | #开始定位到相对位置 93 | datas = content['singerSongList']['data']['songList'] 94 | for song in datas: 95 | #去获取相应歌曲的名字,mid,歌手名字,歌曲的专辑 96 | song_name = song['songInfo']['name'] 97 | song_ablum = song['songInfo']['album']['name'] 98 | singer_name = singer_name 99 | song_mid = song['songInfo']['mid'] 100 | try: 101 | #存入数据库 102 | song = Song( 103 | # 第一个是你数据库的名字,第二个就是存进入的信息 104 | song_name=song_name, 105 | song_ablum=song_ablum, 106 | song_mid=song_mid, 107 | singer_name=singer_name, 108 | ) 109 | session.add(song) 110 | session.commit() 111 | print('commit') 112 | except: 113 | session.rollback() 114 | print('rollback') 115 | print(singer_name,song_name,song_ablum,song_mid) 116 | #获取对应的参数,传入到下载的参数里面 117 | download(song_mid,singer_name) 118 | 119 | else: 120 | for a in range(0, songs_num, 80): 121 | params = '{"comm":{"ct":24,"cv":0},"singerSongList":{"method":"GetSingerSongList",' \ 122 | '"param":{"order":1,"singerMid":"%s","begin":%s,"num":%s},' \ 123 | '"module":"musichall.song_list_server"}}' % (str(mid), int(a), int(songs_num)) 124 | 125 | url = 'https://u.y.qq.com/cgi-bin/musicu.fcg?-=getSingerSong9513357793133783&' \ 126 | 'g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8' \ 127 | '¬ice=0&platform=yqq.json&needNewCode=0*&data={}'.format(parse.quote(params)) 128 | html = requests.session() 129 | content = html.get(url, headers=headers).json() 130 | datas = content['singerSongList']['data']['songList'] 131 | for song in datas: 132 | song_name = song['songInfo']['name'] 133 | song_ablum = song['songInfo']['album']['name'] 134 | singer_name = singer_name 135 | song_mid = song['songInfo']['mid'] 136 | try: 137 | song = Song( 138 | # 第一个是你数据库的名字,第二个就是存进入的信息 139 | song_name=song_name, 140 | song_ablum=song_ablum, 141 | song_mid=song_mid, 142 | singer_name=singer_name, 143 | ) 144 | session.add(song) 145 | session.commit() 146 | print('commit') 147 | except: 148 | session.rollback() 149 | print('rollback') 150 | print(singer_name, song_name, song_ablum, song_mid) 151 | download(song_mid, singer_name) 152 | 153 | #去获取每一页的全部歌手的mid和名字 154 | def get_singer_mid(index): 155 | #index=1---27 156 | data='{"comm":{"ct":24,"cv":0},"singerList":{"module":"Music.SingerListServer"' \ 157 | ',"method":"get_singer_list","param":{"area":-100,"sex":-100,"genre":-100,' \ 158 | '"index":%s,"sin":0,"cur_page":1}}}' % (str(index)) 159 | url='https://u.y.qq.com/cgi-bin/musicu.fcg?-=getUCGI0432880619182503' \ 160 | '&g_tk=571600846&loginUin=0&hostUin=0&format=json&inCharset=utf8&out' \ 161 | 'Charset=utf-8¬ice=0&platform=yqq.json&needNewCode=0' \ 162 | '&data={}'.format(parse.quote(data)) 163 | html = requests.get(url).json() 164 | #总共一共有多少歌手 165 | total = html['singerList']['data']['total'] 166 | #一页只有80个歌手,除以80可以知道每一个字母的总的页数有多少 167 | pages = int(math.floor(int(total) / 80)) 168 | thread_number = pages 169 | Thread=ThreadPoolExecutor(max_workers=thread_number) 170 | #设置一个翻页,这里sin=80为1页 171 | sin = 0 172 | for page in range(1, pages): 173 | data = '{"comm":{"ct":24,"cv":0},"singerList":{"module":"Music.SingerListServer",' \ 174 | '"method":"get_singer_list","param":{"area":-100,"sex":-100,"genre":-100,"' \ 175 | 'index":%s,"sin":%d,"cur_page":%s}}}' % (str(index), sin, str(page)) 176 | 177 | url = 'https://u.y.qq.com/cgi-bin/musicu.fcg?-=getUCGI0432880619182503' \ 178 | '&g_tk=571600846&loginUin=0&hostUin=0&format=json&inCharset=utf8&out' \ 179 | 'Charset=utf-8¬ice=0&platform=yqq.json&needNewCode=0' \ 180 | '&data={}'.format(parse.quote(data)) 181 | html=requests.get(url,headers=headers).json() 182 | sings=html['singerList']['data']['singerlist'] 183 | for sing in sings: 184 | singer_name = sing['singer_name'] 185 | mid = sing['singer_mid'] 186 | Thread.submit(get_singer_data, mid, singer_name) 187 | sin += 80 188 | 189 | 190 | def myProcess(): 191 | #开5个进程,加快爬取的速度 192 | with ProcessPoolExecutor(max_workers=5) as exe: 193 | #i为一个字母,这里一共有26个字母加一个#号,所以,就写一个循环函数,来爬取全部内容 194 | for i in range(1,28): 195 | exe.submit(get_singer_mid,i) 196 | 197 | 198 | if __name__ == '__main__': 199 | myProcess() -------------------------------------------------------------------------------- /qq音乐的框架/music_db.py: -------------------------------------------------------------------------------- 1 | from sqlalchemy import * 2 | from sqlalchemy.orm import sessionmaker, scoped_session 3 | from sqlalchemy.ext.declarative import declarative_base 4 | engine = create_engine( 5 | "mysql+pymysql://root:root@127.0.0.1:3306/test", 6 | #超过连接池大小外最多可以创建的连接数 7 | max_overflow=500, 8 | #连接池的大小 9 | pool_size=100, 10 | #是否显示开发信息 11 | echo=False, 12 | ) 13 | #创建一个基类 14 | BASE= declarative_base() 15 | 16 | class Song(BASE): 17 | #定义这个数据库表的名字 18 | __tablename__ = 'song' 19 | #设置一些对应的值 20 | song_id = Column(Integer,primary_key=True,autoincrement=True) 21 | singer_name= Column(String(50)) 22 | song_name = Column(String(64)) 23 | song_ablum = Column(String(64)) 24 | song_number = Column(String(50)) 25 | song_mid = Column(String(50)) 26 | 27 | 28 | #把引擎加入基类里面 29 | BASE.metadata.create_all(engine) 30 | DBsession = sessionmaker(bind=engine) 31 | SQLsession = scoped_session(DBsession) -------------------------------------------------------------------------------- /qq音乐的框架/sign.js: -------------------------------------------------------------------------------- 1 | this.window=this; 2 | var youmaoni=null; 3 | 4 | !function(n, t) { 5 | "object" == typeof exports && "undefined" != typeof module ? module.exports = t() : "function" == typeof define && define.amd ? define(t) : (n = n || self).getSecuritySign = t() 6 | } (this, 7 | function() { 8 | "use strict"; 9 | var n = function() { 10 | if ("undefined" != typeof self) return self; 11 | if ("undefined" != typeof window) return window; 12 | if ("undefined" != typeof global) return global; 13 | throw new Error("unable to locate global object") 14 | } (); 15 | n.__sign_hash_20200305 = function(n, t) { 16 | function f(n, t) { 17 | return n << t | n >>> 32 - t 18 | } 19 | function h(n, t) { 20 | var o, e, u, p, r; 21 | return u = 2147483648 & n, 22 | p = 2147483648 & t, 23 | r = (1073741823 & n) + (1073741823 & t), 24 | (o = 1073741824 & n) & (e = 1073741824 & t) ? 2147483648 ^ r ^ u ^ p: o | e ? 1073741824 & r ? 3221225472 ^ r ^ u ^ p: 1073741824 ^ r ^ u ^ p: r ^ u ^ p 25 | } 26 | function o(n, t, o, e, u, p, r) { 27 | var i; 28 | return n = h(n, h(h((i = t) & o | ~i & e, u), r)), 29 | h(f(n, p), t) 30 | } 31 | function e(n, t, o, e, u, p, r) { 32 | var i; 33 | return n = h(n, h(h(t & (i = e) | o & ~i, u), r)), 34 | h(f(n, p), t) 35 | } 36 | function u(n, t, o, e, u, p, r) { 37 | return n = h(n, h(h(t ^ o ^ e, u), r)), 38 | h(f(n, p), t) 39 | } 40 | function p(n, t, o, e, u, p, r) { 41 | return n = h(n, h(h(o ^ (t | ~e), u), r)), 42 | h(f(n, p), t) 43 | } 44 | function r(n) { 45 | var t, o = "", 46 | e = ""; 47 | for (t = 0; t <= 3; t++) o += (e = "0" + (n >>> 8 * t & 255).toString(16)).substr(e.length - 2, 2); 48 | return o 49 | } 50 | var i, l, c, g, a, s, v, d, y, b; 51 | for (t = t || 32, i = function(n) { 52 | for (var t, o = n.length, 53 | e = o + 8, 54 | u = 16 * (1 + (e - e % 64) / 64), p = Array(u - 1), r = 0, i = 0; i < o;) r = i % 4 * 8, 55 | p[t = (i - i % 4) / 4] = p[t] | n.charCodeAt(i) << r, 56 | i++; 57 | return r = i % 4 * 8, 58 | p[t = (i - i % 4) / 4] = p[t] | 128 << r, 59 | p[u - 2] = o << 3, 60 | p[u - 1] = o >>> 29, 61 | p 62 | } (n = function(n) { 63 | n = n.replace(/\r\n/g, "\n"); 64 | for (var t = "", 65 | o = 0; o < n.length; o++) { 66 | var e = n.charCodeAt(o); 67 | e < 128 ? t += String.fromCharCode(e) : (127 < e && e < 2048 ? t += String.fromCharCode(e >> 6 | 192) : (t += String.fromCharCode(e >> 12 | 224), t += String.fromCharCode(e >> 6 & 63 | 128)), t += String.fromCharCode(63 & e | 128)) 68 | } 69 | return t 70 | } (n)), v = 1732584193, d = 4023233417, y = 2562383102, b = 271733878, l = 0; l < i.length; l += 16) v = o(c = v, g = d, a = y, s = b, i[l + 0], 7, 3614090360), 71 | b = o(b, v, d, y, i[l + 1], 12, 3905402710), 72 | y = o(y, b, v, d, i[l + 2], 17, 606105819), 73 | d = o(d, y, b, v, i[l + 3], 22, 3250441966), 74 | v = o(v, d, y, b, i[l + 4], 7, 4118548399), 75 | b = o(b, v, d, y, i[l + 5], 12, 1200080426), 76 | y = o(y, b, v, d, i[l + 6], 17, 2821735955), 77 | d = o(d, y, b, v, i[l + 7], 22, 4249261313), 78 | v = o(v, d, y, b, i[l + 8], 7, 1770035416), 79 | b = o(b, v, d, y, i[l + 9], 12, 2336552879), 80 | y = o(y, b, v, d, i[l + 10], 17, 4294925233), 81 | d = o(d, y, b, v, i[l + 11], 22, 2304563134), 82 | v = o(v, d, y, b, i[l + 12], 7, 1804603682), 83 | b = o(b, v, d, y, i[l + 13], 12, 4254626195), 84 | y = o(y, b, v, d, i[l + 14], 17, 2792965006), 85 | v = e(v, d = o(d, y, b, v, i[l + 15], 22, 1236535329), y, b, i[l + 1], 5, 4129170786), 86 | b = e(b, v, d, y, i[l + 6], 9, 3225465664), 87 | y = e(y, b, v, d, i[l + 11], 14, 643717713), 88 | d = e(d, y, b, v, i[l + 0], 20, 3921069994), 89 | v = e(v, d, y, b, i[l + 5], 5, 3593408605), 90 | b = e(b, v, d, y, i[l + 10], 9, 38016083), 91 | y = e(y, b, v, d, i[l + 15], 14, 3634488961), 92 | d = e(d, y, b, v, i[l + 4], 20, 3889429448), 93 | v = e(v, d, y, b, i[l + 9], 5, 568446438), 94 | b = e(b, v, d, y, i[l + 14], 9, 3275163606), 95 | y = e(y, b, v, d, i[l + 3], 14, 4107603335), 96 | d = e(d, y, b, v, i[l + 8], 20, 1163531501), 97 | v = e(v, d, y, b, i[l + 13], 5, 2850285829), 98 | b = e(b, v, d, y, i[l + 2], 9, 4243563512), 99 | y = e(y, b, v, d, i[l + 7], 14, 1735328473), 100 | v = u(v, d = e(d, y, b, v, i[l + 12], 20, 2368359562), y, b, i[l + 5], 4, 4294588738), 101 | b = u(b, v, d, y, i[l + 8], 11, 2272392833), 102 | y = u(y, b, v, d, i[l + 11], 16, 1839030562), 103 | d = u(d, y, b, v, i[l + 14], 23, 4259657740), 104 | v = u(v, d, y, b, i[l + 1], 4, 2763975236), 105 | b = u(b, v, d, y, i[l + 4], 11, 1272893353), 106 | y = u(y, b, v, d, i[l + 7], 16, 4139469664), 107 | d = u(d, y, b, v, i[l + 10], 23, 3200236656), 108 | v = u(v, d, y, b, i[l + 13], 4, 681279174), 109 | b = u(b, v, d, y, i[l + 0], 11, 3936430074), 110 | y = u(y, b, v, d, i[l + 3], 16, 3572445317), 111 | d = u(d, y, b, v, i[l + 6], 23, 76029189), 112 | v = u(v, d, y, b, i[l + 9], 4, 3654602809), 113 | b = u(b, v, d, y, i[l + 12], 11, 3873151461), 114 | y = u(y, b, v, d, i[l + 15], 16, 530742520), 115 | v = p(v, d = u(d, y, b, v, i[l + 2], 23, 3299628645), y, b, i[l + 0], 6, 4096336452), 116 | b = p(b, v, d, y, i[l + 7], 10, 1126891415), 117 | y = p(y, b, v, d, i[l + 14], 15, 2878612391), 118 | d = p(d, y, b, v, i[l + 5], 21, 4237533241), 119 | v = p(v, d, y, b, i[l + 12], 6, 1700485571), 120 | b = p(b, v, d, y, i[l + 3], 10, 2399980690), 121 | y = p(y, b, v, d, i[l + 10], 15, 4293915773), 122 | d = p(d, y, b, v, i[l + 1], 21, 2240044497), 123 | v = p(v, d, y, b, i[l + 8], 6, 1873313359), 124 | b = p(b, v, d, y, i[l + 15], 10, 4264355552), 125 | y = p(y, b, v, d, i[l + 6], 15, 2734768916), 126 | d = p(d, y, b, v, i[l + 13], 21, 1309151649), 127 | v = p(v, d, y, b, i[l + 4], 6, 4149444226), 128 | b = p(b, v, d, y, i[l + 11], 10, 3174756917), 129 | y = p(y, b, v, d, i[l + 2], 15, 718787259), 130 | d = p(d, y, b, v, i[l + 9], 21, 3951481745), 131 | v = h(v, c), 132 | d = h(d, g), 133 | y = h(y, a), 134 | b = h(b, s); 135 | return 32 == t ? r(v) + r(d) + r(y) + r(b) : r(d) + r(y) 136 | }, 137 | function i(f, h, l, c, g) { 138 | g = g || [[this], [{}]]; 139 | for (var t = [], o = null, n = [function() { 140 | return ! 0 141 | }, 142 | function() {}, 143 | function() { 144 | g.length = l[h++] 145 | }, 146 | function() { 147 | g.push(l[h++]) 148 | }, 149 | function() { 150 | g.pop() 151 | }, 152 | function() { 153 | var n = l[h++], 154 | t = g[g.length - 2 - n]; 155 | g[g.length - 2 - n] = g.pop(), 156 | g.push(t) 157 | }, 158 | function() { 159 | g.push(g[g.length - 1]) 160 | }, 161 | function() { 162 | g.push([g.pop(), g.pop()].reverse()) 163 | }, 164 | function() { 165 | g.push([c, g.pop()]) 166 | }, 167 | function() { 168 | g.push([g.pop()]) 169 | }, 170 | function() { 171 | var n = g.pop(); 172 | g.push(n[0][n[1]]) 173 | }, 174 | function() { 175 | g.push(g[g.pop()[0]][0]) 176 | }, 177 | function() { 178 | var n = g[g.length - 2]; 179 | n[0][n[1]] = g[g.length - 1] 180 | }, 181 | function() { 182 | g[g[g.length - 2][0]][0] = g[g.length - 1] 183 | }, 184 | function() { 185 | var n = g.pop(), 186 | t = g.pop(); 187 | g.push([t[0][t[1]], n]) 188 | }, 189 | function() { 190 | var n = g.pop(); 191 | g.push([g[g.pop()][0], n]) 192 | }, 193 | function() { 194 | var n = g.pop(); 195 | g.push(delete n[0][n[1]]) 196 | }, 197 | function() { 198 | var n = []; 199 | for (var t in g.pop()) n.push(t); 200 | g.push(n) 201 | }, 202 | function() { 203 | g[g.length - 1].length ? g.push(g[g.length - 1].shift(), !0) : g.push(void 0, !1) 204 | }, 205 | function() { 206 | var n = g[g.length - 2], 207 | t = Object.getOwnPropertyDescriptor(n[0], n[1]) || { 208 | configurable: !0, 209 | enumerable: !0 210 | }; 211 | t.get = g[g.length - 1], 212 | Object.defineProperty(n[0], n[1], t) 213 | }, 214 | function() { 215 | var n = g[g.length - 2], 216 | t = Object.getOwnPropertyDescriptor(n[0], n[1]) || { 217 | configurable: !0, 218 | enumerable: !0 219 | }; 220 | t.set = g[g.length - 1], 221 | Object.defineProperty(n[0], n[1], t) 222 | }, 223 | function() { 224 | h = l[h++] 225 | }, 226 | function() { 227 | var n = l[h++]; 228 | g[g.length - 1] && (h = n) 229 | }, 230 | function() { 231 | throw g[g.length - 1] 232 | }, 233 | function() { 234 | var n = l[h++], 235 | t = n ? g.slice( - n) : []; 236 | g.length -= n, 237 | g.push(g.pop().apply(c, t)) 238 | }, 239 | function() { 240 | var n = l[h++], 241 | t = n ? g.slice( - n) : []; 242 | g.length -= n; 243 | var o = g.pop(); 244 | g.push(o[0][o[1]].apply(o[0], t)) 245 | }, 246 | function() { 247 | var n = l[h++], 248 | t = n ? g.slice( - n) : []; 249 | g.length -= n, 250 | t.unshift(null), 251 | g.push(new(Function.prototype.bind.apply(g.pop(), t))) 252 | }, 253 | function() { 254 | var n = l[h++], 255 | t = n ? g.slice( - n) : []; 256 | g.length -= n, 257 | t.unshift(null); 258 | var o = g.pop(); 259 | g.push(new(Function.prototype.bind.apply(o[0][o[1]], t))) 260 | }, 261 | function() { 262 | g.push(!g.pop()) 263 | }, 264 | function() { 265 | g.push(~g.pop()) 266 | }, 267 | function() { 268 | g.push(typeof g.pop()) 269 | }, 270 | function() { 271 | g[g.length - 2] = g[g.length - 2] == g.pop() 272 | }, 273 | function() { 274 | g[g.length - 2] = g[g.length - 2] === g.pop() 275 | }, 276 | function() { 277 | g[g.length - 2] = g[g.length - 2] > g.pop() 278 | }, 279 | function() { 280 | g[g.length - 2] = g[g.length - 2] >= g.pop() 281 | }, 282 | function() { 283 | g[g.length - 2] = g[g.length - 2] << g.pop() 284 | }, 285 | function() { 286 | g[g.length - 2] = g[g.length - 2] >> g.pop() 287 | }, 288 | function() { 289 | g[g.length - 2] = g[g.length - 2] >>> g.pop() 290 | }, 291 | function() { 292 | g[g.length - 2] = g[g.length - 2] + g.pop() 293 | }, 294 | function() { 295 | g[g.length - 2] = g[g.length - 2] - g.pop() 296 | }, 297 | function() { 298 | g[g.length - 2] = g[g.length - 2] * g.pop() 299 | }, 300 | function() { 301 | g[g.length - 2] = g[g.length - 2] / g.pop() 302 | }, 303 | function() { 304 | g[g.length - 2] = g[g.length - 2] % g.pop() 305 | }, 306 | function() { 307 | g[g.length - 2] = g[g.length - 2] | g.pop() 308 | }, 309 | function() { 310 | g[g.length - 2] = g[g.length - 2] & g.pop() 311 | }, 312 | function() { 313 | g[g.length - 2] = g[g.length - 2] ^ g.pop() 314 | }, 315 | function() { 316 | g[g.length - 2] = g[g.length - 2] in g.pop() 317 | }, 318 | function() { 319 | g[g.length - 2] = g[g.length - 2] instanceof g.pop() 320 | }, 321 | function() { 322 | g[g[g.length - 1][0]] = void 0 === g[g[g.length - 1][0]] ? [] : g[g[g.length - 1][0]] 323 | }, 324 | function() { 325 | for (var e = l[h++], u = [], n = l[h++], t = l[h++], p = [], o = 0; o < n; o++) u[l[h++]] = g[l[h++]]; 326 | for (var r = 0; r < t; r++) p[r] = l[h++]; 327 | g.push(function n() { 328 | var t = u.slice(0); 329 | t[0] = [this], 330 | t[1] = [arguments], 331 | t[2] = [n]; 332 | for (var o = 0; o < p.length && o < arguments.length; o++) 0 < p[o] && (t[p[o]] = [arguments[o]]); 333 | return i(f, e, l, c, t) 334 | }) 335 | }, 336 | function() { 337 | t.push([l[h++], g.length, l[h++]]) 338 | }, 339 | function() { 340 | t.pop() 341 | }, 342 | function() { 343 | return !! o 344 | }, 345 | function() { 346 | o = null 347 | }, 348 | function() { 349 | g[g.length - 1] += String.fromCharCode(l[h++]) 350 | }, 351 | function() { 352 | g.push("") 353 | }, 354 | function() { 355 | g.push(void 0) 356 | }, 357 | function() { 358 | g.push(null) 359 | }, 360 | function() { 361 | g.push(!0) 362 | }, 363 | function() { 364 | g.push(!1) 365 | }, 366 | function() { 367 | g.length -= l[h++] 368 | }, 369 | function() { 370 | g[g.length - 1] = l[h++] 371 | }, 372 | function() { 373 | var n = g.pop(), 374 | t = g[g.length - 1]; 375 | t[0][t[1]] = g[n[0]][0] 376 | }, 377 | function() { 378 | var n = g.pop(), 379 | t = g[g.length - 1]; 380 | t[0][t[1]] = n[0][n[1]] 381 | }, 382 | function() { 383 | var n = g.pop(), 384 | t = g[g.length - 1]; 385 | g[t[0]][0] = g[n[0]][0] 386 | }, 387 | function() { 388 | var n = g.pop(), 389 | t = g[g.length - 1]; 390 | g[t[0]][0] = n[0][n[1]] 391 | }, 392 | function() { 393 | g[g.length - 2] = g[g.length - 2] < g.pop() 394 | }, 395 | function() { 396 | g[g.length - 2] = g[g.length - 2] <= g.pop() 397 | }];;) try { 398 | for (; ! n[l[h++]]();); 399 | if (o) throw o; 400 | return g.pop() 401 | } catch(n) { 402 | var e = t.pop(); 403 | if (void 0 === e) throw n; 404 | o = n, 405 | h = e[0], 406 | g.length = e[1], 407 | e[2] && (g[e[2]][0] = o) 408 | } 409 | } (120731, 0, [21, 34, 50, 100, 57, 50, 102, 50, 98, 99, 101, 52, 54, 97, 52, 99, 55, 56, 52, 49, 57, 54, 57, 49, 56, 98, 102, 100, 100, 48, 48, 55, 55, 102, 2, 10, 3, 2, 9, 48, 61, 3, 9, 48, 61, 4, 9, 48, 61, 5, 9, 48, 61, 6, 9, 48, 61, 7, 9, 48, 61, 8, 9, 48, 61, 9, 9, 48, 4, 21, 427, 54, 2, 15, 3, 2, 9, 48, 61, 3, 9, 48, 61, 4, 9, 48, 61, 5, 9, 48, 61, 6, 9, 48, 61, 7, 9, 48, 61, 8, 9, 48, 61, 9, 9, 48, 61, 10, 9, 48, 61, 11, 9, 48, 61, 12, 9, 48, 61, 13, 9, 48, 61, 14, 9, 48, 61, 10, 9, 55, 54, 97, 54, 98, 54, 99, 54, 100, 54, 101, 54, 102, 54, 103, 54, 104, 54, 105, 54, 106, 54, 107, 54, 108, 54, 109, 54, 110, 54, 111, 54, 112, 54, 113, 54, 114, 54, 115, 54, 116, 54, 117, 54, 118, 54, 119, 54, 120, 54, 121, 54, 122, 54, 48, 54, 49, 54, 50, 54, 51, 54, 52, 54, 53, 54, 54, 54, 55, 54, 56, 54, 57, 13, 4, 61, 11, 9, 55, 54, 77, 54, 97, 54, 116, 54, 104, 8, 55, 54, 102, 54, 108, 54, 111, 54, 111, 54, 114, 14, 55, 54, 77, 54, 97, 54, 116, 54, 104, 8, 55, 54, 114, 54, 97, 54, 110, 54, 100, 54, 111, 54, 109, 14, 25, 0, 3, 4, 9, 11, 3, 3, 9, 11, 39, 3, 1, 38, 40, 3, 3, 9, 11, 38, 25, 1, 13, 4, 61, 12, 9, 55, 13, 4, 61, 13, 9, 3, 0, 13, 4, 4, 3, 13, 9, 11, 3, 11, 9, 11, 66, 22, 306, 4, 21, 422, 24, 4, 3, 14, 9, 55, 54, 77, 54, 97, 54, 116, 54, 104, 8, 55, 54, 102, 54, 108, 54, 111, 54, 111, 54, 114, 14, 55, 54, 77, 54, 97, 54, 116, 54, 104, 8, 55, 54, 114, 54, 97, 54, 110, 54, 100, 54, 111, 54, 109, 14, 25, 0, 3, 10, 9, 55, 54, 108, 54, 101, 54, 110, 54, 103, 54, 116, 54, 104, 15, 10, 40, 25, 1, 13, 4, 61, 12, 9, 6, 11, 3, 10, 9, 3, 14, 9, 11, 15, 10, 38, 13, 4, 61, 13, 9, 6, 11, 6, 5, 1, 5, 0, 3, 1, 38, 13, 4, 61, 0, 5, 0, 43, 4, 21, 291, 61, 3, 12, 9, 11, 0, 3, 9, 9, 49, 72, 0, 2, 3, 4, 13, 4, 61, 8, 9, 21, 721, 3, 2, 8, 3, 2, 9, 48, 61, 3, 9, 48, 61, 4, 9, 48, 61, 5, 9, 48, 61, 6, 9, 48, 61, 7, 9, 48, 4, 55, 54, 115, 54, 101, 54, 108, 54, 102, 8, 10, 30, 55, 54, 117, 54, 110, 54, 100, 54, 101, 54, 102, 54, 105, 54, 110, 54, 101, 54, 100, 32, 28, 22, 510, 4, 21, 523, 22, 4, 55, 54, 115, 54, 101, 54, 108, 54, 102, 8, 10, 0, 55, 54, 119, 54, 105, 54, 110, 54, 100, 54, 111, 54, 119, 8, 10, 30, 55, 54, 117, 54, 110, 54, 100, 54, 101, 54, 102, 54, 105, 54, 110, 54, 101, 54, 100, 32, 28, 22, 566, 4, 21, 583, 3, 4, 55, 54, 119, 54, 105, 54, 110, 54, 100, 54, 111, 54, 119, 8, 10, 0, 55, 54, 103, 54, 108, 54, 111, 54, 98, 54, 97, 54, 108, 8, 10, 30, 55, 54, 117, 54, 110, 54, 100, 54, 101, 54, 102, 54, 105, 54, 110, 54, 101, 54, 100, 32, 28, 22, 626, 4, 21, 643, 25, 4, 55, 54, 103, 54, 108, 54, 111, 54, 98, 54, 97, 54, 108, 8, 10, 0, 55, 54, 69, 54, 114, 54, 114, 54, 111, 54, 114, 8, 55, 54, 117, 54, 110, 54, 97, 54, 98, 54, 108, 54, 101, 54, 32, 54, 116, 54, 111, 54, 32, 54, 108, 54, 111, 54, 99, 54, 97, 54, 116, 54, 101, 54, 32, 54, 103, 54, 108, 54, 111, 54, 98, 54, 97, 54, 108, 54, 32, 54, 111, 54, 98, 54, 106, 54, 101, 54, 99, 54, 116, 27, 1, 23, 56, 0, 49, 444, 0, 0, 24, 0, 13, 4, 61, 8, 9, 55, 54, 95, 54, 95, 54, 103, 54, 101, 54, 116, 54, 83, 54, 101, 54, 99, 54, 117, 54, 114, 54, 105, 54, 116, 54, 121, 54, 83, 54, 105, 54, 103, 54, 110, 15, 21, 1126, 49, 2, 14, 3, 2, 9, 48, 61, 3, 9, 48, 61, 4, 9, 48, 61, 5, 9, 48, 61, 6, 9, 48, 61, 7, 9, 48, 61, 8, 9, 48, 61, 9, 9, 48, 61, 10, 9, 48, 61, 11, 9, 48, 61, 9, 9, 55, 54, 108, 54, 111, 54, 99, 54, 97, 54, 116, 54, 105, 54, 111, 54, 110, 8, 10, 30, 55, 54, 117, 54, 110, 54, 100, 54, 101, 54, 102, 54, 105, 54, 110, 54, 101, 54, 100, 32, 28, 22, 862, 21, 932, 21, 4, 55, 54, 108, 54, 111, 54, 99, 54, 97, 54, 116, 54, 105, 54, 111, 54, 110, 8, 55, 54, 104, 54, 111, 54, 115, 54, 116, 14, 55, 54, 105, 54, 110, 54, 100, 54, 101, 54, 120, 54, 79, 54, 102, 14, 55, 54, 121, 54, 46, 54, 113, 54, 113, 54, 46, 54, 99, 54, 111, 54, 109, 25, 1, 3, 0, 3, 1, 39, 32, 22, 963, 4, 55, 54, 67, 54, 74, 54, 66, 54, 80, 54, 65, 54, 67, 54, 114, 54, 82, 54, 117, 54, 78, 54, 121, 54, 55, 21, 974, 50, 4, 3, 12, 9, 11, 3, 8, 3, 10, 24, 2, 13, 4, 61, 10, 9, 3, 13, 9, 55, 54, 95, 54, 95, 54, 115, 54, 105, 54, 103, 54, 110, 54, 95, 54, 104, 54, 97, 54, 115, 54, 104, 54, 95, 54, 50, 54, 48, 54, 50, 54, 48, 54, 48, 54, 51, 54, 48, 54, 53, 15, 10, 22, 1030, 21, 1087, 22, 4, 3, 13, 9, 55, 54, 95, 54, 95, 54, 115, 54, 105, 54, 103, 54, 110, 54, 95, 54, 104, 54, 97, 54, 115, 54, 104, 54, 95, 54, 50, 54, 48, 54, 50, 54, 48, 54, 48, 54, 51, 54, 48, 54, 53, 15, 3, 9, 9, 11, 3, 3, 9, 11, 38, 25, 1, 13, 4, 61, 11, 9, 3, 12, 9, 11, 3, 10, 3, 53, 3, 37, 39, 24, 2, 13, 4, 4, 55, 54, 122, 54, 122, 54, 97, 3, 11, 9, 11, 38, 3, 10, 9, 11, 38, 0, 49, 771, 2, 1, 12, 9, 13, 8, 3, 12, 4, 4, 56, 0], n); 410 | var t = n.__getSecuritySign; 411 | youmaoni=t; 412 | return t 413 | }); 414 | 415 | function test(){ 416 | return youmaoni("123"); 417 | } 418 | 419 | function getSIgn(data) { 420 | return youmaoni(data); 421 | } -------------------------------------------------------------------------------- /qq音乐的框架/test.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import uuid 3 | from urllib import parse 4 | import math 5 | import execjs 6 | import os 7 | from concurrent.futures import ThreadPoolExecutor,ProcessPoolExecutor 8 | headers = { 9 | "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36", 10 | "accept-language": "zh-CN,zh;q=0.9", 11 | "accept-encoding": "gzip, deflate, br", 12 | "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 13 | "cache-control": "max-age=600", 14 | "Referer": "https://y.qq.com/portal/singer_list.html", 15 | } 16 | def get_singer_mid(index): 17 | #index=1---27 18 | data='{"comm":{"ct":24,"cv":0},"singerList":{"module":"Music.SingerListServer"' \ 19 | ',"method":"get_singer_list","param":{"area":-100,"sex":-100,"genre":-100,' \ 20 | '"index":%s,"sin":0,"cur_page":1}}}' % (str(index)) 21 | url='https://u.y.qq.com/cgi-bin/musicu.fcg?-=getUCGI0432880619182503' \ 22 | '&g_tk=571600846&loginUin=0&hostUin=0&format=json&inCharset=utf8&out' \ 23 | 'Charset=utf-8¬ice=0&platform=yqq.json&needNewCode=0' \ 24 | '&data={}'.format(parse.quote(data)) 25 | html=requests.get(url).json() 26 | total=html["singerList"]["data"]["total"] 27 | pages=int(math.floor(int(total)/80)) 28 | thread_number=pages 29 | Thread=ThreadPoolExecutor(max_workers=thread_number) 30 | sin = 0 31 | for page in range(1, pages): 32 | data = '{"comm":{"ct":24,"cv":0},"singerList":{"module":"Music.SingerListServer",' \ 33 | '"method":"get_singer_list","param":{"area":-100,"sex":-100,"genre":-100,"' \ 34 | 'index":%s,"sin":%d,"cur_page":%s}}}' % (str(index), sin, str(page)) 35 | 36 | url = 'https://u.y.qq.com/cgi-bin/musicu.fcg?-=getUCGI0432880619182503' \ 37 | '&g_tk=571600846&loginUin=0&hostUin=0&format=json&inCharset=utf8&out' \ 38 | 'Charset=utf-8¬ice=0&platform=yqq.json&needNewCode=0' \ 39 | '&data={}'.format(parse.quote(data)) 40 | html=requests.get(url,headers=headers).json() 41 | sings=html['singerList']['data']['singerlist'] 42 | for sing in sings: 43 | singer_name = sing['singer_name'] 44 | mid = sing['singer_mid'] 45 | get_singer_data(mid,singer_name) 46 | sin += 80 47 | 48 | def get_singer_data(mid,singer_name): 49 | params = '{"comm":{"ct":24,"cv":0},"singerSongList":{"method":"GetSingerSongList",' \ 50 | '"param":{"order":1,"singerMid":"%s","begin":0,"num":10},' \ 51 | '"module":"musichall.song_list_server"}}' % str(mid) 52 | 53 | url = 'https://u.y.qq.com/cgi-bin/musicu.fcg?-=getSingerSong9513357793133783&' \ 54 | 'g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8' \ 55 | '¬ice=0&platform=yqq.json&needNewCode=0*&data={}'.format(parse.quote(params)) 56 | html=requests.session() 57 | content=html.get(url,headers=headers).json() 58 | songs_num=content['singerSongList']['data']['totalNum'] 59 | 60 | if int(songs_num)<=80: 61 | params = '{"comm":{"ct":24,"cv":0},"singerSongList":{"method":"GetSingerSongList",' \ 62 | '"param":{"order":1,"singerMid":"%s","begin":0,"num":%s},' \ 63 | '"module":"musichall.song_list_server"}}' % (str(mid), str(songs_num)) 64 | 65 | url = 'https://u.y.qq.com/cgi-bin/musicu.fcg?-=getSingerSong9513357793133783&' \ 66 | 'g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8' \ 67 | '¬ice=0&platform=yqq.json&needNewCode=0*&data={}'.format(parse.quote(params)) 68 | html = requests.session() 69 | content =html.get(url,headers=headers).json() 70 | datas = content['singerSongList']['data']['songList'] 71 | for song in datas: 72 | song_name = song['songInfo']['name'] 73 | song_ablum = song['songInfo']['album']['name'] 74 | singer_name = singer_name 75 | song_mid = song['songInfo']['mid'] 76 | print(singer_name,song_name,song_ablum,song_mid) 77 | 78 | else: 79 | for a in range(0,songs_num,80): 80 | params = '{"comm":{"ct":24,"cv":0},"singerSongList":{"method":"GetSingerSongList",' \ 81 | '"param":{"order":1,"singerMid":"%s","begin":%s,"num":%s},' \ 82 | '"module":"musichall.song_list_server"}}' % (str(mid), int(a), int(songs_num)) 83 | 84 | url = 'https://u.y.qq.com/cgi-bin/musicu.fcg?-=getSingerSong9513357793133783&' \ 85 | 'g_tk=5381&loginUin=0&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8' \ 86 | '¬ice=0&platform=yqq.json&needNewCode=0*&data={}'.format(parse.quote(params)) 87 | html = requests.session() 88 | content = html.get(url, headers=headers).json() 89 | datas = content['singerSongList']['data']['songList'] 90 | for song in datas: 91 | song_name = song['songInfo']['name'] 92 | song_ablum = song['songInfo']['album']['name'] 93 | singer_name = singer_name 94 | song_mid = song['songInfo']['mid'] 95 | print(singer_name,song_name,song_ablum,song_mid) 96 | download(song_mid,singer_name) 97 | def download(song_mid,sing_name): 98 | headers = { 99 | 'cookie': 'pgv_pvid=2128245208; pac_uid=0_6b1c785781d54; pgv_pvi=772980736; RK=8x5lwvVnY1;' 100 | ' ptcz=baca3422f148c8897bd71cb3765e7c08bf0dddc2aac46b34eb2f6b669e38d215;' 101 | ' ptui_loginuin=1766228968@qq.com; ts_refer=www.baidu.com/link; ts_uid=2958609676; ' 102 | 'pgv_si=s6667516928; pgv_info=ssid=s1048782460; player_exist=1; qqmusic_fromtag=66;' 103 | ' userAction=1; yqq_stat=0; _qpsvr_localtk=0.5664181233633159;' 104 | 'psrf_qqunionid=E7D5E8B282E958B5ED555246677BCD41; psrf_qqrefresh_' 105 | 'token=DC32336F11952FA5867192F46CF15FD5; tmeLoginType=2; qqmusic_' 106 | 'key=Q_H_L_2Sqn2y50eyOV1i5dcbk613wim45KnxmEK5ofj1RsBgxgHN-xLkK25EjEAQ2jvs1;' 107 | ' psrf_qqopenid=33B542A190FCA799A663FDDCB25EA8F0; qm_' 108 | 'keyst=Q_H_L_2Sqn2y50eyOV1i5dcbk613wim45KnxmEK5ofj1RsBgxgHN-xLkK25EjEAQ2jvs1; ' 109 | 'euin=oK4qNe-l7KvPoz**; psrf_access_token_expiresAt=1601793607; ' 110 | 'psrf_musickey_createtime=1594017607; psrf_qqaccess_token=8838D613ABE40CD4A345D8E550EBB967; ' 111 | 'uin=1598275443; ts_last=y.qq.com/portal/player.html; yplayer_open=1; yq_index=3', 112 | 'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36', 113 | 'referer': 'https://y.qq.com/portal/player.html' 114 | } 115 | 116 | data = '{"req_0":{"module":"vkey.GetVkeyServer","method":"CgiGetVkey","param":' \ 117 | '{"guid":"2128245208","songmid":["%s"],"songtype":[0],"uin":"1598275443","loginflag"' \ 118 | ':1,"platform":"20"}},"comm":{"uin":1598275443,"format":"json","ct":24,"cv":0}}' % str( 119 | song_mid) 120 | url = 'https://u.y.qq.com/cgi-bin/musicu.fcg?-=getplaysongvkey17693804549459324' \ 121 | '&g_tk=5381&loginUin=3262637034&hostUin=0&format=json&inCharset=utf8&outCharset=utf-8' \ 122 | '¬ice=0&platform=yqq.json&needNewCode=0&data={}'.format(parse.quote(data)) 123 | vkey = requests.get(url, headers=headers) 124 | purl = vkey.json()['req_0']['data']['midurlinfo'][0]['purl'] 125 | url = 'https://ws.stream.qqmusic.qq.com/' + purl 126 | html = requests.get(url) 127 | filename = 'qq音乐' 128 | if not os.path.exists(filename): 129 | os.makedirs(filename) 130 | with open('./{}/{}.m4a'.format(filename, sing_name), 'wb') as f: 131 | print('\n正在下载{}歌曲.....\n'.format(sing_name)) 132 | f.write(html.content) 133 | 134 | 135 | 136 | 137 | if __name__ == '__main__': 138 | get_singer_mid(2) -------------------------------------------------------------------------------- /中国疫情地图/地图.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | import re 4 | from pyecharts.charts import Map 5 | from pyecharts import options 6 | #先请求到数据的页面 7 | result = requests.get( 8 | 'https://interface.sina.cn/news/wap/fymap2020_data.d.json?1580097300739&&callback=sinajp_1580097300873005379567841634181') 9 | #用正则去获取具体的数字 10 | json_str = re.search("\(+([^)]*)\)+", result.text).group(1) 11 | html = f"{json_str}" 12 | #然后用json把json转化为python 13 | table = json.loads(f"{html}") 14 | province_data = [] 15 | for province in table['data']['list']: 16 | province_data.append((province['name'], province['value'])) 17 | city_data = [] 18 | # 循环获取城市名称和对应的确诊数据 19 | for city in province['city']: 20 | # 这里要注意对应上地图的名字需要使用mapName这个字段 21 | city_data.append((city['mapName'], city['conNum'])) 22 | # 使用Map,创建省份地图 23 | map_province = Map() 24 | # 设置地图上的标题和数据标记,添加省份和确诊人数 25 | map_province.set_global_opts(title_opts=options.TitleOpts( 26 | title=province['name'] + "全国疫情的总数:" + province['value']), 27 | visualmap_opts=options.VisualMapOpts(is_piecewise=True, # 设置是否为分段显示 28 | # 自定义数据范围和对应的颜色,这里我是取色工具获取的颜色值,不容易呀。 29 | pieces=[ 30 | {"min": 1000, "label": '>1000', 31 | "color": "#6F171F"}, 32 | {"min": 500, "max": 1000, 33 | "label": '500-1000', "color": "#C92C34"}, 34 | {"min": 100, "max": 499, 35 | "label": '100-499', "color": "#E35B52"}, 36 | {"min": 10, "max": 99, 37 | "label": '10-99', "color": "#F39E86"}, 38 | {"min": 1, "max": 9, "label": '1-9', "color": "#FDEBD0"}])) 39 | # 将 数据添加进去,生成省份地图,所以maptype要对应省份。 40 | map_province.add("确诊人数:", city_data, maptype=province['name']) 41 | # 一切完成,那么生成一个省份的html网页文件,取上对应省份的名字。 42 | 43 | # 创建国家地图 44 | map_country = Map() 45 | # 设置地图上的标题和数据标记,添加确诊人数 46 | map_country.set_global_opts(title_opts=options.TitleOpts( 47 | title="全国确诊人数:" + str(table['data']["gntotal"])), 48 | visualmap_opts=options.VisualMapOpts(is_piecewise=True, # 设置是否为分段显示 49 | # 自定义数据范围和对应的颜色,这里我是取色工具获取的颜色值,不容易呀。 50 | pieces=[ 51 | # 不指定 max,表示 max 为无限大(Infinity)。 52 | {"min": 1000, "label": '>1000', 53 | "color": "#6F171F"}, 54 | {"min": 500, "max": 1000, 55 | "label": '500-1000', "color": "#C92C34"}, 56 | {"min": 100, "max": 499, 57 | "label": '100-499', "color": "#E35B52"}, 58 | {"min": 10, "max": 99, 59 | "label": '10-99', "color": "#F39E86"}, 60 | {"min": 1, "max": 9, "label": '1-9', "color": "#FDEBD0"}])) 61 | # 将数据添加进去,生成中国地图,所以maptype要对应china。 62 | map_country.add("确诊人数:", province_data, maptype="china") 63 | # 一切完成,那么生成一个html网页文件。 64 | map_country.render("country.html") 65 | print("生成完成!!!") -------------------------------------------------------------------------------- /京东框架/Celery.py: -------------------------------------------------------------------------------- 1 | from celery import Celery 2 | import requests,re,json 3 | app = Celery( 4 | 'tasks', 5 | backend='redis://127.0.0.1:6379/2', 6 | broker='redis://127.0.0.1:6379/1', 7 | ) 8 | headers = { 9 | 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36', 10 | } 11 | def get_id(url): 12 | id = re.compile('\d+') 13 | res = id.findall(url) 14 | return res[0] 15 | @app.task 16 | def get_comm(url,comm_num): 17 | #存放结果 18 | good_comments = "" 19 | #获取评论 20 | item_id = get_id(url) 21 | pages = comm_num//10 22 | if pages>99: 23 | pages = 99 24 | for page in range(0,pages): 25 | comm_url = 'https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId={}&score=0&sortType=5&page={}&pageSize=10&isShadowSku=0&rid=0&fold=1'.format(item_id,page) 26 | headers['Referer'] = url 27 | json_decode = requests.get(comm_url,headers = headers).text 28 | try: 29 | if json_decode: 30 | start = json_decode.find('{"productAttr"') 31 | end = json_decode.find('"afterDays":0}]}') + len('"afterDays":0}]}') 32 | results = json.loads(json_decode[start:end])['comments'] 33 | for result in results: 34 | content = result['content'] 35 | good_comments += "{}|".format(content) 36 | except Exception as e: 37 | pass 38 | return item_id,good_comments 39 | -------------------------------------------------------------------------------- /京东框架/client.py: -------------------------------------------------------------------------------- 1 | import redis 2 | import json 3 | def write_csv(row): 4 | with open('shop.txt','a+',encoding='utf8')as f: 5 | f.write(str(row)+'\n') 6 | r = redis.Redis(host='127.0.0.1',port=6379,db=2) 7 | keys = r.keys() 8 | for key in keys(): 9 | res = r.get(key) 10 | res = json.loads(res.decode('utf-8')) 11 | results = res.get('result') 12 | write_csv(results) -------------------------------------------------------------------------------- /京东框架/main.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from bs4 import BeautifulSoup 3 | import re,json,csv 4 | import threadpool 5 | from urllib import parse 6 | headers = { 7 | 'cookie': 'unpl=V2_ZzNtbRAEQxYiDBNTKR1cAmIGEg1KVEYVcgxFBH4ZCQIyABpbclRCFnQUR1NnGlQUZwEZWUtcQRdFCEdkeB5fA2AFEFlBZxVLK14bADlNDEY1WnwHBAJfF3ILQFJ8HlQMZAEUbXJUQyV1CXZdeR1aB2QHE1tyZ0QlRThGXXMbXQZXAiJcchUXSXEKQVVzGBEMZQcUX0FTQhNFCXZX; __jdv=76161171|google-search|t_262767352_googlesearch|cpc|kwd-362776698237_0_cb12f5d6c516441a9241652a41d6d297|1593410310158; __jdu=835732507; areaId=19; ipLoc-djd=19-1601-50256-0; PCSYCityID=CN_440000_440100_440114; shshshfpa=b3947298-5c63-ba93-8e7d-b89e3e422382-1593410312; shshshfpb=eVvsT1HAgXe1EsnsQQ6HTpQ%3D%3D; __jda=122270672.835732507.1593410309.1593410309.1593410310.1; __jdc=122270672; shshshfp=158c0090e5888d932458419e12bac1d7; rkv=V0100; 3AB9D23F7A4B3C9B=VLVTNQOO6BLWETXYSO5XADLGXR7OIDM3NHDDPRNYKWBPH45RRTYXIJNGG5TFHJ5YYFBFDEARKUWAM3XO4ZWTNCDX7U; qrsc=3; shshshsID=0c6834aad4a33312fc6c9eadbfb29e65_6_1593410449685; __jdb=122270672.6.835732507|1.1593410310', 8 | 'referer': 'https://search.jd.com/Search?keyword=python&wq=python&page=3&s=61&click=0', 9 | 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36', 10 | } 11 | id_comm_dict = {} 12 | KEYWORD = parse.quote('python') 13 | 14 | def get_comm(url,comm_num): 15 | #存放结果 16 | good_comments = "" 17 | #获取评论 18 | item_id = get_id(url) 19 | pages = comm_num//10 20 | if pages>99: 21 | pages = 99 22 | for page in range(0,pages): 23 | comm_url = 'https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId={}&score=0&sortType=5&page={}&pageSize=10&isShadowSku=0&rid=0&fold=1'.format(item_id,page) 24 | headers['Referer'] = url 25 | html = requests.get(comm_url, headers=headers) 26 | json_decode = html.text 27 | try: 28 | if json_decode: 29 | start = json_decode.find('{"productAttr"') 30 | end = json_decode.find('"afterDays":0}]}') + len('"afterDays":0}]}') 31 | results = json.loads(json_decode[start:end])['comments'] 32 | for result in results: 33 | content = result['content'] 34 | good_comments += "{}|".format(content) 35 | except Exception as e: 36 | pass 37 | return item_id,good_comments 38 | 39 | def get_index(url): 40 | session = requests.Session() 41 | session.headers = headers 42 | html = session.get(url) 43 | soup = BeautifulSoup(html.text,'lxml') 44 | items = soup.select('li.gl-item') 45 | for item in items: 46 | base = 'https://item.jd.com/' 47 | inner_url = item.select_one('.gl-i-wrap div.p-img a').get('href') 48 | inner_url = parse.urljoin(base,inner_url) 49 | item_id = get_id(inner_url) 50 | comm_num = get_comm_num(inner_url) 51 | # if comm_num>0: 52 | # id_comm_dict[item_id] = get_comm.delay(inner_url,comm_num) 53 | shop_info_data = get_shop_info(inner_url) 54 | price = item.select('div.p-price strong i')[0].text 55 | shop_info_data['price'] = price 56 | shop_info_data['comm_num'] = comm_num 57 | shop_info_data['item_id'] = item_id 58 | print(shop_info_data) 59 | write_csv(shop_info_data) 60 | 61 | head = ['shop_name','shop_evaluation','logistics','sale_server','shop_brand','price','comm_num','item_id'] 62 | def write_csv(row): 63 | with open('shop.csv','a+',encoding='utf-8')as f: 64 | csv_write = csv.DictWriter(f,head) 65 | csv_write.writerow(row) 66 | 67 | def get_comm_num(url): 68 | item_id = get_id(url) 69 | comm_url = 'https://club.jd.com/comment/productCommentSummaries.action?referenceIds={}&callback=jQuery5999681'.format(item_id) 70 | comment = requests.get(comm_url,headers = headers) 71 | json_decode = comment.text 72 | start = json_decode.find('{"CommentsCount":') 73 | end = json_decode.find('PoorRateStyle":1}]}') + len('PoorRateStyle":1}]}') 74 | try: 75 | result = json.loads(json_decode[start:end])['CommentsCount'] 76 | except: 77 | return 0 78 | comm_num = result[0]['CommentCount'] 79 | return comm_num 80 | def get_shop_info(url): 81 | shop_data = {} 82 | html = requests.get(url,headers = headers) 83 | soup = BeautifulSoup(html.text,'lxml') 84 | try: 85 | shop_name = soup.select('div.mt h3 a')[0].text 86 | except: 87 | shop_name = '京东' 88 | shop_score = soup.select('.score-part span.score-detail em') 89 | try: 90 | shop_evaluation = shop_score[0].text 91 | logistics = shop_score[1].text 92 | sale_server = shop_score[2].text 93 | except: 94 | shop_evaluation = None 95 | logistics = None 96 | sale_server = None 97 | shop_info = soup.select('div.p-parameter ul') 98 | shop_brand = shop_info[0].select('ul li a')[0].text 99 | try: 100 | shop_other = shop_info[1].select('li') 101 | for s in shop_other: 102 | data = s.text.split(':') 103 | key = data[0] 104 | value = data[1] 105 | shop_data[key] = value 106 | except: 107 | pass 108 | shop_data['shop_name']= shop_name 109 | shop_data['shop_evaluation'] = shop_evaluation 110 | shop_data['logistics'] = logistics 111 | shop_data['sale_server'] = sale_server 112 | shop_data['shop_brand'] = shop_brand 113 | return shop_data 114 | 115 | def get_id(url): 116 | id = re.compile('\d+') 117 | res = id.findall(url) 118 | return res[0] 119 | if __name__ == '__main__': 120 | #先创建一个列表用来存放URL 121 | urls = [] 122 | #找到他们的规律,创建一个个URL 123 | for i in range(1,200,2): 124 | url = "https://search.jd.com/Search?keyword={}&wq={}&page={}".format(KEYWORD,KEYWORD,i) 125 | #把创建好的URL用空元组这种形式一条条存入URLS列表里面 126 | urls.append(([url,],None)) 127 | #创建100个线程 128 | pool = threadpool.ThreadPool(100) 129 | #往线程里面添加URL,makeRequests创建任务,创建100个任务 130 | reque = threadpool.makeRequests(get_index,urls) 131 | #用一个for循环线程池 132 | for r in reque: 133 | #putRequest提交这100个任务,往线程池里面提交100个任务 134 | pool.putRequest(r) 135 | #最后等待这个线程池结束 136 | pool.wait() -------------------------------------------------------------------------------- /京东框架/shop.csv: -------------------------------------------------------------------------------- 1 | 电子工业出版社,,,,电子工业出版社,41.00,4502,12393882 2 | 电子工业出版社,,,,电子工业出版社,66.20,25415,12367744 3 | 人民邮电出版社,,,,人民邮电出版社,57.80,0,12418677 4 | 电子工业出版社,,,,电子工业出版社,54.40,0,12528357 5 | 机械工业出版社,,,,机械工业出版社,61.40,15155,12593612 6 | 人民邮电出版社,,,,人民邮电出版社,81.90,0,12835020 7 | 电子工业出版社,,,,电子工业出版社,70.20,0,12827014 8 | 静默时光图书专营店,9.01 高,9.67 高,9.03 中,静默时光图书专营店,72.00,0,40906323506 9 | 电子工业出版社,,,,电子工业出版社,54.40,25415,12480719 10 | 广东人民出版社图书专营店,9.14 高,9.35 高,8.71 低,广东人民出版社图书专营店,69.80,0,65651547064 11 | 人民邮电出版社,,,,人民邮电出版社,66.20,0,12592731 12 | 电子工业出版社官方旗舰店,9.17 中,8.92 中,8.77 低,电子工业出版社官方旗舰店,84.96,0,39232866417 13 | 人民邮电出版社,,,,人民邮电出版社,115.60,0,12568882 14 | 华心图书专营店,8.63 低,9.74 高,9.10 中,华心图书专营店,413.96,0,62760270425 15 | 清华大学出版社,,,,清华大学出版社,58.70,0,12600117 16 | 品阅轩图书专营店,8.49 低,8.29 低,9.40 高,品阅轩图书专营店,70.30,0,66406991415 17 | 电子工业出版社,,,,电子工业出版社,74.50,25415,12261161 18 | 人民邮电出版社,,,,人民邮电出版社,82.90,0,11896385 19 | 人民邮电出版社,,,,人民邮电出版社,41.00,0,11666319 20 | 中国人民大学出版社,,,,中国人民大学出版社,27.90,0,12296243 21 | 人民邮电出版社,,,,人民邮电出版社,91.10,0,12832594 22 | 清华大学出版社,,,,清华大学出版社,42.30,0,12418597 23 | 电子工业出版社,,,,电子工业出版社,74.50,0,12365097 24 | 翔坤图书专营店,9.06 中,9.87 高,8.75 低,翔坤图书专营店,155.25,0,46330246763 25 | 文轩网旗舰店,9.36 高,9.78 高,8.87 低,文轩网旗舰店,76.80,436,25740989105 26 | 化学工业出版社,,,,化学工业出版社,58.40,0,12711528 27 | 电子工业出版社,,,,电子工业出版社,78.00,18185,12333252 28 | 人民邮电出版社,,,,人民邮电出版社,57.80,0,12899622 29 | 博库网旗舰店,9.28 高,9.00 中,9.01 中,博库网旗舰店,22.50,0,1392325966 30 | 清华大学出版社,,,,清华大学出版社,21.90,0,11760279 31 | 人民邮电出版社,,,,人民邮电出版社,57.80,0,11485222 32 | 淘乐思图书专营店,9.21 高,9.09 中,9.06 中,淘乐思图书专营店,55.50,0,34183052585 33 | 中国电力出版社,,,,中国电力出版社,66.00,0,12777434 34 | 人民邮电出版社,,,,人民邮电出版社,57.80,0,12382259 35 | 华心图书专营店,8.63 低,9.74 高,9.10 中,华心图书专营店,39.90,0,34550128533 36 | 人民邮电出版社,,,,人民邮电出版社,49.40,0,12163851 37 | 电子工业出版社,,,,电子工业出版社,62.30,0,12801066 38 | 机械工业出版社,,,,机械工业出版社,149.50,0,12808812 39 | 电子工业出版社,,,,电子工业出版社,78.00,0,12385123 40 | 机械工业出版社,,,,机械工业出版社,97.00,29006,12656938 41 | 明日科技京东自营旗舰店,,,,吉林大学出版社,191.30,43745,12597786 42 | 人民邮电出版社,,,,人民邮电出版社,57.80,0,12323267 43 | 亿书图书专营店,-- --,-- --,-- --,亿书图书专营店,48.00,0,67462338606 44 | 电子工业出版社,,,,电子工业出版社,46.50,0,12820408 45 | 机械工业出版社,,,,机械工业出版社,61.40,0,12591062 46 | 电子工业出版社,,,,电子工业出版社,41.00,13800,12071148 47 | 京东,,,,北京大学出版社,92.20,0,12659301 48 | 电子工业出版社,,,,电子工业出版社,54.40,0,12364204 49 | 电子工业出版社,,,,电子工业出版社,55.00,0,12516591 50 | 京东,,,,北京大学出版社,83.70,0,12673001 51 | 人民邮电出版社,,,,人民邮电出版社,65.60,0,12736346 52 | 人民邮电出版社,,,,人民邮电出版社,37.70,0,12303057 53 | 润知天下图书专营店,-- --,-- --,-- --,润知天下图书专营店,79.80,0,39979430016 54 | 京东,,,,华中科技大学出版社,69.30,0,12793050 55 | 书香神州图书专营店,8.81 低,9.59 高,8.80 低,书香神州图书专营店,179.00,0,41425888019 56 | 清华大学出版社,,,,清华大学出版社,84.20,0,12276775 57 | 机械工业出版社,,,,机械工业出版社,65.20,511,11889583 58 | 电子工业出版社,,,,电子工业出版社,70.20,65996,12700534 59 | 人民邮电出版社,,,,人民邮电出版社,57.80,0,12273412 60 | 机械工业出版社,,,,机械工业出版社,65.20,0,12391038 61 | 机械工业出版社,,,,机械工业出版社,70.30,0,12741670 62 | 清华大学出版社,,,,清华大学出版社,50.20,0,12227940 63 | 中国铁道出版社,,,,中国铁道出版社,56.80,0,12700670 64 | 机械工业出版社,,,,机械工业出版社,56.90,0,12506442 65 | 墨涵图书专营店,9.37 高,9.32 高,9.56 高,墨涵图书专营店,118.00,0,36818666321 66 | 人民邮电出版社,,,,人民邮电出版社,41.00,0,12550065 67 | 清华大学出版社,,,,清华大学出版社,33.20,0,12810084 68 | 清华大学出版社,,,,清华大学出版社,84.20,698,12578878 69 | 华研外语官方旗舰店,8.38 低,9.64 高,8.78 低,华研外语官方旗舰店,49.80,0,47862468421 70 | 机械工业出版社,,,,机械工业出版社,105.60,0,12482000 71 | 机械工业出版社,,,,机械工业出版社,65.20,0,12442779 72 | 人民邮电出版社,,,,人民邮电出版社,68.30,0,12651725 73 | 人民邮电出版社,,,,人民邮电出版社,57.80,0,12283479 74 | 人民邮电出版社,,,,人民邮电出版社,74.50,0,12550997 75 | 书香神州图书专营店,8.81 低,9.59 高,8.80 低,书香神州图书专营店,320.00,0,49719079251 76 | 清华大学出版社,,,,清华大学出版社,30.60,0,12509361 77 | 机械工业出版社,,,,机械工业出版社,40.40,0,12012431 78 | 人民邮电出版社,,,,人民邮电出版社,66.20,0,12832968 79 | 中国法律图书旗舰店,8.89 中,9.70 高,9.01 中,中国法律图书旗舰店,61.00,0,65353249462 80 | 清华大学出版社,,,,清华大学出版社,58.70,0,12482971 81 | 电子工业出版社,,,,电子工业出版社,93.80,0,12838698 82 | 中国水利水电出版社,,,,中国水利水电出版社,85.30,0,12615304 83 | 人民邮电出版社,,,,人民邮电出版社,57.80,0,12418677 84 | 人民邮电出版社,,,,人民邮电出版社,61.40,0,11993134 85 | 人民邮电出版社,,,,人民邮电出版社,82.90,0,12219342 86 | 电子工业出版社,,,,电子工业出版社,100.90,0,12844424 87 | 静默时光图书专营店,9.01 高,9.67 高,9.03 中,静默时光图书专营店,72.00,0,40906323506 88 | 人民邮电出版社,,,,人民邮电出版社,156.60,0,12842874 89 | 明日科技京东自营旗舰店,,,,吉林大学出版社,82.30,0,12647829 90 | 电子工业出版社,,,,电子工业出版社,70.20,0,12654474 91 | 人民邮电出版社,,,,人民邮电出版社,66.20,0,12592731 92 | 明日科技京东自营旗舰店,,,,吉林大学出版社,67.00,0,12353915 93 | 明日科技京东自营旗舰店,,,,吉林大学出版社,67.00,0,12859710 94 | 人民邮电出版社,,,,人民邮电出版社,74.50,0,12794078 95 | 华心图书专营店,8.63 低,9.74 高,9.10 中,华心图书专营店,413.96,0,62760270425 96 | 机械工业出版社,,,,机械工业出版社,98.20,0,12398725 97 | 机械工业出版社,,,,机械工业出版社,81.70,2292,12568751 98 | 电子工业出版社,,,,电子工业出版社,70.20,47782,12654873 99 | 人民邮电出版社,,,,人民邮电出版社,41.00,0,11666319 100 | 智博尚书京东自营旗舰店,,,,中国水利水电出版社,63.80,6356,12821118 101 | 机械工业出版社,,,,机械工业出版社,180.70,29009,12452929 102 | 人民邮电出版社,,,,人民邮电出版社,66.30,0,12585125 103 | 翔坤图书专营店,9.06 中,9.87 高,8.75 低,翔坤图书专营店,155.25,0,46330246763 104 | 北京大学出版社,,,,北京大学出版社,61.20,3546,12776266 105 | 清华大学出版社,,,,清华大学出版社,108.80,3331,12417265 106 | 人民邮电出版社,,,,人民邮电出版社,54.30,0,12863192 107 | 电子工业出版社,,,,电子工业出版社,78.00,18185,12333252 108 | 机械工业出版社,,,,机械工业出版社,86.00,0,12425597 109 | 人民邮电出版社,,,,人民邮电出版社,78.00,0,12682860 110 | 人民邮电出版社,,,,人民邮电出版社,63.50,0,11943853 111 | 人民邮电出版社,,,,人民邮电出版社,57.80,0,11485222 112 | 人民邮电出版社,,,,人民邮电出版社,54.40,0,12335366 113 | 人民邮电出版社,,,,人民邮电出版社,57.80,0,12063813 114 | 人民邮电出版社,,,,人民邮电出版社,66.30,0,12279949 115 | 华心图书专营店,8.63 低,9.74 高,9.10 中,华心图书专营店,39.90,0,34550128533 116 | 明日科技京东自营旗舰店,,,,吉林大学出版社,82.30,76665,12859724 117 | 文轩网旗舰店,9.36 高,9.78 高,8.87 低,文轩网旗舰店,57.10,0,1060018965 118 | 电子工业出版社,,,,电子工业出版社,78.00,0,12385123 119 | 电子工业出版社,,,,电子工业出版社,43.90,47784,12492797 120 | 中国水利水电出版社,,,,中国水利水电出版社,72.90,0,12458274 121 | 人民邮电出版社,,,,人民邮电出版社,57.80,0,12230702 122 | 亿书图书专营店,-- --,-- --,-- --,亿书图书专营店,48.00,0,67462338606 123 | 电子工业出版社,,,,电子工业出版社,101.70,12479,12829246 124 | 人民邮电出版社,,,,人民邮电出版社,91.30,0,12293703 125 | 人民邮电出版社,,,,人民邮电出版社,99.40,47402,12627795 126 | 电子工业出版社,,,,电子工业出版社,41.00,13800,12071148 127 | 清华大学出版社,,,,清华大学出版社,71.20,135910,12594658 128 | 中国水利水电出版社,,,,中国水利水电出版社,63.80,0,12392747 129 | 智博尚书京东自营旗舰店,,,,中国水利水电出版社,94.80,0,12626565 130 | 人民邮电出版社,,,,人民邮电出版社,65.60,0,12736346 131 | 人民邮电出版社,,,,人民邮电出版社,105.80,0,12186192 132 | 人民邮电出版社,,,,人民邮电出版社,82.90,0,12219342 133 | 人民邮电出版社,,,,人民邮电出版社,127.90,0,12830348 134 | 书香神州图书专营店,8.81 低,9.59 高,8.80 低,书香神州图书专营店,179.00,0,41425888019 135 | 中国水利水电出版社,,,,中国水利水电出版社,74.80,0,12645941 136 | 明日科技京东自营旗舰店,,,,吉林大学出版社,82.30,0,12647829 137 | 人民邮电出版社,,,,人民邮电出版社,66.20,0,12570153 138 | 人民邮电出版社,,,,人民邮电出版社,57.80,0,12273412 139 | 明日科技京东自营旗舰店,,,,吉林大学出版社,107.50,76665,12512461 140 | 人民邮电出版社,,,,人民邮电出版社,74.50,0,12794078 141 | 文轩网旗舰店,9.36 高,9.78 高,8.87 低,文轩网旗舰店,67.60,0,21738292624 142 | 中国铁道出版社,,,,中国铁道出版社,56.80,0,12700670 143 | 明日科技京东自营旗舰店,,,,吉林大学出版社,83.80,76665,12451724 144 | 电子工业出版社,,,,电子工业出版社,70.20,47782,12654873 145 | 清华大学出版社,,,,清华大学出版社,33.20,0,12810084 146 | 机械工业出版社,,,,机械工业出版社,48.70,29008,11864820 147 | 人民邮电出版社,,,,人民邮电出版社,75.40,0,12333540 148 | 人民邮电出版社,,,,人民邮电出版社,66.30,0,12585125 149 | 机械工业出版社,,,,机械工业出版社,65.20,0,12442779 150 | 北京大学出版社,,,,北京大学出版社,61.20,3546,12776266 151 | 凤凰新华书店旗舰店,8.74 低,8.71 低,8.73 低,凤凰新华书店旗舰店,78.00,0,31329707691 152 | 人民邮电出版社,,,,人民邮电出版社,79.70,0,12409581 153 | 机械工业出版社,,,,机械工业出版社,86.00,0,12425597 154 | 书香神州图书专营店,8.81 低,9.59 高,8.80 低,书香神州图书专营店,320.00,0,49719079251 155 | 人民邮电出版社,,,,人民邮电出版社,54.30,0,11896415 156 | 人民邮电出版社,,,,人民邮电出版社,46.50,0,12372646 157 | 人民邮电出版社,,,,人民邮电出版社,54.40,0,12335366 158 | 中国法律图书旗舰店,8.89 中,9.70 高,9.01 中,中国法律图书旗舰店,61.00,0,65353249462 159 | 人民邮电出版社,,,,人民邮电出版社,66.20,0,12301195 160 | 人民邮电出版社,,,,人民邮电出版社,78.00,0,11936238 161 | 文轩网旗舰店,9.36 高,9.78 高,8.87 低,文轩网旗舰店,57.10,0,1060018965 162 | 清华大学出版社,,,,清华大学出版社,67.20,92609,12460540 163 | 文轩网旗舰店,9.36 高,9.78 高,8.87 低,文轩网旗舰店,149.50,1152,40580160814 164 | 清华大学出版社,,,,清华大学出版社,67.80,92612,12467272 165 | 人民邮电出版社,,,,人民邮电出版社,57.80,0,12230702 166 | 机械工业出版社,,,,机械工业出版社,53.60,13140,12450676 167 | 人民邮电出版社,,,,人民邮电出版社,49.40,0,12786896 168 | 高等教育出版社,,,,高等教育出版社,27.30,0,12128326 169 | 人民邮电出版社,,,,人民邮电出版社,99.40,47402,12627795 170 | 电子工业出版社,,,,电子工业出版社,62.30,0,12580392 171 | 人民邮电出版社,,,,人民邮电出版社,85.10,0,11681561 172 | 友杰图书专营店,8.86 低,9.64 高,8.79 低,友杰图书专营店,54.50,0,27176955108 173 | 清华大学出版社,,,,清华大学出版社,67.20,0,12667860 174 | 智博尚书京东自营旗舰店,,,,中国水利水电出版社,94.80,0,12626565 175 | 机械工业出版社,,,,机械工业出版社,164.20,29006,12461168 176 | 人民邮电出版社,,,,人民邮电出版社,74.50,0,12631217 177 | 翔坤图书专营店,9.06 中,9.87 高,8.75 低,翔坤图书专营店,54.50,0,45873713471 178 | 人民邮电出版社,,,,人民邮电出版社,127.90,0,12830348 179 | 中国水利水电出版社,,,,中国水利水电出版社,212.50,0,12562129 180 | 清华大学出版社,,,,清华大学出版社,75.70,44370,12788164 181 | 人民邮电出版社,,,,人民邮电出版社,66.20,0,12570153 182 | 静默时光图书专营店,9.01 高,9.67 高,9.03 中,静默时光图书专营店,299.00,0,70445819119 183 | 人民邮电出版社,,,,人民邮电出版社,109.50,0,12659831 184 | 文轩网旗舰店,9.36 高,9.78 高,8.87 低,文轩网旗舰店,44.20,0,65865938532 185 | 文轩网旗舰店,9.36 高,9.78 高,8.87 低,文轩网旗舰店,67.60,0,21738292624 186 | 人民邮电出版社,,,,人民邮电出版社,41.40,0,12526039 187 | 电子工业出版社,,,,电子工业出版社,93.00,0,12456107 188 | 人民邮电出版社,,,,人民邮电出版社,41.00,0,11948817 189 | 机械工业出版社,,,,机械工业出版社,75.10,0,12614732 190 | 机械工业出版社,,,,机械工业出版社,48.70,29008,11864820 191 | -------------------------------------------------------------------------------- /京东框架/test.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import json 3 | url = 'https://club.jd.com/comment/productCommentSummaries.action?referenceIds=12611705,12660454,12610002,64397711026,12669803,24675618191,12598158,12675942,69543169703,12241204,69257978102,12622435,31373896774,12445029,29308861250,12585016,12375644,37830680848,12052200,12346637,12661197,11748995,12397576,12660458,11487324,12701880,68665370422,12613720,12513593,69004796494,12509944&callback=jQuery5999681&_=1593412248445' 4 | headers = { 5 | "Cookie": "unpl=V2_ZzNtbRAEQxYiDBNTKR1cAmIGEg1KVEYVcgxFBH4ZCQIyABpbclRCFnQUR1NnGlQUZwEZWUtcQRdFCEdkeB5fA2AFEFlBZxVLK14bADlNDEY1WnwHBAJfF3ILQFJ8HlQMZAEUbXJUQyV1CXZdeR1aB2QHE1tyZ0QlRThGXXMbXQZXAiJcchUXSXEKQVVzGBEMZQcUX0FTQhNFCXZX; __jdv=76161171|google-search|t_262767352_googlesearch|cpc|kwd-362776698237_0_cb12f5d6c516441a9241652a41d6d297|1593410310158; __jdu=835732507; areaId=19; ipLoc-djd=19-1601-50256-0; PCSYCityID=CN_440000_440100_440114; shshshfpa=b3947298-5c63-ba93-8e7d-b89e3e422382-1593410312; shshshfpb=eVvsT1HAgXe1EsnsQQ6HTpQ%3D%3D; __jdc=122270672; shshshfp=158c0090e5888d932458419e12bac1d7; 3AB9D23F7A4B3C9B=VLVTNQOO6BLWETXYSO5XADLGXR7OIDM3NHDDPRNYKWBPH45RRTYXIJNGG5TFHJ5YYFBFDEARKUWAM3XO4ZWTNCDX7U; shshshsID=86f31dd02161606d1c9bf211a7b066fd_1_1593415724113; __jda=122270672.835732507.1593410309.1593410310.1593415724.2; __jdb=122270672.1.835732507|2.1593415724; JSESSIONID=F5D0DC3E7CDA9CFFAA42F672B5826835.s1; jwotest_product=99", 6 | "Host": "club.jd.com", 7 | "Referer": "https://item.jd.com/12611705.html", 8 | "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36" 9 | } 10 | # html = requests.get(url,headers=headers) 11 | # json_decode = html.text 12 | # start = json_decode.find('{"CommentsCount":') 13 | # end = json_decode.find('PoorRateStyle":1}]}')+len('PoorRateStyle":1}]}') 14 | # results = json.loads(json_decode[start:end]) 15 | # for result in results['CommentsCount']: 16 | # print(result['CommentCount']) 17 | url2= 'https://club.jd.com/comment/productPageComments.action?callback=fetchJSON_comment98&productId=12611705&score=0&sortType=5&page=1&pageSize=10&isShadowSku=0&rid=0&fold=1' 18 | html = requests.get(url2,headers= headers) 19 | json_decode = html.text 20 | start = json_decode.find('{"productAttr"') 21 | end = json_decode.find('"afterDays":0}]}')+len('"afterDays":0}]}') 22 | results = json.loads(json_decode[start:end]) 23 | for result in results['comments']: 24 | print(result["content"]) 25 | 26 | -------------------------------------------------------------------------------- /抖音爬取签名,喜欢列表和关注列表/爬取抖音的用户信息和视频连接/__pycache__/shujuku.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/13060923171/Crawl-Project/320d69b1d9051db6f43836d7d8854fa5af9d56fb/抖音爬取签名,喜欢列表和关注列表/爬取抖音的用户信息和视频连接/__pycache__/shujuku.cpython-37.pyc -------------------------------------------------------------------------------- /抖音爬取签名,喜欢列表和关注列表/爬取抖音的用户信息和视频连接/__pycache__/xgorgon.cpython-37.pyc: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/13060923171/Crawl-Project/320d69b1d9051db6f43836d7d8854fa5af9d56fb/抖音爬取签名,喜欢列表和关注列表/爬取抖音的用户信息和视频连接/__pycache__/xgorgon.cpython-37.pyc -------------------------------------------------------------------------------- /抖音爬取签名,喜欢列表和关注列表/爬取抖音的用户信息和视频连接/ceshi.py: -------------------------------------------------------------------------------- 1 | from urllib import parse 2 | from xgorgon import douyin_xgorgon 3 | import requests 4 | import re 5 | import time 6 | cookies = "sessionid=" 7 | xtttoken = "" 8 | 9 | device_dict ={'iid': '', 10 | 'device_id': '', 11 | 'openudid': '', 12 | 'uuid': '', 13 | 'cdid': '' 14 | } 15 | 16 | def change_params(url,device_dict=None): 17 | params_item = {} 18 | lot_url = url.split('?')[0]+'?' 19 | for i in parse.urlparse(url).query.split('&'): 20 | k = i.split('=')[0] 21 | try: 22 | params_item[k] = i.split('=')[1] 23 | except: 24 | params_item[k] = None 25 | if device_dict: 26 | params_item['openudid'] = device_dict['openudid'] 27 | params_item['iid']= device_dict['iid'] 28 | params_item['device_id']= device_dict['device_id'] 29 | params_item['uuid']= device_dict['uuid'] 30 | params_item['cdid']= device_dict['cdid'] 31 | new_url = lot_url+parse.unquote_plus(parse.urlencode(params_item)) 32 | return new_url 33 | 34 | 35 | def get(url,proxies=None): 36 | headers = douyin_xgorgon(url=url,cookies=cookies,xtttoken=xtttoken) 37 | doc = requests.get(url, headers=headers,proxies=proxies).json() 38 | return doc 39 | 40 | def user_name(keyword): 41 | url = 'https://search-hl.amemv.com/aweme/v1/discover/search/?ts=1596089594&_rticket=1596089584130&os_api=23&device_platform=android&device_type=MI%205s&iid=3535618201363773&version_code=100400&app_name=aweme&openudid=68d42b816654c06d&device_id=2638416712040606&os_version=6.0.1&aid=1128&channel=tengxun_new&ssmix=a&manifest_version_code=100401&dpi=270&cdid=6beadddd-ede3-4fc1-99f0-d351d4c76445&version_name=10.4.0&resolution=810*1440&language=zh&device_brand=Xiaomi&app_type=normal&ac=wifi&update_version_code=10409900&uuid=350000000060778' 42 | url = change_params(url, device_dict) 43 | headers = douyin_xgorgon(url=url, cookies=cookies, xtttoken=xtttoken) 44 | data = { 45 | 'cursor':0, 46 | 'keyword':keyword, 47 | 'count': 1, 48 | 'hot_search': 0, 49 | 'is_pull_refresh': 1, 50 | 'search_source': None, 51 | 'search_id': None, 52 | 'type': 1, 53 | 'query_correct_type': 1 54 | } 55 | requests.packages.urllib3.disable_warnings() 56 | doc = requests.post(url, headers=headers,data=data,verify=False).text 57 | sec_uids = re.compile('"sec_uid":"(.*?)"',re.I|re.S) 58 | sec_uid = sec_uids.findall(doc) 59 | user_ids = re.compile('"uid":"(.*?)"', re.I | re.S) 60 | user_id= user_ids.findall(doc) 61 | try: 62 | for i in range(len(sec_uid)): 63 | uid = sec_uid[i] 64 | id = user_id[i] 65 | time.sleep(1) 66 | user_list(id,uid) 67 | count_list(uid) 68 | except: 69 | pass 70 | 71 | 72 | def user_list(id,uid): 73 | url = 'https://api3-normal-c-lf.amemv.com/aweme/v1/user/following/list/?user_id={}&sec_user_id={}' \ 74 | '&max_time=1595739196&count=20&offset=0' \ 75 | '&source_type=1&address_book_access=2&gps_access=1&vcd_count=0&vcd_auth_first_time=0&ts=1595737080&cpu_support64=false&storage_type=2' \ 76 | '&host_abi=armeabi-v7a&_rticket=1595737080341&mac_address=F4%3A09%3AD8%3A33%3AEE%3A9A&mcc_mnc=46001&os_api=23' \ 77 | '&device_platform=android&device_type=SM-G9008V&iid=2339350999993501&version_code=110800&app_name=aweme&openudid=c5c0babc0b33a19b' \ 78 | '&device_id=2743971277974349&os_version=6.0.1&aid=1128&channel=douyin-huidu-guanwang-control1&ssmix=a&manifest_version_code=110801&dpi=480&cdid=92d6111d-fa05-4987-a2bf-13b22d7caec2' \ 79 | '&version_name=11.8.0&resolution=1080*1920&language=zh&device_brand=samsung&app_type=normal&ac=wifi&update_version_code=11809900&uuid=866174600901389'.format(id,uid) 80 | url = change_params(url, device_dict) 81 | headers = douyin_xgorgon(url=url, cookies=cookies, xtttoken=xtttoken) 82 | requests.packages.urllib3.disable_warnings() 83 | doc = requests.post(url, headers=headers,verify=False).json() 84 | try: 85 | total = doc["total"] 86 | for i in range(total): 87 | uid = doc["followings"][i]["uid"] 88 | print(uid) 89 | guangzhu_uid(uid) 90 | except: 91 | pass 92 | 93 | def count_list(uid): 94 | url_base = 'https://api3-normal-c-lf.amemv.com/aweme/v1/aweme/favorite/?invalid_item_count=0&' \ 95 | 'is_hiding_invalid_item=0&max_cursor=0&' \ 96 | 'sec_user_id={}&count=20&os_api=22&device_type=MI%209&ssmix=a&manifest_version_code=110801&' \ 97 | 'dpi=320&uuid=866174600901389&app_name=aweme&version_name=11.8.0&ts=1596114855&cpu_support64=false&' \ 98 | 'storage_type=0&app_type=normal&ac=wifi&host_abi=armeabi-v7a&update_version_code=11809900&channel=tengxun_new&' \ 99 | '_rticket=1596114842311&device_platform=android&iid=2339350999993501&version_code=110800&' \ 100 | 'mac_address=80%3AC5%3AF2%3A70%3A8A%3A3B&cdid=92d6111d-fa05-4987-a2bf-13b22d7caec2&' \ 101 | 'openudid=c5c0babc0b33a19b&device_id=2743971277974349&resolution=1600*900&os_version=5.1.1&language=zh&' \ 102 | 'device_brand=Xiaomi&aid=1128&mcc_mnc=46007&os_api=23' \ 103 | '&device_platform=android&device_type=SM-G9008V&iid=2339350999993501&version_code=110800&app_name=aweme&' \ 104 | 'openudid=c5c0babc0b33a19b' \ 105 | '&device_id=2743971277974349&os_version=6.0.1&aid=1128&channel=douyin-huidu-guanwang-control1&' \ 106 | 'ssmix=a&manifest_version_code=110801&dpi=480&cdid=92d6111d-fa05-4987-a2bf-13b22d7caec2' \ 107 | '&version_name=11.8.0&resolution=1080*1920&language=zh&device_brand=samsung&app_type=normal&ac=wifi&' \ 108 | 'update_version_code=11809900&uuid=866174600901389'.format(uid) 109 | 110 | page = 0 111 | while 1: 112 | url = change_params(url_base.replace('max_cursor=0','max_cursor={}'.format(page)), device_dict) 113 | headers = douyin_xgorgon(url=url, cookies=cookies, xtttoken=xtttoken) 114 | requests.packages.urllib3.disable_warnings() 115 | doc = requests.post(url, headers=headers, verify=False).json() 116 | 117 | if doc['has_more'] !=1: 118 | print("没有下一页了") 119 | break 120 | if len(doc['aweme_list']) == 0: 121 | raise ("aweme_list Error") 122 | 123 | page = doc['max_cursor'] 124 | time.sleep(1) 125 | try: 126 | for i in range(20): 127 | uid = doc['aweme_list'][i]['author']['uid'] 128 | print(uid) 129 | xihuan_uid(uid) 130 | except: 131 | pass 132 | continue 133 | 134 | def xihuan_uid(uid): 135 | with open("喜欢列表.txt","a+")as f: 136 | f.write(uid+"\n") 137 | print("写入成功") 138 | 139 | def guangzhu_uid(uid): 140 | with open("关注列表.txt","a+")as f: 141 | f.write(uid+"\n") 142 | print("写入成功") 143 | 144 | if __name__ == '__main__': 145 | while True: 146 | user_name("dy6i3fk5dhj4") 147 | 148 | 149 | -------------------------------------------------------------------------------- /抖音爬取签名,喜欢列表和关注列表/爬取抖音的用户信息和视频连接/douying.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import urllib3 3 | ''' 4 | GET https://api3-core-c-hl.amemv.com/aweme/v1/aweme/post/?source=0&publish_video_strategy_type=0&max_cursor=1587528101000&sec_user_id=MS4wLjABAAAA4s3jerVDPUA_xvyoGhRypnn8ijAtUfrt9rCWL2aXxtU&count=10&ts=1587635299&host_abi=armeabi-v7a&_rticket=1587635299508&mcc_mnc=46007& HTTP/1.1 5 | Host: api3-core-c-hl.amemv.com 6 | Connection: keep-alive 7 | Cookie: odin_tt=fab0188042f9c0722c90b1fbaf5233d30ddb78a41267bacbfc7c1fb216d37344df795f4e08e975d557d0c274b1c761da039574e4eceaae4a8441f72167d64afb 8 | X-SS-REQ-TICKET: 1587635299505 9 | sdk-version: 1 10 | X-SS-DP: 1128 11 | x-tt-trace-id: 00-a67026290de17aa15402ce8ee4a90468-a67026290de17aa1-01 12 | User-Agent: com.ss.android.ugc.aweme/100801 (Linux; U; Android 5.1.1; zh_CN; MI 9; Build/NMF26X; Cronet/TTNetVersion:8109b77c 2020-04-15 QuicVersion:0144d358 2020-03-24) 13 | X-Gorgon: 0404c0d100004fe124c18b36d03baf0768c181e105b1af5e8167 14 | X-Khronos: 1587635299 15 | x-common-params-v2: os_api=22&device_platform=android&device_type=MI%209&iid=78795828897640&version_code=100800&app_name=aweme&openudid=80c5f2708a3b6304&device_id=3966668942355688&os_version=5.1.1&aid=1128&channel=tengxun_new&ssmix=a&manifest_version_code=100801&dpi=320&cdid=e390170c-0cb5-42ad-8bf6-d25dc4c7e3a3&version_name=10.8.0&resolution=900*1600&language=zh&device_brand=Xiaomi&app_type=normal&ac=wifi&update_version_code=10809900&uuid=863254643501389 16 | 17 | 18 | ''' 19 | 20 | 21 | # 下载视频代码,创建一个文件夹来存放抖音的视频 22 | def download_video(url, title): 23 | with open("{}.mp4".format(title), "wb") as f: 24 | f.write(requests.get(url).content) 25 | print("下载视频{}完毕".format(title)) 26 | 27 | #怎么去爬取APP里面的视频 28 | def get_video(): 29 | #通过我们的fiddler这个抓包工具来获取我们想要爬取某个账户里面全部视频的URL 30 | url = "GET https://api3-core-c-lf.amemv.com/aweme/v1/aweme/post/?source=0&publish_video_strategy_type=0&max_cursor=1590752981000&sec_user_id=MS4wLjABAAAAcXW9VYbv07hczERdiLoQil_TRW6GbwWc_BuRU1pczaCq9GQavlvKFhl_qIqE4yZ6&count=10&ts=1594477988&cpu_support64=false&storage_type=0&host_abi=armeabi-v7a&_rticket=1594477986155&mac_address=80%3AC5%3AF2%3A70%3A8A%3A3B&mcc_mnc=46007& HTTP/1.1" 31 | #构建我们的headers,这些对应的数据都是通过我们的fiddler获取的 32 | headers = { 33 | 'Host': 'api3-core-c-lf.amemv.com', 34 | 'Connection': 'keep-alive', 35 | 'Cookie': 'install_id=2339350999993501; ttreq=1$7a4d72914f4cef66e2e2ff13b5dc74a9ab180c06; passport_csrf_token=a4f3fb89f64b4fa8c707293c951c0c17; d_ticket=19b0a970bd0b508bdde6a5128f580f540c2d6; odin_tt=c3c9b378984696b77432b71b951c0e34a773411cce385120c69196cc6529b214c7d5c8716d1fc6f4cc2cb701d61a48b4; sid_guard=fdbd63a338be8acb4a08a1621c41fea6%7C1594464835%7C5184000%7CWed%2C+09-Sep-2020+10%3A53%3A55+GMT; uid_tt=760bb76af4748dcf85a4a0c5a9c5b146; uid_tt_ss=760bb76af4748dcf85a4a0c5a9c5b146; sid_tt=fdbd63a338be8acb4a08a1621c41fea6; sessionid=fdbd63a338be8acb4a08a1621c41fea6; sessionid_ss=fdbd63a338be8acb4a08a1621c41fea6', 36 | 'X-SS-REQ-TICKET': '1594464868804', 37 | 'passport-sdk-version': '17', 38 | 'X-Tt-Token': '00fdbd63a338be8acb4a08a1621c41fea6c5165e3a78a6e6e8bad4d8602a9fba4f29f111b5425b14f07ecf6df18c6b940518', 39 | 'sdk-version': '2', 40 | 'X-SS-DP': '1128', 41 | 'x-tt-trace-id': '00-3d831b3e0d9bfa0994c2b4de0dc30468-3d831b3e0d9bfa09-01', 42 | 'User-Agent': 'com.ss.android.ugc.aweme/110801 (Linux; U; Android 5.1.1; zh_CN; OPPO R11 Plus; Build/NMF26X; Cronet/TTNetVersion:71e8fd11 2020-06-10 QuicVersion:7aee791b 2020-06-05)', 43 | 'Accept-Encoding': 'gzip, deflate', 44 | 'X-Gorgon': '0404d8954001fffd06f451b46c120f09798b487f8c591c2f6bce', 45 | 'X-Khronos': '1594464868', 46 | 'x-common-params-v2': 'os_api=22&device_platform=android&device_type=OPPO%20R11%20Plus&iid=2339350999993501&version_code=110800&app_name=aweme&openudid=c5c0babc0b33a19b&device_id=2743971277974349&os_version=5.1.1&aid=1128&channel=tengxun_new&ssmix=a&manifest_version_code=110801&dpi=320&cdid=92d6111d-fa05-4987-a2bf-13b22d7caec2&version_name=11.8.0&resolution=900*1600&language=zh&device_brand=OPPO&app_type=normal&ac=wifi&update_version_code=11809900&uuid=866174600901389', 47 | } 48 | 49 | #无视证书的请求 50 | requests.packages.urllib3.disable_warnings() 51 | html = requests.get(url, headers=headers, verify=False) 52 | #把数据用json来全部获取下来 53 | json_data = html.json()["aweme_list"] 54 | #循环叠带我们的数据,把它们一一展示出来 55 | for j in json_data: 56 | title = j['desc'] 57 | print(title) 58 | print(j['video']['play_addr']['url_list'][0]) 59 | #把最后每个视频对应的URL打印出来,再根据我们的下载函数,把它们全部下载到自己的电脑里面 60 | download_video(j['video']['play_addr']['url_list'][0], title) 61 | 62 | 63 | if __name__ == '__main__': 64 | get_video() -------------------------------------------------------------------------------- /抖音爬取签名,喜欢列表和关注列表/爬取抖音的用户信息和视频连接/main.py: -------------------------------------------------------------------------------- 1 | from urllib import parse 2 | from xgorgon import douyin_xgorgon 3 | import requests 4 | import re 5 | 6 | cookies = "sessionid=" 7 | xtttoken = "" 8 | 9 | device_dict ={'iid': '', 10 | 'device_id': '', 11 | 'openudid': '', 12 | 'uuid': '', 13 | 'cdid': '' 14 | } 15 | 16 | def change_params(url,device_dict=None): 17 | params_item = {} 18 | lot_url = url.split('?')[0]+'?' 19 | for i in parse.urlparse(url).query.split('&'): 20 | k = i.split('=')[0] 21 | try: 22 | params_item[k] = i.split('=')[1] 23 | except: 24 | params_item[k] = None 25 | if device_dict: 26 | params_item['openudid'] = device_dict['openudid'] 27 | params_item['iid']= device_dict['iid'] 28 | params_item['device_id']= device_dict['device_id'] 29 | params_item['uuid']= device_dict['uuid'] 30 | params_item['cdid']= device_dict['cdid'] 31 | new_url = lot_url+parse.unquote_plus(parse.urlencode(params_item)) 32 | return new_url 33 | 34 | def get(url,proxies=None): 35 | headers = douyin_xgorgon(url=url,cookies=cookies,xtttoken=xtttoken) 36 | doc = requests.get(url, headers=headers,proxies=proxies, verify=False).json() 37 | return doc 38 | 39 | 40 | device_dict ={'iid': '3729134641503981', 41 | 'device_id': '2743971277974349', 42 | 'openudid': 'c5c0babc0b33a19b', 43 | 'uuid': '866174600901389', 44 | 'cdid': 'a4ff527f-e409-47ce-ae32-59c555cdd653' 45 | } 46 | 47 | 48 | 49 | '''搜索用户列表''' 50 | def search_user(keyword,cursor): 51 | """ 52 | 搜索用户信息 53 | keyword: 关键词 54 | :return: response->json 55 | """ 56 | url ='https://search-hl.amemv.com/aweme/v1/discover/search/?ts=1594792387&_rticket=1594187269781&os_api=23&device_platform=android&device_type=MI%205s&iid=3729134641503981&version_code=100400&app_name=aweme&openudid=c5c0babc0b33a19b&device_id=2743971277974349&os_version=6.0.1&aid=1128&channel=tengxun_new&ssmix=a&manifest_version_code=100401&dpi=270&cdid=a4ff527f-e409-47ce-ae32-59c555cdd653&version_name=10.4.0&resolution=810*1440&language=zh&device_brand=Xiaomi&app_type=normal&ac=wifi&update_version_code=10409900&uuid=866174600901389' 57 | url = change_params(url) 58 | headers = douyin_xgorgon(url=url,cookies=cookies,xtttoken=xtttoken) 59 | data = { 60 | 'cursor': cursor, 61 | 'keyword':keyword, 62 | 'count': 10, 63 | 'hot_search': 0, 64 | 'is_pull_refresh': 1, 65 | 'search_source': None, 66 | 'search_id':None, 67 | 'type':1, 68 | 'query_correct_type': 1 69 | } 70 | requests.packages.urllib3.disable_warnings() 71 | doc = requests.post(url, headers=headers,data=data, verify=False).text 72 | print(doc) 73 | # sec_uid = re.compile('"sec_uid":"(.*?)"',re.I | re.S) 74 | # sec_uids = sec_uid.findall(doc) 75 | # has_more = re.compile('"has_more":(.*?),', re.I | re.S) 76 | # result = has_more.findall(doc) 77 | # for i in sec_uids: 78 | # with open('shuju6.text','a+')as f: 79 | # f.write(i) 80 | # f.write('\n') 81 | # print('保存完毕') 82 | # print(i) 83 | # print(result) 84 | 85 | 86 | if __name__ == '__main__': 87 | # for i in range(0,1000,30): 88 | search_user('美妆',30) 89 | 90 | 91 | 92 | -------------------------------------------------------------------------------- /抖音爬取签名,喜欢列表和关注列表/爬取抖音的用户信息和视频连接/shujuku.py: -------------------------------------------------------------------------------- 1 | from sqlalchemy import create_engine 2 | from sqlalchemy import Column,Integer,String,Text 3 | from sqlalchemy.orm import sessionmaker,scoped_session 4 | from sqlalchemy.ext.declarative import declarative_base 5 | 6 | Base = declarative_base() 7 | engine = create_engine( 8 | "mysql+pymysql://root:root@127.0.0.1:3306/test?charset=utf8mb4", 9 | #超过连接池大小外最多可以创建的连接数 10 | max_overflow=500, 11 | #连接池的大小 12 | pool_size=100, 13 | #是否显示开发信息 14 | echo=False, 15 | ) 16 | 17 | class Tik(Base): 18 | __tablename__ = 'tik' 19 | id = Column(Integer, primary_key=True, autoincrement=True) 20 | name = Column(String(40)) 21 | user_id = Column(String(40)) 22 | intro = Column(Text()) 23 | fans = Column(String(50)) 24 | 25 | Base.metadata.create_all(engine) 26 | session = sessionmaker(engine) 27 | sess = scoped_session(session) -------------------------------------------------------------------------------- /抖音爬取签名,喜欢列表和关注列表/爬取抖音的用户信息和视频连接/wangyexingxi.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import re 3 | from shujuku import sess,Tik 4 | from concurrent.futures import ThreadPoolExecutor 5 | 6 | 7 | list = [] 8 | with open('shuju6.text', 'r')as f: 9 | contents = f.readlines() 10 | for c in contents: 11 | content = c.strip() 12 | list.append(content) 13 | headers = { 14 | 'accept': 'application/json', 15 | 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36', 16 | } 17 | 18 | 19 | def get_html(url): 20 | html = requests.get(url,headers= headers) 21 | contents = html.text 22 | #抖音个人简介 23 | signature = re.compile('signature":"(.*?)"',re.I | re.S) 24 | intro = signature.findall(contents) 25 | #抖音用户的名字 26 | nickname = re.compile('"nickname":"(.*?)"',re.I | re.S) 27 | name = nickname.findall(contents) 28 | #抖音上面的粉丝数量 29 | follower_count = re.compile('"follower_count":(.*?),',re.I | re.S) 30 | fans = follower_count.findall(contents) 31 | #抖音ID 32 | unique_id = re.compile('"unique_id":"(.*?)"',re.I | re.S) 33 | ID = unique_id.findall(contents) 34 | print('用户名称:{}\n用户ID:{}\n个人简介:{}\n粉丝数量:{}\n'.format(name,ID,intro,fans)) 35 | try: 36 | tik = Tik( 37 | name = name, 38 | user_id = ID, 39 | intro = intro, 40 | fans = fans 41 | ) 42 | sess.add(tik) 43 | sess.commit() 44 | print('commit') 45 | except Exception as e: 46 | print("rollback",e) 47 | sess.rollback() 48 | 49 | if __name__ == '__main__': 50 | count = 0 51 | for uid in list: 52 | url = 'https://www.iesdouyin.com/web/api/v2/user/info/?' \ 53 | 'sec_uid={}'.format(uid) 54 | with ThreadPoolExecutor(max_workers=10)as e: 55 | futures = [e.submit(get_html,url)] 56 | count += 1 57 | print(count) 58 | sess.close() -------------------------------------------------------------------------------- /抖音爬取签名,喜欢列表和关注列表/爬取抖音的用户信息和视频连接/xgorgon.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | import hashlib 4 | from urllib import request 5 | import time 6 | import gzip 7 | 8 | 9 | byteTable1 ="D6 28 3B 71 70 76 BE 1B A4 FE 19 57 5E 6C BC 21 B2 14 37 7D 8C A2 FA 67 55 6A 95 E3 FA 67 78 ED 8E 55 33 89 A8 CE 36 B3 5C D6 B2 6F 96 C4 34 B9 6A EC 34 95 C4 FA 72 FF B8 42 8D FB EC 70 F0 85 46 D8 B2 A1 E0 CE AE 4B 7D AE A4 87 CE E3 AC 51 55 C4 36 AD FC C4 EA 97 70 6A 85 37 6A C8 68 FA FE B0 33 B9 67 7E CE E3 CC 86 D6 9F 76 74 89 E9 DA 9C 78 C5 95 AA B0 34 B3 F2 7D B2 A2 ED E0 B5 B6 88 95 D1 51 D6 9E 7D D1 C8 F9 B7 70 CC 9C B6 92 C5 FA DD 9F 28 DA C7 E0 CA 95 B2 DA 34 97 CE 74 FA 37 E9 7D C4 A2 37 FB FA F1 CF AA 89 7D 55 AE 87 BC F5 E9 6A C4 68 C7 FA 76 85 14 D0 D0 E5 CE FF 19 D6 E5 D6 CC F1 F4 6C E9 E7 89 B2 B7 AE 28 89 BE 5E DC 87 6C F7 51 F2 67 78 AE B3 4B A2 B3 21 3B 55 F8 B3 76 B2 CF B3 B3 FF B3 5E 71 7D FA FC FF A8 7D FE D8 9C 1B C4 6A F9 88 B5 E5" 10 | 11 | def getXGon(url,stub,cookies): 12 | NULL_MD5_STRING = "00000000000000000000000000000000" 13 | sb="" 14 | if len(url)<1 : 15 | sb =NULL_MD5_STRING 16 | else: 17 | sb =encryption(url) 18 | if len(stub)<1: 19 | sb+=NULL_MD5_STRING 20 | else: 21 | sb+=stub 22 | if len(cookies)<1: 23 | sb+=NULL_MD5_STRING 24 | else: 25 | sb+=encryption(cookies) 26 | index = cookies.index("sessionid=") 27 | if index == -1: 28 | sb+=NULL_MD5_STRING 29 | else: 30 | sessionid = cookies[index+10:] 31 | if sessionid.__contains__(';'): 32 | endIndex = sessionid.index(';') 33 | sessionid = sessionid[:endIndex] 34 | sb+=encryption(sessionid) 35 | return sb 36 | 37 | 38 | 39 | def encryption(url): 40 | obj = hashlib.md5() 41 | obj.update(url.encode("UTF-8")) 42 | secret = obj.hexdigest() 43 | return secret.lower() 44 | 45 | 46 | 47 | def initialize(data): 48 | myhex = 0 49 | byteTable2 = byteTable1.split(" ") 50 | for i in range(len(data)): 51 | hex1 = 0 52 | if i==0: 53 | hex1= int(byteTable2[int(byteTable2[0],16)-1],16) 54 | byteTable2[i]=hex(hex1) 55 | # byteTable2[i] = Integer.toHexString(hex1); 56 | elif i==1: 57 | temp= int("D6",16)+int("28",16) 58 | if temp>256: 59 | temp-=256 60 | hex1 = int(byteTable2[temp-1],16) 61 | myhex = temp 62 | byteTable2[i] = hex(hex1) 63 | else: 64 | temp = myhex+int(byteTable2[i], 16) 65 | if temp > 256: 66 | temp -= 256 67 | hex1 = int(byteTable2[temp - 1], 16) 68 | myhex = temp 69 | byteTable2[i] = hex(hex1) 70 | if hex1*2>256: 71 | hex1 = hex1*2 - 256 72 | else: 73 | hex1 = hex1*2 74 | hex2 = byteTable2[hex1 - 1] 75 | result = int(hex2,16)^int(data[i],16) 76 | data[i] = hex(result) 77 | for i in range(len(data)): 78 | data[i] = data[i].replace("0x", "") 79 | return data 80 | 81 | 82 | 83 | def handle(data): 84 | for i in range(len(data)): 85 | byte1 = data[i] 86 | if len(byte1)<2: 87 | byte1+='0' 88 | else: 89 | byte1 = data[i][1] +data[i][0] 90 | if i1: 101 | byte3 = byte3[1] +byte3[0] 102 | else: 103 | byte3+="0" 104 | byte4 = int(byte3,16)^int("FF",16) 105 | byte4 = byte4 ^ int("14",16) 106 | data[i] = hex(byte4).replace("0x","") 107 | return data 108 | 109 | 110 | 111 | def xGorgon(timeMillis,inputBytes): 112 | data1 = [] 113 | data1.append("3") 114 | data1.append("61") 115 | data1.append("41") 116 | data1.append("10") 117 | data1.append("80") 118 | data1.append("0") 119 | data2 = input(timeMillis,inputBytes) 120 | data2 = initialize(data2) 121 | data2 = handle(data2) 122 | for i in range(len(data2)): 123 | data1.append(data2[i]) 124 | 125 | xGorgonStr = "" 126 | for i in range(len(data1)): 127 | temp = data1[i]+"" 128 | if len(temp)>1: 129 | xGorgonStr += temp 130 | else: 131 | xGorgonStr +="0" 132 | xGorgonStr+=temp 133 | return xGorgonStr 134 | 135 | 136 | 137 | def input(timeMillis,inputBytes): 138 | result = [] 139 | for i in range(4): 140 | if inputBytes[i]<0: 141 | temp = hex(inputBytes[i])+'' 142 | temp = temp[6:] 143 | result.append(temp) 144 | else: 145 | temp = hex(inputBytes[i]) + '' 146 | result.append(temp) 147 | for i in range(4): 148 | result.append("0") 149 | for i in range(4): 150 | if inputBytes[i+32]<0: 151 | result.append(hex(inputBytes[i+32])+'')[6:] 152 | else: 153 | result.append(hex(inputBytes[i + 32]) + '') 154 | for i in range(4): 155 | result.append("0") 156 | tempByte = hex(int(timeMillis))+"" 157 | tempByte = tempByte.replace("0x","") 158 | for i in range(4): 159 | a = tempByte[i * 2:2 * i + 2] 160 | result.append(tempByte[i*2:2*i+2]) 161 | for i in range(len(result)): 162 | result[i] = result[i].replace("0x","") 163 | return result 164 | 165 | 166 | 167 | def strToByte(str): 168 | length = len(str) 169 | str2 = str 170 | bArr =[] 171 | i=0 172 | while i < length: 173 | # bArr[i/2] = b'\xff\xff\xff'+(str2hex(str2[i]) << 4+str2hex(str2[i+1])).to_bytes(1, "big") 174 | a = str2[i] 175 | b = str2[1+i] 176 | c = ((str2hex(a) << 4)+str2hex(b)) 177 | bArr.append(c) 178 | i+=2 179 | return bArr 180 | 181 | 182 | 183 | def str2hex(s): 184 | odata = 0 185 | su =s.upper() 186 | for c in su: 187 | tmp=ord(c) 188 | if tmp <= ord('9') : 189 | odata = odata << 4 190 | odata += tmp - ord('0') 191 | elif ord('A') <= tmp <= ord('F'): 192 | odata = odata << 4 193 | odata += tmp - ord('A') + 10 194 | return odata 195 | 196 | 197 | def doGetGzip(url,headers,charset): 198 | req = request.Request(url) 199 | for key in headers: 200 | req.add_header(key,headers[key]) 201 | with request.urlopen(req) as f: 202 | data = f.read() 203 | return gzip.decompress(data).decode() 204 | 205 | 206 | 207 | 208 | def douyin_xgorgon(url,cookies,xtttoken): 209 | ts = str(time.time()).split(".")[0] 210 | _rticket = str(time.time() * 1000).split(".")[0] 211 | params = url[url.index('?')+1:] 212 | STUB = "" 213 | s = getXGon(params,STUB,cookies) 214 | gorgon = xGorgon(ts,strToByte(s)) 215 | 216 | headers={ 217 | "X-Gorgon":gorgon, 218 | "X-Khronos": ts, 219 | "sdk-version":"1", 220 | "Cookie": cookies, 221 | "Accept-Encoding": "gzip", 222 | "X-SS-REQ-TICKET": _rticket, 223 | "Host": "aweme.snssdk.com", 224 | "Connection": "Keep-Alive", 225 | 'User-Agent': 'okhttp/3.10.0.1', 226 | "x-tt-token":xtttoken 227 | } 228 | 229 | return headers 230 | 231 | 232 | 233 | -------------------------------------------------------------------------------- /抖音爬取签名,喜欢列表和关注列表/爬取抖音的用户信息和视频连接/抖音里面的xg算法,因为一些原因就不适合开源了。.md: -------------------------------------------------------------------------------- 1 | # 抖音里面的xg算法,因为一些原因就不适合开源了。 -------------------------------------------------------------------------------- /最好大学网/daxue.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import re 3 | from bs4 import BeautifulSoup 4 | import time 5 | headers = { 6 | "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", 7 | "Cookie": "Hm_lvt_2ce94714199fe618dcebb5872c6def14=1594741637; Hm_lpvt_2ce94714199fe618dcebb5872c6def14=1594741768", 8 | "Host": "www.zuihaodaxue.cn", 9 | "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36" 10 | } 11 | session = requests.session() 12 | session.headers = headers 13 | 14 | def get_html9(url): 15 | html = session.get(url) 16 | #它的解码等于他当前的页面的解码,这样破解里面字体的反爬 17 | html.encoding = html.apparent_encoding 18 | if html.status_code == 200: 19 | content = html.text 20 | #用正则去定位排名 21 | rankings = re.compile('">(.*?)',re.I|re.S) 22 | ranking = rankings.findall(content) 23 | soup = BeautifulSoup(content,'lxml') 24 | list = [] 25 | for i in range(len(ranking)): 26 | #定位大学名称 27 | daxues = soup.select("td.align-left a")[i].text 28 | list.append(daxues) 29 | print(list) 30 | #定位大学排名 31 | states = re.compile('title="查看(.*?)大学排名">', re.I | re.S) 32 | state = states.findall(content) 33 | state_ranks = re.compile('(.*?)',re.I|re.S) 34 | state_rank = state_ranks.findall(content) 35 | grades = re.compile('\d+(.*?)(.*?)(.*?)',re.I|re.S) 54 | ranking = rankings.findall(content) 55 | soup = BeautifulSoup(content,'lxml') 56 | list = [] 57 | for i in range(len(ranking)): 58 | daxues = soup.select("td.align-left a")[i].text 59 | list.append(daxues) 60 | print(list) 61 | states = re.compile('title="查看(.*?)大学排名">', re.I | re.S) 62 | state = states.findall(content) 63 | state_ranks = re.compile('(.*?)',re.I|re.S) 64 | state_rank = state_ranks.findall(content) 65 | grades = re.compile('\d+(.*?)(.*?)(.*?)',re.I|re.S) 84 | ranking = rankings.findall(content) 85 | soup = BeautifulSoup(content,'lxml') 86 | list = [] 87 | for i in range(len(ranking)): 88 | daxues = soup.select("td.align-left a")[i].text 89 | list.append(daxues) 90 | print(list) 91 | states = re.compile('title="查看(.*?)大学排名">', re.I | re.S) 92 | state = states.findall(content) 93 | state_ranks = re.compile('(.*?)',re.I|re.S) 94 | state_rank = state_ranks.findall(content) 95 | grades = re.compile('\d+(.*?)(.*?) 3000: 130 | wb.save(file_name) 131 | return 132 | if house_id not in self.quchong[city_name]: 133 | # print(house_id, tupian, price, renttype, shiting, mianji, chaoxiang, xiaqu, jiedao, xiaoqu, jiaotong) 134 | print(f'正在爬取:{city_name}-->第{row_count}条租房信息', ) 135 | # 保存数据 136 | 137 | self.save_to_excel(ws, row_count, [self.today_str,city_name,tupian, price, renttype, shiting, mianji, chaoxiang, xiaqu, jiedao, xiaoqu,jiaotong,]) 138 | row_count += 1 139 | self.quchong[city_name].append(house_id) # 将爬取过的房子id放进去,用于去重 140 | else: 141 | print('已存在') 142 | if next_url: 143 | html = self.get_html(next_url) 144 | wb.save(file_name) 145 | 146 | def run_spider(self, city_url_list): 147 | for city_url in city_url_list: 148 | try: 149 | current_city_url = city_url 150 | html = self.get_html(city_url) 151 | print(city_url) 152 | city_name = re.findall(re.compile('class="s4Box">(.*?)'), html)[0] # 获取城市名 153 | self.quchong[city_name] = [] # 构建{'城市名': [租房1,2,3,4,]}用于去重 154 | self.parse(current_city_url, html, city_name) 155 | except: 156 | pass 157 | 158 | # 数组拆分 (将一个大元组拆分多个小元组,用于多线程任务分配) 159 | def div_list(self, ls, n): 160 | result = [] 161 | cut = int(len(ls)/n) 162 | if cut == 0: 163 | ls = [[x] for x in ls] 164 | none_array = [[] for i in range(0, n-len(ls))] 165 | return ls+none_array 166 | for i in range(0, n-1): 167 | result.append(ls[cut*i:cut*(1+i)]) 168 | result.append(ls[cut*(n-1):len(ls)]) 169 | return result 170 | 171 | def save_to_excel(self, ws, row_count, data): 172 | for index, value in enumerate(data): 173 | ws.cell(row=row_count+1, column=index + 1, value=value) # openpyxl 是以1,开始第一行,第一列 174 | 175 | if __name__ == '__main__': 176 | options = webdriver.ChromeOptions() 177 | options.add_experimental_option('excludeSwitches', ['enable-automation']) 178 | # options.add_argument('--headless') 179 | browser = webdriver.Chrome(options=options,executable_path='chromedriver.exe') 180 | wait = WebDriverWait(browser, 10) 181 | spider = FTXSpider() 182 | if not os.path.exists(f'租房{spider.today_str}'): 183 | os.mkdir(f'租房{spider.today_str}') 184 | pool = ThreadPool(1) # 创建一个包含5个线程的线程池 185 | pool.map(spider.run_spider, spider.div_list(spider.start_urls, 1)) 186 | pool.close() # 关闭线程池的写入 187 | pool.join() # 阻塞,保证子线程运行完毕后再继续主进程 188 | 189 | 190 | 191 | # 单线程 192 | # for city_url in spider.start_urls: 193 | # spider.run_spider([city_url]) 194 | 195 | -------------------------------------------------------------------------------- /爬取房天下的框架/爬取房天下的信息/code/chromedriver.exe: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/13060923171/Crawl-Project/320d69b1d9051db6f43836d7d8854fa5af9d56fb/爬取房天下的框架/爬取房天下的信息/code/chromedriver.exe -------------------------------------------------------------------------------- /爬取房天下的框架/爬取房天下的信息/code/zufangcitymatch.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/13060923171/Crawl-Project/320d69b1d9051db6f43836d7d8854fa5af9d56fb/爬取房天下的框架/爬取房天下的信息/code/zufangcitymatch.xlsx -------------------------------------------------------------------------------- /爬取房天下的框架/爬取房天下的信息/code/租房2020-10-10/包头2020-10-10房天下租房.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/13060923171/Crawl-Project/320d69b1d9051db6f43836d7d8854fa5af9d56fb/爬取房天下的框架/爬取房天下的信息/code/租房2020-10-10/包头2020-10-10房天下租房.xlsx -------------------------------------------------------------------------------- /爬取房天下的框架/爬取房天下的信息/code/租房2020-10-10/北海2020-10-10房天下租房.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/13060923171/Crawl-Project/320d69b1d9051db6f43836d7d8854fa5af9d56fb/爬取房天下的框架/爬取房天下的信息/code/租房2020-10-10/北海2020-10-10房天下租房.xlsx -------------------------------------------------------------------------------- /爬取房天下的框架/爬取房天下的信息/code/租房2020-10-10/安庆2020-10-10房天下租房.xlsx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/13060923171/Crawl-Project/320d69b1d9051db6f43836d7d8854fa5af9d56fb/爬取房天下的框架/爬取房天下的信息/code/租房2020-10-10/安庆2020-10-10房天下租房.xlsx -------------------------------------------------------------------------------- /爬取房天下的框架/爬取房天下的信息/main.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/13060923171/Crawl-Project/320d69b1d9051db6f43836d7d8854fa5af9d56fb/爬取房天下的框架/爬取房天下的信息/main.py -------------------------------------------------------------------------------- /爬取抖音无水印视频/douying.py: -------------------------------------------------------------------------------- 1 | import requests 2 | 3 | ''' 4 | GET https://api3-core-c-hl.amemv.com/aweme/v1/aweme/post/?source=0&publish_video_strategy_type=0&max_cursor=1587528101000&sec_user_id=MS4wLjABAAAA4s3jerVDPUA_xvyoGhRypnn8ijAtUfrt9rCWL2aXxtU&count=10&ts=1587635299&host_abi=armeabi-v7a&_rticket=1587635299508&mcc_mnc=46007& HTTP/1.1 5 | Host: api3-core-c-hl.amemv.com 6 | Connection: keep-alive 7 | Cookie: odin_tt=fab0188042f9c0722c90b1fbaf5233d30ddb78a41267bacbfc7c1fb216d37344df795f4e08e975d557d0c274b1c761da039574e4eceaae4a8441f72167d64afb 8 | X-SS-REQ-TICKET: 1587635299505 9 | sdk-version: 1 10 | X-SS-DP: 1128 11 | x-tt-trace-id: 00-a67026290de17aa15402ce8ee4a90468-a67026290de17aa1-01 12 | User-Agent: com.ss.android.ugc.aweme/100801 (Linux; U; Android 5.1.1; zh_CN; MI 9; Build/NMF26X; Cronet/TTNetVersion:8109b77c 2020-04-15 QuicVersion:0144d358 2020-03-24) 13 | X-Gorgon: 0404c0d100004fe124c18b36d03baf0768c181e105b1af5e8167 14 | X-Khronos: 1587635299 15 | x-common-params-v2: os_api=22&device_platform=android&device_type=MI%209&iid=78795828897640&version_code=100800&app_name=aweme&openudid=80c5f2708a3b6304&device_id=3966668942355688&os_version=5.1.1&aid=1128&channel=tengxun_new&ssmix=a&manifest_version_code=100801&dpi=320&cdid=e390170c-0cb5-42ad-8bf6-d25dc4c7e3a3&version_name=10.8.0&resolution=900*1600&language=zh&device_brand=Xiaomi&app_type=normal&ac=wifi&update_version_code=10809900&uuid=863254643501389 16 | 17 | 18 | ''' 19 | 20 | 21 | # 下载视频代码,创建一个文件夹来存放抖音的视频 22 | def download_video(url, title): 23 | with open("{}.mp4".format(title), "wb") as f: 24 | f.write(requests.get(url).content) 25 | print("下载视频{}完毕".format(title)) 26 | 27 | #怎么去爬取APP里面的视频 28 | def get_video(): 29 | #通过我们的fiddler这个抓包工具来获取我们想要爬取某个账户里面全部视频的URL 30 | url = "https://api3-core-c-hl.amemv.com/aweme/v1/aweme/post/?source=0&publish_video_strategy_type=0&max_cursor=1587528101000&sec_user_id=MS4wLjABAAAA4s3jerVDPUA_xvyoGhRypnn8ijAtUfrt9rCWL2aXxtU&count=10&ts=1587635299&host_abi=armeabi-v7a&_rticket=1587635299508&mcc_mnc=46007&" 31 | #构建我们的headers,这些对应的数据都是通过我们的fiddler获取的 32 | headers = { 33 | 'Cookie': 'odin_tt=fab0188042f9c0722c90b1fbaf5233d30ddb78a41267bacbfc7c1fb216d37344df795f4e08e975d557d0c274b1c761da039574e4eceaae4a8441f72167d64afb', 34 | 'X-SS-REQ-TICKET': '1587635299505', 35 | 'sdk-version': '1', 36 | 'X-SS-DP': '1128', 37 | 'x-tt-trace-id': '00-a67026290de17aa15402ce8ee4a90468-a67026290de17aa1-01', 38 | 'User-Agent': 'com.ss.android.ugc.aweme/100801 (Linux; U; Android 5.1.1; zh_CN; MI 9; Build/NMF26X; Cronet/TTNetVersion:8109b77c 2020-04-15 QuicVersion:0144d358 2020-03-24)', 39 | 'X-Gorgon': '0404c0d100004fe124c18b36d03baf0768c181e105b1af5e8167', 40 | 'X-Khronos': '1587635299', 41 | 'x-common-params-v2': 'os_api=22&device_platform=android&device_type=MI%209&iid=78795828897640&version_code=100800&app_name=aweme&openudid=80c5f2708a3b6304&device_id=3966668942355688&os_version=5.1.1&aid=1128&channel=tengxun_new&ssmix=a&manifest_version_code=100801&dpi=320&cdid=e390170c-0cb5-42ad-8bf6-d25dc4c7e3a3&version_name=10.8.0&resolution=900*1600&language=zh&device_brand=Xiaomi&app_type=normal&ac=wifi&update_version_code=10809900&uuid=863254643501389' 42 | } 43 | #无视证书的请求 44 | html = requests.get(url, headers=headers, verify=False) 45 | #把数据用json来全部获取下来 46 | json_data = html.json()["aweme_list"] 47 | #循环叠带我们的数据,把它们一一展示出来 48 | for j in json_data: 49 | title = j['desc'] 50 | print(title) 51 | print(j['video']['play_addr']['url_list'][0]) 52 | #把最后每个视频对应的URL打印出来,再根据我们的下载函数,把它们全部下载到自己的电脑里面 53 | download_video(j['video']['play_addr']['url_list'][0], title) 54 | 55 | 56 | if __name__ == '__main__': 57 | get_video() -------------------------------------------------------------------------------- /爬取淘宝商品信息基于selenium框架/taobao.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import re 3 | from urllib import parse 4 | headers = { 5 | "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36", 6 | "referer": "https://tb.alicdn.com/snapshot/index.html", 7 | 'cookie': 't=884491259d4aed9aac3cd83e5798c433; cna=UU81Fxb46woCAWUv7c0BLoMd; sgcookie=ERElHyZEXq%2FBxbIAKkMLf; tracknick=%5Cu53F6%5Cu95EE%5Cu8C01%5Cu662F%5Cu8FB0%5Cu5357; _cc_=V32FPkk%2Fhw%3D%3D; enc=UvoaKN2E%2F5qKScgssIA7s34lg2c%2B7mFKY6bD58vrwGvLTZKDyYj7UQ0p3hGnXJK11f8JrZT5ky54YNi0i73Few%3D%3D; tfstk=cIOdBdvB3cmha_TF3QHGFR3VyY-dafFd2ys4w4-E6MTnQmN8NsxviIpfnv_Yv13O.; thw=cn; hng=CN%7Czh-CN%7CCNY%7C156; cookie2=1165897f57a1ed424d42db9d3a99ff7d; v=0; _tb_token_=77a6e3fa3eb98; alitrackid=tb.alicdn.com; lastalitrackid=tb.alicdn.com; JSESSIONID=42FB5C5D5D65C270436BAF43224830CB; isg=BPb2H7f2tUx9pkBnqiw8IaAaRyz4FzpR25dtfWDcO1mro5U9yaZ-YfUau3_PPzJp; l=eBTUSTCcQZnRM5Q_BO5alurza77TaQdf1nVzaNbMiInca6TFta8TVNQqOBKvSdtjgt5j2eKrb3kJjRhM8W4LRjkDBeYBRs5mpfpp8e1..', 8 | } 9 | 10 | keyword = input("请输入你要搜索的信息:") 11 | def get_parse(url): 12 | html = requests.get(url,headers= headers) 13 | if html.status_code ==200: 14 | print('页面正常') 15 | get_html(html) 16 | else: 17 | print(html.status_code) 18 | 19 | def get_html(html): 20 | #用正则表达式去获取商品的名称,价格,商家名称和商家位置 21 | content = html.text 22 | #定位商品名称 23 | names = re.compile('"raw_title":"(.*?)"', re.I | re.S) 24 | name = names.findall(content) 25 | #定位价格 26 | prices = re.compile('"view_price":"(.*?)"',re.I|re.S) 27 | price = prices.findall(content) 28 | #定位商家名称 29 | nicks = re.compile('"nick":"(.*?)"',re.I|re.S) 30 | nick = nicks.findall(content) 31 | #定位商家位置 32 | item_locs = re.compile('"item_loc":"(.*?)"', re.I | re.S) 33 | item_loc= item_locs.findall(content) 34 | #先算出爬出来正则的长度,从而确定循环,把商品的名称,价格,位置全部有序的全部打印出来 35 | for j in range(len(name)): 36 | print('商品名称:{}\n价格:{}\n商家名称:{}\n商家位置:{}\n'.format(name[j], price[j], nick[j], item_loc[j])) 37 | 38 | if __name__ == '__main__': 39 | for i in range(0,45,44): 40 | url = 'https://s.taobao.com/search?q={}&imgfile=&commend=all&ssid=s5-e&' \ 41 | 'search_type=item&sourceId=tb.index&spm=a21bo.2017.201856-taobao-item.1&' \ 42 | 'ie=utf8&initiative_id=tbindexz_20170306&bcoffset=1&ntoffset=1&p4ppushleft=2%2C48&s={}'.format(parse.quote(keyword),i) 43 | get_parse(url) 44 | 45 | 46 | 47 | -------------------------------------------------------------------------------- /爬取淘宝商品信息基于selenium框架/taobaopachong.py: -------------------------------------------------------------------------------- 1 | from selenium import webdriver 2 | from selenium.common.exceptions import TimeoutException 3 | from selenium.webdriver.common.by import By 4 | from selenium.webdriver.support import expected_conditions as EC 5 | from selenium.webdriver.support.ui import WebDriverWait 6 | from urllib import parse 7 | import pandas as pd 8 | from pyquery import PyQuery 9 | import time 10 | import json 11 | import re 12 | #定义一个变量,最好用大写这个是约定俗成 13 | KEYWORD = '月饼' 14 | #定位Chromedriver这个工具的位置 15 | options = webdriver.ChromeOptions() 16 | options.add_experimental_option("prefs",{"profile.mamaged_default_content_settings.images":2}) 17 | options.add_experimental_option('excludeSwitches',['enable-automation']) 18 | browser = webdriver.Chrome(executable_path="C:\\Users\\96075\\Desktop\\全部资料\\Python\\爬虫\\chromedriver.exe",options=options) 19 | #设置等待时间 20 | wait = WebDriverWait(browser,10) 21 | url ='https://www.taobao.com/' 22 | 23 | def crawl_page(): 24 | try: 25 | browser.get(url)#获取网页 26 | #用xpath语法定位到输入框 27 | input = wait.until(EC.presence_of_element_located(( 28 | By.XPATH,'//*[@id="q"]' 29 | ))) 30 | #用xpath语法定位到搜索框 31 | button = wait.until(EC.element_to_be_clickable(( 32 | By.XPATH,'//*[@id="J_TSearchForm"]/div[1]/button' 33 | ))) 34 | input.send_keys(KEYWORD)#输入关键词 35 | button.click()#模拟鼠标点击 36 | #等到python爬虫页面的总页数加载出来 37 | total = wait.until(EC.presence_of_element_located(( 38 | By.XPATH,'//*[@id="mainsrp-pager"]/div/div/div/div[1]' 39 | ))).text 40 | # 发现总页数有逗号 41 | total = re.sub(r',|,','',total) 42 | #数据清洗,将共100页后面的逗号去掉,淘宝里的是大写的逗号 43 | print(total) 44 | totalnum = int(re.compile('(\d+)').search(total).group(1)) 45 | # 只取出100这个数字 46 | print("第1页:") 47 | # 获取数据 48 | get_products() 49 | #返回总的页数 50 | return totalnum 51 | except: 52 | crawl_page() 53 | 54 | def get_products(): 55 | list_price = [] 56 | list_title = [] 57 | list_deal = [] 58 | list_picture = [] 59 | # 用工具去爬取这个页面 60 | html = browser.page_source 61 | # 打印这个页面的信息 62 | doc = PyQuery(html) 63 | # 去定位获取我们需要的信息 64 | items = doc("#mainsrp-itemlist .items .item").items() 65 | for item in items: 66 | price = item.find(".price").text(), 67 | list_price.append(price) 68 | title = item.find(".title").text(), 69 | list_title.append(title) 70 | deal = item.find(".deal-cnt").text(), 71 | list_deal.append(deal) 72 | picture = parse.urljoin('http:', item.find(".img").attr("data-src")) 73 | list_picture.append(picture) 74 | df = pd.DataFrame() 75 | df["商品名称"] = list_title 76 | df["商品价格"] = list_price 77 | df["商品销量"] = list_deal 78 | df["图片链接"] = list_picture 79 | try: 80 | df.to_csv("商品的基本信息.csv", mode="a+", header=None, index=None, encoding="utf-8") 81 | print("写入成功") 82 | except: 83 | print("当页数据写入失败") 84 | 85 | def next_page(): 86 | # 获取总页数的值,并且调用search获取第一页数据 87 | totalnum = crawl_page() 88 | # 初始为1,因为我第一页已经获取过数据了 89 | num = 1 90 | # 首先进来的是第1页,共100页,所以只需要翻页99次 91 | while num != totalnum - 1: 92 | print("第%s页:" %str(num+1) ) 93 | # 用修改s属性的方式翻页 94 | browser.get('https://s.taobao.com/search?q={}&s={}'.format(KEYWORD,44 * num)) 95 | # 等待10秒 96 | browser.implicitly_wait(10) 97 | # 获取数据 98 | get_products() 99 | #延迟3秒 100 | time.sleep(3) 101 | # 自增 102 | num +=1 103 | 104 | #写一个循环函数,用于爬取多页信息的内容 105 | def main(): 106 | next_page() 107 | if __name__ == '__main__': 108 | main() -------------------------------------------------------------------------------- /爬取淘宝商品信息基于selenium框架/对数据进行清洗.py: -------------------------------------------------------------------------------- 1 | def clearBlankLine(): 2 | file1 = open(input("请输入要清洗的文本(包括后缀):"), 'r',encoding="utf-8") # 要去掉空行的文件 3 | file2 = open('清洗好的文本.txt', 'w', encoding='utf-8') # 生成没有空行的文件 4 | try: 5 | for line in file1.readlines(): 6 | line = line.replace("'","").replace('"',"").replace('[','').replace(']','').replace(",)","").replace("(","") 7 | file2.write(line) 8 | finally: 9 | file1.close() 10 | file2.close() 11 | 12 | 13 | if __name__ == '__main__': 14 | clearBlankLine() 15 | -------------------------------------------------------------------------------- /爬取淘宝商品信息基于selenium框架/爬取商品属性.py: -------------------------------------------------------------------------------- 1 | from selenium import webdriver 2 | from selenium.common.exceptions import TimeoutException 3 | from selenium.webdriver.common.by import By 4 | from selenium.webdriver.support import expected_conditions as EC 5 | from selenium.webdriver.support.ui import WebDriverWait 6 | from pyquery import PyQuery 7 | import time 8 | import json 9 | import re 10 | 11 | #定义一个变量,最好用大写这个是约定俗成 12 | KEYWORD = '月饼' 13 | #定位Chromedriver这个工具的位置 14 | options = webdriver.ChromeOptions() 15 | options.add_experimental_option("prefs",{"profile.mamaged_default_content_settings.images":2}) 16 | options.add_experimental_option('excludeSwitches',['enable-automation']) 17 | browser = webdriver.Chrome(executable_path="C:\\Users\\96075\\Desktop\\全部资料\\Python\\爬虫\\chromedriver.exe",options=options) 18 | #设置等待时间 19 | wait = WebDriverWait(browser,10) 20 | url ='https://www.taobao.com/' 21 | 22 | list_property = [] 23 | #先写一个去调动输入框定位一个商品类型的函数,并且获取总的页数 24 | def crawl_page(): 25 | try: 26 | browser.get(url)#获取网页 27 | #用xpath语法定位到输入框 28 | input = wait.until(EC.presence_of_element_located(( 29 | By.XPATH,'//*[@id="q"]' 30 | ))) 31 | #用xpath语法定位到搜索框 32 | button = wait.until(EC.element_to_be_clickable(( 33 | By.XPATH,'//*[@id="J_TSearchForm"]/div[1]/button' 34 | ))) 35 | input.send_keys(KEYWORD)#输入关键词 36 | button.click()#模拟鼠标点击 37 | #等到python爬虫页面的总页数加载出来 38 | total = wait.until(EC.presence_of_element_located(( 39 | By.XPATH,'//*[@id="mainsrp-pager"]/div/div/div/div[1]' 40 | ))).text 41 | # 发现总页数有逗号 42 | total = re.sub(r',|,','',total) 43 | #数据清洗,将共100页后面的逗号去掉,淘宝里的是大写的逗号 44 | print(total) 45 | totalnum = int(re.compile('(\d+)').search(total).group(1)) 46 | # 只取出100这个数字 47 | print("第1页:") 48 | # 获取数据 49 | get_products() 50 | #返回总的页数 51 | return totalnum 52 | except: 53 | crawl_page() 54 | #然后根据获取总的页数来获取每一页商品的内容和信息 55 | def next_page(): 56 | # 获取总页数的值,并且调用search获取第一页数据 57 | totalnum = crawl_page() 58 | # 初始为1,因为我第一页已经获取过数据了 59 | num = 1 60 | # 首先进来的是第1页,共100页,所以只需要翻页99次 61 | while num != totalnum - 1: 62 | print("第%s页:" %str(num+1) ) 63 | # 用修改s属性的方式翻页 64 | browser.get('https://s.taobao.com/search?q={}&s={}'.format(KEYWORD,44 * num)) 65 | # 等待10秒 66 | browser.implicitly_wait(10) 67 | # 获取数据 68 | get_products() 69 | #延迟3秒 70 | time.sleep(3) 71 | # 自增 72 | num +=1 73 | #再根据获取到的商品信息来定位来商品的链接从而进入商品的详细页面 74 | def get_products(): 75 | #用工具去爬取这个页面 76 | html = browser.page_source 77 | #打印这个页面的信息 78 | doc = PyQuery(html) 79 | #去定位获取我们需要的信息 80 | items = doc("#mainsrp-itemlist .items .item").items() 81 | for item in items: 82 | product ={ 83 | "href":item.find(".pic-link").attr("href"), 84 | } 85 | href = product["href"] 86 | url = "https:{}".format(href).replace("https:https:","https:") 87 | get_url(url) 88 | 89 | #获取到详细页面之后去获取每个商品信息的属性,并且把这些属性抓取下来 90 | def get_url(url): 91 | browser.get(url) 92 | time.sleep(2) 93 | html= browser.page_source 94 | doc = PyQuery(html) 95 | items = doc('div#attributes.attributes .attributes-list li').text() 96 | 97 | data = { 98 | 'property':items 99 | } 100 | print(data['property']) 101 | save_to_file(data['property']) 102 | #写一个保存文件的函数 103 | def save_to_file(result): 104 | #a是追加信息的意思 105 | with open("商品属性.text","a",encoding='utf-8') as f: 106 | #把python转化为json,然后用json的形式保存下来,ensure_ascii=False是识别有没有中文的意思 107 | f.write(json.dumps(result,ensure_ascii=False)+"\n") 108 | print("存储到text成功") 109 | 110 | if __name__ == '__main__': 111 | next_page() -------------------------------------------------------------------------------- /爬取百度文库的doc格式/paqubaiduwenku.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import re 3 | 4 | #首先先去获取这个函数 5 | def get_html(url): 6 | try: 7 | #这里用到防错机制先获取这个页面用get方法 8 | r = requests.get(url,headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.62 Safari/537.36"}) 9 | #这句话的意思就是这个HTTP回应内容的编码方式 =这个内容的备用编码方式, 10 | # 这样写的意义就是不用指定某种编码,而是直接调用这个内容的编码 11 | r.encoding = r.apparent_encoding 12 | #放回这个内容以text的形式 13 | return r.text 14 | except: 15 | print("URL request error") 16 | 17 | #开始解析我们的doc文件 18 | def parse_doc(html): 19 | #先设置result为空,方便存放 20 | result = '' 21 | #用我们的正则去获取我们想要的URL 22 | url_list = re.findall("(https.*?0.json.*?)\\\\x22}", html) 23 | #并且把获取到的URL替换成正确的URL 24 | url_list = [addr.replace("\\\\\\/","/") for addr in url_list] 25 | #最后打印出来 26 | print(url_list) 27 | #开始调用我们的URL,因为最后5条是用不了的,所以对它们进行切片 28 | for url in url_list[:-5]: 29 | content = get_html(url) 30 | y = 0 31 | #把这个列表全部打印出来,因为这里是用到了反爬虫机制,所以我们要开始解析页面 32 | txtlists = re.findall('"c":"(.*?)".*?"y":(.*?),',content) 33 | for item in txtlists: 34 | if not y==item[1]: 35 | y = item[1] 36 | n = '\n' 37 | else: 38 | n = '' 39 | result += n 40 | #最后的结果,把我们破解好的内容一条条的打印上去,解码方式就是utf-8,因为还有因为解码还没解,所以我采用了最大的解码方法 41 | result += item[0].encode("utf-8").decode("unicode_escape","ignore") 42 | return result 43 | 44 | def main(): 45 | #输入我们想要爬取这个文章的连接 46 | url = input("请输入你要获取百度文库的URL连接:") 47 | html = get_html(url) 48 | #爬取这个页面的一些信息 49 | wenku_title = re.findall("\'title\'.*?\'(.*?)\'",html)[0] 50 | wenku_type = re.findall("\'docType\'.*?\'(.*?)\'",html)[0] 51 | wenku_id = re.findall("'docId'.*?'(.*?)'",html)[0] 52 | print("文章类型",wenku_type) 53 | print("文档ID",wenku_id) 54 | result =parse_doc(html) 55 | filename= wenku_title+'.doc' 56 | with open(filename,"w",encoding="utf-8") as f: 57 | f.write(result) 58 | print("文件保存为{}.doc".format(wenku_title)) 59 | 60 | if __name__ == '__main__': 61 | main() -------------------------------------------------------------------------------- /爬取百度文库的doc格式/抓取百度文库所有内容/paqu-ppt/pdf.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from selenium import webdriver 3 | from lxml import etree 4 | import re 5 | from selenium.webdriver.common.keys import Keys 6 | import time 7 | from PIL import Image 8 | import os 9 | from bs4 import BeautifulSoup 10 | from docx import Document 11 | import sys 12 | 13 | #首先先去获取这个函数 14 | def get_html(url): 15 | try: 16 | #这里用到防错机制先获取这个页面用get方法 17 | r = requests.get(url,headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.62 Safari/537.36"}) 18 | #这句话的意思就是这个HTTP回应内容的编码方式 =这个内容的备用编码方式, 19 | # 这样写的意义就是不用指定某种编码,而是直接调用这个内容的编码 20 | r.encoding = r.apparent_encoding 21 | #放回这个内容以text的形式 22 | return r.text 23 | except: 24 | print("URL request error") 25 | 26 | -------------------------------------------------------------------------------- /爬取百度文库的doc格式/抓取百度文库所有内容/paqubaiduwenku.py: -------------------------------------------------------------------------------- 1 | import requests 2 | import re 3 | 4 | #首先先去获取这个函数 5 | def get_html(url): 6 | try: 7 | #这里用到防错机制先获取这个页面用get方法 8 | r = requests.get(url,headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.62 Safari/537.36"}) 9 | #这句话的意思就是这个HTTP回应内容的编码方式 =这个内容的备用编码方式, 10 | # 这样写的意义就是不用指定某种编码,而是直接调用这个内容的编码 11 | r.encoding = r.apparent_encoding 12 | #放回这个内容以text的形式 13 | return r.text 14 | except: 15 | print("URL request error") 16 | 17 | #开始解析我们的doc文件 18 | def parse_doc(html): 19 | #先设置result为空,方便存放 20 | result = '' 21 | #用我们的正则去获取我们想要的URL 22 | url_list = re.findall("(https.*?0.json.*?)\\\\x22}", html) 23 | #并且把获取到的URL替换成正确的URL 24 | url_list = [addr.replace("\\\\\\/","/") for addr in url_list] 25 | #最后打印出来 26 | print(url_list) 27 | #开始调用我们的URL,因为最后5条是用不了的,所以对它们进行切片 28 | for url in url_list[:-5]: 29 | content = get_html(url) 30 | y = 0 31 | #把这个列表全部打印出来,因为这里是用到了反爬虫机制,所以我们要开始解析页面 32 | txtlists = re.findall('"c":"(.*?)".*?"y":(.*?),',content) 33 | for item in txtlists: 34 | if not y==item[1]: 35 | y = item[1] 36 | n = '\n' 37 | else: 38 | n = '' 39 | result += n 40 | #最后的结果,把我们破解好的内容一条条的打印上去,解码方式就是utf-8,因为还有因为解码还没解,所以我采用了最大的解码方法 41 | result += item[0].encode("utf-8").decode("unicode_escape","ignore") 42 | return result 43 | 44 | def main(): 45 | #输入我们想要爬取这个文章的连接 46 | url = input("请输入你要获取百度文库的URL连接:") 47 | html = get_html(url) 48 | #爬取这个页面的一些信息 49 | wenku_title = re.findall("\'title\'.*?\'(.*?)\'",html)[0] 50 | wenku_type = re.findall("\'docType\'.*?\'(.*?)\'",html)[0] 51 | wenku_id = re.findall("'docId'.*?'(.*?)'",html)[0] 52 | print("文章类型",wenku_type) 53 | print("文档ID",wenku_id) 54 | result =parse_doc(html) 55 | filename= wenku_title+'.doc' 56 | with open(filename,"w",encoding="utf-8") as f: 57 | f.write(result) 58 | print("文件保存为{}.doc".format(wenku_title)) 59 | 60 | if __name__ == '__main__': 61 | main() -------------------------------------------------------------------------------- /爬取百度文库的doc格式/抓取百度文库所有内容/作文.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/13060923171/Crawl-Project/320d69b1d9051db6f43836d7d8854fa5af9d56fb/爬取百度文库的doc格式/抓取百度文库所有内容/作文.pptx -------------------------------------------------------------------------------- /爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/README.md: -------------------------------------------------------------------------------- 1 | # 爬取百度文库 2 | 3 | ## Before use: 4 | 5 | ```python 6 | 执行setup.bat 7 | ``` 8 | 9 | ## Requirements 10 | 11 | - Python 3 环境 12 | - 将文件夹放置在一个路径文件夹名中**没有空格**的位置 13 | 14 | ## 使用说明 15 | 16 | 用户界面: 17 | 18 | ![1](.\img\1.png) 19 | 20 | 输入要爬取的百度文库网页的url地址: 21 | 22 | ![2](.\img\2.png) 23 | 24 | 如果没有设置环境变量,需要手动输入本地python.exe文件的绝对路径,如果已经设置环境变量,不需要修改该部分: 25 | 26 | ![3](.\img\3.png) 27 | 28 | 如果输入了错误的网址或python路径,会弹窗报错: 29 | 30 | ![4](.\img\5.png) 31 | 32 | 选择是否爬取文本内容: 33 | 34 | ![4](.\img\4.png) 35 | 36 | 爬取成功后会有文字提示,爬出结果保存在文件夹内: 37 | 38 | ![6](.\img\6.png) 39 | 40 | 41 | ## 联系作者 42 | 43 | 本项目由xkw和zll共同完成,如有疑惑请咨询: 44 | 45 | xkw:xiasen99@gmail.com 46 | 47 | zll:zh20010728@126.com https://github.com/zll-hust 48 | 49 | 50 | -------------------------------------------------------------------------------- /爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/img/1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/13060923171/Crawl-Project/320d69b1d9051db6f43836d7d8854fa5af9d56fb/爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/img/1.png -------------------------------------------------------------------------------- /爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/img/2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/13060923171/Crawl-Project/320d69b1d9051db6f43836d7d8854fa5af9d56fb/爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/img/2.png -------------------------------------------------------------------------------- /爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/img/3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/13060923171/Crawl-Project/320d69b1d9051db6f43836d7d8854fa5af9d56fb/爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/img/3.png -------------------------------------------------------------------------------- /爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/img/4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/13060923171/Crawl-Project/320d69b1d9051db6f43836d7d8854fa5af9d56fb/爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/img/4.png -------------------------------------------------------------------------------- /爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/img/5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/13060923171/Crawl-Project/320d69b1d9051db6f43836d7d8854fa5af9d56fb/爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/img/5.png -------------------------------------------------------------------------------- /爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/img/6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/13060923171/Crawl-Project/320d69b1d9051db6f43836d7d8854fa5af9d56fb/爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/img/6.png -------------------------------------------------------------------------------- /爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/requirements.txt: -------------------------------------------------------------------------------- 1 | pillow 2 | requests 3 | selenium 4 | lxml 5 | bs4 6 | -------------------------------------------------------------------------------- /爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/setup.bat: -------------------------------------------------------------------------------- 1 | @echo off 2 | pip install -r requirements.txt 3 | pip install python-docx -------------------------------------------------------------------------------- /爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/src/chromedriver.exe: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/13060923171/Crawl-Project/320d69b1d9051db6f43836d7d8854fa5af9d56fb/爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/src/chromedriver.exe -------------------------------------------------------------------------------- /爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/src/wenku.py: -------------------------------------------------------------------------------- 1 | import requests 2 | from selenium import webdriver 3 | from lxml import etree 4 | import re 5 | from selenium.webdriver.common.keys import Keys 6 | import time 7 | from PIL import Image 8 | import os 9 | from bs4 import BeautifulSoup 10 | import bs4 11 | from docx import Document 12 | import sys 13 | 14 | def getHTMLText(url): 15 | header = {'User-agent': 'Googlebot'} 16 | try: 17 | r = requests.get(url, headers=header, timeout=30) 18 | r.raise_for_status() 19 | r.encoding = 'gbk' 20 | # r.encoding = r.apparent_encoding 21 | return r.text 22 | except: 23 | return '' 24 | 25 | def parse_type(content): 26 | return re.findall(r"docType.*?\:.*?\'(.*?)\'\,", content)[0] 27 | 28 | def parse_txt(html): 29 | plist = [] 30 | soup = BeautifulSoup(html, "html.parser") 31 | plist.append(soup.title.string) 32 | for div in soup.find_all('div', attrs={"class": "bd doc-reader"}): 33 | plist.extend(div.get_text().split('\n')) 34 | plist = [c.replace(' ', '') for c in plist] 35 | plist = [c.replace('\x0c', '') for c in plist] 36 | return plist 37 | 38 | def print_docx(plist, filename): 39 | file = open(filename + '.txt', 'w',encoding='utf-8') 40 | for str in plist: 41 | file.write(str) 42 | file.write('\n') 43 | file.close() 44 | with open(filename + '.txt', encoding='utf-8') as f: 45 | docu = Document() 46 | docu.add_paragraph(f.read()) 47 | docu.save(filename + '.docx') 48 | 49 | def parse_doc(url, folderPath): 50 | driver = webdriver.Chrome(r'./src/chromedriver.exe') 51 | driver.get(url) 52 | # 找到‘继续阅读’按钮 定位至还剩35页未读,继续阅读 53 | button = driver.find_element_by_xpath("//*[@id='html-reader-go-more']/div[2]/div[1]/span") 54 | # 按下按钮 55 | driver.execute_script("arguments[0].click();", button) 56 | time.sleep(1) 57 | source = re.compile(r'/(.*?)') 58 | number = int(source.findall(driver.page_source)[0]) 59 | # 获取页码数 60 | # number = total[1] 61 | time.sleep(1) 62 | for i in range(2,number): 63 | driver.find_element_by_class_name("page-input").clear() 64 | driver.find_element_by_class_name("page-input").send_keys(f'{i}') 65 | driver.find_element_by_class_name("page-input").send_keys(Keys.ENTER) 66 | time.sleep(1) 67 | html=etree.HTML(driver.page_source) 68 | # 找到picture容器 69 | links=html.xpath("//div[@class='reader-pic-item']/@style") 70 | # 找到图片对应的url 71 | part = re.compile(r'url[(](.*?)[)]') 72 | qa="".join(links) 73 | z=part.findall(qa) 74 | if i == 2: 75 | for m in range(3): 76 | pic = requests.get(z[m]).content 77 | with open(f'./照片/{m+1}.jpg','wb') as f: 78 | f.write(pic) 79 | f.close() 80 | else: 81 | pic = requests.get(z[2]).content 82 | with open(f'./照片/{i+1}.jpg','wb') as f: 83 | f.write(pic) 84 | f.close() 85 | time.sleep(1) 86 | driver.quit() 87 | 88 | def parse_other(url, folderPath): 89 | driver = webdriver.Chrome(r'./src/chromedriver.exe') 90 | driver.get(url) 91 | # 找到‘继续阅读’按钮 定位至还剩35页未读,继续阅读 92 | button = driver.find_element_by_xpath("//*[@id='html-reader-go-more']/div[2]/div[1]/span") 93 | # 按下按钮 94 | driver.execute_script("arguments[0].click();", button) 95 | time.sleep(1) 96 | source = re.compile(r'/(.*?)') 97 | number = int(source.findall(driver.page_source)[0]) 98 | # 获取页码数 99 | # number = total[1] 100 | time.sleep(1) 101 | # 获取图片 102 | for i in range(2,number): 103 | driver.find_element_by_class_name("page-input").clear() 104 | driver.find_element_by_class_name("page-input").send_keys(f'{i}') 105 | driver.find_element_by_class_name("page-input").send_keys(Keys.ENTER) 106 | time.sleep(1) 107 | html=etree.HTML(driver.page_source) 108 | # 找到picture容器"//div[@class='reader-pic-item']/@style" 109 | z=html.xpath('//div[@class="ppt-image-wrap"]/img/@src') 110 | # print(z) 111 | # 保存图片 112 | if i == 2: 113 | for m in range(3): 114 | pic = requests.get(z[m]).content 115 | with open(folderPath + f'/{m + 1}.jpg','wb') as f: 116 | f.write(pic) 117 | f.close() 118 | else: 119 | pic = requests.get(z[i]).content 120 | with open(folderPath + f'/{i + 1}.jpg','wb') as f: 121 | f.write(pic) 122 | f.close() 123 | time.sleep(1) 124 | driver.quit() 125 | 126 | 127 | def print_pdf(folderPath, filename): 128 | files = os.listdir(folderPath) 129 | jpgFiles = [] 130 | sources = [] 131 | for file in files: 132 | if 'jpg' in file: 133 | jpgFiles.append(file) 134 | tep = [] 135 | for i in jpgFiles: 136 | ex = i.split('.') 137 | tep.append(int(ex[0])) 138 | tep.sort() 139 | jpgFiles=[folderPath +'/'+ str(i) + '.jpg' for i in tep] 140 | output = Image.open(jpgFiles[0]) 141 | jpgFiles.pop(0) 142 | for file in jpgFiles: 143 | img = Image.open(file) 144 | img = img.convert("P") 145 | sources.append(img) 146 | output.save(f"{filename}.pdf","PDF",save_all=True,append_images=sources) 147 | 148 | def main(url, istxt): 149 | try: 150 | ticks = time.time() # 获取时间(用于命名文件夹) 151 | filepath = './照片' + str(ticks) # 保存爬取的图片 152 | filename = './爬取结果' + str(ticks) # 爬取生成的文件名 153 | if not os.path.exists(filepath): # 新建文件夹 154 | os.mkdir(filepath) 155 | html = getHTMLText(url) # requests库爬取 156 | type = parse_type(html) # 获取文库文件类型:ppt, pdf, docx 157 | 158 | # 当你要爬取文档的文本时,打开下列注释 159 | if(istxt == "1"): 160 | type = 'txt' 161 | 162 | if type == 'txt' : 163 | plist = parse_txt(html) 164 | print_docx(plist, filename) 165 | elif type == 'doc' or type == 'pdf': 166 | parse_doc(url, filepath) 167 | print_pdf(filepath , filename) 168 | else: 169 | parse_other(url, filepath) 170 | print_pdf(filepath, filename) 171 | print('1') 172 | except: 173 | print('0') 174 | 175 | if __name__ == '__main__': 176 | main(sys.argv[1],sys.argv[2]) 177 | # url = 'https://wenku.baidu.com/view/5292b2bc0166f5335a8102d276a20029bd64638c.html?fr=search' 178 | # istxt = "0" 179 | # main(url,istxt) -------------------------------------------------------------------------------- /爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/代码分析/Ajax知识点补充.md: -------------------------------------------------------------------------------- 1 | # Ajax 2 | 3 |   AJAX技术与其说是一种“技术”,不如说是一种“方案”。如上文所述,在网页中使用JavaScript 加载页面中数据的过程,都可以看作AJAX技术。 4 | 5 | ​ AJAX技术改变了过去用户浏览网站时一个请求对应一个页面的模式,允许浏览器通过异步请求来获取数据,从而使得一个页面能够呈现并容纳更多的内容,同时也就意味着更多的功能。 6 | 7 | ​ 只要用户使用的是主流的浏览器,同时允许浏览器执行JavaScript,用户就能够享受网页中的AJAX内容。 8 | 9 |   AJAX技术在逐渐流行的同时,也面临着一些批评和意见。由于JavaScript本身是作为客户端脚本语言在浏览器的基础上执行的,因此,浏览器兼容性成为不可忽视的问题。 10 | 11 | ​ 另外,由于JavaScript在某种程度上实现了业务逻辑的分离(此前的业务逻辑统一由服务器端实现),因此在代码维护上也存在一些效率问题。但总体而言,AJAX技术已经成为现代网站技术中的中流砥柱,受到了广泛的欢迎。AJAX目前的使用场景十分广泛,很多时候普通用户甚至察觉不到网页正在使用AJAX技术。 12 | 13 | ​ 以知乎的首页信息流为例,与用户的主要交互方式就是用户通过下拉页面(具体操作可通过鼠标滚轮、拖动滚动条等实现)查看更多动态,而在一部分动态(对于知乎而言包括被关注用户的点赞和回答等)展示完毕后,就会显示一段加载动画并呈现后续的动态内容。此处的页面动画其实只是“障眼法”,在这个过程中,JavasScript脚本已向服务器请求发送相关数据,并最终加载到页面之中。这时页面显然没有进行全部刷新,而是只“新”刷新了一部分,通过这种异步加载的方式完成了对新内容的获取和呈现,这个过程就是典型的AJAX应用。 14 | 15 |   比较尴尬的是,爬虫一般不能执行包括“加载新内容”或者“跳到下一页”等功能在内的各类写在网页中的JavaScript代码。如本节开头所述,爬虫会获取网站的原始HTML页面,由于它没有像浏览器一样的执行JavaScript脚本的能力,因此也就不会为网页运行JavaScript。 16 | 17 | ​ 最终,爬虫爬取到的结果就会和浏览器里显示的结果有所差异,很多时候便不能直接获得想要的关键信息。 18 | 19 | ​ 为解决这个尴尬处境,基于Python编写的爬虫程序可以做出两种改进,一种是通过分析AJAX内容(需要开发者手动观察和实验),观察其请求目标、请求内容和请求的参数等信息,最终编写程序来模拟这样的JavaScript 请求,从而获取信息(这个过程也可以叫作“逆向工程”)。 20 | 21 | ​ 另外一种方式则比较取巧,那就是直接模拟出浏览器环境,使得程序得以通过浏览器模拟工具“移花接木”,最终通过浏览器渲染后的页面来获得信息。这两种方式的选择与JavaScript在网页中的具体使用方法有关。 -------------------------------------------------------------------------------- /爬取百度文库的doc格式/抓取百度文库所有内容/带GUI的爬取百度文库/代码分析/JSP知识的小补充.md: -------------------------------------------------------------------------------- 1 | # JSP知识的小补充 2 | ## 网页组成 3 | 4 | 网页是由 HTML 、 CSS 、JavaScript 组成的。 5 | 6 | HTML 是用来搭建整个网页的骨架,而 CSS 是为了让整个页面更好看,包括我们看到的颜色,每个模块的大小、位置等都是由 CSS 来控制的, JavaScript 是用来让整个网页“动起来”,这个动起来有两层意思,一层是网页的数据动态交互,还有一层是真正的动,比如我们都见过一些网页上的动画,一般都是由 JavaScript 配合 CSS 来完成的。 7 | 8 | 不同类型的文字通过不同类型的标签来表示,如图片用 标签表示,视频用