├── Notes.md ├── dyspider.py ├── head.py ├── pics ├── Screenshot.png ├── desc.png ├── id.png ├── logo.jpg ├── pc.png └── video.png ├── readme.md └── requirements.txt /Notes.md: -------------------------------------------------------------------------------- 1 | 2 | 3 | ## 解锁手机抓包技能 4 | 5 | 1. 确保手机和电脑在同一个局域网 6 | 2. Charles里设置 Proxy - Proxy Settings - Http-Proxy, port设置为8888,勾选enable transparent HTTP proxying 7 | 3. 在Charles里 Help - SSL proxying - install charles root certificate on a mobile device or remote 8 | 4. 按步骤3的提示在手机连接的wifi上设置 代理 - 手动 - 输入3里要求的IP和端口 9 | 5. 使用第三方浏览器访问chls.pro/ssl 下载pem证书,通过 设置---更多设置---系统安全---从存储设备安装--选择文件 进行安装 10 | 6. 以上完成后即可正式的对手机进行抓包 11 | 12 | 13 | 14 | #### 成果 15 | 16 | 随便点开一个抖音视频,观察抓到的网址,通过关键字dy过滤,可以看到视频源地址url为 17 | 18 | http://v3-dy-z.ixigua.com/fcbe3c977ce612147e47e6450ae9b60c/5b87abed/video/m/220279cd152b24b4a1d953cab94c6818f18115a9d1d00010a7eaf9e9a44/ 19 | 20 | 21 | 22 | 先随便试一个视频 23 | 24 | 通过charles找到视频地址copy as curl,然后粘贴到postman中,获得requests源码,修改一下保存代码,bingo! 25 | 26 | ```python 27 | import requests 28 | 29 | url = "http://v3-dy-y.ixigua.com/7c0a21fa9c4c6108f1482d6b6769ca6c/5b921c99/video/m/2208175becfbb8a40e48ccf94c58df84eff115b57ab000081cb4d8ed0c5/" 30 | 31 | headers = { 32 | 'Range': "bytes=0-10485760", 33 | 'Vpwp-Type': "preloader", 34 | 'Vpwp-Key': "683A5AD25479CE507F8D2EC34799A6E6", 35 | 'Vpwp-Raw-Key': "v0200f5d0000be8koqaepr11885ngtr0720p", 36 | 'Vpwp-Flag': "0", 37 | 'Host': "v3-dy-y.ixigua.com", 38 | 'User-Agent': "okhttp/3.10.0.1", 39 | 'Cache-Control': "no-cache", 40 | 'Postman-Token': "47c1d073-7f67-44b2-8c2e-ee680cfb8966" 41 | } 42 | 43 | response = requests.request("GET", url, headers=headers) 44 | 45 | with open('dy.mp4', 'wb') as f: 46 | f.write(response.content) 47 | f.close() 48 | ``` 49 | 50 | 单个视频下载到手。 51 | 52 | 53 | 54 | ## 批量爬虫遇到的问题 55 | 56 | #### 遇到443了 57 | 58 | 关charles 59 | 60 | #### 遇到重定向问题302 61 | 62 | 用requests爬虫拒绝301/302页面的重定向而拿到Location(重定向页面URL) 63 | 64 | ```python 65 | response = requests.request("GET", url, headers=headers, allow_redirects=False) 66 | video_url = response.headers['Location'] 67 | ``` 68 | 69 | 70 | 71 | ## 下一步 72 | 73 | 怎么让爬虫变得更通用呢? 74 | 75 | 检查参数 76 | 77 | ```python 78 | querystring = { 79 | "user_id": "68152168500", 80 | "count": "21", 81 | "max_cursor": max_cursor, 82 | "aid": "1128", 83 | "_signature": "9HucchAar.RLDW2U0fv6DPR7nG", 84 | "dytk": "14d65256b82dd042058b0eca9f85461b" 85 | } 86 | ``` 87 | 88 | 检查发现aid和_signature并不是必须传的,cookie也不是必须,只有dytk必须传,douyintoken? 该如何解密呢? 89 | 90 | 已搞定,在第一个分享链接的response中有dytk -------------------------------------------------------------------------------- /dyspider.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | 4 | import os 5 | import re 6 | import sys 7 | from argparse import ArgumentParser 8 | from time import sleep 9 | 10 | import requests 11 | 12 | from head import download_headers, video_headers, Web_UA 13 | 14 | VIDEO_URLS, PAGE = [], 1 15 | 16 | 17 | def get_all_video_urls(user_id, max_cursor, dytk): 18 | ''' 19 | 获取所有视频的源地址 20 | ''' 21 | url = "https://www.amemv.com/aweme/v1/aweme/post/?" 22 | params = {'user_id' : user_id, 23 | 'count' : 21, 24 | 'max_cursor': max_cursor, 25 | 'dytk' : dytk} 26 | try: 27 | response = requests.request("GET", url, params=params, headers=video_headers) 28 | if response.status_code == 200: 29 | data = response.json() 30 | for li in data['aweme_list']: 31 | name = li.get('share_info').get('share_desc') 32 | url = li.get('video').get('play_addr').get('url_list')[0] 33 | VIDEO_URLS.append([name, url]) 34 | 35 | # 下拉获取更多视频 36 | if data['has_more'] == 1 and data.get('max_cursor') != 0: 37 | sleep(1) 38 | global PAGE 39 | print('正在收集第%s页视频地址' % (PAGE)) 40 | PAGE += 1 41 | return get_all_video_urls(user_id, data.get('max_cursor'), dytk) 42 | else: 43 | return 44 | else: 45 | print(response.status_code) 46 | return 47 | except Exception as e: 48 | print('failed,', e) 49 | return 50 | 51 | 52 | def download_video(index, username, name, url, retry=3): 53 | ''' 54 | 下载视频,显示进度 55 | ''' 56 | print("\r正在下载第%s个视频: %s" % (index, name)) 57 | try: 58 | response = requests.get(url, stream=True, headers=download_headers, timeout=15, allow_redirects=False) 59 | video_url = response.headers['Location'] 60 | video_response = requests.get(video_url, headers=download_headers, timeout=15) 61 | 62 | # 保存视频,显示下载进度 63 | if video_response.status_code == 200: 64 | video_size = int(video_response.headers['Content-Length']) 65 | with open('%s/%s.mp4' % (username, name), 'wb') as f: 66 | data_length = 0 67 | for data in video_response.iter_content(chunk_size=1024): 68 | data_length += len(data) 69 | f.write(data) 70 | done = int(50 * data_length / video_size) 71 | sys.stdout.write("\r下载进度: [%s%s]" % ('█' * done, ' ' * (50 - done))) 72 | sys.stdout.flush() 73 | # 失败重试3次 74 | elif video_response.status_code != 200 and retry: 75 | retry -= 1 76 | download_video(index, username, name, url, retry) 77 | else: 78 | return 79 | except Exception as e: 80 | print('download failed,', name, e) 81 | return None 82 | 83 | 84 | def get_name_and_dytk(num): 85 | ''' 86 | 获取用户名和dytk 87 | ''' 88 | url = "https://www.amemv.com/share/user/%s" % num 89 | headers = {'user-agent': Web_UA} 90 | try: 91 | response = requests.request("GET", url, headers=headers) 92 | name = re.findall('
(.*?)
', response.text)[0] 93 | dytk = re.findall("dytk: '(.*?)'", response.text)[0] 94 | return name, dytk 95 | except Exception: 96 | return 97 | 98 | 99 | def makedir(name): 100 | ''' 101 | 建立用户名文件夹 102 | ''' 103 | if not os.path.isdir(name): 104 | os.mkdir(name) 105 | else: 106 | pass 107 | 108 | 109 | def get_parser(): 110 | ''' 111 | 解析参数 112 | ''' 113 | parser = ArgumentParser() 114 | parser.add_argument('--uid', dest='user_id', type=int, help='用户的抖音id') 115 | return parser.parse_args() 116 | 117 | 118 | def main(): 119 | ''' 主函数 ''' 120 | args = get_parser() 121 | _id = args.user_id if args.user_id else int(input('请输入你要爬取的抖音用户id: ')) 122 | username, dytk = get_name_and_dytk(_id) 123 | makedir(username) 124 | get_all_video_urls(_id, 0, dytk) 125 | for index, item in enumerate(VIDEO_URLS, 1): 126 | name = item[0] 127 | if name == '抖音-原创音乐短视频社区': 128 | name = name + str(index) 129 | url = item[1] 130 | download_video(index, username, name, url) 131 | sleep(1) 132 | 133 | 134 | if __name__ == '__main__': 135 | main() 136 | -------------------------------------------------------------------------------- /head.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | 4 | '''把headers单独放到一个文件''' 5 | 6 | Web_UA = '"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36"' 7 | 8 | # 爬取所有视频链接url时的headers 9 | video_headers = { 10 | 'accept-encoding': "gzip, deflate, br", 11 | 'accept-language': "zh-CN,zh;q=0.9,zh-TW;q=0.8,en-US;q=0.7,en;q=0.6", 12 | 'user-agent': "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1", 13 | 'accept': "application/json", 14 | 'referer': "https://www.amemv.com/share/user/99311023790?u_code=md2m918k×tamp=1536311519&utm_source=qq&utm_campaign=client_share&utm_medium=android&app=aweme&iid=43539176723", 15 | 'authority': "www.amemv.com", 16 | 'x-requested-with': "XMLHttpRequest", 17 | 'Cache-Control': "no-cache", 18 | 'Postman-Token': "a3873c36-c20c-45f3-baf9-d80c6fdae811" 19 | } 20 | # 最终下载视频文件时用的headers 21 | download_headers = { 22 | 'Connection': "keep-alive", 23 | 'Upgrade-Insecure-Requests': "1", 24 | 'User-Agent': "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 " 25 | "(KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1", 26 | 'Accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*" 27 | "/*;q=0.8", 28 | 'Accept-Encoding': "gzip, deflate, br", 29 | 'Accept-Language': "zh-CN,zh;q=0.9,zh-TW;q=0.8,en-US;q=0.7,en;q=0.6", 30 | 'Cache-Control': "no-cache", 31 | 'Postman-Token': "29cf9311-b9cd-4171-afe1-53e6cdabaabf" 32 | } 33 | -------------------------------------------------------------------------------- /pics/Screenshot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lfykid/TikTokSpider/089661ff743c422794358ce88b7af5f1eafe8cff/pics/Screenshot.png -------------------------------------------------------------------------------- /pics/desc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lfykid/TikTokSpider/089661ff743c422794358ce88b7af5f1eafe8cff/pics/desc.png -------------------------------------------------------------------------------- /pics/id.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lfykid/TikTokSpider/089661ff743c422794358ce88b7af5f1eafe8cff/pics/id.png -------------------------------------------------------------------------------- /pics/logo.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lfykid/TikTokSpider/089661ff743c422794358ce88b7af5f1eafe8cff/pics/logo.jpg -------------------------------------------------------------------------------- /pics/pc.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lfykid/TikTokSpider/089661ff743c422794358ce88b7af5f1eafe8cff/pics/pc.png -------------------------------------------------------------------------------- /pics/video.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lfykid/TikTokSpider/089661ff743c422794358ce88b7af5f1eafe8cff/pics/video.png -------------------------------------------------------------------------------- /readme.md: -------------------------------------------------------------------------------- 1 |  2 | 3 | 4 | 5 | # TikTokSpider 6 | 7 |  8 | 9 | ## 使用方法 10 | 11 | 1. 运行dyspider.py文件,输入要爬用户的id (注意不是手机端的抖音号) 12 | 13 |  14 | 15 | 2. 使用命令行参数 --uid 输入用户id 16 | 17 | ```shell 18 | python dyspider.py --uid 64742778880 19 | ``` 20 | 21 | - 抖音号不是抖音ID,抖音ID在这里: 22 | 23 |  24 | 25 | 抖音ID可以通过将主页分享链接发用浏览器打开查看,详细步骤参见 26 | 27 | [抖音短视频的抖音号-抖音ID怎么看?](https://jingyan.baidu.com/article/d2b1d102ce2a885c7e37d4f5.html) 28 | 29 | https://jingyan.baidu.com/article/d2b1d102ce2a885c7e37d4f5.html 30 | 31 | 32 | 33 |  34 | 35 | ## 下载效果 36 | 37 | 运行代码将会在项目文件夹下新建一个与目标用户同名的文件夹,所有视频皆存入于此 38 | 39 |  40 | 41 |  42 | 43 | ## 分析过程 44 | 45 | 从用户主页的分享页面入手 46 | 47 | 1. 进入用户主页,点击 - 分享名片 - 链接形式,将主页分享链接发送到电脑上用chrome打开,就可以看到用户的主页面了 48 | 49 |  50 | 51 | 2. 但此时还看不到用户的作品,将chrome设置成手机模式,刷新,bingo! 作品出来了 52 | 53 |  54 | 55 | 3. 点击作品,下拉,查看network,就可以看到我们要找的作品url列表啦 56 | 57 | ``` 58 | https://www.amemv.com/aweme/v1/aweme/post/?user_id=6xx1xx0&count=21&max_cursor=0&aid=1128&_signature=TG2uvBAbGAHzG19a.rniF0xtrq&dytk=14d65256b82dd042058b0eca9f85461b 59 | ``` 60 | 61 | 4. 观察一下,这是一个ajax链接,我们需要的所有信息都在返回的包里了 62 | 63 | 5. 模拟请求,直接点击链接,copy as curl,然后复制到Postman里转成requests代码 64 | 65 | 6. 返回json里有一个has_more字段,如果为1表示还可以下拉出现更多作品,如果为0表示已经是最后 66 | 67 | 7. 当我们下拉的时候可以发现,新出现的url里只有max_cursor变了,新出现的max_cursor就是上次请求返回的max_cursor,有了has_more和max_cursor两个参数,我们就可以把所有urls取到了 68 | 69 | 8. 写一个递归函数 get_all_video_urls 根据has_more字段将所有urls递归爬取下来,终止条件是has_more==0 70 | 71 | 9. 用一个全局变量url_list = [] 存放爬到的每一个视频的名字和地址 72 | 73 | 10. 运行编写好的代码后发现,视频数据格式不对,返回去检查,原来第9步中的url不是真实的视频url,而是一个302跳转地址,真实视频地址在第9步response headers里的Location里 74 | 75 | 11. 添加禁止跳转的代码,获取真实视频url 76 | 77 | ```python 78 | response = requests.request("GET", url, headers=headers, allow_redirects=False) 79 | video_url = response.headers['Location'] 80 | ``` 81 | 82 | 83 | 84 | #### 搞定收工! 85 | 86 | 87 | 88 | 2018/09/11更新:添加显示下载进度条功能 89 | 90 | 91 | 92 | todolist: 93 | 94 | - [x] 异常处理 95 | - [ ] 日志功能 96 | - [x] 显示下载进度条 -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | appnope==0.1.0 2 | backcall==0.1.0 3 | certifi==2018.8.24 4 | chardet==3.0.4 5 | decorator==4.3.0 6 | idna==2.7 7 | ipython==6.5.0 8 | ipython-genutils==0.2.0 9 | jedi==0.12.1 10 | parso==0.3.1 11 | pexpect==4.6.0 12 | pickleshare==0.7.4 13 | prompt-toolkit==1.0.15 14 | ptyprocess==0.6.0 15 | Pygments==2.2.0 16 | requests==2.19.1 17 | simplegeneric==0.8.1 18 | six==1.11.0 19 | tqdm==4.25.0 20 | traitlets==4.3.2 21 | urllib3==1.23 22 | wcwidth==0.1.7 23 | --------------------------------------------------------------------------------