├── Notes.md
├── dyspider.py
├── head.py
├── pics
    ├── Screenshot.png
    ├── desc.png
    ├── id.png
    ├── logo.jpg
    ├── pc.png
    └── video.png
├── readme.md
└── requirements.txt


/Notes.md:
--------------------------------------------------------------------------------
 1 | 
 2 | 
 3 | ## 解锁手机抓包技能
 4 | 
 5 | 1. 确保手机和电脑在同一个局域网
 6 | 2. Charles里设置 Proxy - Proxy Settings - Http-Proxy, port设置为8888，勾选enable transparent HTTP proxying
 7 | 3. 在Charles里 Help - SSL proxying -  install charles root certificate on a mobile device or remote
 8 | 4. 按步骤3的提示在手机连接的wifi上设置   代理 - 手动 - 输入3里要求的IP和端口
 9 | 5. 使用第三方浏览器访问chls.pro/ssl 下载pem证书，通过 设置---更多设置---系统安全---从存储设备安装--选择文件 进行安装
10 | 6. 以上完成后即可正式的对手机进行抓包
11 | 
12 | 
13 | 
14 | #### 成果
15 | 
16 | 随便点开一个抖音视频，观察抓到的网址，通过关键字dy过滤，可以看到视频源地址url为
17 | 
18 | http://v3-dy-z.ixigua.com/fcbe3c977ce612147e47e6450ae9b60c/5b87abed/video/m/220279cd152b24b4a1d953cab94c6818f18115a9d1d00010a7eaf9e9a44/
19 | 
20 | 
21 | 
22 | 先随便试一个视频
23 | 
24 | 通过charles找到视频地址copy as curl，然后粘贴到postman中，获得requests源码，修改一下保存代码，bingo!
25 | 
26 | ```python
27 | import requests
28 | 
29 | url = "http://v3-dy-y.ixigua.com/7c0a21fa9c4c6108f1482d6b6769ca6c/5b921c99/video/m/2208175becfbb8a40e48ccf94c58df84eff115b57ab000081cb4d8ed0c5/"
30 | 
31 | headers = {
32 |     'Range':         "bytes=0-10485760",
33 |     'Vpwp-Type':     "preloader",
34 |     'Vpwp-Key':      "683A5AD25479CE507F8D2EC34799A6E6",
35 |     'Vpwp-Raw-Key':  "v0200f5d0000be8koqaepr11885ngtr0720p",
36 |     'Vpwp-Flag':     "0",
37 |     'Host':          "v3-dy-y.ixigua.com",
38 |     'User-Agent':    "okhttp/3.10.0.1",
39 |     'Cache-Control': "no-cache",
40 |     'Postman-Token': "47c1d073-7f67-44b2-8c2e-ee680cfb8966"
41 | }
42 | 
43 | response = requests.request("GET", url, headers=headers)
44 | 
45 | with open('dy.mp4', 'wb') as f:
46 |     f.write(response.content)
47 |     f.close()
48 | ```
49 | 
50 | 单个视频下载到手。
51 | 
52 | 
53 | 
54 | ## 批量爬虫遇到的问题
55 | 
56 | #### 遇到443了
57 | 
58 | 关charles
59 | 
60 | #### 遇到重定向问题302
61 | 
62 | 用requests爬虫拒绝301/302页面的重定向而拿到Location(重定向页面URL)
63 | 
64 | ```python
65 | response = requests.request("GET", url, headers=headers, allow_redirects=False)
66 | video_url = response.headers['Location']
67 | ```
68 | 
69 | 
70 | 
71 | ## 下一步
72 | 
73 | 怎么让爬虫变得更通用呢？
74 | 
75 | 检查参数 
76 | 
77 | ```python
78 | querystring = {
79 |     "user_id":    "68152168500",
80 |     "count":      "21",
81 |     "max_cursor": max_cursor,
82 |     "aid":        "1128",
83 |     "_signature": "9HucchAar.RLDW2U0fv6DPR7nG",
84 |     "dytk":       "14d65256b82dd042058b0eca9f85461b"
85 | }
86 | ```
87 | 
88 | 检查发现aid和_signature并不是必须传的，cookie也不是必须，只有dytk必须传，douyintoken? 该如何解密呢？
89 | 
90 | 已搞定，在第一个分享链接的response中有dytk


--------------------------------------------------------------------------------
/dyspider.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/python
  2 | # -*- coding: utf-8 -*-
  3 | 
  4 | import os
  5 | import re
  6 | import sys
  7 | from argparse import ArgumentParser
  8 | from time import sleep
  9 | 
 10 | import requests
 11 | 
 12 | from head import download_headers, video_headers, Web_UA
 13 | 
 14 | VIDEO_URLS, PAGE = [], 1
 15 | 
 16 | 
 17 | def get_all_video_urls(user_id, max_cursor, dytk):
 18 |     '''
 19 |     获取所有视频的源地址
 20 |     '''
 21 |     url = "https://www.amemv.com/aweme/v1/aweme/post/?"
 22 |     params = {'user_id'   : user_id,
 23 |               'count'     : 21,
 24 |               'max_cursor': max_cursor,
 25 |               'dytk'      : dytk}
 26 |     try:
 27 |         response = requests.request("GET", url, params=params, headers=video_headers)
 28 |         if response.status_code == 200:
 29 |             data = response.json()
 30 |             for li in data['aweme_list']:
 31 |                 name = li.get('share_info').get('share_desc')
 32 |                 url = li.get('video').get('play_addr').get('url_list')[0]
 33 |                 VIDEO_URLS.append([name, url])
 34 | 
 35 |             # 下拉获取更多视频
 36 |             if data['has_more'] == 1 and data.get('max_cursor') != 0:
 37 |                 sleep(1)
 38 |                 global PAGE
 39 |                 print('正在收集第%s页视频地址' % (PAGE))
 40 |                 PAGE += 1
 41 |                 return get_all_video_urls(user_id, data.get('max_cursor'), dytk)
 42 |             else:
 43 |                 return
 44 |         else:
 45 |             print(response.status_code)
 46 |             return
 47 |     except Exception as e:
 48 |         print('failed,', e)
 49 |         return
 50 | 
 51 | 
 52 | def download_video(index, username, name, url, retry=3):
 53 |     '''
 54 |      下载视频,显示进度
 55 |      '''
 56 |     print("\r正在下载第%s个视频: %s" % (index, name))
 57 |     try:
 58 |         response = requests.get(url, stream=True, headers=download_headers, timeout=15, allow_redirects=False)
 59 |         video_url = response.headers['Location']
 60 |         video_response = requests.get(video_url, headers=download_headers, timeout=15)
 61 | 
 62 |         # 保存视频，显示下载进度
 63 |         if video_response.status_code == 200:
 64 |             video_size = int(video_response.headers['Content-Length'])
 65 |             with open('%s/%s.mp4' % (username, name), 'wb') as f:
 66 |                 data_length = 0
 67 |                 for data in video_response.iter_content(chunk_size=1024):
 68 |                     data_length += len(data)
 69 |                     f.write(data)
 70 |                     done = int(50 * data_length / video_size)
 71 |                     sys.stdout.write("\r下载进度: [%s%s]" % ('█' * done, ' ' * (50 - done)))
 72 |                     sys.stdout.flush()
 73 |         # 失败重试3次
 74 |         elif video_response.status_code != 200 and retry:
 75 |             retry -= 1
 76 |             download_video(index, username, name, url, retry)
 77 |         else:
 78 |             return
 79 |     except Exception as e:
 80 |         print('download failed,', name, e)
 81 |         return None
 82 | 
 83 | 
 84 | def get_name_and_dytk(num):
 85 |     '''
 86 |     获取用户名和dytk
 87 |     '''
 88 |     url = "https://www.amemv.com/share/user/%s" % num
 89 |     headers = {'user-agent': Web_UA}
 90 |     try:
 91 |         response = requests.request("GET", url, headers=headers)
 92 |         name = re.findall('<p class="nickname">(.*?)</p>', response.text)[0]
 93 |         dytk = re.findall("dytk: '(.*?)'", response.text)[0]
 94 |         return name, dytk
 95 |     except Exception:
 96 |         return
 97 | 
 98 | 
 99 | def makedir(name):
100 |     '''
101 |      建立用户名文件夹
102 |      '''
103 |     if not os.path.isdir(name):
104 |         os.mkdir(name)
105 |     else:
106 |         pass
107 | 
108 | 
109 | def get_parser():
110 |     '''
111 |     解析参数
112 |     '''
113 |     parser = ArgumentParser()
114 |     parser.add_argument('--uid', dest='user_id', type=int, help='用户的抖音id')
115 |     return parser.parse_args()
116 | 
117 | 
118 | def main():
119 |     ''' 主函数 '''
120 |     args = get_parser()
121 |     _id = args.user_id if args.user_id else int(input('请输入你要爬取的抖音用户id: '))
122 |     username, dytk = get_name_and_dytk(_id)
123 |     makedir(username)
124 |     get_all_video_urls(_id, 0, dytk)
125 |     for index, item in enumerate(VIDEO_URLS, 1):
126 |         name = item[0]
127 |         if name == '抖音-原创音乐短视频社区':
128 |             name = name + str(index)
129 |         url = item[1]
130 |         download_video(index, username, name, url)
131 |         sleep(1)
132 | 
133 | 
134 | if __name__ == '__main__':
135 |     main()
136 | 


--------------------------------------------------------------------------------
/head.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/python
 2 | # -*- coding: utf-8 -*-
 3 | 
 4 | '''把headers单独放到一个文件'''
 5 | 
 6 | Web_UA = '"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36"'
 7 | 
 8 | # 爬取所有视频链接url时的headers
 9 | video_headers = {
10 |     'accept-encoding':  "gzip, deflate, br",
11 |     'accept-language':  "zh-CN,zh;q=0.9,zh-TW;q=0.8,en-US;q=0.7,en;q=0.6",
12 |     'user-agent':       "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 (KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1",
13 |     'accept':           "application/json",
14 |     'referer':          "https://www.amemv.com/share/user/99311023790?u_code=md2m918k&timestamp=1536311519&utm_source=qq&utm_campaign=client_share&utm_medium=android&app=aweme&iid=43539176723",
15 |     'authority':        "www.amemv.com",
16 |     'x-requested-with': "XMLHttpRequest",
17 |     'Cache-Control':    "no-cache",
18 |     'Postman-Token':    "a3873c36-c20c-45f3-baf9-d80c6fdae811"
19 | }
20 | # 最终下载视频文件时用的headers
21 | download_headers = {
22 |     'Connection':                "keep-alive",
23 |     'Upgrade-Insecure-Requests': "1",
24 |     'User-Agent':                "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/604.1.38 "
25 |                                  "(KHTML, like Gecko) Version/11.0 Mobile/15A372 Safari/604.1",
26 |     'Accept':                    "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*"
27 |                                  "/*;q=0.8",
28 |     'Accept-Encoding':           "gzip, deflate, br",
29 |     'Accept-Language':           "zh-CN,zh;q=0.9,zh-TW;q=0.8,en-US;q=0.7,en;q=0.6",
30 |     'Cache-Control':             "no-cache",
31 |     'Postman-Token':             "29cf9311-b9cd-4171-afe1-53e6cdabaabf"
32 | }
33 | 


--------------------------------------------------------------------------------
/pics/Screenshot.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lfykid/TikTokSpider/089661ff743c422794358ce88b7af5f1eafe8cff/pics/Screenshot.png


--------------------------------------------------------------------------------
/pics/desc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lfykid/TikTokSpider/089661ff743c422794358ce88b7af5f1eafe8cff/pics/desc.png


--------------------------------------------------------------------------------
/pics/id.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lfykid/TikTokSpider/089661ff743c422794358ce88b7af5f1eafe8cff/pics/id.png


--------------------------------------------------------------------------------
/pics/logo.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lfykid/TikTokSpider/089661ff743c422794358ce88b7af5f1eafe8cff/pics/logo.jpg


--------------------------------------------------------------------------------
/pics/pc.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lfykid/TikTokSpider/089661ff743c422794358ce88b7af5f1eafe8cff/pics/pc.png


--------------------------------------------------------------------------------
/pics/video.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/lfykid/TikTokSpider/089661ff743c422794358ce88b7af5f1eafe8cff/pics/video.png


--------------------------------------------------------------------------------
/readme.md:
--------------------------------------------------------------------------------
 1 | ![logo](https://github.com/huangke19/TikTokSpider/raw/master/pics/logo.jpg)
 2 | 
 3 | 
 4 | 
 5 | # TikTokSpider
 6 | 
 7 | ![dd](https://github.com/huangke19/LagouSpider/raw/master/lines/bird.jpg)
 8 | 
 9 | ## 使用方法
10 | 
11 | 1. 运行dyspider.py文件，输入要爬用户的id  (注意不是手机端的抖音号)
12 | 
13 |    ![dd](https://github.com/huangke19/TikTokSpider/raw/master/pics/desc.png)
14 | 
15 | 2. 使用命令行参数 --uid 输入用户id
16 | 
17 |    ```shell
18 |    python dyspider.py --uid 64742778880
19 |    ```
20 | 
21 | - 抖音号不是抖音ID，抖音ID在这里：
22 | 
23 |   ![dd](https://github.com/huangke19/TikTokSpider/raw/master/pics/id.png)
24 | 
25 |   抖音ID可以通过将主页分享链接发用浏览器打开查看，详细步骤参见
26 | 
27 |   [抖音短视频的抖音号-抖音ID怎么看？](https://jingyan.baidu.com/article/d2b1d102ce2a885c7e37d4f5.html)
28 | 
29 |   https://jingyan.baidu.com/article/d2b1d102ce2a885c7e37d4f5.html
30 | 
31 | 
32 | 
33 | ![dd](https://github.com/huangke19/LagouSpider/raw/master/lines/bird.jpg)
34 | 
35 | ## 下载效果
36 | 
37 | 运行代码将会在项目文件夹下新建一个与目标用户同名的文件夹，所有视频皆存入于此
38 | 
39 | ![dd](https://github.com/huangke19/TikTokSpider/raw/master/pics/video.png)
40 | 
41 | ![dd](https://github.com/huangke19/LagouSpider/raw/master/lines/bird.jpg)
42 | 
43 | ## 分析过程
44 | 
45 | 从用户主页的分享页面入手
46 | 
47 | 1. 进入用户主页，点击 - 分享名片 - 链接形式，将主页分享链接发送到电脑上用chrome打开，就可以看到用户的主页面了
48 | 
49 |    ![dd](https://github.com/huangke19/TikTokSpider/raw/master/pics/Screenshot.png)
50 | 
51 | 2. 但此时还看不到用户的作品，将chrome设置成手机模式，刷新，bingo! 作品出来了
52 | 
53 |    ![dd](https://github.com/huangke19/TikTokSpider/raw/master/pics/pc.png)
54 | 
55 | 3. 点击作品，下拉，查看network，就可以看到我们要找的作品url列表啦
56 | 
57 |    ```
58 |    https://www.amemv.com/aweme/v1/aweme/post/?user_id=6xx1xx0&count=21&max_cursor=0&aid=1128&_signature=TG2uvBAbGAHzG19a.rniF0xtrq&dytk=14d65256b82dd042058b0eca9f85461b
59 |    ```
60 | 
61 | 4. 观察一下，这是一个ajax链接，我们需要的所有信息都在返回的包里了
62 | 
63 | 5. 模拟请求，直接点击链接，copy as curl，然后复制到Postman里转成requests代码
64 | 
65 | 6. 返回json里有一个has_more字段，如果为1表示还可以下拉出现更多作品，如果为0表示已经是最后
66 | 
67 | 7. 当我们下拉的时候可以发现，新出现的url里只有max_cursor变了，新出现的max_cursor就是上次请求返回的max_cursor，有了has_more和max_cursor两个参数，我们就可以把所有urls取到了
68 | 
69 | 8. 写一个递归函数 get_all_video_urls 根据has_more字段将所有urls递归爬取下来，终止条件是has_more==0
70 | 
71 | 9. 用一个全局变量url_list = [] 存放爬到的每一个视频的名字和地址
72 | 
73 | 10. 运行编写好的代码后发现，视频数据格式不对，返回去检查，原来第9步中的url不是真实的视频url，而是一个302跳转地址，真实视频地址在第9步response headers里的Location里
74 | 
75 | 11. 添加禁止跳转的代码，获取真实视频url
76 | 
77 |     ```python
78 |     response = requests.request("GET", url, headers=headers, allow_redirects=False)
79 |     video_url = response.headers['Location']
80 |     ```
81 | 
82 | 
83 | 
84 | #### 搞定收工！
85 | 
86 | 
87 | 
88 | 2018/09/11更新：添加显示下载进度条功能 
89 | 
90 | 
91 | 
92 | todolist:
93 | 
94 | - [x] 异常处理
95 | - [ ] 日志功能
96 | - [x] 显示下载进度条


--------------------------------------------------------------------------------
/requirements.txt:
--------------------------------------------------------------------------------
 1 | appnope==0.1.0
 2 | backcall==0.1.0
 3 | certifi==2018.8.24
 4 | chardet==3.0.4
 5 | decorator==4.3.0
 6 | idna==2.7
 7 | ipython==6.5.0
 8 | ipython-genutils==0.2.0
 9 | jedi==0.12.1
10 | parso==0.3.1
11 | pexpect==4.6.0
12 | pickleshare==0.7.4
13 | prompt-toolkit==1.0.15
14 | ptyprocess==0.6.0
15 | Pygments==2.2.0
16 | requests==2.19.1
17 | simplegeneric==0.8.1
18 | six==1.11.0
19 | tqdm==4.25.0
20 | traitlets==4.3.2
21 | urllib3==1.23
22 | wcwidth==0.1.7
23 | 


--------------------------------------------------------------------------------