├── requirements.txt
├── .gitignore
├── config.json
├── LICENSE
├── README.md
└── main.py


/requirements.txt:
--------------------------------------------------------------------------------
1 | requests


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
1 | id.csv
2 | Hitokoto.csv
3 | Sentry.py
4 | .ipynb_checkpoints/
5 | Hitokoto.ipynb


--------------------------------------------------------------------------------
/config.json:
--------------------------------------------------------------------------------
 1 | {
 2 |     "path": "Hitokoto.csv",
 3 |     "times": "all",
 4 |     "delay": 0,
 5 |     "timeout": 60,
 6 |     "retry": 3,
 7 |     "from": false,
 8 |     "from_who": false,
 9 |     "creator": false,
10 |     "creator_uid": false,
11 |     "reviewer": false,
12 |     "uuid": false,
13 |     "created_at": false,
14 |     "duplicate": false,
15 |     "print": false
16 | }


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | MIT License
 2 | 
 3 | Copyright (c) 2020 Hitokoto-Spider
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | # Hitokoto-Spider
 2 | ###### 网易云音乐爬虫请往这里走https://github.com/GamerNoTitle/Netease-Comment-Spider
 3 | 
 4 | 这个项目是与一言的库相关的项目，一直想用python来爬取一言的文本库，所以就有了这个项目。本项目正在开发，需要用到requests库，第一次使用请输入
 5 | 
 6 | ```bash
 7 | $ pip install requests
 8 | ```
 9 | 来安装所需要的python库！爬取的链接为：https://international.v1.hitokoto.cn/
10 | 
11 | 已经实现在追加写入的前提下去重，现已支持断点续抓！
12 | 
13 | ## 打开方式
14 | 
15 | 首先，请到config.json修改你的配置，默认的配置如下：
16 | 
17 | ```json
18 | {
19 |     "path": "Hitokoto.csv",
20 |     "times": "all",
21 |     "delay": 0,
22 |     "timeout": 60,
23 |     "retry": 3,
24 |     "from": false,
25 |     "from_who": false,
26 |     "creator": false,
27 |     "creator_uid": false,
28 |     "reviewer": false,
29 |     "uuid": false,
30 |     "created_at": false,
31 |     "duplicate": false,
32 |     "print": false
33 | }
34 | ```
35 | 
36 | ``path``修改为你要输出的文件路径，必须自己带后缀且必须为csv
37 | 
38 | ``times``修改为抓取条数，如果你的目录下有已经抓取过的文件，会进入断点续抓模式，例如我需要抓2000条，文件中已经有1000条记录，则只会抓取1000条，也可以填作"all"（即默认值，来抓取所有的一言条目）
39 | 
40 | ``delay``修改为每次抓取完成后等待的时长，单位为秒（参考一言的QPS做的这个选项）
41 | 
42 | ``timeout``为http请求超时时间，单位为秒
43 | 
44 | ``retry``表示重试次数，当返回的状态码不是200的时候会自动进行重试，支持任意非负整数
45 | 
46 | ``from``表示来源，这个来源是作品，只支持true和false
47 | 
48 | ``from_who``表示来源，这个来源是人，指的是说这句话的人，只支持true和false
49 | 
50 | ``creator``表示该条目的创建者，将返回创建者的昵称，只支持true和false
51 | 
52 | ``creator_uid``表示该条目创建者的UID，将返回创建者的UID，只支持true和false
53 | 
54 | ``reviewer``说实话这个参数我都不知道返回的是什么值，先留着吧……只支持true和false
55 | 
56 | ``uuid``表示条目创建者的uuid，将返回条目创建者的uuid，只支持true和false
57 | 
58 | ``created_at``表示该条目提交的时间，一言返回的值为时间戳，我将它转成了``YYYY-MM-DD HH-MM-SS``的格式
59 | 
60 | ``duplicate``表示是否存入重复结果，设置为true则存入，设置为false则不存入，只支持true和false（还未完成）
61 | 
62 | ``print``表示是否每次打印抓取到的结果在控制台，只支持true和false
63 | 
64 | 使用了UTF8的编码方式存储csv文件，所以gbk解码是无效的，会乱码，之所以采用UTF8是因为有些字符gbk处理不了……
65 | 
66 | **如果你要用Excel查看文件，请务必先用vscode之类的软件将文件以UTF-8的编码方式打开，以gbk编码保存，否则会乱码！！！**
67 | 
68 | 开发日记已发布到http://bili33.top/posts/Hitokoto-Spider/
69 | 
70 | 高中生无聊时做的，望大佬指教！
71 | 
72 | # 进度一览
73 | - [x] 获取一言内容
74 | - [x] 导出为csv文件
75 | - [x] 自动分类一言类别
76 | - [x] 去除重复一言
77 | - [x] 自定义输出路径
78 | - [x] 自定义抓取数
79 | - [x] 自定义延时
80 | - [x] json配置文件支持
81 | - [x] 断点续抓
82 | - [x] 连接错误重试
83 | - [x] 重复条目存入选项  [#3](https://github.com/GamerNoTitle/Hitokoto-Spider/issues/3)
84 | - [x] 打印抓取结果选项  [#3](https://github.com/GamerNoTitle/Hitokoto-Spider/issues/3)
85 | - [x] 重复率显示        [#3](https://github.com/GamerNoTitle/Hitokoto-Spider/issues/3)
86 | - [x] 重复抓取次数显示  [#3](https://github.com/GamerNoTitle/Hitokoto-Spider/issues/3)
87 | - [ ] Excel表格格式存储功能     [#3](https://github.com/GamerNoTitle/Hitokoto-Spider/issues/3)
88 | - [x] 在原有的表格上追加写入    [#3](https://github.com/GamerNoTitle/Hitokoto-Spider/issues/3)
89 | - [x] 在追加写入的前提下去重    [#3](https://github.com/GamerNoTitle/Hitokoto-Spider/issues/3)
90 | - [ ] GUI支持   
91 | 


--------------------------------------------------------------------------------
/main.py:
--------------------------------------------------------------------------------
  1 | import requests as r
  2 | import json as js
  3 | import csv
  4 | import os
  5 | import datetime
  6 | from array import array
  7 | import time
  8 | # 程序运行时间开始
  9 | start_Pro=datetime.datetime.now()
 10 | failed=False
 11 | def create_csv(path):
 12 |     with open(path,"w+",newline="",encoding="utf8") as file:    # 打开文件，也相当于一个回车，避免覆盖文档
 13 |         csv_file = csv.writer(file)
 14 |         head = heads # 创建csv表头
 15 |         csv_file.writerow(head)
 16 | def append_csv(path):
 17 |     with open(path,"a+",newline='',encoding="utf8") as file:
 18 |         csv_file = csv.writer(file)
 19 |         data = [inputs]
 20 |         csv_file.writerows(data)
 21 | def read_config():
 22 |     with open("config.json") as json_file:
 23 |         config = js.load(json_file)
 24 |     return config
 25 | 
 26 | def get_num():
 27 |     status=r.get('https://status.hitokoto.cn/v1/statistic',timeout=timeout).json()
 28 |     print(type(status))
 29 |     num=status['data']['status']['hitokoto']['total']
 30 |     return num
 31 | 
 32 | def get_requests():
 33 |     global failed
 34 |     try:
 35 |         res = r.get('https://v1.hitokoto.cn/',timeout=timeout) # 得到服务器回应，此时回应的内容为json文件（res.text）和状态码
 36 |         failed=False
 37 |         return res
 38 |     except:
 39 |         failed=True
 40 |         return None
 41 | conf = read_config()
 42 | path = conf["path"]
 43 | heads = ["id","sort","hitokoto"]
 44 | delay = int(conf["delay"])
 45 | timeout = int(conf['timeout'])
 46 | if(conf['from']): heads.append("from")
 47 | if(conf['from_who']): heads.append("from_who")
 48 | if(conf['creator']): heads.append('creator')
 49 | if(conf['creator_uid']): heads.append('creator_uid')
 50 | if(conf['reviewer']): heads.append('reviewer')
 51 | if(conf['uuid']): heads.append('uuid')
 52 | if(conf['created_at']): heads.append("created_at")
 53 | if str(conf['times']) == 'all':
 54 |     num=get_num()
 55 | else:
 56 |     num = int(conf["times"])
 57 | temp=array('i',[0])   # 初始化temp变量，用于放置已抓取的ID
 58 | if (os.path.exists(path)!=True):    # 判断文件是否存在，不存在则创建
 59 |     create_csv(path)
 60 |     i=0
 61 | else:
 62 |     print('断点续抓模式已开启！')
 63 |     file=open(path,'r',encoding='utf8')
 64 |     ids_in_file=csv.reader(file)
 65 |     for id_in_file in ids_in_file:
 66 |         try:
 67 |             temp.append(int(id_in_file[0])) # 将文件中已有的id加入temp数组
 68 |         except ValueError:
 69 |             id_in_file[0] = 0   # 读取已有文件时"id"无法被识别为int型所以要去掉
 70 |     i=0
 71 |     i=i+len(temp)
 72 | sorts=""
 73 | dup=0
 74 | all=0   # 总抓取次数
 75 | while True:
 76 |     if(i>num):   # 如果不加1那么最后一次将无法运行
 77 |         break
 78 |     time.sleep(delay)
 79 |     print("----------------------------------------------------------")
 80 |     print("正在获取新的一言……")
 81 |     print("Fetching new Hitokoto......")
 82 |     tried=0
 83 |     while True:
 84 |         tried+=1
 85 |         print('抓取尝试次数{}/{}'.format(tried,conf['retry']))
 86 |         result=get_requests()
 87 |         if failed==False: 
 88 |             break
 89 |         elif(tried>=conf['retry']): 
 90 |             print('已经超过设定的重试次数，将不再重试，请检查网络连接！')
 91 |             os._exit()
 92 |         else:
 93 |             None
 94 |     res=result
 95 |     all=all+1
 96 |     data=res.json() # 将获取到的结果转为json字符串
 97 |     temp_minus=len(temp)-1
 98 |     if temp_minus!=0:
 99 |         t=1
100 |         print("正在检测是否抓取过结果……")
101 |         for t in range(len(temp)):
102 |             if(int(data["id"])==temp[t]):
103 |                 dup=dup+1
104 |                 end_Pro=datetime.datetime.now()
105 |                 print("发现已经抓取到的结果，正在丢弃……")
106 |                 print("已完成数量：{}/{}，已经用时：{} ，总抓取{}次，重复次数{}次，重复率{}".format(len(temp)-1,num,end_Pro-start_Pro,all,dup,dup/all))
107 |                 break
108 |             elif(t==len(temp)-1):
109 |                 print("未抓取过的结果，正在存入文件……")
110 |                 if data["type"]== "a": sorts=("Anime")  # 自动把分类码还原为分类
111 |                 elif data["type"]== "b": sorts=("Comic")
112 |                 elif data["type"]== "c": sorts=("Game")
113 |                 elif data["type"]== "d": sorts=("Novel")
114 |                 elif data["type"]== "e": sorts=("Myself")
115 |                 elif data["type"]== "f": sorts=("Internet")
116 |                 elif data["type"]== "g": sorts=("Other")
117 |                 elif data['type']== 'h': sorts=("Movie")
118 |                 elif data['type']== 'i': sorts=("Poem")
119 |                 elif data['type']== 'j': sorts=("Netease")
120 |                 elif data['type']== 'k': sorts=("Philosophy")
121 |                 elif data['type']== 'l': sorts=('Intelligent')
122 |                 else: sorts=('Unknown')
123 |                 inputs=[data["id"],sorts,data["hitokoto"]]
124 |                 if(conf['from']): inputs.append(data['from'])
125 |                 if(conf["from_who"]): 
126 |                     try:
127 |                         if(data['from_who']==None):
128 |                             inputs.append('null')
129 |                         else:
130 |                             inputs.append(data["from_who"])
131 |                     except KeyError:
132 |                         inputs.append("null")
133 |                 if(conf['creator']): inputs.append(data['creator'])
134 |                 if(conf['creator_uid']):
135 |                     try: 
136 |                         inputs.append(data['creator_uid'])
137 |                     except KeyError:
138 |                         inputs.append("null")
139 |                 if(conf['reviewer']):
140 |                     try:
141 |                         inputs.append(int(data['reviewer']))
142 |                     except KeyError:
143 |                         inputs.append('0')
144 |                 if(conf['uuid']):
145 |                     try:
146 |                         inputs.append(data['uuid'])
147 |                     except KeyError:
148 |                         inputs.append("null")
149 |                 try:
150 |                     timeArray = time.localtime(int(data['created_at']))
151 |                     created_at = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
152 |                 except ValueError:
153 |                     created_at = ('null')
154 |                 except OSError:
155 |                     timeArray = time.localtime(int(data['created_at'])/1000)
156 |                     created_at = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
157 |                 if(conf['created_at']): inputs.append(created_at)
158 |                 if(conf['print']): print(res.text)
159 |                 append_csv(path)
160 |                 temp.append(data["id"])
161 |                 end_Pro=datetime.datetime.now()
162 |                 print("已完成数量：{}/{}，已经用时：{} ，总抓取{}次，重复次数{}次，重复率{}".format(len(temp)-1,num,end_Pro-start_Pro,all,dup,dup/all))
163 |                 i=i+1
164 |                 break
165 |     else:
166 |         if data["type"]== "a": sorts=("Anime")  # 自动把分类码还原为分类
167 |         elif data["type"]== "b": sorts=("Comic")
168 |         elif data["type"]== "c": sorts=("Game")
169 |         elif data["type"]== "d": sorts=("Novel")
170 |         elif data["type"]== "e": sorts=("Myself")
171 |         elif data["type"]== "f": sorts=("Internet")
172 |         elif data["type"]== "g": sorts=("Other")
173 |         elif data['type']== 'h': sorts=("Movie")
174 |         elif data['type']== 'i': sorts=("Poem")
175 |         elif data['type']== 'j': sorts=("Netease")
176 |         elif data['type']== 'k': sorts=("Philosophy")
177 |         elif data['type']== 'l': sorts=('Intelligent')
178 |         else: sorts=('Unknown')
179 |         inputs=[data["id"],sorts,data["hitokoto"]]
180 |         if(conf['from']): inputs.append(data['from'])
181 |         if(conf["from_who"]): 
182 |             try:
183 |                 if(data['from_who']==None):
184 |                     inputs.append('null')
185 |                 else:
186 |                     inputs.append(data["from_who"])
187 |             except KeyError:
188 |                 inputs.append('null')
189 |         if(conf['creator']): inputs.append(data['creator'])
190 |         if(conf['creator_uid']):
191 |             try: 
192 |                 inputs.append(data['creator_uid'])
193 |             except KeyError:
194 |                 inputs.append("null")
195 |         if(conf['reviewer']):
196 |             try:
197 |                 inputs.append(int(data['reviewer']))
198 |             except KeyError:
199 |                 inputs.append('0')
200 |         if(conf['uuid']):
201 |             try:
202 |                 inputs.append(data['uuid'])
203 |             except KeyError:
204 |                 inputs.append("null")
205 |         try:
206 |             timeArray = time.localtime(int(data['created_at']))
207 |             created_at = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
208 |         except ValueError:
209 |             created_at = ('null')
210 |         except OSError:
211 |             timeArray = time.localtime(int(data['created_at'])/1000)
212 |             created_at = time.strftime("%Y-%m-%d %H:%M:%S", timeArray)
213 |         if(conf['created_at']): inputs.append(created_at)
214 |         if(conf['print']): print(res.text)
215 |         append_csv(path)
216 |         temp.append(data["id"])
217 |         end_Pro=datetime.datetime.now()
218 |         print("已完成数量：{}/{}，已经用时：{} ，总抓取{}次，重复次数{}次，重复率{}".format(len(temp)-1,num,end_Pro-start_Pro,all,dup,dup/all))
219 |         i=i+1
220 | end_Pro=datetime.datetime.now()
221 | print('----------------------------------------------------------')
222 | try:
223 |     print('已抓取完成！抓取数量{}，用时{}，总抓取{}次，重复{}次，重复率{}'.format(num,end_Pro-start_Pro,all,dup,dup/all))
224 | except ZeroDivisionError:
225 |     print('已抓取完成！抓取数量{}，用时{}，总抓取{}次，重复{}次，重复率0'.format(num,end_Pro-start_Pro,all,dup))
226 | 


--------------------------------------------------------------------------------