├── requirements.txt ├── .gitignore ├── config.json ├── LICENSE ├── README.md └── main.py /requirements.txt: -------------------------------------------------------------------------------- 1 | requests -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | id.csv 2 | Hitokoto.csv 3 | Sentry.py 4 | .ipynb_checkpoints/ 5 | Hitokoto.ipynb -------------------------------------------------------------------------------- /config.json: -------------------------------------------------------------------------------- 1 | { 2 | "path": "Hitokoto.csv", 3 | "times": "all", 4 | "delay": 0, 5 | "timeout": 60, 6 | "retry": 3, 7 | "from": false, 8 | "from_who": false, 9 | "creator": false, 10 | "creator_uid": false, 11 | "reviewer": false, 12 | "uuid": false, 13 | "created_at": false, 14 | "duplicate": false, 15 | "print": false 16 | } -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | MIT License 2 | 3 | Copyright (c) 2020 Hitokoto-Spider 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy 6 | of this software and associated documentation files (the "Software"), to deal 7 | in the Software without restriction, including without limitation the rights 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 9 | copies of the Software, and to permit persons to whom the Software is 10 | furnished to do so, subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 21 | SOFTWARE. 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Hitokoto-Spider 2 | ###### 网易云音乐爬虫请往这里走https://github.com/GamerNoTitle/Netease-Comment-Spider 3 | 4 | 这个项目是与一言的库相关的项目,一直想用python来爬取一言的文本库,所以就有了这个项目。本项目正在开发,需要用到requests库,第一次使用请输入 5 | 6 | ```bash 7 | $ pip install requests 8 | ``` 9 | 来安装所需要的python库!爬取的链接为:https://international.v1.hitokoto.cn/ 10 | 11 | 已经实现在追加写入的前提下去重,现已支持断点续抓! 12 | 13 | ## 打开方式 14 | 15 | 首先,请到config.json修改你的配置,默认的配置如下: 16 | 17 | ```json 18 | { 19 | "path": "Hitokoto.csv", 20 | "times": "all", 21 | "delay": 0, 22 | "timeout": 60, 23 | "retry": 3, 24 | "from": false, 25 | "from_who": false, 26 | "creator": false, 27 | "creator_uid": false, 28 | "reviewer": false, 29 | "uuid": false, 30 | "created_at": false, 31 | "duplicate": false, 32 | "print": false 33 | } 34 | ``` 35 | 36 | ``path``修改为你要输出的文件路径,必须自己带后缀且必须为csv 37 | 38 | ``times``修改为抓取条数,如果你的目录下有已经抓取过的文件,会进入断点续抓模式,例如我需要抓2000条,文件中已经有1000条记录,则只会抓取1000条,也可以填作"all"(即默认值,来抓取所有的一言条目) 39 | 40 | ``delay``修改为每次抓取完成后等待的时长,单位为秒(参考一言的QPS做的这个选项) 41 | 42 | ``timeout``为http请求超时时间,单位为秒 43 | 44 | ``retry``表示重试次数,当返回的状态码不是200的时候会自动进行重试,支持任意非负整数 45 | 46 | ``from``表示来源,这个来源是作品,只支持true和false 47 | 48 | ``from_who``表示来源,这个来源是人,指的是说这句话的人,只支持true和false 49 | 50 | ``creator``表示该条目的创建者,将返回创建者的昵称,只支持true和false 51 | 52 | ``creator_uid``表示该条目创建者的UID,将返回创建者的UID,只支持true和false 53 | 54 | ``reviewer``说实话这个参数我都不知道返回的是什么值,先留着吧……只支持true和false 55 | 56 | ``uuid``表示条目创建者的uuid,将返回条目创建者的uuid,只支持true和false 57 | 58 | ``created_at``表示该条目提交的时间,一言返回的值为时间戳,我将它转成了``YYYY-MM-DD HH-MM-SS``的格式 59 | 60 | ``duplicate``表示是否存入重复结果,设置为true则存入,设置为false则不存入,只支持true和false(还未完成) 61 | 62 | ``print``表示是否每次打印抓取到的结果在控制台,只支持true和false 63 | 64 | 使用了UTF8的编码方式存储csv文件,所以gbk解码是无效的,会乱码,之所以采用UTF8是因为有些字符gbk处理不了…… 65 | 66 | **如果你要用Excel查看文件,请务必先用vscode之类的软件将文件以UTF-8的编码方式打开,以gbk编码保存,否则会乱码!!!** 67 | 68 | 开发日记已发布到http://bili33.top/posts/Hitokoto-Spider/ 69 | 70 | 高中生无聊时做的,望大佬指教! 71 | 72 | # 进度一览 73 | - [x] 获取一言内容 74 | - [x] 导出为csv文件 75 | - [x] 自动分类一言类别 76 | - [x] 去除重复一言 77 | - [x] 自定义输出路径 78 | - [x] 自定义抓取数 79 | - [x] 自定义延时 80 | - [x] json配置文件支持 81 | - [x] 断点续抓 82 | - [x] 连接错误重试 83 | - [x] 重复条目存入选项 [#3](https://github.com/GamerNoTitle/Hitokoto-Spider/issues/3) 84 | - [x] 打印抓取结果选项 [#3](https://github.com/GamerNoTitle/Hitokoto-Spider/issues/3) 85 | - [x] 重复率显示 [#3](https://github.com/GamerNoTitle/Hitokoto-Spider/issues/3) 86 | - [x] 重复抓取次数显示 [#3](https://github.com/GamerNoTitle/Hitokoto-Spider/issues/3) 87 | - [ ] Excel表格格式存储功能 [#3](https://github.com/GamerNoTitle/Hitokoto-Spider/issues/3) 88 | - [x] 在原有的表格上追加写入 [#3](https://github.com/GamerNoTitle/Hitokoto-Spider/issues/3) 89 | - [x] 在追加写入的前提下去重 [#3](https://github.com/GamerNoTitle/Hitokoto-Spider/issues/3) 90 | - [ ] GUI支持 91 | -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import requests as r 2 | import json as js 3 | import csv 4 | import os 5 | import datetime 6 | from array import array 7 | import time 8 | # 程序运行时间开始 9 | start_Pro=datetime.datetime.now() 10 | failed=False 11 | def create_csv(path): 12 | with open(path,"w+",newline="",encoding="utf8") as file: # 打开文件,也相当于一个回车,避免覆盖文档 13 | csv_file = csv.writer(file) 14 | head = heads # 创建csv表头 15 | csv_file.writerow(head) 16 | def append_csv(path): 17 | with open(path,"a+",newline='',encoding="utf8") as file: 18 | csv_file = csv.writer(file) 19 | data = [inputs] 20 | csv_file.writerows(data) 21 | def read_config(): 22 | with open("config.json") as json_file: 23 | config = js.load(json_file) 24 | return config 25 | 26 | def get_num(): 27 | status=r.get('https://status.hitokoto.cn/v1/statistic',timeout=timeout).json() 28 | print(type(status)) 29 | num=status['data']['status']['hitokoto']['total'] 30 | return num 31 | 32 | def get_requests(): 33 | global failed 34 | try: 35 | res = r.get('https://v1.hitokoto.cn/',timeout=timeout) # 得到服务器回应,此时回应的内容为json文件(res.text)和状态码 36 | failed=False 37 | return res 38 | except: 39 | failed=True 40 | return None 41 | conf = read_config() 42 | path = conf["path"] 43 | heads = ["id","sort","hitokoto"] 44 | delay = int(conf["delay"]) 45 | timeout = int(conf['timeout']) 46 | if(conf['from']): heads.append("from") 47 | if(conf['from_who']): heads.append("from_who") 48 | if(conf['creator']): heads.append('creator') 49 | if(conf['creator_uid']): heads.append('creator_uid') 50 | if(conf['reviewer']): heads.append('reviewer') 51 | if(conf['uuid']): heads.append('uuid') 52 | if(conf['created_at']): heads.append("created_at") 53 | if str(conf['times']) == 'all': 54 | num=get_num() 55 | else: 56 | num = int(conf["times"]) 57 | temp=array('i',[0]) # 初始化temp变量,用于放置已抓取的ID 58 | if (os.path.exists(path)!=True): # 判断文件是否存在,不存在则创建 59 | create_csv(path) 60 | i=0 61 | else: 62 | print('断点续抓模式已开启!') 63 | file=open(path,'r',encoding='utf8') 64 | ids_in_file=csv.reader(file) 65 | for id_in_file in ids_in_file: 66 | try: 67 | temp.append(int(id_in_file[0])) # 将文件中已有的id加入temp数组 68 | except ValueError: 69 | id_in_file[0] = 0 # 读取已有文件时"id"无法被识别为int型所以要去掉 70 | i=0 71 | i=i+len(temp) 72 | sorts="" 73 | dup=0 74 | all=0 # 总抓取次数 75 | while True: 76 | if(i>num): # 如果不加1那么最后一次将无法运行 77 | break 78 | time.sleep(delay) 79 | print("----------------------------------------------------------") 80 | print("正在获取新的一言……") 81 | print("Fetching new Hitokoto......") 82 | tried=0 83 | while True: 84 | tried+=1 85 | print('抓取尝试次数{}/{}'.format(tried,conf['retry'])) 86 | result=get_requests() 87 | if failed==False: 88 | break 89 | elif(tried>=conf['retry']): 90 | print('已经超过设定的重试次数,将不再重试,请检查网络连接!') 91 | os._exit() 92 | else: 93 | None 94 | res=result 95 | all=all+1 96 | data=res.json() # 将获取到的结果转为json字符串 97 | temp_minus=len(temp)-1 98 | if temp_minus!=0: 99 | t=1 100 | print("正在检测是否抓取过结果……") 101 | for t in range(len(temp)): 102 | if(int(data["id"])==temp[t]): 103 | dup=dup+1 104 | end_Pro=datetime.datetime.now() 105 | print("发现已经抓取到的结果,正在丢弃……") 106 | print("已完成数量:{}/{},已经用时:{} ,总抓取{}次,重复次数{}次,重复率{}".format(len(temp)-1,num,end_Pro-start_Pro,all,dup,dup/all)) 107 | break 108 | elif(t==len(temp)-1): 109 | print("未抓取过的结果,正在存入文件……") 110 | if data["type"]== "a": sorts=("Anime") # 自动把分类码还原为分类 111 | elif data["type"]== "b": sorts=("Comic") 112 | elif data["type"]== "c": sorts=("Game") 113 | elif data["type"]== "d": sorts=("Novel") 114 | elif data["type"]== "e": sorts=("Myself") 115 | elif data["type"]== "f": sorts=("Internet") 116 | elif data["type"]== "g": sorts=("Other") 117 | elif data['type']== 'h': sorts=("Movie") 118 | elif data['type']== 'i': sorts=("Poem") 119 | elif data['type']== 'j': sorts=("Netease") 120 | elif data['type']== 'k': sorts=("Philosophy") 121 | elif data['type']== 'l': sorts=('Intelligent') 122 | else: sorts=('Unknown') 123 | inputs=[data["id"],sorts,data["hitokoto"]] 124 | if(conf['from']): inputs.append(data['from']) 125 | if(conf["from_who"]): 126 | try: 127 | if(data['from_who']==None): 128 | inputs.append('null') 129 | else: 130 | inputs.append(data["from_who"]) 131 | except KeyError: 132 | inputs.append("null") 133 | if(conf['creator']): inputs.append(data['creator']) 134 | if(conf['creator_uid']): 135 | try: 136 | inputs.append(data['creator_uid']) 137 | except KeyError: 138 | inputs.append("null") 139 | if(conf['reviewer']): 140 | try: 141 | inputs.append(int(data['reviewer'])) 142 | except KeyError: 143 | inputs.append('0') 144 | if(conf['uuid']): 145 | try: 146 | inputs.append(data['uuid']) 147 | except KeyError: 148 | inputs.append("null") 149 | try: 150 | timeArray = time.localtime(int(data['created_at'])) 151 | created_at = time.strftime("%Y-%m-%d %H:%M:%S", timeArray) 152 | except ValueError: 153 | created_at = ('null') 154 | except OSError: 155 | timeArray = time.localtime(int(data['created_at'])/1000) 156 | created_at = time.strftime("%Y-%m-%d %H:%M:%S", timeArray) 157 | if(conf['created_at']): inputs.append(created_at) 158 | if(conf['print']): print(res.text) 159 | append_csv(path) 160 | temp.append(data["id"]) 161 | end_Pro=datetime.datetime.now() 162 | print("已完成数量:{}/{},已经用时:{} ,总抓取{}次,重复次数{}次,重复率{}".format(len(temp)-1,num,end_Pro-start_Pro,all,dup,dup/all)) 163 | i=i+1 164 | break 165 | else: 166 | if data["type"]== "a": sorts=("Anime") # 自动把分类码还原为分类 167 | elif data["type"]== "b": sorts=("Comic") 168 | elif data["type"]== "c": sorts=("Game") 169 | elif data["type"]== "d": sorts=("Novel") 170 | elif data["type"]== "e": sorts=("Myself") 171 | elif data["type"]== "f": sorts=("Internet") 172 | elif data["type"]== "g": sorts=("Other") 173 | elif data['type']== 'h': sorts=("Movie") 174 | elif data['type']== 'i': sorts=("Poem") 175 | elif data['type']== 'j': sorts=("Netease") 176 | elif data['type']== 'k': sorts=("Philosophy") 177 | elif data['type']== 'l': sorts=('Intelligent') 178 | else: sorts=('Unknown') 179 | inputs=[data["id"],sorts,data["hitokoto"]] 180 | if(conf['from']): inputs.append(data['from']) 181 | if(conf["from_who"]): 182 | try: 183 | if(data['from_who']==None): 184 | inputs.append('null') 185 | else: 186 | inputs.append(data["from_who"]) 187 | except KeyError: 188 | inputs.append('null') 189 | if(conf['creator']): inputs.append(data['creator']) 190 | if(conf['creator_uid']): 191 | try: 192 | inputs.append(data['creator_uid']) 193 | except KeyError: 194 | inputs.append("null") 195 | if(conf['reviewer']): 196 | try: 197 | inputs.append(int(data['reviewer'])) 198 | except KeyError: 199 | inputs.append('0') 200 | if(conf['uuid']): 201 | try: 202 | inputs.append(data['uuid']) 203 | except KeyError: 204 | inputs.append("null") 205 | try: 206 | timeArray = time.localtime(int(data['created_at'])) 207 | created_at = time.strftime("%Y-%m-%d %H:%M:%S", timeArray) 208 | except ValueError: 209 | created_at = ('null') 210 | except OSError: 211 | timeArray = time.localtime(int(data['created_at'])/1000) 212 | created_at = time.strftime("%Y-%m-%d %H:%M:%S", timeArray) 213 | if(conf['created_at']): inputs.append(created_at) 214 | if(conf['print']): print(res.text) 215 | append_csv(path) 216 | temp.append(data["id"]) 217 | end_Pro=datetime.datetime.now() 218 | print("已完成数量:{}/{},已经用时:{} ,总抓取{}次,重复次数{}次,重复率{}".format(len(temp)-1,num,end_Pro-start_Pro,all,dup,dup/all)) 219 | i=i+1 220 | end_Pro=datetime.datetime.now() 221 | print('----------------------------------------------------------') 222 | try: 223 | print('已抓取完成!抓取数量{},用时{},总抓取{}次,重复{}次,重复率{}'.format(num,end_Pro-start_Pro,all,dup,dup/all)) 224 | except ZeroDivisionError: 225 | print('已抓取完成!抓取数量{},用时{},总抓取{}次,重复{}次,重复率0'.format(num,end_Pro-start_Pro,all,dup)) 226 | --------------------------------------------------------------------------------