├── 100w_data ├── 1.txt └── con.py ├── 250w_data └── baiduyun.txt ├── README.md ├── WordVector.py ├── crawler.py ├── img ├── 1.jpeg ├── 1.txt ├── 10.jpeg ├── 11.jpeg ├── 12.jpeg ├── 13.jpg ├── 14.jpeg ├── 15.jpg ├── 16.jpg ├── 17.jpeg ├── 18.jpg ├── 19.jpeg ├── 2.jpeg ├── 29.jpg ├── 3.jpg ├── 4.jpg ├── 5.jpg ├── 6.jpg ├── 7.jpeg ├── 8.jpeg └── 9.jpeg ├── neo4j ├── 1 ├── CRUD.py └── __init__.py └── one_crawler.py /100w_data/1.txt: -------------------------------------------------------------------------------- 1 | 链接:https://pan.baidu.com/s/1vfSURLzrZyHgJgq4Qa3O0w?pwd=vrx3 2 | 提取码:vrx3 3 | --来自百度网盘超级会员V1的分享 4 | 5 | 链接:https://pan.baidu.com/s/1LhZ7YUKsHVhsfJIXZB9GhA?pwd=9m15 6 | 提取码:9m15 7 | --来自百度网盘超级会员V1的分享 8 | 9 | 10 | -------------------------------------------------------------------------------- /100w_data/con.py: -------------------------------------------------------------------------------- 1 | graph = Graph( 2 | "bolt://localhost:7687", 3 | auth = ('neo4j','exhibit-join-donor-orion-cable-4724') 4 | ) 5 | -------------------------------------------------------------------------------- /250w_data/baiduyun.txt: -------------------------------------------------------------------------------- 1 | 链接:https://pan.baidu.com/s/12HvgoRuAU6rQU1sTkqrhIA?pwd=26z8 2 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 第二批250W实体+关系,2023.5.5 2 | excel类型,包含实体和关系,这份数据通过sql优化了结构,在250w_data里面有下载链接 3 | - CALL apoc.export.csv.query("MATCH (n)-[r]-(m) RETURN id(n),n.entity_name,labels(n)[0],type(r),id(m),m.entity_name,labels(m)", "all_entities_and_relations.csv", {}) 4 | 5 | 6 | # 第一批100w实体的图谱数据 7 | 我存在了百度云盘里面,方便下载使用,他是5.5版本的neo4j数据压缩制作而成,解压后直接放在neo4j安装路径下的data文件夹中即可,登录密码和下载链接都在100w_data文件夹里面,因为密码也在数据库里面,所以需要用和我一样的密码,登上去在修改即可 8 | 9 | ![](img/19.jpeg) 10 | 11 | ![](img/29.jpg) 12 | 13 | # 图谱的质量怎么样 2023.3.26 14 | 经过几天的爬取已经有了几十万的实体,在过几天,我会发布第一版100万实体的数据。可以看到,尽管有很多同名的实体,但是他们通过严格的属性标签分割开来。 15 | 16 | ![](img/18.jpg) 17 | ![](img/14.jpeg) 18 | ![](img/15.jpg) 19 | ![](img/16.jpg) 20 | ![](img/17.jpeg) 21 | 22 | # 第一批图谱 2023.3.19 23 | ![](img/10.jpeg) 24 | 这是我的第一批数据,他让我信心倍增,不过看起来很单调,我觉得修改他们的标签属性,然后将原有标签用其他属性储存。 25 | **修改后的效果** 26 | 27 | ![](img/11.jpeg) 28 | 29 | ![](img/12.jpeg) 30 | 31 | ![](img/13.jpg) 32 | 33 | 是不是看起来好一些了 34 | 35 | 36 | # 第一个bug 2023.3.19 37 | 当我测试实体之间关联的程序时发生了一个bug,具体就是没找到对应元素,他的真实情况是因为存在**多义词**,需要你继续选择,才能够获得详细的百科界面。 38 | 这个选择动作但是十分容易关键是如何选择。 39 | ![](img/6.jpg) 40 | ![](img/7.jpeg) 41 | 42 | 可以按照顺序,第一个总是最推荐的,不过这样不严谨,我们去做选择的时候是去理解这两段描述,哪个才最可能是前面提到的语种。为了模拟这个过程,我找一一个做语义分析的模型**gensim**。从语义得角度分析出最好的选择。 43 | ## gensim的结果 44 | gensim需要一个训练好的向量模型。我在网上找的对应文件。在这里分享一下。模型大概几个G,可以直接使用。 45 | - 链接: https://pan.baidu.com/s/1pKImAOWF9HS-NL_rWj_ASQ?pwd=zhbk 提取码: zhbk 46 | ```python 47 | // A highlighted block 48 | from gensim.models import word2vec 49 | from gensim.models import Doc2Vec 50 | model = Doc2Vec.load(r'modelpath') 51 | similarity_between = model.wv.similarity(word1, cut_word) #计算相似值 52 | ``` 53 | 长一点的文本可以分词处理,就是寻找有没有最相似词。 54 | 55 | ![](img/8.jpeg) 56 | 57 | ![](img/9.jpeg) 58 | 59 | 可以看到效果还是很明显的,语言和用语的相似性达到了82%,而第二个计算结果没有大于30%的。 60 | 61 | 62 | # 执行爬虫 2023.3.12 63 | ![](img/5.jpg) 64 | 65 | 就像这个图片看到的一样,我抽取了一个实体的名字、简介、和属性关系,还有一些正文内容,对于这些内容我打算这样使用 66 | - entity_name 就是我们要构建的实体,这个实体是具有唯一标志的,目前我打算就是用名字和简介作为他的标识,这样的话,如果其他实体和他有关联,就能做链接 67 | - entity_profile 是实体的简介,entity_profile包含了实体的类别和其他比较特殊的信息。比如说这里就提到了“中国史书中记载的第一个世袭制朝代”,而其他的朝代,就是简单的“中国历史朝代”六个字 68 | - attr_list 这是我最喜欢的内容,因为我可以感觉到,我目前的能力能构建出的关系,大部分来自这里,比如:夏朝的都城在哪里,这就是一个简单地三元组,夏朝的都城阳城,也是一个实体。不同的是这些实体,有的带有链接,有的没有链接,只是一个简单的名字,所以,我对这些信息做了一些判断,比如有链接的,就去获得他的实体名称和简介,这样就构成了唯一标识,保证关系的可靠性,没有连接的,就用 “实体名”+“属性名”来认定他的简介,比如这里我们看到的第二个属性,他在数据中的应该是这样储存的{entity_name:"The Xia Dynasty",entity_introduction:"夏朝的外文名"} 69 | - event_list 最后是这个文本内容列表,我觉得这些小标题的内容十分有趣,所以我储存了起来。 70 | 71 | # 我应该先做什么 2023.3.05 72 | 毫无疑问是schema,schema就是知识图谱的灵魂,但我并没有选择先在schema上下功夫,因为百科类元数据本身就定义好了schema,所以先建图,然后我想根据已有的分类数据去构建一个schema。 73 | 74 | ## Schema的定义 75 | ### 概念类别体系 76 | 将知识图谱要表达的知识按照层级结构的概念进行组织,保证上层类别所表示的概念完全包含下层类别表示的概念 77 | 可以参考 http://schema.org等已有知识资源人工确定顶层的概念体系。 78 | 要保证概念类别体系的鲁棒性,便于维护和扩展,适应新的需求。 79 | - 设计了一套上下位关系挖掘系统,用于自动化构建大量的细粒度概念(或称之为上位词) 80 | ### 关系和属性: 81 | 定义了概念类别体系之后我们还需要为每一个类别定义关系和属性。关系用于描述不同实体间的联系,属性用于描述实体的内在特征。 82 | ### 约束 83 | 定义关系属性的约束信息可以保证数据的一致性,避免出现异常值,比如:年龄必须是 Int 类型且唯一(单值)。 84 | 85 | ## 构建过程 86 | ### 自动化构建 87 | #### 获得百科页面数据 88 | - 种子网站:要构建什么体系的图谱,或者仅仅是优先级 89 | ![](img/1.jpeg) 90 | 91 | - 深度遍历:挖掘正文中的链接,判断是不是百科的实体页面,如果是则进入处理队列 92 | ![](img/2.jpeg) 93 | 94 | #### 抽取信息 95 | - 属性关系抽取 96 | 百科上面有一部分是编辑好的属性信息,这部分是最容易获得的实体-属性,实体-实体的数据,这里面出现的实体往往包含了大量信息,所以 97 | ![](img/3.jpg) 98 | 99 | - 事件 100 | 这一部分通常是丰富百科实体的内容的 101 | ![](img/4.jpg) 102 | -------------------------------------------------------------------------------- /WordVector.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # coding=utf-8 3 | # Creative time 2020/3/20 4 | # Creator HongYuan Guo 5 | 6 | from gensim.models import word2vec 7 | from gensim.models import Doc2Vec 8 | import jieba.posseg as psg 9 | import jieba 10 | import time 11 | 12 | class WordVector: 13 | 14 | def __init__(self): 15 | self.corpus_path = 'Corpus\corpus_jingdong.txt' 16 | self.static_model_path = 'baike_26g_news_13g_novel_229g.model' 17 | 18 | def Train_model(self,path,Size): 19 | sentences = word2vec.Text8Corpus(path) # 加载语料 20 | model = word2vec.Word2Vec(sentences, size=Size) # 默认window=5 21 | model.save('static_models\WordVector_JD60W_'+str(Size)+'.model') 22 | return model 23 | 24 | def get_model(self,path,size): 25 | try: 26 | model = Doc2Vec.load('static_models\WordVector_JD60W_'+str(size)+'.model') 27 | print('模型已存在') 28 | except Exception as e: 29 | print('正在训练模型') 30 | self.Train_model(path,size) 31 | print('训练结束') 32 | finally: 33 | model = Doc2Vec.load('static_models\WordVector_JD60W_'+str(size)+'.model') 34 | return model 35 | 36 | def Similarity_New( self ,model,word1,word2): 37 | word_list1 = [] 38 | word_list2 = [] 39 | similarity_TOP1 = 0 40 | similarity_list = model.wv.most_similar ( word2 , topn = 1 ) # 获取排名第一的相似之=值 41 | for item in similarity_list : 42 | similarity_TOP1 = item [ 1 ] 43 | similarity_between = model.wv.similarity ( word1 , word2 ) # 计算相似值 44 | 45 | # print ( '{}和{}的相似程度为{}'.format ( word1 , word2 , str ( format ( similarity_between / similarity_TOP1 * 100 , '.2f' ) ) + '%' ) ) 46 | for k , v in psg.cut ( word1 ) : 47 | word_list1.append ( [ k , v ] ) 48 | for k , v in psg.cut ( word2 ) : 49 | word_list2.append ( [ k , v ] ) 50 | print ( '{} 和 {} 的相似系数为{},相似程度为:{}'.format ( word1 , word2 , str ( similarity_between ),str ( format ( similarity_between / similarity_TOP1 * 100 , '.2f' ) ) + '%' ) ) 51 | return [word_list1,word_list2,similarity_between / similarity_TOP1] 52 | 53 | if __name__ == '__main__': 54 | wv = WordVector() 55 | # model = wv.get_model(wv.corpus_path, 200) 56 | import time 57 | load_start = time.time() 58 | model = Doc2Vec.load(r'static_models\baike_26g_news_13g_novel_229g.model') 59 | load_finish = time.time() 60 | print('加载模型时长:',load_finish-load_start) 61 | word1 = u'语言' 62 | word2 = u'连横所著小说' 63 | similarity_TOP1 = 0 64 | 65 | # for cut_word, v in psg.cut(word2): 66 | for cut_word in jieba.lcut(word2 ,cut_all=True) : 67 | sim_start = time.time() 68 | similarity_list = model.wv.most_similar(cut_word, topn=1) #获取排名第一的相似之=值 69 | sim_stop = time.time() 70 | print('相似度计算时长:',sim_stop-sim_start) 71 | 72 | for item in similarity_list: 73 | similarity_TOP1 = [item[0],item[1]] 74 | similarity_between = model.wv.similarity(word1, cut_word) #计算相似值 75 | print('{}和{}的相似系数为{}'.format(word1, cut_word, str(similarity_between))) 76 | # print('与{}最相似的词为{},相似系数为{}'.format(word1,similarity_TOP1[0],similarity_TOP1[1])) 77 | print ( '{}和{}的最终相似程度为{}'.format (word1, cut_word, str (format(similarity_between / similarity_TOP1[1] * 100, '.2f')) + '%')) 78 | # for k,v in psg.cut(word1): 79 | # print(k,v) 80 | # for k,v in psg.cut(word2): 81 | # print(k,v) -------------------------------------------------------------------------------- /crawler.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: UTF-8 -*- 3 | ''' 4 | @Project :BK_Knowledge_Graph 5 | @File :crawler.py 6 | @IDE :PyCharm 7 | @Author :HongYuan Guo 8 | @Date :2023/3/14 19:52 9 | ''' 10 | import requests 11 | from bs4 import BeautifulSoup 12 | import multiprocessing 13 | 14 | import re 15 | import time 16 | try: 17 | from .neo4j.CRUD import neo4j_CRUD 18 | from .neo4j.CRUD import redis_crud 19 | 20 | except: 21 | from neo4j.CRUD import neo4j_CRUD 22 | from neo4j.CRUD import redis_crud 23 | 24 | 25 | crud = neo4j_CRUD() 26 | redis_c = redis_crud() 27 | 28 | def baike_crawler(url:str) -> requests.Response: 29 | ''' 30 | 爬取百度百科内容的代码 31 | :param url: 32 | :return: 33 | ''' 34 | res = requests.get(url) 35 | return res 36 | 37 | def clean(c:str) -> str: 38 | return c.replace('\xa0','').replace('\n','') 39 | 40 | def replace_zongkuohao(c:str) -> str: 41 | num = re.sub(r'\[[0-9\-]*\]', "", c) 42 | return num 43 | 44 | def analyzing_baike_html(htmlstr:str) -> dict: 45 | ''' 46 | 解析百科html 47 | :param htmlstr: 48 | :return: 49 | ''' 50 | 51 | soup = BeautifulSoup(htmlstr,'lxml') 52 | 53 | entity_name = soup.find('h1').text 54 | entity_profile = soup.find('div',attrs = {'class':'lemma-desc'}).text 55 | 56 | attr_list = [] 57 | attr_div = soup.find('div',attrs = {'class':'basic-info J-basic-info cmn-clearfix'}) 58 | 59 | attr_left = attr_div.find('dl',attrs = {'class':'basicInfo-block basicInfo-left'}) 60 | for dt in attr_left.find_all('dt'): 61 | attr_name = dt.text 62 | dd = dt.find_next('dd') 63 | 64 | dc = dd.children 65 | for item in dc: 66 | if item.name == 'a': 67 | try: 68 | href = item.attrs['href'] 69 | attr_value = item.text 70 | attr_list.append([clean(attr_name), clean(attr_value), href]) 71 | except: 72 | attr_value = item.text 73 | attr_list.append([clean(attr_name), clean(attr_value), False]) 74 | else: 75 | try: 76 | attr_value = item 77 | attr_list.append([clean(attr_name), clean(attr_value), False]) 78 | except: 79 | attr_value = item.text 80 | attr_list.append([clean(attr_name), clean(attr_value), False]) 81 | 82 | 83 | attr_right = attr_div.find('dl',attrs = {'class':'basicInfo-block basicInfo-right'}) 84 | for dt in attr_right.find_all('dt'): 85 | attr_name = dt.text 86 | dd = dt.find_next('dd') 87 | 88 | dc = dd.children 89 | for item in dc: 90 | if item.name == 'a': 91 | try: 92 | href = item.attrs['href'] 93 | attr_value = item.text 94 | attr_list.append([clean(attr_name), clean(attr_value), href]) 95 | except: 96 | attr_value = item.text 97 | attr_list.append([clean(attr_name), clean(attr_value), False]) 98 | else: 99 | try: 100 | attr_value = item 101 | attr_list.append([clean(attr_name), clean(attr_value), False]) 102 | except: 103 | attr_value = item.text 104 | attr_list.append([clean(attr_name), clean(attr_value), False]) 105 | 106 | attr_list = clean_attr_value(attr_list) 107 | 108 | 109 | event_list = [] 110 | if len(soup.find_all('h3')) > 1: 111 | for h2 in soup.find_all('h2'): 112 | if h2.text == '目录': 113 | continue 114 | h3 = h2.find_next('h3') 115 | while h3: 116 | h3_p_h2 = h3.find_previous('h2').text 117 | if h3_p_h2 == h2.text: 118 | # print(h2.text,h3.text) 119 | content = '' 120 | content_div = h3.find_next('div',attrs = {'class':'para'}) 121 | while content_div: 122 | content_div_h3 = content_div.find_previous('h3').text 123 | if content_div_h3 == h3.text: 124 | content = content + clean(replace_zongkuohao(content_div.text)) + '\n' 125 | content_div = content_div.find_next('div',attrs = {'class':'para'}) 126 | else: 127 | break 128 | if '扫码' not in h3.text: 129 | event_list.append([h3.text,content]) 130 | h3 = h3.find_next('h3') 131 | else: 132 | break 133 | 134 | entity_href_list = [] 135 | entity_href = soup.find_all('a') 136 | for eh in entity_href: 137 | if 'href' in eh.attrs and 'item' in eh.attrs['href']: 138 | if '秒懂' in eh.attrs['href'] : 139 | continue 140 | else: 141 | entity_href_list.append('https://baike.baidu.com'+eh.attrs['href']) 142 | 143 | 144 | return { 145 | 'entity_name':entity_name, 146 | 'entity_profile':entity_profile, 147 | 'attr_list':attr_list, 148 | 'event_list':event_list, 149 | 'entity_href_list':entity_href_list 150 | } 151 | 152 | def analyzing_baike_html_get_title(htmlstr:str) -> str: 153 | soup = BeautifulSoup(htmlstr,'lxml') 154 | try: 155 | entity_profile = soup.find('div', attrs = {'class' : 'lemma-desc'}).text 156 | return entity_profile 157 | except: 158 | print('need choise') 159 | div = soup.find('div', attrs = {'class' : 'para'}) 160 | entity_profile_href = div.find('a').attrs['href'] 161 | r = baike_crawler('https://baike.baidu.com' + entity_profile_href) 162 | new_soup = BeautifulSoup(r.text,'lxml') 163 | entity_profile = new_soup.find('div', attrs = {'class' : 'lemma-desc'}).text 164 | return entity_profile 165 | 166 | def get_href_titile(url:str) -> str: 167 | r = baike_crawler(url) 168 | res = analyzing_baike_html_get_title(r.text) 169 | return res 170 | 171 | def clean_attr_value(attrlist:list) -> list: 172 | return_list = [] 173 | for item in attrlist: 174 | if '[' in item[1]: 175 | continue 176 | elif '' == item[1] or ' ' == item[1] or '、' == item[1]: 177 | continue 178 | elif item[1][0] == '、': 179 | return_list.append([item[0],item[1][1:],item[2]]) 180 | else: 181 | return_list.append([item[0],item[1],item[2]]) 182 | return return_list 183 | 184 | def analyzing_baike_url(url:str) -> dict: 185 | r = baike_crawler(url) 186 | res = analyzing_baike_html(r.text) 187 | return res 188 | 189 | def build_graph_from_url(url:str) -> None: 190 | graph_data = analyzing_baike_url(url) 191 | #------------------ 页面主实体构建----------------------------------------------------- 192 | entity = crud.creat_node(clabels = graph_data['entity_profile'],**{ 193 | 'entity_name':graph_data['entity_name'],'entity_profile':graph_data['entity_profile'] 194 | }) 195 | #------------------ 属性实体构建 -------------------------------------------------------- 196 | for item in graph_data['attr_list']: 197 | if item[1] == '等': # 去掉多个实体后面的 等 字 198 | continue 199 | elif not item[2] and item[1][-1] == '等': 200 | item[1] = item[1][:-1] 201 | 202 | if item[2]: 203 | entity_profile = get_href_titile('https://baike.baidu.com'+item[2]) # 有链接的实体,就用百度给的标签 204 | attr_ent = crud.creat_node(clabels = entity_profile,**{ 205 | 'entity_name' : item[1], 206 | 'entity_profile' : entity_profile 207 | }) 208 | crud.creat_resp(entity,attr_ent,item[0]) 209 | else: 210 | attr_ent = crud.creat_node(clabels = item[0],**{ # 没有链接的实体,就用他的类别作为标签 211 | 'entity_name' : item[1], 212 | 'entity_profile' : graph_data['entity_name'] + '的' + item[0] 213 | }) 214 | crud.creat_resp(entity, attr_ent, item[0]) 215 | #-------------------- 普通实体构建 ----------------------------------------------------------- 216 | for item in graph_data['event_list']: #最后是普通的事件实体,统统使用event_list 217 | 218 | attr_ent = crud.creat_node(clabels = 'normal event',**{ 219 | 'entity_name' : item[0], 220 | 'entity_profile' : item[0], 221 | 'entity_content' : item[1], 222 | }) 223 | crud.creat_resp(entity,attr_ent,item[0]) 224 | 225 | for item in graph_data['entity_href_list']: 226 | if item and not redis_c.check_i_need_crawl(item) : # href存在且不在已经爬取过得集合中 227 | redis_c.insert_list(item) #将解析的实体链接加入到待爬取队列 228 | else: 229 | print('-url list exist') 230 | return None 231 | 232 | def crawler(): 233 | while True: 234 | href = redis_c.pop_list() # 弹出一个链接 235 | if href and not redis_c.check_i_need_crawl(href): # href存在且不在已经爬取过得集合中 236 | try: 237 | build_graph_from_url(href) 238 | print('Finish url', href) 239 | redis_c.insert_set(href) # 加入待爬取队列,失败了不需要重新爬取 240 | except Exception as e: 241 | print('Bad url','because:',str(e),href) 242 | else: 243 | print('had Finish url', href) 244 | time.sleep(0.1) 245 | 246 | 247 | if __name__ == '__main__': 248 | # build_graph_from_url('https://baike.baidu.com/item/夏朝/22101') 249 | # crawler() 250 | p_list = [] 251 | print('cpu count',multiprocessing.cpu_count()) 252 | for i in range(multiprocessing.cpu_count() - 2) : 253 | p = multiprocessing.Process(target = crawler) 254 | p.start() 255 | p_list.append(p) 256 | for k in p_list : 257 | k.join() 258 | 259 | 260 | 261 | 262 | -------------------------------------------------------------------------------- /img/1.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/1.jpeg -------------------------------------------------------------------------------- /img/1.txt: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /img/10.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/10.jpeg -------------------------------------------------------------------------------- /img/11.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/11.jpeg -------------------------------------------------------------------------------- /img/12.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/12.jpeg -------------------------------------------------------------------------------- /img/13.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/13.jpg -------------------------------------------------------------------------------- /img/14.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/14.jpeg -------------------------------------------------------------------------------- /img/15.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/15.jpg -------------------------------------------------------------------------------- /img/16.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/16.jpg -------------------------------------------------------------------------------- /img/17.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/17.jpeg -------------------------------------------------------------------------------- /img/18.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/18.jpg -------------------------------------------------------------------------------- /img/19.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/19.jpeg -------------------------------------------------------------------------------- /img/2.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/2.jpeg -------------------------------------------------------------------------------- /img/29.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/29.jpg -------------------------------------------------------------------------------- /img/3.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/3.jpg -------------------------------------------------------------------------------- /img/4.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/4.jpg -------------------------------------------------------------------------------- /img/5.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/5.jpg -------------------------------------------------------------------------------- /img/6.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/6.jpg -------------------------------------------------------------------------------- /img/7.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/7.jpeg -------------------------------------------------------------------------------- /img/8.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/8.jpeg -------------------------------------------------------------------------------- /img/9.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/9.jpeg -------------------------------------------------------------------------------- /neo4j/1: -------------------------------------------------------------------------------- 1 | 111 2 | -------------------------------------------------------------------------------- /neo4j/CRUD.py: -------------------------------------------------------------------------------- 1 | # coding:utf-8 2 | from py2neo import Graph, Node, Relationship,NodeMatcher 3 | import glob 4 | import redis 5 | 6 | class redis_crud: 7 | 8 | def __init__(self): 9 | # 连接池 10 | self.redis_conn = redis.Redis(host='127.0.0.1', port=6379,db=1) 11 | self.redis_list_need_crawl = "redis_list_need_crawl" 12 | self.redis_set_had_crawl = "redis_set_had_crawl" # 爬取过的集合 13 | 14 | def insert_list(self,href) -> None: 15 | ''' 16 | 插入爬取队列,尾部添加 17 | :return: 18 | ''' 19 | self.redis_conn.rpush(self.redis_list_need_crawl, href) 20 | pass 21 | 22 | def pop_list(self) -> str: 23 | ''' 24 | 从头部获取一个待爬取的url 25 | :return: 26 | ''' 27 | return self.redis_conn.lpop(self.redis_list_need_crawl) 28 | 29 | def insert_set(self,href) -> None: 30 | ''' 31 | 插入已经爬取过得集合 32 | :return: 33 | ''' 34 | self.redis_conn.sadd(self.redis_set_had_crawl, href) 35 | 36 | 37 | def check_i_need_crawl(self,href:str) -> bool: 38 | ''' 39 | 检查已经爬取过得集合 40 | :return: 41 | ''' 42 | return self.redis_conn.sismember(self.redis_set_had_crawl,href) 43 | 44 | 45 | 46 | 47 | class neo4j_CRUD: 48 | 49 | def __init__(self): 50 | self.graph = self.get_connection() 51 | self.label = 'bkkg' 52 | 53 | #neo4j链接 54 | def get_connection(self): 55 | graph = Graph( 56 | "bolt://localhost:7687", 57 | auth = ('neo4j','exhibit-join-donor-orion-cable-4724') 58 | ) 59 | print('connection success') 60 | return graph 61 | 62 | #创建节点 63 | def creat_node(self,clabels,**kwargs) -> Node: 64 | node_e = self.select_node_or_resp(clabels,kwargs['entity_name'],kwargs['entity_profile']) 65 | if node_e in (None, '') : # 如果节点不存在 66 | node = Node(clabels,**kwargs,type=self.label) 67 | self.graph.create(node) 68 | print('creat node ',kwargs['entity_name'],kwargs['entity_profile']) 69 | return node 70 | else: 71 | print('node exist',kwargs['entity_name'],kwargs['entity_profile']) 72 | return node_e 73 | 74 | #建立关系 75 | def creat_resp(self,node1,node2,resp_name:str): 76 | resp = Relationship(node1, resp_name, node2) 77 | print('creat rela ',resp_name) 78 | self.graph.create(resp) 79 | 80 | #查询节点 81 | def select_node_or_resp(self,slabels,entity_name:str,entity_profile:str): 82 | matcher = NodeMatcher(self.graph) 83 | node_resp = matcher.match(slabels,entity_name=entity_name,entity_profile=entity_profile,type = self.label).first() 84 | return node_resp 85 | 86 | #切割字符串 87 | def cut_str(self,Str:str): 88 | if ':' in Str: 89 | return Str.split(':') #英文分隔 90 | else: 91 | return Str.split(':') #中文分隔 92 | 93 | 94 | if __name__ == '__main__': 95 | pass 96 | rc = redis_crud() 97 | a = 'https://baike.baidu.com/item/夏朝/22101' 98 | # rc.insert_list(a) 99 | # rc.insert_set(a) 100 | # print(rc.pop_list()) 101 | print(rc.check_i_need_crawl(a)) -------------------------------------------------------------------------------- /neo4j/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/neo4j/__init__.py -------------------------------------------------------------------------------- /one_crawler.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: UTF-8 -*- 3 | ''' 4 | @Project :BK_Knowledge_Graph 5 | @File :crawler.py 6 | @IDE :PyCharm 7 | @Author :HongYuan Guo 8 | @Date :2023/3/14 19:52 9 | ''' 10 | import requests 11 | from bs4 import BeautifulSoup 12 | import multiprocessing 13 | 14 | import re 15 | import time 16 | try: 17 | from .neo4j.CRUD import neo4j_CRUD 18 | from .neo4j.CRUD import redis_crud 19 | 20 | except: 21 | from neo4j.CRUD import neo4j_CRUD 22 | from neo4j.CRUD import redis_crud 23 | 24 | 25 | crud = neo4j_CRUD() 26 | redis_c = redis_crud() 27 | 28 | def baike_crawler(url:str) -> requests.Response: 29 | ''' 30 | 爬取百度百科内容的代码 31 | :param url: 32 | :return: 33 | ''' 34 | headers = { 35 | 'Connection': 'keep-alive', 36 | 'Host': 'baike.baidu.com', 37 | 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.54', 38 | 'cookies':'zhishiTopicRequestTime=1681813315491;' 39 | } 40 | res = requests.get(headers=headers,url=url) 41 | 42 | return res 43 | 44 | def clean(c:str) -> str: 45 | return c.replace('\xa0','').replace('\n','') 46 | 47 | def replace_zongkuohao(c:str) -> str: 48 | num = re.sub(r'\[[0-9\-]*\]', "", c) 49 | return num 50 | 51 | def analyzing_baike_html(htmlstr:str) -> dict: 52 | ''' 53 | 解析百科html 54 | :param htmlstr: 55 | :return: 56 | ''' 57 | 58 | soup = BeautifulSoup(htmlstr,'lxml') 59 | 60 | entity_name = soup.find('h1').text 61 | entity_profile = soup.find('div',attrs = {'class':'lemma-desc'}).text 62 | 63 | attr_list = [] 64 | attr_div = soup.find('div',attrs = {'class':'basic-info J-basic-info cmn-clearfix'}) 65 | 66 | attr_left = attr_div.find('dl',attrs = {'class':'basicInfo-block basicInfo-left'}) 67 | for dt in attr_left.find_all('dt'): 68 | attr_name = dt.text 69 | dd = dt.find_next('dd') 70 | 71 | dc = dd.children 72 | for item in dc: 73 | if item.name == 'a': 74 | try: 75 | href = item.attrs['href'] 76 | attr_value = item.text 77 | attr_list.append([clean(attr_name), clean(attr_value), href]) 78 | except: 79 | attr_value = item.text 80 | attr_list.append([clean(attr_name), clean(attr_value), False]) 81 | else: 82 | try: 83 | attr_value = item 84 | attr_list.append([clean(attr_name), clean(attr_value), False]) 85 | except: 86 | attr_value = item.text 87 | attr_list.append([clean(attr_name), clean(attr_value), False]) 88 | 89 | 90 | attr_right = attr_div.find('dl',attrs = {'class':'basicInfo-block basicInfo-right'}) 91 | for dt in attr_right.find_all('dt'): 92 | attr_name = dt.text 93 | dd = dt.find_next('dd') 94 | 95 | dc = dd.children 96 | for item in dc: 97 | if item.name == 'a': 98 | try: 99 | href = item.attrs['href'] 100 | attr_value = item.text 101 | attr_list.append([clean(attr_name), clean(attr_value), href]) 102 | except: 103 | attr_value = item.text 104 | attr_list.append([clean(attr_name), clean(attr_value), False]) 105 | else: 106 | try: 107 | attr_value = item 108 | attr_list.append([clean(attr_name), clean(attr_value), False]) 109 | except: 110 | attr_value = item.text 111 | attr_list.append([clean(attr_name), clean(attr_value), False]) 112 | 113 | attr_list = clean_attr_value(attr_list) 114 | 115 | 116 | event_list = [] 117 | if len(soup.find_all('h3')) > 1: 118 | for h2 in soup.find_all('h2'): 119 | if h2.text == '目录': 120 | continue 121 | h3 = h2.find_next('h3') 122 | while h3: 123 | h3_p_h2 = h3.find_previous('h2').text 124 | if h3_p_h2 == h2.text: 125 | # print(h2.text,h3.text) 126 | content = '' 127 | content_div = h3.find_next('div',attrs = {'class':'para'}) 128 | while content_div: 129 | content_div_h3 = content_div.find_previous('h3').text 130 | if content_div_h3 == h3.text: 131 | content = content + clean(replace_zongkuohao(content_div.text)) + '\n' 132 | content_div = content_div.find_next('div',attrs = {'class':'para'}) 133 | else: 134 | break 135 | if '扫码' not in h3.text: 136 | event_list.append([h3.text,content]) 137 | h3 = h3.find_next('h3') 138 | else: 139 | break 140 | 141 | entity_href_list = [] 142 | entity_href = soup.find_all('a') 143 | for eh in entity_href: 144 | if 'href' in eh.attrs and 'item' in eh.attrs['href']: 145 | if '秒懂' in eh.attrs['href'] : 146 | continue 147 | else: 148 | entity_href_list.append('https://baike.baidu.com'+eh.attrs['href']) 149 | 150 | 151 | return { 152 | 'entity_name':entity_name, 153 | 'entity_profile':entity_profile, 154 | 'attr_list':attr_list, 155 | 'event_list':event_list, 156 | 'entity_href_list':entity_href_list 157 | } 158 | 159 | def analyzing_baike_html_get_title(htmlstr:str) -> str: 160 | soup = BeautifulSoup(htmlstr,'lxml') 161 | try: 162 | entity_profile = soup.find('div', attrs = {'class' : 'lemma-desc'}).text 163 | return entity_profile 164 | except: 165 | print('need choise') 166 | div = soup.find('div', attrs = {'class' : 'para'}) 167 | entity_profile_href = div.find('a').attrs['href'] 168 | r = baike_crawler('https://baike.baidu.com' + entity_profile_href) 169 | new_soup = BeautifulSoup(r.text,'lxml') 170 | entity_profile = new_soup.find('div', attrs = {'class' : 'lemma-desc'}).text 171 | return entity_profile 172 | 173 | def get_href_titile(url:str) -> str: 174 | r = baike_crawler(url) 175 | res = analyzing_baike_html_get_title(r.text) 176 | return res 177 | 178 | def clean_attr_value(attrlist:list) -> list: 179 | return_list = [] 180 | for item in attrlist: 181 | if '[' in item[1]: 182 | continue 183 | elif '' == item[1] or ' ' == item[1] or '、' == item[1]: 184 | continue 185 | elif item[1][0] == '、': 186 | return_list.append([item[0],item[1][1:],item[2]]) 187 | else: 188 | return_list.append([item[0],item[1],item[2]]) 189 | return return_list 190 | 191 | 192 | def analyzing_baike_url(url:str) -> dict: 193 | r = baike_crawler(url) 194 | res = analyzing_baike_html(r.text) 195 | return res 196 | 197 | def build_graph_from_url(url:str) -> None: 198 | graph_data = analyzing_baike_url(url) 199 | #------------------ 页面主实体构建----------------------------------------------------- 200 | entity = crud.creat_node(clabels = graph_data['entity_profile'],**{ 201 | 'entity_name':graph_data['entity_name'],'entity_profile':graph_data['entity_profile'] 202 | }) 203 | #------------------ 属性实体构建 -------------------------------------------------------- 204 | for item in graph_data['attr_list']: 205 | if item[1] == '等': # 去掉多个实体后面的 等 字 206 | continue 207 | elif not item[2] and item[1][-1] == '等': 208 | item[1] = item[1][:-1] 209 | 210 | if item[2]: 211 | entity_profile = get_href_titile('https://baike.baidu.com'+item[2]) # 有链接的实体,就用百度给的标签 212 | attr_ent = crud.creat_node(clabels = entity_profile,**{ 213 | 'entity_name' : item[1], 214 | 'entity_profile' : entity_profile 215 | }) 216 | crud.creat_resp(entity,attr_ent,item[0]) 217 | else: 218 | attr_ent = crud.creat_node(clabels = item[0],**{ # 没有链接的实体,就用他的类别作为标签 219 | 'entity_name' : item[1], 220 | 'entity_profile' : graph_data['entity_name'] + '的' + item[0] 221 | }) 222 | crud.creat_resp(entity, attr_ent, item[0]) 223 | #-------------------- 普通实体构建 ----------------------------------------------------------- 224 | for item in graph_data['event_list']: #最后是普通的事件实体,统统使用event_list 225 | 226 | attr_ent = crud.creat_node(clabels = 'normal event',**{ 227 | 'entity_name' : item[0], 228 | 'entity_profile' : item[0], 229 | 'entity_content' : item[1], 230 | }) 231 | crud.creat_resp(entity,attr_ent,item[0]) 232 | 233 | for item in graph_data['entity_href_list']: 234 | if item and not redis_c.check_i_need_crawl(item) : # href存在且不在已经爬取过得集合中 235 | redis_c.insert_list(item) #将解析的实体链接加入到待爬取队列 236 | else: 237 | print('-url list exist') 238 | return None 239 | 240 | def crawler(): 241 | while True: 242 | href = redis_c.pop_list() # 弹出一个链接 243 | if href and not redis_c.check_i_need_crawl(href): # href存在且不在已经爬取过得集合中 244 | try: 245 | build_graph_from_url(href) 246 | print('Finish url', href) 247 | redis_c.insert_set(href) # 加入待爬取队列,失败了不需要重新爬取 248 | except Exception as e: 249 | print('Bad url','because:',str(e),href) 250 | else: 251 | print('had Finish url', href) 252 | time.sleep(0.1) 253 | 254 | 255 | if __name__ == '__main__': 256 | crawler() 257 | # build_graph_from_url('https://baike.baidu.com/item/%E9%B2%81%E8%BF%85/36231') 258 | 259 | --------------------------------------------------------------------------------