├── 100w_data
    ├── 1.txt
    └── con.py
├── 250w_data
    └── baiduyun.txt
├── README.md
├── WordVector.py
├── crawler.py
├── img
    ├── 1.jpeg
    ├── 1.txt
    ├── 10.jpeg
    ├── 11.jpeg
    ├── 12.jpeg
    ├── 13.jpg
    ├── 14.jpeg
    ├── 15.jpg
    ├── 16.jpg
    ├── 17.jpeg
    ├── 18.jpg
    ├── 19.jpeg
    ├── 2.jpeg
    ├── 29.jpg
    ├── 3.jpg
    ├── 4.jpg
    ├── 5.jpg
    ├── 6.jpg
    ├── 7.jpeg
    ├── 8.jpeg
    └── 9.jpeg
├── neo4j
    ├── 1
    ├── CRUD.py
    └── __init__.py
└── one_crawler.py


/100w_data/1.txt:
--------------------------------------------------------------------------------
 1 | 链接：https://pan.baidu.com/s/1vfSURLzrZyHgJgq4Qa3O0w?pwd=vrx3 
 2 | 提取码：vrx3 
 3 | --来自百度网盘超级会员V1的分享
 4 | 
 5 | 链接：https://pan.baidu.com/s/1LhZ7YUKsHVhsfJIXZB9GhA?pwd=9m15 
 6 | 提取码：9m15 
 7 | --来自百度网盘超级会员V1的分享
 8 | 
 9 | 
10 | 


--------------------------------------------------------------------------------
/100w_data/con.py:
--------------------------------------------------------------------------------
1 | graph = Graph(
2 |             "bolt://localhost:7687",
3 |             auth = ('neo4j','exhibit-join-donor-orion-cable-4724')
4 |         )
5 | 


--------------------------------------------------------------------------------
/250w_data/baiduyun.txt:
--------------------------------------------------------------------------------
1 | 链接：https://pan.baidu.com/s/12HvgoRuAU6rQU1sTkqrhIA?pwd=26z8 
2 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # 第二批250W实体+关系，2023.5.5
  2 | excel类型，包含实体和关系，这份数据通过sql优化了结构，在250w_data里面有下载链接
  3 |  - CALL apoc.export.csv.query("MATCH (n)-[r]-(m) RETURN id(n),n.entity_name,labels(n)[0],type(r),id(m),m.entity_name,labels(m)", "all_entities_and_relations.csv", {})
  4 | 
  5 | 
  6 | # 第一批100w实体的图谱数据
  7 | 我存在了百度云盘里面，方便下载使用，他是5.5版本的neo4j数据压缩制作而成，解压后直接放在neo4j安装路径下的data文件夹中即可,登录密码和下载链接都在100w_data文件夹里面,因为密码也在数据库里面，所以需要用和我一样的密码，登上去在修改即可
  8 | 
  9 | ![](img/19.jpeg)
 10 | 
 11 | ![](img/29.jpg)
 12 | 
 13 | # 图谱的质量怎么样 2023.3.26
 14 | 经过几天的爬取已经有了几十万的实体，在过几天，我会发布第一版100万实体的数据。可以看到,尽管有很多同名的实体，但是他们通过严格的属性标签分割开来。
 15 | 
 16 | ![](img/18.jpg)
 17 | ![](img/14.jpeg)
 18 | ![](img/15.jpg)
 19 | ![](img/16.jpg)
 20 | ![](img/17.jpeg)
 21 | 
 22 | # 第一批图谱 2023.3.19
 23 | ![](img/10.jpeg)
 24 | 这是我的第一批数据，他让我信心倍增，不过看起来很单调，我觉得修改他们的标签属性，然后将原有标签用其他属性储存。
 25 | **修改后的效果**
 26 | 
 27 | ![](img/11.jpeg)
 28 | 
 29 | ![](img/12.jpeg)
 30 | 
 31 | ![](img/13.jpg)
 32 | 
 33 | 是不是看起来好一些了
 34 | 
 35 | 
 36 | # 第一个bug 2023.3.19
 37 | 当我测试实体之间关联的程序时发生了一个bug，具体就是没找到对应元素，他的真实情况是因为存在**多义词**，需要你继续选择，才能够获得详细的百科界面。
 38 | 这个选择动作但是十分容易关键是如何选择。
 39 | ![](img/6.jpg)
 40 | ![](img/7.jpeg)
 41 | 
 42 | 可以按照顺序，第一个总是最推荐的，不过这样不严谨，我们去做选择的时候是去理解这两段描述，哪个才最可能是前面提到的语种。为了模拟这个过程，我找一一个做语义分析的模型**gensim**。从语义得角度分析出最好的选择。
 43 | ## gensim的结果
 44 | gensim需要一个训练好的向量模型。我在网上找的对应文件。在这里分享一下。模型大概几个G，可以直接使用。
 45 |  - 链接: https://pan.baidu.com/s/1pKImAOWF9HS-NL_rWj_ASQ?pwd=zhbk 提取码: zhbk 
 46 | ```python
 47 | // A highlighted block
 48 | from gensim.models import word2vec
 49 | from gensim.models import Doc2Vec
 50 | model = Doc2Vec.load(r'modelpath')
 51 | similarity_between = model.wv.similarity(word1, cut_word) #计算相似值
 52 | ```
 53 | 长一点的文本可以分词处理，就是寻找有没有最相似词。
 54 | 
 55 | ![](img/8.jpeg)
 56 | 
 57 | ![](img/9.jpeg)
 58 | 
 59 | 可以看到效果还是很明显的，语言和用语的相似性达到了82%，而第二个计算结果没有大于30%的。
 60 | 
 61 | 
 62 | # 执行爬虫  2023.3.12
 63 | ![](img/5.jpg)
 64 | 
 65 | 就像这个图片看到的一样，我抽取了一个实体的名字、简介、和属性关系，还有一些正文内容，对于这些内容我打算这样使用
 66 | - entity_name 就是我们要构建的实体，这个实体是具有唯一标志的，目前我打算就是用名字和简介作为他的标识，这样的话，如果其他实体和他有关联，就能做链接
 67 | - entity_profile 是实体的简介，entity_profile包含了实体的类别和其他比较特殊的信息。比如说这里就提到了“中国史书中记载的第一个世袭制朝代”,而其他的朝代，就是简单的“中国历史朝代”六个字
 68 | - attr_list 这是我最喜欢的内容，因为我可以感觉到，我目前的能力能构建出的关系，大部分来自这里，比如：夏朝的都城在哪里，这就是一个简单地三元组，夏朝的都城阳城，也是一个实体。不同的是这些实体，有的带有链接，有的没有链接，只是一个简单的名字，所以，我对这些信息做了一些判断，比如有链接的，就去获得他的实体名称和简介，这样就构成了唯一标识，保证关系的可靠性，没有连接的，就用 “实体名”+“属性名”来认定他的简介，比如这里我们看到的第二个属性，他在数据中的应该是这样储存的｛entity_name:"The Xia Dynasty",entity_introduction:"夏朝的外文名"｝
 69 | - event_list 最后是这个文本内容列表，我觉得这些小标题的内容十分有趣，所以我储存了起来。 
 70 | 
 71 | # 我应该先做什么   2023.3.05
 72 | 毫无疑问是schema，schema就是知识图谱的灵魂，但我并没有选择先在schema上下功夫，因为百科类元数据本身就定义好了schema，所以先建图，然后我想根据已有的分类数据去构建一个schema。
 73 | 
 74 | ## Schema的定义
 75 | ### 概念类别体系
 76 | 将知识图谱要表达的知识按照层级结构的概念进行组织，保证上层类别所表示的概念完全包含下层类别表示的概念
 77 | 可以参考 http://schema.org等已有知识资源人工确定顶层的概念体系。
 78 | 要保证概念类别体系的鲁棒性，便于维护和扩展，适应新的需求。
 79 | - 设计了一套上下位关系挖掘系统，用于自动化构建大量的细粒度概念（或称之为上位词）
 80 | ### 关系和属性：
 81 | 定义了概念类别体系之后我们还需要为每一个类别定义关系和属性。关系用于描述不同实体间的联系，属性用于描述实体的内在特征。
 82 | ### 约束
 83 | 定义关系属性的约束信息可以保证数据的一致性，避免出现异常值，比如：年龄必须是 Int 类型且唯一（单值）。
 84 | 
 85 | ## 构建过程
 86 | ### 自动化构建
 87 | #### 获得百科页面数据
 88 |  - 种子网站：要构建什么体系的图谱，或者仅仅是优先级
 89 | ![](img/1.jpeg)
 90 | 
 91 |  - 深度遍历：挖掘正文中的链接，判断是不是百科的实体页面，如果是则进入处理队列
 92 | ![](img/2.jpeg)
 93 | 
 94 | #### 抽取信息
 95 |  - 属性关系抽取
 96 |  百科上面有一部分是编辑好的属性信息，这部分是最容易获得的实体-属性，实体-实体的数据，这里面出现的实体往往包含了大量信息，所以
 97 | ![](img/3.jpg)
 98 | 
 99 |  - 事件
100 |  这一部分通常是丰富百科实体的内容的
101 | ![](img/4.jpg)
102 | 


--------------------------------------------------------------------------------
/WordVector.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | # coding=utf-8
 3 | # Creative time 2020/3/20
 4 | # Creator HongYuan Guo
 5 | 
 6 | from gensim.models import word2vec
 7 | from gensim.models import Doc2Vec
 8 | import jieba.posseg as psg
 9 | import jieba
10 | import time
11 | 
12 | class WordVector:
13 | 
14 |     def __init__(self):
15 |         self.corpus_path = 'Corpus\corpus_jingdong.txt'
16 |         self.static_model_path = 'baike_26g_news_13g_novel_229g.model'
17 | 
18 |     def Train_model(self,path,Size):
19 |         sentences = word2vec.Text8Corpus(path)  # 加载语料
20 |         model = word2vec.Word2Vec(sentences, size=Size)  # 默认window=5
21 |         model.save('static_models\WordVector_JD60W_'+str(Size)+'.model')
22 |         return model
23 | 
24 |     def get_model(self,path,size):
25 |         try:
26 |             model = Doc2Vec.load('static_models\WordVector_JD60W_'+str(size)+'.model')
27 |             print('模型已存在')
28 |         except Exception as e:
29 |             print('正在训练模型')
30 |             self.Train_model(path,size)
31 |             print('训练结束')
32 |         finally:
33 |             model = Doc2Vec.load('static_models\WordVector_JD60W_'+str(size)+'.model')
34 |             return model
35 | 
36 |     def Similarity_New( self ,model,word1,word2):
37 |         word_list1 = []
38 |         word_list2 = []
39 |         similarity_TOP1 = 0
40 |         similarity_list = model.wv.most_similar ( word2 , topn = 1 )  # 获取排名第一的相似之=值
41 |         for item in similarity_list :
42 |             similarity_TOP1 = item [ 1 ]
43 |         similarity_between = model.wv.similarity ( word1 , word2 )  # 计算相似值
44 | 
45 |         # print ( '{}和{}的相似程度为{}'.format ( word1 , word2 , str ( format ( similarity_between / similarity_TOP1 * 100 , '.2f' ) ) + '%' ) )
46 |         for k , v in psg.cut ( word1 ) :
47 |             word_list1.append ( [ k , v ] )
48 |         for k , v in psg.cut ( word2 ) :
49 |             word_list2.append ( [ k , v ] )
50 |         print ( '{} 和 {} 的相似系数为{},相似程度为:{}'.format ( word1 , word2 , str ( similarity_between ),str ( format ( similarity_between / similarity_TOP1 * 100 , '.2f' ) ) + '%' )  )
51 |         return [word_list1,word_list2,similarity_between / similarity_TOP1]
52 | 
53 | if __name__ == '__main__':
54 |     wv = WordVector()
55 |     # model = wv.get_model(wv.corpus_path, 200)
56 |     import time
57 |     load_start = time.time()
58 |     model = Doc2Vec.load(r'static_models\baike_26g_news_13g_novel_229g.model')
59 |     load_finish = time.time()
60 |     print('加载模型时长：',load_finish-load_start)
61 |     word1 = u'语言'
62 |     word2 = u'连横所著小说'
63 |     similarity_TOP1 = 0
64 | 
65 |     # for cut_word, v in psg.cut(word2):
66 |     for cut_word in jieba.lcut(word2 ,cut_all=True) :
67 |         sim_start = time.time()
68 |         similarity_list = model.wv.most_similar(cut_word, topn=1) #获取排名第一的相似之=值
69 |         sim_stop = time.time()
70 |         print('相似度计算时长：',sim_stop-sim_start)
71 | 
72 |         for item in similarity_list:
73 |             similarity_TOP1 = [item[0],item[1]]
74 |         similarity_between = model.wv.similarity(word1, cut_word) #计算相似值
75 |         print('{}和{}的相似系数为{}'.format(word1, cut_word, str(similarity_between)))
76 |         # print('与{}最相似的词为{}，相似系数为{}'.format(word1,similarity_TOP1[0],similarity_TOP1[1]))
77 |         print ( '{}和{}的最终相似程度为{}'.format (word1, cut_word, str (format(similarity_between / similarity_TOP1[1] * 100, '.2f')) + '%'))
78 |     # for k,v in psg.cut(word1):
79 |     #     print(k,v)
80 |     # for k,v in psg.cut(word2):
81 |     #     print(k,v)


--------------------------------------------------------------------------------
/crawler.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: UTF-8 -*-
  3 | '''
  4 | @Project ：BK_Knowledge_Graph 
  5 | @File    ：crawler.py
  6 | @IDE     ：PyCharm 
  7 | @Author  ：HongYuan Guo
  8 | @Date    ：2023/3/14 19:52 
  9 | '''
 10 | import requests
 11 | from bs4 import BeautifulSoup
 12 | import multiprocessing
 13 | 
 14 | import re
 15 | import time
 16 | try:
 17 |     from .neo4j.CRUD import neo4j_CRUD
 18 |     from .neo4j.CRUD import redis_crud
 19 | 
 20 | except:
 21 |     from neo4j.CRUD import neo4j_CRUD
 22 |     from neo4j.CRUD import redis_crud
 23 | 
 24 | 
 25 | crud = neo4j_CRUD()
 26 | redis_c =  redis_crud()
 27 | 
 28 | def baike_crawler(url:str) -> requests.Response:
 29 |     '''
 30 |     爬取百度百科内容的代码
 31 |     :param url:
 32 |     :return:
 33 |     '''
 34 |     res = requests.get(url)
 35 |     return res
 36 | 
 37 | def clean(c:str) -> str:
 38 |     return c.replace('\xa0','').replace('\n','')
 39 | 
 40 | def replace_zongkuohao(c:str) -> str:
 41 |     num = re.sub(r'\[[0-9\-]*\]', "", c)
 42 |     return num
 43 | 
 44 | def analyzing_baike_html(htmlstr:str) -> dict:
 45 |     '''
 46 |     解析百科html
 47 |     :param htmlstr:
 48 |     :return:
 49 |     '''
 50 | 
 51 |     soup = BeautifulSoup(htmlstr,'lxml')
 52 | 
 53 |     entity_name =  soup.find('h1').text
 54 |     entity_profile = soup.find('div',attrs = {'class':'lemma-desc'}).text
 55 | 
 56 |     attr_list = []
 57 |     attr_div = soup.find('div',attrs = {'class':'basic-info J-basic-info cmn-clearfix'})
 58 | 
 59 |     attr_left = attr_div.find('dl',attrs = {'class':'basicInfo-block basicInfo-left'})
 60 |     for dt in attr_left.find_all('dt'):
 61 |         attr_name = dt.text
 62 |         dd = dt.find_next('dd')
 63 | 
 64 |         dc =  dd.children
 65 |         for item in dc:
 66 |             if item.name == 'a':
 67 |                 try:
 68 |                     href = item.attrs['href']
 69 |                     attr_value = item.text
 70 |                     attr_list.append([clean(attr_name), clean(attr_value), href])
 71 |                 except:
 72 |                     attr_value = item.text
 73 |                     attr_list.append([clean(attr_name), clean(attr_value), False])
 74 |             else:
 75 |                 try:
 76 |                     attr_value = item
 77 |                     attr_list.append([clean(attr_name), clean(attr_value), False])
 78 |                 except:
 79 |                     attr_value = item.text
 80 |                     attr_list.append([clean(attr_name), clean(attr_value), False])
 81 | 
 82 | 
 83 |     attr_right = attr_div.find('dl',attrs = {'class':'basicInfo-block basicInfo-right'})
 84 |     for dt in attr_right.find_all('dt'):
 85 |         attr_name = dt.text
 86 |         dd = dt.find_next('dd')
 87 | 
 88 |         dc =  dd.children
 89 |         for item in dc:
 90 |             if item.name == 'a':
 91 |                 try:
 92 |                     href = item.attrs['href']
 93 |                     attr_value = item.text
 94 |                     attr_list.append([clean(attr_name), clean(attr_value), href])
 95 |                 except:
 96 |                     attr_value = item.text
 97 |                     attr_list.append([clean(attr_name), clean(attr_value), False])
 98 |             else:
 99 |                 try:
100 |                     attr_value = item
101 |                     attr_list.append([clean(attr_name), clean(attr_value), False])
102 |                 except:
103 |                     attr_value = item.text
104 |                     attr_list.append([clean(attr_name), clean(attr_value), False])
105 | 
106 |     attr_list = clean_attr_value(attr_list)
107 | 
108 | 
109 |     event_list = []
110 |     if len(soup.find_all('h3')) > 1:
111 |         for h2 in soup.find_all('h2'):
112 |             if h2.text == '目录':
113 |                 continue
114 |             h3 = h2.find_next('h3')
115 |             while h3:
116 |                 h3_p_h2 = h3.find_previous('h2').text
117 |                 if  h3_p_h2 == h2.text:
118 |                     # print(h2.text,h3.text)
119 |                     content = ''
120 |                     content_div = h3.find_next('div',attrs = {'class':'para'})
121 |                     while content_div:
122 |                         content_div_h3 = content_div.find_previous('h3').text
123 |                         if content_div_h3 == h3.text:
124 |                             content = content + clean(replace_zongkuohao(content_div.text)) + '\n'
125 |                             content_div = content_div.find_next('div',attrs = {'class':'para'})
126 |                         else:
127 |                             break
128 |                     if '扫码' not in h3.text:
129 |                         event_list.append([h3.text,content])
130 |                     h3 = h3.find_next('h3')
131 |                 else:
132 |                     break
133 | 
134 |     entity_href_list = []
135 |     entity_href = soup.find_all('a')
136 |     for eh in entity_href:
137 |         if  'href' in eh.attrs and 'item' in eh.attrs['href']:
138 |             if '秒懂' in eh.attrs['href'] :
139 |                 continue
140 |             else:
141 |                 entity_href_list.append('https://baike.baidu.com'+eh.attrs['href'])
142 | 
143 | 
144 |     return {
145 |         'entity_name':entity_name,
146 |         'entity_profile':entity_profile,
147 |         'attr_list':attr_list,
148 |         'event_list':event_list,
149 |         'entity_href_list':entity_href_list
150 |     }
151 | 
152 | def analyzing_baike_html_get_title(htmlstr:str) -> str:
153 |     soup = BeautifulSoup(htmlstr,'lxml')
154 |     try:
155 |         entity_profile = soup.find('div', attrs = {'class' : 'lemma-desc'}).text
156 |         return entity_profile
157 |     except:
158 |         print('need choise')
159 |         div = soup.find('div', attrs = {'class' : 'para'})
160 |         entity_profile_href = div.find('a').attrs['href']
161 |         r = baike_crawler('https://baike.baidu.com' + entity_profile_href)
162 |         new_soup = BeautifulSoup(r.text,'lxml')
163 |         entity_profile = new_soup.find('div', attrs = {'class' : 'lemma-desc'}).text
164 |         return entity_profile
165 | 
166 | def get_href_titile(url:str) -> str:
167 |     r = baike_crawler(url)
168 |     res = analyzing_baike_html_get_title(r.text)
169 |     return res
170 | 
171 | def clean_attr_value(attrlist:list) -> list:
172 |     return_list = []
173 |     for item in attrlist:
174 |         if '[' in item[1]:
175 |             continue
176 |         elif '' == item[1] or ' ' == item[1] or '、' == item[1]:
177 |             continue
178 |         elif item[1][0] == '、':
179 |             return_list.append([item[0],item[1][1:],item[2]])
180 |         else:
181 |             return_list.append([item[0],item[1],item[2]])
182 |     return return_list
183 | 
184 | def analyzing_baike_url(url:str) -> dict:
185 |     r = baike_crawler(url)
186 |     res = analyzing_baike_html(r.text)
187 |     return res
188 | 
189 | def build_graph_from_url(url:str) -> None:
190 |     graph_data = analyzing_baike_url(url)
191 |     #------------------ 页面主实体构建-----------------------------------------------------
192 |     entity =  crud.creat_node(clabels = graph_data['entity_profile'],**{
193 |         'entity_name':graph_data['entity_name'],'entity_profile':graph_data['entity_profile']
194 |     })
195 |     #------------------ 属性实体构建 --------------------------------------------------------
196 |     for item in graph_data['attr_list']:
197 |         if item[1] == '等':    # 去掉多个实体后面的 等 字
198 |             continue
199 |         elif not item[2] and item[1][-1] == '等':
200 |             item[1] = item[1][:-1]
201 | 
202 |         if item[2]:
203 |             entity_profile = get_href_titile('https://baike.baidu.com'+item[2])  # 有链接的实体，就用百度给的标签
204 |             attr_ent = crud.creat_node(clabels = entity_profile,**{
205 |                 'entity_name' : item[1],
206 |                 'entity_profile' : entity_profile
207 |             })
208 |             crud.creat_resp(entity,attr_ent,item[0])
209 |         else:
210 |             attr_ent = crud.creat_node(clabels = item[0],**{    # 没有链接的实体，就用他的类别作为标签
211 |                 'entity_name' : item[1],
212 |                 'entity_profile' : graph_data['entity_name'] + '的' + item[0]
213 |             })
214 |             crud.creat_resp(entity, attr_ent, item[0])
215 |     #-------------------- 普通实体构建 -----------------------------------------------------------
216 |     for item in graph_data['event_list']:   #最后是普通的事件实体，统统使用event_list
217 | 
218 |             attr_ent = crud.creat_node(clabels = 'normal event',**{
219 |                 'entity_name' : item[0],
220 |                 'entity_profile' : item[0],
221 |                 'entity_content' : item[1],
222 |             })
223 |             crud.creat_resp(entity,attr_ent,item[0])
224 | 
225 |     for item in graph_data['entity_href_list']:
226 |         if item and not redis_c.check_i_need_crawl(item) :  # href存在且不在已经爬取过得集合中
227 |             redis_c.insert_list(item) #将解析的实体链接加入到待爬取队列
228 |         else:
229 |             print('-url list exist')
230 |     return None
231 | 
232 | def crawler():
233 |     while True:
234 |         href = redis_c.pop_list()  # 弹出一个链接
235 |         if href and not redis_c.check_i_need_crawl(href): # href存在且不在已经爬取过得集合中
236 |             try:
237 |                 build_graph_from_url(href)
238 |                 print('Finish url', href)
239 |                 redis_c.insert_set(href)  # 加入待爬取队列，失败了不需要重新爬取
240 |             except Exception as e:
241 |                 print('Bad url','because:',str(e),href)
242 |         else:
243 |             print('had Finish url', href)
244 |             time.sleep(0.1)
245 | 
246 | 
247 | if __name__ == '__main__':
248 |     # build_graph_from_url('https://baike.baidu.com/item/夏朝/22101')
249 |     # crawler()
250 |     p_list = []
251 |     print('cpu count',multiprocessing.cpu_count())
252 |     for i in range(multiprocessing.cpu_count() - 2) :
253 |         p = multiprocessing.Process(target = crawler)
254 |         p.start()
255 |         p_list.append(p)
256 |     for k in p_list :
257 |         k.join()
258 | 
259 | 
260 | 
261 | 
262 | 


--------------------------------------------------------------------------------
/img/1.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/1.jpeg


--------------------------------------------------------------------------------
/img/1.txt:
--------------------------------------------------------------------------------
1 | 
2 | 


--------------------------------------------------------------------------------
/img/10.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/10.jpeg


--------------------------------------------------------------------------------
/img/11.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/11.jpeg


--------------------------------------------------------------------------------
/img/12.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/12.jpeg


--------------------------------------------------------------------------------
/img/13.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/13.jpg


--------------------------------------------------------------------------------
/img/14.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/14.jpeg


--------------------------------------------------------------------------------
/img/15.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/15.jpg


--------------------------------------------------------------------------------
/img/16.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/16.jpg


--------------------------------------------------------------------------------
/img/17.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/17.jpeg


--------------------------------------------------------------------------------
/img/18.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/18.jpg


--------------------------------------------------------------------------------
/img/19.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/19.jpeg


--------------------------------------------------------------------------------
/img/2.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/2.jpeg


--------------------------------------------------------------------------------
/img/29.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/29.jpg


--------------------------------------------------------------------------------
/img/3.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/3.jpg


--------------------------------------------------------------------------------
/img/4.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/4.jpg


--------------------------------------------------------------------------------
/img/5.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/5.jpg


--------------------------------------------------------------------------------
/img/6.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/6.jpg


--------------------------------------------------------------------------------
/img/7.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/7.jpeg


--------------------------------------------------------------------------------
/img/8.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/8.jpeg


--------------------------------------------------------------------------------
/img/9.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/img/9.jpeg


--------------------------------------------------------------------------------
/neo4j/1:
--------------------------------------------------------------------------------
1 | 111
2 | 


--------------------------------------------------------------------------------
/neo4j/CRUD.py:
--------------------------------------------------------------------------------
  1 | # coding:utf-8
  2 | from py2neo import Graph, Node, Relationship,NodeMatcher
  3 | import glob
  4 | import redis
  5 | 
  6 | class redis_crud:
  7 | 
  8 |     def __init__(self):
  9 |         # 连接池
 10 |         self.redis_conn = redis.Redis(host='127.0.0.1', port=6379,db=1)
 11 |         self.redis_list_need_crawl = "redis_list_need_crawl"
 12 |         self.redis_set_had_crawl = "redis_set_had_crawl"  # 爬取过的集合
 13 | 
 14 |     def insert_list(self,href) -> None:
 15 |         '''
 16 |         插入爬取队列,尾部添加
 17 |         :return:
 18 |         '''
 19 |         self.redis_conn.rpush(self.redis_list_need_crawl, href)
 20 |         pass
 21 | 
 22 |     def pop_list(self) -> str:
 23 |         '''
 24 |         从头部获取一个待爬取的url
 25 |         :return:
 26 |         '''
 27 |         return self.redis_conn.lpop(self.redis_list_need_crawl)
 28 | 
 29 |     def insert_set(self,href) -> None:
 30 |         '''
 31 |         插入已经爬取过得集合
 32 |         :return:
 33 |         '''
 34 |         self.redis_conn.sadd(self.redis_set_had_crawl, href)
 35 | 
 36 | 
 37 |     def check_i_need_crawl(self,href:str) -> bool:
 38 |         '''
 39 |         检查已经爬取过得集合
 40 |         :return:
 41 |         '''
 42 |         return self.redis_conn.sismember(self.redis_set_had_crawl,href)
 43 | 
 44 | 
 45 | 
 46 | 
 47 | class neo4j_CRUD:
 48 | 
 49 |     def __init__(self):
 50 |         self.graph = self.get_connection()
 51 |         self.label = 'bkkg'
 52 | 
 53 |     #neo4j链接
 54 |     def get_connection(self):
 55 |         graph = Graph(
 56 |             "bolt://localhost:7687",
 57 |             auth = ('neo4j','exhibit-join-donor-orion-cable-4724')
 58 |         )
 59 |         print('connection success')
 60 |         return graph
 61 | 
 62 |     #创建节点
 63 |     def creat_node(self,clabels,**kwargs) -> Node:
 64 |         node_e = self.select_node_or_resp(clabels,kwargs['entity_name'],kwargs['entity_profile'])
 65 |         if node_e in (None, '') :  # 如果节点不存在
 66 |             node = Node(clabels,**kwargs,type=self.label)
 67 |             self.graph.create(node)
 68 |             print('creat node ',kwargs['entity_name'],kwargs['entity_profile'])
 69 |             return node
 70 |         else:
 71 |             print('node exist',kwargs['entity_name'],kwargs['entity_profile'])
 72 |             return node_e
 73 | 
 74 |     #建立关系
 75 |     def creat_resp(self,node1,node2,resp_name:str):
 76 |         resp = Relationship(node1, resp_name, node2)
 77 |         print('creat rela ',resp_name)
 78 |         self.graph.create(resp)
 79 | 
 80 |     #查询节点
 81 |     def select_node_or_resp(self,slabels,entity_name:str,entity_profile:str):
 82 |         matcher = NodeMatcher(self.graph)
 83 |         node_resp = matcher.match(slabels,entity_name=entity_name,entity_profile=entity_profile,type = self.label).first()
 84 |         return node_resp
 85 | 
 86 |     #切割字符串
 87 |     def cut_str(self,Str:str):
 88 |         if ':' in Str:
 89 |             return Str.split(':') #英文分隔
 90 |         else:
 91 |             return Str.split('：') #中文分隔
 92 | 
 93 | 
 94 | if __name__ == '__main__':
 95 |     pass
 96 |     rc = redis_crud()
 97 |     a = 'https://baike.baidu.com/item/夏朝/22101'
 98 |     # rc.insert_list(a)
 99 |     # rc.insert_set(a)
100 |     # print(rc.pop_list())
101 |     print(rc.check_i_need_crawl(a))


--------------------------------------------------------------------------------
/neo4j/__init__.py:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GuoHongYuan/BKnowledgeGraph/9bb482b2995e2b050e9e9fa57ef78d5780bfe69f/neo4j/__init__.py


--------------------------------------------------------------------------------
/one_crawler.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: UTF-8 -*-
  3 | '''
  4 | @Project ：BK_Knowledge_Graph 
  5 | @File    ：crawler.py
  6 | @IDE     ：PyCharm 
  7 | @Author  ：HongYuan Guo
  8 | @Date    ：2023/3/14 19:52 
  9 | '''
 10 | import requests
 11 | from bs4 import BeautifulSoup
 12 | import multiprocessing
 13 | 
 14 | import re
 15 | import time
 16 | try:
 17 |     from .neo4j.CRUD import neo4j_CRUD
 18 |     from .neo4j.CRUD import redis_crud
 19 | 
 20 | except:
 21 |     from neo4j.CRUD import neo4j_CRUD
 22 |     from neo4j.CRUD import redis_crud
 23 | 
 24 | 
 25 | crud = neo4j_CRUD()
 26 | redis_c =  redis_crud()
 27 | 
 28 | def baike_crawler(url:str) -> requests.Response:
 29 |     '''
 30 |     爬取百度百科内容的代码
 31 |     :param url:
 32 |     :return:
 33 |     '''
 34 |     headers = {
 35 |     'Connection': 'keep-alive',
 36 |     'Host': 'baike.baidu.com',
 37 |     'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 Edg/108.0.1462.54',
 38 |     'cookies':'zhishiTopicRequestTime=1681813315491;'
 39 |     }
 40 |     res = requests.get(headers=headers,url=url)
 41 | 
 42 |     return res
 43 | 
 44 | def clean(c:str) -> str:
 45 |     return c.replace('\xa0','').replace('\n','')
 46 | 
 47 | def replace_zongkuohao(c:str) -> str:
 48 |     num = re.sub(r'\[[0-9\-]*\]', "", c)
 49 |     return num
 50 | 
 51 | def analyzing_baike_html(htmlstr:str) -> dict:
 52 |     '''
 53 |     解析百科html
 54 |     :param htmlstr:
 55 |     :return:
 56 |     '''
 57 | 
 58 |     soup = BeautifulSoup(htmlstr,'lxml')
 59 | 
 60 |     entity_name =  soup.find('h1').text
 61 |     entity_profile = soup.find('div',attrs = {'class':'lemma-desc'}).text
 62 | 
 63 |     attr_list = []
 64 |     attr_div = soup.find('div',attrs = {'class':'basic-info J-basic-info cmn-clearfix'})
 65 | 
 66 |     attr_left = attr_div.find('dl',attrs = {'class':'basicInfo-block basicInfo-left'})
 67 |     for dt in attr_left.find_all('dt'):
 68 |         attr_name = dt.text
 69 |         dd = dt.find_next('dd')
 70 | 
 71 |         dc =  dd.children
 72 |         for item in dc:
 73 |             if item.name == 'a':
 74 |                 try:
 75 |                     href = item.attrs['href']
 76 |                     attr_value = item.text
 77 |                     attr_list.append([clean(attr_name), clean(attr_value), href])
 78 |                 except:
 79 |                     attr_value = item.text
 80 |                     attr_list.append([clean(attr_name), clean(attr_value), False])
 81 |             else:
 82 |                 try:
 83 |                     attr_value = item
 84 |                     attr_list.append([clean(attr_name), clean(attr_value), False])
 85 |                 except:
 86 |                     attr_value = item.text
 87 |                     attr_list.append([clean(attr_name), clean(attr_value), False])
 88 | 
 89 | 
 90 |     attr_right = attr_div.find('dl',attrs = {'class':'basicInfo-block basicInfo-right'})
 91 |     for dt in attr_right.find_all('dt'):
 92 |         attr_name = dt.text
 93 |         dd = dt.find_next('dd')
 94 | 
 95 |         dc =  dd.children
 96 |         for item in dc:
 97 |             if item.name == 'a':
 98 |                 try:
 99 |                     href = item.attrs['href']
100 |                     attr_value = item.text
101 |                     attr_list.append([clean(attr_name), clean(attr_value), href])
102 |                 except:
103 |                     attr_value = item.text
104 |                     attr_list.append([clean(attr_name), clean(attr_value), False])
105 |             else:
106 |                 try:
107 |                     attr_value = item
108 |                     attr_list.append([clean(attr_name), clean(attr_value), False])
109 |                 except:
110 |                     attr_value = item.text
111 |                     attr_list.append([clean(attr_name), clean(attr_value), False])
112 | 
113 |     attr_list = clean_attr_value(attr_list)
114 | 
115 | 
116 |     event_list = []
117 |     if len(soup.find_all('h3')) > 1:
118 |         for h2 in soup.find_all('h2'):
119 |             if h2.text == '目录':
120 |                 continue
121 |             h3 = h2.find_next('h3')
122 |             while h3:
123 |                 h3_p_h2 = h3.find_previous('h2').text
124 |                 if  h3_p_h2 == h2.text:
125 |                     # print(h2.text,h3.text)
126 |                     content = ''
127 |                     content_div = h3.find_next('div',attrs = {'class':'para'})
128 |                     while content_div:
129 |                         content_div_h3 = content_div.find_previous('h3').text
130 |                         if content_div_h3 == h3.text:
131 |                             content = content + clean(replace_zongkuohao(content_div.text)) + '\n'
132 |                             content_div = content_div.find_next('div',attrs = {'class':'para'})
133 |                         else:
134 |                             break
135 |                     if '扫码' not in h3.text:
136 |                         event_list.append([h3.text,content])
137 |                     h3 = h3.find_next('h3')
138 |                 else:
139 |                     break
140 | 
141 |     entity_href_list = []
142 |     entity_href = soup.find_all('a')
143 |     for eh in entity_href:
144 |         if  'href' in eh.attrs and 'item' in eh.attrs['href']:
145 |             if '秒懂' in eh.attrs['href'] :
146 |                 continue
147 |             else:
148 |                 entity_href_list.append('https://baike.baidu.com'+eh.attrs['href'])
149 | 
150 | 
151 |     return {
152 |         'entity_name':entity_name,
153 |         'entity_profile':entity_profile,
154 |         'attr_list':attr_list,
155 |         'event_list':event_list,
156 |         'entity_href_list':entity_href_list
157 |     }
158 | 
159 | def analyzing_baike_html_get_title(htmlstr:str) -> str:
160 |     soup = BeautifulSoup(htmlstr,'lxml')
161 |     try:
162 |         entity_profile = soup.find('div', attrs = {'class' : 'lemma-desc'}).text
163 |         return entity_profile
164 |     except:
165 |         print('need choise')
166 |         div = soup.find('div', attrs = {'class' : 'para'})
167 |         entity_profile_href = div.find('a').attrs['href']
168 |         r = baike_crawler('https://baike.baidu.com' + entity_profile_href)
169 |         new_soup = BeautifulSoup(r.text,'lxml')
170 |         entity_profile = new_soup.find('div', attrs = {'class' : 'lemma-desc'}).text
171 |         return entity_profile
172 | 
173 | def get_href_titile(url:str) -> str:
174 |     r = baike_crawler(url)
175 |     res = analyzing_baike_html_get_title(r.text)
176 |     return res
177 | 
178 | def clean_attr_value(attrlist:list) -> list:
179 |     return_list = []
180 |     for item in attrlist:
181 |         if '[' in item[1]:
182 |             continue
183 |         elif '' == item[1] or ' ' == item[1] or '、' == item[1]:
184 |             continue
185 |         elif item[1][0] == '、':
186 |             return_list.append([item[0],item[1][1:],item[2]])
187 |         else:
188 |             return_list.append([item[0],item[1],item[2]])
189 |     return return_list
190 | 
191 | 
192 | def analyzing_baike_url(url:str) -> dict:
193 |     r = baike_crawler(url)
194 |     res = analyzing_baike_html(r.text)
195 |     return res
196 | 
197 | def build_graph_from_url(url:str) -> None:
198 |     graph_data = analyzing_baike_url(url)
199 |     #------------------ 页面主实体构建-----------------------------------------------------
200 |     entity =  crud.creat_node(clabels = graph_data['entity_profile'],**{
201 |         'entity_name':graph_data['entity_name'],'entity_profile':graph_data['entity_profile']
202 |     })
203 |     #------------------ 属性实体构建 --------------------------------------------------------
204 |     for item in graph_data['attr_list']:
205 |         if item[1] == '等':    # 去掉多个实体后面的 等 字
206 |             continue
207 |         elif not item[2] and item[1][-1] == '等':
208 |             item[1] = item[1][:-1]
209 | 
210 |         if item[2]:
211 |             entity_profile = get_href_titile('https://baike.baidu.com'+item[2])  # 有链接的实体，就用百度给的标签
212 |             attr_ent = crud.creat_node(clabels = entity_profile,**{
213 |                 'entity_name' : item[1],
214 |                 'entity_profile' : entity_profile
215 |             })
216 |             crud.creat_resp(entity,attr_ent,item[0])
217 |         else:
218 |             attr_ent = crud.creat_node(clabels = item[0],**{    # 没有链接的实体，就用他的类别作为标签
219 |                 'entity_name' : item[1],
220 |                 'entity_profile' : graph_data['entity_name'] + '的' + item[0]
221 |             })
222 |             crud.creat_resp(entity, attr_ent, item[0])
223 |     #-------------------- 普通实体构建 -----------------------------------------------------------
224 |     for item in graph_data['event_list']:   #最后是普通的事件实体，统统使用event_list
225 | 
226 |             attr_ent = crud.creat_node(clabels = 'normal event',**{
227 |                 'entity_name' : item[0],
228 |                 'entity_profile' : item[0],
229 |                 'entity_content' : item[1],
230 |             })
231 |             crud.creat_resp(entity,attr_ent,item[0])
232 | 
233 |     for item in graph_data['entity_href_list']:
234 |         if item and not redis_c.check_i_need_crawl(item) :  # href存在且不在已经爬取过得集合中
235 |             redis_c.insert_list(item) #将解析的实体链接加入到待爬取队列
236 |         else:
237 |             print('-url list exist')
238 |     return None
239 | 
240 | def crawler():
241 |     while True:
242 |         href = redis_c.pop_list()  # 弹出一个链接
243 |         if href and not redis_c.check_i_need_crawl(href): # href存在且不在已经爬取过得集合中
244 |             try:
245 |                 build_graph_from_url(href)
246 |                 print('Finish url', href)
247 |                 redis_c.insert_set(href)  # 加入待爬取队列，失败了不需要重新爬取
248 |             except Exception as e:
249 |                 print('Bad url','because:',str(e),href)
250 |         else:
251 |             print('had Finish url', href)
252 |             time.sleep(0.1)
253 | 
254 | 
255 | if __name__ == '__main__':
256 |     crawler()
257 |     # build_graph_from_url('https://baike.baidu.com/item/%E9%B2%81%E8%BF%85/36231')
258 | 
259 | 


--------------------------------------------------------------------------------