├── README.md
├── convert_to_text.py
├── segment_by_jieba.py
├── segment_by_pinyin.py
├── segment_char.py
├── test_pinyin.py
├── test_word2vec.py
├── test_word2vec_char.py
├── test_word2vec_pinyin.py
├── train_word2vec.py
├── train_word2vec_char.py
└── train_word2vec_pinyin.py
/README.md:
--------------------------------------------------------------------------------
1 | # python3_wiki_word2vec
2 | 基于python3训练中文wiki词向量、字向量、拼音向量。
3 | 已处理好的文件:
4 | 阿里云链接:https://www.aliyundrive.com/s/p35V1WmiBJS
5 | # 依赖
6 | ```python
7 | gensim
8 | jieba
9 | pypinyin
10 | opencc-python-reimplemented
11 | ```
12 |
13 | ## 第一步
14 | 过滤掉原始文本中的html符号,并存储为txt文件
15 | ```python
16 | python convert_to_txt.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.txt
17 | ```
18 | ## 第二步
19 | 将繁体字转换为简体字,首先要安装以下包:
20 | ```python
21 | pip install opencc-python-reimplemented
22 | ```
23 | 然后在命令行执行:
24 | ```python
25 | python -m opencc -c t2s -i wiki.zh.txt -o wiki.zh.simp.txt
26 | ```
27 | ## 第三步
28 | ### 获取词向量训练语料
29 | 使用jieba进行分词,如果没安装该模块,则需要:
30 | ```python
31 | pip install jieba
32 | ```
33 | 然后运行以下命令:
34 | ```python
35 | python segment_by_jieba.py
36 | ```
37 | 最后执行:
38 | ```python
39 | python train_word2vec.py
40 | ```
41 | ### 获取字向量训练语料
42 | 先运行:
43 | ```python
44 | python segment_char.py
45 | ```
46 | 然后运行:
47 | ```python
48 | python train_word2vec_char.py
49 | ```
50 | ### 获取拼音向量语料
51 | 你需要先安装以下包:
52 | ```python
53 | pip install pypinyin
54 | ```
55 | 你可以使用以下命令测试:
56 | ```
57 | python test_pinyin.py
58 | ```
59 | 然后运行:
60 | ```python
61 | python segment_by_pinyin.py
62 | ```
63 | 最后运行:
64 | ```python
65 | python segment_by_pinyin.py
66 | ```
67 | ## 加载训练好的模型,并测试
68 | ### 加载词向量
69 | ```python
70 | python test_word2vec.py
71 | ```
72 | 结果:
73 | ```python
74 | ==============================
75 | 求孙悟空的相似词:
76 | 悟空 0.8302149176597595
77 | 牛魔王 0.8158780336380005
78 | 唐僧 0.7910414934158325
79 | 猪八戒 0.7805720567703247
80 | 沙悟净 0.7804343104362488
81 | 沙僧 0.7628636956214905
82 | 铁扇公主 0.7468816637992859
83 | 哪吒 0.7401531338691711
84 | 唐三藏 0.7286755442619324
85 | 齐天大圣 0.728164553642273
86 | ==============================
87 | ==============================
88 | 皇上+国王=皇后+?
89 | 臣子 0.5882754325866699
90 | 侍臣 0.5720297694206238
91 | 三世 0.5707935690879822
92 | 教皇 0.5681069493293762
93 | 下臣 0.5630483031272888
94 | 新王 0.561499297618866
95 | 陛下 0.5612003207206726
96 | 叔向 0.5568481683731079
97 | 圣上 0.5560325384140015
98 | 皇帝 0.5549168586730957
99 | ==============================
100 | ==============================
101 | 找出“太后 妃子 贵人 贵妃 才人”不匹配的词语
102 | 妃子
103 | ==============================
104 | ==============================
105 | 找出“书籍和书本"的相似度
106 | 0.6888837
107 | 找出"逛街和书本"的相似度
108 | 0.15222512
109 | ```
110 | ### 加载字向量
111 | ```python
112 | python test_word2vec_char.py
113 | ```
114 | 结果:
115 | ```python
116 | ==============================
117 | 求丑的相似词:
118 | 卯 0.5284875631332397
119 | 酉 0.5041337013244629
120 | 巳 0.4905036985874176
121 | 绯 0.4649895429611206
122 | 戌 0.43988290429115295
123 | 娇 0.42739951610565186
124 | 壬 0.40833044052124023
125 | 新 0.38065898418426514
126 | 猪 0.37993788719177246
127 | 戊 0.3759212791919708
128 | ==============================
129 | ==============================
130 | 求龚的相似词:
131 | 冯 0.836471676826477
132 | 廖 0.8297753930091858
133 | 郝 0.8270807266235352
134 | 杨 0.8269028067588806
135 | 陈 0.8242413997650146
136 | 吴 0.8150491714477539
137 | 姚 0.8111284375190735
138 | 郭 0.807446300983429
139 | 潘 0.8055393099784851
140 | 谭 0.7994782328605652
141 | ==============================
142 | ==============================
143 | 求美的相似词:
144 | 韩 0.6687709093093872
145 | 际 0.6437357664108276
146 | 英 0.5671981573104858
147 | 家 0.521937906742096
148 | 泰 0.5166231393814087
149 | 欧 0.43756103515625
150 | 中 0.35331031680107117
151 | 艺 0.35141149163246155
152 | 樔 0.3467445373535156
153 | 民 0.32968077063560486
154 | ==============================
155 | ```
156 | ### 加载拼音向量
157 | ```python
158 | python test_word2vec_pinyin.py
159 | ```
160 | 结果:
161 | ```python
162 | ==============================
163 | 求chou3的相似词:
164 | mao3 0.7094196677207947
165 | yao3 0.45190462470054626
166 | nong4 0.4128718078136444
167 | man2 0.407620906829834
168 | shuo1 0.3986712098121643
169 | niang2 0.39512279629707336
170 | dai3 0.382381796836853
171 | xia1 0.3749469220638275
172 | mao1 0.37324249744415283
173 | dan3 0.37195447087287903
174 | ==============================
175 | ==============================
176 | 求gong1的相似词:
177 | umpc 0.34139925241470337
178 | tie3 0.3174058496952057
179 | meng1 0.3167680501937866
180 | guan3 0.3146922290325165
181 | mcfly 0.30235010385513306
182 | ホテル 0.29963892698287964
183 | esp 0.2994944155216217
184 | chuang4 0.2983870208263397
185 | vertex 0.29366573691368103
186 | のあたる 0.29353320598602295
187 | ==============================
188 | ==============================
189 | 求mei3的相似词:
190 | ying1 0.45298585295677185
191 | han2 0.4176357686519623
192 | jia1 0.4083157479763031
193 | ou1 0.3890629708766937
194 | トメ 0.3689119815826416
195 | ベロキス 0.3615325093269348
196 | bash 0.34474748373031616
197 | かず 0.3278074264526367
198 | からみた 0.3273840844631195
199 | あゆ 0.32595932483673096
200 | ==============================
201 | ```
202 | # 讲在最后
203 | 你可以尝试以下:
204 | - 在得到词向量训练语料的时候,可以使用jieba分词中加载停止词的用法,过滤掉停用词。
205 | - 得到字训练语料的时候,可以过滤掉英文以及其它的一些非中文字符或者字符串。
206 | - 得到拼音语料的时候,也可以去除掉英文及其它的一些非拼音的字符或字符串。
207 |
208 | # 参考
209 | https://github.com/Embedding/Chinese-Word-Vectors:
210 | 获取大量中文预训练向量
211 | https://github.com/AimeeLee77/wiki_zh_word2vec: 本项目基于该项目进行的修改,不同之处:
212 | (1)修改支持为python3;
213 | (2)修改繁体转简体使用的包,该项目里面的包不可用;
214 | (3)修改新版gensim加载词向量的方式;
215 | (4)增加字向量和拼音向量。
216 |
--------------------------------------------------------------------------------
/convert_to_text.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | #将xml的wiki数据转换为text格式
4 |
5 | import logging
6 | import os.path
7 | import sys
8 |
9 | from gensim.corpora import WikiCorpus
10 |
11 | if __name__ == '__main__':
12 | program = os.path.basename(sys.argv[0])#得到文件名
13 | logger = logging.getLogger(program)
14 |
15 | logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
16 | logging.root.setLevel(level=logging.INFO)
17 | logger.info("running %s" % ' '.join(sys.argv))
18 |
19 | if len(sys.argv) < 3:
20 | print(globals()['__doc__'] % locals())
21 | sys.exit(1)
22 |
23 | inp, outp = sys.argv[1:3]
24 | space = " "
25 | i = 0
26 |
27 | output = open(outp, 'w', encoding='utf-8')
28 | wiki =WikiCorpus(inp, dictionary=[])#gensim里的维基百科处理类WikiCorpus
29 | for text in wiki.get_texts():#通过get_texts将维基里的每篇文章转换位1行text文本,并且去掉了标点符号等内容
30 | output.write(space.join(text) + "\n")
31 | i = i+1
32 | if (i % 10000 == 0):
33 | logger.info("Saved "+str(i)+" articles.")
34 |
35 | output.close()
36 | logger.info("Finished Saved "+str(i)+" articles.")
37 |
--------------------------------------------------------------------------------
/segment_by_jieba.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | #逐行读取文件数据进行jieba分词
4 |
5 | import jieba
6 | import jieba.analyse
7 | import jieba.posseg as pseg #引入词性标注接口
8 | import codecs,sys
9 |
10 |
11 | if __name__ == '__main__':
12 | f = codecs.open('wiki.zh.simp.txt', 'r', encoding='utf8')
13 | target = codecs.open('wiki.zh.simp.seg.txt', 'w', encoding='utf8')
14 | print('open files.')
15 |
16 | lineNum = 1
17 | line = f.readline()
18 | while line:
19 | print('---processing ',lineNum,' article---')
20 | seg_list = jieba.cut(line,cut_all=False)
21 | line_seg = ' '.join(seg_list)
22 | target.writelines(line_seg)
23 | lineNum = lineNum + 1
24 | line = f.readline()
25 |
26 | print('well done.')
27 | f.close()
28 | target.close()
29 |
--------------------------------------------------------------------------------
/segment_by_pinyin.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | #逐行读取文件数据进行jieba分词
4 | from pypinyin import pinyin, lazy_pinyin, Style
5 | import codecs,sys
6 |
7 |
8 | if __name__ == '__main__':
9 | f = codecs.open('wiki.zh.simp.txt', 'r', encoding='utf8')
10 | target = codecs.open('wiki.zh.simp.seg.pinyin.txt', 'w', encoding='utf8')
11 | print('open files.')
12 |
13 | lineNum = 1
14 | line = f.readline()
15 | style = Style.TONE3
16 | while line:
17 | print('---processing ',lineNum,' article---')
18 | line = lazy_pinyin(line, style=style)
19 | tmp = []
20 | for i in line:
21 | if not ' ' in i: # 去除掉一连串英文,一般都包含空格
22 | tmp.append(i)
23 | line_seg = " ".join(tmp)
24 | target.writelines(line_seg)
25 | lineNum = lineNum + 1
26 | line = f.readline()
27 |
28 | print('well done.')
29 | f.close()
30 | target.close()
31 |
--------------------------------------------------------------------------------
/segment_char.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | #逐行读取文件数据进行jieba分词
4 |
5 | import jieba
6 | import jieba.analyse
7 | import jieba.posseg as pseg #引入词性标注接口
8 | import codecs,sys
9 |
10 |
11 | if __name__ == '__main__':
12 | f = codecs.open('wiki.zh.simp.txt', 'r', encoding='utf8')
13 | target = codecs.open('wiki.zh.simp.seg.char.txt', 'w', encoding='utf8')
14 | print('open files.')
15 |
16 | lineNum = 1
17 | line = f.readline()
18 | while line:
19 | print('---processing ',lineNum,' article---')
20 | seg_list = [str(i.strip()) for i in line if i]
21 | line_seg = ' '.join(seg_list)
22 | target.writelines(line_seg)
23 | lineNum = lineNum + 1
24 | line = f.readline()
25 |
26 | print('well done.')
27 | f.close()
28 | target.close()
29 |
--------------------------------------------------------------------------------
/test_pinyin.py:
--------------------------------------------------------------------------------
1 | from pypinyin import pinyin, lazy_pinyin, Style
2 |
3 | style = Style.TONE3
4 | line = lazy_pinyin('我爱北京天安门 娃哈哈 style 你好123', style=style)
5 | tmp = []
6 | for i in line:
7 | if not i.isalpha() and not i.isdigit():
8 | tmp.append(i)
9 | print(tmp)
10 |
--------------------------------------------------------------------------------
/test_word2vec.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | #测试训练好的模型
4 |
5 | import warnings
6 | warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')# 忽略警告
7 | from gensim.models import KeyedVectors
8 |
9 | if __name__ == '__main__':
10 | fdir = '/data02/gob/model_hub/wiki_zh_vec/'
11 | # model = KeyedVectors.load_word2vec_format(fdir+'wiki.zh.text.vector', binary=False, unicode_errors='ignore')
12 | model = KeyedVectors.load_word2vec_format(fdir+'wiki.zh.text.vector', binary=False)
13 | word = model.most_similar(u"孙悟空")
14 | print('==============================')
15 | print('求孙悟空的相似词:')
16 | for t in word:
17 | print(t[0],t[1])
18 | print('==============================')
19 | print('==============================')
20 | print('皇上+国王=皇后+?')
21 | word = model.most_similar(positive=[u'皇上',u'国王'],negative=[u'皇后'])
22 | for t in word:
23 | print(t[0],t[1])
24 | print('==============================')
25 | print('==============================')
26 | print('找出“太后 妃子 贵人 贵妃 才人”不匹配的词语')
27 | print(model.doesnt_match(u'太后 妃子 贵人 贵妃 才人'.split()))
28 | print('==============================')
29 | print('==============================')
30 | print('找出“书籍和书本"的相似度')
31 | print(model.similarity(u'书籍',u'书本'))
32 | print('找出"逛街和书本"的相似度')
33 | print(model.similarity(u'逛街',u'书本'))
34 |
--------------------------------------------------------------------------------
/test_word2vec_char.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | #测试训练好的模型
4 |
5 | import warnings
6 | warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')# 忽略警告
7 | from gensim.models import KeyedVectors
8 |
9 | if __name__ == '__main__':
10 | fdir = '/data02/gob/model_hub/wiki_zh_char_vec/'
11 | # model = KeyedVectors.load_word2vec_format(fdir+'wiki.zh.text.vector', binary=False, unicode_errors='ignore')
12 | model = KeyedVectors.load_word2vec_format(fdir+'wiki.zh.text.char.vector', binary=False)
13 | word = model.most_similar(u"丑")
14 | print('==============================')
15 | print('求丑的相似词:')
16 | for t in word:
17 | print(t[0],t[1])
18 | print('==============================')
19 | print('==============================')
20 | word = model.most_similar(u"龚")
21 | print('求龚的相似词:')
22 | for t in word:
23 | print(t[0],t[1])
24 | print('==============================')
25 | print('==============================')
26 | word = model.most_similar(u"美")
27 | print('求美的相似词:')
28 | for t in word:
29 | print(t[0],t[1])
30 | print('==============================')
31 | """
32 | print('==============================')
33 | print('皇上+国王=皇后+?')
34 | word = model.most_similar(positive=[u'皇上',u'国王'],negative=[u'皇后'])
35 | for t in word:
36 | print(t[0],t[1])
37 | print('==============================')
38 | print('==============================')
39 | print('找出“太后 妃子 贵人 贵妃 才人”不匹配的词语')
40 | print(model.doesnt_match(u'太后 妃子 贵人 贵妃 才人'.split()))
41 | print('==============================')
42 | print('==============================')
43 | print('找出“书籍和书本"的相似度')
44 | print(model.similarity(u'书籍',u'书本'))
45 | print('找出"逛街和书本"的相似度')
46 | print(model.similarity(u'逛街',u'书本'))
47 | """
48 |
--------------------------------------------------------------------------------
/test_word2vec_pinyin.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | #测试训练好的模型
4 |
5 | import warnings
6 | warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')# 忽略警告
7 | from gensim.models import KeyedVectors
8 |
9 | if __name__ == '__main__':
10 | fdir = '/data02/gob/model_hub/wiki_zh_pinyin_vec/'
11 | # model = KeyedVectors.load_word2vec_format(fdir+'wiki.zh.text.vector', binary=False, unicode_errors='ignore')
12 | model = KeyedVectors.load_word2vec_format(fdir+'wiki.zh.text.pinyin.vector', binary=False)
13 | word = model.most_similar("chou3")
14 | print('==============================')
15 | print('求chou3的相似词:')
16 | for t in word:
17 | print(t[0],t[1])
18 | print('==============================')
19 | print('==============================')
20 | word = model.most_similar("gong1")
21 | print('求gong1的相似词:')
22 | for t in word:
23 | print(t[0],t[1])
24 | print('==============================')
25 | print('==============================')
26 | word = model.most_similar("mei3")
27 | print('求mei3的相似词:')
28 | for t in word:
29 | print(t[0],t[1])
30 | print('==============================')
31 | """
32 | print('==============================')
33 | print('皇上+国王=皇后+?')
34 | word = model.most_similar(positive=[u'皇上',u'国王'],negative=[u'皇后'])
35 | for t in word:
36 | print(t[0],t[1])
37 | print('==============================')
38 | print('==============================')
39 | print('找出“太后 妃子 贵人 贵妃 才人”不匹配的词语')
40 | print(model.doesnt_match(u'太后 妃子 贵人 贵妃 才人'.split()))
41 | print('==============================')
42 | print('==============================')
43 | print('找出“书籍和书本"的相似度')
44 | print(model.similarity(u'书籍',u'书本'))
45 | print('找出"逛街和书本"的相似度')
46 | print(model.similarity(u'逛街',u'书本'))
47 | """
48 |
--------------------------------------------------------------------------------
/train_word2vec.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | #使用gensim word2vec训练脚本获取词向量
4 |
5 | import warnings
6 | warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')# 忽略警告
7 |
8 | import logging
9 | import os.path
10 | import sys
11 | import multiprocessing
12 |
13 | from gensim.corpora import WikiCorpus
14 | from gensim.models import Word2Vec
15 | from gensim.models.word2vec import LineSentence
16 |
17 |
18 | if __name__ == '__main__':
19 |
20 | #print open('/Users/sy/Desktop/pyRoot/wiki_zh_vec/cmd.txt').readlines()
21 | #sys.exit()
22 |
23 | program = os.path.basename(sys.argv[0])
24 | logger = logging.getLogger(program)
25 |
26 | logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s',level=logging.INFO)
27 | logger.info("running %s" % ' '.join(sys.argv))
28 |
29 | # inp为输入语料, outp1 为输出模型, outp2为原始c版本word2vec的vector格式的模型
30 | fdir = '/data02/gob/model_hub/wiki_zh_vec/'
31 | inp = 'wiki.zh.simp.seg.txt'
32 | outp1 = fdir + 'wiki.zh.text.model'
33 | outp2 = fdir + 'wiki.zh.text.vector'
34 |
35 | # 训练skip-gram模型
36 | model = Word2Vec(LineSentence(inp), vector_size=100, window=5, min_count=5,
37 | workers=4, epochs=10)
38 |
39 | # 保存模型
40 | model.save(outp1)
41 | model.wv.save_word2vec_format(outp2, binary=False)
42 |
43 |
--------------------------------------------------------------------------------
/train_word2vec_char.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | #使用gensim word2vec训练脚本获取词向量
4 |
5 | import warnings
6 | warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')# 忽略警告
7 |
8 | import logging
9 | import os.path
10 | import sys
11 | import multiprocessing
12 | import os
13 |
14 | from gensim.corpora import WikiCorpus
15 | from gensim.models import Word2Vec
16 | from gensim.models.word2vec import LineSentence
17 |
18 |
19 | if __name__ == '__main__':
20 |
21 | #print open('/Users/sy/Desktop/pyRoot/wiki_zh_vec/cmd.txt').readlines()
22 | #sys.exit()
23 |
24 | program = os.path.basename(sys.argv[0])
25 | logger = logging.getLogger(program)
26 |
27 | logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s',level=logging.INFO)
28 | logger.info("running %s" % ' '.join(sys.argv))
29 |
30 | # inp为输入语料, outp1 为输出模型, outp2为原始c版本word2vec的vector格式的模型
31 | fdir = '/data02/gob/model_hub/wiki_zh_char_vec/'
32 | if not os.path.exists(fdir):
33 | os.mkdir(fdir)
34 | inp = 'wiki.zh.simp.seg.char.txt'
35 | outp1 = fdir + 'wiki.zh.text.char.model'
36 | outp2 = fdir + 'wiki.zh.text.char.vector'
37 |
38 | # 训练skip-gram模型
39 | model = Word2Vec(LineSentence(inp), vector_size=100, window=5, min_count=5,
40 | workers=4, epochs=10)
41 |
42 | # 保存模型
43 | model.save(outp1)
44 | model.wv.save_word2vec_format(outp2, binary=False)
45 |
46 |
--------------------------------------------------------------------------------
/train_word2vec_pinyin.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/env python
2 | # -*- coding: utf-8 -*-
3 | #使用gensim word2vec训练脚本获取词向量
4 |
5 | import warnings
6 | warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')# 忽略警告
7 |
8 | import logging
9 | import os.path
10 | import sys
11 | import multiprocessing
12 | import os
13 |
14 | from gensim.corpora import WikiCorpus
15 | from gensim.models import Word2Vec
16 | from gensim.models.word2vec import LineSentence
17 |
18 |
19 | if __name__ == '__main__':
20 |
21 | #print open('/Users/sy/Desktop/pyRoot/wiki_zh_vec/cmd.txt').readlines()
22 | #sys.exit()
23 |
24 | program = os.path.basename(sys.argv[0])
25 | logger = logging.getLogger(program)
26 |
27 | logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s',level=logging.INFO)
28 | logger.info("running %s" % ' '.join(sys.argv))
29 |
30 | # inp为输入语料, outp1 为输出模型, outp2为原始c版本word2vec的vector格式的模型
31 | fdir = '/data02/gob/model_hub/wiki_zh_pinyin_vec/'
32 | if not os.path.exists(fdir):
33 | os.mkdir(fdir)
34 | inp = 'wiki.zh.simp.seg.pinyin.txt'
35 | outp1 = fdir + 'wiki.zh.text.pinyin.model'
36 | outp2 = fdir + 'wiki.zh.text.pinyin.vector'
37 |
38 | # 训练skip-gram模型
39 | model = Word2Vec(LineSentence(inp), vector_size=100, window=5, min_count=5,
40 | workers=4, epochs=10)
41 |
42 | # 保存模型
43 | model.save(outp1)
44 | model.wv.save_word2vec_format(outp2, binary=False)
45 |
46 |
--------------------------------------------------------------------------------