├── README.md ├── convert_to_text.py ├── segment_by_jieba.py ├── segment_by_pinyin.py ├── segment_char.py ├── test_pinyin.py ├── test_word2vec.py ├── test_word2vec_char.py ├── test_word2vec_pinyin.py ├── train_word2vec.py ├── train_word2vec_char.py └── train_word2vec_pinyin.py /README.md: -------------------------------------------------------------------------------- 1 | # python3_wiki_word2vec 2 | 基于python3训练中文wiki词向量、字向量、拼音向量。
3 | 已处理好的文件:
4 | 阿里云链接:https://www.aliyundrive.com/s/p35V1WmiBJS 5 | # 依赖 6 | ```python 7 | gensim 8 | jieba 9 | pypinyin 10 | opencc-python-reimplemented 11 | ``` 12 | 13 | ## 第一步 14 | 过滤掉原始文本中的html符号,并存储为txt文件 15 | ```python 16 | python convert_to_txt.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.txt 17 | ``` 18 | ## 第二步 19 | 将繁体字转换为简体字,首先要安装以下包: 20 | ```python 21 | pip install opencc-python-reimplemented 22 | ``` 23 | 然后在命令行执行: 24 | ```python 25 | python -m opencc -c t2s -i wiki.zh.txt -o wiki.zh.simp.txt 26 | ``` 27 | ## 第三步 28 | ### 获取词向量训练语料 29 | 使用jieba进行分词,如果没安装该模块,则需要: 30 | ```python 31 | pip install jieba 32 | ``` 33 | 然后运行以下命令: 34 | ```python 35 | python segment_by_jieba.py 36 | ``` 37 | 最后执行: 38 | ```python 39 | python train_word2vec.py 40 | ``` 41 | ### 获取字向量训练语料 42 | 先运行: 43 | ```python 44 | python segment_char.py 45 | ``` 46 | 然后运行: 47 | ```python 48 | python train_word2vec_char.py 49 | ``` 50 | ### 获取拼音向量语料 51 | 你需要先安装以下包: 52 | ```python 53 | pip install pypinyin 54 | ``` 55 | 你可以使用以下命令测试: 56 | ``` 57 | python test_pinyin.py 58 | ``` 59 | 然后运行: 60 | ```python 61 | python segment_by_pinyin.py 62 | ``` 63 | 最后运行: 64 | ```python 65 | python segment_by_pinyin.py 66 | ``` 67 | ## 加载训练好的模型,并测试 68 | ### 加载词向量 69 | ```python 70 | python test_word2vec.py 71 | ``` 72 | 结果: 73 | ```python 74 | ============================== 75 | 求孙悟空的相似词: 76 | 悟空 0.8302149176597595 77 | 牛魔王 0.8158780336380005 78 | 唐僧 0.7910414934158325 79 | 猪八戒 0.7805720567703247 80 | 沙悟净 0.7804343104362488 81 | 沙僧 0.7628636956214905 82 | 铁扇公主 0.7468816637992859 83 | 哪吒 0.7401531338691711 84 | 唐三藏 0.7286755442619324 85 | 齐天大圣 0.728164553642273 86 | ============================== 87 | ============================== 88 | 皇上+国王=皇后+? 89 | 臣子 0.5882754325866699 90 | 侍臣 0.5720297694206238 91 | 三世 0.5707935690879822 92 | 教皇 0.5681069493293762 93 | 下臣 0.5630483031272888 94 | 新王 0.561499297618866 95 | 陛下 0.5612003207206726 96 | 叔向 0.5568481683731079 97 | 圣上 0.5560325384140015 98 | 皇帝 0.5549168586730957 99 | ============================== 100 | ============================== 101 | 找出“太后 妃子 贵人 贵妃 才人”不匹配的词语 102 | 妃子 103 | ============================== 104 | ============================== 105 | 找出“书籍和书本"的相似度 106 | 0.6888837 107 | 找出"逛街和书本"的相似度 108 | 0.15222512 109 | ``` 110 | ### 加载字向量 111 | ```python 112 | python test_word2vec_char.py 113 | ``` 114 | 结果: 115 | ```python 116 | ============================== 117 | 求丑的相似词: 118 | 卯 0.5284875631332397 119 | 酉 0.5041337013244629 120 | 巳 0.4905036985874176 121 | 绯 0.4649895429611206 122 | 戌 0.43988290429115295 123 | 娇 0.42739951610565186 124 | 壬 0.40833044052124023 125 | 新 0.38065898418426514 126 | 猪 0.37993788719177246 127 | 戊 0.3759212791919708 128 | ============================== 129 | ============================== 130 | 求龚的相似词: 131 | 冯 0.836471676826477 132 | 廖 0.8297753930091858 133 | 郝 0.8270807266235352 134 | 杨 0.8269028067588806 135 | 陈 0.8242413997650146 136 | 吴 0.8150491714477539 137 | 姚 0.8111284375190735 138 | 郭 0.807446300983429 139 | 潘 0.8055393099784851 140 | 谭 0.7994782328605652 141 | ============================== 142 | ============================== 143 | 求美的相似词: 144 | 韩 0.6687709093093872 145 | 际 0.6437357664108276 146 | 英 0.5671981573104858 147 | 家 0.521937906742096 148 | 泰 0.5166231393814087 149 | 欧 0.43756103515625 150 | 中 0.35331031680107117 151 | 艺 0.35141149163246155 152 | 樔 0.3467445373535156 153 | 民 0.32968077063560486 154 | ============================== 155 | ``` 156 | ### 加载拼音向量 157 | ```python 158 | python test_word2vec_pinyin.py 159 | ``` 160 | 结果: 161 | ```python 162 | ============================== 163 | 求chou3的相似词: 164 | mao3 0.7094196677207947 165 | yao3 0.45190462470054626 166 | nong4 0.4128718078136444 167 | man2 0.407620906829834 168 | shuo1 0.3986712098121643 169 | niang2 0.39512279629707336 170 | dai3 0.382381796836853 171 | xia1 0.3749469220638275 172 | mao1 0.37324249744415283 173 | dan3 0.37195447087287903 174 | ============================== 175 | ============================== 176 | 求gong1的相似词: 177 | umpc 0.34139925241470337 178 | tie3 0.3174058496952057 179 | meng1 0.3167680501937866 180 | guan3 0.3146922290325165 181 | mcfly 0.30235010385513306 182 | ホテル 0.29963892698287964 183 | esp 0.2994944155216217 184 | chuang4 0.2983870208263397 185 | vertex 0.29366573691368103 186 | のあたる 0.29353320598602295 187 | ============================== 188 | ============================== 189 | 求mei3的相似词: 190 | ying1 0.45298585295677185 191 | han2 0.4176357686519623 192 | jia1 0.4083157479763031 193 | ou1 0.3890629708766937 194 | トメ 0.3689119815826416 195 | ベロキス 0.3615325093269348 196 | bash 0.34474748373031616 197 | かず 0.3278074264526367 198 | からみた 0.3273840844631195 199 | あゆ 0.32595932483673096 200 | ============================== 201 | ``` 202 | # 讲在最后 203 | 你可以尝试以下:
204 | - 在得到词向量训练语料的时候,可以使用jieba分词中加载停止词的用法,过滤掉停用词。
205 | - 得到字训练语料的时候,可以过滤掉英文以及其它的一些非中文字符或者字符串。 206 | - 得到拼音语料的时候,也可以去除掉英文及其它的一些非拼音的字符或字符串。 207 | 208 | # 参考 209 | https://github.com/Embedding/Chinese-Word-Vectors: 210 | 获取大量中文预训练向量
211 | https://github.com/AimeeLee77/wiki_zh_word2vec: 本项目基于该项目进行的修改,不同之处:
212 | (1)修改支持为python3;
213 | (2)修改繁体转简体使用的包,该项目里面的包不可用;
214 | (3)修改新版gensim加载词向量的方式;
215 | (4)增加字向量和拼音向量。
216 | -------------------------------------------------------------------------------- /convert_to_text.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | #将xml的wiki数据转换为text格式 4 | 5 | import logging 6 | import os.path 7 | import sys 8 | 9 | from gensim.corpora import WikiCorpus 10 | 11 | if __name__ == '__main__': 12 | program = os.path.basename(sys.argv[0])#得到文件名 13 | logger = logging.getLogger(program) 14 | 15 | logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') 16 | logging.root.setLevel(level=logging.INFO) 17 | logger.info("running %s" % ' '.join(sys.argv)) 18 | 19 | if len(sys.argv) < 3: 20 | print(globals()['__doc__'] % locals()) 21 | sys.exit(1) 22 | 23 | inp, outp = sys.argv[1:3] 24 | space = " " 25 | i = 0 26 | 27 | output = open(outp, 'w', encoding='utf-8') 28 | wiki =WikiCorpus(inp, dictionary=[])#gensim里的维基百科处理类WikiCorpus 29 | for text in wiki.get_texts():#通过get_texts将维基里的每篇文章转换位1行text文本,并且去掉了标点符号等内容 30 | output.write(space.join(text) + "\n") 31 | i = i+1 32 | if (i % 10000 == 0): 33 | logger.info("Saved "+str(i)+" articles.") 34 | 35 | output.close() 36 | logger.info("Finished Saved "+str(i)+" articles.") 37 | -------------------------------------------------------------------------------- /segment_by_jieba.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | #逐行读取文件数据进行jieba分词 4 | 5 | import jieba 6 | import jieba.analyse 7 | import jieba.posseg as pseg #引入词性标注接口 8 | import codecs,sys 9 | 10 | 11 | if __name__ == '__main__': 12 | f = codecs.open('wiki.zh.simp.txt', 'r', encoding='utf8') 13 | target = codecs.open('wiki.zh.simp.seg.txt', 'w', encoding='utf8') 14 | print('open files.') 15 | 16 | lineNum = 1 17 | line = f.readline() 18 | while line: 19 | print('---processing ',lineNum,' article---') 20 | seg_list = jieba.cut(line,cut_all=False) 21 | line_seg = ' '.join(seg_list) 22 | target.writelines(line_seg) 23 | lineNum = lineNum + 1 24 | line = f.readline() 25 | 26 | print('well done.') 27 | f.close() 28 | target.close() 29 | -------------------------------------------------------------------------------- /segment_by_pinyin.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | #逐行读取文件数据进行jieba分词 4 | from pypinyin import pinyin, lazy_pinyin, Style 5 | import codecs,sys 6 | 7 | 8 | if __name__ == '__main__': 9 | f = codecs.open('wiki.zh.simp.txt', 'r', encoding='utf8') 10 | target = codecs.open('wiki.zh.simp.seg.pinyin.txt', 'w', encoding='utf8') 11 | print('open files.') 12 | 13 | lineNum = 1 14 | line = f.readline() 15 | style = Style.TONE3 16 | while line: 17 | print('---processing ',lineNum,' article---') 18 | line = lazy_pinyin(line, style=style) 19 | tmp = [] 20 | for i in line: 21 | if not ' ' in i: # 去除掉一连串英文,一般都包含空格 22 | tmp.append(i) 23 | line_seg = " ".join(tmp) 24 | target.writelines(line_seg) 25 | lineNum = lineNum + 1 26 | line = f.readline() 27 | 28 | print('well done.') 29 | f.close() 30 | target.close() 31 | -------------------------------------------------------------------------------- /segment_char.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | #逐行读取文件数据进行jieba分词 4 | 5 | import jieba 6 | import jieba.analyse 7 | import jieba.posseg as pseg #引入词性标注接口 8 | import codecs,sys 9 | 10 | 11 | if __name__ == '__main__': 12 | f = codecs.open('wiki.zh.simp.txt', 'r', encoding='utf8') 13 | target = codecs.open('wiki.zh.simp.seg.char.txt', 'w', encoding='utf8') 14 | print('open files.') 15 | 16 | lineNum = 1 17 | line = f.readline() 18 | while line: 19 | print('---processing ',lineNum,' article---') 20 | seg_list = [str(i.strip()) for i in line if i] 21 | line_seg = ' '.join(seg_list) 22 | target.writelines(line_seg) 23 | lineNum = lineNum + 1 24 | line = f.readline() 25 | 26 | print('well done.') 27 | f.close() 28 | target.close() 29 | -------------------------------------------------------------------------------- /test_pinyin.py: -------------------------------------------------------------------------------- 1 | from pypinyin import pinyin, lazy_pinyin, Style 2 | 3 | style = Style.TONE3 4 | line = lazy_pinyin('我爱北京天安门 娃哈哈 style 你好123', style=style) 5 | tmp = [] 6 | for i in line: 7 | if not i.isalpha() and not i.isdigit(): 8 | tmp.append(i) 9 | print(tmp) 10 | -------------------------------------------------------------------------------- /test_word2vec.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | #测试训练好的模型 4 | 5 | import warnings 6 | warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')# 忽略警告 7 | from gensim.models import KeyedVectors 8 | 9 | if __name__ == '__main__': 10 | fdir = '/data02/gob/model_hub/wiki_zh_vec/' 11 | # model = KeyedVectors.load_word2vec_format(fdir+'wiki.zh.text.vector', binary=False, unicode_errors='ignore') 12 | model = KeyedVectors.load_word2vec_format(fdir+'wiki.zh.text.vector', binary=False) 13 | word = model.most_similar(u"孙悟空") 14 | print('==============================') 15 | print('求孙悟空的相似词:') 16 | for t in word: 17 | print(t[0],t[1]) 18 | print('==============================') 19 | print('==============================') 20 | print('皇上+国王=皇后+?') 21 | word = model.most_similar(positive=[u'皇上',u'国王'],negative=[u'皇后']) 22 | for t in word: 23 | print(t[0],t[1]) 24 | print('==============================') 25 | print('==============================') 26 | print('找出“太后 妃子 贵人 贵妃 才人”不匹配的词语') 27 | print(model.doesnt_match(u'太后 妃子 贵人 贵妃 才人'.split())) 28 | print('==============================') 29 | print('==============================') 30 | print('找出“书籍和书本"的相似度') 31 | print(model.similarity(u'书籍',u'书本')) 32 | print('找出"逛街和书本"的相似度') 33 | print(model.similarity(u'逛街',u'书本')) 34 | -------------------------------------------------------------------------------- /test_word2vec_char.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | #测试训练好的模型 4 | 5 | import warnings 6 | warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')# 忽略警告 7 | from gensim.models import KeyedVectors 8 | 9 | if __name__ == '__main__': 10 | fdir = '/data02/gob/model_hub/wiki_zh_char_vec/' 11 | # model = KeyedVectors.load_word2vec_format(fdir+'wiki.zh.text.vector', binary=False, unicode_errors='ignore') 12 | model = KeyedVectors.load_word2vec_format(fdir+'wiki.zh.text.char.vector', binary=False) 13 | word = model.most_similar(u"丑") 14 | print('==============================') 15 | print('求丑的相似词:') 16 | for t in word: 17 | print(t[0],t[1]) 18 | print('==============================') 19 | print('==============================') 20 | word = model.most_similar(u"龚") 21 | print('求龚的相似词:') 22 | for t in word: 23 | print(t[0],t[1]) 24 | print('==============================') 25 | print('==============================') 26 | word = model.most_similar(u"美") 27 | print('求美的相似词:') 28 | for t in word: 29 | print(t[0],t[1]) 30 | print('==============================') 31 | """ 32 | print('==============================') 33 | print('皇上+国王=皇后+?') 34 | word = model.most_similar(positive=[u'皇上',u'国王'],negative=[u'皇后']) 35 | for t in word: 36 | print(t[0],t[1]) 37 | print('==============================') 38 | print('==============================') 39 | print('找出“太后 妃子 贵人 贵妃 才人”不匹配的词语') 40 | print(model.doesnt_match(u'太后 妃子 贵人 贵妃 才人'.split())) 41 | print('==============================') 42 | print('==============================') 43 | print('找出“书籍和书本"的相似度') 44 | print(model.similarity(u'书籍',u'书本')) 45 | print('找出"逛街和书本"的相似度') 46 | print(model.similarity(u'逛街',u'书本')) 47 | """ 48 | -------------------------------------------------------------------------------- /test_word2vec_pinyin.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | #测试训练好的模型 4 | 5 | import warnings 6 | warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')# 忽略警告 7 | from gensim.models import KeyedVectors 8 | 9 | if __name__ == '__main__': 10 | fdir = '/data02/gob/model_hub/wiki_zh_pinyin_vec/' 11 | # model = KeyedVectors.load_word2vec_format(fdir+'wiki.zh.text.vector', binary=False, unicode_errors='ignore') 12 | model = KeyedVectors.load_word2vec_format(fdir+'wiki.zh.text.pinyin.vector', binary=False) 13 | word = model.most_similar("chou3") 14 | print('==============================') 15 | print('求chou3的相似词:') 16 | for t in word: 17 | print(t[0],t[1]) 18 | print('==============================') 19 | print('==============================') 20 | word = model.most_similar("gong1") 21 | print('求gong1的相似词:') 22 | for t in word: 23 | print(t[0],t[1]) 24 | print('==============================') 25 | print('==============================') 26 | word = model.most_similar("mei3") 27 | print('求mei3的相似词:') 28 | for t in word: 29 | print(t[0],t[1]) 30 | print('==============================') 31 | """ 32 | print('==============================') 33 | print('皇上+国王=皇后+?') 34 | word = model.most_similar(positive=[u'皇上',u'国王'],negative=[u'皇后']) 35 | for t in word: 36 | print(t[0],t[1]) 37 | print('==============================') 38 | print('==============================') 39 | print('找出“太后 妃子 贵人 贵妃 才人”不匹配的词语') 40 | print(model.doesnt_match(u'太后 妃子 贵人 贵妃 才人'.split())) 41 | print('==============================') 42 | print('==============================') 43 | print('找出“书籍和书本"的相似度') 44 | print(model.similarity(u'书籍',u'书本')) 45 | print('找出"逛街和书本"的相似度') 46 | print(model.similarity(u'逛街',u'书本')) 47 | """ 48 | -------------------------------------------------------------------------------- /train_word2vec.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | #使用gensim word2vec训练脚本获取词向量 4 | 5 | import warnings 6 | warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')# 忽略警告 7 | 8 | import logging 9 | import os.path 10 | import sys 11 | import multiprocessing 12 | 13 | from gensim.corpora import WikiCorpus 14 | from gensim.models import Word2Vec 15 | from gensim.models.word2vec import LineSentence 16 | 17 | 18 | if __name__ == '__main__': 19 | 20 | #print open('/Users/sy/Desktop/pyRoot/wiki_zh_vec/cmd.txt').readlines() 21 | #sys.exit() 22 | 23 | program = os.path.basename(sys.argv[0]) 24 | logger = logging.getLogger(program) 25 | 26 | logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s',level=logging.INFO) 27 | logger.info("running %s" % ' '.join(sys.argv)) 28 | 29 | # inp为输入语料, outp1 为输出模型, outp2为原始c版本word2vec的vector格式的模型 30 | fdir = '/data02/gob/model_hub/wiki_zh_vec/' 31 | inp = 'wiki.zh.simp.seg.txt' 32 | outp1 = fdir + 'wiki.zh.text.model' 33 | outp2 = fdir + 'wiki.zh.text.vector' 34 | 35 | # 训练skip-gram模型 36 | model = Word2Vec(LineSentence(inp), vector_size=100, window=5, min_count=5, 37 | workers=4, epochs=10) 38 | 39 | # 保存模型 40 | model.save(outp1) 41 | model.wv.save_word2vec_format(outp2, binary=False) 42 | 43 | -------------------------------------------------------------------------------- /train_word2vec_char.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | #使用gensim word2vec训练脚本获取词向量 4 | 5 | import warnings 6 | warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')# 忽略警告 7 | 8 | import logging 9 | import os.path 10 | import sys 11 | import multiprocessing 12 | import os 13 | 14 | from gensim.corpora import WikiCorpus 15 | from gensim.models import Word2Vec 16 | from gensim.models.word2vec import LineSentence 17 | 18 | 19 | if __name__ == '__main__': 20 | 21 | #print open('/Users/sy/Desktop/pyRoot/wiki_zh_vec/cmd.txt').readlines() 22 | #sys.exit() 23 | 24 | program = os.path.basename(sys.argv[0]) 25 | logger = logging.getLogger(program) 26 | 27 | logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s',level=logging.INFO) 28 | logger.info("running %s" % ' '.join(sys.argv)) 29 | 30 | # inp为输入语料, outp1 为输出模型, outp2为原始c版本word2vec的vector格式的模型 31 | fdir = '/data02/gob/model_hub/wiki_zh_char_vec/' 32 | if not os.path.exists(fdir): 33 | os.mkdir(fdir) 34 | inp = 'wiki.zh.simp.seg.char.txt' 35 | outp1 = fdir + 'wiki.zh.text.char.model' 36 | outp2 = fdir + 'wiki.zh.text.char.vector' 37 | 38 | # 训练skip-gram模型 39 | model = Word2Vec(LineSentence(inp), vector_size=100, window=5, min_count=5, 40 | workers=4, epochs=10) 41 | 42 | # 保存模型 43 | model.save(outp1) 44 | model.wv.save_word2vec_format(outp2, binary=False) 45 | 46 | -------------------------------------------------------------------------------- /train_word2vec_pinyin.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | #使用gensim word2vec训练脚本获取词向量 4 | 5 | import warnings 6 | warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')# 忽略警告 7 | 8 | import logging 9 | import os.path 10 | import sys 11 | import multiprocessing 12 | import os 13 | 14 | from gensim.corpora import WikiCorpus 15 | from gensim.models import Word2Vec 16 | from gensim.models.word2vec import LineSentence 17 | 18 | 19 | if __name__ == '__main__': 20 | 21 | #print open('/Users/sy/Desktop/pyRoot/wiki_zh_vec/cmd.txt').readlines() 22 | #sys.exit() 23 | 24 | program = os.path.basename(sys.argv[0]) 25 | logger = logging.getLogger(program) 26 | 27 | logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s',level=logging.INFO) 28 | logger.info("running %s" % ' '.join(sys.argv)) 29 | 30 | # inp为输入语料, outp1 为输出模型, outp2为原始c版本word2vec的vector格式的模型 31 | fdir = '/data02/gob/model_hub/wiki_zh_pinyin_vec/' 32 | if not os.path.exists(fdir): 33 | os.mkdir(fdir) 34 | inp = 'wiki.zh.simp.seg.pinyin.txt' 35 | outp1 = fdir + 'wiki.zh.text.pinyin.model' 36 | outp2 = fdir + 'wiki.zh.text.pinyin.vector' 37 | 38 | # 训练skip-gram模型 39 | model = Word2Vec(LineSentence(inp), vector_size=100, window=5, min_count=5, 40 | workers=4, epochs=10) 41 | 42 | # 保存模型 43 | model.save(outp1) 44 | model.wv.save_word2vec_format(outp2, binary=False) 45 | 46 | --------------------------------------------------------------------------------