├── .gitignore
├── README.md
└── scel2mmseg.py


/.gitignore:
--------------------------------------------------------------------------------
1 | *.pyc
2 | *~
3 | .project
4 | .settings
5 | .pydevproject
6 | carevenv
7 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 将搜狗(sogou)的细胞词库转换为mmseg的词库
 2 | 
 3 | 功能：
 4 | 
 5 |  - scel2mmseg.py: 将.scel文件转换为mmseg格式的.txt文件
 6 |  
 7 |  使用方法： python scel2mmseg.py a.scel a.txt
 8 | 
 9 |  批量转换方法：python scel2mmseg.py scel文件目录 a.txt
10 |  
11 |  说明：新增加的所有词的词频都为1，对于格式的解释如下：
12 | > 每条记录分两行。其中，第一行为词项，其格式为：[词条]\t[词频率]。需要注意的是，对于单个字后面跟这个字作单字成词的频率，这个频率需要在大量的预先切分好的语料库中进行统计，用户增加或删除词时，一般不需要修改这个数值；对于非单字词，词频率处必须为1。第二行为占位项，是由于LibMMSeg库的代码是从Coreseek其他的分词算法库（N-gram模型）中改造而来的，在原来的应用中，第二行为该词在各种词性下的分布频率。LibMMSeg的用户只需要简单的在第二行处填”x:1″即可
13 | 
14 |  - mergedict.py: 将mmseg的多个.txt文件合并为一个.txt
15 |  
16 |  使用方法： python mergedict.py unigram.txt b.txt c.txt new.txt
17 |  
18 |  说明： .txt可以使mmseg格式的，也可以是每行一个词的格式（这样词频默认为1）
19 |  
20 |  注意：因为merge的时候会判重，一个词在前面出现过，就不会追加到新产生的文件中,所以要将unigram.txt放到最前面
21 | 
22 | 
23 | 
24 | ### 使用案例
25 | 1. [叮当星球](https://www.aboutstudy.net)
26 | 2. [极客星球](https://www.geekshared.com)
27 | 3. [Web攻城志](http://blog.vimge.com)
28 | 4. [Creem](https://www.creem.io/bip/geekshared)
29 | 
30 | 
31 | 
32 | 


--------------------------------------------------------------------------------
/scel2mmseg.py:
--------------------------------------------------------------------------------
 1 | import struct
 2 | import os, sys, glob
 3 | 
 4 | def read_utf16_str (f, offset=-1, len=2):
 5 |     if offset >= 0:
 6 |         f.seek(offset)
 7 |     str = f.read(len)
 8 |     return str.decode('UTF-16LE')
 9 | 
10 | def read_uint16 (f):
11 |     return struct.unpack ('<H', f.read(2))[0]
12 | 
13 | def get_word_from_sogou_cell_dict (fname):
14 |     f = open (fname, 'rb')
15 |     file_size = os.path.getsize (fname)
16 |     
17 |     hz_offset = 0
18 |     mask = struct.unpack ('B', f.read(128)[4])[0]
19 |     if mask == 0x44:
20 |         hz_offset = 0x2628
21 |     elif mask == 0x45:
22 |         hz_offset = 0x26c4
23 |     else:
24 |         sys.exit(1)
25 |     
26 |     title   = read_utf16_str (f, 0x130, 0x338  - 0x130)
27 |     type    = read_utf16_str (f, 0x338, 0x540  - 0x338)
28 |     desc    = read_utf16_str (f, 0x540, 0xd40  - 0x540)
29 |     samples = read_utf16_str (f, 0xd40, 0x1540 - 0xd40)
30 |     
31 |     py_map = {}
32 |     f.seek(0x1540+4)
33 |     
34 |     while 1:
35 |         py_code = read_uint16 (f)
36 |         py_len  = read_uint16 (f)
37 |         py_str  = read_utf16_str (f, -1, py_len)
38 |     
39 |         if py_code not in py_map:
40 |             py_map[py_code] = py_str
41 |     
42 |         if py_str == 'zuo':
43 |             break
44 |     
45 |     f.seek(hz_offset)
46 |     while f.tell() != file_size:
47 |         word_count   = read_uint16 (f)
48 |         pinyin_count = read_uint16 (f) / 2
49 |     
50 |         py_set = []
51 |         for i in range(pinyin_count):
52 |             py_id = read_uint16(f)
53 |             py_set.append(py_map[py_id])
54 |         py_str = "'".join (py_set)
55 | 
56 |         for i in range(word_count):
57 |             word_len = read_uint16(f)
58 |             word_str = read_utf16_str (f, -1, word_len)
59 |             f.read(12) 
60 |             yield py_str, word_str
61 | 
62 |     f.close()
63 | 
64 | def showtxt (records):
65 |     for (pystr, utf8str) in records:
66 |         print len(utf8str), utf8str
67 | 
68 | def store(records, f):
69 |     for (pystr, utf8str) in records:
70 |         f.write("%s\t1\n" %(utf8str.encode("utf8")))
71 |         f.write("x:1\n")
72 | 
73 | def main ():
74 | 	if len (sys.argv) != 3:
75 | 		print "Unknown Option \n usage: python %s file.scel new.txt" %(sys.argv[0])
76 | 		exit (1)
77 | 	
78 | 	#Specify the param of scel path as a directory, you can place many scel file in this dirctory, the this process will combine the result in one txt file
79 | 	if os.path.isdir(sys.argv[1]):
80 | 		for fileName in glob.glob(sys.argv[1] + '*.scel'):
81 | 			print fileName
82 | 			generator = get_word_from_sogou_cell_dict(fileName)
83 | 			with open(sys.argv[2], "a") as f:
84 | 				store(generator, f)
85 | 
86 | 	else:
87 | 		generator = get_word_from_sogou_cell_dict (sys.argv[1])
88 | 		with open(sys.argv[2], "w") as f:
89 | 			store(generator, f)
90 | 			#showtxt(generator)
91 | 
92 | if __name__ == "__main__":
93 |     main()
94 |     
95 |     
96 | 


--------------------------------------------------------------------------------