├── .hgignore ├── INSTALL.md ├── README.md ├── debian ├── changelog ├── compat ├── control ├── copyright ├── docs ├── rules └── source │ └── format ├── dict.txt.gz ├── doc ├── ps_cls.md ├── ps_cutter.md ├── ps_dbmgr.md └── train.md ├── example.py ├── ps_cls ├── ps_cutter ├── ps_dbmgr ├── ps_spider ├── segment ├── __init__.py ├── cut.py ├── dbase.py ├── dictdb.py ├── dyn.py ├── fisher.py └── train.py ├── setup.py └── test.py /.hgignore: -------------------------------------------------------------------------------- 1 | syntax: glob 2 | 3 | pylint.* 4 | txt/* 5 | cls/* 6 | *.db 7 | *.elc 8 | *.pyc 9 | *.pyo 10 | *.so 11 | *.o 12 | *~ 13 | -------------------------------------------------------------------------------- /INSTALL.md: -------------------------------------------------------------------------------- 1 | # 准备 # 2 | 3 | 如果你不准备进行安装部署,可以跳过安装和打包这两步。如果你打算使用cutter工具,请安装chardet。如果你打算使用spider工具,请安装html2text。 4 | 5 | # 获得代码 # 6 | 7 | 你可以使用以下代码,直接从版本库中复制一个可用版本出来。 8 | 9 | hg clone https://shell909090@code.google.com/p/python-segment/ 10 | 11 | 或者可以从这里下载一个最新版本的包。 12 | 13 | # 直接安装 # 14 | 15 | 使用setup.py直接安装。 16 | 17 | # debian打包 # 18 | 19 | debuild直接打包。 20 | 21 | # 词典生成 # 22 | 23 | 按照如下方式,使用dbmgr生成frq.db文件。 24 | 25 | gunzip dict.tar.gz 26 | ./ps_dbmgr create dict.txt 27 | 28 | 你可以看到生成了frq.db,这是词典的默认文件名。注意,词典文件的格式和具体的版本有关,换用版本后最好重新生成词典。 29 | 30 | # 命令行使用 # 31 | 32 | 假定有一个文本文件,test.txt,里面内容是中文平文本,编码任意。 33 | 34 | ./ps_cutter cutshow test.txt 35 | 36 | cutter会自动推测编码。 37 | 38 | # 代码使用 # 39 | 40 | 假如当前有一个frq.db词库。 41 | 42 | import segment 43 | cut = segment.get_cutter('frq.db') 44 | print list(cut.parse(u'工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作')) 45 | 46 | 注意,仅仅使用parse是不会进行分词的,因为parse返回的是一个生成器。 47 | 48 | # 分类器命令行例子 # 49 | 50 | 假定有一个spam目录,一个ham目录,一个example.txt文件。以下过程可以分析example.txt归属于哪个分类。 51 | 52 | ./ps_cls create_fisher 53 | ./ps_cls train_fisher hamdir/ spam/ 54 | ./ps_cls classify_fisher example.txt 55 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 简介 # 2 | 3 | python-segment是一个纯python实现的分词库,他的目标是提供一个可用的,完善的分词系统和训练环境,包括一个可用的词典。 4 | 5 | # 词典说明 # 6 | 7 | python-segment的词典是带词频无词性词典,程序基于剪枝和词频概率工作,不考虑词性,不考虑马尔可夫链。词典含两部分内容,单字词频和词组词频。两者的统计和使用是分离的。 8 | 9 | * 单字词频,某个词的出现次数。 10 | * 词粗词频,某个词组的出现次数。 11 | 12 | 词典一般有两种形态,marshal格式和txt格式,dbmgr工具提供了词典的管理界面。 13 | 14 | # 词频规则 # 15 | 16 | 词频有两种表示方法,一种为词语的出现次数。在marshal词典和txt词典中使用此种表示方法。另一种为词的出现次数除以总词量的对数,在计算中通常采用后者。在实际计算概率的时候,采用通乘所有碎片的概率。使用对数词频可以将这一行为化为加法。 17 | 18 | # 内置词典使用说明 # 19 | 20 | 内置词典是txt格式词典,通过gzip压缩。解压后使用dbmgr可以生成一个marshal格式词典。 21 | 22 | # 性能说明 # 23 | 24 | 在一台虚拟机上测试的结果,载入词典后消耗内存(带python)大约60m,分词效率大约100k字/秒。注意,默认情况下,程序使用yield返回分词结果,这不会消耗太多内存。但是如果需要保留分词得到的每个词语碎片,将耗费大量内存。根据测试,一个10M的文本文件(大约500W字)需要120m以上的内存来保持词语碎片。 25 | 26 | # 授权 # 27 | 28 | The MIT License (MIT) 29 | 30 | Copyright (c) 2010 Shell.Xu 31 | 32 | Permission is hereby granted, free of charge, to any person obtaining a copy 33 | of this software and associated documentation files (the "Software"), to deal 34 | in the Software without restriction, including without limitation the rights 35 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 36 | copies of the Software, and to permit persons to whom the Software is 37 | furnished to do so, subject to the following conditions: 38 | 39 | The above copyright notice and this permission notice shall be included in 40 | all copies or substantial portions of the Software. 41 | 42 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 43 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 44 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 45 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 46 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 47 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 48 | THE SOFTWARE. 49 | -------------------------------------------------------------------------------- /debian/changelog: -------------------------------------------------------------------------------- 1 | python-segment (1.0+1) unstable; urgency=low 2 | 3 | * Initial release 4 | 5 | -- Shell Xu Sat, 02 Oct 2010 17:18:39 +0800 6 | -------------------------------------------------------------------------------- /debian/compat: -------------------------------------------------------------------------------- 1 | 7 2 | -------------------------------------------------------------------------------- /debian/control: -------------------------------------------------------------------------------- 1 | Source: python-segment 2 | Section: python 3 | Priority: optional 4 | Maintainer: Shell Xu 5 | Build-Depends: debhelper (>= 7.0.50~) 6 | Standards-Version: 3.9.2 7 | Homepage: http://shell909090.com/ 8 | 9 | Package: python-segment 10 | Architecture: all 11 | Depends: ${shlibs:Depends}, ${misc:Depends}, ${python:Depends} 12 | Description: segmentation library written by Python 13 | segmentation library written by Python. You can train youself's dict. 14 | -------------------------------------------------------------------------------- /debian/copyright: -------------------------------------------------------------------------------- 1 | Format: http://anonscm.debian.org/viewvc/dep/web/deps/dep5.mdwn?revision=174 2 | Upstream-Name: python-segment 3 | Source: https://shell909090.com/ 4 | 5 | Files: * 6 | Copyright: 2010 Shell Xu 7 | License: MIT 8 | 9 | License: MIT 10 | Permission is hereby granted, free of charge, to any person obtaining a copy 11 | of this software and associated documentation files (the "Software"), to deal 12 | in the Software without restriction, including without limitation the rights 13 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 14 | copies of the Software, and to permit persons to whom the Software is 15 | furnished to do so, subject to the following conditions: 16 | . 17 | The above copyright notice and this permission notice shall be included in 18 | all copies or substantial portions of the Software. 19 | . 20 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 21 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 22 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 23 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 24 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 25 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN 26 | THE SOFTWARE. 27 | -------------------------------------------------------------------------------- /debian/docs: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shell909090/python-segment/ca725a3827bdd0df6e6b5ee1adaa08aa6230685c/debian/docs -------------------------------------------------------------------------------- /debian/rules: -------------------------------------------------------------------------------- 1 | #!/usr/bin/make -f 2 | # -*- makefile -*- 3 | 4 | %: 5 | dh $@ --with python2 6 | -------------------------------------------------------------------------------- /debian/source/format: -------------------------------------------------------------------------------- 1 | 3.0 (native) 2 | -------------------------------------------------------------------------------- /dict.txt.gz: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/shell909090/python-segment/ca725a3827bdd0df6e6b5ee1adaa08aa6230685c/dict.txt.gz -------------------------------------------------------------------------------- /doc/ps_cls.md: -------------------------------------------------------------------------------- 1 | # 简介 # 2 | 3 | cls是分类系统,负责各种分类器的console界面。 4 | 5 | # 参数 # 6 | 7 | * -d: 用于指定分词词典。 8 | * -f: 用于指定统计频率词典。 9 | 10 | # 命令 # 11 | 12 | * create_fisher: 从一个txt词库生成一个fisher的marshal词库,如果目标词库存在则覆盖原始内容。 13 | 14 | ps_cls create_fisher 15 | 16 | 注意,内容为空。 17 | 18 | * train_fisher: 统计ham目录和spam目录,并且更新数据库。 19 | 20 | ps_cls train_fisher hamdir spamdir 21 | 22 | * classify_fisher: 分析文件,计算出归属ham和spam的可能性。 23 | 24 | ps_cls classify_fisher spam.txt 25 | -------------------------------------------------------------------------------- /doc/ps_cutter.md: -------------------------------------------------------------------------------- 1 | # 简介 # 2 | 3 | cutter是分词和训练的console界面。 4 | 5 | # 参数 # 6 | 7 | * -d: 用于指定目标marshal词典。 8 | 9 | # 命令 # 10 | 11 | * cutstr: cutstr将一个句子分词,并且在屏幕上展现分词过程。 12 | 13 | cutstr: cutstr sentence : cut string and show out. 14 | 15 | * cut: cut负责分词一个或多个文件,不在屏幕上打印。 16 | 17 | cut: cut filepath ... : cut a file and not print out. 18 | 19 | * cutshow: cutshow负责分词一个或多个文件,在屏幕上打印。 20 | 21 | cutshow: cutshow filepath ... : cut a file and print out. 22 | 23 | * frqtrain: frqtrain负责训练一个或多个文件的词频。 24 | 25 | frqtrain: frqtrain filepath ... : train frequency by files. 26 | 27 | * frqtrains: frqtrains负责训练一个目录下所有文件的词频。 28 | 29 | frqtrains: frqtrains dirpath : train frequency by all files under dir. 30 | 31 | * newtrain: newtrain负责训练一个或多个文件的新词。 32 | 33 | newtrain: newtrain filepath ... : train new words by files. 34 | 35 | * newtrains: newtrains负责训练一个目录下所有文件的新词。 36 | 37 | newtrains: newtrains dirpath : train new words by all files under dir. 38 | 39 | * frqstat: frqstat负责训练一个或多个文件的字频。 40 | 41 | frqstat: frqstat filepath ... : statistics frequency of char in files. 42 | 43 | * frqstats: frqstats负责训练一个目录下所有文件的字频。 44 | 45 | frqstats: frqstats dirpath : statistics frequency in all files under dir. 46 | -------------------------------------------------------------------------------- /doc/ps_dbmgr.md: -------------------------------------------------------------------------------- 1 | # 简介 # 2 | 3 | dbmgr是词典管理工具,他提供了词典的console管理界面。 4 | 5 | # 参数 # 6 | 7 | * -d: 用于指定目标marshal词典。 8 | 9 | # 命令 # 10 | 11 | * create: 从一个txt词库生成一个marshal词库,如果目标词库存在则覆盖原始内容。 12 | 13 | ps_dbmgr create dict.txt 14 | 15 | dict.txt必须存在且格式正确,文件编码必须为utf-8,且格式正确。 16 | 17 | * importdb: 从一个txt词库导入数据,目标词库必须存在,新数据叠加到目标词库上。 18 | 19 | ps_dbmgr importdb dict.txt 20 | 21 | * exportdb: 导出数据到txt文件上,目标如果存在则覆盖。 22 | 23 | ps_dbmgr exportdb dict.txt 24 | 25 | * add: 增加中文词语词频。如果词语不存在则添加。 26 | 27 | ps_dbmgr add 中文 10 28 | 29 | * remove: 删除中文词语。 30 | 31 | ps_dbmgr remove 中文 32 | 33 | * lookup: 查找中文字和词语的对数频率。 34 | 35 | ps_dbmgr lookup 中文 36 | 37 | 注意,长度为1的会查找字频表,而长度超过1会查找词频表。表中不存在的字词会按照出现概率为1进行计算。 38 | 39 | * cals: 计算该字符串的对数频率和。不理会其中的词语,直接将字符串中的每个字的对数频率求和。 40 | 41 | ps_dbmgr cals 中文 42 | 43 | * stat: 词语数据库的统计数据。包括有多少个词,总词频多少,平均词频多少,最高词频多少,首层dict宽度,平均二层宽度。 44 | 45 | ps_dbmgr stat 46 | 47 | * waterlevel: 计算该水位以上的词语数量。即查找出现次数大于等于当前值的所有词语总量。 48 | 49 | ps_dbmgr waterlevel 1.1 50 | 51 | * scale: 所有词语的出现次数乘以一个调整因子。 52 | 53 | ps_dbmgr scale 0.9 54 | 55 | 该功能通常用于词库合并,将两个词库的平均出现频率调整到一致后进行合并比较好。 56 | 57 | * shrink: 切割词库,删除所有小于指定水平的词。 58 | 59 | ps_dbmgr shrink 0.9 60 | 61 | 可以通过waterlevel计算出还会残留多少词。 62 | 63 | * flat: 扁平化,将所有词频调整为1。 64 | 65 | ps_dbmgr flat 66 | 67 | * trans_cedict: 将cedict词典转换为可用的txt词典。 68 | 69 | ps_dbmgr trans_cedict in.txt out.txt 70 | 71 | * trans_plain: 将无词频平文本词典转换为可用的txt词典。 72 | 73 | ps_dbmgr trans_plain in.txt out.txt 74 | -------------------------------------------------------------------------------- /doc/train.md: -------------------------------------------------------------------------------- 1 | # 词库训练简介 # 2 | 3 | python-segment依赖于词库和词频工作,因此取得合适的词库就变成一个比较重要的问题。 4 | 5 | # 准备工作 # 6 | 7 | 准备一些文本样本,一个足够大的初始词库。文本样本至少在1000W汉字以上,以utf-8格式存储,大约30M以上大小。推荐1亿汉字以上,以utf-8格式存储,大约100M。初始词库不应当小于30W词汇量,推荐50W以上。汉语常用词大约15W左右。 8 | 9 | # 字频统计 # 10 | 11 | 使用cutter的frqstats功能,统计文本样本,得到字频表。通常字频表直接输出即可使用,一般中文常用字大约6000个上下。其余字字频计为1。 字频统计可以用于各种类别的txt上,得到不同的字频结果。不过通常来说意义不大。因为字频影响最大的是高频字,而汉语的高频字大多数情况下字频相对固定。 12 | 13 | # 词库转换 # 14 | 15 | 初始词库通常是cedict格式或者是平文本格式,使用dbmgr的trans_plain或者trans_cedict功能,将其转换为初始带词频词表。 16 | 17 | # 生成初始txt词库 # 18 | 19 | 将字频表和词库合并,就可以得到初始txt词库。 20 | 21 | # 新词训练 # 22 | 23 | 首先导入初始txt词库,进行第一次新词训练。使用cutter的newtrains功能,可以得到一批高频未识别词和频率。注意,这一过程会耗费比较长的时间和内存。请准备一台好点的机器来做。人工阅览这些词,将其中不合适的删除,就可以得到带词频新词表。将这个新词表添加到初始txt词库后面,就可以得到未经词频训练的完整词库。 新词训练也适用于各种类别的文本素材,尤其是针对特性领域,使用新词训练理论上可以识别出一批新词,从而增强机器的分词准确性。然而注意,新词训练是基于不可分词语,对于在常见词基础上附加而成的新词识别能力很差,而且需要大量人力进行新词识别。 24 | 25 | # 词频训练 # 26 | 27 | 使用cutter的frqtrains功能,对文本进行统计,可以得到词语在文本中的出现概率。注意,这一过程会耗费比较长的时间和内存。请准备一台好点的机器来做。训练完成后会自动保存词频。 词频训练对不同领域的文本素材有很强的针对性。在针对某个领域进行训练后,可以有效提高该领域的识别度。但是注意,如果词库不涵盖该领域所需词汇的话,词频训练效果很难体现。 28 | 29 | # 词典输出和保存 # 30 | 31 | marshal词典保存了相当多的数据结构,加载速度快。然而可读性差,文件大小大。因此可以用dbmgr的exportdb功能输出为txt格式词典,而后用gzip压缩。一般输出为txt词典会减小25%的大小,gzip压缩会减小66%的大小。以范例词典而言,压缩得到的词典为原始词典的1/4。 32 | 33 | # 词典的合并 # 34 | 35 | 一般来说,用户的常用词典只有一个。但是基于某种理由,我们常常需要对词典进行合并,添加新词,或者补充训练结果。对词典的合并通常依照“词频调整-合并-重新训练词频”的步骤进行。 36 | 37 | 1. 首先调整词频,将两者词语平均出现概率调整到一致的地步。这步的最主要目地是为了防止后续训练时,某个词典中的词平均词频太小,导致被选中的概率相对偏小,从而出现聚集效应。 38 | 2. 词典合并,通过dbmgr的importdb指令进行词典合并。 39 | 3. 重训练,对合并后的词典,推荐将平均词频调整到1,重新进行词频训练。 40 | 41 | # 有效词的截取 # 42 | 43 | 一个词典内不是所有词都有同样的效用,某些词出现的概率会比其他词低很多,在整个识别过程中,多付出相当数量的资源而无法获得太多的效果提升。我们可以通过训练-词频截取的方法去除这些无效词。 44 | 45 | 1. 首先调整平均词频到1。 46 | 2. 针对某个样本进行词频训练,得到训练后词表。 47 | 3. 通过waterlevel指令进行水位查询,确定截取水位。通常推荐的截取水位为1.1,因为未出现的词语的默认次数为1。而在训练后词表中出现次数为1的词相当于在文本中未出现。 48 | 4. 通过shrink指令进行词频截取,截取结果会自动保存为工作词典。 49 | 50 | 有效词的截取能够有效的增强系统的性能,减少词典内存占用,增加工作速度。但是去掉的无效词并不代表真的无效。去掉多余的无效词后,在识别的精度上会受到一定的影响。有效词的截取属于以精度换取资源的做法 51 | -------------------------------------------------------------------------------- /example.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | ''' 4 | @date: 2011-11-22 5 | @author: shell.xu 6 | ''' 7 | import segment 8 | 9 | cut = segment.get_cutter('frq.db') 10 | 11 | def cuttest(word): 12 | print word 13 | print u'|'.join([i for i in cut.parse(word.decode('utf-8'))]) 14 | 15 | def test_dyn1(): 16 | cuttest("这是一个伸手不见五指的黑夜。我叫孙悟空,我爱北京,我爱Python和C++。") 17 | cuttest("我不喜欢日本和服。") 18 | cuttest("雷猴回归人间。") 19 | cuttest("工信处女干事每月经过下属科室都要亲口交代24口交换机等技术性器件的安装工作") 20 | cuttest("我需要廉租房") 21 | cuttest("永和服装饰品有限公司") 22 | cuttest("我爱北京天安门") 23 | cuttest("abc") 24 | cuttest("隐马尔可夫") 25 | cuttest("雷猴是个好网站") 26 | cuttest("“Microsoft”一词由“MICROcomputer(微型计算机)”和“SOFTware(软件)”两部分组成") 27 | cuttest("草泥马和欺实马是今年的流行词汇") 28 | cuttest("伊藤洋华堂总府店") 29 | cuttest("中国科学院计算技术研究所") 30 | cuttest("罗密欧与朱丽叶") 31 | cuttest("我购买了道具和服装") 32 | cuttest("PS: 我觉得开源有一个好处,就是能够敦促自己不断改进,避免敞帚自珍") 33 | cuttest("湖北省石首市") 34 | cuttest("总经理完成了这件事情") 35 | cuttest("电脑修好了") 36 | cuttest("做好了这件事情就一了百了了") 37 | cuttest("人们审美的观点是不同的") 38 | cuttest("我们买了一个美的空调") 39 | cuttest("线程初始化时我们要注意") 40 | cuttest("一个分子是由好多原子组织成的") 41 | cuttest("祝你马到功成") 42 | cuttest("他掉进了无底洞里") 43 | cuttest("中国的首都是北京") 44 | cuttest("孙君意") 45 | cuttest("外交部发言人马朝旭") 46 | cuttest("领导人会议和第四届东亚峰会") 47 | cuttest("在过去的这五年") 48 | cuttest("还需要很长的路要走") 49 | cuttest("60周年首都阅兵") 50 | cuttest("你好人们审美的观点是不同的") 51 | cuttest("买水果然后来世博园") 52 | cuttest("买水果然后去世博园") 53 | cuttest("但是后来我才知道你是对的") 54 | cuttest("存在即合理") 55 | cuttest("的的的的的在的的的的就以和和和") 56 | cuttest("I love你,不以为耻,反以为rong") 57 | cuttest("hello你好人们审美的观点是不同的") 58 | cuttest("很好但主要是基于网页形式") 59 | cuttest("hello你好人们审美的观点是不同的") 60 | cuttest("为什么我不能拥有想要的生活") 61 | cuttest("后来我才") 62 | cuttest("此次来中国是为了") 63 | cuttest("使用了它就可以解决一些问题") 64 | cuttest(",使用了它就可以解决一些问题") 65 | cuttest("其实使用了它就可以解决一些问题") 66 | cuttest("好人使用了它就可以解决一些问题") 67 | cuttest("是因为和国家") 68 | cuttest("老年搜索还支持") 69 | cuttest("干脆就把那部蒙人的闲法给废了拉倒!RT @laoshipukong : 27日,全国人大常委会第三次审议侵权责任法草案,删除了有关医疗损害责任“举证倒置”的规定。在医患纠纷中本已处于弱势地位的消费者由此将陷入万劫不复的境地。 ") 70 | 71 | if __name__ == '__main__': 72 | test_dyn1() 73 | -------------------------------------------------------------------------------- /ps_cls: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | ''' 4 | @date: 2011-11-24 5 | @author: shell.xu 6 | ''' 7 | import os, sys, time, getopt 8 | import traceback, unicodedata 9 | import chardet, segment 10 | from os import path 11 | 12 | segdbname = 'frq.db' 13 | dbname = 'classify.db' 14 | 15 | def walkdir(basepath, func, *params): 16 | for base, dirs, files in os.walk(basepath): 17 | for filename in files: 18 | try: func(path.join(base, filename), *params) 19 | except: traceback.print_exc() 20 | 21 | def create_fisher(): 22 | ''' create : create fisher db. ''' 23 | fisher = segment.Fisher() 24 | fisher.savefile(dbname) 25 | 26 | def train_fisher(hamdir, spamdir): 27 | ''' train_fisher hamdir spamdir : train fisher with ham and spam. ''' 28 | seg = segment.get_cutter(segdbname) 29 | fisher = segment.Fisher(dbname) 30 | fisher.set_segment(seg.parse) 31 | walkdir(hamdir, fisher.trainfile, 'ham') 32 | walkdir(spamdir, fisher.trainfile, 'spam') 33 | fisher.sync() 34 | 35 | def classify_fisher(*filepathes): 36 | ''' classify_fisher filepath .. : classify file. ''' 37 | seg = segment.get_cutter(segdbname) 38 | fisher = segment.Fisher(dbname) 39 | fisher.set_segment(seg.parse) 40 | for fp in filepathes: 41 | fisher.classifyfile(fp, ['ham', 'spam']) 42 | 43 | cmds = ['create_fisher', 'train_fisher', 'classify_fisher',] 44 | def main(): 45 | opts, args = getopt.getopt(sys.argv[1:], 'd:f:') 46 | for opt, val in opts: 47 | if opt == '-d': 48 | global segdbname 49 | segdbname = val 50 | elif opt == '-f': 51 | global dbname 52 | dbname = val 53 | if len(args) == 0: 54 | print '%s cmds params ...' % sys.argv[0] 55 | for name in cmds: print '%s:%s' % (name, eval(name).__doc__) 56 | else: eval(args[0])(*args[1:]) 57 | 58 | if __name__ == '__main__': main() 59 | -------------------------------------------------------------------------------- /ps_cutter: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | ''' 4 | @date: 2011-11-16 5 | @author: shell.xu 6 | ''' 7 | import os, sys, time, getopt, traceback 8 | import unicodedata, chardet, segment 9 | from os import path 10 | 11 | dbname = 'frq.db' 12 | 13 | def walkdir(basepath, func, *params): 14 | for base, dirs, files in os.walk(basepath): 15 | for filename in files: 16 | try: func(path.join(base, filename), *params) 17 | except: traceback.print_exc() 18 | 19 | def cutstr(sentence): 20 | ''' cutstr sentence : cut string and show out. ''' 21 | cut = segment.get_cutter(dbname) 22 | segment.DynamicCutter.DEBUG = True 23 | print '|'.join(cut.parse(sentence.decode('utf-8'))) 24 | 25 | def cut(*filepathes): 26 | ''' cut filepath ... : cut a file and not print out. ''' 27 | cut = segment.get_cutter(dbname) 28 | for fp in filepathes: list(cut.parsefile(fp)) 29 | 30 | def cutshow(*filepathes): 31 | ''' cutshow filepath ... : cut a file and print out. ''' 32 | cut = segment.get_cutter(dbname) 33 | for fp in filepathes: print '|'.join(cut.parsefile(fp)).encode('utf-8') 34 | 35 | def frqtrain(*filepathes): 36 | ''' frqtrain filepath ... : train frequency by files. ''' 37 | stat = segment.StatCutter(segment.dictdb(dbname)) 38 | cut = segment.StringCutter(stat) 39 | for fp in filepathes: cut.parsefile(filepath) 40 | stat.train() 41 | 42 | def frqtrains(basepath): 43 | ''' frqtrains dirpath : train frequency by all files under dir. ''' 44 | stat = segment.StatCutter(segment.dictdb(dbname)) 45 | cut = segment.StringCutter(stat) 46 | walkdir(basepath, cut.paesefile) 47 | stat.train(True) 48 | 49 | def newtrain(*filepathes): 50 | ''' newtrain filepath ... : train new words by files. ''' 51 | new = segment.NewCutter(segment.dictdb(dbname)) 52 | cut = segment.StringCutter(new) 53 | for fp in filepathes: cut.paesefile(fp) 54 | for word, frq in new.get_highfrq(): 55 | print word.encode('utf-8'), frq 56 | 57 | def newtrains(basepath): 58 | ''' newtrains dirpath : train new words by all files under dir. ''' 59 | new = segment.NewCutter(segment.dictdb(dbname)) 60 | cut = segment.StringCutter(new) 61 | walkdir(basepath, cut.paesefile) 62 | for word, frq in new.get_highfrq(): 63 | print word.encode('utf-8'), frq 64 | 65 | def frqfile(filepath, frq): 66 | print 'process', filepath 67 | data = segment.readfile_cd(filepath) 68 | for i in data: 69 | if i not in frq: frq[i] = 0 70 | frq[i] += 1 71 | 72 | def frqstat(*filepathes): 73 | ''' frqstat filepath ... : statistics frequency of char in files. ''' 74 | frq = {} 75 | for fp in filepathes: frqfile(fp, frq) 76 | for k, v in sorted(frq.items(), key = lambda x:x[1], reverse = True): 77 | if unicodedata.category(k) != 'Lo': continue 78 | print k.encode('utf-8'), v 79 | 80 | def frqstats(basepath): 81 | ''' frqstats dirpath : statistics frequency in all files under dir. ''' 82 | frq = {} 83 | walkdir(basepath, frqfile, frq) 84 | for k, v in sorted(frq.items(), key = lambda x:x[1], reverse = True): 85 | if unicodedata.category(k) != 'Lo': continue 86 | print k.encode('utf-8'), v 87 | 88 | cmds = ['cutstr', 'cut', 'cutshow', 'frqtrain', 'frqtrains', 89 | 'newtrain', 'newtrains', 'frqstat', 'frqstats'] 90 | def main(): 91 | opts, args = getopt.getopt(sys.argv[1:], 'd:') 92 | for opt, val in opts: 93 | if opt == '-d': 94 | global dbname 95 | dbname = val 96 | if len(args) == 0: 97 | print '%s cmds params ...' % sys.argv[0] 98 | for name in cmds: print '%s:%s' % (name, eval(name).__doc__) 99 | else: eval(args[0])(*args[1:]) 100 | 101 | if __name__ == '__main__': main() 102 | -------------------------------------------------------------------------------- /ps_dbmgr: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | ''' 4 | @date: 2011-11-16 5 | @author: shell.xu 6 | ''' 7 | import os, sys, getopt 8 | import segment 9 | 10 | dbname = 'frq.db' 11 | 12 | def create(filepath): 13 | ''' create filepath : create db form txt dict. ''' 14 | ddb = segment.dictdb() 15 | ddb.importtxt(filepath) 16 | ddb.savefile(dbname) 17 | 18 | def importdb(filepath): 19 | ''' importdb filepath : import txt dict to db. ''' 20 | ddb = segment.dictdb(dbname) 21 | ddb.importtxt(filepath) 22 | ddb.sync() 23 | 24 | def exportdb(filepath): 25 | ''' exportdb filepath : export txt dict from db. ''' 26 | ddb = segment.dictdb(dbname) 27 | with open(filepath, 'w') as fo: ddb.exporttxt(fo) 28 | 29 | def add(word, frq): 30 | ''' add word frq : add a new word with frequency. ''' 31 | ddb = segment.dictdb(dbname) 32 | ddb.add(word.decode('utf-8'), float(frq)) 33 | ddb.sync() 34 | 35 | def remove(word): 36 | ''' remove word : remove word. ''' 37 | ddb = segment.dictdb(dbname) 38 | ddb.remove(word.decode('utf-8')) 39 | ddb.sync() 40 | 41 | def lookup(word): 42 | ''' lookup word : lookup word from dict. ''' 43 | ddb = segment.dictdb(dbname) 44 | word = word.decode('utf-8') 45 | if len(word) == 1: print word, ddb.gets(word) 46 | else: print word, ddb.get(word) 47 | 48 | def cals(word): 49 | ''' cals word : calculate word frequency as char set. ''' 50 | ddb = segment.dictdb(dbname) 51 | word = word.decode('utf-8') 52 | print word, ddb.cals(word) 53 | 54 | def stat(): 55 | ''' stat : dict db statistic. ''' 56 | ddb = segment.dictdb(dbname) 57 | count, sumval = len(list(ddb.values())), sum(ddb.values()) 58 | print 'count: %d\nsumvalue: %f' % (count, sumval) 59 | print 'avgvalue: %f\nmaxvalue: %f' % (sumval / count, max(ddb.values())) 60 | print 'root: %d\navgdict: %f' % (len(ddb.db), float(count) / len(ddb.db)) 61 | 62 | def waterlevel(wl): 63 | ''' waterlevel wl : see how many word has higher frequency then waterlevel. ''' 64 | ddb = segment.dictdb(dbname) 65 | print len([i for i in ddb.values() if i >= threshold]) 66 | 67 | def scale(factor): 68 | ''' scale factor : scale dict with factor. ''' 69 | ddb = segment.dictdb(dbname) 70 | f = float(factor) 71 | for h, vs in ddb.db.iteritems(): 72 | for k, v in vs.iteritems(): vs[k] = v * f 73 | ddb.normalize() 74 | ddb.sync() 75 | 76 | def shrink(threshold): 77 | ''' shrink threshold : remove words which has lower frequency then threshold. ''' 78 | ddb = segment.dictdb(dbname) 79 | t = float(threshold) 80 | zero = [] 81 | for h, vs in ddb.db.iteritems(): 82 | ddb.db[h] = dict([(k, v) for k, v in vs.iteritems() if v >= t]) 83 | if len(ddb.db[h]) == 0: zero.append(h) 84 | for z in zero: del ddb.db[z] 85 | ddb.normalize() 86 | ddb.sync() 87 | 88 | def flat(): 89 | ''' flat : set every word's frequency to 1. ''' 90 | ddb = segment.dictdb(dbname) 91 | for h, vs in ddb.db.iteritems(): 92 | for k, v in vs.iteritems(): vs[k] = 1 93 | ddb.normalize() 94 | ddb.sync() 95 | 96 | def write_dict(words, outfile): 97 | with open(outfile, 'w') as fo: 98 | for word in list(words): 99 | fo.write((u'%s 1\n' % word).encode('utf-8')) 100 | 101 | def read_file(infile, func, words): 102 | with open(infile, 'r') as fi: 103 | for line in fi: 104 | if line.startswith('#'): continue 105 | word = func(line) 106 | if word: words.add(word) 107 | 108 | def trans_cedict(infile, outfile): 109 | ''' trans_cedict infile outfile : translate cedict infile to outfile. ''' 110 | words = set() 111 | def __inner(line, words): 112 | word = line.split()[1].decode('utf-8') 113 | if len(word) > 1: return word 114 | read_file(infile, __inner, words) 115 | write_dict(words, outfile) 116 | 117 | def trans_plain(infile, outfile): 118 | ''' trans_plain infile outfile : translate plain infile to outfile. ''' 119 | words = set() 120 | def __inner(line, words): 121 | try: word = line.decode('utf-8').strip() 122 | except: return 123 | if len(word) > 1: return word 124 | read_file(infile, __inner, words) 125 | write_dict(words, outfile) 126 | 127 | cmds = ['create', 'importdb', 'exportdb', 'add', 'remove', 128 | 'lookup', 'cals', 'stat', 'waterlevel', 'scale', 129 | 'shrink', 'flat', 'trans_cedict', 'trans_plain'] 130 | def main(): 131 | opts, args = getopt.getopt(sys.argv[1:], 'd:') 132 | for opt, val in opts: 133 | if opt == '-d': 134 | global dbname 135 | dbname = val 136 | if len(args) == 0: 137 | print '%s cmds params ...' % sys.argv[0] 138 | for name in cmds: print '%s:%s' % (name, eval(name).__doc__) 139 | else: eval(args[0])(*args[1:]) 140 | 141 | if __name__ == '__main__': main() 142 | -------------------------------------------------------------------------------- /ps_spider: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | ''' 4 | @date: 2011-11-18 5 | @author: shell.xu 6 | ''' 7 | import os, sys, urllib 8 | import threading, urlparse, traceback 9 | import html2text, chardet 10 | from StringIO import StringIO 11 | from lxml import etree, html 12 | sys.path.insert(0, '../') 13 | import segment 14 | 15 | queue = [] 16 | 17 | class spider(threading.Thread): 18 | 19 | def __init__(self, cut, fix): 20 | super(spider, self).__init__() 21 | self.cut, self.fix = cut, fix 22 | 23 | def run(self): 24 | while queue: 25 | try: self.proc(queue.pop()) 26 | except: traceback.print_exc() 27 | 28 | def proc(self, job): 29 | if urlparse.urlparse(job[0]).netloc != self.fix: return 30 | print 'process %s' % job[0] 31 | httpfile = urllib.urlopen(job[0]) 32 | content_type = httpfile.info()['Content-Type'] 33 | if content_type not in ['text/html', 'text/plain']: return 34 | data = httpfile.read() 35 | enc = chardet.detect(data[1200:]).get('encoding', 'utf-8') 36 | if enc is None: enc = 'utf-8' 37 | data = data.decode(enc, 'ignore') 38 | if content_type == 'text/html': 39 | if job[1] > 0: 40 | doc = html.fromstring(data) 41 | doc.make_links_absolute(job[0]) 42 | for link in doc.iterlinks(): 43 | queue.append((link[2], job[1] - 1)) 44 | txt = html2text.html2text(data) 45 | elif content_type == 'text/plain': txt = data 46 | list(self.cut.parse(txt)) 47 | 48 | def main(): 49 | db = segment.dictdb('frq.db') 50 | queue.append((sys.argv[2], 2)) 51 | trains, spiders = [], [] 52 | for i in xrange(10): 53 | if sys.argv[1] == 'new': 54 | new = segment.NewCutter(db) 55 | cut = segment.StringCutter(new) 56 | trains.append(new) 57 | else: 58 | stat = segment.StatCutter(db) 59 | cut = segment.StringCutter(stat) 60 | trains.append(stat) 61 | s = spider(cut, urlparse.urlparse(sys.argv[2]).netloc) 62 | s.start() 63 | spiders.append(s) 64 | for s in spiders: s.join() 65 | t = reduce(lambda x, y: x.join(y), trains) 66 | if sys.argv[1] == 'new': 67 | for k, v in t.get_highfrq(): 68 | print k.encode('utf-8'), v 69 | else: t.train(True) 70 | 71 | if __name__ == '__main__': main() 72 | -------------------------------------------------------------------------------- /segment/__init__.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | ''' 4 | @date: 2011-11-16 5 | @author: shell.xu 6 | ''' 7 | from dbase import readfile_cd 8 | from dictdb import dictdb 9 | from cut import StringCutter 10 | from dyn import DynamicCutter 11 | from train import StatCutter, NewCutter 12 | from fisher import Fisher 13 | 14 | def get_cutter(filepath): 15 | return StringCutter(DynamicCutter(dictdb(filepath))) 16 | -------------------------------------------------------------------------------- /segment/cut.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | ''' 4 | @date: 2011-11-16 5 | @author: shell.xu 6 | ''' 7 | from dbase import * 8 | import unicodedata 9 | chrcat = unicodedata.category 10 | 11 | punct_set = set(['Ll', 'Lu', 'Lo', 'Nd']) 12 | def split_punct(stc, s): 13 | for e in xrange(s, len(stc)): 14 | if chrcat(stc[e]) in punct_set: return e 15 | return len(stc) 16 | 17 | def split_chinese(stc, s): 18 | for e in xrange(s, len(stc)): 19 | if chrcat(stc[e]) != 'Lo': return e 20 | return len(stc) 21 | 22 | def split_english(stc, s): 23 | for e in xrange(s, len(stc)): 24 | t = chrcat(stc[e]) 25 | if t not in ['Lu', 'Ll'] and stc[e] not in '-.\'': return e 26 | if e != s and pre_type == 'Ll' and t == 'Lu': return e 27 | pre_type = t 28 | return len(stc) 29 | 30 | def split_number(stc, s): 31 | for e in xrange(s, len(stc)): 32 | t = chrcat(stc[e]) 33 | if t != 'Nd' and stc[e] not in u',.': return e 34 | return len(stc) 35 | 36 | class StringCutter(CutterBase): 37 | def __init__(self, next): self.next = next 38 | 39 | def parse(self, stc): 40 | s, l = 0, len(stc) 41 | while s < l: 42 | t = chrcat(stc[s]) 43 | if t == 'Lo': 44 | e = split_chinese(stc, s) 45 | if not self.next: yield stc[s:e] 46 | else: 47 | for i in self.next.parse(stc[s:e]): yield i 48 | elif t == 'Nd': 49 | e = split_number(stc, s) 50 | if self.next: self.next.stop() 51 | yield stc[s:e] 52 | elif t in ['Ll', 'Lu']: 53 | e = split_english(stc, s) 54 | if self.next: self.next.stop() 55 | yield stc[s:e] 56 | else: 57 | e = split_punct(stc, s) 58 | if self.next: self.next.stop() 59 | yield stc[s:e] 60 | s = e 61 | 62 | def parsetp(self, stc): 63 | s, l = 0, len(stc) 64 | while s < l: 65 | t = chrcat(stc[s]) 66 | if t == 'Lo': 67 | e = split_chinese(stc, s) 68 | if not self.next: yield stc[s:e], 0 69 | else: 70 | for i in self.next.parsetp(stc[s:e]): yield i 71 | elif t == 'Nd': 72 | e = split_number(stc, s) 73 | if self.next: self.next.stop() 74 | yield stc[s:e], 2 75 | elif t in ['Ll', 'Lu']: 76 | e = split_english(stc, s) 77 | if self.next: self.next.stop() 78 | yield stc[s:e], 2 79 | else: 80 | e = split_punct(stc, s) 81 | if self.next: self.next.stop() 82 | yield stc[s:e], 2 83 | s = e 84 | -------------------------------------------------------------------------------- /segment/dbase.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | ''' 4 | @date: 2011-11-24 5 | @author: shell.xu 6 | ''' 7 | import chardet 8 | 9 | class dbase(object): 10 | def __init__(self, filepath = None): 11 | self.close() 12 | self.filepath = filepath 13 | if self.filepath: self.loadfile(filepath) 14 | 15 | def sync(self): 16 | if self.filepath: self.savefile(self.filepath) 17 | 18 | def savefile(self, filepath): 19 | with open(filepath, 'wb') as fo: self.save(fo) 20 | def loadfile(self, filepath): 21 | with open(filepath, 'rb') as fi: self.load(fi) 22 | 23 | def readfile_cd(filepath): 24 | with open(filepath, 'r') as fi: data = fi.read() 25 | if len(data) < 120: enc = chardet.detect(data)['encoding'] 26 | else: enc = chardet.detect(data[:120])['encoding'] 27 | if enc is None: enc = 'utf-8' 28 | return data.decode(enc, 'ignore') 29 | 30 | class CutterBase(object): 31 | def parsefile(self, filepath): 32 | print 'process', filepath 33 | return self.parse(readfile_cd(filepath)) 34 | 35 | -------------------------------------------------------------------------------- /segment/dictdb.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | ''' 4 | @date: 2011-11-16 5 | @author: shell.xu 6 | ''' 7 | import math, marshal 8 | from dbase import dbase 9 | 10 | class dictdb(dbase): 11 | 12 | def save(self, fo): 13 | marshal.dump((self.db, self.sdb, self.dbs, self.sdbs), fo) 14 | def load(self, fi): 15 | self.db, self.sdb, self.dbs, self.sdbs = marshal.load(fi) 16 | 17 | def close(self): 18 | self.db, self.sdb, self.dbs, self.sdbs = {}, {}, None, None 19 | 20 | def importtxt(self, filepath): 21 | with open(filepath, 'r') as fi: 22 | for line in fi: 23 | i = line.strip().decode('utf-8').split() 24 | if len(i[0]) == 1: self.sdb[i[0]] = float(i[1]) 25 | else: self.add(i[0], float(i[1])) 26 | self.normalize() 27 | 28 | def exporttxt(self, fo): 29 | for h, vs in self.db.iteritems(): 30 | for k, v in vs.iteritems(): 31 | d = u'%s %f\n' % (h + k, v) 32 | fo.write(d.encode('utf-8')) 33 | for k, v in self.sdb.iteritems(): 34 | fo.write('%s %f\n' % (k.encode('utf-8'), v)) 35 | 36 | def normalize(self): 37 | self.dbs = math.log(sum(self.values())) 38 | self.sdbs = math.log(sum(self.sdb.values())) 39 | 40 | def gets(self, w): 41 | return math.log(self.sdb.get(w, 1)) - self.sdbs 42 | 43 | def cals(self, ws): 44 | return sum(map(self.gets, ws)) 45 | 46 | def hifrqs(self, num): 47 | return sorted(self.sdb.items(), lambda x:x[1], reverse = True)[num:] 48 | 49 | def items(self): 50 | for h, vs in self.db.iteritems(): 51 | for k, v in vs.iteritems(): yield h + k, v 52 | 53 | def values(self): 54 | for vs in self.db.values(): 55 | for i in vs.values(): yield i 56 | 57 | def add(self, w, f): 58 | if len(w) < 2: return 59 | h, r = w[:2], w[2:] 60 | d = self.db.setdefault(h, {}) 61 | d[r] = d.get(r, 0) + f 62 | 63 | def remove(self, w): 64 | if len(w) < 2: return 65 | h, r = w[:2], w[2:] 66 | d = self.db.setdefault(h, {}) 67 | if r not in d: return 68 | del d[r] 69 | 70 | def get(self, w): 71 | if len(w) < 2: return 0 72 | h, r = w[:2], w[2:] 73 | if h not in self.db: return 0 74 | if r not in self.db[h]: return 0 75 | return math.log(self.db[h][r]) - self.dbs 76 | 77 | def match(self, sentence): 78 | if len(sentence) < 2: return 79 | h, r = sentence[:2], sentence[2:] 80 | if h not in self.db: return 81 | return [(h + k, math.log(v) - self.dbs) 82 | for k, v in self.db[h].iteritems() if r.startswith(k)] 83 | -------------------------------------------------------------------------------- /segment/dyn.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | ''' 4 | @date: 2011-11-16 5 | @author: shell.xu 6 | ''' 7 | from dbase import * 8 | from dictdb import dictdb 9 | 10 | class DynamicCutter(CutterBase): 11 | DEBUG = False 12 | 13 | def __init__(self, db): 14 | self.db, self.cache = db, {} 15 | 16 | @classmethod 17 | def cmp_subtree(cls, r1, r2): 18 | if r1 is None: return True 19 | if cls.DEBUG: print '+', r1; print '-', r2 20 | c1 = [len(r) for r, i in r1[1] if i == 0] 21 | c2 = [len(r) for r, i in r2[1] if i == 0] 22 | s1, s2 = sum(c1), sum(c2) 23 | if s1 != s2: return s1 > s2 24 | s1, s2 = len(r1[1]) - len(c1), len(r2[1]) - len(c2) 25 | if s1 != s2: return s1 > s2 26 | return r1[0] < r2[0] 27 | 28 | def rfindc(self, sentence, start_pos = 0): 29 | for i in xrange(start_pos, len(sentence)): 30 | cset = self.db.match(sentence[i:]) 31 | if cset: return i, cset 32 | return -1, [] 33 | 34 | def split(self, sentence): 35 | if sentence not in self.cache: 36 | self.cache[sentence] = self._split(sentence) 37 | return self.cache[sentence] 38 | def stop(self): self.cache = {} 39 | 40 | def _split(self, sentence): 41 | if not sentence: return 0, [] 42 | if len(sentence) == 1: return self.db.gets(sentence), [(sentence, 0),] 43 | 44 | # looking for the first matches word 45 | best_rslt = None 46 | poc, cset = self.rfindc(sentence) 47 | if poc == -1: 48 | return (sum(map(self.db.gets, sentence)), [(sentence, 0),]) 49 | pre_cut, rest = sentence[:poc], sentence[poc:] 50 | 51 | # see which of matches is the bast choice 52 | for c, f in cset: 53 | r = self.split(rest[len(c):])[:] 54 | temp_rslt = (f + r[0], [(c, 1),] + r[1]) 55 | if self.cmp_subtree(best_rslt, temp_rslt): 56 | best_rslt = temp_rslt 57 | 58 | # looking for choices not match in this position 59 | maxlen = min([len(c) for c, f in cset]) 60 | poc, cset = self.rfindc(rest, 1) 61 | while poc < maxlen and cset: 62 | r = self.split(rest[poc:]) 63 | temp_rslt = (self.db.cals(rest[:poc]) + r[0], 64 | [(rest[:poc], 0),] + r[1]) 65 | if self.cmp_subtree(best_rslt, temp_rslt): 66 | best_rslt = temp_rslt 67 | poc, cset = self.rfindc(rest, poc + 1) 68 | 69 | if pre_cut: 70 | best_rslt = (self.db.cals(pre_cut) + best_rslt[0], 71 | [(pre_cut, 0),] + best_rslt[1]) 72 | if self.DEBUG: 73 | print '%s => %f %s' % (sentence, best_rslt[0], best_rslt[1]) 74 | return best_rslt 75 | 76 | def parse(self, sentence): 77 | frq, rslt = self.split(sentence) 78 | for word, tp in rslt: yield word 79 | 80 | def parsetp(self, sentence): 81 | return self.split(sentence)[1] 82 | -------------------------------------------------------------------------------- /segment/fisher.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | ''' 4 | @date: 2011-11-15 5 | @author: shell.xu 6 | ''' 7 | import os, sys, math, marshal 8 | import unicodedata 9 | from dbase import dbase, readfile_cd 10 | 11 | def fixinit(mid): 12 | def getfunc(func): 13 | return lambda *p: (func(*p) + mid) / 2 14 | return getfunc 15 | 16 | class Fisher(dbase): 17 | FORBIDDEN = set(['Pd', 'Nd']) 18 | 19 | def save(self, fo): marshal.dump(self.freq, fo) 20 | def load(self, fi): self.freq = marshal.load(fi) 21 | 22 | def close(self): self.freq = {} 23 | 24 | def set_segment(self, segfunc): 25 | self.segfunc = segfunc 26 | 27 | def add(self, terms, cls): 28 | for t in terms: 29 | frq = self.freq.setdefault(t, {}) 30 | if cls not in frq: frq[cls] = 0 31 | frq[cls] += 1 32 | 33 | def remove(self, terms, cls): 34 | for t in terms: self.freq[t][cls] -= 1 35 | 36 | @staticmethod 37 | def invchi2(chi, df): 38 | m = chi / 2.0 39 | s = term = math.exp(-m) 40 | for i in xrange(1, df//2): 41 | term *= m / i 42 | s += term 43 | return min(s, 1) 44 | 45 | @fixinit(0.5) 46 | def cprob(self, t, cls): 47 | if t not in self.freq: return 0 48 | frq = self.freq[t] 49 | return float(frq.get(cls, 0)) / sum(frq.values()) 50 | 51 | def prob(self, terms, cls): 52 | fs = sum(map(lambda t: math.log(self.cprob(t, cls)), 53 | terms)) 54 | return self.invchi2(-2 * fs, len(terms) * 2) 55 | 56 | def proc_text(self, data): 57 | r = set() 58 | fdbfunc = lambda x: unicodedata.category(x) in self.FORBIDDEN 59 | for t in self.segfunc(data): 60 | if len(t) < 2: continue 61 | if any(map(fdbfunc, t)): continue 62 | r.add(t) 63 | return r 64 | 65 | def train(self, data, cls): self.add(self.proc_text(data), cls) 66 | 67 | def trainfile(self, filepath, cls): 68 | print 'process', cls, filepath 69 | return self.train(readfile_cd(filepath), cls) 70 | 71 | def classify(self, data, clses): 72 | terms = self.proc_text(data) 73 | for cls in clses: 74 | print '\t', cls, self.prob(terms, cls) 75 | 76 | def classifyfile(self, filepath, clses): 77 | print 'classify', filepath 78 | return self.classify(readfile_cd(filepath), clses) 79 | -------------------------------------------------------------------------------- /segment/train.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | ''' 4 | @date: 2011-11-17 5 | @author: shell.xu 6 | ''' 7 | import os, sys 8 | import dyn, dictdb 9 | 10 | class StatCutter(dyn.DynamicCutter): 11 | 12 | def __init__(self, db): 13 | super(StatCutter, self).__init__(db) 14 | self.wordfrq = {} 15 | 16 | def parse(self, sentence): 17 | frq, rslt = self.split(sentence) 18 | for word, tp in rslt: 19 | if tp != 0: 20 | if word not in self.wordfrq: self.wordfrq[word] = 0 21 | self.wordfrq[word] += 1 22 | yield word 23 | 24 | def train(self, sync = False): 25 | for k, v in self.wordfrq.iteritems(): self.db.add(k, v) 26 | self.db.normalize() 27 | if sync: self.db.sync() 28 | 29 | def join(self, stat): 30 | for k, v in stat.wordfrq.iteritems(): 31 | self.wordfrq[k] = self.wordfrq.get(k, 0) + v 32 | return self 33 | 34 | class NewCutter(dyn.DynamicCutter): 35 | 36 | def __init__(self, db): 37 | super(StatCutter, self).__init__(db) 38 | self.wordfrq, self.hifrqs = {}, db.hifrqs(40) 39 | 40 | def parse(self, word): 41 | frq, rslt = self.split(sentence) 42 | for word, tp in rslt: 43 | if tp == 0 and len(word) >= 2: 44 | sp = word.strip(self.hifrqs) 45 | if len(sp) >= 2: 46 | if sp not in self.wordfrq: self.wordfrq[sp] = 0 47 | self.wordfrq[sp] += 1 48 | yield word 49 | 50 | def get_highfrq(self): 51 | r = sorted(self.wordfrq.items(), key = lambda x: x[1], reverse = True) 52 | avg = int(float(sum(map(lambda x:x[1], r))) / len(r)) + 1 53 | return filter(lambda x:x[1]>avg, r) 54 | 55 | def join(self, new): 56 | for k, v in new.wordfrq.iteritems(): 57 | self.wordfrq[k] = self.wordfrq.get(k, 0) + v 58 | return self 59 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | ''' 4 | @date: 2010-10-02 5 | @author: shell.xu 6 | ''' 7 | from distutils.core import setup 8 | 9 | setup(name = 'segment', version = '1.0', url = 'http://shell909090.com/', 10 | author = 'Shell.E.Xu', author_email = 'shell909090@gmail.com', 11 | description = 'segmentation library written by python', 12 | packages = ['segment',], scripts=['ps_cutter', 'ps_dbmgr', 'ps_cls']) 13 | -------------------------------------------------------------------------------- /test.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python 2 | # -*- coding: utf-8 -*- 3 | ''' 4 | @date: 2010-10-01 5 | @author: shell.xu 6 | ''' 7 | import unittest 8 | import segment 9 | 10 | db = segment.dictdb('frq.db') 11 | 12 | class utCutter(unittest.TestCase): 13 | 14 | def setUp(self): 15 | self.c = segment.StringCutter(None) 16 | 17 | def test_english_words(self): 18 | self.assertEqual(list(self.c.parse(u"BigBigPig")), 19 | [u"Big",u"Big",u"Pig"]) 20 | 21 | def test_not_english_words(self): 22 | self.assertEqual(list(self.c.parse(u"LOLI")), 23 | [u"LOLI"]) 24 | 25 | def test_english_number(self): 26 | self.assertEqual(list(self.c.parse(u"abc123")), 27 | [u"abc", u"123"]) 28 | 29 | def test_chinese_number(self): 30 | self.assertEqual(list(self.c.parse(u"我们123")), 31 | [u"我们", u"123"]) 32 | 33 | def test_chinese_english(self): 34 | self.assertEqual(list(self.c.parse(u"我们abc")), 35 | [u"我们", u"abc"]) 36 | 37 | def test_mix(self): 38 | self.assertEqual(list(self.c.parse(u"我们123abc")), 39 | [u"我们", u"123", u"abc"]) 40 | 41 | def test_number(self): 42 | self.assertEqual(list(self.c.parse(u"abc123,456")), 43 | [u"abc",u"123,456"]) 44 | 45 | def test_split(self): 46 | self.assertEqual(list(self.c.parse(u"abc,def")), 47 | [u"abc", u",", u'def']) 48 | self.assertEqual(list(self.c.parse(u"xyz:ddd")), 49 | [u"xyz", u":", u'ddd']) 50 | 51 | def test_english_punct(self): 52 | self.assertEqual(list(self.c.parse(u"we're half-ready")), 53 | [u"we're", u" ", u"half-ready"]) 54 | 55 | def test_not_number(self): 56 | rslt = [u"we", u"‘", u"re", u" ", u"half", u"——", u"ready"] 57 | self.assertEqual(list(self.c.parse(u"we‘re half——ready")), rslt) 58 | 59 | class utDynamic(unittest.TestCase): 60 | def setUp(self): 61 | self.c = segment.DynamicCutter(db) 62 | 63 | def test1(self): 64 | self.assertEqual(list(self.c.parse(u'有机会见面')), 65 | [u'有', u'机会', u'见面']) 66 | 67 | def test2(self): 68 | self.assertEqual(list(self.c.parse(u'长春市长春节致辞')), 69 | [u'长春', u'市长', u'春节', u'致辞']) 70 | 71 | def test3(self): 72 | self.assertEqual(list(self.c.parse(u'长春市长春药店')), 73 | [u'长春市', u'长春', u'药店']) 74 | 75 | def test4(self): 76 | self.assertEqual(list(self.c.parse(u'有意见分歧')), 77 | [u'有', u'意见分歧']) 78 | 79 | def test5(self): 80 | rslt = [u'吹毛求疵', u'和鱼鹰是', u'两个', 81 | u'有', u'魔力', u'的', u'单词'] 82 | self.assertEqual(list(self.c.parse(u'吹毛求疵和鱼鹰是两个有魔力的单词')), rslt) 83 | 84 | def test6(self): 85 | self.assertEqual(list(self.c.parse(u'王强大小')), 86 | [u'王', u'强大', u'小']) 87 | 88 | def test7(self): 89 | self.assertEqual(list(self.c.parse(u'毛泽东北京华烟云')), 90 | [u'毛泽东', u'北京', u'华', u'烟云']) 91 | 92 | def test8(self): 93 | self.assertEqual(list(self.c.parse(u'魔王育成计划')), 94 | [u'魔王', u'育成', u'计划']) 95 | 96 | def test9(self): 97 | self.assertEqual(list(self.c.parse(u'遥远古古巴比伦')), 98 | [u'遥远', u'古', u'古', u'巴比伦']) 99 | 100 | if __name__ == '__main__': unittest.main() 101 | --------------------------------------------------------------------------------