├── setup.cfg ├── LICENSE ├── README.md ├── .gitignore ├── setup.py ├── wordiscovery.py └── docs └── wordiscovery.ipynb /setup.cfg: -------------------------------------------------------------------------------- 1 | [bdist_wheel] 2 | universal=1 3 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | The MIT License (MIT) 2 | 3 | Copyright (c) 2017 @flykun.com 4 | 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of 6 | this software and associated documentation files (the "Software"), to deal in 7 | the Software without restriction, including without limitation the rights to 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of 9 | the Software, and to permit persons to whom the Software is furnished to do so, 10 | subject to the following conditions: 11 | 12 | The above copyright notice and this permission notice shall be included in all 13 | copies or substantial portions of the Software. 14 | 15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS 17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR 18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER 19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN 20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. 21 | 22 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | # 词频、互信息、信息熵发现中文新词 3 | 4 | **新词发现**任务是中文自然语言处理的重要步骤。**新词**有“新”就有“旧”,属于一个相对个概念,在相对的领域(金融、医疗),在相对的时间(过去、现在)都存在新词。[文本挖掘](https://zh.wikipedia.org/wiki/文本挖掘)会先将文本[分词](https://zh.wikipedia.org/wiki/中文自动分词),而通用分词器精度不够,通常需要添加**自定义字典**补足精度,所以发现新词并加入字典,成为文本挖掘的一个重要工作。 5 | 6 | [**单词**](https://zh.wikipedia.org/wiki/單詞)的定义,来自维基百科的定义如下: 7 | >在语言学中,**单词**(又称为词、词语、单字;英语对应用语为“word”)是能独立运用并含有语义内容或语用内容(即具有表面含义或实际含义)的最小单位。单词的集合称为词汇、术语,例如:所有中文单词统称为“中文词汇”,医学上专用的词统称为“医学术语”等。词典是为词语提供音韵、词义解释、例句、用法等等的工具书,有的词典只修录特殊领域的词汇。 8 | 9 | 单从语义角度,“苹果“的法语是"pomme",而“土豆”的法语是"pomme de terre",若按上面的定义,“土豆”是要被拆的面目全非,但"pomme de terre"却是表达“土豆”这个语义的最小单位;在机构名中,这类问题出现的更频繁,"Paris 3"是"巴黎第三大学"的简称,如果"Paris"和"3"分别表示地名和数字,那这两个就无法表达“巴黎第三大学”的语义。而中文也有类似的例子,“北京大学”的”北京“和”大学“都可以作为一个最小单位来使用,分别表示”地方名“和“大学”,如果这样分词,那么就可以理解为“北京的大学”了,所以“北京大学”是一个表达语义的最小单位。前几年有部电影《夏洛特烦恼》,我们是要理解为“夏洛特 烦恼“还是”夏洛 特 烦恼“,这就是很经典的分词问题。 10 | 11 | 从语用角度,这些问题似乎能被解决,我们知道"pomme de terre"在日常生活中一般作为“土豆”而不是“土里的苹果”,在巴黎学习都知道“Paris 3”,就像我们提到“北京大学”特指那所著名的高等学府一样。看过电影《夏洛特烦恼》的观众很容易的就能区分这个标题应该看为“夏洛 特 烦恼”。 12 | 13 | 发现新词的方法,《[互联网时代的社会语言学:基于SNS的文本数据挖掘](http://www.matrix67.com/blog/archives/5044]) 》一文,里面提到的给每一个文本串计算**文本片段**的**凝固程度**和文本串对外的使用**自由度**,通过设定阈值来将文本串分类为词和非词两类。原文给了十分通俗易懂的例子来解释凝固度和自动度。这里放上计算方法。这个方法还有许多地方需要优化,在之后的实践中慢慢调整了。 14 | 15 | ## 环境 16 | 17 | python >= 3.5 18 | 19 | ## 安装 20 | 21 | ```bash 22 | python setup.py install 23 | ``` 24 | 25 | ## 使用说明 26 | 27 | ```python 28 | import wordiscovery as wd 29 | 30 | text = """ 31 | 新词发现任务是中文自然语言处理的重要步骤。新词有“新”就有“旧”,属于一个相对个概念,在相对的领域(金融、医疗),在相对的时间(过去、现在)都存在新词。文本挖掘会先将文本分词,而通用分词器精度不过,通常需要添加自定义字典补足精度,所以发现新词并加入字典,成为文本挖掘的一个重要工作。 32 | """ 33 | 34 | f = wd.Wordiscovery() 35 | 36 | # 解析过程默认参数, 根据文本自由调节这几个阈值 37 | # 最小信息熵0.01 38 | # 最小互信息4 39 | # 最小词频2 40 | f.parse(text) # f.parse(text, 0.01, 4, 2) 41 | # {'分词': (2, 5.18271944179699, 0.6931471805599453), 42 | # '字典': (2, 6.2813317304651, 0.6931471805599453), 43 | # '文本': (3, 4.895037369345209, 0.6365141682948128), 44 | # '文本挖掘': (2, 5.588184549905154, 0.6931471805599453), 45 | # '新词': (4, 4.371789225580661, 1.0397207708399179), 46 | # '相对': (3, 4.3842117455792184, 0.6365141682948128), 47 | # '精度': (2, 6.2813317304651, 0.6931471805599453), 48 | # '通常': (2, 5.18271944179699, 0.6931471805599453), 49 | # '重要': (2, 5.028568761969732, 0.6931471805599453), 50 | # '需要': (2, 5.028568761969732, 0.6931471805599453), 51 | # '领域': (2, 6.2813317304651, 0.6931471805599453)} 52 | ``` 53 | 54 | ## 详细说明 55 | 56 | [wordicovery解释](https://github.com/Ushiao/wordiscovery/blob/master/docs/wordiscovery.ipynb) 57 | 58 | 59 | 60 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | .DS_Store 3 | .idea/ 4 | .ipynb_checkpoints 5 | instance/ 6 | 7 | # Eclipse 8 | 9 | *.pydevproject 10 | .project 11 | .metadata 12 | bin/ 13 | tmp/ 14 | *.tmp 15 | *.bak 16 | *.swp 17 | *~.nib 18 | local.properties 19 | .classpath 20 | .settings/ 21 | .loadpath 22 | 23 | # External tool builders 24 | .externalToolBuilders/ 25 | 26 | # Locally stored "Eclipse launch configurations" 27 | *.launch 28 | 29 | # CDT-specific 30 | .cproject 31 | 32 | # PDT-specific 33 | .buildpath 34 | 35 | 36 | # Visual Studio 37 | 38 | ## Ignore Visual Studio temporary files, build results, and 39 | ## files generated by popular Visual Studio add-ons. 40 | 41 | # User-specific files 42 | *.suo 43 | *.user 44 | *.sln.docstates 45 | 46 | # Build results 47 | [Dd]ebug/ 48 | [Rr]elease/ 49 | *_i.c 50 | *_p.c 51 | *.ilk 52 | *.meta 53 | *.obj 54 | *.pch 55 | *.pdb 56 | *.pgc 57 | *.pgd 58 | *.rsp 59 | *.sbr 60 | *.tlb 61 | *.tli 62 | *.tlh 63 | *.tmp 64 | *.vspscc 65 | .builds 66 | *.dotCover 67 | 68 | ## TODO: If you have NuGet Package Restore enabled, uncomment this 69 | #packages/ 70 | 71 | # Visual C++ cache files 72 | ipch/ 73 | *.aps 74 | *.ncb 75 | *.opensdf 76 | *.sdf 77 | 78 | # Visual Studio profiler 79 | *.psess 80 | *.vsp 81 | 82 | # ReSharper is a .NET coding add-in 83 | _ReSharper* 84 | 85 | # Installshield output folder 86 | [Ee]xpress 87 | 88 | # DocProject is a documentation generator add-in 89 | DocProject/buildhelp/ 90 | DocProject/Help/*.HxT 91 | DocProject/Help/*.HxC 92 | DocProject/Help/*.hhc 93 | DocProject/Help/*.hhk 94 | DocProject/Help/*.hhp 95 | DocProject/Help/Html2 96 | DocProject/Help/html 97 | 98 | # Click-Once directory 99 | publish 100 | 101 | # Others 102 | [Bb]in 103 | [Oo]bj 104 | sql 105 | TestResults 106 | *.Cache 107 | ClientBin 108 | stylecop.* 109 | ~$* 110 | *.dbmdl 111 | Generated_Code #added for RIA/Silverlight projects 112 | 113 | # Backup & report files from converting an old project file to a newer 114 | # Visual Studio version. Backup files are not needed, because we have git ;-) 115 | _UpgradeReport_Files/ 116 | Backup*/ 117 | UpgradeLog*.XML 118 | ############ 119 | ## pycharm 120 | ############ 121 | .idea 122 | 123 | ############ 124 | ## Windows 125 | ############ 126 | 127 | # Windows image file caches 128 | Thumbs.db 129 | 130 | # Folder config file 131 | Desktop.ini 132 | 133 | 134 | ############# 135 | ## Python 136 | ############# 137 | 138 | *.py[co] 139 | 140 | # Packages 141 | *.egg 142 | *.egg-info 143 | dist 144 | build 145 | eggs 146 | parts 147 | bin 148 | var 149 | sdist 150 | develop-eggs 151 | .installed.cfg 152 | 153 | # Installer logs 154 | pip-log.txt 155 | 156 | # Unit test / coverage reports 157 | .coverage 158 | .tox 159 | 160 | #Translations 161 | *.mo 162 | 163 | #Mr Developer 164 | .mr.developer.cfg 165 | 166 | # Mac crap 167 | .DS_Store 168 | *.log 169 | test/tmp/* 170 | 171 | #jython 172 | *.class 173 | 174 | MANIFEST 175 | -------------------------------------------------------------------------------- /setup.py: -------------------------------------------------------------------------------- 1 | """A setuptools based setup module. 2 | See: 3 | https://packaging.python.org/en/latest/distributing.html 4 | https://github.com/pypa/sampleproject 5 | """ 6 | 7 | # Always prefer setuptools over distutils 8 | from setuptools import setup, find_packages 9 | from os import path 10 | 11 | here = path.abspath(path.dirname(__file__)) 12 | 13 | # Get the long description from the README file 14 | #with open(path.join(here, 'README.md')) as f: 15 | #long_description = f.read() 16 | long_description = """ 17 | 词频、互信息、信息熵发现中文新词 18 | ================================ 19 | 20 | **新词发现**\ 任务是中文自然语言处理的重要步骤。\ **新词**\ 有“新”就有“旧”,属于一个相对个概念,在相对的领域(金融、医疗),在相对的时间(过去、现在)都存在新词。\ `文本挖掘 `__\ 会先将文本\ `分词 `__\ ,而通用分词器精度不过,通常需要添加\ **自定义字典**\ 补足精度,所以发现新词并加入字典,成为文本挖掘的一个重要工作。 21 | 22 | `单词 `__\ 的定义,来自维基百科的定义如下: 23 | 24 | 在语言学中,\ **单词**\ (又称为词、词语、单字;英语对应用语为“word”)是能独立运用并含有语义内容或语用内容(即具有表面含义或实际含义)的最小单位。单词的集合称为词汇、术语,例如:所有中文单词统称为“中文词汇”,医学上专用的词统称为“医学术语”等。词典是为词语提供音韵、词义解释、例句、用法等等的工具书,有的词典只修录特殊领域的词汇。 25 | 26 | 单从语义角度,“苹果“的法语是”pomme”,而“土豆”的法语是“pomme de 27 | terre”,若按上面的定义,“土豆”是要被拆的面目全非,但“pomme de 28 | terre”是却是表达“土豆”这个语义的最小单位;在机构名中,这类问题出现的更频繁,“Paris 29 | 3”是“巴黎第三大学”的简称,如果“Paris”和“3”分别表示地名和数字,那这两个就无法表达“巴黎第三大学”的语义。而中文也有类似的例子,“北京大学”的”北京“和”大学“都可以作为一个最小单位来使用,分别表示”地方名“和“大学”,如果这样分词,那么就可以理解为“北京的大学”了,所以“北京大学”是一个表达语义的最小单位。前几年有部电影《夏洛特烦恼》,我们是要理解为“夏洛特 30 | 烦恼“还是”夏洛 特 烦恼“,这就是很经典的分词问题。 31 | 32 | 但是从语用角度,这些问题似乎能被解决,我们知道“pomme de 33 | terre”在日常生活中一般作为“土豆”而不是“土里的苹果”,在巴黎学习都知道“Paris 34 | 3”,就像我们提到“北京大学”特指那所著名的高等学府一样。看过电影《夏洛特烦恼》的观众很容易的就能区分这个标题应该看为“夏洛 35 | 特 烦恼”。 36 | 37 | 发现新词的方法,《\ `互联网时代的社会语言学:基于SNS的文本数据挖掘 `__ 38 | 》一文,里面提到的给每一个文本串计算\ **文本片段**\ 的\ **凝固程度**\ 和文本串对外的使用\ **自由度**\ ,通过设定阈值来将文本串分类为词和非词两类。原文给了十分通俗易懂的例子来解释凝固度和自动度。这里放上计算方法。这个方法还有许多地方需要优化,在之后的实践中慢慢调整了。 39 | 40 | 环境 41 | ---- 42 | 43 | :: 44 | 45 | python >= 3.5 46 | 47 | 安装 48 | ---- 49 | 50 | .. code:: bash 51 | 52 | python setup.py install 53 | 54 | 使用说明 55 | -------- 56 | 57 | .. code:: python 58 | 59 | import wordiscovery as wd 60 | 61 | text = "新词发现任务是中文自然语言处理的重要步骤。 62 | 新词有新就有旧,属于一个相对个概念,在相对的领域(金融、医疗), 63 | 在相对的时间(过去、现在)都存在新词。文本挖掘会先将文本分词, 64 | 而通用分词器精度不过,通常需要添加自定义字典补足精度, 65 | 所以发现新词并加入字典,成为文本挖掘的一个重要工作。 66 | " 67 | 68 | f = wd.Wordiscovery() 69 | 70 | # 解析过程默认参数, 根据文本自由调节这几个阈值 71 | # 最小信息熵0.01 72 | # 最小互信息4 73 | # 最小词频2 74 | f.parse(text) # f.parse(text, 0.01, 4, 2) 75 | # {'分词': (2, 5.18271944179699, 0.6931471805599453), 76 | # '字典': (2, 6.2813317304651, 0.6931471805599453), 77 | # '文本': (3, 4.895037369345209, 0.6365141682948128), 78 | # '文本挖掘': (2, 5.588184549905154, 0.6931471805599453), 79 | # '新词': (4, 4.371789225580661, 1.0397207708399179), 80 | # '相对': (3, 4.3842117455792184, 0.6365141682948128), 81 | # '精度': (2, 6.2813317304651, 0.6931471805599453), 82 | # '通常': (2, 5.18271944179699, 0.6931471805599453), 83 | # '重要': (2, 5.028568761969732, 0.6931471805599453), 84 | # '需要': (2, 5.028568761969732, 0.6931471805599453), 85 | # '领域': (2, 6.2813317304651, 0.6931471805599453)} 86 | 87 | 详细说明 88 | -------- 89 | 90 | `wordicovery解释 `__ 91 | """ 92 | 93 | setup( 94 | name='wordiscovery', 95 | 96 | # Versions should comply with PEP440. For a discussion on single-sourcing 97 | # the version across setup.py and the project code, see 98 | # https://packaging.python.org/en/latest/single_source_version.html 99 | version='0.1.4.6', 100 | 101 | description='A Chinese new word discovery', 102 | long_description=long_description, 103 | 104 | # The project's main homepage. 105 | url='https://github.com/ushiao/wordiscovery', 106 | 107 | # Author details 108 | author='Kun JIN', 109 | author_email='jin.kun@flykun.com', 110 | 111 | # Choose your license 112 | license='MIT', 113 | 114 | # See https://pypi.python.org/pypi?%4Aaction=list_classifiers 115 | classifiers=[ 116 | # How mature is this project? Common values are 117 | # 3 - Alpha 118 | # 4 - Beta 119 | # 5 - Production/Stable 120 | 'Development Status :: 4 - Beta', 121 | 122 | # Indicate who your project is intended for 123 | 'Intended Audience :: Developers', 124 | 'Topic :: Software Development :: Build Tools', 125 | 126 | # Pick your license as you wish (should match "license" above) 127 | 'License :: OSI Approved :: MIT License', 128 | 129 | # Specify the Python versions you support here. In particular, ensure 130 | # that you indicate whether you support Python 2, Python 3 or both. 131 | # 'Programming Language :: Python :: 2', 132 | # 'Programming Language :: Python :: 2.7', 133 | 'Programming Language :: Python :: 3', 134 | 'Programming Language :: Python :: 3.3', 135 | 'Programming Language :: Python :: 3.4', 136 | 'Programming Language :: Python :: 3.5', 137 | ], 138 | 139 | # What does your project relate to? 140 | keywords='NLP, new word discorvery', 141 | 142 | # You can just specify the packages manually here if your project is 143 | # simple. Or you can use find_packages(). 144 | #packages=find_packages(exclude=['contrib', 'docs', 'tests']), 145 | #packages=["wordiscovery"], 146 | py_modules=["wordiscovery"], 147 | 148 | # List run-time dependencies here. These will be installed by pip when 149 | # your project is installed. For an analysis of "install_requires" vs pip's 150 | # requirements files see: 151 | # https://packaging.python.org/en/latest/requirements.html 152 | #install_requires=[ 153 | # 'six==1.11.0'], 154 | 155 | # List additional groups of dependencies here (e.g. development 156 | # dependencies). You can install these using the following syntax, 157 | # for example: 158 | # $ pip install -e .[dev,test] 159 | # extras_require={ 160 | # 'dev': ['check-manifest'], 161 | # 'test': ['coverage'], 162 | # }, 163 | 164 | # If there are data files included in your packages that need to be 165 | # installed, specify them here. If using Python 2.6 or less, then these 166 | # have to be included in MANIFEST.in as well. 167 | # package_data={ 168 | # 'tagword': ['*.*', 169 | # 'tokenizer/*', 170 | # 'tokenizer/models/*', 171 | # 'tokenizer/data/*', 172 | # ], 173 | #}, 174 | 175 | # Although 'package_data' is the preferred approach, in some case you may 176 | # need to place data files outside of your packages. See: 177 | # http://docs.python.org/3.4/distutils/setupscript.html#installing-additional-files # noqa 178 | # In this case, 'data_file' will be installed into '/my_data' 179 | # data_files=[('my_data', ['data/data_file'])], 180 | 181 | # To provide executable scripts, use entry points in preference to the 182 | # "scripts" keyword. Entry points provide cross-platform support and allow 183 | # pip to create the appropriate form of executable for the target platform. 184 | # entry_points={ 185 | # 'console_scripts': [ 186 | # '=sample:main', 187 | # ], 188 | # }, 189 | ) 190 | -------------------------------------------------------------------------------- /wordiscovery.py: -------------------------------------------------------------------------------- 1 | # coding: utf-8 2 | 3 | import math 4 | 5 | 6 | class TrieNode(object): 7 | """TrieNode is node in trietree, each node contains parent of this node, 8 | node frequence, children, and all children frequence. this structure data 9 | for calculation 10 | """ 11 | 12 | def __init__(self, 13 | frequence=0, 14 | children_frequence=0, 15 | parent=None): 16 | self.parent = parent 17 | self.frequence = frequence 18 | self.children = {} 19 | self.children_frequence = children_frequence 20 | 21 | def insert(self, char): 22 | self.children_frequence += 1 23 | self.children[char] = self.children.get(char, TrieNode(parent=self)) 24 | self.children[char].frequence += 1 25 | return self.children[char] 26 | 27 | def fetch(self, char): 28 | return self.children[char] 29 | 30 | 31 | class TrieTree(object): 32 | def __init__(self, size=6): 33 | self._root = TrieNode() 34 | self.size = size 35 | 36 | def get_root(self): 37 | return self._root 38 | 39 | def insert(self, chunk): 40 | node = self._root 41 | for char in chunk: 42 | node = node.insert(char) 43 | if len(chunk) < self.size: 44 | # add symbol "EOS" at end of line trunck 45 | node.insert("EOS") 46 | 47 | def fetch(self, chunk): 48 | node = self._root 49 | for char in chunk: 50 | node = node.fetch(char) 51 | return node 52 | 53 | # In[153]: 54 | 55 | 56 | class WordDiscovery(object): 57 | def __init__(self, ngram_size=6): 58 | self.puncs = ['【','】',')','(','、',',','“','”', 59 | '。','《','》',' ','-','!','?','.', 60 | '\'','[',']',':','/','.','"','\u3000', 61 | '’','.',',','…','?',';','·','%','(', 62 | '#',')',';','>','<','$', ' ', ' ','\ufeff'] 63 | 64 | self.fw_ngram = TrieTree(ngram_size) 65 | self.bw_ngram = TrieTree(ngram_size) 66 | self.ngram_size = ngram_size 67 | 68 | def preparse(self, text): 69 | 70 | # replace punctuaton wiht "\n" 71 | for punc in self.puncs: 72 | text = text.replace(punc, "\n") 73 | 74 | # Todo: Delimiter alphabetic string, number from chinese text 75 | #regex_num_alpha = re.compile() 76 | #text = re.sub(r"([a-zA-Z0-9]+)", r"\n\1", text, flags=re.M) 77 | 78 | chunks, bchunks = [], [] 79 | 80 | # split text in to line 81 | for line in text.strip().split("\n"): 82 | line = line.strip() 83 | bline = line[::-1] 84 | for start in range(len(line)): 85 | # insert data into structure 86 | end = start + self.ngram_size 87 | chunk = line[start:end] 88 | bchunk = bline[start:end] 89 | self.fw_ngram.insert(chunk) 90 | self.bw_ngram.insert(bchunk) 91 | 92 | if len(chunk) == 6: 93 | chunk = chunk[:-1] 94 | 95 | while len(chunk) > 1: 96 | chunks.append(chunk) 97 | bchunks.append(chunk[::-1]) 98 | chunk = chunk[:-1] 99 | 100 | return chunks, bchunks 101 | 102 | def calc_entropy(self, chunks, ngram): 103 | 104 | def entropy(sample, total): 105 | """Entropy""" 106 | s = float(sample) 107 | t = float(total) 108 | result = - s/t * math.log(s/t) 109 | return result 110 | 111 | def parse(chunk, ngram): 112 | node = ngram.fetch(chunk) 113 | total = node.children_frequence 114 | return sum([entropy(sub_node.frequence, 115 | total) for sub_node in node.children.values()]) 116 | 117 | word2entropy = {} 118 | for chunk in chunks: 119 | word2entropy[chunk] = parse(chunk, ngram) 120 | return word2entropy 121 | 122 | def calc_mutualinfo(self, chunks, ngram): 123 | """Mutual Information 124 | log(p(x,y)/(p(x)*p(y))) = log(p(y|x)/p(y))""" 125 | 126 | def parse(chunk, root): 127 | sub_node_y_x = ngram.fetch(chunk) 128 | node = sub_node_y_x.parent 129 | sub_node_y = root.children[chunk[-1]] 130 | 131 | prob_y_x = float(sub_node_y_x.frequence) / node.children_frequence 132 | prob_y = float(sub_node_y.frequence) / root.children_frequence 133 | mutualinfo = math.log(prob_y_x / prob_y) 134 | return mutualinfo, sub_node_y_x.frequence 135 | 136 | word2mutualinfo = {} 137 | root = ngram.get_root() 138 | for chunk in chunks: 139 | word2mutualinfo[chunk] = parse(chunk, root) 140 | return word2mutualinfo 141 | 142 | def parse(self, 143 | text, 144 | entropy_threshold=0.001, 145 | mutualinfo_threshold=4, 146 | freq_threshold=2): 147 | chunks, bchunks = self.preparse(text) 148 | return self._fetch_final(chunks, 149 | bchunks, 150 | entropy_threshold, 151 | mutualinfo_threshold, 152 | freq_threshold 153 | ) 154 | 155 | def _fetch_final(self, 156 | chunks, 157 | bchunks, 158 | entropy_threshold=0.001, 159 | mutualinfo_threshold=4, 160 | freq_threshold=2): 161 | fw_entropy = self.calc_entropy(chunks, self.fw_ngram) 162 | bw_entropy = self.calc_entropy(bchunks, self.bw_ngram) 163 | fw_mi = self.calc_mutualinfo(chunks, self.fw_ngram) 164 | bw_mi = self.calc_mutualinfo(bchunks, self.bw_ngram) 165 | 166 | final = {} 167 | for k, v in fw_entropy.items(): 168 | if k[::-1] in bw_mi and k in fw_mi: 169 | mi_min = min(fw_mi[k][0], bw_mi[k[::-1]][0]) 170 | word_prob = min(fw_mi[k][1], bw_mi[k[::-1]][1]) 171 | if mi_min < mutualinfo_threshold: 172 | continue 173 | else: 174 | continue 175 | if word_prob < freq_threshold: 176 | continue 177 | if k[::-1] in bw_entropy: 178 | en_min = min(v, bw_entropy[k[::-1]]) 179 | if en_min < entropy_threshold: 180 | continue 181 | else: 182 | continue 183 | final[k] = (word_prob, mi_min, en_min) 184 | return final 185 | 186 | # In[155]: 187 | 188 | 189 | def main(filename): 190 | with open(filename, "r") as inf: 191 | text = inf.read() 192 | f = WordDiscovery(6) 193 | word_info = f.parse(text, 194 | entropy_threshold=0.001, 195 | mutualinfo_threshold=4, 196 | freq_threshold=3) 197 | for k, v in sorted(word_info.items(), 198 | key=lambda x:x[1][0], 199 | reverse=False): 200 | print("%+9s\t%-5d\t%.4f\t%.4f"%(k, v[0], v[1], v[2])) 201 | 202 | 203 | if __name__ == "__main__": 204 | import os 205 | main(os.path.join("data", 206 | "shijiuda.txt") 207 | ) 208 | 209 | 210 | -------------------------------------------------------------------------------- /docs/wordiscovery.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "# 词频、互信息、信息熵发现中文新词" 8 | ] 9 | }, 10 | { 11 | "cell_type": "markdown", 12 | "metadata": {}, 13 | "source": [ 14 | "**新词发现**任务是中文自然语言处理的重要步骤。**新词**有“新”就有“旧”,属于一个相对个概念,在相对的领域(金融、医疗),在相对的时间(过去、现在)都存在新词。[文本挖掘](https://zh.wikipedia.org/wiki/文本挖掘)会先将文本[分词](https://zh.wikipedia.org/wiki/中文自动分词),而通用分词器精度不过,通常需要添加**自定义字典**补足精度,所以发现新词并加入字典,成为文本挖掘的一个重要工作。" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "metadata": {}, 20 | "source": [ 21 | "[**单词**](https://zh.wikipedia.org/wiki/單詞)的定义,来自维基百科的定义如下:\n", 22 | ">在语言学中,**单词**(又称为词、词语、单字;英语对应用语为“word”)是能独立运用并含有语义内容或语用内容(即具有表面含义或实际含义)的最小单位。单词的集合称为词汇、术语,例如:所有中文单词统称为“中文词汇”,医学上专用的词统称为“医学术语”等。词典是为词语提供音韵、词义解释、例句、用法等等的工具书,有的词典只修录特殊领域的词汇。\n", 23 | "\n", 24 | "单从语义角度,“苹果“的法语是\"pomme\",而“土豆”的法语是\"pomme de terre\",若按上面的定义,“土豆”是要被拆的面目全非,但\"pomme de terre\"是却是表达“土豆”这个语义的最小单位;在机构名中,这类问题出现的更频繁,\"Paris 3\"是\"巴黎第三大学\"的简称,如果\"Paris\"和\"3\"分别表示地名和数字,那这两个就无法表达“巴黎第三大学”的语义。而中文也有类似的例子,“北京大学”的”北京“和”大学“都可以作为一个最小单位来使用,分别表示”地方名“和“大学”,如果这样分词,那么就可以理解为“北京的大学”了,所以“北京大学”是一个表达语义的最小单位。前几年有部电影《夏洛特烦恼》,我们是要理解为“夏洛特 烦恼“还是”夏洛 特 烦恼“,这就是很经典的分词问题。\n", 25 | "\n", 26 | "但是从语用角度,这些问题似乎能被解决,我们知道\"pomme de terre\"在日常生活中一般作为“土豆”而不是“土里的苹果”,在巴黎学习都知道“Paris 3”,就像我们提到“北京大学”特指那所著名的高等学府一样。看过电影《夏洛特烦恼》的观众很容易的就能区分这个标题应该看为“夏洛 特 烦恼”。" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "metadata": {}, 32 | "source": [ 33 | "发现新词的方法,《[互联网时代的社会语言学:基于SNS的文本数据挖掘](http://www.matrix67.com/blog/archives/5044]) 》一文,里面提到的给每一个文本串计算**文本片段**的**凝固程度**和文本串对外的使用**自由度**,通过设定阈值来将文本串分类为词和非词两类。原文给了十分通俗易懂的例子来解释凝固度和自动度。这里放上计算方法。这个方法还有许多地方需要优化,在之后的实践中慢慢调整了。" 34 | ] 35 | }, 36 | { 37 | "cell_type": "markdown", 38 | "metadata": {}, 39 | "source": [ 40 | "## 文本片段" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "metadata": {}, 46 | "source": [ 47 | "**文本片段**,最常用的方法就是[n元语法(ngram)](https://zh.wikipedia.org/wiki/N元语法),将分本分成多个n长度的文本片段。数据结构,这里采用Trie树的方案,这个方案是简单容易实现,而且用Python的字典做Hash索引实现起来也很优美,唯独的一个问题是所有的数据都存在内存中,这会使得内存占用量非常大,如果要把这个工程化使用,还需要采用其他方案,比如硬盘检索。\n", 48 | "" 51 | ] 52 | }, 53 | { 54 | "cell_type": "code", 55 | "execution_count": null, 56 | "metadata": {}, 57 | "outputs": [], 58 | "source": [ 59 | "class TrieNode(object):\n", 60 | " def __init__(self, \n", 61 | " frequence=0, \n", 62 | " children_frequence=0, \n", 63 | " parent=None):\n", 64 | "\n", 65 | " self.parent = parent\n", 66 | " self.frequence = frequence\n", 67 | " self.children = {} \n", 68 | " self.children_frequence = children_frequence\n", 69 | "\n", 70 | " def insert(self, char):\n", 71 | " self.children_frequence += 1\n", 72 | " self.children[char] = self.children.get(char, TrieNode(parent=self))\n", 73 | " self.children[char].frequence += 1\n", 74 | " return self.children[char]\n", 75 | " \n", 76 | " def fetch(self, char):\n", 77 | " return self.children[char]\n", 78 | " \n", 79 | "class TrieTree(object):\n", 80 | " def __init__(self, size=6):\n", 81 | " self._root = TrieNode()\n", 82 | " self.size = size\n", 83 | " \n", 84 | " def get_root(self):\n", 85 | " return self._root\n", 86 | " \n", 87 | " def insert(self, chunk):\n", 88 | " node = self._root\n", 89 | " for char in chunk:\n", 90 | " node = node.insert(char)\n", 91 | " if len(chunk) < self.size:\n", 92 | " # add symbol \"EOS\" at end of line trunck\n", 93 | " node.insert(\"EOS\")\n", 94 | "\n", 95 | " def fetch(self, chunk):\n", 96 | " node = self._root\n", 97 | " for char in chunk:\n", 98 | " node = node.fetch(char)\n", 99 | " return node" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "metadata": {}, 105 | "source": [ 106 | "Trie树的结构上,我添加了几个参数,parent,frequence,children_frequence,他们分别是:\n", 107 | "- parent,当前节点的父节点,如果是“树根”的时候,这个父节点为空;\n", 108 | "- frequence,当前节点出现的频次,在Trie树上,也可以表示某个文本片段的频次,比如\"中国\",“国”这个节点的frequence是100的时候,“中国”俩字也出现了100次。这个可以作为最后的词频过滤用。\n", 109 | "- children_frequence,当前接点下有子节点的\"frequence\"的总和。比如在刚才的例子上加上“中间”出现了99次,那么“中”这个节点的children_frequence的值是199次。\n", 110 | "这样的构造让第二部分的计算更加方面。\n", 111 | "\n", 112 | "这个任务中需要构建两棵Trie树,表示正向和反向两个字符片段集。" 113 | ] 114 | }, 115 | { 116 | "cell_type": "markdown", 117 | "metadata": {}, 118 | "source": [ 119 | "## 自由度" 120 | ] 121 | }, 122 | { 123 | "cell_type": "markdown", 124 | "metadata": {}, 125 | "source": [ 126 | "**自由度**,使用信息论中的[信息熵](https://zh.wikipedia.org/wiki/熵_(信息论))构建文本片段左右熵,公式[1]。熵越大,表示该片段和左右邻字符相互关系的不稳定性越高,那么越有可能作为独立的片段使用。公式[1]第一个等号后面的I(x)表示x的自信息。\n", 127 | "\\begin{align} \n", 128 | " H(X) = \\sum_{i} {\\mathrm{P}(x_i)\\,\\mathrm{I}(x_i)} = -\\sum_{i} {\\mathrm{P}(x_i) \\log \\mathrm{P}(x_i)} [1]\n", 129 | "\\end{align} " 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "def calc_entropy(chunks, ngram):\n", 139 | " \"\"\"计算信息熵\n", 140 | " Args:\n", 141 | " chunks,是所有数据的文本片段\n", 142 | " ngram,是Trie树\n", 143 | " Return:\n", 144 | " word2entropy,返回一个包含每个chunk和对应信息熵的字典。\n", 145 | " \"\"\"\n", 146 | " def entropy(sample, total):\n", 147 | " \"\"\"Entropy\"\"\"\n", 148 | " s = float(sample)\n", 149 | " t = float(total)\n", 150 | " result = - s/t * math.log(s/t)\n", 151 | " return result\n", 152 | "\n", 153 | " def parse(chunk, ngram):\n", 154 | " node = ngram.fetch(chunk)\n", 155 | " total = node.children_frequence\n", 156 | " return sum([entropy(sub_node.frequence, \n", 157 | " total) for sub_node in node.children.values()])\n", 158 | "\n", 159 | " word2entropy = {}\n", 160 | " for chunk in chunks:\n", 161 | " word2entropy[chunk] = parse(chunk, ngram) \n", 162 | " return word2entropy" 163 | ] 164 | }, 165 | { 166 | "cell_type": "markdown", 167 | "metadata": {}, 168 | "source": [ 169 | "## 凝固度" 170 | ] 171 | }, 172 | { 173 | "cell_type": "markdown", 174 | "metadata": {}, 175 | "source": [ 176 | "**凝固度**,用信息论中的**互信息**表示,公式[2]。在概率论中,如果x跟y不相关,则p(x,y)=p(x)p(y)。二者相关性越大,则p(x,y)就相比于p(x)p(y)越大。用后面的式子可能更好理解,在y出现的情况下x出现的条件概率p(x|y)除以x本身出现的概率p(x),自然就表示x跟y的相关程度。 \n", 177 | "\\begin{align} \n", 178 | "I(x;y) = \\log\\frac{p(x,y)}{p(x)p(y)} = \\log\\frac{p(x|y)}{p(x)} = \\log\\frac{p(y|x)}{p(y)} [2]\n", 179 | "\\end{align}\n", 180 | "\n", 181 | "这里比较容易产生一个概念的混淆,维基百科将式[2]定义为[点互信息](https://en.wikipedia.org/wiki/Pointwise_mutual_information),[互信息](https://zh.wikipedia.org/wiki/互信息)的定义如下:\n", 182 | "\\begin{align} \n", 183 | "I(X;Y) = \\sum_{y \\in Y} \\sum_{x \\in X} \n", 184 | " p(x,y) \\log{ \\left(\\frac{p(x,y)}{p(x)\\,p(y)}\n", 185 | " \\right) }\\ [3]\n", 186 | "\\end{align}\n", 187 | "在傅祖芸编著的《信息论——基础理论与应用(第4版)》的绪论中,把式[2]定义为互信息,而式[3]定义为平均互信息,就像信息熵指的是**平均自信息**。" 188 | ] 189 | }, 190 | { 191 | "cell_type": "code", 192 | "execution_count": null, 193 | "metadata": {}, 194 | "outputs": [], 195 | "source": [ 196 | "def calc_mutualinfo(chunks, ngram):\n", 197 | " \"\"\"计算互信息\n", 198 | " Args:\n", 199 | " chunks,是所有数据的文本片段\n", 200 | " ngram,是Trie树\n", 201 | " Return:\n", 202 | " word2mutualinfo,返回一个包含每个chunk和对应互信息的字典。\n", 203 | " \"\"\"\n", 204 | " def parse(chunk, root):\n", 205 | " sub_node_y_x = ngram.fetch(chunk)\n", 206 | " node = sub_node_y_x.parent\n", 207 | " sub_node_y = root.children[chunk[-1]]\n", 208 | "\n", 209 | " # 这里采用互信息log(p(y|x)/p(y))的计算方法\n", 210 | " prob_y_x = float(sub_node_y_x.frequence) / node.children_frequence\n", 211 | " prob_y = float(sub_node_y.frequence) / root.children_frequence\n", 212 | " mutualinfo = math.log(prob_y_x / prob_y)\n", 213 | " return mutualinfo, sub_node_y_x.frequence\n", 214 | "\n", 215 | " word2mutualinfo = {} \n", 216 | " root = ngram.get_root()\n", 217 | " for chunk in chunks:\n", 218 | " word2mutualinfo[chunk] = parse(chunk, root)\n", 219 | " return word2mutualinfo" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "metadata": {}, 225 | "source": [ 226 | "## 过滤" 227 | ] 228 | }, 229 | { 230 | "cell_type": "markdown", 231 | "metadata": {}, 232 | "source": [ 233 | "最终计算得出互信息、信息熵,甚至也统计了词频,最后一步就是根据阈值对词进行过滤。" 234 | ] 235 | }, 236 | { 237 | "cell_type": "code", 238 | "execution_count": null, 239 | "metadata": {}, 240 | "outputs": [], 241 | "source": [ 242 | "def _fetch_final(fw_entropy,\n", 243 | " bw_entropy,\n", 244 | " fw_mi,\n", 245 | " bw_mi\n", 246 | " entropy_threshold=0.8,\n", 247 | " mutualinfo_threshold=7,\n", 248 | " freq_threshold=10):\n", 249 | " final = {}\n", 250 | " for k, v in fw_entropy.items():\n", 251 | " last_node = self.fw_ngram\n", 252 | " if k[::-1] in bw_mi and k in fw_mi:\n", 253 | " mi_min = min(fw_mi[k][0], bw_mi[k[::-1]][0])\n", 254 | " word_prob = min(fw_mi[k][1], bw_mi[k[::-1]][1])\n", 255 | " if mi_min < mutualinfo_threshold:\n", 256 | " continue\n", 257 | " else:\n", 258 | " continue\n", 259 | " if word_prob < freq_threshold:\n", 260 | " continue\n", 261 | " if k[::-1] in bw_entropy:\n", 262 | " en_min = min(v, bw_entropy[k[::-1]])\n", 263 | " if en_min < entropy_threshold:\n", 264 | " continue\n", 265 | " else:\n", 266 | " continue\n", 267 | " final[k] = (word_prob, mi_min, en_min)\n", 268 | " return final" 269 | ] 270 | }, 271 | { 272 | "cell_type": "markdown", 273 | "metadata": {}, 274 | "source": [ 275 | "## 结果" 276 | ] 277 | }, 278 | { 279 | "cell_type": "markdown", 280 | "metadata": {}, 281 | "source": [ 282 | "最终,通过这个方法对这次十九大的开幕发言做的一个词汇发现,ngram的n=10,结果按词频排序输出,可以发现这次十九大谈了许多内容,不一一说了。这个结果还存在不少问题,比如“二〇”,这在阈值的设置上还不够准确,可以尝试使用机器学习的方法来获取阈值。" 283 | ] 284 | }, 285 | { 286 | "cell_type": "markdown", 287 | "metadata": {}, 288 | "source": [ 289 | "经济|70 改革|69 我们|64 必须|61 领导|60 完善|57 历史|44 不断|43 群众|43 教育|43 战略|42 思想|40 世界|39 问题|37 提高|37 组织|36 监督|35 加快|35 依法|34 精神|33 团结|33 复兴|32 保障|31 奋斗|30 根本|29 环境|29 军队|29 开放|27 服务|27 理论|26 干部|26 创造|26 基础|25 意识|25 维护|25 协商|24 解决|24 贯彻|23 斗争|23 目标|21 统筹|20 始终|19 方式|19 水平|19 科学|19 利益|19 市场|19 基层|19 积极|18 马克思|18 反对|18 道路|18 自然|18 增长|17 科技|17 稳定|17 原则|17 两岸|17 取得|16 质量|16 农村|16 矛盾|16 协调|15 巩固|15 收入|15 绿色|15 自觉|15 方针|15 纪律|15 长期|15 保证|15 同胞|15 命运|14 美好生活|14 五年|14 传统|14 繁荣|14 没有|14 使命|13 广泛|13 日益|13 价值|13 健康|13 资源|13 参与|13 突出|13 腐败|13 充分|13 梦想|13 任何|13 二〇|13 代表|12 阶段|12 深刻|12 布局|12 区域|12 贸易|12 核心|12 城乡|12 生态文明|12 工程|12 任务|12 地区|12 责任|12 认识|12 胜利|11 贡献|11 覆盖|11 生态环境|11 具有|11 面临|11 各种|11 培育|11 企业|11 继续|10 团结带领|10 提升|10 明显|10 弘扬|10 脱贫|10 贫困|10 标准|10 注重|10 基本实现|10 培养|10 青年|10" 290 | ] 291 | }, 292 | { 293 | "cell_type": "markdown", 294 | "metadata": {}, 295 | "source": [ 296 | "## 代码下载地址" 297 | ] 298 | }, 299 | { 300 | "cell_type": "markdown", 301 | "metadata": {}, 302 | "source": [ 303 | "git clone https://github.com/Ushiao/new-word-discovery.git" 304 | ] 305 | } 306 | ], 307 | "metadata": { 308 | "kernelspec": { 309 | "display_name": "Python 3", 310 | "language": "python", 311 | "name": "python3" 312 | }, 313 | "language_info": { 314 | "codemirror_mode": { 315 | "name": "ipython", 316 | "version": 3 317 | }, 318 | "file_extension": ".py", 319 | "mimetype": "text/x-python", 320 | "name": "python", 321 | "nbconvert_exporter": "python", 322 | "pygments_lexer": "ipython3", 323 | "version": "3.6.3" 324 | } 325 | }, 326 | "nbformat": 4, 327 | "nbformat_minor": 2 328 | } 329 | --------------------------------------------------------------------------------