├── setup.cfg
├── LICENSE
├── README.md
├── .gitignore
├── setup.py
├── wordiscovery.py
└── docs
    └── wordiscovery.ipynb


/setup.cfg:
--------------------------------------------------------------------------------
1 | [bdist_wheel]
2 | universal=1
3 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2017 @flykun.com
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy of
 6 | this software and associated documentation files (the "Software"), to deal in
 7 | the Software without restriction, including without limitation the rights to
 8 | use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
 9 | the Software, and to permit persons to whom the Software is furnished to do so,
10 | subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
17 | FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
18 | COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
19 | IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
20 | CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
21 | 
22 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
 1 | 
 2 | # 词频、互信息、信息熵发现中文新词
 3 | 
 4 | **新词发现**任务是中文自然语言处理的重要步骤。**新词**有“新”就有“旧”，属于一个相对个概念，在相对的领域（金融、医疗），在相对的时间（过去、现在）都存在新词。[文本挖掘](https://zh.wikipedia.org/wiki/文本挖掘)会先将文本[分词](https://zh.wikipedia.org/wiki/中文自动分词)，而通用分词器精度不够，通常需要添加**自定义字典**补足精度，所以发现新词并加入字典，成为文本挖掘的一个重要工作。
 5 | 
 6 | [**单词**](https://zh.wikipedia.org/wiki/單詞)的定义，来自维基百科的定义如下：
 7 | >在语言学中，**单词**（又称为词、词语、单字；英语对应用语为“word”）是能独立运用并含有语义内容或语用内容（即具有表面含义或实际含义）的最小单位。单词的集合称为词汇、术语，例如：所有中文单词统称为“中文词汇”，医学上专用的词统称为“医学术语”等。词典是为词语提供音韵、词义解释、例句、用法等等的工具书，有的词典只修录特殊领域的词汇。
 8 | 
 9 | 单从语义角度，“苹果“的法语是"pomme"，而“土豆”的法语是"pomme de terre"，若按上面的定义，“土豆”是要被拆的面目全非，但"pomme de terre"却是表达“土豆”这个语义的最小单位；在机构名中，这类问题出现的更频繁，"Paris 3"是"巴黎第三大学"的简称，如果"Paris"和"3"分别表示地名和数字，那这两个就无法表达“巴黎第三大学”的语义。而中文也有类似的例子，“北京大学”的”北京“和”大学“都可以作为一个最小单位来使用，分别表示”地方名“和“大学”，如果这样分词，那么就可以理解为“北京的大学”了，所以“北京大学”是一个表达语义的最小单位。前几年有部电影《夏洛特烦恼》，我们是要理解为“夏洛特 烦恼“还是”夏洛 特 烦恼“，这就是很经典的分词问题。
10 | 
11 | 从语用角度，这些问题似乎能被解决，我们知道"pomme de terre"在日常生活中一般作为“土豆”而不是“土里的苹果”，在巴黎学习都知道“Paris 3”，就像我们提到“北京大学”特指那所著名的高等学府一样。看过电影《夏洛特烦恼》的观众很容易的就能区分这个标题应该看为“夏洛 特 烦恼”。
12 | 
13 | 发现新词的方法，《[互联网时代的社会语言学：基于SNS的文本数据挖掘](http://www.matrix67.com/blog/archives/5044]) 》一文，里面提到的给每一个文本串计算**文本片段**的**凝固程度**和文本串对外的使用**自由度**，通过设定阈值来将文本串分类为词和非词两类。原文给了十分通俗易懂的例子来解释凝固度和自动度。这里放上计算方法。这个方法还有许多地方需要优化，在之后的实践中慢慢调整了。
14 | 
15 | ## 环境
16 | 
17 | python >= 3.5
18 | 
19 | ## 安装
20 | 
21 | ```bash
22 | python setup.py install
23 | ```
24 | 
25 | ## 使用说明
26 | 
27 | ```python
28 | import wordiscovery as wd
29 | 
30 | text = """
31 | 新词发现任务是中文自然语言处理的重要步骤。新词有“新”就有“旧”，属于一个相对个概念，在相对的领域（金融、医疗），在相对的时间（过去、现在）都存在新词。文本挖掘会先将文本分词，而通用分词器精度不过，通常需要添加自定义字典补足精度，所以发现新词并加入字典，成为文本挖掘的一个重要工作。
32 | """
33 | 
34 | f = wd.Wordiscovery()
35 | 
36 | # 解析过程默认参数, 根据文本自由调节这几个阈值
37 | # 最小信息熵0.01
38 | # 最小互信息4
39 | # 最小词频2
40 | f.parse(text)  # f.parse(text, 0.01, 4, 2)
41 | # {'分词': (2, 5.18271944179699, 0.6931471805599453),
42 | # '字典': (2, 6.2813317304651, 0.6931471805599453),
43 | # '文本': (3, 4.895037369345209, 0.6365141682948128),
44 | # '文本挖掘': (2, 5.588184549905154, 0.6931471805599453),
45 | # '新词': (4, 4.371789225580661, 1.0397207708399179),
46 | # '相对': (3, 4.3842117455792184, 0.6365141682948128),
47 | # '精度': (2, 6.2813317304651, 0.6931471805599453),
48 | # '通常': (2, 5.18271944179699, 0.6931471805599453),
49 | # '重要': (2, 5.028568761969732, 0.6931471805599453),
50 | # '需要': (2, 5.028568761969732, 0.6931471805599453),
51 | # '领域': (2, 6.2813317304651, 0.6931471805599453)}
52 | ```
53 | 
54 | ## 详细说明
55 | 
56 | [wordicovery解释](https://github.com/Ushiao/wordiscovery/blob/master/docs/wordiscovery.ipynb)
57 | 
58 | 
59 | 
60 | 


--------------------------------------------------------------------------------
/.gitignore:
--------------------------------------------------------------------------------
  1 | *.pyc
  2 | .DS_Store
  3 | .idea/
  4 | .ipynb_checkpoints
  5 | instance/
  6 | 
  7 | # Eclipse
  8 | 
  9 | *.pydevproject
 10 | .project
 11 | .metadata
 12 | bin/
 13 | tmp/
 14 | *.tmp
 15 | *.bak
 16 | *.swp
 17 | *~.nib
 18 | local.properties
 19 | .classpath
 20 | .settings/
 21 | .loadpath
 22 | 
 23 | # External tool builders
 24 | .externalToolBuilders/
 25 | 
 26 | # Locally stored "Eclipse launch configurations"
 27 | *.launch
 28 | 
 29 | # CDT-specific
 30 | .cproject
 31 | 
 32 | # PDT-specific
 33 | .buildpath
 34 | 
 35 | 
 36 | # Visual Studio
 37 | 
 38 | ## Ignore Visual Studio temporary files, build results, and
 39 | ## files generated by popular Visual Studio add-ons.
 40 | 
 41 | # User-specific files
 42 | *.suo
 43 | *.user
 44 | *.sln.docstates
 45 | 
 46 | # Build results
 47 | [Dd]ebug/
 48 | [Rr]elease/
 49 | *_i.c
 50 | *_p.c
 51 | *.ilk
 52 | *.meta
 53 | *.obj
 54 | *.pch
 55 | *.pdb
 56 | *.pgc
 57 | *.pgd
 58 | *.rsp
 59 | *.sbr
 60 | *.tlb
 61 | *.tli
 62 | *.tlh
 63 | *.tmp
 64 | *.vspscc
 65 | .builds
 66 | *.dotCover
 67 | 
 68 | ## TODO: If you have NuGet Package Restore enabled, uncomment this
 69 | #packages/
 70 | 
 71 | # Visual C++ cache files
 72 | ipch/
 73 | *.aps
 74 | *.ncb
 75 | *.opensdf
 76 | *.sdf
 77 | 
 78 | # Visual Studio profiler
 79 | *.psess
 80 | *.vsp
 81 | 
 82 | # ReSharper is a .NET coding add-in
 83 | _ReSharper*
 84 | 
 85 | # Installshield output folder
 86 | [Ee]xpress
 87 | 
 88 | # DocProject is a documentation generator add-in
 89 | DocProject/buildhelp/
 90 | DocProject/Help/*.HxT
 91 | DocProject/Help/*.HxC
 92 | DocProject/Help/*.hhc
 93 | DocProject/Help/*.hhk
 94 | DocProject/Help/*.hhp
 95 | DocProject/Help/Html2
 96 | DocProject/Help/html
 97 | 
 98 | # Click-Once directory
 99 | publish
100 | 
101 | # Others
102 | [Bb]in
103 | [Oo]bj
104 | sql
105 | TestResults
106 | *.Cache
107 | ClientBin
108 | stylecop.*
109 | ~$*
110 | *.dbmdl
111 | Generated_Code #added for RIA/Silverlight projects
112 | 
113 | # Backup & report files from converting an old project file to a newer
114 | # Visual Studio version. Backup files are not needed, because we have git ;-)
115 | _UpgradeReport_Files/
116 | Backup*/
117 | UpgradeLog*.XML
118 | ############
119 | ## pycharm
120 | ############
121 | .idea
122 | 
123 | ############
124 | ## Windows
125 | ############
126 | 
127 | # Windows image file caches
128 | Thumbs.db
129 | 
130 | # Folder config file
131 | Desktop.ini
132 | 
133 | 
134 | #############
135 | ## Python
136 | #############
137 | 
138 | *.py[co]
139 | 
140 | # Packages
141 | *.egg
142 | *.egg-info
143 | dist
144 | build
145 | eggs
146 | parts
147 | bin
148 | var
149 | sdist
150 | develop-eggs
151 | .installed.cfg
152 | 
153 | # Installer logs
154 | pip-log.txt
155 | 
156 | # Unit test / coverage reports
157 | .coverage
158 | .tox
159 | 
160 | #Translations
161 | *.mo
162 | 
163 | #Mr Developer
164 | .mr.developer.cfg
165 | 
166 | # Mac crap
167 | .DS_Store
168 | *.log
169 | test/tmp/*
170 | 
171 | #jython
172 | *.class
173 | 
174 | MANIFEST
175 | 


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
  1 | """A setuptools based setup module.
  2 | See:
  3 | https://packaging.python.org/en/latest/distributing.html
  4 | https://github.com/pypa/sampleproject
  5 | """
  6 | 
  7 | # Always prefer setuptools over distutils
  8 | from setuptools import setup, find_packages
  9 | from os import path
 10 | 
 11 | here = path.abspath(path.dirname(__file__))
 12 | 
 13 | # Get the long description from the README file
 14 | #with open(path.join(here, 'README.md')) as f:
 15 |     #long_description = f.read()
 16 | long_description = """
 17 | 词频、互信息、信息熵发现中文新词
 18 | ================================
 19 | 
 20 | **新词发现**\ 任务是中文自然语言处理的重要步骤。\ **新词**\ 有“新”就有“旧”，属于一个相对个概念，在相对的领域（金融、医疗），在相对的时间（过去、现在）都存在新词。\ `文本挖掘 <https://zh.wikipedia.org/wiki/文本挖掘>`__\ 会先将文本\ `分词 <https://zh.wikipedia.org/wiki/中文自动分词>`__\ ，而通用分词器精度不过，通常需要添加\ **自定义字典**\ 补足精度，所以发现新词并加入字典，成为文本挖掘的一个重要工作。
 21 | 
 22 | `单词 <https://zh.wikipedia.org/wiki/單詞>`__\ 的定义，来自维基百科的定义如下：
 23 | 
 24 |     在语言学中，\ **单词**\ （又称为词、词语、单字；英语对应用语为“word”）是能独立运用并含有语义内容或语用内容（即具有表面含义或实际含义）的最小单位。单词的集合称为词汇、术语，例如：所有中文单词统称为“中文词汇”，医学上专用的词统称为“医学术语”等。词典是为词语提供音韵、词义解释、例句、用法等等的工具书，有的词典只修录特殊领域的词汇。
 25 | 
 26 | 单从语义角度，“苹果“的法语是”pomme”，而“土豆”的法语是“pomme de
 27 | terre”，若按上面的定义，“土豆”是要被拆的面目全非，但“pomme de
 28 | terre”是却是表达“土豆”这个语义的最小单位；在机构名中，这类问题出现的更频繁，“Paris
 29 | 3”是“巴黎第三大学”的简称，如果“Paris”和“3”分别表示地名和数字，那这两个就无法表达“巴黎第三大学”的语义。而中文也有类似的例子，“北京大学”的”北京“和”大学“都可以作为一个最小单位来使用，分别表示”地方名“和“大学”，如果这样分词，那么就可以理解为“北京的大学”了，所以“北京大学”是一个表达语义的最小单位。前几年有部电影《夏洛特烦恼》，我们是要理解为“夏洛特
 30 | 烦恼“还是”夏洛 特 烦恼“，这就是很经典的分词问题。
 31 | 
 32 | 但是从语用角度，这些问题似乎能被解决，我们知道“pomme de
 33 | terre”在日常生活中一般作为“土豆”而不是“土里的苹果”，在巴黎学习都知道“Paris
 34 | 3”，就像我们提到“北京大学”特指那所著名的高等学府一样。看过电影《夏洛特烦恼》的观众很容易的就能区分这个标题应该看为“夏洛
 35 | 特 烦恼”。
 36 | 
 37 | 发现新词的方法，《\ `互联网时代的社会语言学：基于SNS的文本数据挖掘 <http://www.matrix67.com/blog/archives/5044%5D>`__
 38 | 》一文，里面提到的给每一个文本串计算\ **文本片段**\ 的\ **凝固程度**\ 和文本串对外的使用\ **自由度**\ ，通过设定阈值来将文本串分类为词和非词两类。原文给了十分通俗易懂的例子来解释凝固度和自动度。这里放上计算方法。这个方法还有许多地方需要优化，在之后的实践中慢慢调整了。
 39 | 
 40 | 环境
 41 | ----
 42 | 
 43 | ::
 44 | 
 45 |     python >= 3.5
 46 | 
 47 | 安装
 48 | ----
 49 | 
 50 | .. code:: bash
 51 | 
 52 |     python setup.py install
 53 | 
 54 | 使用说明
 55 | --------
 56 | 
 57 | .. code:: python
 58 | 
 59 |     import wordiscovery as wd
 60 | 
 61 |     text = "新词发现任务是中文自然语言处理的重要步骤。
 62 |     新词有新就有旧，属于一个相对个概念，在相对的领域（金融、医疗），
 63 |     在相对的时间（过去、现在）都存在新词。文本挖掘会先将文本分词，
 64 |     而通用分词器精度不过，通常需要添加自定义字典补足精度，
 65 |     所以发现新词并加入字典，成为文本挖掘的一个重要工作。
 66 |     "
 67 | 
 68 |     f = wd.Wordiscovery()
 69 | 
 70 |     # 解析过程默认参数, 根据文本自由调节这几个阈值
 71 |     # 最小信息熵0.01
 72 |     # 最小互信息4
 73 |     # 最小词频2
 74 |     f.parse(text)  # f.parse(text, 0.01, 4, 2)
 75 |     # {'分词': (2, 5.18271944179699, 0.6931471805599453),
 76 |     # '字典': (2, 6.2813317304651, 0.6931471805599453),
 77 |     # '文本': (3, 4.895037369345209, 0.6365141682948128),
 78 |     # '文本挖掘': (2, 5.588184549905154, 0.6931471805599453),
 79 |     # '新词': (4, 4.371789225580661, 1.0397207708399179),
 80 |     # '相对': (3, 4.3842117455792184, 0.6365141682948128),
 81 |     # '精度': (2, 6.2813317304651, 0.6931471805599453),
 82 |     # '通常': (2, 5.18271944179699, 0.6931471805599453),
 83 |     # '重要': (2, 5.028568761969732, 0.6931471805599453),
 84 |     # '需要': (2, 5.028568761969732, 0.6931471805599453),
 85 |     # '领域': (2, 6.2813317304651, 0.6931471805599453)}
 86 | 
 87 | 详细说明
 88 | --------
 89 | 
 90 | `wordicovery解释 <https://github.com/Ushiao/wordiscovery/blob/master/docs/wordiscovery.ipynb>`__
 91 | """
 92 | 
 93 | setup(
 94 |     name='wordiscovery',
 95 | 
 96 |     # Versions should comply with PEP440.  For a discussion on single-sourcing
 97 |     # the version across setup.py and the project code, see
 98 |     # https://packaging.python.org/en/latest/single_source_version.html
 99 |     version='0.1.4.6',
100 | 
101 |     description='A Chinese new word discovery',
102 |     long_description=long_description,
103 | 
104 |     # The project's main homepage.
105 |     url='https://github.com/ushiao/wordiscovery',
106 | 
107 |     # Author details
108 |     author='Kun JIN',
109 |     author_email='jin.kun@flykun.com',
110 | 
111 |     # Choose your license
112 |     license='MIT',
113 | 
114 |     # See https://pypi.python.org/pypi?%4Aaction=list_classifiers
115 |     classifiers=[
116 |         # How mature is this project? Common values are
117 |         #   3 - Alpha
118 |         #   4 - Beta
119 |         #   5 - Production/Stable
120 |         'Development Status :: 4 - Beta',
121 | 
122 |         # Indicate who your project is intended for
123 |         'Intended Audience :: Developers',
124 |         'Topic :: Software Development :: Build Tools',
125 | 
126 |         # Pick your license as you wish (should match "license" above)
127 |         'License :: OSI Approved :: MIT License',
128 | 
129 |         # Specify the Python versions you support here. In particular, ensure
130 |         # that you indicate whether you support Python 2, Python 3 or both.
131 |         # 'Programming Language :: Python :: 2',
132 |         # 'Programming Language :: Python :: 2.7',
133 |         'Programming Language :: Python :: 3',
134 |         'Programming Language :: Python :: 3.3',
135 |         'Programming Language :: Python :: 3.4',
136 |         'Programming Language :: Python :: 3.5',
137 |     ],
138 | 
139 |     # What does your project relate to?
140 |     keywords='NLP, new word discorvery',
141 | 
142 |     # You can just specify the packages manually here if your project is
143 |     # simple. Or you can use find_packages().
144 |     #packages=find_packages(exclude=['contrib', 'docs', 'tests']),
145 |     #packages=["wordiscovery"],
146 |     py_modules=["wordiscovery"],
147 | 
148 |     # List run-time dependencies here.  These will be installed by pip when
149 |     # your project is installed. For an analysis of "install_requires" vs pip's
150 |     # requirements files see:
151 |     # https://packaging.python.org/en/latest/requirements.html
152 |     #install_requires=[
153 |     #        'six==1.11.0'],
154 | 
155 |     # List additional groups of dependencies here (e.g. development
156 |     # dependencies). You can install these using the following syntax,
157 |     # for example:
158 |     # $ pip install -e .[dev,test]
159 |     # extras_require={
160 |     #     'dev': ['check-manifest'],
161 |     #     'test': ['coverage'],
162 |     # },
163 | 
164 |     # If there are data files included in your packages that need to be
165 |     # installed, specify them here.  If using Python 2.6 or less, then these
166 |     # have to be included in MANIFEST.in as well.
167 | #     package_data={
168 |     #    'tagword': ['*.*', 
169 |     #                'tokenizer/*', 
170 |     #                'tokenizer/models/*', 
171 |     #                'tokenizer/data/*', 
172 |     #                ],
173 |     #},
174 | 
175 |     # Although 'package_data' is the preferred approach, in some case you may
176 |     # need to place data files outside of your packages. See:
177 |     # http://docs.python.org/3.4/distutils/setupscript.html#installing-additional-files # noqa
178 |     # In this case, 'data_file' will be installed into '<sys.prefix>/my_data'
179 | #    data_files=[('my_data', ['data/data_file'])],
180 | 
181 |     # To provide executable scripts, use entry points in preference to the
182 |     # "scripts" keyword. Entry points provide cross-platform support and allow
183 |     # pip to create the appropriate form of executable for the target platform.
184 | #    entry_points={
185 | #        'console_scripts': [
186 | #            '=sample:main',
187 | #        ],
188 | #    },
189 | )
190 | 


--------------------------------------------------------------------------------
/wordiscovery.py:
--------------------------------------------------------------------------------
  1 | # coding: utf-8
  2 | 
  3 | import math
  4 | 
  5 | 
  6 | class TrieNode(object):
  7 |     """TrieNode is node in trietree, each node contains parent of this node,
  8 |        node frequence, children, and all children frequence. this structure data
  9 |        for calculation
 10 |     """
 11 | 
 12 |     def __init__(self,
 13 |                  frequence=0,
 14 |                  children_frequence=0,
 15 |                  parent=None):
 16 |         self.parent = parent
 17 |         self.frequence = frequence
 18 |         self.children = {}
 19 |         self.children_frequence = children_frequence
 20 | 
 21 |     def insert(self, char):
 22 |         self.children_frequence += 1
 23 |         self.children[char] = self.children.get(char, TrieNode(parent=self))
 24 |         self.children[char].frequence += 1
 25 |         return self.children[char]
 26 | 
 27 |     def fetch(self, char):
 28 |         return self.children[char]
 29 | 
 30 | 
 31 | class TrieTree(object):
 32 |     def __init__(self, size=6):
 33 |         self._root = TrieNode()
 34 |         self.size = size
 35 | 
 36 |     def get_root(self):
 37 |         return self._root
 38 | 
 39 |     def insert(self, chunk):
 40 |         node = self._root
 41 |         for char in chunk:
 42 |             node = node.insert(char)
 43 |         if len(chunk) < self.size:
 44 |             # add symbol "EOS" at end of line trunck
 45 |             node.insert("EOS")
 46 | 
 47 |     def fetch(self, chunk):
 48 |         node = self._root
 49 |         for char in chunk:
 50 |             node = node.fetch(char)
 51 |         return node
 52 | 
 53 | # In[153]:
 54 | 
 55 | 
 56 | class WordDiscovery(object):
 57 |     def __init__(self, ngram_size=6):
 58 |         self.puncs = ['【','】',')','(','、','，','“','”',
 59 |                      '。','《','》',' ','-','！','？','.',
 60 |                      '\'','[',']','：','/','.','"','\u3000',
 61 |                      '’','．',',','…','?',';','·','%','（',
 62 |                      '#','）','；','>','<','$', ' ', ' ','\ufeff'] 
 63 |         
 64 |         self.fw_ngram = TrieTree(ngram_size)
 65 |         self.bw_ngram = TrieTree(ngram_size)
 66 |         self.ngram_size = ngram_size
 67 |         
 68 |     def preparse(self, text):
 69 |     
 70 |         # replace punctuaton wiht "\n"
 71 |         for punc in self.puncs:
 72 |             text = text.replace(punc, "\n")
 73 | 
 74 |         # Todo: Delimiter alphabetic string, number from chinese text
 75 |         #regex_num_alpha = re.compile()
 76 |         #text = re.sub(r"([a-zA-Z0-9]+)", r"\n\1", text, flags=re.M)
 77 | 
 78 |         chunks, bchunks = [], []
 79 |         
 80 |         # split text in to line
 81 |         for line in text.strip().split("\n"):
 82 |             line = line.strip()
 83 |             bline = line[::-1]
 84 |             for start in range(len(line)):
 85 |                 # insert data into structure
 86 |                 end = start + self.ngram_size
 87 |                 chunk = line[start:end]
 88 |                 bchunk = bline[start:end]
 89 |                 self.fw_ngram.insert(chunk)
 90 |                 self.bw_ngram.insert(bchunk)
 91 | 
 92 |                 if len(chunk) == 6:
 93 |                     chunk = chunk[:-1]
 94 | 
 95 |                 while len(chunk) > 1:
 96 |                     chunks.append(chunk)
 97 |                     bchunks.append(chunk[::-1])
 98 |                     chunk = chunk[:-1]
 99 | 
100 |         return chunks, bchunks
101 | 
102 |     def calc_entropy(self, chunks, ngram):
103 |         
104 |         def entropy(sample, total):
105 |             """Entropy"""
106 |             s = float(sample)
107 |             t = float(total)
108 |             result = - s/t * math.log(s/t)
109 |             return result
110 |         
111 |         def parse(chunk, ngram):
112 |             node = ngram.fetch(chunk)
113 |             total = node.children_frequence
114 |             return sum([entropy(sub_node.frequence, 
115 |                                total) for sub_node in node.children.values()])
116 |         
117 |         word2entropy = {}
118 |         for chunk in chunks:
119 |             word2entropy[chunk] = parse(chunk, ngram)   
120 |         return word2entropy
121 |                 
122 |     def calc_mutualinfo(self, chunks, ngram):
123 |         """Mutual Information
124 |         log(p(x,y)/(p(x)*p(y))) = log(p(y|x)/p(y))"""
125 | 
126 |         def parse(chunk, root):
127 |             sub_node_y_x = ngram.fetch(chunk)
128 |             node = sub_node_y_x.parent
129 |             sub_node_y = root.children[chunk[-1]]
130 |             
131 |             prob_y_x = float(sub_node_y_x.frequence) / node.children_frequence
132 |             prob_y = float(sub_node_y.frequence) / root.children_frequence
133 |             mutualinfo = math.log(prob_y_x / prob_y)
134 |             return mutualinfo, sub_node_y_x.frequence
135 |         
136 |         word2mutualinfo = {}  
137 |         root = ngram.get_root()
138 |         for chunk in chunks:
139 |             word2mutualinfo[chunk] = parse(chunk, root)
140 |         return word2mutualinfo
141 |     
142 |     def parse(self,
143 |               text,
144 |               entropy_threshold=0.001,
145 |               mutualinfo_threshold=4,
146 |               freq_threshold=2):
147 |         chunks, bchunks = self.preparse(text)
148 |         return self._fetch_final(chunks,
149 |                           bchunks,
150 |                           entropy_threshold,
151 |                           mutualinfo_threshold,
152 |                           freq_threshold
153 |                          )
154 | 
155 |     def _fetch_final(self,
156 |                      chunks,
157 |                      bchunks,
158 |                      entropy_threshold=0.001,
159 |                      mutualinfo_threshold=4,
160 |                      freq_threshold=2):
161 |         fw_entropy = self.calc_entropy(chunks, self.fw_ngram)
162 |         bw_entropy = self.calc_entropy(bchunks, self.bw_ngram)
163 |         fw_mi = self.calc_mutualinfo(chunks, self.fw_ngram)
164 |         bw_mi = self.calc_mutualinfo(bchunks, self.bw_ngram)
165 | 
166 |         final = {}
167 |         for k, v in fw_entropy.items():
168 |             if k[::-1] in bw_mi and k in fw_mi:
169 |                 mi_min = min(fw_mi[k][0], bw_mi[k[::-1]][0])
170 |                 word_prob = min(fw_mi[k][1], bw_mi[k[::-1]][1])
171 |                 if mi_min < mutualinfo_threshold:
172 |                     continue
173 |             else:
174 |                 continue
175 |             if word_prob < freq_threshold:
176 |                 continue
177 |             if k[::-1] in bw_entropy:
178 |                 en_min = min(v, bw_entropy[k[::-1]])
179 |                 if en_min < entropy_threshold:
180 |                     continue
181 |             else:
182 |                 continue
183 |             final[k] = (word_prob, mi_min, en_min)
184 |         return final
185 | 
186 | # In[155]:
187 | 
188 | 
189 | def main(filename):
190 |     with open(filename, "r") as inf:
191 |         text = inf.read()
192 |     f = WordDiscovery(6)
193 |     word_info = f.parse(text,
194 |                         entropy_threshold=0.001,
195 |                         mutualinfo_threshold=4,
196 |                         freq_threshold=3)
197 |     for k, v in sorted(word_info.items(),
198 |                        key=lambda x:x[1][0],
199 |                        reverse=False):
200 |         print("%+9s\t%-5d\t%.4f\t%.4f"%(k, v[0], v[1], v[2]))
201 | 
202 | 
203 | if __name__ == "__main__":
204 |     import os
205 |     main(os.path.join("data",
206 |                       "shijiuda.txt")
207 |         )
208 | 
209 | 
210 | 


--------------------------------------------------------------------------------
/docs/wordiscovery.ipynb:
--------------------------------------------------------------------------------
  1 | {
  2 |  "cells": [
  3 |   {
  4 |    "cell_type": "markdown",
  5 |    "metadata": {},
  6 |    "source": [
  7 |     "# 词频、互信息、信息熵发现中文新词"
  8 |    ]
  9 |   },
 10 |   {
 11 |    "cell_type": "markdown",
 12 |    "metadata": {},
 13 |    "source": [
 14 |     "**新词发现**任务是中文自然语言处理的重要步骤。**新词**有“新”就有“旧”，属于一个相对个概念，在相对的领域（金融、医疗），在相对的时间（过去、现在）都存在新词。[文本挖掘](https://zh.wikipedia.org/wiki/文本挖掘)会先将文本[分词](https://zh.wikipedia.org/wiki/中文自动分词)，而通用分词器精度不过，通常需要添加**自定义字典**补足精度，所以发现新词并加入字典，成为文本挖掘的一个重要工作。"
 15 |    ]
 16 |   },
 17 |   {
 18 |    "cell_type": "markdown",
 19 |    "metadata": {},
 20 |    "source": [
 21 |     "[**单词**](https://zh.wikipedia.org/wiki/單詞)的定义，来自维基百科的定义如下：\n",
 22 |     ">在语言学中，**单词**（又称为词、词语、单字；英语对应用语为“word”）是能独立运用并含有语义内容或语用内容（即具有表面含义或实际含义）的最小单位。单词的集合称为词汇、术语，例如：所有中文单词统称为“中文词汇”，医学上专用的词统称为“医学术语”等。词典是为词语提供音韵、词义解释、例句、用法等等的工具书，有的词典只修录特殊领域的词汇。\n",
 23 |     "\n",
 24 |     "单从语义角度，“苹果“的法语是\"pomme\"，而“土豆”的法语是\"pomme de terre\"，若按上面的定义，“土豆”是要被拆的面目全非，但\"pomme de terre\"是却是表达“土豆”这个语义的最小单位；在机构名中，这类问题出现的更频繁，\"Paris 3\"是\"巴黎第三大学\"的简称，如果\"Paris\"和\"3\"分别表示地名和数字，那这两个就无法表达“巴黎第三大学”的语义。而中文也有类似的例子，“北京大学”的”北京“和”大学“都可以作为一个最小单位来使用，分别表示”地方名“和“大学”，如果这样分词，那么就可以理解为“北京的大学”了，所以“北京大学”是一个表达语义的最小单位。前几年有部电影《夏洛特烦恼》，我们是要理解为“夏洛特 烦恼“还是”夏洛 特 烦恼“，这就是很经典的分词问题。\n",
 25 |     "\n",
 26 |     "但是从语用角度，这些问题似乎能被解决，我们知道\"pomme de terre\"在日常生活中一般作为“土豆”而不是“土里的苹果”，在巴黎学习都知道“Paris 3”，就像我们提到“北京大学”特指那所著名的高等学府一样。看过电影《夏洛特烦恼》的观众很容易的就能区分这个标题应该看为“夏洛 特 烦恼”。"
 27 |    ]
 28 |   },
 29 |   {
 30 |    "cell_type": "markdown",
 31 |    "metadata": {},
 32 |    "source": [
 33 |     "发现新词的方法，《[互联网时代的社会语言学：基于SNS的文本数据挖掘](http://www.matrix67.com/blog/archives/5044]) 》一文，里面提到的给每一个文本串计算**文本片段**的**凝固程度**和文本串对外的使用**自由度**，通过设定阈值来将文本串分类为词和非词两类。原文给了十分通俗易懂的例子来解释凝固度和自动度。这里放上计算方法。这个方法还有许多地方需要优化，在之后的实践中慢慢调整了。"
 34 |    ]
 35 |   },
 36 |   {
 37 |    "cell_type": "markdown",
 38 |    "metadata": {},
 39 |    "source": [
 40 |     "## 文本片段"
 41 |    ]
 42 |   },
 43 |   {
 44 |    "cell_type": "markdown",
 45 |    "metadata": {},
 46 |    "source": [
 47 |     "**文本片段**，最常用的方法就是[n元语法(ngram)](https://zh.wikipedia.org/wiki/N元语法)，将分本分成多个n长度的文本片段。数据结构，这里采用Trie树的方案，这个方案是简单容易实现，而且用Python的字典做Hash索引实现起来也很优美，唯独的一个问题是所有的数据都存在内存中，这会使得内存占用量非常大，如果要把这个工程化使用，还需要采用其他方案，比如硬盘检索。\n",
 48 |     "<a href=\"https://upload.wikimedia.org/wikipedia/commons/b/be/Trie_example.svg\n",
 49 |     "\" target=\"_blank\"><img src=\"https://upload.wikimedia.org/wikipedia/commons/b/be/Trie_example.svg\" \n",
 50 |     "alt=\"IMAGE ALT TEXT HERE\" width=\"240\" height=\"180\" border=\"10\" /></a>"
 51 |    ]
 52 |   },
 53 |   {
 54 |    "cell_type": "code",
 55 |    "execution_count": null,
 56 |    "metadata": {},
 57 |    "outputs": [],
 58 |    "source": [
 59 |     "class TrieNode(object):\n",
 60 |     "    def __init__(self, \n",
 61 |     "                 frequence=0, \n",
 62 |     "                 children_frequence=0, \n",
 63 |     "                 parent=None):\n",
 64 |     "\n",
 65 |     "        self.parent = parent\n",
 66 |     "        self.frequence = frequence\n",
 67 |     "        self.children = {} \n",
 68 |     "        self.children_frequence = children_frequence\n",
 69 |     "\n",
 70 |     "    def insert(self, char):\n",
 71 |     "        self.children_frequence += 1\n",
 72 |     "        self.children[char] = self.children.get(char, TrieNode(parent=self))\n",
 73 |     "        self.children[char].frequence += 1\n",
 74 |     "        return  self.children[char]\n",
 75 |     "        \n",
 76 |     "    def fetch(self, char):\n",
 77 |     "        return self.children[char]\n",
 78 |     "    \n",
 79 |     "class TrieTree(object):\n",
 80 |     "    def __init__(self, size=6):\n",
 81 |     "        self._root = TrieNode()\n",
 82 |     "        self.size = size\n",
 83 |     "        \n",
 84 |     "    def get_root(self):\n",
 85 |     "        return self._root\n",
 86 |     "    \n",
 87 |     "    def insert(self, chunk):\n",
 88 |     "        node = self._root\n",
 89 |     "        for char in chunk:\n",
 90 |     "            node = node.insert(char)\n",
 91 |     "        if len(chunk) < self.size:\n",
 92 |     "            # add symbol \"EOS\" at end of line trunck\n",
 93 |     "            node.insert(\"EOS\")\n",
 94 |     "\n",
 95 |     "    def fetch(self, chunk):\n",
 96 |     "        node = self._root\n",
 97 |     "        for char in chunk:\n",
 98 |     "            node = node.fetch(char)\n",
 99 |     "        return node"
100 |    ]
101 |   },
102 |   {
103 |    "cell_type": "markdown",
104 |    "metadata": {},
105 |    "source": [
106 |     "Trie树的结构上，我添加了几个参数，parent，frequence，children_frequence，他们分别是：\n",
107 |     "- parent，当前节点的父节点，如果是“树根”的时候，这个父节点为空；\n",
108 |     "- frequence，当前节点出现的频次，在Trie树上，也可以表示某个文本片段的频次，比如\"中国\"，“国”这个节点的frequence是100的时候，“中国”俩字也出现了100次。这个可以作为最后的词频过滤用。\n",
109 |     "- children_frequence，当前接点下有子节点的\"frequence\"的总和。比如在刚才的例子上加上“中间”出现了99次，那么“中”这个节点的children_frequence的值是199次。\n",
110 |     "这样的构造让第二部分的计算更加方面。\n",
111 |     "\n",
112 |     "这个任务中需要构建两棵Trie树，表示正向和反向两个字符片段集。"
113 |    ]
114 |   },
115 |   {
116 |    "cell_type": "markdown",
117 |    "metadata": {},
118 |    "source": [
119 |     "## 自由度"
120 |    ]
121 |   },
122 |   {
123 |    "cell_type": "markdown",
124 |    "metadata": {},
125 |    "source": [
126 |     "**自由度**，使用信息论中的[信息熵](https://zh.wikipedia.org/wiki/熵_(信息论))构建文本片段左右熵，公式[1]。熵越大，表示该片段和左右邻字符相互关系的不稳定性越高，那么越有可能作为独立的片段使用。公式[1]第一个等号后面的I(x)表示x的自信息。\n",
127 |     "\\begin{align} \n",
128 |     " H(X) = \\sum_{i} {\\mathrm{P}(x_i)\\,\\mathrm{I}(x_i)} = -\\sum_{i} {\\mathrm{P}(x_i) \\log \\mathrm{P}(x_i)} [1]\n",
129 |     "\\end{align} "
130 |    ]
131 |   },
132 |   {
133 |    "cell_type": "code",
134 |    "execution_count": null,
135 |    "metadata": {},
136 |    "outputs": [],
137 |    "source": [
138 |     "def calc_entropy(chunks, ngram):\n",
139 |     "    \"\"\"计算信息熵\n",
140 |     "    Args:\n",
141 |     "        chunks，是所有数据的文本片段\n",
142 |     "        ngram，是Trie树\n",
143 |     "    Return:\n",
144 |     "        word2entropy，返回一个包含每个chunk和对应信息熵的字典。\n",
145 |     "    \"\"\"\n",
146 |     "    def entropy(sample, total):\n",
147 |     "        \"\"\"Entropy\"\"\"\n",
148 |     "        s = float(sample)\n",
149 |     "        t = float(total)\n",
150 |     "        result = - s/t * math.log(s/t)\n",
151 |     "        return result\n",
152 |     "\n",
153 |     "    def parse(chunk, ngram):\n",
154 |     "        node = ngram.fetch(chunk)\n",
155 |     "        total = node.children_frequence\n",
156 |     "        return sum([entropy(sub_node.frequence, \n",
157 |     "                           total) for sub_node in node.children.values()])\n",
158 |     "\n",
159 |     "    word2entropy = {}\n",
160 |     "    for chunk in chunks:\n",
161 |     "        word2entropy[chunk] = parse(chunk, ngram)   \n",
162 |     "    return word2entropy"
163 |    ]
164 |   },
165 |   {
166 |    "cell_type": "markdown",
167 |    "metadata": {},
168 |    "source": [
169 |     "## 凝固度"
170 |    ]
171 |   },
172 |   {
173 |    "cell_type": "markdown",
174 |    "metadata": {},
175 |    "source": [
176 |     "**凝固度**，用信息论中的**互信息**表示，公式[2]。在概率论中，如果x跟y不相关，则p(x,y)=p(x)p(y)。二者相关性越大，则p(x,y)就相比于p(x)p(y)越大。用后面的式子可能更好理解，在y出现的情况下x出现的条件概率p(x|y)除以x本身出现的概率p(x)，自然就表示x跟y的相关程度。 \n",
177 |     "\\begin{align} \n",
178 |     "I(x;y) = \\log\\frac{p(x,y)}{p(x)p(y)} = \\log\\frac{p(x|y)}{p(x)} = \\log\\frac{p(y|x)}{p(y)} [2]\n",
179 |     "\\end{align}\n",
180 |     "\n",
181 |     "这里比较容易产生一个概念的混淆，维基百科将式[2]定义为[点互信息](https://en.wikipedia.org/wiki/Pointwise_mutual_information)，[互信息](https://zh.wikipedia.org/wiki/互信息)的定义如下：\n",
182 |     "\\begin{align} \n",
183 |     "I(X;Y) = \\sum_{y \\in Y} \\sum_{x \\in X} \n",
184 |     "                 p(x,y) \\log{ \\left(\\frac{p(x,y)}{p(x)\\,p(y)}\n",
185 |     "                              \\right) }\\ [3]\n",
186 |     "\\end{align}\n",
187 |     "在傅祖芸编著的《信息论——基础理论与应用（第4版）》的绪论中，把式[2]定义为互信息，而式[3]定义为平均互信息，就像信息熵指的是**平均自信息**。"
188 |    ]
189 |   },
190 |   {
191 |    "cell_type": "code",
192 |    "execution_count": null,
193 |    "metadata": {},
194 |    "outputs": [],
195 |    "source": [
196 |     "def calc_mutualinfo(chunks, ngram):\n",
197 |     "    \"\"\"计算互信息\n",
198 |     "    Args:\n",
199 |     "        chunks，是所有数据的文本片段\n",
200 |     "        ngram，是Trie树\n",
201 |     "    Return:\n",
202 |     "        word2mutualinfo，返回一个包含每个chunk和对应互信息的字典。\n",
203 |     "    \"\"\"\n",
204 |     "    def parse(chunk, root):\n",
205 |     "        sub_node_y_x = ngram.fetch(chunk)\n",
206 |     "        node = sub_node_y_x.parent\n",
207 |     "        sub_node_y = root.children[chunk[-1]]\n",
208 |     "\n",
209 |     "        # 这里采用互信息log(p(y|x)/p(y))的计算方法\n",
210 |     "        prob_y_x = float(sub_node_y_x.frequence) / node.children_frequence\n",
211 |     "        prob_y = float(sub_node_y.frequence) / root.children_frequence\n",
212 |     "        mutualinfo = math.log(prob_y_x / prob_y)\n",
213 |     "        return mutualinfo, sub_node_y_x.frequence\n",
214 |     "\n",
215 |     "    word2mutualinfo = {}  \n",
216 |     "    root = ngram.get_root()\n",
217 |     "    for chunk in chunks:\n",
218 |     "        word2mutualinfo[chunk] = parse(chunk, root)\n",
219 |     "    return word2mutualinfo"
220 |    ]
221 |   },
222 |   {
223 |    "cell_type": "markdown",
224 |    "metadata": {},
225 |    "source": [
226 |     "## 过滤"
227 |    ]
228 |   },
229 |   {
230 |    "cell_type": "markdown",
231 |    "metadata": {},
232 |    "source": [
233 |     "最终计算得出互信息、信息熵，甚至也统计了词频，最后一步就是根据阈值对词进行过滤。"
234 |    ]
235 |   },
236 |   {
237 |    "cell_type": "code",
238 |    "execution_count": null,
239 |    "metadata": {},
240 |    "outputs": [],
241 |    "source": [
242 |     "def _fetch_final(fw_entropy，\n",
243 |     "                 bw_entropy,\n",
244 |     "                 fw_mi,\n",
245 |     "                 bw_mi\n",
246 |     "                 entropy_threshold=0.8,\n",
247 |     "                 mutualinfo_threshold=7,\n",
248 |     "                 freq_threshold=10):\n",
249 |     "        final = {}\n",
250 |     "        for k, v in fw_entropy.items():\n",
251 |     "            last_node = self.fw_ngram\n",
252 |     "            if k[::-1] in bw_mi and k in fw_mi:\n",
253 |     "                mi_min = min(fw_mi[k][0], bw_mi[k[::-1]][0])\n",
254 |     "                word_prob = min(fw_mi[k][1], bw_mi[k[::-1]][1])\n",
255 |     "                if mi_min < mutualinfo_threshold:\n",
256 |     "                    continue\n",
257 |     "            else:\n",
258 |     "                continue\n",
259 |     "            if word_prob < freq_threshold:\n",
260 |     "                 continue\n",
261 |     "            if k[::-1] in bw_entropy:\n",
262 |     "                en_min = min(v, bw_entropy[k[::-1]])\n",
263 |     "                if en_min < entropy_threshold:\n",
264 |     "                    continue\n",
265 |     "            else:\n",
266 |     "                continue\n",
267 |     "            final[k] = (word_prob, mi_min, en_min)\n",
268 |     "        return final"
269 |    ]
270 |   },
271 |   {
272 |    "cell_type": "markdown",
273 |    "metadata": {},
274 |    "source": [
275 |     "## 结果"
276 |    ]
277 |   },
278 |   {
279 |    "cell_type": "markdown",
280 |    "metadata": {},
281 |    "source": [
282 |     "最终，通过这个方法对这次十九大的开幕发言做的一个词汇发现，ngram的n=10，结果按词频排序输出，可以发现这次十九大谈了许多内容，不一一说了。这个结果还存在不少问题，比如“二〇”，这在阈值的设置上还不够准确，可以尝试使用机器学习的方法来获取阈值。"
283 |    ]
284 |   },
285 |   {
286 |    "cell_type": "markdown",
287 |    "metadata": {},
288 |    "source": [
289 |     "经济|70 改革|69 我们|64 必须|61 领导|60 完善|57 历史|44 不断|43 群众|43 教育|43 战略|42 思想|40 世界|39 问题|37 提高|37 组织|36 监督|35 加快|35 依法|34 精神|33 团结|33 复兴|32 保障|31 奋斗|30 根本|29 环境|29 军队|29 开放|27 服务|27 理论|26 干部|26 创造|26 基础|25 意识|25 维护|25 协商|24 解决|24 贯彻|23 斗争|23 目标|21 统筹|20 始终|19 方式|19 水平|19 科学|19 利益|19 市场|19 基层|19 积极|18 马克思|18 反对|18 道路|18 自然|18 增长|17 科技|17 稳定|17 原则|17 两岸|17 取得|16 质量|16 农村|16 矛盾|16 协调|15 巩固|15 收入|15 绿色|15 自觉|15 方针|15 纪律|15 长期|15 保证|15 同胞|15 命运|14 美好生活|14 五年|14 传统|14 繁荣|14 没有|14 使命|13 广泛|13 日益|13 价值|13 健康|13 资源|13 参与|13 突出|13 腐败|13 充分|13 梦想|13 任何|13 二〇|13 代表|12 阶段|12 深刻|12 布局|12 区域|12 贸易|12 核心|12 城乡|12 生态文明|12 工程|12 任务|12 地区|12 责任|12 认识|12 胜利|11 贡献|11 覆盖|11 生态环境|11 具有|11 面临|11 各种|11 培育|11 企业|11 继续|10 团结带领|10 提升|10 明显|10 弘扬|10 脱贫|10 贫困|10 标准|10 注重|10 基本实现|10 培养|10 青年|10"
290 |    ]
291 |   },
292 |   {
293 |    "cell_type": "markdown",
294 |    "metadata": {},
295 |    "source": [
296 |     "## 代码下载地址"
297 |    ]
298 |   },
299 |   {
300 |    "cell_type": "markdown",
301 |    "metadata": {},
302 |    "source": [
303 |     "git clone https://github.com/Ushiao/new-word-discovery.git"
304 |    ]
305 |   }
306 |  ],
307 |  "metadata": {
308 |   "kernelspec": {
309 |    "display_name": "Python 3",
310 |    "language": "python",
311 |    "name": "python3"
312 |   },
313 |   "language_info": {
314 |    "codemirror_mode": {
315 |     "name": "ipython",
316 |     "version": 3
317 |    },
318 |    "file_extension": ".py",
319 |    "mimetype": "text/x-python",
320 |    "name": "python",
321 |    "nbconvert_exporter": "python",
322 |    "pygments_lexer": "ipython3",
323 |    "version": "3.6.3"
324 |   }
325 |  },
326 |  "nbformat": 4,
327 |  "nbformat_minor": 2
328 | }
329 | 


--------------------------------------------------------------------------------