├── .gitignore
├── HISTORY.md
├── LICENSE
├── README.md
├── example
    ├── example01.py
    └── example02.py
├── setup.py
├── test
    ├── Segmentation_test.py
    ├── TextRank4Keyword_test.py
    ├── TextRank4Sentence_test.py
    ├── codecs_test.py
    ├── doc
    │   ├── 01.txt
    │   ├── 02.txt
    │   ├── 03.txt
    │   ├── 04.txt
    │   └── 05.txt
    ├── jieba_test.py
    └── util_test.py
└── textrank4zh
    ├── Segmentation.py
    ├── TextRank4Keyword.py
    ├── TextRank4Sentence.py
    ├── __init__.py
    ├── stopwords.txt
    └── util.py


/.gitignore:
--------------------------------------------------------------------------------
1 | build
2 | dist
3 | MANIFEST
4 | textrank4zh/__pycache__
5 | *.pyc
6 | test.py


--------------------------------------------------------------------------------
/HISTORY.md:
--------------------------------------------------------------------------------
 1 | ### 2014
 2 | 
 3 | 主要功能的实现。
 4 | 
 5 | ### 2015-12
 6 | 
 7 | 更新到v0.2。
 8 | 
 9 | 接口有变化。
10 | 
11 | 


--------------------------------------------------------------------------------
/LICENSE:
--------------------------------------------------------------------------------
 1 | The MIT License (MIT)
 2 | 
 3 | Copyright (c) 2015 Letian Sun
 4 | 
 5 | Permission is hereby granted, free of charge, to any person obtaining a copy
 6 | of this software and associated documentation files (the "Software"), to deal
 7 | in the Software without restriction, including without limitation the rights
 8 | to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
 9 | copies of the Software, and to permit persons to whom the Software is
10 | furnished to do so, subject to the following conditions:
11 | 
12 | The above copyright notice and this permission notice shall be included in all
13 | copies or substantial portions of the Software.
14 | 
15 | THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16 | IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17 | FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18 | AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19 | LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20 | OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21 | SOFTWARE.
22 | 
23 | 


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # TextRank4ZH
  2 | 
  3 | TextRank算法可以用来从文本中提取关键词和摘要（重要的句子）。TextRank4ZH是针对中文文本的TextRank算法的python算法实现。
  4 | 
  5 | ## 安装
  6 | 
  7 | 方式1：
  8 | ```
  9 | $ python setup.py install --user
 10 | ```
 11 | 
 12 | 方式2：
 13 | ```
 14 | $ sudo python setup.py install
 15 | ```
 16 | 
 17 | 方式3：
 18 | ```
 19 | $ pip install textrank4zh --user
 20 | ```
 21 | 
 22 | 方式4：
 23 | ```
 24 | $ sudo pip install textrank4zh
 25 | ```
 26 | 
 27 | Python 3下需要将上面的python改成python3，pip改成pip3。
 28 | 
 29 | 
 30 | ## 卸载
 31 | ```plain
 32 | $ pip uninstall textrank4zh
 33 | ```
 34 | 
 35 | ## 依赖
 36 | jieba >= 0.35  
 37 | numpy >= 1.7.1  
 38 | networkx >= 1.9.1  
 39 | 
 40 | ## 兼容性
 41 | 在Python 2.7.9和Python 3.4.3中测试通过。
 42 | 
 43 | 
 44 | ## 原理
 45 | 
 46 | TextRank的详细原理请参考：
 47 | 
 48 | > Mihalcea R, Tarau P. TextRank: Bringing order into texts[C]. Association for Computational Linguistics, 2004.
 49 | 
 50 | 关于TextRank4ZH的原理和使用介绍：[使用TextRank算法为文本生成关键字和摘要](https://www.letiantian.xyz/p/101666.html)
 51 | 
 52 | ### 关键词提取
 53 | 将原文本拆分为句子，在每个句子中过滤掉停用词（可选），并只保留指定词性的单词（可选）。由此可以得到句子的集合和单词的集合。
 54 | 
 55 | 每个单词作为pagerank中的一个节点。设定窗口大小为k，假设一个句子依次由下面的单词组成：
 56 | ```
 57 | w1, w2, w3, w4, w5, ..., wn
 58 | ```
 59 | `w1, w2, ..., wk`、`w2, w3, ...,wk+1`、`w3, w4, ...,wk+2`等都是一个窗口。在一个窗口中的任两个单词对应的节点之间存在一个无向无权的边。
 60 | 
 61 | 基于上面构成图，可以计算出每个单词节点的重要性。最重要的若干单词可以作为关键词。
 62 | 
 63 | 
 64 | ### 关键短语提取
 65 | 参照[关键词提取](#关键词提取)提取出若干关键词。若原文本中存在若干个关键词相邻的情况，那么这些关键词可以构成一个关键词组。
 66 | 
 67 | 例如，在一篇介绍`支持向量机`的文章中，可以找到关键词`支持`、`向量`、`机`，通过关键词组提取，可以得到`支持向量机`。
 68 | 
 69 | ### 摘要生成
 70 | 将每个句子看成图中的一个节点，若两个句子之间有相似性，认为对应的两个节点之间有一个无向有权边，权值是相似度。
 71 | 
 72 | 通过pagerank算法计算得到的重要性最高的若干句子可以当作摘要。
 73 | 
 74 | 
 75 | ## 示例
 76 | 见[example](./example)、[test](./test)。
 77 | 
 78 | example/example01.py:
 79 | 
 80 | ```python
 81 | #-*- encoding:utf-8 -*-
 82 | from __future__ import print_function
 83 | 
 84 | import sys
 85 | try:
 86 |     reload(sys)
 87 |     sys.setdefaultencoding('utf-8')
 88 | except:
 89 |     pass
 90 | 
 91 | import codecs
 92 | from textrank4zh import TextRank4Keyword, TextRank4Sentence
 93 | 
 94 | text = codecs.open('../test/doc/01.txt', 'r', 'utf-8').read()
 95 | tr4w = TextRank4Keyword()
 96 | 
 97 | tr4w.analyze(text=text, lower=True, window=2)  # py2中text必须是utf8编码的str或者unicode对象，py3中必须是utf8编码的bytes或者str对象
 98 | 
 99 | print( '关键词：' )
100 | for item in tr4w.get_keywords(20, word_min_len=1):
101 |     print(item.word, item.weight)
102 | 
103 | print()
104 | print( '关键短语：' )
105 | for phrase in tr4w.get_keyphrases(keywords_num=20, min_occur_num= 2):
106 |     print(phrase)
107 | 
108 | tr4s = TextRank4Sentence()
109 | tr4s.analyze(text=text, lower=True, source = 'all_filters')
110 | 
111 | print()
112 | print( '摘要：' )
113 | for item in tr4s.get_key_sentences(num=3):
114 |     print(item.index, item.weight, item.sentence)  # index是语句在文本中位置，weight是权重
115 | ```
116 | 
117 | 运行结果如下：
118 | ```plain
119 | 关键词：
120 | 媒体 0.02155864734852778
121 | 高圆圆 0.020220281898126486
122 | 微 0.01671909730824073
123 | 宾客 0.014328439104001788
124 | 赵又廷 0.014035488254875914
125 | 答谢 0.013759845912857732
126 | 谢娜 0.013361244496632448
127 | 现身 0.012724133346018603
128 | 记者 0.01227742092899235
129 | 新人 0.01183128428494362
130 | 北京 0.011686712993089671
131 | 博 0.011447168887452668
132 | 展示 0.010889176260920504
133 | 捧场 0.010507502237123278
134 | 礼物 0.010447275379792245
135 | 张杰 0.009558332870902892
136 | 当晚 0.009137982757893915
137 | 戴 0.008915271161035208
138 | 酒店 0.00883521621207796
139 | 外套 0.008822082954131174
140 | 
141 | 关键短语：
142 | 微博
143 | 
144 | 摘要：
145 | 摘要：
146 | 0 0.0709719557171 中新网北京12月1日电(记者 张曦) 30日晚，高圆圆和赵又廷在京举行答谢宴，诸多明星现身捧场，其中包括张杰(微博)、谢娜(微博)夫妇、何炅(微博)、蔡康永(微博)、徐克、张凯丽、黄轩(微博)等
147 | 6 0.0541037236415 高圆圆身穿粉色外套，看到大批记者在场露出娇羞神色，赵又廷则戴着鸭舌帽，十分淡定，两人快步走进电梯，未接受媒体采访
148 | 27 0.0490428312984 记者了解到，出席高圆圆、赵又廷答谢宴的宾客近百人，其中不少都是女方的高中同学
149 | 
150 | ```
151 | 
152 | ## 使用说明
153 | 
154 | 类TextRank4Keyword、TextRank4Sentence在处理一段文本时会将文本拆分成4种格式：
155 | 
156 | * sentences：由句子组成的列表。
157 | * words_no_filter：对sentences中每个句子分词而得到的两级列表。
158 | * words_no_stop_words：去掉words_no_filter中的停止词而得到的二维列表。
159 | * words_all_filters：保留words_no_stop_words中指定词性的单词而得到的二维列表。
160 | 
161 | 例如，对于：
162 | ```
163 | 这间酒店位于北京东三环，里面摆放很多雕塑，文艺气息十足。答谢宴于晚上8点开始。
164 | ```
165 | 
166 | ```python
167 | #-*- encoding:utf-8 -*-
168 | from __future__ import print_function
169 | import codecs
170 | from textrank4zh import TextRank4Keyword, TextRank4Sentence
171 | 
172 | import sys
173 | try:
174 |     reload(sys)
175 |     sys.setdefaultencoding('utf-8')
176 | except:
177 |     pass
178 | 
179 | text = "这间酒店位于北京东三环，里面摆放很多雕塑，文艺气息十足。答谢宴于晚上8点开始。"
180 | tr4w = TextRank4Keyword()
181 | 
182 | tr4w.analyze(text=text, lower=True, window=2)
183 | 
184 | print()
185 | print('sentences:')
186 | for s in tr4w.sentences:
187 |     print(s)                 # py2中是unicode类型。py3中是str类型。
188 | 
189 | print()
190 | print('words_no_filter')
191 | for words in tr4w.words_no_filter:
192 |     print('/'.join(words))   # py2中是unicode类型。py3中是str类型。
193 | 
194 | print()
195 | print('words_no_stop_words')
196 | for words in tr4w.words_no_stop_words:
197 |     print('/'.join(words))   # py2中是unicode类型。py3中是str类型。
198 | 
199 | print()
200 | print('words_all_filters')
201 | for words in tr4w.words_all_filters:
202 |     print('/'.join(words))   # py2中是unicode类型。py3中是str类型。
203 | ```
204 | 
205 | 运行结果如下：
206 | ```plain
207 | sentences:
208 | 这间酒店位于北京东三环，里面摆放很多雕塑，文艺气息十足
209 | 答谢宴于晚上8点开始
210 | 
211 | words_no_filter
212 | 这/间/酒店/位于/北京/东三环/里面/摆放/很多/雕塑/文艺/气息/十足
213 | 答谢/宴于/晚上/8/点/开始
214 | 
215 | words_no_stop_words
216 | 间/酒店/位于/北京/东三环/里面/摆放/很多/雕塑/文艺/气息/十足
217 | 答谢/宴于/晚上/8/点
218 | 
219 | words_all_filters
220 | 酒店/位于/北京/东三环/摆放/雕塑/文艺/气息
221 | 答谢/宴于/晚上
222 | 
223 | ```
224 | 
225 | 
226 | ## API
227 | TODO.
228 | 
229 | 类的实现、函数的参数请参考源码注释。
230 | 
231 | ## License
232 | [MIT](./LICENSE)
233 | 
234 | 
235 | 
236 | 
237 | 
238 | 
239 | 
240 | 
241 | 
242 | 


--------------------------------------------------------------------------------
/example/example01.py:
--------------------------------------------------------------------------------
 1 | #-*- encoding:utf-8 -*-
 2 | from __future__ import print_function
 3 | 
 4 | import sys
 5 | try:
 6 |     reload(sys)
 7 |     sys.setdefaultencoding('utf-8')
 8 | except:
 9 |     pass
10 | 
11 | import codecs
12 | from textrank4zh import TextRank4Keyword, TextRank4Sentence
13 | 
14 | text = codecs.open('../test/doc/01.txt', 'r', 'utf-8').read()
15 | tr4w = TextRank4Keyword()
16 | 
17 | tr4w.analyze(text=text, lower=True, window=2)   # py2中text必须是utf8编码的str或者unicode对象，py3中必须是utf8编码的bytes或者str对象
18 | 
19 | print( '关键词：' )
20 | for item in tr4w.get_keywords(20, word_min_len=1):
21 |     print(item.word, item.weight)
22 | 
23 | print()
24 | print( '关键短语：' )
25 | for phrase in tr4w.get_keyphrases(keywords_num=20, min_occur_num= 2):
26 |     print(phrase)
27 | 
28 | tr4s = TextRank4Sentence()
29 | tr4s.analyze(text=text, lower=True, source = 'all_filters')
30 | 
31 | print()
32 | print( '摘要：' )
33 | for item in tr4s.get_key_sentences(num=3):
34 |     print(item.index, item.weight, item.sentence)


--------------------------------------------------------------------------------
/example/example02.py:
--------------------------------------------------------------------------------
 1 | #-*- encoding:utf-8 -*-
 2 | from __future__ import print_function
 3 | import codecs
 4 | from textrank4zh import TextRank4Keyword, TextRank4Sentence
 5 | 
 6 | import sys
 7 | try:
 8 |     reload(sys)
 9 |     sys.setdefaultencoding('utf-8')
10 | except:
11 |     pass
12 | 
13 | text = "这间酒店位于北京东三环，里面摆放很多雕塑，文艺气息十足。答谢宴于晚上8点开始。"
14 | tr4w = TextRank4Keyword()
15 | 
16 | tr4w.analyze(text=text, lower=True, window=2)
17 | 
18 | print()
19 | print('sentences:')
20 | for s in tr4w.sentences:
21 |     print(s)                 # py2中是unicode类型。py3中是str类型。
22 | 
23 | print()
24 | print('words_no_filter')
25 | for words in tr4w.words_no_filter:
26 |     print('/'.join(words))   # py2中是unicode类型。py3中是str类型。
27 | 
28 | print()
29 | print('words_no_stop_words')
30 | for words in tr4w.words_no_stop_words:
31 |     print('/'.join(words))   # py2中是unicode类型。py3中是str类型。
32 | 
33 | print()
34 | print('words_all_filters')
35 | for words in tr4w.words_all_filters:
36 |     print('/'.join(words))   # py2中是unicode类型。py3中是str类型。


--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | from distutils.core import setup
 3 | LONGDOC = """
 4 | Please go to https://github.com/someus/TextRank4ZH for more info.
 5 | """
 6 | 
 7 | setup(
 8 |     name='textrank4zh',
 9 |     version='0.3',
10 |     description='Extract keywords and abstract Chinese article',
11 |     long_description=LONGDOC,
12 |     author='Letian Sun',
13 |     author_email='sunlt1699@gmail.com',
14 |     url='https://github.com/someus/TextRank4ZH',
15 |     license="MIT",
16 |     classifiers=[
17 |         'Intended Audience :: Developers',
18 |         'License :: OSI Approved :: MIT License',
19 |         'Operating System :: OS Independent',
20 |         'Natural Language :: Chinese (Simplified)',
21 |         'Natural Language :: Chinese (Traditional)',
22 |         'Programming Language :: Python :: 2',
23 |         'Programming Language :: Python :: 2.7',
24 |         'Programming Language :: Python :: 3',
25 |         'Programming Language :: Python :: 3.4',
26 |         'Topic :: Text Processing',
27 |         'Topic :: Text Processing :: Linguistic',
28 |     ],
29 |     keywords='NLP,Chinese,Keywords extraction, Abstract extraction',
30 |     install_requires=['jieba >= 0.35', 'numpy >= 1.7.1', 'networkx >= 1.9.1'],
31 |     packages=['textrank4zh'],
32 |     package_dir={'textrank4zh':'textrank4zh'},
33 |     package_data={'textrank4zh':['*.txt',]},
34 | )


--------------------------------------------------------------------------------
/test/Segmentation_test.py:
--------------------------------------------------------------------------------
 1 | #-*- encoding:utf-8 -*-
 2 | from __future__ import print_function
 3 | 
 4 | import sys
 5 | try:
 6 |     reload(sys)
 7 |     sys.setdefaultencoding('utf-8')
 8 | except:
 9 |     pass
10 | 
11 | import codecs
12 | from textrank4zh import Segmentation
13 | 
14 | seg = Segmentation.Segmentation()
15 | 
16 | text = codecs.open('./doc/01.txt', 'r', 'utf-8', 'ignore').read()
17 | text = "视频里，我们的杰宝热情地用英文和全场观众打招呼并清唱了一段《Heal The World》。我们的世界充满了未知数。"
18 | 
19 | result = seg.segment(text=text, lower=True)
20 | 
21 | for key in result:
22 |     print(key)
23 | 
24 | print(20*'#')
25 | for s in result['sentences']:
26 |     print(s)
27 | 
28 | print(20*'*')
29 | for s in result.sentences:
30 |     print (s)
31 | 
32 | print
33 | for ss in result.words_no_filter:
34 |     print( '  '.join(ss) )
35 | 
36 | print
37 | for ss in result.words_no_stop_words:
38 |     print( ' / '.join(ss) )
39 | 
40 | print
41 | for ss in result.words_all_filters:
42 |     print (' | '.join(ss) )
43 | 


--------------------------------------------------------------------------------
/test/TextRank4Keyword_test.py:
--------------------------------------------------------------------------------
 1 | #-*- encoding:utf-8 -*-
 2 | from __future__ import print_function
 3 | 
 4 | import sys
 5 | try:
 6 |     reload(sys)
 7 |     sys.setdefaultencoding('utf-8')
 8 | except:
 9 |     pass
10 | 
11 | import codecs
12 | from textrank4zh import TextRank4Keyword
13 | 
14 | text = codecs.open('./doc/02.txt', 'r', 'utf-8').read()
15 | # text = "世界的美好。世界美国英国。 世界和平。"
16 | 
17 | tr4w = TextRank4Keyword()
18 | tr4w.analyze(text=text,lower=True, window=3, pagerank_config={'alpha':0.85})
19 | 
20 | for item in tr4w.get_keywords(30, word_min_len=2):
21 |     print(item.word, item.weight, type(item.word))
22 | 
23 | print('--phrase--')
24 | 
25 | for phrase in tr4w.get_keyphrases(keywords_num=20, min_occur_num = 0):
26 |     print(phrase, type(phrase))


--------------------------------------------------------------------------------
/test/TextRank4Sentence_test.py:
--------------------------------------------------------------------------------
 1 | #-*- encoding:utf-8 -*-
 2 | from __future__ import print_function
 3 | 
 4 | import sys
 5 | try:
 6 |     reload(sys)
 7 |     sys.setdefaultencoding('utf-8')
 8 | except:
 9 |     pass
10 | 
11 | import codecs
12 | from textrank4zh import TextRank4Sentence
13 | 
14 | text = codecs.open('./doc/03.txt', 'r', 'utf-8').read()
15 | text = "这间酒店位于北京东三环，里面摆放很多雕塑，文艺气息十足。答谢宴于晚上8点开始。"
16 | tr4s = TextRank4Sentence()
17 | tr4s.analyze(text=text, lower=True, source = 'all_filters')
18 | 
19 | for st in tr4s.sentences:
20 |     print(type(st), st)
21 | 
22 | print(20*'*')
23 | for item in tr4s.get_key_sentences(num=4):
24 |     print(item.weight, item.sentence, type(item.sentence))


--------------------------------------------------------------------------------
/test/codecs_test.py:
--------------------------------------------------------------------------------
1 | #-*- encoding:utf-8 -*-
2 | from __future__ import print_function
3 | 
4 | 
5 | import codecs
6 | text = codecs.open('./doc/01.txt', 'r', 'utf-8', 'ignore').read()
7 | print( type(text) )  # in py2 is unicode, py3 is str


--------------------------------------------------------------------------------
/test/doc/01.txt:
--------------------------------------------------------------------------------
 1 | 中新网北京12月1日电(记者 张曦) 30日晚，高圆圆和赵又廷在京举行答谢宴，诸多明星现身捧场，其中包括张杰(微博)、谢娜(微博)夫妇、何炅(微博)、蔡康永(微博)、徐克、张凯丽、黄轩(微博)等。
 2 | 
 3 | 30日中午，有媒体曝光高圆圆和赵又廷现身台北桃园机场的照片，照片中两人小动作不断，尽显恩爱。事实上，夫妻俩此行是回女方老家北京举办答谢宴。
 4 | 
 5 | 群星捧场 谢娜张杰亮相
 6 | 
 7 | 当晚不到7点，两人十指紧扣率先抵达酒店。这间酒店位于北京东三环，里面摆放很多雕塑，文艺气息十足。
 8 | 
 9 | 高圆圆身穿粉色外套，看到大批记者在场露出娇羞神色，赵又廷则戴着鸭舌帽，十分淡定，两人快步走进电梯，未接受媒体采访。
10 | 
11 | 随后，谢娜、何炅也一前一后到场庆贺，并对一对新人表示恭喜。接着蔡康永满脸笑容现身，他直言：“我没有参加台湾婚礼，所以这次觉得蛮开心。”
12 | 
13 | 曾与赵又廷合作《狄仁杰之神都龙王》的导演徐克则携女助理亮相，面对媒体的长枪短炮，他只大呼“恭喜！恭喜！”
14 | 
15 | 作为高圆圆的好友，黄轩虽然拍杂志收工较晚，但也赶过来参加答谢宴。问到给新人带什么礼物，他大方拉开外套，展示藏在包里厚厚的红包，并笑言：“封红包吧！”但不愿透露具体数额。
16 | 
17 | 值得一提的是，当晚10点，张杰压轴抵达酒店，他戴着黑色口罩，透露因刚下飞机所以未和妻子谢娜同行。虽然他没有接受采访，但在进电梯后大方向媒体挥手致意。
18 | 
19 | 《我们结婚吧》主创捧场
20 | 
21 | 黄海波(微博)获释仍未出席
22 | 
23 | 在电视剧《咱们结婚吧》里，饰演高圆圆母亲的张凯丽，当晚身穿黄色大衣出席，但只待了一个小时就匆忙离去。
24 | 
25 | 同样有份参演该剧，并扮演高圆圆男闺蜜的大左(微信号：dazuozone) 也到场助阵，28日，他已在台湾参加两人的盛大婚礼。大左30日晚接受采访时直言当时场面感人，“每个人都哭得稀里哗啦，晚上是吴宗宪(微博)(微信号：wushowzongxian) 主持，现场欢声笑语，讲了好多不能播的事，新人都非常开心”。
26 | 
27 | 最令人关注的是在这部剧里和高圆圆出演夫妻的黄海波。巧合的是，他刚好于30日收容教育期满，解除收容教育。
28 | 
29 | 答谢宴细节
30 | 
31 | 宾客近百人，获赠礼物
32 | 
33 | 记者了解到，出席高圆圆、赵又廷答谢宴的宾客近百人，其中不少都是女方的高中同学。
34 | 
35 | 答谢宴位于酒店地下一层，现场安保森严，大批媒体只好在酒店大堂等待。期间有工作人员上来送上喜糖，代两位新人向媒体问好。
36 | 
37 | 记者注意到，虽然答谢宴于晚上8点开始，但从9点开始就陆续有宾客离开，每个宾客都手持礼物，有宾客大方展示礼盒，只见礼盒上印有两只正在接吻的烫金兔子，不过工作人员迅速赶来，拒绝宾客继续展示。


--------------------------------------------------------------------------------
/test/doc/02.txt:
--------------------------------------------------------------------------------
 1 | 如何在美国把贪官送进监狱——
 2 | 法律用美国的：被用来修理黑帮的美国联邦法律也能“顺便”对付中国贪官
 3 | 
 4 | 在美国起诉中国贪官，不可能适用中国自己的法律。所以，必须得搞清楚外逃贪官触犯美国法律的证据。贪官明明是在中国国内贪腐，还能触犯到美国的法律？没错。
 5 | 
 6 | 有先例可循。1994年—2001年，原中国银行开平支行三任行长许超凡、余振东和许国俊勾结贪污、挪用了4.85亿美元巨资。他们都逃向了美国。后来，余振东和二许分别在美国被起诉。以“二许案”为例，这两个巨贪触犯了多项美国联邦刑法，首当其冲的是《反勒索及受贿组织法》。该法律是美国在上个世纪70年代通过的，当时的立意是对付各种黑帮。由于黑帮犯罪常常是一套完整的步骤，所以这个法案把有组织犯罪作为一条完整的“产业链”做考虑。具体来说，二许在中国国内贪污后，后续有一系列涉及到美国的行为——通过各种办法洗钱；把赃款转移到美国；为转移非法所得开设空壳公司……这三个人用了拉斯维加斯的赌场洗钱。所以，最后都是在拉斯维加斯所在的内华达州被美联邦法院审判。
 7 | 
 8 | 除了“有组织犯罪”相关法条外，洗钱、伪造签证等贪官可能涉及到的触及美国法律行为都能被提起控诉。这里的美国法律主要指的是美国联邦法律，而不是州法律，所以这些贪官也是被联邦警察给抓获的。
 9 | 
10 | 总之，在国内的贪污行为是“上游”，中国的检察官们没可能因为这些发生在中国的“上游”事件要求美国法院给中国贪官定罪；而美国法官也不可能运用中国的法律来做判决。不过，把赃款和人转移出去这个“下游”过程是有很大部分是在美国发生的，可能触犯到各种美国法律。严格说起来，要想在美国对贪官们治罪，那么得找到他们的贪污关联行为触犯到美国法律的证据。
11 | 在美国坐牢的贪官许超凡和许国俊曾经通过拉斯维加斯的赌场洗钱在美国坐牢的贪官许超凡和许国俊曾经通过拉斯维加斯的赌场洗钱
12 | 一般也在美国蹲监狱：美国法律对付中国贪官并不手软，他们可能被判得很重
13 | 
14 | 既然是运用美国法律判的，也得在美国服刑。而大家会担心，会不会“有组织犯罪”等罪名对付中国贪官太过温和、间接，对他们下手轻呢？其实不会。还是说“二许案”，他们一个判25年，另一个是22年。因为所涉及的犯罪基本在美国都是重罪。2009年，法制日报的报道《中行开平案八年追诉始末》分析道，“‘二许’此次在美国所获刑期，均已经超出我国刑法有期徒刑的最高量刑标准。”
15 | 
16 | 在二许坐满牢之后，他们面临着被美国驱逐出境。他们都是通过欺诈的手段获得了美国的签证。
17 | 不过也有办法把贪官“换回”中国坐牢：中、美和嫌犯三方达成协议
18 | 余振东被遣返回中国后被判处12年有期徒刑。余振东被遣返回中国后被判处12年有期徒刑。
19 | 
20 | 前文提到的余振东案里，余在美国被判入监144个月，但他现在处于中国的监狱中。这又是怎么回事呢？原来，余振东表示自愿接受遣返。余振东向美国方面递交《递解出境司法命令和放弃听证约定申请书》，承认自己在美所犯罪行应导致递解出境的法律后果，并且明确指定中国为其递解出境的接收国。当然，这种“自愿”是有前提条件的。因为中国也向美国的司法机关作出承诺，余振东在中国国内被宣判的刑期不会长于美国。而也因为余振东的自愿认罪，美国司法机关对他的判罚从轻。
21 | 还可以追究共同犯罪的贪官家属刑责：贪官的家属倘若一起触犯了美国法律，也得受罚
22 | 贪官背后往往有“贪内助”，而参与了犯罪的贪官家属也可能在美国被起诉贪官背后往往有“贪内助”，而参与了犯罪的贪官家属也可能在美国被起诉
23 | 
24 | “二许案”中一共有四个人被追究刑责。因为，两个贪官的太太也没少参与触犯到美国法律的洗钱等行为。她们分别被判处监禁8年。除了大家能够想到的洗钱等常规动作外，这两对巨贪夫妻很“奇葩”的一点是，两位妻子先通过和美国人假结婚获得了美国公民资格。没有后顾之忧，真丈夫们开始疯狂地转移资金。等到逃跑时候，男人们也运用了假结婚的方式。所以在“二许案”的指控中，有一项是“护照、签证欺诈”。
25 | 要做到以上这些，重要的还是中国官方的努力，争取美国的积极合作
26 | 
27 | 看起来，好像动用美国的司法体系来追究中国贪官的刑责并不难，也只需要美国方面的努力。那么，这是一条追诉逃美贪官的康庄大道？当然不是这样的。否则不会在2009年“二许案”宣判之后，暂时没有再出现过这样的案例。余振东案和“二许案”在当年都轰动一时，关系重大。因此是被当作大案要案在办。其时，恰逢中国和美国签署了《刑事司法互助协定》不久。所以这三个金融系统大蛀虫首当其冲被起诉了——几个巨贪被美国联邦警察逮捕就是中方努力的结果。中国也向美方提供了大量的证据，证明钱财是非法所得。找出财产转移链、挖出洗钱的细节……种种犯罪事实都需要经过繁琐的查证。另一方面，美国办案子也需要付出大量的司法成本，所以不能希冀美国的司法部门多么主动地去发现中国外逃贪官。
28 | 
29 | 当然，时代在前行。随着国内反腐的高涨，海外追逃也越来越得到重视。这次外交部条约法律司司长徐宏的发言，也给了大家一个期许。
30 | 美国司法部关于“二许案”的通告美国司法部关于“二许案”的通告
31 | 如何在美国打官司，向贪官要回钱——
32 | 拿着刑事判决的结果来打民事官司追款相对容易
33 | 
34 | 对于民事诉讼追赃，《联合国反腐败公约》里有制度支持。而相对容易的一种形式就是拿着刑事判决去追赃。刑事判决对于财产的非法性是强有力的证明。所以，“二许案”后，中国银行在美国当地提起诉讼，追回了一些财产。
35 | 
36 | 尽管“二许案”的许多赃款并没有转移到美国，而是在加拿大，美国的这份刑事判决也有助于“苦主”在加拿大追偿。就在今年11月24日，加拿大的大不列颠哥伦比亚BC省法院正式开庭审理中国银行向许超凡妻子和母亲追赃的民事诉讼。
37 | 
38 | 一些学者认为，中国国内的刑事判决也是有助于发起民事诉讼法追赃的。不过一个现实是，中国的刑法不允许“缺席审判”，贪官不到位就没法动了，因此许多学者也提出中国应该建立起相关的制度来。
39 | 不管刑事，直接打民事官司也可以，就是费时、费力、费钱
40 | 
41 | 民事诉讼相对于刑事诉讼来说要容易得多。所以被认为是一个非常好的向外逃贪官追责的路径。追回贪官的赃款，既挽回损失，还能够断了贪官的财源。
42 | 
43 | 当然，以上都是最理想的说法。实际情况难多了。所以公开报道的海外成功追赃的民事诉讼案例真是屈指可数。在美国，目前唯一公开的一起是前述的中国银行向“二许”追赃。但是情况特殊，并且真正的大头在加拿大，所以参考性不强。倒是有一起不在美国，在澳大利亚的民事诉讼追赃可以做参考。被诉方是原北京市城乡建设集团副总经理李化学。只是过程非常艰辛曲折，为了顺利起诉，中方不得不聘请了一名当地律师。付出和回报存疑。办案人员彭唯良检察官的原话是：“在国外打官司，经济上必须有坚强的后盾来支持。另外，由于语言上的障碍，有些我们想通过律师要达到的目的，律师不太了解，返工的次数比较多。”
44 | 
45 | 当然，这里需要说明的一点是，民事诉讼的主体也不宜是中国政府，而是具体的单位。所以在北京城乡集团的案子里，尽管检察官们为民事诉讼付出了大量的努力，但还得找来单位做原告。
46 | 目前经验看，最省事、有效的是争取到美国司法部的最大限度合作
47 | 陈水扁用非法所得在美国购买的房产陈水扁用非法所得在美国购买的房产
48 | 
49 | 说一个台湾地区的例子。陈水扁弊案爆发后，被美国司法部发现，陈家用“不法所得”在美国购买了两处房产。后来，由美国司法部出面进行没收。美国司法部也需要向法院提出诉讼。这其实是属于美国的一个“腐败政府国家资产追回”计划。这个案子最后的结果是，法院支持了美国司法部的请求，陈家房产被拍卖。而根据相关法律，拍卖所得美国是有权分得一部分的。
50 | 
51 | 由美国司法部出面提起诉讼，恐怕是最好的办法了。而这需要两点：第一，还是追赃国的申请和完整的证据；第二，则涉及到一个积极性问题。对追缴财产进行分享也是国际上一个比较流行的做法，可以大限度地调动赃款流向国的积极性。不失为一个参考。
52 | 结语
53 | 看来，在美国起诉贪官确实可行。但是，不管是追人还是追钱，都存在一个和美国的紧密合作问题，不然也是白搭。
54 | 


--------------------------------------------------------------------------------
/test/doc/03.txt:
--------------------------------------------------------------------------------
 1 | 据BI消息，Netflix 正准备在本月上线其最新的原创剧集《马可波罗》。而据纽约时报报道，《马可波罗》第一季 10 集的总投资高达 9000 万美元，这不仅创下了 Netflix 的最高电视剧投资记录，在全球电视剧制作成本的排名中也是数一数二的，仅次于 HBO 原创的《权利的游戏》。
 2 | 
 3 | 《马可波罗》在意大利、哈萨克斯坦、马来西亚等多国取景拍摄，数百名演员来自多个国家，电视剧把传奇冒险、战争、武术、性诱惑、政治阴谋等元素都融了进去，看起来会包含不少大家喜闻乐见的题材。Netflix 也为《马可波罗》的播出制定了庞大的市场营销计划。比如，Netflix 将携主要演员参加巴西的圣地亚哥国际动漫展，另外也会在墨西哥的一个大型购物中心展示《马可波罗》演出所用的服装和道具。
 4 | 
 5 | Netflix 怎么会如此大手笔？毫无疑问 Netflix 对这部剧寄予了重望——Netflix 的海外市场。Netflix 现在已经进入全球 50 多个国家和地区，付费订户高达 5000 万人。由于在美国本土的增长开始下滑，寻求海外增长机会成为 Netflix 的当务之急。除了四处购买电影电视剧的全球版权等海外市场豪赌，Netflix 在影视内容上也有一场豪赌——鸿篇巨制的《马可波罗》。此剧由独立片商威斯坦公司制作，Netflix 掌握全球版权，将从 12 月 12 日开始在 Netflix 面向全球订户提供点播。
 6 | 
 7 | Netflix 早期靠《纸牌屋》和《女子监狱》等原创剧一鸣惊人，但可惜的是，Netflix 并没有掌握《纸牌屋》等剧的海外版权。比如在德国和法国，观众可以在电视频道上收看该剧。不过，《纸牌屋》的成功仍然帮助 Netflix 提升了知名度。目前，Netflix 正在筹拍中的原创剧有不少，《马可波罗》成功与否可以在一定程度上验证 Netflix 的原创剧战略是否在海外市场是否有效。
 8 | 
 9 | 不过，一些媒体行业分析师预测，Netflix 的国际化将会遇到各国本地视频网站的狙击。此外 HBO 也是其最强劲的竞争对手。众所周知，HBO 在国际化方面已经先行一步。比如在中国市场，HBO 就刚刚与腾讯视频签订了独家播放权。而迄今为止，Netflix 尚未在亚洲任何一个国家展开业务。
10 | 


--------------------------------------------------------------------------------
/test/doc/04.txt:
--------------------------------------------------------------------------------
1 | 京菜擅长烤、爆、烧、焖、涮，听起来豪爽，吃起来痛快。北京烤鸭是来京游玩必食的美味；西四缸瓦市一家名叫砂锅居的老店所烧的砂锅白肉名满京城，相传他们用的原汤已有二三百年历史；涮羊肉是最受北京人欢迎的冬令美食，其中阳坊涮肉连锁店以价格便宜，味道正宗而倍受青睐。可以以半份起卖，在全市有许多分店。除此之外，还有东来顺、又一顺、能仁居的涮羊肉名气也很大。
2 | 北京风味小吃有600多年历史，包括汉民风味小吃、回民风味小吃和宫廷风味小吃等300多种。
3 | 北京的各大饭店历来是名厨荟萃，如北京饭店的谭家菜、建国饭店的法式西餐都是别处不易享用到的佳肴；北京还有正宗的法式、美式、意式、俄式餐厅和日本料理、韩国烧烤以及越南、印尼、泰国风味的菜馆。若为省时实惠，还可以光顾街头小店，这里不乏北京特有的包子、饺子、面条及家常炒菜，当然，环境就不如大餐馆讲究了。 
4 | 东直门内大街原是北京最富特色的餐饮一条街，大街南北两侧云集了各种风味的餐馆，多为24小时营业。但现在因为拆迁，这条路上的餐饮店多已搬迁。 


--------------------------------------------------------------------------------
/test/doc/05.txt:
--------------------------------------------------------------------------------
1 | 支持向量机（英语：Support Vector Machine，常简称为SVM）是一种监督式学习的方法，可广泛地应用于统计分类以及回归分析。
2 | 
3 | 支持向量机属于一般化线性分类器，也可以被认为是提克洛夫规范化（Tikhonov Regularization）方法的一个特例。这族分类器的特点是他们能够同时最小化经验误差与最大化几何边缘区，因此支持向量机也被称为最大边缘区分类器。
4 | 
5 | 支持向量机构造一个超平面或者多个超平面，这些超平面可能是高维的，甚至可能是无限多维的。在分类任务中，它的原理是，将决策面（超平面）放置在这样的一个位置，两类中接近这个位置的点距离的都最远。我们来考虑两类线性可分问题，如果要在两个类之间画一条线，那么按照支持向量机的原理，我们会先找两类之间最大的空白间隔，然后在空白间隔的中点画一条线，这条线平行于空白间隔。通过核函数，可以使得支持向量机对非线性可分的任务进行分类。一个极好的指南是C.J.C Burges的《模式识别支持向量机指南》。van der Walt和Barnard将支持向量机和其他分类器进行了比较。


--------------------------------------------------------------------------------
/test/jieba_test.py:
--------------------------------------------------------------------------------
 1 | #-*- encoding:utf-8 -*-
 2 | from __future__ import print_function
 3 | 
 4 | import sys
 5 | try:
 6 |     reload(sys)
 7 |     sys.setdefaultencoding('utf-8')
 8 | except:
 9 |     pass
10 | 
11 | import jieba.posseg as pseg
12 | words = pseg.cut("我爱北京天安门.。；‘你的#")
13 | for w in words:
14 |     # print(w.word)
15 |     print('{0} {1}'.format(w.word, w.flag))
16 |     print(type(w.word))  # in py2 is unicode, py3 is str
17 | 
18 | 


--------------------------------------------------------------------------------
/test/util_test.py:
--------------------------------------------------------------------------------
 1 | #-*- encoding:utf-8 -*-
 2 | from __future__ import print_function
 3 | 
 4 | from textrank4zh import util
 5 | 
 6 | def testAttrDict():
 7 |     r = util.AttrDict(a=2)
 8 |     print( r )
 9 |     print( r.a )
10 |     print( r['a'] )
11 | 
12 | def testCombine():
13 |     print(20*'*')
14 |     for item in util.combine(['a', 'b', 'c', 'd'], 2):
15 |         print(item)
16 |     print
17 |     for item in util.combine(['a', 'b', 'c', 'd'], 3):
18 |         print (item)
19 | 
20 | def testDebug():
21 |     import sys
22 |     print(sys.getdefaultencoding())
23 |     util.debug('你好')
24 |     util.debug(u'世界')
25 | 
26 | 
27 | if __name__ == "__main__":
28 |     testAttrDict()
29 |     testCombine()
30 |     testDebug()
31 | 


--------------------------------------------------------------------------------
/textrank4zh/Segmentation.py:
--------------------------------------------------------------------------------
  1 | #-*- encoding:utf-8 -*-
  2 | """
  3 | @author:   letian
  4 | @homepage: http://www.letiantian.me
  5 | @github:   https://github.com/someus/
  6 | """
  7 | from __future__ import (absolute_import, division, print_function,
  8 |                         unicode_literals)
  9 | 
 10 | import jieba.posseg as pseg
 11 | import codecs
 12 | import os
 13 | 
 14 | from . import util
 15 | 
 16 | def get_default_stop_words_file():
 17 |     d = os.path.dirname(os.path.realpath(__file__))
 18 |     return os.path.join(d, 'stopwords.txt')
 19 | 
 20 | class WordSegmentation(object):
 21 |     """ 分词 """
 22 |     
 23 |     def __init__(self, stop_words_file = None, allow_speech_tags = util.allow_speech_tags):
 24 |         """
 25 |         Keyword arguments:
 26 |         stop_words_file    -- 保存停止词的文件路径，utf8编码，每行一个停止词。若不是str类型，则使用默认的停止词
 27 |         allow_speech_tags  -- 词性列表，用于过滤
 28 |         """     
 29 |         
 30 |         allow_speech_tags = [util.as_text(item) for item in allow_speech_tags]
 31 | 
 32 |         self.default_speech_tag_filter = allow_speech_tags
 33 |         self.stop_words = set()
 34 |         self.stop_words_file = get_default_stop_words_file()
 35 |         if type(stop_words_file) is str:
 36 |             self.stop_words_file = stop_words_file
 37 |         for word in codecs.open(self.stop_words_file, 'r', 'utf-8', 'ignore'):
 38 |             self.stop_words.add(word.strip())
 39 |     
 40 |     def segment(self, text, lower = True, use_stop_words = True, use_speech_tags_filter = False):
 41 |         """对一段文本进行分词，返回list类型的分词结果
 42 | 
 43 |         Keyword arguments:
 44 |         lower                  -- 是否将单词小写（针对英文）
 45 |         use_stop_words         -- 若为True，则利用停止词集合来过滤（去掉停止词）
 46 |         use_speech_tags_filter -- 是否基于词性进行过滤。若为True，则使用self.default_speech_tag_filter过滤。否则，不过滤。    
 47 |         """
 48 |         text = util.as_text(text)
 49 |         jieba_result = pseg.cut(text)
 50 |         
 51 |         if use_speech_tags_filter == True:
 52 |             jieba_result = [w for w in jieba_result if w.flag in self.default_speech_tag_filter]
 53 |         else:
 54 |             jieba_result = [w for w in jieba_result]
 55 | 
 56 |         # 去除特殊符号
 57 |         word_list = [w.word.strip() for w in jieba_result if w.flag!='x']
 58 |         word_list = [word for word in word_list if len(word)>0]
 59 |         
 60 |         if lower:
 61 |             word_list = [word.lower() for word in word_list]
 62 | 
 63 |         if use_stop_words:
 64 |             word_list = [word.strip() for word in word_list if word.strip() not in self.stop_words]
 65 | 
 66 |         return word_list
 67 |         
 68 |     def segment_sentences(self, sentences, lower=True, use_stop_words=True, use_speech_tags_filter=False):
 69 |         """将列表sequences中的每个元素/句子转换为由单词构成的列表。
 70 |         
 71 |         sequences -- 列表，每个元素是一个句子（字符串类型）
 72 |         """
 73 |         
 74 |         res = []
 75 |         for sentence in sentences:
 76 |             res.append(self.segment(text=sentence, 
 77 |                                     lower=lower, 
 78 |                                     use_stop_words=use_stop_words, 
 79 |                                     use_speech_tags_filter=use_speech_tags_filter))
 80 |         return res
 81 |         
 82 | class SentenceSegmentation(object):
 83 |     """ 分句 """
 84 |     
 85 |     def __init__(self, delimiters=util.sentence_delimiters):
 86 |         """
 87 |         Keyword arguments:
 88 |         delimiters -- 可迭代对象，用来拆分句子
 89 |         """
 90 |         self.delimiters = set([util.as_text(item) for item in delimiters])
 91 |     
 92 |     def segment(self, text):
 93 |         res = [util.as_text(text)]
 94 |         
 95 |         util.debug(res)
 96 |         util.debug(self.delimiters)
 97 | 
 98 |         for sep in self.delimiters:
 99 |             text, res = res, []
100 |             for seq in text:
101 |                 res += seq.split(sep)
102 |         res = [s.strip() for s in res if len(s.strip()) > 0]
103 |         return res 
104 |         
105 | class Segmentation(object):
106 |     
107 |     def __init__(self, stop_words_file = None, 
108 |                     allow_speech_tags = util.allow_speech_tags,
109 |                     delimiters = util.sentence_delimiters):
110 |         """
111 |         Keyword arguments:
112 |         stop_words_file -- 停止词文件
113 |         delimiters      -- 用来拆分句子的符号集合
114 |         """
115 |         self.ws = WordSegmentation(stop_words_file=stop_words_file, allow_speech_tags=allow_speech_tags)
116 |         self.ss = SentenceSegmentation(delimiters=delimiters)
117 |         
118 |     def segment(self, text, lower = False):
119 |         text = util.as_text(text)
120 |         sentences = self.ss.segment(text)
121 |         words_no_filter = self.ws.segment_sentences(sentences=sentences, 
122 |                                                     lower = lower, 
123 |                                                     use_stop_words = False,
124 |                                                     use_speech_tags_filter = False)
125 |         words_no_stop_words = self.ws.segment_sentences(sentences=sentences, 
126 |                                                     lower = lower, 
127 |                                                     use_stop_words = True,
128 |                                                     use_speech_tags_filter = False)
129 | 
130 |         words_all_filters = self.ws.segment_sentences(sentences=sentences, 
131 |                                                     lower = lower, 
132 |                                                     use_stop_words = True,
133 |                                                     use_speech_tags_filter = True)
134 | 
135 |         return util.AttrDict(
136 |                     sentences           = sentences, 
137 |                     words_no_filter     = words_no_filter, 
138 |                     words_no_stop_words = words_no_stop_words, 
139 |                     words_all_filters   = words_all_filters
140 |                 )
141 |     
142 |         
143 | 
144 | if __name__ == '__main__':
145 |     pass


--------------------------------------------------------------------------------
/textrank4zh/TextRank4Keyword.py:
--------------------------------------------------------------------------------
  1 | #-*- encoding:utf-8 -*-
  2 | """
  3 | @author:   letian
  4 | @homepage: http://www.letiantian.me
  5 | @github:   https://github.com/someus/
  6 | """
  7 | from __future__ import (absolute_import, division, print_function,
  8 |                         unicode_literals)
  9 | 
 10 | import networkx as nx
 11 | import numpy as np
 12 | 
 13 | from . import util
 14 | from .Segmentation import Segmentation
 15 | 
 16 | class TextRank4Keyword(object):
 17 |     
 18 |     def __init__(self, stop_words_file = None, 
 19 |                  allow_speech_tags = util.allow_speech_tags, 
 20 |                  delimiters = util.sentence_delimiters):
 21 |         """
 22 |         Keyword arguments:
 23 |         stop_words_file  --  str，指定停止词文件路径（一行一个停止词），若为其他类型，则使用默认停止词文件
 24 |         delimiters       --  默认值是`?!;？！。；…\n`，用来将文本拆分为句子。
 25 |         
 26 |         Object Var:
 27 |         self.words_no_filter      --  对sentences中每个句子分词而得到的两级列表。
 28 |         self.words_no_stop_words  --  去掉words_no_filter中的停止词而得到的两级列表。
 29 |         self.words_all_filters    --  保留words_no_stop_words中指定词性的单词而得到的两级列表。
 30 |         """
 31 |         self.text = ''
 32 |         self.keywords = None
 33 |         
 34 |         self.seg = Segmentation(stop_words_file=stop_words_file, 
 35 |                                 allow_speech_tags=allow_speech_tags, 
 36 |                                 delimiters=delimiters)
 37 | 
 38 |         self.sentences = None
 39 |         self.words_no_filter = None     # 2维列表
 40 |         self.words_no_stop_words = None
 41 |         self.words_all_filters = None
 42 |         
 43 |     def analyze(self, text, 
 44 |                 window = 2, 
 45 |                 lower = False,
 46 |                 vertex_source = 'all_filters',
 47 |                 edge_source = 'no_stop_words',
 48 |                 pagerank_config = {'alpha': 0.85,}):
 49 |         """分析文本
 50 | 
 51 |         Keyword arguments:
 52 |         text       --  文本内容，字符串。
 53 |         window     --  窗口大小，int，用来构造单词之间的边。默认值为2。
 54 |         lower      --  是否将文本转换为小写。默认为False。
 55 |         vertex_source   --  选择使用words_no_filter, words_no_stop_words, words_all_filters中的哪一个来构造pagerank对应的图中的节点。
 56 |                             默认值为`'all_filters'`，可选值为`'no_filter', 'no_stop_words', 'all_filters'`。关键词也来自`vertex_source`。
 57 |         edge_source     --  选择使用words_no_filter, words_no_stop_words, words_all_filters中的哪一个来构造pagerank对应的图中的节点之间的边。
 58 |                             默认值为`'no_stop_words'`，可选值为`'no_filter', 'no_stop_words', 'all_filters'`。边的构造要结合`window`参数。
 59 |         """
 60 |         
 61 |         # self.text = util.as_text(text)
 62 |         self.text = text
 63 |         self.word_index = {}
 64 |         self.index_word = {}
 65 |         self.keywords = []
 66 |         self.graph = None
 67 |         
 68 |         result = self.seg.segment(text=text, lower=lower)
 69 |         self.sentences = result.sentences
 70 |         self.words_no_filter = result.words_no_filter
 71 |         self.words_no_stop_words = result.words_no_stop_words
 72 |         self.words_all_filters   = result.words_all_filters
 73 | 
 74 |         util.debug(20*'*')
 75 |         util.debug('self.sentences in TextRank4Keyword:\n', ' || '.join(self.sentences))
 76 |         util.debug('self.words_no_filter in TextRank4Keyword:\n', self.words_no_filter)
 77 |         util.debug('self.words_no_stop_words in TextRank4Keyword:\n', self.words_no_stop_words)
 78 |         util.debug('self.words_all_filters in TextRank4Keyword:\n', self.words_all_filters)
 79 | 
 80 | 
 81 |         options = ['no_filter', 'no_stop_words', 'all_filters']
 82 | 
 83 |         if vertex_source in options:
 84 |             _vertex_source = result['words_'+vertex_source]
 85 |         else:
 86 |             _vertex_source = result['words_all_filters']
 87 | 
 88 |         if edge_source in options:
 89 |             _edge_source   = result['words_'+edge_source]
 90 |         else:
 91 |             _edge_source   = result['words_no_stop_words']
 92 | 
 93 |         self.keywords = util.sort_words(_vertex_source, _edge_source, window = window, pagerank_config = pagerank_config)
 94 | 
 95 |     def get_keywords(self, num = 6, word_min_len = 1):
 96 |         """获取最重要的num个长度大于等于word_min_len的关键词。
 97 | 
 98 |         Return:
 99 |         关键词列表。
100 |         """
101 |         result = []
102 |         count = 0
103 |         for item in self.keywords:
104 |             if count >= num:
105 |                 break
106 |             if len(item.word) >= word_min_len:
107 |                 result.append(item)
108 |                 count += 1
109 |         return result
110 |     
111 |     def get_keyphrases(self, keywords_num = 12, min_occur_num = 2): 
112 |         """获取关键短语。
113 |         获取 keywords_num 个关键词构造的可能出现的短语，要求这个短语在原文本中至少出现的次数为min_occur_num。
114 | 
115 |         Return:
116 |         关键短语的列表。
117 |         """
118 |         keywords_set = set([ item.word for item in self.get_keywords(num=keywords_num, word_min_len = 1)])
119 |         keyphrases = set()
120 |         for sentence in self.words_no_filter:
121 |             one = []
122 |             for word in sentence:
123 |                 if word in keywords_set:
124 |                     one.append(word)
125 |                 else:
126 |                     if len(one) >  1:
127 |                         keyphrases.add(''.join(one))
128 |                     if len(one) == 0:
129 |                         continue
130 |                     else:
131 |                         one = []
132 |             # 兜底
133 |             if len(one) >  1:
134 |                 keyphrases.add(''.join(one))
135 | 
136 |         return [phrase for phrase in keyphrases 
137 |                 if self.text.count(phrase) >= min_occur_num]
138 | 
139 | if __name__ == '__main__':
140 |     pass


--------------------------------------------------------------------------------
/textrank4zh/TextRank4Sentence.py:
--------------------------------------------------------------------------------
 1 | #-*- encoding:utf-8 -*-
 2 | """
 3 | @author:   letian
 4 | @homepage: http://www.letiantian.me
 5 | @github:   https://github.com/someus/
 6 | """
 7 | from __future__ import (absolute_import, division, print_function,
 8 |                         unicode_literals)
 9 | 
10 | import networkx as nx
11 | import numpy as np
12 | 
13 | from . import util
14 | from .Segmentation import Segmentation
15 | 
16 | class TextRank4Sentence(object):
17 |     
18 |     def __init__(self, stop_words_file = None, 
19 |                  allow_speech_tags = util.allow_speech_tags,
20 |                  delimiters = util.sentence_delimiters):
21 |         """
22 |         Keyword arguments:
23 |         stop_words_file  --  str，停止词文件路径，若不是str则是使用默认停止词文件
24 |         delimiters       --  默认值是`?!;？！。；…\n`，用来将文本拆分为句子。
25 |         
26 |         Object Var:
27 |         self.sentences               --  由句子组成的列表。
28 |         self.words_no_filter         --  对sentences中每个句子分词而得到的两级列表。
29 |         self.words_no_stop_words     --  去掉words_no_filter中的停止词而得到的两级列表。
30 |         self.words_all_filters       --  保留words_no_stop_words中指定词性的单词而得到的两级列表。
31 |         """
32 |         self.seg = Segmentation(stop_words_file=stop_words_file,
33 |                                 allow_speech_tags=allow_speech_tags,
34 |                                 delimiters=delimiters)
35 |         
36 |         self.sentences = None
37 |         self.words_no_filter = None     # 2维列表
38 |         self.words_no_stop_words = None
39 |         self.words_all_filters = None
40 |         
41 |         self.key_sentences = None
42 |         
43 |     def analyze(self, text, lower = False, 
44 |               source = 'no_stop_words', 
45 |               sim_func = util.get_similarity,
46 |               pagerank_config = {'alpha': 0.85,}):
47 |         """
48 |         Keyword arguments:
49 |         text                 --  文本内容，字符串。
50 |         lower                --  是否将文本转换为小写。默认为False。
51 |         source               --  选择使用words_no_filter, words_no_stop_words, words_all_filters中的哪一个来生成句子之间的相似度。
52 |                                  默认值为`'all_filters'`，可选值为`'no_filter', 'no_stop_words', 'all_filters'`。
53 |         sim_func             --  指定计算句子相似度的函数。
54 |         """
55 |         
56 |         self.key_sentences = []
57 |         
58 |         result = self.seg.segment(text=text, lower=lower)
59 |         self.sentences = result.sentences
60 |         self.words_no_filter = result.words_no_filter
61 |         self.words_no_stop_words = result.words_no_stop_words
62 |         self.words_all_filters   = result.words_all_filters
63 | 
64 |         options = ['no_filter', 'no_stop_words', 'all_filters']
65 |         if source in options:
66 |             _source = result['words_'+source]
67 |         else:
68 |             _source = result['words_no_stop_words']
69 | 
70 |         self.key_sentences = util.sort_sentences(sentences = self.sentences,
71 |                                                  words     = _source,
72 |                                                  sim_func  = sim_func,
73 |                                                  pagerank_config = pagerank_config)
74 | 
75 |             
76 |     def get_key_sentences(self, num = 6, sentence_min_len = 6):
77 |         """获取最重要的num个长度大于等于sentence_min_len的句子用来生成摘要。
78 | 
79 |         Return:
80 |         多个句子组成的列表。
81 |         """
82 |         result = []
83 |         count = 0
84 |         for item in self.key_sentences:
85 |             if count >= num:
86 |                 break
87 |             if len(item['sentence']) >= sentence_min_len:
88 |                 result.append(item)
89 |                 count += 1
90 |         return result
91 |     
92 | 
93 | if __name__ == '__main__':
94 |     pass


--------------------------------------------------------------------------------
/textrank4zh/__init__.py:
--------------------------------------------------------------------------------
1 | #-*- encoding:utf-8 -*-
2 | from __future__ import absolute_import
3 | from .TextRank4Keyword import TextRank4Keyword
4 | from .TextRank4Sentence import TextRank4Sentence
5 | from . import Segmentation
6 | from . import util
7 | 
8 | version = '0.2'


--------------------------------------------------------------------------------
/textrank4zh/stopwords.txt:
--------------------------------------------------------------------------------
   1 | ?
   2 | 、
   3 | 。
   4 | “
   5 | ”
   6 | 《
   7 | 》
   8 | ！
   9 | ，
  10 | ：
  11 | ；
  12 | ？
  13 | 啊
  14 | 阿
  15 | 哎
  16 | 哎呀
  17 | 哎哟
  18 | 唉
  19 | 俺
  20 | 俺们
  21 | 按
  22 | 按照
  23 | 吧
  24 | 吧哒
  25 | 把
  26 | 罢了
  27 | 被
  28 | 本
  29 | 本着
  30 | 比
  31 | 比方
  32 | 比如
  33 | 鄙人
  34 | 彼
  35 | 彼此
  36 | 边
  37 | 别
  38 | 别的
  39 | 别说
  40 | 并
  41 | 并且
  42 | 不比
  43 | 不成
  44 | 不单
  45 | 不但
  46 | 不独
  47 | 不管
  48 | 不光
  49 | 不过
  50 | 不仅
  51 | 不拘
  52 | 不论
  53 | 不怕
  54 | 不然
  55 | 不如
  56 | 不特
  57 | 不惟
  58 | 不问
  59 | 不只
  60 | 朝
  61 | 朝着
  62 | 趁
  63 | 趁着
  64 | 乘
  65 | 冲
  66 | 除
  67 | 除此之外
  68 | 除非
  69 | 除了
  70 | 此
  71 | 此间
  72 | 此外
  73 | 从
  74 | 从而
  75 | 打
  76 | 待
  77 | 但
  78 | 但是
  79 | 当
  80 | 当着
  81 | 到
  82 | 得
  83 | 的
  84 | 的话
  85 | 等
  86 | 等等
  87 | 地
  88 | 第
  89 | 叮咚
  90 | 对
  91 | 对于
  92 | 多
  93 | 多少
  94 | 而
  95 | 而况
  96 | 而且
  97 | 而是
  98 | 而外
  99 | 而言
 100 | 而已
 101 | 尔后
 102 | 反过来
 103 | 反过来说
 104 | 反之
 105 | 非但
 106 | 非徒
 107 | 否则
 108 | 嘎
 109 | 嘎登
 110 | 该
 111 | 赶
 112 | 个
 113 | 各
 114 | 各个
 115 | 各位
 116 | 各种
 117 | 各自
 118 | 给
 119 | 根据
 120 | 跟
 121 | 故
 122 | 故此
 123 | 固然
 124 | 关于
 125 | 管
 126 | 归
 127 | 果然
 128 | 果真
 129 | 过
 130 | 哈
 131 | 哈哈
 132 | 呵
 133 | 和
 134 | 何
 135 | 何处
 136 | 何况
 137 | 何时
 138 | 嘿
 139 | 哼
 140 | 哼唷
 141 | 呼哧
 142 | 乎
 143 | 哗
 144 | 还是
 145 | 还有
 146 | 换句话说
 147 | 换言之
 148 | 或
 149 | 或是
 150 | 或者
 151 | 极了
 152 | 及
 153 | 及其
 154 | 及至
 155 | 即
 156 | 即便
 157 | 即或
 158 | 即令
 159 | 即若
 160 | 即使
 161 | 几
 162 | 几时
 163 | 己
 164 | 既
 165 | 既然
 166 | 既是
 167 | 继而
 168 | 加之
 169 | 假如
 170 | 假若
 171 | 假使
 172 | 鉴于
 173 | 将
 174 | 较
 175 | 较之
 176 | 叫
 177 | 接着
 178 | 结果
 179 | 借
 180 | 紧接着
 181 | 进而
 182 | 尽
 183 | 尽管
 184 | 经
 185 | 经过
 186 | 就
 187 | 就是
 188 | 就是说
 189 | 据
 190 | 具体地说
 191 | 具体说来
 192 | 开始
 193 | 开外
 194 | 靠
 195 | 咳
 196 | 可
 197 | 可见
 198 | 可是
 199 | 可以
 200 | 况且
 201 | 啦
 202 | 来
 203 | 来着
 204 | 离
 205 | 例如
 206 | 哩
 207 | 连
 208 | 连同
 209 | 两者
 210 | 了
 211 | 临
 212 | 另
 213 | 另外
 214 | 另一方面
 215 | 论
 216 | 嘛
 217 | 吗
 218 | 慢说
 219 | 漫说
 220 | 冒
 221 | 么
 222 | 每
 223 | 每当
 224 | 们
 225 | 莫若
 226 | 某
 227 | 某个
 228 | 某些
 229 | 拿
 230 | 哪
 231 | 哪边
 232 | 哪儿
 233 | 哪个
 234 | 哪里
 235 | 哪年
 236 | 哪怕
 237 | 哪天
 238 | 哪些
 239 | 哪样
 240 | 那
 241 | 那边
 242 | 那儿
 243 | 那个
 244 | 那会儿
 245 | 那里
 246 | 那么
 247 | 那么些
 248 | 那么样
 249 | 那时
 250 | 那些
 251 | 那样
 252 | 乃
 253 | 乃至
 254 | 呢
 255 | 能
 256 | 你
 257 | 你们
 258 | 您
 259 | 宁
 260 | 宁可
 261 | 宁肯
 262 | 宁愿
 263 | 哦
 264 | 呕
 265 | 啪达
 266 | 旁人
 267 | 呸
 268 | 凭
 269 | 凭借
 270 | 其
 271 | 其次
 272 | 其二
 273 | 其他
 274 | 其它
 275 | 其一
 276 | 其余
 277 | 其中
 278 | 起
 279 | 起见
 280 | 起见
 281 | 岂但
 282 | 恰恰相反
 283 | 前后
 284 | 前者
 285 | 且
 286 | 然而
 287 | 然后
 288 | 然则
 289 | 让
 290 | 人家
 291 | 任
 292 | 任何
 293 | 任凭
 294 | 如
 295 | 如此
 296 | 如果
 297 | 如何
 298 | 如其
 299 | 如若
 300 | 如上所述
 301 | 若
 302 | 若非
 303 | 若是
 304 | 啥
 305 | 上下
 306 | 尚且
 307 | 设若
 308 | 设使
 309 | 甚而
 310 | 甚么
 311 | 甚至
 312 | 省得
 313 | 时候
 314 | 什么
 315 | 什么样
 316 | 使得
 317 | 是
 318 | 是的
 319 | 首先
 320 | 谁
 321 | 谁知
 322 | 顺
 323 | 顺着
 324 | 似的
 325 | 虽
 326 | 虽然
 327 | 虽说
 328 | 虽则
 329 | 随
 330 | 随着
 331 | 所
 332 | 所以
 333 | 他
 334 | 他们
 335 | 他人
 336 | 它
 337 | 它们
 338 | 她
 339 | 她们
 340 | 倘
 341 | 倘或
 342 | 倘然
 343 | 倘若
 344 | 倘使
 345 | 腾
 346 | 替
 347 | 通过
 348 | 同
 349 | 同时
 350 | 哇
 351 | 万一
 352 | 往
 353 | 望
 354 | 为
 355 | 为何
 356 | 为了
 357 | 为什么
 358 | 为着
 359 | 喂
 360 | 嗡嗡
 361 | 我
 362 | 我们
 363 | 呜
 364 | 呜呼
 365 | 乌乎
 366 | 无论
 367 | 无宁
 368 | 毋宁
 369 | 嘻
 370 | 吓
 371 | 相对而言
 372 | 像
 373 | 向
 374 | 向着
 375 | 嘘
 376 | 呀
 377 | 焉
 378 | 沿
 379 | 沿着
 380 | 要
 381 | 要不
 382 | 要不然
 383 | 要不是
 384 | 要么
 385 | 要是
 386 | 也
 387 | 也罢
 388 | 也好
 389 | 一
 390 | 一般
 391 | 一旦
 392 | 一方面
 393 | 一来
 394 | 一切
 395 | 一样
 396 | 一则
 397 | 依
 398 | 依照
 399 | 矣
 400 | 以
 401 | 以便
 402 | 以及
 403 | 以免
 404 | 以至
 405 | 以至于
 406 | 以致
 407 | 抑或
 408 | 因
 409 | 因此
 410 | 因而
 411 | 因为
 412 | 哟
 413 | 用
 414 | 由
 415 | 由此可见
 416 | 由于
 417 | 有
 418 | 有的
 419 | 有关
 420 | 有些
 421 | 又
 422 | 于
 423 | 于是
 424 | 于是乎
 425 | 与
 426 | 与此同时
 427 | 与否
 428 | 与其
 429 | 越是
 430 | 云云
 431 | 哉
 432 | 再说
 433 | 再者
 434 | 在
 435 | 在下
 436 | 咱
 437 | 咱们
 438 | 则
 439 | 怎
 440 | 怎么
 441 | 怎么办
 442 | 怎么样
 443 | 怎样
 444 | 咋
 445 | 照
 446 | 照着
 447 | 者
 448 | 这
 449 | 这边
 450 | 这儿
 451 | 这个
 452 | 这会儿
 453 | 这就是说
 454 | 这里
 455 | 这么
 456 | 这么点儿
 457 | 这么些
 458 | 这么样
 459 | 这时
 460 | 这些
 461 | 这样
 462 | 正如
 463 | 吱
 464 | 之
 465 | 之类
 466 | 之所以
 467 | 之一
 468 | 只是
 469 | 只限
 470 | 只要
 471 | 只有
 472 | 至
 473 | 至于
 474 | 诸位
 475 | 着
 476 | 着呢
 477 | 自
 478 | 自从
 479 | 自个儿
 480 | 自各儿
 481 | 自己
 482 | 自家
 483 | 自身
 484 | 综上所述
 485 | 总的来看
 486 | 总的来说
 487 | 总的说来
 488 | 总而言之
 489 | 总之
 490 | 纵
 491 | 纵令
 492 | 纵然
 493 | 纵使
 494 | 遵照
 495 | 作为
 496 | 兮
 497 | 呃
 498 | 呗
 499 | 咚
 500 | 咦
 501 | 喏
 502 | 啐
 503 | 喔唷
 504 | 嗬
 505 | 嗯
 506 | 嗳
 507 | a
 508 | able
 509 | about
 510 | above
 511 | abroad
 512 | according
 513 | accordingly
 514 | across
 515 | actually
 516 | adj
 517 | after
 518 | afterwards
 519 | again
 520 | against
 521 | ago
 522 | ahead
 523 | ain't
 524 | all
 525 | allow
 526 | allows
 527 | almost
 528 | alone
 529 | along
 530 | alongside
 531 | already
 532 | also
 533 | although
 534 | always
 535 | am
 536 | amid
 537 | amidst
 538 | among
 539 | amongst
 540 | an
 541 | and
 542 | another
 543 | any
 544 | anybody
 545 | anyhow
 546 | anyone
 547 | anything
 548 | anyway
 549 | anyways
 550 | anywhere
 551 | apart
 552 | appear
 553 | appreciate
 554 | appropriate
 555 | are
 556 | aren't
 557 | around
 558 | as
 559 | a's
 560 | aside
 561 | ask
 562 | asking
 563 | associated
 564 | at
 565 | available
 566 | away
 567 | awfully
 568 | b
 569 | back
 570 | backward
 571 | backwards
 572 | be
 573 | became
 574 | because
 575 | become
 576 | becomes
 577 | becoming
 578 | been
 579 | before
 580 | beforehand
 581 | begin
 582 | behind
 583 | being
 584 | believe
 585 | below
 586 | beside
 587 | besides
 588 | best
 589 | better
 590 | between
 591 | beyond
 592 | both
 593 | brief
 594 | but
 595 | by
 596 | c
 597 | came
 598 | can
 599 | cannot
 600 | cant
 601 | can't
 602 | caption
 603 | cause
 604 | causes
 605 | certain
 606 | certainly
 607 | changes
 608 | clearly
 609 | c'mon
 610 | co
 611 | co.
 612 | com
 613 | come
 614 | comes
 615 | concerning
 616 | consequently
 617 | consider
 618 | considering
 619 | contain
 620 | containing
 621 | contains
 622 | corresponding
 623 | could
 624 | couldn't
 625 | course
 626 | c's
 627 | currently
 628 | d
 629 | dare
 630 | daren't
 631 | definitely
 632 | described
 633 | despite
 634 | did
 635 | didn't
 636 | different
 637 | directly
 638 | do
 639 | does
 640 | doesn't
 641 | doing
 642 | done
 643 | don't
 644 | down
 645 | downwards
 646 | during
 647 | e
 648 | each
 649 | edu
 650 | eg
 651 | eight
 652 | eighty
 653 | either
 654 | else
 655 | elsewhere
 656 | end
 657 | ending
 658 | enough
 659 | entirely
 660 | especially
 661 | et
 662 | etc
 663 | even
 664 | ever
 665 | evermore
 666 | every
 667 | everybody
 668 | everyone
 669 | everything
 670 | everywhere
 671 | ex
 672 | exactly
 673 | example
 674 | except
 675 | f
 676 | fairly
 677 | far
 678 | farther
 679 | few
 680 | fewer
 681 | fifth
 682 | first
 683 | five
 684 | followed
 685 | following
 686 | follows
 687 | for
 688 | forever
 689 | former
 690 | formerly
 691 | forth
 692 | forward
 693 | found
 694 | four
 695 | from
 696 | further
 697 | furthermore
 698 | g
 699 | get
 700 | gets
 701 | getting
 702 | given
 703 | gives
 704 | go
 705 | goes
 706 | going
 707 | gone
 708 | got
 709 | gotten
 710 | greetings
 711 | h
 712 | had
 713 | hadn't
 714 | half
 715 | happens
 716 | hardly
 717 | has
 718 | hasn't
 719 | have
 720 | haven't
 721 | having
 722 | he
 723 | he'd
 724 | he'll
 725 | hello
 726 | help
 727 | hence
 728 | her
 729 | here
 730 | hereafter
 731 | hereby
 732 | herein
 733 | here's
 734 | hereupon
 735 | hers
 736 | herself
 737 | he's
 738 | hi
 739 | him
 740 | himself
 741 | his
 742 | hither
 743 | hopefully
 744 | how
 745 | howbeit
 746 | however
 747 | hundred
 748 | i
 749 | i'd
 750 | ie
 751 | if
 752 | ignored
 753 | i'll
 754 | i'm
 755 | immediate
 756 | in
 757 | inasmuch
 758 | inc
 759 | inc.
 760 | indeed
 761 | indicate
 762 | indicated
 763 | indicates
 764 | inner
 765 | inside
 766 | insofar
 767 | instead
 768 | into
 769 | inward
 770 | is
 771 | isn't
 772 | it
 773 | it'd
 774 | it'll
 775 | its
 776 | it's
 777 | itself
 778 | i've
 779 | j
 780 | just
 781 | k
 782 | keep
 783 | keeps
 784 | kept
 785 | know
 786 | known
 787 | knows
 788 | l
 789 | last
 790 | lately
 791 | later
 792 | latter
 793 | latterly
 794 | least
 795 | less
 796 | lest
 797 | let
 798 | let's
 799 | like
 800 | liked
 801 | likely
 802 | likewise
 803 | little
 804 | look
 805 | looking
 806 | looks
 807 | low
 808 | lower
 809 | ltd
 810 | m
 811 | made
 812 | mainly
 813 | make
 814 | makes
 815 | many
 816 | may
 817 | maybe
 818 | mayn't
 819 | me
 820 | mean
 821 | meantime
 822 | meanwhile
 823 | merely
 824 | might
 825 | mightn't
 826 | mine
 827 | minus
 828 | miss
 829 | more
 830 | moreover
 831 | most
 832 | mostly
 833 | mr
 834 | mrs
 835 | much
 836 | must
 837 | mustn't
 838 | my
 839 | myself
 840 | n
 841 | name
 842 | namely
 843 | nd
 844 | near
 845 | nearly
 846 | necessary
 847 | need
 848 | needn't
 849 | needs
 850 | neither
 851 | never
 852 | neverf
 853 | neverless
 854 | nevertheless
 855 | new
 856 | next
 857 | nine
 858 | ninety
 859 | no
 860 | nobody
 861 | non
 862 | none
 863 | nonetheless
 864 | noone
 865 | no-one
 866 | nor
 867 | normally
 868 | not
 869 | nothing
 870 | notwithstanding
 871 | novel
 872 | now
 873 | nowhere
 874 | o
 875 | obviously
 876 | of
 877 | off
 878 | often
 879 | oh
 880 | ok
 881 | okay
 882 | old
 883 | on
 884 | once
 885 | one
 886 | ones
 887 | one's
 888 | only
 889 | onto
 890 | opposite
 891 | or
 892 | other
 893 | others
 894 | otherwise
 895 | ought
 896 | oughtn't
 897 | our
 898 | ours
 899 | ourselves
 900 | out
 901 | outside
 902 | over
 903 | overall
 904 | own
 905 | p
 906 | particular
 907 | particularly
 908 | past
 909 | per
 910 | perhaps
 911 | placed
 912 | please
 913 | plus
 914 | possible
 915 | presumably
 916 | probably
 917 | provided
 918 | provides
 919 | q
 920 | que
 921 | quite
 922 | qv
 923 | r
 924 | rather
 925 | rd
 926 | re
 927 | really
 928 | reasonably
 929 | recent
 930 | recently
 931 | regarding
 932 | regardless
 933 | regards
 934 | relatively
 935 | respectively
 936 | right
 937 | round
 938 | s
 939 | said
 940 | same
 941 | saw
 942 | say
 943 | saying
 944 | says
 945 | second
 946 | secondly
 947 | see
 948 | seeing
 949 | seem
 950 | seemed
 951 | seeming
 952 | seems
 953 | seen
 954 | self
 955 | selves
 956 | sensible
 957 | sent
 958 | serious
 959 | seriously
 960 | seven
 961 | several
 962 | shall
 963 | shan't
 964 | she
 965 | she'd
 966 | she'll
 967 | she's
 968 | should
 969 | shouldn't
 970 | since
 971 | six
 972 | so
 973 | some
 974 | somebody
 975 | someday
 976 | somehow
 977 | someone
 978 | something
 979 | sometime
 980 | sometimes
 981 | somewhat
 982 | somewhere
 983 | soon
 984 | sorry
 985 | specified
 986 | specify
 987 | specifying
 988 | still
 989 | sub
 990 | such
 991 | sup
 992 | sure
 993 | t
 994 | take
 995 | taken
 996 | taking
 997 | tell
 998 | tends
 999 | th
1000 | than
1001 | thank
1002 | thanks
1003 | thanx
1004 | that
1005 | that'll
1006 | thats
1007 | that's
1008 | that've
1009 | the
1010 | their
1011 | theirs
1012 | them
1013 | themselves
1014 | then
1015 | thence
1016 | there
1017 | thereafter
1018 | thereby
1019 | there'd
1020 | therefore
1021 | therein
1022 | there'll
1023 | there're
1024 | theres
1025 | there's
1026 | thereupon
1027 | there've
1028 | these
1029 | they
1030 | they'd
1031 | they'll
1032 | they're
1033 | they've
1034 | thing
1035 | things
1036 | think
1037 | third
1038 | thirty
1039 | this
1040 | thorough
1041 | thoroughly
1042 | those
1043 | though
1044 | three
1045 | through
1046 | throughout
1047 | thru
1048 | thus
1049 | till
1050 | to
1051 | together
1052 | too
1053 | took
1054 | toward
1055 | towards
1056 | tried
1057 | tries
1058 | truly
1059 | try
1060 | trying
1061 | t's
1062 | twice
1063 | two
1064 | u
1065 | un
1066 | under
1067 | underneath
1068 | undoing
1069 | unfortunately
1070 | unless
1071 | unlike
1072 | unlikely
1073 | until
1074 | unto
1075 | up
1076 | upon
1077 | upwards
1078 | us
1079 | use
1080 | used
1081 | useful
1082 | uses
1083 | using
1084 | usually
1085 | v
1086 | value
1087 | various
1088 | versus
1089 | very
1090 | via
1091 | viz
1092 | vs
1093 | w
1094 | want
1095 | wants
1096 | was
1097 | wasn't
1098 | way
1099 | we
1100 | we'd
1101 | welcome
1102 | well
1103 | we'll
1104 | went
1105 | were
1106 | we're
1107 | weren't
1108 | we've
1109 | what
1110 | whatever
1111 | what'll
1112 | what's
1113 | what've
1114 | when
1115 | whence
1116 | whenever
1117 | where
1118 | whereafter
1119 | whereas
1120 | whereby
1121 | wherein
1122 | where's
1123 | whereupon
1124 | wherever
1125 | whether
1126 | which
1127 | whichever
1128 | while
1129 | whilst
1130 | whither
1131 | who
1132 | who'd
1133 | whoever
1134 | whole
1135 | who'll
1136 | whom
1137 | whomever
1138 | who's
1139 | whose
1140 | why
1141 | will
1142 | willing
1143 | wish
1144 | with
1145 | within
1146 | without
1147 | wonder
1148 | won't
1149 | would
1150 | wouldn't
1151 | x
1152 | y
1153 | yes
1154 | yet
1155 | you
1156 | you'd
1157 | you'll
1158 | your
1159 | you're
1160 | yours
1161 | yourself
1162 | yourselves
1163 | you've
1164 | z
1165 | zero


--------------------------------------------------------------------------------
/textrank4zh/util.py:
--------------------------------------------------------------------------------
  1 | #-*- encoding:utf-8 -*-
  2 | """
  3 | @author:   letian
  4 | @homepage: http://www.letiantian.me
  5 | @github:   https://github.com/someus/
  6 | """
  7 | from __future__ import (absolute_import, division, print_function,
  8 |                         unicode_literals)
  9 | 
 10 | import os
 11 | import math
 12 | import networkx as nx
 13 | import numpy as np
 14 | import sys
 15 | 
 16 | try:
 17 |     reload(sys)
 18 |     sys.setdefaultencoding('utf-8')
 19 | except:
 20 |     pass
 21 |     
 22 | sentence_delimiters = ['?', '!', ';', '？', '！', '。', '；', '……', '…', '\n']
 23 | allow_speech_tags = ['an', 'i', 'j', 'l', 'n', 'nr', 'nrfg', 'ns', 'nt', 'nz', 't', 'v', 'vd', 'vn', 'eng']
 24 | 
 25 | PY2 = sys.version_info[0] == 2
 26 | if not PY2:
 27 |     # Python 3.x and up
 28 |     text_type    = str
 29 |     string_types = (str,)
 30 |     xrange       = range
 31 | 
 32 |     def as_text(v):  ## 生成unicode字符串
 33 |         if v is None:
 34 |             return None
 35 |         elif isinstance(v, bytes):
 36 |             return v.decode('utf-8', errors='ignore')
 37 |         elif isinstance(v, str):
 38 |             return v
 39 |         else:
 40 |             raise ValueError('Unknown type %r' % type(v))
 41 | 
 42 |     def is_text(v):
 43 |         return isinstance(v, text_type)
 44 | 
 45 | else:
 46 |     # Python 2.x
 47 |     text_type    = unicode
 48 |     string_types = (str, unicode)
 49 |     xrange       = xrange
 50 | 
 51 |     def as_text(v):
 52 |         if v is None:
 53 |             return None
 54 |         elif isinstance(v, unicode):
 55 |             return v
 56 |         elif isinstance(v, str):
 57 |             return v.decode('utf-8', errors='ignore')
 58 |         else:
 59 |             raise ValueError('Invalid type %r' % type(v))
 60 | 
 61 |     def is_text(v):
 62 |         return isinstance(v, text_type)
 63 | 
 64 | __DEBUG = None
 65 | 
 66 | def debug(*args):
 67 |     global __DEBUG
 68 |     if __DEBUG is None:
 69 |         try:
 70 |             if os.environ['DEBUG'] == '1':
 71 |                 __DEBUG = True
 72 |             else:
 73 |                 __DEBUG = False
 74 |         except:
 75 |             __DEBUG = False
 76 |     if __DEBUG:
 77 |         print( ' '.join([str(arg) for arg in args]) )
 78 | 
 79 | class AttrDict(dict):
 80 |     """Dict that can get attribute by dot"""
 81 |     def __init__(self, *args, **kwargs):
 82 |         super(AttrDict, self).__init__(*args, **kwargs)
 83 |         self.__dict__ = self
 84 | 
 85 | 
 86 | def combine(word_list, window = 2):
 87 |     """构造在window下的单词组合，用来构造单词之间的边。
 88 |     
 89 |     Keyword arguments:
 90 |     word_list  --  list of str, 由单词组成的列表。
 91 |     windows    --  int, 窗口大小。
 92 |     """
 93 |     if window < 2: window = 2
 94 |     for x in xrange(1, window):
 95 |         if x >= len(word_list):
 96 |             break
 97 |         word_list2 = word_list[x:]
 98 |         res = zip(word_list, word_list2)
 99 |         for r in res:
100 |             yield r
101 | 
102 | def get_similarity(word_list1, word_list2):
103 |     """默认的用于计算两个句子相似度的函数。
104 | 
105 |     Keyword arguments:
106 |     word_list1, word_list2  --  分别代表两个句子，都是由单词组成的列表
107 |     """
108 |     words   = list(set(word_list1 + word_list2))        
109 |     vector1 = [float(word_list1.count(word)) for word in words]
110 |     vector2 = [float(word_list2.count(word)) for word in words]
111 |     
112 |     vector3 = [vector1[x]*vector2[x]  for x in xrange(len(vector1))]
113 |     vector4 = [1 for num in vector3 if num > 0.]
114 |     co_occur_num = sum(vector4)
115 | 
116 |     if abs(co_occur_num) <= 1e-12:
117 |         return 0.
118 |     
119 |     denominator = math.log(float(len(word_list1))) + math.log(float(len(word_list2))) # 分母
120 |     
121 |     if abs(denominator) < 1e-12:
122 |         return 0.
123 |     
124 |     return co_occur_num / denominator
125 | 
126 | def sort_words(vertex_source, edge_source, window = 2, pagerank_config = {'alpha': 0.85,}):
127 |     """将单词按关键程度从大到小排序
128 | 
129 |     Keyword arguments:
130 |     vertex_source   --  二维列表，子列表代表句子，子列表的元素是单词，这些单词用来构造pagerank中的节点
131 |     edge_source     --  二维列表，子列表代表句子，子列表的元素是单词，根据单词位置关系构造pagerank中的边
132 |     window          --  一个句子中相邻的window个单词，两两之间认为有边
133 |     pagerank_config --  pagerank的设置
134 |     """
135 |     sorted_words   = []
136 |     word_index     = {}
137 |     index_word     = {}
138 |     _vertex_source = vertex_source
139 |     _edge_source   = edge_source
140 |     words_number   = 0
141 |     for word_list in _vertex_source:
142 |         for word in word_list:
143 |             if not word in word_index:
144 |                 word_index[word] = words_number
145 |                 index_word[words_number] = word
146 |                 words_number += 1
147 | 
148 |     graph = np.zeros((words_number, words_number))
149 |     
150 |     for word_list in _edge_source:
151 |         for w1, w2 in combine(word_list, window):
152 |             if w1 in word_index and w2 in word_index:
153 |                 index1 = word_index[w1]
154 |                 index2 = word_index[w2]
155 |                 graph[index1][index2] = 1.0
156 |                 graph[index2][index1] = 1.0
157 | 
158 |     debug('graph:\n', graph)
159 |     
160 |     nx_graph = nx.from_numpy_matrix(graph)
161 |     scores = nx.pagerank(nx_graph, **pagerank_config)          # this is a dict
162 |     sorted_scores = sorted(scores.items(), key = lambda item: item[1], reverse=True)
163 |     for index, score in sorted_scores:
164 |         item = AttrDict(word=index_word[index], weight=score)
165 |         sorted_words.append(item)
166 | 
167 |     return sorted_words
168 | 
169 | def sort_sentences(sentences, words, sim_func = get_similarity, pagerank_config = {'alpha': 0.85,}):
170 |     """将句子按照关键程度从大到小排序
171 | 
172 |     Keyword arguments:
173 |     sentences         --  列表，元素是句子
174 |     words             --  二维列表，子列表和sentences中的句子对应，子列表由单词组成
175 |     sim_func          --  计算两个句子的相似性，参数是两个由单词组成的列表
176 |     pagerank_config   --  pagerank的设置
177 |     """
178 |     sorted_sentences = []
179 |     _source = words
180 |     sentences_num = len(_source)        
181 |     graph = np.zeros((sentences_num, sentences_num))
182 |     
183 |     for x in xrange(sentences_num):
184 |         for y in xrange(x, sentences_num):
185 |             similarity = sim_func( _source[x], _source[y] )
186 |             graph[x, y] = similarity
187 |             graph[y, x] = similarity
188 |             
189 |     nx_graph = nx.from_numpy_matrix(graph)
190 |     scores = nx.pagerank(nx_graph, **pagerank_config)              # this is a dict
191 |     sorted_scores = sorted(scores.items(), key = lambda item: item[1], reverse=True)
192 | 
193 |     for index, score in sorted_scores:
194 |         item = AttrDict(index=index, sentence=sentences[index], weight=score)
195 |         sorted_sentences.append(item)
196 | 
197 |     return sorted_sentences
198 | 
199 | if __name__ == '__main__':
200 |     pass


--------------------------------------------------------------------------------