├── img
├── 课程.png
├── dadeng.png
├── wordcloud.png
├── wordshiftor.png
├── 大邓和他的Python.png
└── 词云图测试.html
├── cntext
├── dictionary
│ ├── __init__.py
│ └── README.md
├── stats
│ ├── __init__.py
│ └── stats.py
├── similarity
│ ├── __init__.py
│ ├── README.md
│ └── similarity.py
├── visualization
│ ├── __init__.py
│ └── visualization.py
├── sentiment
│ ├── __init__.py
│ └── sentiment.py
└── __init__.py
├── examples
├── data
│ └── w2v_seeds
│ │ ├── respect.txt
│ │ ├── teamwork.txt
│ │ ├── quality.txt
│ │ ├── innovation.txt
│ │ └── integrity.txt
├── output
│ └── w2v_candi_words
│ │ ├── teamwork.txt
│ │ └── integrity.txt
├── 02-统计信息.ipynb
├── 05-文本相似.ipynb
├── 03-扩充词典.ipynb
├── 04-情感计算.ipynb
├── 01-cntext.ipynb
└── 06-可视化.ipynb
├── setup.py
└── README.md
/img/课程.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhw3051/cntext/HEAD/img/课程.png
--------------------------------------------------------------------------------
/cntext/dictionary/__init__.py:
--------------------------------------------------------------------------------
1 | from cntext.dictionary.dictionary import *
2 |
--------------------------------------------------------------------------------
/cntext/stats/__init__.py:
--------------------------------------------------------------------------------
1 | from cntext.stats.stats import term_freq, readability
--------------------------------------------------------------------------------
/img/dadeng.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhw3051/cntext/HEAD/img/dadeng.png
--------------------------------------------------------------------------------
/cntext/similarity/__init__.py:
--------------------------------------------------------------------------------
1 | from cntext.similarity.similarity import similarity_score
--------------------------------------------------------------------------------
/img/wordcloud.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhw3051/cntext/HEAD/img/wordcloud.png
--------------------------------------------------------------------------------
/img/wordshiftor.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhw3051/cntext/HEAD/img/wordshiftor.png
--------------------------------------------------------------------------------
/img/大邓和他的Python.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/zhw3051/cntext/HEAD/img/大邓和他的Python.png
--------------------------------------------------------------------------------
/cntext/visualization/__init__.py:
--------------------------------------------------------------------------------
1 | from cntext.visualization.visualization import wordcloud, wordshiftor
--------------------------------------------------------------------------------
/examples/data/w2v_seeds/respect.txt:
--------------------------------------------------------------------------------
1 | respectful
2 | talent
3 | talented
4 | employee
5 | dignity
6 | empowerment
7 | empower
--------------------------------------------------------------------------------
/cntext/sentiment/__init__.py:
--------------------------------------------------------------------------------
1 | from cntext.sentiment.sentiment import senti_by_dutir, senti_by_hownet, senti_by_diydict, init_jieba
2 |
--------------------------------------------------------------------------------
/examples/data/w2v_seeds/teamwork.txt:
--------------------------------------------------------------------------------
1 | teamwork
2 | collaboration
3 | collaborate
4 | collaborative
5 | cooperation
6 | cooperate
7 | cooperative
--------------------------------------------------------------------------------
/examples/data/w2v_seeds/quality.txt:
--------------------------------------------------------------------------------
1 | quality
2 | customer
3 | customer_commitment
4 | dedication
5 | dedicated
6 | dedicate
7 | customer_expectation
--------------------------------------------------------------------------------
/examples/data/w2v_seeds/innovation.txt:
--------------------------------------------------------------------------------
1 | innovation
2 | innovate
3 | innovative
4 | creativity
5 | creative
6 | create
7 | passion
8 | passionate
9 | efficiency
10 | efficient
11 | excellence
12 | pride
--------------------------------------------------------------------------------
/examples/data/w2v_seeds/integrity.txt:
--------------------------------------------------------------------------------
1 | integrity
2 | ethic
3 | ethical
4 | accountable
5 | accountability
6 | trust
7 | honesty
8 | honest
9 | honestly
10 | fairness
11 | responsibility
12 | responsible
13 | transparency
14 | transparent
--------------------------------------------------------------------------------
/cntext/__init__.py:
--------------------------------------------------------------------------------
1 | from cntext.dictionary import *
2 | from cntext.sentiment import senti_by_dutir, senti_by_hownet, senti_by_diydict, init_jieba
3 | from cntext.similarity import similarity_score
4 | from cntext.stats import term_freq, readability
5 | from cntext.visualization import wordcloud, wordshiftor
6 |
7 |
--------------------------------------------------------------------------------
/setup.py:
--------------------------------------------------------------------------------
1 | from setuptools import setup
2 | import setuptools
3 |
4 | setup(
5 | name='cntext', # 包名字
6 | version='0.9', # 包版本
7 | description='中文文本分析库,可对文本进行词频统计、词典扩充、情绪分析、相似度、可读性等', # 简单描述
8 | author='大邓', # 作者
9 | author_email='thunderhit@qq.com', # 邮箱
10 | url='https://github.com/thunderhit/cntext', # 包的主页
11 | packages=setuptools.find_packages(),
12 | install_requires=['jieba', 'numpy', 'scikit-learn==1.0', 'numpy', 'matplotlib', 'pyecharts', 'shifterator'],
13 | python_requires='>=3.5',
14 | license="MIT",
15 | keywords=['chinese', 'text mining', 'sentiment', 'sentiment analysis', 'natural language processing', 'sentiment dictionary development', 'text similarity'],
16 | long_description=open('README.md').read(), # 读取的Readme文档内容
17 | long_description_content_type="text/markdown") # 指定包文档格式为markdown
18 | #py_modules = ['eventextraction.py']
19 |
--------------------------------------------------------------------------------
/examples/output/w2v_candi_words/teamwork.txt:
--------------------------------------------------------------------------------
1 | teamwork
2 | collaboration
3 | collaborate
4 | collaborative
5 | cooperation
6 | cooperate
7 | cooperative
8 | submissions
9 | conduct
10 | collaborations
11 | nda
12 | alamos
13 | tigit
14 | backbone
15 | subsequently
16 | toxicology
17 | governance
18 | tolerability
19 | ipafricept
20 | svi
21 | compounds
22 | cooperative
23 | 1003
24 | pd1
25 | oncomed
26 | clia
27 | proof
28 | assisting
29 | reel-to-reel
30 | best-in-class
31 | pivotal
32 | interaction
33 | monotherapy
34 | crm
35 | anti-tigit
36 | opt
37 | 3454
38 | comprehensive
39 | establishing
40 | interactive
41 | innovent
42 | non-small
43 | investigative
44 | in-depth
45 | specs
46 | regarded
47 | bispecific
48 | predictive
49 | inhibitors
50 | distributes
51 | investigation
52 | roche
53 | teach
54 | anti-pd1
55 | contacts
56 | militaries
57 | reel
58 | gem
59 | fluor
60 | adopt
61 | inviting
62 | vantictumab
63 | expose
64 | biologic
65 | certification
66 | dispensing
67 | clinic
68 | apc
69 | gitr
70 | communist
71 | celgene
72 | medications
73 | intermediaries
74 | enargas
75 | primes
76 | integral
77 | authority
78 | measuring
79 | freightliner
80 | synchronized
81 | accessible
82 | bayer
83 | independently
84 | listeners
85 | dystrophin
86 | unanimous
87 | proof-of-concept
88 | proceeding
89 | records
90 | alzheimer
91 | seamless
92 | c-12
93 | stakeholders
94 | oversight
95 | coalition
96 | examining
97 | engagement
98 | gsk
99 | motor
100 | jewelfish
101 | pre-ind
102 | managements
103 | modify
104 | attorneys
105 | compliant
106 | venue
107 | instructed
108 |
--------------------------------------------------------------------------------
/examples/output/w2v_candi_words/integrity.txt:
--------------------------------------------------------------------------------
1 | integrity
2 | ethic
3 | ethical
4 | accountable
5 | accountability
6 | trust
7 | honesty
8 | honest
9 | honestly
10 | fairness
11 | responsibility
12 | responsible
13 | transparency
14 | transparent
15 | win-win
16 | threat
17 | stakeholders
18 | advice
19 | dissatisfaction
20 | ecosystem
21 | imbalances
22 | integral
23 | reputable
24 | knowledgeable
25 | mindset
26 | professionals
27 | seriously
28 | trusted
29 | intermediaries
30 | responsibility
31 | durability
32 | assistant
33 | deserves
34 | embrace
35 | examine
36 | urge
37 | uniquely
38 | blockchain
39 | friendly
40 | manifested
41 | ourself
42 | redefined
43 | attracting
44 | deeply
45 | specs
46 | inviting
47 | marketers
48 | bottlers
49 | command-and-control
50 | gigantic
51 | oversight
52 | experts
53 | reward
54 | enterprises
55 | attracted
56 | examination
57 | intelligence
58 | reinforce
59 | responsible
60 | educate
61 | engage
62 | workflow
63 | embraced
64 | legitimate
65 | lies
66 | investigate
67 | clever
68 | powerlistings
69 | hr
70 | terror
71 | blame
72 | chose
73 | involves
74 | adopt
75 | incentivized
76 | refocusing
77 | eager
78 | approached
79 | domain
80 | spirit
81 | karat
82 | thrive
83 | beverages
84 | merits
85 | emotional
86 | retailing
87 | reassessed
88 | seamless
89 | advantageous
90 | holistic
91 | viability
92 | selecting
93 | technological
94 | tailor
95 | alignment
96 | in-depth
97 | sponsor
98 | align
99 | fuels
100 | definitively
101 | venue
102 | examining
103 | audiences
104 | credentials
105 | engaging
106 | governance
107 | handled
108 | server-to-server
109 | disdainful
110 | motivation
111 | intrinsic
112 | subsequently
113 | humanly
114 | peace
115 |
--------------------------------------------------------------------------------
/cntext/visualization/visualization.py:
--------------------------------------------------------------------------------
1 | import pyecharts.options as opts
2 | from pyecharts.charts import WordCloud
3 | from cntext.stats.stats import term_freq
4 | from shifterator import EntropyShift
5 |
6 |
7 | def wordcloud(text, title, html_path):
8 | """
9 | 使用pyecharts库绘制词云图
10 | :param text: 中文文本字符串数据
11 | :param title: 词云图标题
12 | :param html_path: 词云图html文件存储路径
13 | :return:
14 | """
15 | wordfreq_dict = dict(term_freq(text))
16 | wordfreqs = [(word, str(freq)) for word, freq in wordfreq_dict.items()]
17 | wc = WordCloud()
18 | wc.add(series_name="", data_pair=wordfreqs, word_size_range=[20, 100])
19 | wc.set_global_opts(
20 | title_opts=opts.TitleOpts(title=title,
21 | title_textstyle_opts=opts.TextStyleOpts(font_size=23)
22 | ),
23 | tooltip_opts=opts.TooltipOpts(is_show=True))
24 | wc.render(html_path) #存储位置
25 | print('可视化完成,请前往 {} 查看'.format(html_path))
26 |
27 |
28 |
29 |
30 | def wordshiftor(text1, text2, title, top_n=50, matplotlib_family='Arial Unicode MS'):
31 | """
32 | 使用shifterator库绘制词移图,可用于查看两文本在词语信息熵上的区别
33 | :param text1: 文本数据1;字符串
34 | :param text2: 文本数据2;字符串
35 | :param title: 词移图标题
36 | :param top_n: 显示最常用的前n词; 默认值15
37 | :param matplotlib_family matplotlib中文字体,默认"Arial Unicode MS";如绘图字体乱码请,请参考下面提示
38 |
39 | 设置参数matplotlib_family,需要先运行下面代码获取本机字体列表
40 | from matplotlib.font_manager import FontManager
41 | mpl_fonts = set(f.name for f in FontManager().ttflist)
42 | print(mpl_fonts)
43 | """
44 | import matplotlib
45 | matplotlib.rc("font", family=matplotlib_family)
46 | type2freq_1 = term_freq(text1)
47 |
48 | type2freq_2 = term_freq(text2)
49 |
50 | entropy_shift = EntropyShift(type2freq_1=type2freq_1,
51 | type2freq_2=type2freq_2,
52 | base=2)
53 | entropy_shift.get_shift_graph(title=title, top_n=top_n)
54 |
55 |
--------------------------------------------------------------------------------
/cntext/stats/stats.py:
--------------------------------------------------------------------------------
1 | from cntext.dictionary.dictionary import ADV_words, CONJ_words, STOPWORDS_zh
2 | import re
3 | import jieba
4 | from collections import Counter
5 | import numpy as np
6 |
7 |
8 | def term_freq(text):
9 | text = ''.join(re.findall('[\u4e00-\u9fa5]+', text))
10 | words = jieba.lcut(text)
11 | words = [w for w in words if w not in STOPWORDS_zh]
12 | return Counter(words)
13 |
14 |
15 |
16 | def readability(text, language='chinese'):
17 | """
18 | 文本可读性,指标越大,文章复杂度越高,可读性越差。
19 | ------------
20 | 【英文可读性】公式 4.71 x (characters/words) + 0.5 x (words/sentences) - 21.43;
21 | 【中文可读性】 参考自 【徐巍,姚振晔,陈冬华.中文年报可读性:衡量与检验[J].会计研究,2021(03):28-44.】
22 | readability1 ---每个分句中的平均字数
23 | readability2 ---每个句子中副词和连词所占的比例
24 | readability3 ---参考Fog Index, readability3=(readability1+readability2)×0.5
25 | 以上三个指标越大,都说明文本的复杂程度越高,可读性越差。
26 |
27 | """
28 | if language=='english':
29 | text = text.lower()
30 | num_of_characters = len(text)
31 | num_of_words = len(text.split(" "))
32 | num_of_sentences = len(re.split('[\.!\?\n;]+', text))
33 | ari = (
34 | 4.71 * (num_of_characters / num_of_words)
35 | + 0.5 * (num_of_words / num_of_sentences)
36 | - 21.43
37 | )
38 |
39 | return {"readability": ari}
40 | if language=='chinese':
41 | adv_conj_words = set(ADV_words+CONJ_words)
42 | zi_num_per_sent = []
43 | adv_conj_ratio_per_sent = []
44 | sentences = re.split('[\.。!!?\?\n;;]+', text)
45 | for sent in sentences:
46 | adv_conj_num = 0
47 | zi_num_per_sent.append(len(sent))
48 | words = jieba.lcut(sent)
49 | for w in words:
50 | if w in adv_conj_words:
51 | adv_conj_num+=1
52 | adv_conj_ratio_per_sent.append(adv_conj_num/len(words))
53 | readability1 = np.mean(zi_num_per_sent)
54 | readability2 = np.mean(adv_conj_ratio_per_sent)
55 | readability3 = (readability1+readability2)*0.5
56 | return {'readability1': readability1,
57 | 'readability2': readability2,
58 | 'readability3': readability3}
59 |
60 |
61 |
62 |
63 |
64 |
65 |
66 |
--------------------------------------------------------------------------------
/cntext/similarity/README.md:
--------------------------------------------------------------------------------
1 |
2 |
3 |
4 |
5 | # simtext
6 |
7 | simtext可以计算两文档间四大文本相似性指标,分别为:
8 |
9 | - Sim_Cosine cosine相似性
10 | - Sim_Jaccard Jaccard相似性
11 | - Sim_MinEdit 最小编辑距离
12 | - Sim_Simple 微软Word中的track changes
13 |
14 | 具体算法介绍可翻看Cohen, Lauren, Christopher Malloy&Quoc Nguyen(2018) 第60页
15 |
16 | 
17 |
18 | ### 安装
19 |
20 | ```
21 | pip install simtext
22 | ```
23 |
24 | ### 使用
25 |
26 | 中文文本相似性
27 |
28 | ```python
29 | from simtext import similarity
30 |
31 | text1 = '在宏观经济背景下,为继续优化贷款结构,重点发展可以抵抗经济周期不良的贷款'
32 | text2 = '在宏观经济背景下,为继续优化贷款结构,重点发展可三年专业化、集约化、综合金融+物联网金融四大金融特色的基础上'
33 |
34 | sim = similarity()
35 | res = sim.compute(text1, text2)
36 | print(res)
37 | ```
38 |
39 | Run
40 |
41 | ```
42 | {'Sim_Cosine': 0.46475800154489,
43 | 'Sim_Jaccard': 0.3333333333333333,
44 | 'Sim_MinEdit': 29,
45 | 'Sim_Simple': 0.9889595182335229}
46 | ```
47 |
48 |
49 |
50 | 英文文本相似性
51 |
52 | ```python
53 | from simtext import similarity
54 |
55 | A = 'We expect demand to increase.'
56 | B = 'We expect worldwide demand to increase.'
57 | C = 'We expect weakness in sales'
58 |
59 | sim = similarity()
60 | AB = sim.compute(A, B)
61 | AC = sim.compute(A, C)
62 |
63 | print(AB)
64 | print(AC)
65 | ```
66 |
67 | Run
68 |
69 | ```
70 | {'Sim_Cosine': 0.9128709291752769,
71 | 'Sim_Jaccard': 0.8333333333333334,
72 | 'Sim_MinEdit': 2,
73 | 'Sim_Simple': 0.9545454545454546}
74 |
75 | {'Sim_Cosine': 0.39999999999999997,
76 | 'Sim_Jaccard': 0.25,
77 | 'Sim_MinEdit': 4,
78 | 'Sim_Simple': 0.9315789473684211}
79 |
80 | ```
81 |
82 |
83 |
84 | ### 参考文献
85 |
86 | Cohen, Lauren, Christopher Malloy, and Quoc Nguyen. *Lazy prices*. No. w25084. National Bureau of Economic Research, 2018.
87 |
88 | ## 如果
89 |
90 | 如果您是经管人文社科专业背景,编程小白,面临海量文本数据采集和处理分析艰巨任务,个人建议学习[《python网络爬虫与文本数据分析》](https://ke.qq.com/course/482241?tuin=163164df)视频课。作为文科生,一样也是从两眼一抹黑开始,这门课程是用五年时间凝缩出来的。自认为讲的很通俗易懂o(* ̄︶ ̄*)o,
91 |
92 | - python入门
93 | - 网络爬虫
94 | - 数据读取
95 | - 文本分析入门
96 | - 机器学习与文本分析
97 | - 文本分析在经管研究中的应用
98 |
99 | 感兴趣的童鞋不妨 戳一下[《python网络爬虫与文本数据分析》](https://ke.qq.com/course/482241?tuin=163164df)进来看看~
100 |
101 | [](https://ke.qq.com/course/482241?tuin=163164df)
102 |
103 |
104 |
105 | ## 更多
106 |
107 | - [B站:大邓和他的python](https://space.bilibili.com/122592901/channel/detail?cid=66008)
108 |
109 | - 公众号:大邓和他的python
110 |
111 | - [知乎专栏:数据科学家](https://zhuanlan.zhihu.com/dadeng)
112 |
113 |
114 | 
115 |
--------------------------------------------------------------------------------
/examples/02-统计信息.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [
8 | {
9 | "name": "stderr",
10 | "output_type": "stream",
11 | "text": [
12 | "Building prefix dict from the default dictionary ...\n",
13 | "Dumping model to file cache /var/folders/sc/3mnt5tgs419_hk7s16gq61p80000gn/T/jieba.cache\n",
14 | "Loading model cost 0.718 seconds.\n",
15 | "Prefix dict has been built successfully.\n"
16 | ]
17 | },
18 | {
19 | "data": {
20 | "text/plain": [
21 | "Counter({'看待': 1,\n",
22 | " '网文': 1,\n",
23 | " '作者': 1,\n",
24 | " '黑客': 1,\n",
25 | " '大佬': 1,\n",
26 | " '盗号': 1,\n",
27 | " '改文因': 1,\n",
28 | " '万分': 1,\n",
29 | " '惭愧': 1,\n",
30 | " '停': 1})"
31 | ]
32 | },
33 | "execution_count": 1,
34 | "metadata": {},
35 | "output_type": "execute_result"
36 | }
37 | ],
38 | "source": [
39 | "from cntext.stats import term_freq, readability\n",
40 | "\n",
41 | "text = '如何看待一网文作者被黑客大佬盗号改文,因万分惭愧而停更'\n",
42 | "term_freq(text)"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": 2,
48 | "metadata": {},
49 | "outputs": [
50 | {
51 | "data": {
52 | "text/plain": [
53 | "{'readability1': 27.0,\n",
54 | " 'readability2': 0.17647058823529413,\n",
55 | " 'readability3': 13.588235294117647}"
56 | ]
57 | },
58 | "execution_count": 2,
59 | "metadata": {},
60 | "output_type": "execute_result"
61 | }
62 | ],
63 | "source": [
64 | "readability(text)"
65 | ]
66 | },
67 | {
68 | "cell_type": "code",
69 | "execution_count": null,
70 | "metadata": {},
71 | "outputs": [],
72 | "source": []
73 | }
74 | ],
75 | "metadata": {
76 | "kernelspec": {
77 | "display_name": "Python 3",
78 | "language": "python",
79 | "name": "python3"
80 | },
81 | "language_info": {
82 | "codemirror_mode": {
83 | "name": "ipython",
84 | "version": 3
85 | },
86 | "file_extension": ".py",
87 | "mimetype": "text/x-python",
88 | "name": "python",
89 | "nbconvert_exporter": "python",
90 | "pygments_lexer": "ipython3",
91 | "version": "3.7.5"
92 | },
93 | "toc": {
94 | "base_numbering": 1,
95 | "nav_menu": {},
96 | "number_sections": true,
97 | "sideBar": true,
98 | "skip_h1_title": false,
99 | "title_cell": "Table of Contents",
100 | "title_sidebar": "Contents",
101 | "toc_cell": false,
102 | "toc_position": {},
103 | "toc_section_display": true,
104 | "toc_window_display": false
105 | }
106 | },
107 | "nbformat": 4,
108 | "nbformat_minor": 4
109 | }
110 |
--------------------------------------------------------------------------------
/cntext/similarity/similarity.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | # -*- coding: utf-8 -*-
3 | from sklearn.feature_extraction.text import CountVectorizer
4 | from sklearn.metrics.pairwise import cosine_similarity
5 | from difflib import Differ, SequenceMatcher
6 | import jieba
7 | from math import *
8 | import warnings
9 |
10 | warnings.filterwarnings('ignore')
11 |
12 |
13 | def check_contain_chinese(check_str):
14 | for ch in check_str:
15 | if u'\u4e00' <= ch <= u'\u9fff':
16 | return True
17 | return False
18 |
19 |
20 | def transform(text1, text2):
21 | """
22 | 把文本text1,text2转化为英文样式的text1,text2和向量vec1,vec2
23 | :param text1:
24 | :param text2:
25 | :return:
26 | """
27 |
28 | if check_contain_chinese(text1):
29 | text1 = ' '.join(jieba.lcut(text1))
30 | text2 = ' '.join(jieba.lcut(text2))
31 | else:
32 | pass
33 |
34 | corpus = [text1, text2]
35 | cv = CountVectorizer(binary=True)
36 | cv.fit(corpus)
37 | vec1 = cv.transform([text1]).toarray()
38 | vec2 = cv.transform([text2]).toarray()
39 | return text1, text2, vec1, vec2
40 |
41 |
42 | #def cosine_similarity(vec1, vec2):
43 | # cos_sim = cosine_similarity(vec1, vec2)[0][0]
44 | # return cos_sim[0][0]
45 |
46 | def jaccard_similarity(vec1, vec2):
47 | """ returns the jaccard similarity between two lists """
48 | vec1 = set([idx for idx, v in enumerate(vec1[0]) if v > 0])
49 | vec2 = set([idx for idx, v in enumerate(vec2[0]) if v > 0])
50 | return len(vec1 & vec2) / len(vec1 | vec2)
51 |
52 | def minedit_similarity(text1, text2):
53 | words1 = jieba.lcut(text1.lower())
54 | words2 = jieba.lcut(text2.lower())
55 | leven_cost = 0
56 | s = SequenceMatcher(None, words1, words2)
57 | for tag, i1, i2, j1, j2 in s.get_opcodes():
58 | if tag == 'replace':
59 | leven_cost += max(i2 - i1, j2 - j1)
60 | elif tag == 'insert':
61 | leven_cost += (j2 - j1)
62 | elif tag == 'delete':
63 | leven_cost += (i2 - i1)
64 | return leven_cost
65 |
66 | def simple_similarity(text1, text2):
67 | words1 = jieba.lcut(text1.lower())
68 | words2 = jieba.lcut(text2.lower())
69 | diff = Differ()
70 | diff_manipulate = list(diff.compare(words1, words2))
71 | c = len(diff_manipulate) / (len(words1) + len(words2))
72 | cmax = max([len(words1), len(words2)])
73 | return (cmax - c) / cmax
74 |
75 |
76 |
77 | def similarity_score(text1, text2):
78 | """
79 | 对输入的text1和text2进行相似性计算,返回相似性信息
80 | :param text1: 文本字符串
81 | :param text2: 文本字符串
82 | :return: 字典, 形如{
83 | 'Sim_Cosine':0.8,
84 | 'Sim_Jaccard': 0.3,
85 | 'Sim_MinEdit': 0.5,
86 | 'Sim_Simple': 0.8
87 | }
88 | """
89 | text11, text22, vec1, vec2 = transform(text1, text2)
90 | data = {
91 | 'Sim_Cosine': cosine_similarity(vec1, vec2)[0][0],
92 | 'Sim_Jaccard': jaccard_similarity(vec1, vec2),
93 | 'Sim_MinEdit': minedit_similarity(text11, text22),
94 | 'Sim_Simple': simple_similarity(text11, text22)
95 | }
96 | return data
97 |
98 |
--------------------------------------------------------------------------------
/examples/05-文本相似.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "from cntext.similarity import similarity_score"
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 2,
15 | "metadata": {},
16 | "outputs": [
17 | {
18 | "name": "stdout",
19 | "output_type": "stream",
20 | "text": [
21 | "Help on function similarity_score in module cntext.similarity.similarity:\n",
22 | "\n",
23 | "similarity_score(text1, text2)\n",
24 | " 对输入的text1和text2进行相似性计算,返回相似性信息\n",
25 | " :param text1: 文本字符串\n",
26 | " :param text2: 文本字符串\n",
27 | " :return: 字典, 形如{\n",
28 | " 'Sim_Cosine':0.8,\n",
29 | " 'Sim_Jaccard': 0.3,\n",
30 | " 'Sim_MinEdit': 0.5,\n",
31 | " 'Sim_Simple': 0.8\n",
32 | " }\n",
33 | "\n"
34 | ]
35 | }
36 | ],
37 | "source": [
38 | "help(similarity_score)"
39 | ]
40 | },
41 | {
42 | "cell_type": "code",
43 | "execution_count": 3,
44 | "metadata": {},
45 | "outputs": [
46 | {
47 | "name": "stderr",
48 | "output_type": "stream",
49 | "text": [
50 | "Building prefix dict from the default dictionary ...\n",
51 | "Loading model from cache /var/folders/sc/3mnt5tgs419_hk7s16gq61p80000gn/T/jieba.cache\n",
52 | "Loading model cost 0.633 seconds.\n",
53 | "Prefix dict has been built successfully.\n"
54 | ]
55 | },
56 | {
57 | "data": {
58 | "text/plain": [
59 | "{'Sim_Cosine': 0.816496580927726,\n",
60 | " 'Sim_Jaccard': 0.6666666666666666,\n",
61 | " 'Sim_MinEdit': 1,\n",
62 | " 'Sim_Simple': 0.9183673469387755}"
63 | ]
64 | },
65 | "execution_count": 3,
66 | "metadata": {},
67 | "output_type": "execute_result"
68 | }
69 | ],
70 | "source": [
71 | "text1 = '编程真好玩编程真好玩'\n",
72 | "text2 = '游戏真好玩编程真好玩'\n",
73 | "\n",
74 | "similarity_score(text1, text2)"
75 | ]
76 | },
77 | {
78 | "cell_type": "code",
79 | "execution_count": null,
80 | "metadata": {},
81 | "outputs": [],
82 | "source": []
83 | }
84 | ],
85 | "metadata": {
86 | "kernelspec": {
87 | "display_name": "Python 3",
88 | "language": "python",
89 | "name": "python3"
90 | },
91 | "language_info": {
92 | "codemirror_mode": {
93 | "name": "ipython",
94 | "version": 3
95 | },
96 | "file_extension": ".py",
97 | "mimetype": "text/x-python",
98 | "name": "python",
99 | "nbconvert_exporter": "python",
100 | "pygments_lexer": "ipython3",
101 | "version": "3.7.5"
102 | },
103 | "toc": {
104 | "base_numbering": 1,
105 | "nav_menu": {},
106 | "number_sections": true,
107 | "sideBar": true,
108 | "skip_h1_title": false,
109 | "title_cell": "Table of Contents",
110 | "title_sidebar": "Contents",
111 | "toc_cell": false,
112 | "toc_position": {},
113 | "toc_section_display": true,
114 | "toc_window_display": false
115 | }
116 | },
117 | "nbformat": 4,
118 | "nbformat_minor": 4
119 | }
120 |
--------------------------------------------------------------------------------
/examples/03-扩充词典.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [
8 | {
9 | "name": "stderr",
10 | "output_type": "stream",
11 | "text": [
12 | "Building prefix dict from the default dictionary ...\n",
13 | "Loading model from cache /var/folders/sc/3mnt5tgs419_hk7s16gq61p80000gn/T/jieba.cache\n"
14 | ]
15 | },
16 | {
17 | "name": "stdout",
18 | "output_type": "stream",
19 | "text": [
20 | "step 1/4:...seg corpus ...\n"
21 | ]
22 | },
23 | {
24 | "name": "stderr",
25 | "output_type": "stream",
26 | "text": [
27 | "Loading model cost 0.678 seconds.\n",
28 | "Prefix dict has been built successfully.\n"
29 | ]
30 | },
31 | {
32 | "name": "stdout",
33 | "output_type": "stream",
34 | "text": [
35 | "step 1/4 finished:...cost 60.78995203971863...\n",
36 | "step 2/4:...collect cowords ...\n",
37 | "step 2/4 finished:...cost 0.6169600486755371...\n",
38 | "step 3/4:...compute sopmi ...\n",
39 | "step 1/4 finished:...cost 0.26422882080078125...\n",
40 | "step 4/4:...save candiwords ...\n",
41 | "finished! cost 61.8965539932251\n"
42 | ]
43 | }
44 | ],
45 | "source": [
46 | "from cntext.dictionary import SoPmi\n",
47 | "import os\n",
48 | "\n",
49 | "sopmier = SoPmi(cwd=os.getcwd(),\n",
50 | " input_txt_file='data/sopmi_corpus.txt', #原始数据,您的语料\n",
51 | " seedword_txt_file='data/sopmi_seed_words.txt', #人工标注的初始种子词\n",
52 | " ) \n",
53 | "\n",
54 | "sopmier.sopmi()"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": null,
60 | "metadata": {},
61 | "outputs": [],
62 | "source": []
63 | },
64 | {
65 | "cell_type": "code",
66 | "execution_count": 1,
67 | "metadata": {},
68 | "outputs": [
69 | {
70 | "name": "stdout",
71 | "output_type": "stream",
72 | "text": [
73 | "数据预处理开始.......\n",
74 | "预处理结束...........\n",
75 | "Word2Vec模型训练开始......\n",
76 | "已将模型存入 /Users/thunderhit/Desktop/cntext/test/output/w2v_candi_words/w2v.model \n",
77 | "准备寻找每个seed在语料中所有的相似候选词\n",
78 | "初步搜寻到 572 个相似的候选词\n",
79 | "计算每个候选词 与 integrity 的相似度, 选出相似度最高的前 100 个候选词\n",
80 | "已完成 【integrity 类】 的词语筛选,并保存于 /Users/thunderhit/Desktop/cntext/test/output/w2v_candi_words/integrity.txt, 耗时 46 秒\n",
81 | "准备寻找每个seed在语料中所有的相似候选词\n",
82 | "初步搜寻到 516 个相似的候选词\n",
83 | "计算每个候选词 与 innovation 的相似度, 选出相似度最高的前 100 个候选词\n",
84 | "已完成 【innovation 类】 的词语筛选,并保存于 /Users/thunderhit/Desktop/cntext/test/output/w2v_candi_words/innovation.txt, 耗时 46 秒\n",
85 | "准备寻找每个seed在语料中所有的相似候选词\n",
86 | "初步搜寻到 234 个相似的候选词\n",
87 | "计算每个候选词 与 quality 的相似度, 选出相似度最高的前 100 个候选词\n",
88 | "已完成 【quality 类】 的词语筛选,并保存于 /Users/thunderhit/Desktop/cntext/test/output/w2v_candi_words/quality.txt, 耗时 46 秒\n",
89 | "准备寻找每个seed在语料中所有的相似候选词\n",
90 | "初步搜寻到 243 个相似的候选词\n",
91 | "计算每个候选词 与 respect 的相似度, 选出相似度最高的前 100 个候选词\n",
92 | "已完成 【respect 类】 的词语筛选,并保存于 /Users/thunderhit/Desktop/cntext/test/output/w2v_candi_words/respect.txt, 耗时 46 秒\n",
93 | "准备寻找每个seed在语料中所有的相似候选词\n",
94 | "初步搜寻到 319 个相似的候选词\n",
95 | "计算每个候选词 与 teamwork 的相似度, 选出相似度最高的前 100 个候选词\n",
96 | "已完成 【teamwork 类】 的词语筛选,并保存于 /Users/thunderhit/Desktop/cntext/test/output/w2v_candi_words/teamwork.txt, 耗时 46 秒\n"
97 | ]
98 | }
99 | ],
100 | "source": [
101 | "from cntext.dictionary import W2VModels\n",
102 | "import os\n",
103 | "\n",
104 | "#初始化模型\n",
105 | "model = W2VModels(cwd=os.getcwd()) #语料数据 w2v_corpus.txt\n",
106 | "model.train(input_txt_file='data/w2v_corpus.txt')\n",
107 | "\n",
108 | "\n",
109 | "#根据种子词,筛选出没类词最相近的前100个词\n",
110 | "model.find(seedword_txt_file='data/w2v_seeds/integrity.txt', \n",
111 | " topn=100)\n",
112 | "model.find(seedword_txt_file='data/w2v_seeds/innovation.txt', \n",
113 | " topn=100)\n",
114 | "model.find(seedword_txt_file='data/w2v_seeds/quality.txt', \n",
115 | " topn=100)\n",
116 | "model.find(seedword_txt_file='data/w2v_seeds/respect.txt', \n",
117 | " topn=100)\n",
118 | "model.find(seedword_txt_file='data/w2v_seeds/teamwork.txt', \n",
119 | " topn=100)"
120 | ]
121 | },
122 | {
123 | "cell_type": "code",
124 | "execution_count": null,
125 | "metadata": {},
126 | "outputs": [],
127 | "source": []
128 | }
129 | ],
130 | "metadata": {
131 | "kernelspec": {
132 | "display_name": "Python 3",
133 | "language": "python",
134 | "name": "python3"
135 | },
136 | "language_info": {
137 | "codemirror_mode": {
138 | "name": "ipython",
139 | "version": 3
140 | },
141 | "file_extension": ".py",
142 | "mimetype": "text/x-python",
143 | "name": "python",
144 | "nbconvert_exporter": "python",
145 | "pygments_lexer": "ipython3",
146 | "version": "3.7.5"
147 | },
148 | "toc": {
149 | "base_numbering": 1,
150 | "nav_menu": {},
151 | "number_sections": true,
152 | "sideBar": true,
153 | "skip_h1_title": false,
154 | "title_cell": "Table of Contents",
155 | "title_sidebar": "Contents",
156 | "toc_cell": false,
157 | "toc_position": {},
158 | "toc_section_display": true,
159 | "toc_window_display": false
160 | }
161 | },
162 | "nbformat": 4,
163 | "nbformat_minor": 4
164 | }
165 |
--------------------------------------------------------------------------------
/examples/04-情感计算.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "from cntext.sentiment import senti_by_dutir, senti_by_hownet, senti_by_diydict"
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": 2,
15 | "metadata": {},
16 | "outputs": [
17 | {
18 | "name": "stdout",
19 | "output_type": "stream",
20 | "text": [
21 | "Help on function senti_by_dutir in module cntext.sentiment.sentiment:\n",
22 | "\n",
23 | "senti_by_dutir(text)\n",
24 | " 使用大连理工大学情感本体库DUTIR,仅计算文本中各个情绪词出现次数\n",
25 | " :param text: 中文文本字符串\n",
26 | " :return: 返回文本情感统计信息,类似于这样{'words': 22, 'sentences': 2, '好': 0, '乐': 4, '哀': 0, '怒': 0, '惧': 0, '恶': 0, '惊': 0}\n",
27 | "\n"
28 | ]
29 | }
30 | ],
31 | "source": [
32 | "help(senti_by_dutir)"
33 | ]
34 | },
35 | {
36 | "cell_type": "code",
37 | "execution_count": 3,
38 | "metadata": {},
39 | "outputs": [
40 | {
41 | "name": "stderr",
42 | "output_type": "stream",
43 | "text": [
44 | "Building prefix dict from the default dictionary ...\n",
45 | "Loading model from cache /var/folders/sc/3mnt5tgs419_hk7s16gq61p80000gn/T/jieba.cache\n",
46 | "Loading model cost 0.671 seconds.\n",
47 | "Prefix dict has been built successfully.\n"
48 | ]
49 | },
50 | {
51 | "data": {
52 | "text/plain": [
53 | "{'word_num': 12,\n",
54 | " 'sentence_num': 2,\n",
55 | " 'stopword_num': 4,\n",
56 | " '好_num': 0,\n",
57 | " '乐_num': 1,\n",
58 | " '哀_num': 0,\n",
59 | " '怒_num': 0,\n",
60 | " '惧_num': 0,\n",
61 | " '恶_num': 0,\n",
62 | " '惊_num': 0}"
63 | ]
64 | },
65 | "execution_count": 3,
66 | "metadata": {},
67 | "output_type": "execute_result"
68 | }
69 | ],
70 | "source": [
71 | "text = '今天股票大涨,心情倍爽,非常开心啊。'\n",
72 | "\n",
73 | "senti_by_dutir(text)"
74 | ]
75 | },
76 | {
77 | "cell_type": "code",
78 | "execution_count": 4,
79 | "metadata": {},
80 | "outputs": [
81 | {
82 | "name": "stdout",
83 | "output_type": "stream",
84 | "text": [
85 | "Help on function senti_by_hownet in module cntext.sentiment.sentiment:\n",
86 | "\n",
87 | "senti_by_hownet(text, adj_adv=False)\n",
88 | " 使用知网Hownet词典进行(中)文本数据的情感分析;\n",
89 | " :param text: 待分析的中文文本数据\n",
90 | " :param adj_adv: 是否考虑副词(否定词、程度词)对情绪形容词的反转和情感强度修饰作用,默认False。默认False只统计情感形容词出现个数;\n",
91 | " :return: 返回情感信息\n",
92 | "\n"
93 | ]
94 | }
95 | ],
96 | "source": [
97 | "help(senti_by_hownet)"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": 5,
103 | "metadata": {},
104 | "outputs": [
105 | {
106 | "data": {
107 | "text/plain": [
108 | "{'word_num': 12,\n",
109 | " 'stopword_num': 4,\n",
110 | " 'sentence_num': 1,\n",
111 | " 'pos_word_num': 2,\n",
112 | " 'neg_word_num': 0}"
113 | ]
114 | },
115 | "execution_count": 5,
116 | "metadata": {},
117 | "output_type": "execute_result"
118 | }
119 | ],
120 | "source": [
121 | "senti_by_hownet(text)"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": 6,
127 | "metadata": {},
128 | "outputs": [
129 | {
130 | "data": {
131 | "text/plain": [
132 | "{'sentence_num': 1,\n",
133 | " 'word_num': 12,\n",
134 | " 'stopword_num': 3,\n",
135 | " 'pos_score': 13.0,\n",
136 | " 'neg_score': 0.0}"
137 | ]
138 | },
139 | "execution_count": 6,
140 | "metadata": {},
141 | "output_type": "execute_result"
142 | }
143 | ],
144 | "source": [
145 | "senti_by_hownet(text, adj_adv=True)"
146 | ]
147 | },
148 | {
149 | "cell_type": "code",
150 | "execution_count": 7,
151 | "metadata": {},
152 | "outputs": [
153 | {
154 | "name": "stdout",
155 | "output_type": "stream",
156 | "text": [
157 | "Help on function senti_by_diydict in module cntext.sentiment.sentiment:\n",
158 | "\n",
159 | "senti_by_diydict(text, sentiwords)\n",
160 | " 使用diy词典进行情感分析,计算各个情绪词出现次数,未考虑强度副词、否定词对情感的复杂影响,\n",
161 | " :param text: 待分析中文文本\n",
162 | " :param sentiwords: 情感词字典;\n",
163 | " {'category1': 'category1 词语列表',\n",
164 | " 'category2': 'category2词语列表',\n",
165 | " 'category3': 'category3词语列表',\n",
166 | " ...\n",
167 | " }\n",
168 | " \n",
169 | " :return:\n",
170 | "\n"
171 | ]
172 | }
173 | ],
174 | "source": [
175 | "help(senti_by_diydict)"
176 | ]
177 | },
178 | {
179 | "cell_type": "code",
180 | "execution_count": 8,
181 | "metadata": {},
182 | "outputs": [
183 | {
184 | "data": {
185 | "text/plain": [
186 | "{'pos_num': 1,\n",
187 | " 'neg_num': 0,\n",
188 | " 'adv_num': 1,\n",
189 | " 'stopword_num': 4,\n",
190 | " 'sentence_num': 2,\n",
191 | " 'word_num': 12}"
192 | ]
193 | },
194 | "execution_count": 8,
195 | "metadata": {},
196 | "output_type": "execute_result"
197 | }
198 | ],
199 | "source": [
200 | "sentiwords = {'pos': ['开心', '愉快', '倍爽'],\n",
201 | " 'neg': ['难过', '悲伤'],\n",
202 | " 'adv': ['倍']}\n",
203 | "\n",
204 | "text = '今天股票大涨,心情倍爽,非常开心啊。'\n",
205 | "senti_by_diydict(text, sentiwords)"
206 | ]
207 | },
208 | {
209 | "cell_type": "code",
210 | "execution_count": null,
211 | "metadata": {},
212 | "outputs": [],
213 | "source": []
214 | }
215 | ],
216 | "metadata": {
217 | "kernelspec": {
218 | "display_name": "Python 3",
219 | "language": "python",
220 | "name": "python3"
221 | },
222 | "language_info": {
223 | "codemirror_mode": {
224 | "name": "ipython",
225 | "version": 3
226 | },
227 | "file_extension": ".py",
228 | "mimetype": "text/x-python",
229 | "name": "python",
230 | "nbconvert_exporter": "python",
231 | "pygments_lexer": "ipython3",
232 | "version": "3.7.5"
233 | },
234 | "toc": {
235 | "base_numbering": 1,
236 | "nav_menu": {},
237 | "number_sections": true,
238 | "sideBar": true,
239 | "skip_h1_title": false,
240 | "title_cell": "Table of Contents",
241 | "title_sidebar": "Contents",
242 | "toc_cell": false,
243 | "toc_position": {},
244 | "toc_section_display": true,
245 | "toc_window_display": false
246 | }
247 | },
248 | "nbformat": 4,
249 | "nbformat_minor": 4
250 | }
251 |
--------------------------------------------------------------------------------
/cntext/sentiment/sentiment.py:
--------------------------------------------------------------------------------
1 | #!/usr/bin/python
2 | # -*- coding: utf-8 -*-
3 | import jieba
4 | import numpy as np
5 | import pathlib
6 | import re
7 | from cntext.dictionary.dictionary import *
8 |
9 |
10 | def init_jieba(diydict=dict()):
11 | """
12 | jieba词典初始化, 为防止情感词被错分,需要在调用情感函数前,先运行init_jieba()
13 | :param diydict: 自定义词典,默认空字典; 当使用senti_by_diydict时,才设置diydict
14 | :return:
15 | """
16 |
17 | for key in diydict.keys():
18 | for word in diydict[key]:
19 | jieba.add_word(word)
20 | dictlists = [DUTIR_Ais, DUTIR_Wus, DUTIR_Haos, DUTIR_Jings,
21 | DUTIR_Jus, DUTIR_Les, DUTIR_Nus, HOWNET_deny,
22 | HOWNET_extreme, HOWNET_ish, HOWNET_more, HOWNET_neg,
23 | HOWNET_pos, HOWNET_very]
24 | for wordlist in dictlists:
25 | for word in wordlist:
26 | jieba.add_word(word)
27 |
28 |
29 |
30 |
31 |
32 | def judgeodd(num):
33 | """
34 | 判断奇数偶数。当情感词前方有偶数个否定词,情感极性方向不变。奇数会改变情感极性方向。
35 | """
36 | if (num % 2) == 0:
37 | return 'even'
38 | else:
39 | return 'odd'
40 |
41 |
42 |
43 |
44 |
45 |
46 | def Adj_senti(text):
47 | """
48 | 简单情感分析,仅计算各个情绪形容词出现次数(占比), 未考虑强度副词、否定词对情感的复杂影响。
49 | :param text: 待分析的中文文本数据
50 | :return: 返回情感信息
51 | """
52 | length, sentences, pos, neg, stopword_num = 0, 0, 0, 0, 0
53 | sentences = [s for s in re.split('[\.。!!?\?\n;;]+', text) if s]
54 | sentence_num = len(sentences)
55 | words = jieba.lcut(text)
56 | length = len(words)
57 | for w in words:
58 | if w in STOPWORDS_zh:
59 | stopword_num += 1
60 | if w in HOWNET_pos:
61 | pos+=1
62 | elif w in HOWNET_neg:
63 | neg+=1
64 | else:
65 | pass
66 | return {'word_num': length,
67 | 'stopword_num': stopword_num,
68 | 'sentence_num': sentence_num,
69 | 'pos_word_num': pos,
70 | 'neg_word_num': neg}
71 |
72 |
73 |
74 | def AdjAdv_senti(text):
75 | """
76 | 情感分析,考虑副词对情绪形容词的修饰作用和否定词的反转作用,
77 | 其中副词对情感形容词的情感赋以权重,
78 | 否定词确定情感值正负。
79 |
80 | :param text: 待分析的中文文本数据
81 | :return: 返回情感信息
82 | """
83 | sentences = [s for s in re.split('[\.。!!?\?\n;;]+', text) if s]
84 | wordnum = len(jieba.lcut(text))
85 | count1 = []
86 | count2 = []
87 | stopword_num = 0
88 | for sen in sentences:
89 | segtmp = jieba.lcut(sen)
90 | i = 0 # 记录扫描到的词的位置
91 | a = 0 # 记录情感词的位置
92 | poscount = 0 # 积极词的第一次分值
93 | poscount2 = 0 # 积极词反转后的分值
94 | poscount3 = 0 # 积极词的最后分值(包括叹号的分值)
95 | negcount = 0
96 | negcount2 = 0
97 | negcount3 = 0
98 | for w in segtmp:
99 | if w in STOPWORDS_zh:
100 | stopword_num+=1
101 | if w in HOWNET_pos: # 判断词语是否是情感词
102 | poscount += 1
103 | c = 0
104 | for w in segtmp[a:i]: # 扫描情感词前的程度词
105 | if w in HOWNET_extreme:
106 | poscount *= 4.0
107 | elif w in HOWNET_very:
108 | poscount *= 3.0
109 | elif w in HOWNET_more:
110 | poscount *= 2.0
111 | elif w in HOWNET_ish:
112 | poscount *= 0.5
113 | elif w in HOWNET_deny:
114 | c += 1
115 | if judgeodd(c) == 'odd': # 扫描情感词前的否定词数
116 | poscount *= -1.0
117 | poscount2 += poscount
118 | poscount = 0
119 | poscount3 = poscount + poscount2 + poscount3
120 | poscount2 = 0
121 | else:
122 | poscount3 = poscount + poscount2 + poscount3
123 | poscount = 0
124 | a = i + 1 # 情感词的位置变化
125 |
126 | elif w in HOWNET_neg: # 消极情感的分析,与上面一致
127 | negcount += 1
128 | d = 0
129 | for w in segtmp[a:i]:
130 | if w in HOWNET_extreme:
131 | negcount *= 4.0
132 | elif w in HOWNET_very:
133 | negcount *= 3.0
134 | elif w in HOWNET_more:
135 | negcount *= 2.0
136 | elif w in HOWNET_ish:
137 | negcount *= 0.5
138 | elif w in HOWNET_deny:
139 | d += 1
140 | if judgeodd(d) == 'odd':
141 | negcount *= -1.0
142 | negcount2 += negcount
143 | negcount = 0
144 | negcount3 = negcount + negcount2 + negcount3
145 | negcount2 = 0
146 | else:
147 | negcount3 = negcount + negcount2 + negcount3
148 | negcount = 0
149 | a = i + 1
150 | elif w == '!' or w == '!': ##判断句子是否有感叹号
151 | for w2 in segtmp[::-1]: # 扫描感叹号前的情感词,发现后权值+2,然后退出循环
152 | if w2 in HOWNET_pos or HOWNET_neg:
153 | poscount3 += 2
154 | negcount3 += 2
155 | break
156 | i += 1 # 扫描词位置前移
157 |
158 | # 以下是防止出现负数的情况
159 | pos_count = 0
160 | neg_count = 0
161 | if poscount3 < 0 and negcount3 > 0:
162 | neg_count += negcount3 - poscount3
163 | pos_count = 0
164 | elif negcount3 < 0 and poscount3 > 0:
165 | pos_count = poscount3 - negcount3
166 | neg_count = 0
167 | elif poscount3 < 0 and negcount3 < 0:
168 | neg_count = -poscount3
169 | pos_count = -negcount3
170 | else:
171 | pos_count = poscount3
172 | neg_count = negcount3
173 |
174 | count1.append([pos_count, neg_count])
175 | count2.append(count1)
176 | count1 = []
177 |
178 | pos_result = []
179 | neg_result = []
180 | for sentence in count2:
181 | score_array = np.array(sentence)
182 | pos = np.sum(score_array[:, 0])
183 | neg = np.sum(score_array[:, 1])
184 | pos_result.append(pos)
185 | neg_result.append(neg)
186 |
187 | pos_score = np.sum(np.array(pos_result))
188 | neg_score = np.sum(np.array(neg_result))
189 | score = {'sentence_num': len(count2),
190 | 'word_num':wordnum,
191 | 'stopword_num': stopword_num,
192 | 'pos_score': pos_score,
193 | 'neg_score': neg_score}
194 | return score
195 |
196 |
197 |
198 | def senti_by_hownet(text, adj_adv=False):
199 | """
200 | 使用知网Hownet词典进行(中)文本数据的情感分析;
201 | :param text: 待分析的中文文本数据
202 | :param adj_adv: 是否考虑副词(否定词、程度词)对情绪形容词的反转和情感强度修饰作用,默认False。默认False只统计情感形容词出现个数;
203 | :return: 返回情感信息
204 | """
205 | if adj_adv==True:
206 | return AdjAdv_senti(text)
207 | else:
208 | return Adj_senti(text)
209 |
210 |
211 |
212 | def senti_by_dutir(text):
213 | """
214 | 使用大连理工大学情感本体库DUTIR,仅计算文本中各个情绪词出现次数
215 | :param text: 中文文本字符串
216 | :return: 返回文本情感统计信息,类似于这样{'words': 22, 'sentences': 2, '好': 0, '乐': 4, '哀': 0, '怒': 0, '惧': 0, '恶': 0, '惊': 0}
217 | """
218 | wordnum, sentences, hao, le, ai, nu, ju, wu, jing, stopwords =0, 0, 0, 0, 0, 0, 0, 0, 0, 0
219 | sentences = len(re.split('[\.。!!?\?\n;;]+', text))
220 | words = jieba.lcut(text)
221 | wordnum = len(words)
222 | for w in words:
223 | if w in STOPWORDS_zh:
224 | stopwords+=1
225 | if w in DUTIR_Haos:
226 | hao += 1
227 | elif w in DUTIR_Les:
228 | le += 1
229 | elif w in DUTIR_Ais:
230 | ai += 1
231 | elif w in DUTIR_Nus:
232 | nu += 1
233 | elif w in DUTIR_Jus:
234 | ju += 1
235 | elif w in DUTIR_Wus:
236 | wu += 1
237 | elif w in DUTIR_Jings:
238 | jing += 1
239 | else:
240 | pass
241 | result = {'word_num':wordnum,
242 | 'sentence_num':sentences,
243 | 'stopword_num':stopwords,
244 | '好_num':hao, '乐_num':le, '哀_num':ai, '怒_num':nu, '惧_num':ju, '恶_num': wu, '惊_num':jing}
245 | return result
246 |
247 |
248 |
249 |
250 |
251 | """
252 | 简单情感分析,未考虑强度副词、否定词对情感的复杂影响。仅仅计算各个情绪词出现次数(占比)
253 | :param text: 中文文本字符串
254 | :return: 返回文本情感统计信息
255 | """
256 |
257 | def senti_by_diydict(text, sentiwords):
258 | """
259 | 使用diy词典进行情感分析,计算各个情绪词出现次数,未考虑强度副词、否定词对情感的复杂影响,
260 | :param text: 待分析中文文本
261 | :param sentiwords: 情感词字典;
262 | {'category1': 'category1 词语列表',
263 | 'category2': 'category2词语列表',
264 | 'category3': 'category3词语列表',
265 | ...
266 | }
267 |
268 | :return:
269 | """
270 | result_dict = dict()
271 | senti_categorys = sentiwords.keys()
272 |
273 | for senti_category in senti_categorys:
274 | result_dict[senti_category+'_num'] = 0
275 |
276 | word_num, sentence_num, stopword_num = 0,0,0
277 | sentence_num = len(re.split('[\.。!!?\?\n;;]+', text))
278 | words = jieba.lcut(text)
279 | wordnum = len(words)
280 | for word in words:
281 | if word in STOPWORDS_zh:
282 | stopword_num+=1
283 | for senti_category in senti_categorys:
284 | if word in sentiwords[senti_category]:
285 | result_dict[senti_category+'_num'] += result_dict[senti_category+'_num'] + 1
286 | result_dict['stopword_num'] = stopword_num
287 | result_dict['sentence_num'] = sentence_num
288 | result_dict['word_num'] = wordnum
289 | return result_dict
290 |
291 |
292 |
293 |
294 |
295 |
296 |
--------------------------------------------------------------------------------
/cntext/dictionary/README.md:
--------------------------------------------------------------------------------
1 |
2 | # 一、项目意义
3 |
4 | 情感分析大多是基于情感词典对文本数据进行分析,所以情感词典好坏、是否完备充足是文本分析的关键。
5 |
6 | 目前常用的词典都是基于形容词,有
7 |
8 | - 知网HowNet
9 | - 大连理工大学情感本体库
10 |
11 | 但是形容词类型的词典在某些情况下不适用,比如
12 |
13 | **华为手机外壳采用金属制作,更耐摔**
14 |
15 | 由于句子中没有形容词,使用形容词情感词典计算得到的情感得分为0。但是**耐摔**这个动词具有**正面积极情绪**,这个句子的情感的分理应为**正**
16 |
17 |
18 |
19 | 可见能够简单快速构建不同领域(手机、汽车等)的情感词典十分重要。但是人工构建太慢,如果让机器帮我们把最有可能带情感的候选词找出来,人工再去筛选构建词典,那该多好啊。 那么如何快速自动的新建或者扩充词表呢?
20 |
21 |
22 |
23 |
24 |
25 | # 二、构建思路
26 |
27 | - 共现法,参考https://github.com/liuhuanyong/SentimentWordExpansion
28 | - 词向量,参考https://github.com/MS20190155/Measuring-Corporate-Culture-Using-Machine-Learning
29 |
30 |
31 |
32 |
33 |
34 |
35 |
36 | ## 2.1 共现法扩充词表
37 |
38 | 计算机领域有一个算法叫做SO_PMI,互信息。简单的讲个体之间不是完全独立的,往往物以类聚,人以群分。如果我们一开始设定少量的
39 |
40 | - 初始正面种子词
41 | - 初始负面种子词
42 |
43 | 程序会按照“物以类聚人以群分”的思路,
44 |
45 | - 根据**初始正面种子词**找到很多大概率为**正面情感的候选词**
46 | - 根据**初始负种子词**找到很多大概率为**负面情感的候选词**
47 |
48 | 这个包原始作者刘焕勇,项目地址https://github.com/liuhuanyong/SentimentWordExpansion 我仅仅做了简单的封装
49 |
50 |
51 |
52 |
53 |
54 | ## 2.2 词向量扩充词表
55 |
56 | 共现法,词语之间的独立无相关性依然很强,依然认为词语与词语是不可以比较的。其实词语里潜藏着很多线索,例如中国传统文化中的金木水火土、性别(阴阳)等信息。例如
57 |
58 | - “铁”、“铜”、“钢”
59 | - “国王“、“王后“、“男人“、“女人“
60 |
61 | 如果能抽取出每个词的特征,将每个词用同样长度的向量表示,例如100维。那么咱们中学阶段的cos余弦公式可以计算任意两个词的相似度。
62 |
63 | 参照论文使用机器学习构建五类企业文化词典
64 |
65 | > Kai Li, Feng Mai, Rui Shen, Xinyan Yan, [**Measuring Corporate Culture Using Machine Learning**](https://academic.oup.com/rfs/advance-article-abstract/doi/10.1093/rfs/hhaa079/5869446?redirectedFrom=fulltext), *The Review of Financial Studies*, 2020
66 | >
67 | > 代码发布在github https://github.com/MS20190155/Measuring-Corporate-Culture-Using-Machine-Learning
68 |
69 | 作者github的代码,我技术水平有限,很难直接拿来用,我修改了两处。
70 |
71 | - 原作者使用的stanfordnlp处理英文分词;wordexpansion改为jieba处理中文、nltk处理英文
72 |
73 | - 原作者在构建word2vec模型,考虑了Ngram;wordexpansion未考虑Ngram
74 |
75 |
76 |
77 | 两处更改,降低了代码的复杂程度,方便自己封装成包,供大家使用。大家也可根据自己能力,直接使用作者提供的代码。
78 |
79 |
80 |
81 |
82 |
83 |
84 |
85 |
86 | # 三、安装
87 |
88 |
89 |
90 | 最简单的安装,现在由于国内外网络不稳定,可能需要尝试几次
91 |
92 | ```
93 | pip3 install wordexpansion
94 | ```
95 |
96 |
97 |
98 |
99 |
100 | # 四、test项目文件目录
101 |
102 | >**注意:**
103 | >所有的txt文件,不论输入的还是程序输出的结果,均采用utf-8编码。
104 |
105 | ```
106 | |---test
107 | |---共现法
108 | |--find_newwords.py #共现法测试代码
109 | |--corpus1.txt #语料(媒体报道)文本数据,5.5M
110 | |--test_seed_words.txt #情感种子词,需要手动构建
111 | |--neg_candi.txt #find_newwords.py运行后发现的负面候选词
112 | |--pos_candi.txt #find_newwords.py运行后发现的正面候选词
113 |
114 | |---词向量法
115 | |--run_w2v.py #词向量法测试代码
116 | |--corpus2.txt #语料(企业文化)文本数据,34M
117 | |--seeds #五种企业文化初始候选词(5个txt)
118 | |--model #word2vec训练过程中的模型(运行时产生的副产品)
119 | |--candidate_words #五种企业文化词典(5个txt)
120 |
121 |
122 | ```
123 |
124 |
125 |
126 | # 五、共现法代码
127 |
128 | ### 5.1 准备构建种子词
129 |
130 | 可能我们希望的情感词典几万个,但是种子词100个(正面词50个,负面词50个)说不定就可以。
131 |
132 | 手动构建的种子词典**test_seed_words.txt**中
133 |
134 | - 每行一个词
135 | - 每个词用neg或pos标记
136 | - 词与标记用tab键间隔
137 |
138 | ```
139 | 休克 neg
140 | 如出一辙 neg
141 | 渴求 neg
142 | 扎堆 neg
143 | 休整 neg
144 | 关门 neg
145 | 阴晴不定 neg
146 | 喜忧参半 neg
147 | 起起伏伏 neg
148 | 一厢情愿 neg
149 | 松紧 neg
150 | 最全 pos
151 | 雄风 pos
152 | 稳健 pos
153 | 稳定 pos
154 | 拉平 pos
155 | 保供 pos
156 | 修正 pos
157 | 稳 pos
158 | 稳住 pos
159 | 保养 pos
160 | ...
161 | ...
162 | ```
163 |
164 |
165 |
166 | ### 5.2 发现情感新词
167 |
168 | 已经安装好了**wordexpansion**,现在我们新建一个名为**find_newwords.py**的测试代码
169 |
170 | 代码中的
171 |
172 | ```python
173 | from wordexpansion import ChineseSoPmi
174 |
175 | sopmier = ChineseSoPmi(inputtext_file='test_corpus.txt',
176 | seedword_txtfile='test_seed_words.txt',
177 | pos_candi_txt_file='pos_candi.txt',
178 | neg_candi_txtfile='neg_candi.txt')
179 | sopmier.sopmi()
180 | ```
181 |
182 |
183 |
184 | 我们的语料数据**test_corpus.txt** 文件5.5M,100个候选词,运行程序大概耗时60s
185 |
186 |
187 |
188 | ### 5.3 输出的结果
189 |
190 | **find_newwords.py**运行结束后,会在**同文件夹内(find_newwords.py所在的文件夹)**发现有两个新的txt文件
191 |
192 | - pos_candi.txt
193 | - neg_candi.txt
194 |
195 | 打开**pos_candi.txt**, 我们看到
196 |
197 | ```
198 | word,sopmi,polarity,word_length,postag
199 | 保持,87.28493062512524,pos,2,v
200 | 风险,70.15627986116269,pos,2,n
201 | 货币政策,66.28476448498694,pos,4,n
202 | 发展,64.40272795986517,pos,2,vn
203 | 不要,63.71800916752807,pos,2,df
204 | 理念,61.2024367757337,pos,2,n
205 | 整体,59.415315156715586,pos,2,n
206 | 下,59.321140440512984,pos,1,f
207 | 引导,58.5817208758171,pos,2,v
208 | 投资,57.71720491331896,pos,2,vn
209 | 加强,57.067969337267684,pos,2,v
210 | 自己,53.25503772499689,pos,2,r
211 | 提升,52.80686380719989,pos,2,v
212 | 和,52.12334472663675,pos,1,c
213 | 稳步,51.58193211655792,pos,2,d
214 | 重要,51.095865548255034,pos,2,a
215 | ...
216 | ```
217 |
218 | 打开**neg_candi.txt**, 我们看到
219 |
220 | ```
221 | word,sopmi,polarity,word_length,postag
222 | 心灵,33.17993872989303,neg,2,n
223 | 期间,31.77900620939178,neg,2,f
224 | 西溪,30.87839808390589,neg,2,ns
225 | 人事,29.594976229171877,neg,2,n
226 | 复杂,29.47870186147108,neg,2,a
227 | 直到,27.86014637934966,neg,2,v
228 | 宰客,27.27304813428452,neg,2,nr
229 | 保险,26.433136238404746,neg,2,n
230 | 迎来,25.83859896903048,neg,2,v
231 | 至少,25.105021416064616,neg,2,d
232 | 融资,25.09148586460598,neg,2,vn
233 | 或,24.48343281812743,neg,1,c
234 | 列,22.20695894382675,neg,1,v
235 | 存在,22.041049266517774,neg,2,v
236 | ...
237 | ```
238 |
239 |
240 |
241 | 从上面的结果看,正面候选词较好,负面候选词有点差强人意。虽然差点,但节约了很多很多时间。
242 |
243 | 现在电脑已经帮我们找出候选词,我们人类最擅长找毛病,对neg_candi.txt和pos_candi.txt我们人类只需要一个个挑毛病,把不带正负情感的词剔除掉。这样经过一段时间的剔除工作,针对具体研究领域的专业情感词典就构建出来了。
244 |
245 |
246 |
247 | # 六、词向量法代码
248 |
249 | ## 6.1 准备种子词
250 |
251 | 词向量法程序会挖掘出原始数据中的所有词的词向量,这时候如果给词向量模型传入种子词,会根据向量的远近识别出多个近义词。人工构建了五大类企业文化词典,存放在txt中,即
252 |
253 | - innovation.txt
254 | - integrity.txt
255 | - quality.txt
256 | - respect.txt
257 | - teamwork.txt
258 |
259 | 注意,在txt中,每行一个词语。
260 |
261 |
262 |
263 | ### 6.2 发现情感新词
264 |
265 | 已经安装好了**wordexpansion**,现在我们新建一个名为**run_w2v.py**的测试代码
266 |
267 | 代码中的
268 |
269 | ```python
270 | from wordexpansion import W2VModels
271 | import pandas as pd
272 | import os
273 |
274 | #初始化模型
275 | model = W2VModels(cwd=os.getcwd())
276 | model.train(documents=list(open('data/documents.txt').readlines()))
277 |
278 | #导入种子词
279 | integrity = [w for w in open('seeds/integrity.txt').read().split('\n') if w!='']
280 | innovation = [w for w in open('seeds/innovation.txt').read().split('\n') if w!='']
281 | quality = [w for w in open('seeds/quality.txt').read().split('\n') if w!='']
282 | respect = [w for w in open('seeds/respect.txt').read().split('\n') if w!='']
283 | teamwork = [w for w in open('seeds/teamwork.txt').read().split('\n') if w!='']
284 |
285 | #根据种子词,筛选出没类词最相近的前100个词
286 | model.find(seedwords=integrity, seedwordsname='integrity', topn=100)
287 | model.find(seedwords=innovation, seedwordsname='innovation', topn=100)
288 | model.find(seedwords=quality, seedwordsname='quality', topn=100)
289 | model.find(seedwords=respect, seedwordsname='respect', topn=100)
290 | model.find(seedwords=teamwork, seedwordsname='teamwork', topn=100)
291 |
292 | ```
293 |
294 |
295 |
296 | 我们的语料数据documents.txt 文件30+M,50多个候选词,运行程序大概耗时30s
297 |
298 |
299 |
300 | ### 6.3 输出的结果
301 |
302 | **run_w2v.py**运行结束后,会在**candidate_words内**发现有5个新的txt文件
303 |
304 | - innovation.txt
305 | - integrity.txt
306 | - quality.txt
307 | - respect.txt
308 | - teamwork.txt
309 |
310 | 打开**innovation.txt**, 我们看到
311 |
312 | ```
313 | innovation
314 | innovate
315 | innovative
316 | creativity
317 | creative
318 | create
319 | passion
320 | passionate
321 | efficiency
322 | efficient
323 | excellence
324 | pride
325 | enhance
326 | expertise
327 | optimizing
328 | adapt
329 | capability
330 | awareness
331 | creating
332 | value-added
333 | optimize
334 | leveraging
335 | attract
336 | innovative
337 | manufacture
338 | efficient
339 | integrate
340 | better-for-you
341 | enhanced
342 | efficiently
343 | consolidate
344 | automation
345 | resources
346 | infrastructure
347 | innovation
348 | talent
349 | skills
350 | communicate
351 | differentiated
352 | network
353 | supporting
354 | dsd
355 | capture
356 | efficiency
357 | capabilities
358 | productive
359 | speed
360 | organized
361 | manual
362 | manage
363 | cost-effective
364 | simpler
365 | training
366 | technology
367 | merchandising
368 | interact
369 | drive
370 | organization
371 | reliability
372 | backbone
373 | strengthen
374 | attracting
375 | maximizing
376 | fine-tune
377 | enable
378 | headquarter
379 | platform
380 | tightly
381 | aligned
382 | flexible
383 | fulfillment
384 | rationalize
385 | back-office
386 | ensure
387 | manufacturing
388 | efficiencies
389 | effort
390 | technological
391 | retain
392 | proprietary
393 | durable
394 | diligent
395 | wap
396 | talented
397 | excitement
398 | logistical
399 | utilize
400 | bandwidth
401 | invest
402 | diversify
403 | higher-margin
404 | pride
405 | selecting
406 | managing
407 | departments
408 | engaging
409 | coordination
410 | multinational
411 | efforts
412 | store-within-store
413 | procurement
414 | workarounds
415 | nurture
416 | provides
417 | breadth
418 | viable
419 | superb
420 | digital
421 | smarter
422 | introducing
423 | beef
424 | proposition
425 | ```
426 |
427 | 打开**respec.txt**, 我们看到
428 |
429 | ```
430 | respectful
431 | talent
432 | talented
433 | employee
434 | dignity
435 | empowerment
436 | empower
437 | skills
438 | backbone
439 | training
440 | database
441 | designers
442 | sdk
443 | recruit
444 | engine
445 | dealers
446 | selecting
447 | resource
448 | onsite
449 | computer
450 | functions
451 | wholesalers
452 | educational
453 | expertise
454 | coordination
455 | value-added
456 | creative
457 | individuals
458 | managers
459 | pride
460 | technological
461 | awareness
462 | salespeople
463 | organized
464 | electrical
465 | reputation
466 | tools
467 | web-based
468 | fulfillment
469 | in-house
470 | staff
471 | motor
472 | crm
473 | communications
474 | attracting
475 | departments
476 | databases
477 | warsaw
478 | optimized
479 | functionality
480 | faces
481 | tool
482 | supported
483 | commission-based
484 | transportation
485 | centralized
486 | sponsor
487 | knowledge
488 | train
489 | assigned
490 | physician
491 | viability
492 | brokerage
493 | networks
494 | culture
495 | interior
496 | connecting
497 | leveraging
498 | mwd
499 | systems
500 | incentivized
501 | mission
502 | affiliated
503 | high-quality
504 | ecosystem
505 | eradication
506 | processes
507 | simplify
508 | on-site
509 | continuously
510 | recruiting
511 | practices
512 | dedicated
513 | adequately
514 | headquarter
515 | var
516 | practice
517 | airclic
518 | police
519 | architectural
520 | painlessly
521 | employing
522 | near-field
523 | corporations
524 | organization
525 | onshore
526 | adjacencies
527 | social
528 | well-known
529 | trained
530 | sap
531 | complement
532 | odms
533 | resources
534 | gasification
535 | salesforce
536 | third-party
537 |
538 | ```
539 |
540 | 同理,在其他几类企业文化词典txt中产生了符合预期的词语。
541 |
542 | 现在电脑已经帮我们找出5类企业文化候选词,我们人类最擅长找毛病,对5个txt文件,我们人类只需要一个个挑毛病,把不带正负情感的词剔除掉。这样经过一段时间的剔除工作,针对具体研究领域的专业情感词典就构建出来了。
543 |
544 |
545 |
546 |
547 |
548 |
549 |
550 | # 七、注意:
551 | 1. so_pmi算法效果受训练语料影响,语料规模越大,效果越好
552 |
553 | 2. so_pmi算法效率受训练语料影响,语料越大,训练越耗时。100个种子词,5M的数据,大约耗时62.679秒
554 |
555 | 3. 候选词的选择,可根据PMI值,词长,词性设定规则,进行筛选
556 |
557 | 4. **所有的txt文件,不论输入的还是程序输出的结果,均采用utf-8编码。**
558 |
559 | 5. 词向量法没有考虑Ngram,如果采用了Ngram, 可能会挖掘出该场景下的词语组合。但是程序运行时间可能会更慢。
560 |
561 | 6. 如果刚刚好也想使用**企业文化5大类**这个具体场景,记得引用论文
562 |
563 | 7. > Kai Li, Feng Mai, Rui Shen, Xinyan Yan, Measuring Corporate Culture Using Machine Learning, *The Review of Financial Studies*,2020
564 |
565 |
566 |
567 |
568 |
569 | # 如果
570 |
571 | 如果您是经管人文社科专业背景,编程小白,面临海量文本数据采集和处理分析艰巨任务,可以参看[《python网络爬虫与文本数据分析》](https://ke.qq.com/course/482241?tuin=163164df)视频课。作为文科生,一样也是从两眼一抹黑开始,这门课程是用五年时间凝缩出来的。自认为讲的很通俗易懂o(* ̄︶ ̄*)o,
572 |
573 | - python入门
574 | - 网络爬虫
575 | - 数据读取
576 | - 文本分析入门
577 | - 机器学习与文本分析
578 | - 文本分析在经管研究中的应用
579 |
580 | 感兴趣的童鞋不妨 戳一下[《python网络爬虫与文本数据分析》](https://ke.qq.com/course/482241?tuin=163164df)进来看看~
581 |
582 | [](https://ke.qq.com/course/482241?tuin=163164df)
583 |
584 |
585 |
586 | # 更多
587 |
588 | - [B站:大邓和他的python](https://space.bilibili.com/122592901/channel/detail?cid=66008)
589 |
590 | - 公众号:大邓和他的python
591 |
592 | - [知乎专栏:数据科学家](https://zhuanlan.zhihu.com/dadeng)
593 |
594 | 
595 |
596 |
597 |
598 |
599 |
600 |
601 |
--------------------------------------------------------------------------------
/examples/01-cntext.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": 1,
6 | "metadata": {},
7 | "outputs": [
8 | {
9 | "name": "stdout",
10 | "output_type": "stream",
11 | "text": [
12 | "Help on package cntext:\n",
13 | "\n",
14 | "NAME\n",
15 | " cntext\n",
16 | "\n",
17 | "PACKAGE CONTENTS\n",
18 | " description (package)\n",
19 | " dictionary (package)\n",
20 | " sentiment (package)\n",
21 | " similarity (package)\n",
22 | " visualization (package)\n",
23 | "\n",
24 | "DATA\n",
25 | " ADV_words = ['都', '全', '单', '共', '光', '尽', '净', '仅', '就', '只', '一共', '...\n",
26 | " CONJ_words = ['乃', '乍', '与', '无', '且', '丕', '为', '共', '其', '况', '厥', '...\n",
27 | " DUTIR_Ais = {'sigh', '一命呜呼', '一场春梦', '一场空', '一头跌在菜刀上-切肤之痛', '一念之差', .....\n",
28 | " DUTIR_Haos = {'1兒巴经', '3x', '8错', 'BUCUO', 'Cool毙', 'NB', ...}\n",
29 | " DUTIR_Jings = {'848', 'FT', '_god', 'yun', '一个骰子掷七点-出乎意料', '一举成名', ......\n",
30 | " DUTIR_Jus = {'一则以喜,一则以惧', '一发千钧', '一年被蛇咬,三年怕草索', '一座皆惊', '一脸横肉', '一蛇两头...\n",
31 | " DUTIR_Les = {':)', 'CC', 'Happy', 'LOL', '_so', 'haha', ...}\n",
32 | " DUTIR_Nus = {'2气斗狠', 'MD', 'TNND', 'gun', 'kao', '一刀两断', ...}\n",
33 | " DUTIR_Wus = {'B4', 'BD', 'BS', 'HC', 'HJ', 'JJWW', ...}\n",
34 | " HOWNET_deny = {'不', '不可', '不是', '不能', '不要', '休', ...}\n",
35 | " HOWNET_extreme = {'万', '万万', '万分', '万般', '不亦乐乎', '不可开交', ...}\n",
36 | " HOWNET_ish = {'一些', '一点', '一点儿', '不丁点儿', '不大', '不怎么', ...}\n",
37 | " HOWNET_more = {'多', '大不了', '如斯', '尤甚', '强', '愈', ...}\n",
38 | " HOWNET_neg = {'一下子爆发', '一下子爆发的一连串', '一不小心', '一个屁', '一仍旧贯', '一偏', ...}\n",
39 | " HOWNET_pos = {'', '一专多能', '一丝不差', '一丝不苟', '一个心眼儿', '一五一十', ...}\n",
40 | " HOWNET_very = {'不为过', '不少', '不胜', '不过', '何啻', '何止', ...}\n",
41 | " STOPWORDS_en = {'a', 'about', 'above', 'across', 'after', 'afterwards'...\n",
42 | " STOPWORDS_zh = {'、', '。', '〈', '〉', '《', '》', ...}\n",
43 | "\n",
44 | "FILE\n",
45 | " /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/cntext/__init__.py\n",
46 | "\n",
47 | "\n"
48 | ]
49 | }
50 | ],
51 | "source": [
52 | "import cntext\n",
53 | "\n",
54 | "help(cntext)"
55 | ]
56 | },
57 | {
58 | "cell_type": "code",
59 | "execution_count": 2,
60 | "metadata": {},
61 | "outputs": [
62 | {
63 | "name": "stdout",
64 | "output_type": "stream",
65 | "text": [
66 | "\n",
67 | " 【大连理工大学情感本体库】\n",
68 | " 七大情绪分类,依次是哀、恶、好、惊、惧、乐、怒;对应的情绪词表依次:\n",
69 | " DUTIR_Ais = {\"泣血捶膺\", \"望断白云\", \"日暮途穷\", \"身微力薄\"...}\n",
70 | " DUTIR_Wus = {\"饰非遂过\", \"恶语\", \"毁害\", \"恶籍盈指\", \"脾气爆躁\", \"淫贱\", \"凌乱\"...}\n",
71 | " DUTIR_Haos = {\"打破砂锅璺到底\", \"多彩\", \"披沙拣金\", \"见机行事\", \"精神饱满\"...}\n",
72 | " DUTIR_Jings = {\"骇人视听\", \"拍案惊奇\", \"悬念\", \"无翼而飞\", \"原来\", \"冷门\"...}\n",
73 | " DUTIR_Jus ={\"山摇地动\", \"月黑风高\", \"流血\", \"老鼠偷猫饭-心惊肉跳\", \"一发千钧\"...}\n",
74 | " DUTIR_Les ={\"含哺鼓腹\", \"欢呼鼓舞\", \"莺歌蝶舞\", \"将伯之助\", \"逸兴横飞\", \"舒畅\"...}\n",
75 | " DUTIR_Nus = {\"怨气满腹\", \"面有愠色\", \"愤愤\", \"直眉瞪眼\", \"负气斗狠\", \"挑眼\"...}\n",
76 | " \n",
77 | " 【知网Hownet词典】\n",
78 | " 含正负形容词、否定词、副词等词表,对应的词表依次:\n",
79 | " HOWNET_deny = {\"不\", \"不是\", \"不能\", \"不可\"...}\n",
80 | " HOWNET_extreme = {\"百分之百\", \"倍加\", \"备至\", \"不得了\"...}\n",
81 | " HOWNET_ish = {\"点点滴滴\", \"多多少少\", \"怪\", \"好生\", \"还\", \"或多或少\"...}\n",
82 | " HOWNET_more = {\"大不了\", \"多\", \"更\", \"比较\", \"更加\", \"更进一步\", \"更为\", \"还\", \"还要\"...}\n",
83 | " HOWNET_neg = {\"压坏\", \"鲁莽的\", \"被控犯罪\", \"银根紧\", \"警惕的\", \"残缺\", \"致污物\", \"柔弱\"...}\n",
84 | " HOWNET_pos = {\"无误\", \"感激不尽\", \"受大众欢迎\", \"敬礼\", \"文雅\", \"一尘不染\", \"高精度\", \"兴盛\"...}\n",
85 | " HOWNET_very = {\"不为过\", \"超\", \"超额\", \"超外差\", \"超微结构\", \"超物质\", \"出头\"...}\n",
86 | " \n",
87 | " 【停用词表】\n",
88 | " 中英文停用词表,依次\n",
89 | " STOPWORDS_zh = {\"经\", \"得\", \"则甚\", \"跟\", \"好\", \"具体地说\"...}\n",
90 | " STOPWORDS_en = {'a', 'about', 'above', 'across', 'after'...}\n",
91 | " \n",
92 | " 【中文副词/连词】\n",
93 | " 副词ADV、连词CONJ\n",
94 | " ADV_words = ['都', '全', '单', '共', '光'...}\n",
95 | " CONJ_words = ['乃', '乍', '与', '无', '且'...}\n",
96 | " \n"
97 | ]
98 | }
99 | ],
100 | "source": [
101 | "from cntext import dict_info\n",
102 | "\n",
103 | "dict_info()"
104 | ]
105 | },
106 | {
107 | "cell_type": "code",
108 | "execution_count": 3,
109 | "metadata": {
110 | "collapsed": true
111 | },
112 | "outputs": [
113 | {
114 | "data": {
115 | "text/plain": [
116 | "['乃',\n",
117 | " '乍',\n",
118 | " '与',\n",
119 | " '无',\n",
120 | " '且',\n",
121 | " '丕',\n",
122 | " '为',\n",
123 | " '共',\n",
124 | " '其',\n",
125 | " '况',\n",
126 | " '厥',\n",
127 | " '则',\n",
128 | " '那',\n",
129 | " '兼',\n",
130 | " '凭',\n",
131 | " '即',\n",
132 | " '却',\n",
133 | " '今',\n",
134 | " '以',\n",
135 | " '令',\n",
136 | " '会',\n",
137 | " '任',\n",
138 | " '但',\n",
139 | " '使',\n",
140 | " '便',\n",
141 | " '倘',\n",
142 | " '借',\n",
143 | " '假',\n",
144 | " '傥',\n",
145 | " '单',\n",
146 | " '讵',\n",
147 | " '设',\n",
148 | " '谓',\n",
149 | " '及',\n",
150 | " '苟',\n",
151 | " '若',\n",
152 | " '连',\n",
153 | " '迨',\n",
154 | " '适',\n",
155 | " '将',\n",
156 | " '并',\n",
157 | " '当',\n",
158 | " '带',\n",
159 | " '句',\n",
160 | " '同',\n",
161 | " '向',\n",
162 | " '和',\n",
163 | " '唯',\n",
164 | " '噎',\n",
165 | " '噫',\n",
166 | " '宁',\n",
167 | " '如',\n",
168 | " '饶',\n",
169 | " '抑',\n",
170 | " '浸',\n",
171 | " '纵',\n",
172 | " '维',\n",
173 | " '缘',\n",
174 | " '坐',\n",
175 | " '因',\n",
176 | " '惟',\n",
177 | " '就',\n",
178 | " '子',\n",
179 | " '焉',\n",
180 | " '然',\n",
181 | " '载',\n",
182 | " '旋',\n",
183 | " '或',\n",
184 | " '所',\n",
185 | " '既',\n",
186 | " '斯',\n",
187 | " '果',\n",
188 | " '故',\n",
189 | " '是',\n",
190 | " '暨',\n",
191 | " '必',\n",
192 | " '忍',\n",
193 | " '总',\n",
194 | " '恁',\n",
195 | " '更',\n",
196 | " '脱',\n",
197 | " '爰',\n",
198 | " '甚',\n",
199 | " '盖',\n",
200 | " '直',\n",
201 | " '矧',\n",
202 | " '由',\n",
203 | " '用',\n",
204 | " '虽',\n",
205 | " '而',\n",
206 | " '耳',\n",
207 | " '要',\n",
208 | " '至',\n",
209 | " '第',\n",
210 | " '管',\n",
211 | " '自',\n",
212 | " '跟',\n",
213 | " '须',\n",
214 | " '',\n",
215 | " '纵使',\n",
216 | " '纵令',\n",
217 | " '纵然',\n",
218 | " '再说',\n",
219 | " '虚词',\n",
220 | " '至于',\n",
221 | " '至若',\n",
222 | " '至乎',\n",
223 | " '至如',\n",
224 | " '只有',\n",
225 | " '只要',\n",
226 | " '至乃',\n",
227 | " '致使',\n",
228 | " '自然',\n",
229 | " '再不',\n",
230 | " '于是乎',\n",
231 | " '于是',\n",
232 | " '又且',\n",
233 | " '与其',\n",
234 | " '由于',\n",
235 | " '因而',\n",
236 | " '因为',\n",
237 | " '以致',\n",
238 | " '以至',\n",
239 | " '要是',\n",
240 | " '以及',\n",
241 | " '以下',\n",
242 | " '以为',\n",
243 | " '虚字',\n",
244 | " '焉乃',\n",
245 | " '勿然',\n",
246 | " '万一',\n",
247 | " '无论',\n",
248 | " '忘其',\n",
249 | " '亡其',\n",
250 | " '倘然',\n",
251 | " '倘使',\n",
252 | " '所以',\n",
253 | " '顺推',\n",
254 | " '庶几',\n",
255 | " '顺接',\n",
256 | " '是故',\n",
257 | " '是以',\n",
258 | " '甚至',\n",
259 | " '设或',\n",
260 | " '设若',\n",
261 | " '便乃',\n",
262 | " '便做',\n",
263 | " '别管',\n",
264 | " '不然',\n",
265 | " '不论',\n",
266 | " '不怕',\n",
267 | " '不拘',\n",
268 | " '不仅',\n",
269 | " '不但',\n",
270 | " '不过',\n",
271 | " '不管',\n",
272 | " '不料',\n",
273 | " '诚然',\n",
274 | " '词类',\n",
275 | " '除非',\n",
276 | " '但凡',\n",
277 | " '从而',\n",
278 | " '等到',\n",
279 | " '当使',\n",
280 | " '分句',\n",
281 | " '而亦',\n",
282 | " '而乃',\n",
283 | " '尔其',\n",
284 | " '而且',\n",
285 | " '反而',\n",
286 | " '而况',\n",
287 | " '否则',\n",
288 | " '固然',\n",
289 | " '故尔',\n",
290 | " '果然',\n",
291 | " '或曰',\n",
292 | " '或是',\n",
293 | " '或者',\n",
294 | " '何况',\n",
295 | " '及至',\n",
296 | " '及以',\n",
297 | " '既然',\n",
298 | " '即使',\n",
299 | " '假借义',\n",
300 | " '加以',\n",
301 | " '加之',\n",
302 | " '尽管',\n",
303 | " '借如',\n",
304 | " '借令',\n",
305 | " '尽管',\n",
306 | " '借使',\n",
307 | " '就是',\n",
308 | " '可是',\n",
309 | " '况乎',\n",
310 | " '况于',\n",
311 | " '连絶',\n",
312 | " '况且',\n",
313 | " '哪怕',\n",
314 | " '乃至',\n",
315 | " '譬如',\n",
316 | " '丕则',\n",
317 | " '丕乃',\n",
318 | " '且夫',\n",
319 | " '然则',\n",
320 | " '然而',\n",
321 | " '任凭',\n",
322 | " '如其',\n",
323 | " '如果']"
324 | ]
325 | },
326 | "execution_count": 3,
327 | "metadata": {},
328 | "output_type": "execute_result"
329 | }
330 | ],
331 | "source": [
332 | "from cntext import CONJ_words, ADV_words\n",
333 | "\n",
334 | "CONJ_words"
335 | ]
336 | },
337 | {
338 | "cell_type": "code",
339 | "execution_count": 4,
340 | "metadata": {
341 | "collapsed": true
342 | },
343 | "outputs": [
344 | {
345 | "data": {
346 | "text/plain": [
347 | "['都',\n",
348 | " '全',\n",
349 | " '单',\n",
350 | " '共',\n",
351 | " '光',\n",
352 | " '尽',\n",
353 | " '净',\n",
354 | " '仅',\n",
355 | " '就',\n",
356 | " '只',\n",
357 | " '一共',\n",
358 | " '一起',\n",
359 | " '一同',\n",
360 | " '一道',\n",
361 | " '一齐',\n",
362 | " '一概',\n",
363 | " '一味',\n",
364 | " '统统',\n",
365 | " '总共',\n",
366 | " '仅仅',\n",
367 | " '惟独',\n",
368 | " '可',\n",
369 | " '倒',\n",
370 | " '一定',\n",
371 | " '必定',\n",
372 | " '必然',\n",
373 | " '却',\n",
374 | " '',\n",
375 | " '就',\n",
376 | " '幸亏',\n",
377 | " '难道',\n",
378 | " '何尝',\n",
379 | " '偏偏',\n",
380 | " '索性',\n",
381 | " '简直',\n",
382 | " '反正',\n",
383 | " '多亏',\n",
384 | " '也许',\n",
385 | " '大约',\n",
386 | " '好在',\n",
387 | " '敢情',\n",
388 | " '不',\n",
389 | " '没',\n",
390 | " '没有',\n",
391 | " '别',\n",
392 | " '刚',\n",
393 | " '恰好',\n",
394 | " '正',\n",
395 | " '将',\n",
396 | " '老是',\n",
397 | " '总是',\n",
398 | " '早就',\n",
399 | " '已经',\n",
400 | " '正在',\n",
401 | " '立刻',\n",
402 | " '马上',\n",
403 | " '起初',\n",
404 | " '原先',\n",
405 | " '一向',\n",
406 | " '永远',\n",
407 | " '从来',\n",
408 | " '偶尔',\n",
409 | " '随时',\n",
410 | " '忽然',\n",
411 | " '很',\n",
412 | " '极',\n",
413 | " '最',\n",
414 | " '太',\n",
415 | " '更',\n",
416 | " '更加',\n",
417 | " '格外',\n",
418 | " '十分',\n",
419 | " '极其',\n",
420 | " '比较',\n",
421 | " '相当',\n",
422 | " '稍微',\n",
423 | " '略微',\n",
424 | " '多么',\n",
425 | " '仿佛',\n",
426 | " '渐渐',\n",
427 | " '百般',\n",
428 | " '特地',\n",
429 | " '互相',\n",
430 | " '擅自',\n",
431 | " '几乎',\n",
432 | " '逐渐',\n",
433 | " '逐步',\n",
434 | " '猛然',\n",
435 | " '依然',\n",
436 | " '仍然',\n",
437 | " '当然',\n",
438 | " '毅然',\n",
439 | " '果然',\n",
440 | " '差点儿']"
441 | ]
442 | },
443 | "execution_count": 4,
444 | "metadata": {},
445 | "output_type": "execute_result"
446 | }
447 | ],
448 | "source": [
449 | "ADV_words"
450 | ]
451 | },
452 | {
453 | "cell_type": "code",
454 | "execution_count": null,
455 | "metadata": {},
456 | "outputs": [],
457 | "source": []
458 | }
459 | ],
460 | "metadata": {
461 | "kernelspec": {
462 | "display_name": "Python 3",
463 | "language": "python",
464 | "name": "python3"
465 | },
466 | "language_info": {
467 | "codemirror_mode": {
468 | "name": "ipython",
469 | "version": 3
470 | },
471 | "file_extension": ".py",
472 | "mimetype": "text/x-python",
473 | "name": "python",
474 | "nbconvert_exporter": "python",
475 | "pygments_lexer": "ipython3",
476 | "version": "3.7.5"
477 | },
478 | "toc": {
479 | "base_numbering": 1,
480 | "nav_menu": {},
481 | "number_sections": true,
482 | "sideBar": true,
483 | "skip_h1_title": false,
484 | "title_cell": "Table of Contents",
485 | "title_sidebar": "Contents",
486 | "toc_cell": false,
487 | "toc_position": {},
488 | "toc_section_display": true,
489 | "toc_window_display": false
490 | }
491 | },
492 | "nbformat": 4,
493 | "nbformat_minor": 4
494 | }
495 |
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # cntext
2 |
3 | 中文文本分析库,可对文本进行词频统计、词典扩充、情绪分析、相似度、可读性等
4 |
5 | - [github地址](https://github.com/hidadeng/cntext) ``https://github.com/hidadeng/cntext``
6 | - [pypi地址](https://pypi.org/project/cntext/) ``https://pypi.org/project/cntext/``
7 | - [视频课-**Python网络爬虫与文本数据分析**](https://ke.qq.com/course/482241?tuin=163164df)
8 |
9 |
10 |
11 | 功能模块含
12 |
13 | - **cntext**
14 | - **stats** 文本统计,可读性等
15 | - **dictionary** 构建词表(典)
16 | - **sentiment** 情感分析
17 | - **similarity** 文本相似度
18 | - **visualization** 可视化,如词云图
19 |
20 |
21 |
22 |
23 |
24 | ## 安装
25 |
26 | ```
27 | pip install cntext==0.9
28 | ```
29 |
30 |
31 |
32 |
33 |
34 | ## 一、cntext
35 |
36 | 查看cntext基本信息
37 |
38 | ```python
39 | import cntext
40 |
41 | help(cntext)
42 | ```
43 |
44 | Run
45 |
46 | ```
47 | Help on package cntext:
48 |
49 | NAME
50 | cntext
51 |
52 | PACKAGE CONTENTS
53 | description (package)
54 | dictionary (package)
55 | sentiment (package)
56 | similarity (package)
57 | visualization (package)
58 |
59 | DATA
60 | ADV_words = ['都', '全', '单', '共', '光', '尽', '净', '仅', '就', '只', '一共', '...
61 | CONJ_words = ['乃', '乍', '与', '无', '且', '丕', '为', '共', '其', '况', '厥', '...
62 | DUTIR_Ais = {'sigh', '一命呜呼', '一场春梦', '一场空', '一头跌在菜刀上-切肤之痛', '一念之差', .....
63 | DUTIR_Haos = {'1兒巴经', '3x', '8错', 'BUCUO', 'Cool毙', 'NB', ...}
64 | DUTIR_Jings = {'848', 'FT', '_god', 'yun', '一个骰子掷七点-出乎意料', '一举成名', ......
65 | DUTIR_Jus = {'一则以喜,一则以惧', '一发千钧', '一年被蛇咬,三年怕草索', '一座皆惊', '一脸横肉', '一蛇两头...
66 | DUTIR_Les = {':)', 'CC', 'Happy', 'LOL', '_so', 'haha', ...}
67 | DUTIR_Nus = {'2气斗狠', 'MD', 'TNND', 'gun', 'kao', '一刀两断', ...}
68 | DUTIR_Wus = {'B4', 'BD', 'BS', 'HC', 'HJ', 'JJWW', ...}
69 | HOWNET_deny = {'不', '不可', '不是', '不能', '不要', '休', ...}
70 | HOWNET_extreme = {'万', '万万', '万分', '万般', '不亦乐乎', '不可开交', ...}
71 | HOWNET_ish = {'一些', '一点', '一点儿', '不丁点儿', '不大', '不怎么', ...}
72 | HOWNET_more = {'多', '大不了', '如斯', '尤甚', '强', '愈', ...}
73 | HOWNET_neg = {'一下子爆发', '一下子爆发的一连串', '一不小心', '一个屁', '一仍旧贯', '一偏', ...}
74 | HOWNET_pos = {'', '一专多能', '一丝不差', '一丝不苟', '一个心眼儿', '一五一十', ...}
75 | HOWNET_very = {'不为过', '不少', '不胜', '不过', '何啻', '何止', ...}
76 | STOPWORDS_en = {'a', 'about', 'above', 'across', 'after', 'afterwards'...
77 | STOPWORDS_zh = {'、', '。', '〈', '〉', '《', '》', ...}
78 |
79 | FILE
80 | /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/cntext/__init__.py
81 | ```
82 |
83 |
84 |
85 | ```python
86 | from cntext import dict_info
87 |
88 | dict_info()
89 | ```
90 |
91 | Run
92 |
93 | ```
94 | 【大连理工大学情感本体库】
95 | 七大情绪分类,依次是哀、恶、好、惊、惧、乐、怒;对应的情绪词表依次:
96 | DUTIR_Ais = {"泣血捶膺", "望断白云", "日暮途穷", "身微力薄"...}
97 | DUTIR_Wus = {"饰非遂过", "恶语", "毁害", "恶籍盈指", "脾气爆躁", "淫贱", "凌乱"...}
98 | DUTIR_Haos = {"打破砂锅璺到底", "多彩", "披沙拣金", "见机行事", "精神饱满"...}
99 | DUTIR_Jings = {"骇人视听", "拍案惊奇", "悬念", "无翼而飞", "原来", "冷门"...}
100 | DUTIR_Jus ={"山摇地动", "月黑风高", "流血", "老鼠偷猫饭-心惊肉跳", "一发千钧"...}
101 | DUTIR_Les ={"含哺鼓腹", "欢呼鼓舞", "莺歌蝶舞", "将伯之助", "逸兴横飞", "舒畅"...}
102 | DUTIR_Nus = {"怨气满腹", "面有愠色", "愤愤", "直眉瞪眼", "负气斗狠", "挑眼"...}
103 |
104 | 【知网Hownet词典】
105 | 含正负形容词、否定词、副词等词表,对应的词表依次:
106 | HOWNET_deny = {"不", "不是", "不能", "不可"...}
107 | HOWNET_extreme = {"百分之百", "倍加", "备至", "不得了"...}
108 | HOWNET_ish = {"点点滴滴", "多多少少", "怪", "好生", "还", "或多或少"...}
109 | HOWNET_more = {"大不了", "多", "更", "比较", "更加", "更进一步", "更为", "还", "还要"...}
110 | HOWNET_neg = {"压坏", "鲁莽的", "被控犯罪", "银根紧", "警惕的", "残缺", "致污物", "柔弱"...}
111 | HOWNET_pos = {"无误", "感激不尽", "受大众欢迎", "敬礼", "文雅", "一尘不染", "高精度", "兴盛"...}
112 | HOWNET_very = {"不为过", "超", "超额", "超外差", "超微结构", "超物质", "出头"...}
113 |
114 | 【停用词表】
115 | 中英文停用词表,依次
116 | STOPWORDS_zh = {"经", "得", "则甚", "跟", "好", "具体地说"...}
117 | STOPWORDS_en = {'a', 'about', 'above', 'across', 'after'...}
118 |
119 | 【中文副词/连词】
120 | 副词ADV、连词CONJ
121 | ADV_words = ['都', '全', '单', '共', '光'...}
122 | CONJ_words = ['乃', '乍', '与', '无', '且'...}
123 | ```
124 |
125 |
126 |
127 | 查看词表
128 |
129 | ```python
130 | from cntext import CONJ_words, ADV_words
131 |
132 | #获取连词词表
133 | CONJ_words
134 | ```
135 |
136 | Run
137 |
138 | ```
139 | ['乃',
140 | '乍',
141 | '与',
142 | '无',
143 | '且',
144 | '丕',
145 | '为',
146 | '共',
147 | '其',
148 | '况',
149 | '厥',
150 | '则',
151 | '那',
152 | '兼',
153 | ...
154 | ]
155 | ```
156 |
157 |
158 |
159 |
160 |
161 | ## 二、stats
162 |
163 | 目前含
164 |
165 | - term_freq 词频统计函数,返回Counter类型
166 | - readability 中文可读性
167 |
168 | ```python
169 | from cntext.stats import term_freq, readability
170 |
171 | text = '如何看待一网文作者被黑客大佬盗号改文,因万分惭愧而停更'
172 | term_freq(text)
173 | ```
174 |
175 | ```
176 | Counter({'看待': 1,
177 | '网文': 1,
178 | '作者': 1,
179 | '黑客': 1,
180 | '大佬': 1,
181 | '盗号': 1,
182 | '改文因': 1,
183 | '万分': 1,
184 | '惭愧': 1,
185 | '停': 1})
186 | ```
187 |
188 |
189 |
190 |
191 |
192 | **中文可读性 ** 算法参考自
193 |
194 | > 徐巍,姚振晔,陈冬华.中文年报可读性:衡量与检验[J].会计研究,2021(03):28-44.
195 |
196 | - readability1 ---每个分句中的平均字数
197 | - readability2 ---每个句子中副词和连词所占的比例
198 | - readability3 ---参考Fog Index, readability3=(readability1+readability2)×0.5
199 |
200 |
201 | 以上三个指标越大,都说明文本的复杂程度越高,可读性越差。
202 |
203 | ```python
204 | readability(text)
205 | ```
206 |
207 | ```
208 | {'readability1': 27.0,
209 | 'readability2': 0.17647058823529413,
210 | 'readability3': 13.588235294117647}
211 | ```
212 |
213 |
214 |
215 |
216 |
217 |
218 |
219 | ## 三、dictionary
220 |
221 | 本模块用于构建词表(典),含
222 |
223 | - SoPmi 共现法扩充词表(典)
224 | - W2VModels 词向量word2vec扩充词表(典)
225 |
226 | ### 3.1 SoPmi 共现法
227 |
228 | ```python
229 | from cntext.dictionary import SoPmi
230 | import os
231 |
232 | sopmier = SoPmi(cwd=os.getcwd(),
233 | input_txt_file='data/sopmi_corpus.txt', #原始数据,您的语料
234 | seedword_txt_file='data/sopmi_seed_words.txt', #人工标注的初始种子词
235 | )
236 |
237 | sopmier.sopmi()
238 | ```
239 |
240 | Run
241 |
242 | ```
243 | step 1/4:...seg corpus ...
244 | Loading model cost 0.678 seconds.
245 | Prefix dict has been built successfully.
246 | step 1/4 finished:...cost 60.78995203971863...
247 | step 2/4:...collect cowords ...
248 | step 2/4 finished:...cost 0.6169600486755371...
249 | step 3/4:...compute sopmi ...
250 | step 1/4 finished:...cost 0.26422882080078125...
251 | step 4/4:...save candiwords ...
252 | finished! cost 61.8965539932251
253 | ```
254 |
255 |
256 |
257 |
258 |
259 | ### 3.2 W2VModels 词向量
260 |
261 | ```python
262 | from cntext.dictionary import W2VModels
263 | import os
264 |
265 | #初始化模型
266 | model = W2VModels(cwd=os.getcwd()) #语料数据 w2v_corpus.txt
267 | model.train(input_txt_file='data/w2v_corpus.txt')
268 |
269 |
270 | #根据种子词,筛选出没类词最相近的前100个词
271 | model.find(seedword_txt_file='data/w2v_seeds/integrity.txt',
272 | topn=100)
273 | model.find(seedword_txt_file='data/w2v_seeds/innovation.txt',
274 | topn=100)
275 | model.find(seedword_txt_file='data/w2v_seeds/quality.txt',
276 | topn=100)
277 | model.find(seedword_txt_file='data/w2v_seeds/respect.txt',
278 | topn=100)
279 | model.find(seedword_txt_file='data/w2v_seeds/teamwork.txt',
280 | topn=100)
281 | ```
282 |
283 | Run
284 |
285 | ```
286 | 数据预处理开始.......
287 | 预处理结束...........
288 | Word2Vec模型训练开始......
289 | 已将模型存入 /Users/Desktop/cntext/test/output/w2v_candi_words/w2v.model
290 |
291 | 准备寻找每个seed在语料中所有的相似候选词
292 | 初步搜寻到 572 个相似的候选词
293 | 计算每个候选词 与 integrity 的相似度, 选出相似度最高的前 100 个候选词
294 | 已完成 【integrity 类】 的词语筛选,并保存于 /Users/Desktop/cntext/test/output/w2v_candi_words/integrity.txt, 耗时 46 秒
295 |
296 | 准备寻找每个seed在语料中所有的相似候选词
297 | 初步搜寻到 516 个相似的候选词
298 | 计算每个候选词 与 innovation 的相似度, 选出相似度最高的前 100 个候选词
299 | 已完成 【innovation 类】 的词语筛选,并保存于 /Users/Desktop/cntext/test/output/w2v_candi_words/innovation.txt, 耗时 46 秒
300 |
301 | 准备寻找每个seed在语料中所有的相似候选词
302 | 初步搜寻到 234 个相似的候选词
303 | 计算每个候选词 与 quality 的相似度, 选出相似度最高的前 100 个候选词
304 | 已完成 【quality 类】 的词语筛选,并保存于 /Users/Desktop/cntext/test/output/w2v_candi_words/quality.txt, 耗时 46 秒
305 |
306 | 准备寻找每个seed在语料中所有的相似候选词
307 | 初步搜寻到 243 个相似的候选词
308 | 计算每个候选词 与 respect 的相似度, 选出相似度最高的前 100 个候选词
309 | 已完成 【respect 类】 的词语筛选,并保存于 /Users/Desktop/cntext/test/output/w2v_candi_words/respect.txt, 耗时 46 秒
310 |
311 | 准备寻找每个seed在语料中所有的相似候选词
312 | 初步搜寻到 319 个相似的候选词
313 | 计算每个候选词 与 teamwork 的相似度, 选出相似度最高的前 100 个候选词
314 | 已完成 【teamwork 类】 的词语筛选,并保存于 /Users/Desktop/cntext/test/output/w2v_candi_words/teamwork.txt, 耗时 46 秒
315 | ```
316 |
317 |
318 |
319 |
320 |
321 | ## 四、 sentiment
322 |
323 | - senti_by_hownet 使用知网Hownet词典对文本进行**情感**分析
324 | - senti_by_dutir 使用大连理工大学情感本体库dutir对文本进行**情绪**分析
325 | - senti_by_diydict 使用**自定义词典** 对文本进行**情感**分析
326 |
327 |
328 |
329 | ### 4.1 senti_by_hownet(text, adj_adv=False)
330 |
331 | 使用知网Hownet词典进行(中)文本数据的情感分析,统计正、负情感信息出现次数(得分)
332 |
333 | - text: 待分析的中文文本数据
334 | - adj_adv: 是否考虑副词(否定词、程度词)对情绪形容词的反转和情感强度修饰作用,默认False。默认False只统计情感形容词出现个数;
335 |
336 | ```python
337 | from cntext.sentiment import senti_by_hownet
338 |
339 | text = '今天股票大涨,心情倍爽,非常开心啊。'
340 |
341 | senti_by_dutir(text)
342 | ```
343 |
344 | Run
345 |
346 | ```
347 | {'word_num': 12,
348 | 'sentence_num': 2,
349 | 'stopword_num': 4,
350 | '好_num': 0,
351 | '乐_num': 1,
352 | '哀_num': 0,
353 | '怒_num': 0,
354 | '惧_num': 0,
355 | '恶_num': 0,
356 | '惊_num': 0}
357 | ```
358 |
359 |
360 |
361 | 考虑副词(否定词、程度词)对情绪形容词的反转和情感强度修饰作用
362 |
363 | ```python
364 | senti_by_hownet(text, adj_adv=True)
365 | ```
366 |
367 | Run
368 |
369 | ```
370 | {'sentence_num': 1,
371 | 'word_num': 12,
372 | 'stopword_num': 3,
373 | 'pos_score': 13.0,
374 | 'neg_score': 0.0}
375 | ```
376 |
377 |
378 |
379 |
380 |
381 | ### 4.2 senti_by_dutir(text)
382 |
383 | 使用大连理工大学情感本体库对文本进行情绪分析,统计各情绪词语出现次数。
384 |
385 | ```python
386 | from cntext.sentiment import senti_by_dutir
387 |
388 | text = '今天股票大涨,心情倍爽,非常开心啊。'
389 |
390 | senti_by_dutir(text)
391 | ```
392 |
393 | Run
394 |
395 | ```
396 | {'word_num': 12,
397 | 'sentence_num': 2,
398 | 'stopword_num': 4,
399 | '好_num': 0,
400 | '乐_num': 1,
401 | '哀_num': 0,
402 | '怒_num': 0,
403 | '惧_num': 0,
404 | '恶_num': 0,
405 | '惊_num': 0}
406 | ```
407 |
408 | >情绪分析使用的大连理工大学情感本体库,如发表论文,请注意用户许可协议
409 | >
410 | >如果用户使用该资源发表论文或取得科研成果,请在论文中添加诸如“使用了大连理工大学信息检索研究室的情感词汇本体” 字样加以声明。
411 | >
412 | >参考文献中加入引文“徐琳宏,林鸿飞,潘宇,等.情感词汇本体的构造[J]. 情报学报, 2008, 27(2): 180-185.”
413 | >
414 | >
415 |
416 |
417 |
418 |
419 |
420 | ### 4.3 senti_by_diy(text)
421 |
422 | 使用diy词典进行情感分析,计算各个情绪词出现次数,未考虑强度副词、否定词对情感的复杂影响,
423 |
424 | - text: 待分析中文文本
425 | - sentiwords: 情感词字典;
426 | {'category1': 'category1 词语列表',
427 | 'category2': 'category2词语列表',
428 | 'category3': 'category3词语列表',
429 | ...
430 | }
431 |
432 | ```python
433 | sentiwords = {'pos': ['开心', '愉快', '倍爽'],
434 | 'neg': ['难过', '悲伤'],
435 | 'adv': ['倍']}
436 |
437 | text = '今天股票大涨,心情倍爽,非常开心啊。'
438 | senti_by_diydict(text, sentiwords)
439 | ```
440 |
441 | Run
442 |
443 | ```
444 | {'pos_num': 1,
445 | 'neg_num': 0,
446 | 'adv_num': 1,
447 | 'stopword_num': 4,
448 | 'sentence_num': 2,
449 | 'word_num': 12}
450 | ```
451 |
452 |
453 |
454 |
455 |
456 | ### 4.4 注意
457 |
458 | **返回结果**: **num**表示词语出现次数; score是考虑副词、否定词对情感的修饰,结果不是词频,是情感类别的得分。
459 |
460 |
461 |
462 |
463 |
464 | ## 五、similarity
465 |
466 | 使用cosine、jaccard、miniedit等计算两文本的相似度,算法实现参考自
467 |
468 | > Cohen, Lauren, Christopher Malloy, and Quoc Nguyen. Lazy prices. No. w25084. National Bureau of Economic Research, 2018.
469 |
470 |
471 |
472 | ```
473 | from cntext.similarity import similarity_score
474 |
475 | text1 = '编程真好玩编程真好玩'
476 | text2 = '游戏真好玩编程真好玩'
477 |
478 | similarity_score(text1, text2)
479 | ```
480 |
481 | Run
482 |
483 | ```
484 | {'Sim_Cosine': 0.816496580927726,
485 | 'Sim_Jaccard': 0.6666666666666666,
486 | 'Sim_MinEdit': 1,
487 | 'Sim_Simple': 0.9183673469387755}
488 | ```
489 |
490 |
491 |
492 |
493 |
494 | ## 六、visualization
495 |
496 | 文本信息可视化,含wordcloud、wordshiftor
497 |
498 | - wordcloud 词云图
499 | - wordshiftor 两文本词移图
500 |
501 | ### 6.1 wordcloud(text, title, html_path)
502 |
503 | - text: 中文文本字符串数据
504 | - title: 词云图标题
505 | - html_path: 词云图html文件存储路径
506 |
507 | ```python
508 | from cntext.visualization import wordcloud
509 |
510 | text1 = """在信息化时代,各种各样的数据被广泛采集和利用,有些数据看似无关紧要甚至好像是公开的,但同样关乎国家安全。11月1日是《反间谍法》颁布实施七周年。近年来,国家安全机关按照《反间谍法》《数据安全法》有关规定,依法履行数据安全监管职责,在全国范围内开展涉外数据专项执法行动,发现一些境外数据公司长期、大量、实时搜集我境内船舶数据,数据安全领域的“商业间谍”魅影重重。
511 |
512 | 2020年6月,国家安全机关在反间谍专项行动中发现,有境外数据公司通过网络在境内私下招募“数据贡献员”。广东省湛江市国家安全局据此开展调查,在麻斜军港附近发现有可疑的无线电设备在持续搜集湛江港口舰船数据,并通过互联网实时传往境外。在临近海港的一个居民楼里,国家安全机关工作人员最终锁定了位置。
513 |
514 | 一套简易的无线电设备是AIS陆基基站,用来接收AIS系统发射的船舶数据。AIS系统是船舶身份自动识别系统,国际海事组织要求300总吨以上船舶必须强制安装。船只在航行过程中,通过AIS系统向其他船只和主管部门发送船只航向、航速、目的港等信息,用于航行避让、交通导航、轨迹回溯等功能。国家安全机关查获的设备虽然看上去简陋,功能却十分强大。
515 |
516 | 国家安全机关进一步调查发现,这个基站的来历并不简单。2016年,湛江市的无线电爱好者郑某偶然收到一封境外某海事数据公司发来的邀请邮件。
517 |
518 | 作为资深的无线电爱好者,能免费领取价值几千元的设备还能获取更多的船舶信息,郑某当然心动。而且,这个基站的架设也非常容易,只要简单组装连上家里的网络,自己的任务就算完成。郑某马上浏览了这家公司申请无线电设备的页面,并按对方要求填写了信息。
519 |
520 | """
521 |
522 | wordcloud(text=text1,
523 | title='词云图测试',
524 | html_path='output/词云图测试.html')
525 | ```
526 |
527 | Run
528 |
529 | [**点击查看词云图效果**](examples/output/词云图测试.html)
530 |
531 | 
532 |
533 |
534 |
535 |
536 |
537 | ### 6.2 wordshiftor(text1, text2, title, top_n, matplotlib_family)
538 |
539 | - text1: 文本数据1;字符串
540 | - text2: 文本数据2;字符串
541 | - title: 词移图标题
542 | - top_n: 显示最常用的前n词; 默认值15
543 | - matplotlib_family matplotlib中文字体,默认"Arial Unicode MS";如绘图字体乱码请,请参考下面提示
544 |
545 | ```python
546 | text1 = """在信息化时代,各种各样的数据被广泛采集和利用,有些数据看似无关紧要甚至好像是公开的,但同样关乎国家安全。11月1日是《反间谍法》颁布实施七周年。近年来,国家安全机关按照《反间谍法》《数据安全法》有关规定,依法履行数据安全监管职责,在全国范围内开展涉外数据专项执法行动,发现一些境外数据公司长期、大量、实时搜集我境内船舶数据,数据安全领域的“商业间谍”魅影重重。
547 |
548 | 2020年6月,国家安全机关在反间谍专项行动中发现,有境外数据公司通过网络在境内私下招募“数据贡献员”。广东省湛江市国家安全局据此开展调查,在麻斜军港附近发现有可疑的无线电设备在持续搜集湛江港口舰船数据,并通过互联网实时传往境外。在临近海港的一个居民楼里,国家安全机关工作人员最终锁定了位置。
549 |
550 | 一套简易的无线电设备是AIS陆基基站,用来接收AIS系统发射的船舶数据。AIS系统是船舶身份自动识别系统,国际海事组织要求300总吨以上船舶必须强制安装。船只在航行过程中,通过AIS系统向其他船只和主管部门发送船只航向、航速、目的港等信息,用于航行避让、交通导航、轨迹回溯等功能。国家安全机关查获的设备虽然看上去简陋,功能却十分强大。
551 |
552 | 国家安全机关进一步调查发现,这个基站的来历并不简单。2016年,湛江市的无线电爱好者郑某偶然收到一封境外某海事数据公司发来的邀请邮件。
553 |
554 | 作为资深的无线电爱好者,能免费领取价值几千元的设备还能获取更多的船舶信息,郑某当然心动。而且,这个基站的架设也非常容易,只要简单组装连上家里的网络,自己的任务就算完成。郑某马上浏览了这家公司申请无线电设备的页面,并按对方要求填写了信息。
555 |
556 | """
557 |
558 |
559 | text2 = """
560 | 通知强调,各地商务主管部门要紧紧围绕保供稳价工作目标,压实“菜篮子”市长负责制,细化工作措施;强化横向协作与纵向联动,加强与有关部门的工作协调,形成工作合力;建立完善省际间和本地区联保联供机制,健全有关工作方案,根据形势及时开展跨区域调运;加强市场运行监测,每日跟踪蔬菜、肉类等重点生活必需品供求和价格变化情况,及时预测,及早预警。
561 |
562 | 通知要求,各地支持鼓励大型农产品流通企业与蔬菜、粮油、畜禽养殖等农产品生产基地建立紧密合作关系,签订长期供销协议;耐储蔬菜要提前采购,锁定货源,做好本地菜与客菜之间,北菜与南菜之间、设施菜与露天菜之间的梯次轮换和衔接供应;健全完备本地肉类储备规模及管理制度;北方省份要按时完成本年度冬春蔬菜储备计划,南方省份要根据自身情况建立完善蔬菜储备;及时投放肉类、蔬菜等生活必需品储备,补充市场供应。
563 | """
564 |
565 | from cntext.visualization import wordshiftor
566 |
567 | wordshiftor(text1=text1,
568 | text2=text2,
569 | title='两文本对比')
570 | ```
571 |
572 | Run
573 |
574 | 
575 |
576 |
577 |
578 | **注意**
579 |
580 | > 设置参数matplotlib_family,需要先运行下面代码获取本机字体列表
581 | > from matplotlib.font_manager import FontManager
582 | > mpl_fonts = set(f.name for f in FontManager().ttflist)
583 | > print(mpl_fonts)
584 |
585 |
586 |
587 | ## 如果
588 |
589 | 如果您是经管人文社科专业背景,编程小白,面临海量文本数据采集和处理分析艰巨任务,可以参看[《python网络爬虫与文本数据分析》](https://ke.qq.com/course/482241?tuin=163164df)视频课。作为文科生,一样也是从两眼一抹黑开始,这门课程是用五年时间凝缩出来的。自认为讲的很通俗易懂o(* ̄︶ ̄*)o,
590 |
591 | - python入门
592 | - 网络爬虫
593 | - 数据读取
594 | - 文本分析入门
595 | - 机器学习与文本分析
596 | - 文本分析在经管研究中的应用
597 |
598 | 感兴趣的童鞋不妨 戳一下[《python网络爬虫与文本数据分析》](https://ke.qq.com/course/482241?tuin=163164df)进来看看~
599 |
600 | [](https://ke.qq.com/course/482241?tuin=163164df)
601 |
602 |
603 |
604 | ## 更多
605 |
606 | - [B站:大邓和他的python](https://space.bilibili.com/122592901/channel/detail?cid=66008)
607 |
608 | - 公众号:大邓和他的python
609 |
610 | - [知乎专栏:数据科学家](https://zhuanlan.zhihu.com/dadeng)
611 |
612 |
613 | 
614 |
615 |
--------------------------------------------------------------------------------
/img/词云图测试.html:
--------------------------------------------------------------------------------
1 |
2 |
3 |