├── README.md
├── __pycache__
    ├── data_pre.cpython-36.pyc
    └── lda.cpython-36.pyc
├── data
    ├── coments1.csv
    ├── coments2.csv
    ├── coments3.csv
    └── coments4.csv
├── data_pre.py
├── lda.py
└── model.py


/README.md:
--------------------------------------------------------------------------------
 1 | # 基于文本挖掘对商品销量的预测
 2 | 
 3 | ——以某电商手机销售为例
 4 | 
 5 | ### 概述
 6 | 
 7 | 爬虫获取的源数据，通过pre_data.py对数据进行预处理。在经过lda_model.py提取评论特征名词，对于每一条评论相关特征名词相前后的情感副词、情感词进行加权得分，建立一个以特征名词为列向量的DataFrame评论得分数据。最后通过PCA、皮尔逊等提取特征数据进行建模，对基本数据模型有LRMode、SVM、Xgboost进行建模训练，预测销量排名。
 8 | 
 9 | ### 数据预处理
10 | 
11 | #### 分句
12 | 
13 | 每条评论可能由多个句子组成，没一句话所谈到的内容或者产品特征均不相同。如果以每条评论为单位来进行产品特征评论语句来分类容易产生混淆。与英文分词不同，中文词与词之间不能够严格的按照空格来区分，所以中文的分词工作需要利用某些方法来进行。本次分句使用python包jieba进行切词。
14 | 
15 | #### 分词与词性标注
16 | 
17 | 无论是产品的特征词还是情感观点词都需要通过分词从连续的句子中分离出来，而这些往往都是名词和形容词，所以分词之后对词性标注有利于我们识别这些词，这为之后的文本处理工作奠定了数据基础。
18 | 
19 | #### 去除停用词
20 | 
21 | 文本中一些介词、量词、助词、标点符号等文本要就无意义的词，需要剔除，所以我们还需要对这些评论语料进行停用词过滤和标点符号过滤。停用词和标点符号的过滤可以采用根据停用词表，用python加载停用词文件进行过滤操作。
22 | 
23 | ### LDA模型获取特征词
24 | 
25 | #### LAD模型
26 | 
27 | 1：一种分监督机器学习技术，可以用来识别大规模文档集或语料库中潜藏的主题信息。
28 | 
29 | 2：采用词袋（bag og words）的方法，这种方法将每一篇文档视为一个词频向量，从而将文本信息转化为了易于建模的数字信息。但是词袋方法没有考虑词与词之间的顺序，这简化了问题的复杂性，同时也为模型的改进提供了契机。
30 | 
31 | 3：每一篇文档代表了一些主题所构成的一个概率分布，而每一个主题又代表了很多单词所构成的一个概率分布。
32 | 
33 | #### 名词过滤
34 | 
35 | 产品特征大多数为名词，所以还需要在预处理语料的基础
36 | 上剔除掉其他词，只留下名词，这样再一次缩小产品特征词的提取范围。并不是评论语料中所有的名词都可以作为产品的特征词，一些名词往往不是产品特征，例如“时间”、“地点”、“人物”、“东西”等。为了得到能够较为准确的表达产品特征的候选词，我们需要对上述名词语料进行名词过滤处理。使用过滤规则对名词进行名词过滤处理是常用的方法。
37 | 
38 | #### 同义词合并
39 | 
40 | 中文对某一特征可以用多种方式表达，例如“价钱”、
41 | “价格”、“价位”表达的都是一个意思。在网络评论中，消费者往往都有自己的语言习惯，不会使用同一的表达方式，从而导致某些表达方式因为出现频率过高被选出来而其他的则不能被选出来。为了解决这个问题，在LDA模型提取特征词之前将表达同一个意思的不同表达词汇统一用一个词来表示。
42 | 
43 | #### LDA模型特征提取步骤
44 | 
45 | 1）以评论语料中所有词汇最为LDA模型的词典
46 | 
47 | 2）使用上述词典对所有评论语料转换为LDA模型语料库
48 | 
49 | 3）使用LDA模型对语料库进行训练
50 | 
51 | 4）得到每个主题下词汇的概率分布
52 | 
53 | 5）使用阀值筛选出合适的词作为产品特征词
54 | 
55 | ### 情感极性判断与程度计算
56 | 
57 | #### 建立情感词典
58 | 
59 | 首先我们需要建立情感极性词典，也就是正面情感词典和负面情感词典，其中包含了一些表示情感极性的词，例如“好”、“漂亮”、“差”、“烂”等词。在得到了情感极性词典之后只能对语句进行情感极性的判断，要计算极性程度还需要另外一些词典，其中最重要的就是副词词典。
60 | 
61 | #### 副词词典
62 | 
63 | 包含的主要是否定副词和程度副词，否定副词可以在一定程度上提高情感词典对语句极性判断的准确率。例：“质量不好”，在使用了否定副词词典后结果就为负向；另外通过程度副词比如“较”、“极”、“稍微”、“有点”，并给不同程度的副词赋予不同的权值，可以在情感极性基础上计算极性程度
64 | 
65 | #### 对情感词典赋权值
66 | 
67 | 有了情感词典和程度词词典，在进行程度计算之前还需要对不同程度词赋予不同的权值。
68 | 
69 | ```
70 | | 情感程度 | 权值|
71 | | ------ | ------ |
72 | | 积极情感 | +1 |
73 | | 消极情感 | -1 |
74 | | 反向副词 | -1 |
75 | | most副词 | 2 |
76 | | very副词 | 1.5 |
77 | | more | 1.25 |
78 | | lsh副词 | 0.5 |
79 | | insufficient副词 | 0.25 |
80 | ```
81 | 
82 | #### 极性程度计算
83 | 
84 | 对含有特征词的语句进行以下极性程度计算，并以均值代
85 | 表该产品特征的极性程度 Score = score * weight
86 | 
87 | ### 回归模型预测销量排序
88 | 
89 | 1：建立三个基本模式处理提取出来的特征(PCA、PCC、None)
90 | 
91 | 2：建立三个基本模型训练数据（LRModel、SVM、Xgboost）
92 | 
93 | 3：对以上基本模型使用sklearn网格搜索选择最优超参数
94 | 
95 | 4：对预测结果与真实值可视化展示
96 | 
97 | 5：采用r方评估模型，最优模型r方为0.96
98 | 
99 | # ————待补充......2019/03/05


--------------------------------------------------------------------------------
/__pycache__/data_pre.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GGL12/TextMining/a958c3bef68f14cfa737a2f361fc21a51cb3e279/__pycache__/data_pre.cpython-36.pyc


--------------------------------------------------------------------------------
/__pycache__/lda.cpython-36.pyc:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/GGL12/TextMining/a958c3bef68f14cfa737a2f361fc21a51cb3e279/__pycache__/lda.cpython-36.pyc


--------------------------------------------------------------------------------
/data_pre.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Created on Sat Feb  10 21:48:06 2019
  4 | 
  5 | @author: Administrator
  6 | """
  7 | import jieba
  8 | import pandas as pd
  9 | import numpy as np
 10 | import re
 11 | import jieba.posseg as pseg
 12 | 
 13 | #text_test = '这款手机质量好，就是价钱有点贵。'
 14 | # len(list(pseg.cut(cut_words_str)))
 15 | stopwords_path = './data/StopwordsCN.txt'
 16 | data_1_path = './data/data1.csv'
 17 | data_2_path = './data/data2.csv'
 18 | data_3_path = './data/data3.csv'
 19 | data_4_path = './data/data4.csv'
 20 | 
 21 | 
 22 | def ReadData(data_path):
 23 |     '''读取数据.'''
 24 | 
 25 |     return pd.read_csv(data_path, index_col=False)
 26 | 
 27 | 
 28 | def ConcatData(path1, path2, path3, path4):
 29 |     '''合并数据.'''
 30 | 
 31 |     data_1 = ReadData(path1)
 32 |     data_2 = ReadData(path2)
 33 |     data_3 = ReadData(path3)
 34 |     data_4 = ReadData(path4)
 35 |     data_all = pd.concat([data_1, data_2, data_3, data_4]
 36 |                          ).reset_index(drop=True)
 37 | 
 38 |     return data_all
 39 | 
 40 | 
 41 | def GetNum(x):
 42 |     '''
 43 |     适用于价格1，星级，评论数量 字段提取对应的数字信息.
 44 |     '''
 45 | 
 46 |     if type(x) == str:
 47 |         x = x.replace(',', '')
 48 |         return eval(re.findall('\d+\.?\d*', x)[0])
 49 |     else:
 50 |         return np.nan
 51 | 
 52 | 
 53 | def GetPrice_2(x):
 54 |     '''
 55 |     适用于字段价格2字段。特殊，包含了一些其他无关不出来的东西.
 56 |     如：此商品仅剩 1 件 - 欲购从速.
 57 |     '''
 58 | 
 59 |     if type(x) == str:
 60 | 
 61 |         if "￥" in x:
 62 |             x = x.replace(',', '')
 63 |             return eval(re.findall('\d+\.?\d*', x)[0])
 64 |         else:
 65 |             return np.nan
 66 |     else:
 67 |         return np.nan
 68 | 
 69 | 
 70 | def NameId(x):
 71 |     '''虚拟变量——替换商品名——[i for i in range(x)].'''
 72 | 
 73 |     temp = list(x.名称.unique())
 74 |     temp_dict = dict(zip(temp, [i for i in range(len(temp))]))
 75 |     x.名称 = x.名称.map(temp_dict)
 76 | 
 77 |     return x, temp_dict
 78 | 
 79 | 
 80 | def CutWords(words):
 81 |     '''切词并去掉停用词.'''
 82 | 
 83 |     cut_words = list(jieba.cut(words.replace(' ', '')))
 84 | 
 85 |     cut_words_df = pd.DataFrame(
 86 |         {'words': cut_words}
 87 |     )
 88 | 
 89 |     # 移除停用词
 90 |     stopwords = pd.read_csv(
 91 |         stopwords_path,
 92 |         # index_col=False
 93 |     )
 94 | 
 95 |     new_words = cut_words_df[
 96 |         ~cut_words_df.words.isin(stopwords.stopword)
 97 |     ]
 98 | 
 99 |     cut_words_list = new_words.words.values.tolist()
100 | 
101 |     return ''.join(cut_words_list)
102 | 
103 | 
104 | '''
105 | def DataGroupBy(x): 
106 |     
107 |     ''''''
108 |     
109 |     final_data = pd.DataFrame(columns=['名称','评论'])
110 |     i = 0
111 |     
112 |     for x,y in x.groupby('名称'):
113 |         
114 |         class_comments = ''
115 |         for comment in y.评论:
116 |             if type(comment) != str:
117 |                 
118 |                 class_comments += ''
119 |             else:
120 |                 class_comments += comment
121 | 
122 |         final_data.loc[i,'名称'] = x
123 |         final_data.loc[i,'评论']= CutWords(class_comments)
124 |         i += 1
125 |         
126 |     return final_data
127 | '''
128 | 
129 | 
130 | def DataGroupBy(x):
131 |     '''返回每条数据分组后的均值.'''
132 | 
133 |     return x.groupby("名称").mean()
134 | 
135 | 
136 | def GetAllComment(data):
137 |     '''得到所有评论数据.'''
138 | 
139 |     comments = ''
140 |     for comment in data.评论:
141 |         if type(comment) == str:
142 |             comments += comment
143 |         else:
144 |             comments += ''
145 | 
146 |     return comments
147 | 
148 | 
149 | def GetFinalData():
150 |     '''所有数据经过数据清洗得到最终的数据.'''
151 | 
152 |     data_all = ConcatData(data_1_path, data_2_path, data_3_path, data_4_path)
153 |     data_all, name_dict = NameId(data_all)
154 |     data_all.价格2 = data_all.价格2.apply(GetPrice_2)
155 |     columns = ['价格1', '星级', '评论数量', '排名1']
156 |     for i in columns:
157 |         data_all[i] = data_all[i].apply(GetNum)
158 | 
159 |     return data_all, name_dict
160 | 
161 | 
162 | if __name__ == '__main__':
163 | 
164 |     data_all, name_dict = GetFinalData()
165 |     comment_df = DataGroupBy(data_all)
166 |     comment_all = GetAllComment(data_all)
167 | 


--------------------------------------------------------------------------------
/lda.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf-8 -*-
  2 | """
  3 | Created on Sun Feb 12 14:54:08 2019
  4 | 
  5 | @author: Administrator
  6 | """
  7 | import re
  8 | import sys
  9 | import pandas as pd
 10 | import numpy as np
 11 | import jieba
 12 | from sklearn.feature_extraction.text import CountVectorizer
 13 | from sklearn.decomposition import LatentDirichletAllocation 
 14 | from gensim import corpora
 15 | from gensim import models
 16 | import jieba.posseg as pseg
 17 | from data_pre import ReadData,GetFinalData
 18 | import math
 19 | 
 20 | #cut_words_str,cut_words_list = cut_words(comment_all)
 21 | stopwords_path = './data/StopwordsCN.txt'
 22 | simi_path = './data/simi_words.txt'
 23 | positive_path = './data/posdict.txt'
 24 | negative_path = './data/negdict.txt'
 25 | mostdict_path = './data/mostdict.txt'
 26 | verydict_path = './data/verydict.txt'
 27 | moredict_path = './data/moredict.txt'
 28 | lshdict_path = './data/lshdict.txt'
 29 | insufficientdict_path = './data/insufficientdict.txt'
 30 | 
 31 | def SimiWords(simi_path):
 32 |     
 33 |     '''同义词替换制作成字典.'''
 34 |     
 35 |     #help(pd.read_csv)
 36 |     simi_ = []
 37 |     simi_words_df = pd.read_csv(simi_path,names=['classes','labels','nums'],sep='	')
 38 |     simi_words_df = simi_words_df.drop_duplicates('labels', keep='first', inplace=False)
 39 |     simi_words = simi_words_df.labels.tolist()
 40 |     simi_words_iterm = simi_words_df.groupby('classes').labels
 41 |     
 42 |     for _,syn in simi_words_iterm:
 43 |         for _ in syn.tolist():
 44 |             simi_.append(syn.tolist()[0])
 45 |     
 46 |     return dict(zip(simi_words,simi_)),simi_words
 47 | 
 48 | 
 49 | def CutWords_list(comment):
 50 |     
 51 |     '''每一条数据评论切词.后续副词正反向词权重准备.'''
 52 |     
 53 |     if type(comment) == str:
 54 |         
 55 |         words_list = list(jieba.cut(comment))
 56 |         return words_list
 57 |     else:
 58 |         return []
 59 | 
 60 | 
 61 | '''
 62 | def FeatureWords(cut_words,simi_path):
 63 |     
 64 |     #名词过滤与同义词合并得到特征词
 65 |     final_words = []
 66 |     temp = pseg.cut(cut_words)
 67 |     dict_simi,simi_words_ = SimiWords(simi_path)
 68 |     
 69 |     
 70 |     
 71 |     #num = 1 
 72 |     for i in temp:
 73 |         
 74 |         #print("正在处理第%d个词语,总计有2412421个词语。耐心等候....." % num)
 75 |         #num += 1
 76 |         if i.flag == 'n':
 77 |             if i.word in simi_words_:
 78 |                 final_words.append(dict_simi.get(i.word))
 79 |             else:
 80 |                 final_words.append(i.word)
 81 |     return final_words
 82 | '''
 83 | 
 84 | 
 85 | def LoadStopWord(stopwords_path):
 86 |     
 87 |     '''加载停用词'''
 88 |     
 89 |     stopwords = pd.read_csv(stopwords_path, 
 90 |         #index_col=False
 91 |     )
 92 | 
 93 |     return stopwords.stopword.tolist()
 94 | 
 95 | 
 96 | def GetNoun(comment_data,stopwords_path,simi_path):
 97 |     
 98 |     '''处理分词后的词语：停用词、名词过滤、相关词替换.'''
 99 |     
100 |     count = 1
101 |     noun_words = []
102 |     stop_words = LoadStopWord(stopwords_path)
103 |     dict_simi,simi_words = SimiWords(simi_path)
104 |     #rep = dict((re.escape(k), v) for k, v in dict_simi.items())
105 |     #pattern = re.compile("|".join(rep.keys()))
106 |     #replace_comment = []
107 |     for i in comment_data:        
108 |         print("正在提取第{}条评论的名词，总计有{}条评论需要提取。耐心等候......".format(count,len(comment_data)))
109 |         count += 1        
110 |         if type(i) == str:
111 |             #i = pattern.sub(lambda m: rep[re.escape(m.group(0))], i)
112 |             #replace_comment.append(i)
113 |             word_list = []
114 |             for word,flag in pseg.cut(i):
115 |                 if (flag == 'n') and ( word not in stop_words) and (len(word) >= 2):
116 |                     if word in simi_words:
117 |                         word_list.append(dict_simi.get(word))
118 |                     else:
119 |                         word_list.append(word)
120 |             noun_words.append(word_list)
121 |         else:
122 |             noun_words.append([])
123 |             #replace_comment.append('')
124 |             
125 |     return noun_words#,replace_comment
126 | #noun_words = GetNoun(data_all.评论,stopwords_path,simi_path)
127 | 
128 | 
129 | def Perplexity(ldamodel, test_data, dictionary, size_dictionary, num_topics):
130 |     
131 |     '''Lda模型评估函数.'''
132 |     prep = 0.0
133 |     prob_doc_sum = 0.0
134 |     topic_word_list = []
135 |     
136 |     for topic_id in range(num_topics): 
137 |         topic_word = ldamodel.show_topic(topic_id, size_dictionary) 
138 |         dic = {} 
139 |         for word, probability in topic_word: 
140 |             dic[word] = probability 
141 |             topic_word_list.append(dic) 
142 |     doc_topics_ist = []
143 |     for doc in test_data:
144 |         doc_topics_ist.append(ldamodel.get_document_topics(doc, minimum_probability=0))
145 |     testset_word_num = 0
146 |     
147 |     for i in range(len(test_data)):
148 |         prob_doc = 0.0 # the probablity of the doc
149 |         doc = test_data[i]
150 |         doc_word_num = 0
151 |         for word_id, num in doc: 
152 |             prob_word = 0.0 
153 |             doc_word_num += num 
154 |             word = dictionary[word_id]
155 |             for topic_id in range(num_topics):
156 |                 prob_topic = doc_topics_ist[i][topic_id][1] 
157 |                 prob_topic_word = topic_word_list[topic_id][word] 
158 |                 prob_word += prob_topic*prob_topic_word
159 |             prob_doc += math.log(prob_word)
160 |         prob_doc_sum += prob_doc
161 |         testset_word_num += doc_word_num
162 |     prep = math.exp(-prob_doc_sum/testset_word_num)
163 |     print ("主题数为{}的Perplexity为: {}。".format(num_topics,int(prep)))
164 | 
165 |     return prep
166 | 
167 | 
168 | def GetBestLdaModel(n_words):
169 |     
170 |     '''lda模型获得特征值.'''
171 |     best_prep = 1000000
172 |     best_lda = None
173 |     best_num_topic = None
174 |     num_topics = [i for i in range(5,30)]
175 |     words_test = n_words[:int(len(n_words)/3)]
176 |     dic = corpora.Dictionary(n_words)
177 |     corpus = [dic.doc2bow(text) for text in n_words]
178 |     test_corpus = [dic.doc2bow(text) for text in words_test]
179 |     for num_topic in num_topics:
180 |         lda = models.LdaModel(corpus,id2word=dic,num_topics=num_topic)
181 |         prep = Perplexity(lda, test_corpus, dic,len(dic.keys()), num_topic)
182 |         if prep <= best_prep:
183 |             best_prep = prep
184 |             best_lda = lda
185 |             best_num_topic = num_topic
186 |     print("最好的主题数是：%d，其Perplexity是：%d" %(best_num_topic,best_prep))
187 |     ldaout = best_lda.print_topics(num_topics=best_num_topic)
188 |     
189 |     return ldaout
190 | #ldaout = GetFeatureWord_All(noun_words)
191 | 
192 | 
193 | def ChooseFeatureWord(ldaout,s):
194 |     
195 |     '''阈值s，提取特征值.'''
196 |     
197 |     feature_words = []
198 |     for i in ldaout:
199 |         temp = i[1].replace('"','').replace(' ','').split("+")
200 |         for temp_ in temp:
201 |             if eval(temp_.split("*")[0]) >= s:
202 |                 print(temp_.split("*")[1])
203 |                 feature_words.append(temp_.split("*")[1])
204 | 
205 |     return feature_words
206 | #feature_words = ChooseFeatureWord(ldaout,0.05)
207 | #test = data_all.评论.apply(CutWords_list)
208 | 
209 | 
210 | def GetDict(data,weight):
211 |     
212 |     '''权重字典.'''
213 |     
214 |     return dict(zip(data,[weight for i in data]))
215 | 
216 | 
217 | def GetFeatureWordData(positive_path,negative_path,mostdict_path,verydict_path,moredict_path,lshdict_path,insufficientdict_path):
218 |     
219 |     '''处理特征词权值等数据返回list,dict.'''
220 |     
221 |     poslist = ReadData(positive_path).posdict.tolist()
222 |     neglist = ReadData(negative_path).negdict.tolist()
223 |     mostlist = ReadData(mostdict_path).mostdict.tolist()
224 |     verylist = ReadData(verydict_path).verydict.tolist()
225 |     morelist = ReadData(moredict_path).moredict.tolist()
226 |     lshlist = ReadData(lshdict_path).lshdict.tolist()
227 |     insufficientlist = ReadData(insufficientdict_path).insufficientdict.tolist()
228 |     
229 |     e_list = poslist + neglist
230 |     e_dict = dict(GetDict(poslist,1),**GetDict(neglist,-1))
231 |     
232 |     adv_list = mostlist + verylist + morelist + lshlist + insufficientlist
233 |     adv_dict = dict(dict(dict(dict(GetDict(mostlist,2),**GetDict(verylist,1.5)),**GetDict(morelist,1.25)),**GetDict(lshlist,0.5)),**GetDict(insufficientlist,0.25))
234 |     
235 |     return e_list,e_dict,adv_list,adv_dict
236 | #comments_list = test.tolist()
237 | 
238 | 
239 | def GetCommentScore(feature_words,comments_list):
240 |     
241 |     '''得到所有评论的各个特征词的得分情况。返回值 df.'''
242 |     
243 |     comment_score = pd.DataFrame(data=np.ones((len(comments_list),len(feature_words))),columns=feature_words)
244 |     e_list,e_dict,adv_list,adv_dict = GetFeatureWordData(positive_path,negative_path,mostdict_path,verydict_path,moredict_path,lshdict_path,insufficientdict_path)
245 |     for i in range(len(comments_list)):
246 |         print("正在处理第%d条评论的得分情况，总计有%d条评论需要处理。" %(i,len(comments_list)))
247 |         for feature in feature_words:
248 |             feature_weight = 1
249 |             feature_score = 0
250 |             if feature in comments_list[i]:
251 |                 curr_index = comments_list[i].index(feature)
252 |                 try:
253 |                     if comments_list[i][curr_index + 1] in e_list:
254 |                         curr_score = e_dict.get(comments_list[i][curr_index + 1])
255 |                         feature_score = feature_score + curr_score
256 |                 except IndexError:
257 |                     print("向后无情感词")
258 |                     
259 |                 try:
260 |                     if comments_list[i][curr_index + 2] in adv_list:
261 |                         curr_weight = adv_dict.get(comments_list[i][curr_index + 2])
262 |                         feature_weight = feature_weight * curr_weight
263 |                 except IndexError:
264 |                     print("向后无副词")
265 |                     
266 |                     
267 |             comment_score.loc[i,feature] = feature_weight * feature_score
268 |     comment_score.to_csv(r"C:\Users\Administrator\Desktop\DM\comment_score_1.csv",index=False,encoding='utf_8_sig')    
269 |     
270 |     return comment_score       
271 | #comment_score = GetCommentWeight(feature_words,comments_list)
272 | 
273 | 
274 | def GetModelData():
275 |     
276 |     '''合并原始数据和特征词得分数据,得到模型要训练的数据.'''
277 |     
278 |     data_all,name_dict = GetFinalData()
279 |     noun_words = GetNoun(data_all.评论,stopwords_path,simi_path)
280 |     ldaout = GetBestLdaModel(noun_words)
281 |     feature_words = ChooseFeatureWord(ldaout,0.05)
282 |     comments_list = data_all.评论.apply(CutWords_list).tolist()
283 |     comment_score = GetCommentScore(feature_words,comments_list)
284 |     data_all.drop('评论',axis=1,inplace=True)
285 |     data_all = data_all.join(comment_score)
286 |     #data_all.index = [i for i in range(len(data_all))]
287 | 
288 |     return data_all


--------------------------------------------------------------------------------
/model.py:
--------------------------------------------------------------------------------
  1 | import sys
  2 | import pandas as pd
  3 | import numpy as np
  4 | import datetime
  5 | from lda import GetModelData
  6 | import matplotlib.pyplot as plt
  7 | from sklearn.decomposition import PCA
  8 | from sklearn.preprocessing import StandardScaler
  9 | from sklearn.linear_model import LinearRegression
 10 | from sklearn.metrics import r2_score
 11 | from sklearn.model_selection import KFold
 12 | from sklearn.model_selection import train_test_split
 13 | from sklearn.model_selection import GridSearchCV
 14 | from sklearn.svm import SVR
 15 | import xgboost
 16 | 
 17 | 
 18 | def FillMean(data_s):
 19 |     '''输入series数据，返回填充均值的series'''
 20 | 
 21 |     return data_s.fillna(data_s.mean())
 22 | 
 23 | 
 24 | def FillNa(data):
 25 |     '''
 26 |        预测排名2.
 27 |        删除排名2为空的数据.
 28 |        价格1、星级、评论数量 使用均值填充.
 29 |        价格2 填充对应的价格1.
 30 |     '''
 31 | 
 32 |     data = data[data.排名2.notna()].reset_index(drop=True)
 33 |     data.价格1 = FillMean(data.价格1)
 34 |     data.星级 = FillMean(data.星级)
 35 |     data.评论数量 = FillMean(data.评论数量)
 36 |     indexes = np.where(np.isnan(data.价格2))[0]
 37 |     data.loc[indexes, '价格2'] = data.loc[indexes, '价格1']
 38 | 
 39 |     return data
 40 | 
 41 | 
 42 | def GetColumns(data, mode):
 43 |     '''
 44 |     mode 参数：
 45 |         None: 返回源数据.
 46 |         PCC: 皮尔逊相关系数，返回相关系数大于0.05的字段.
 47 |         PCA: pca降维字段。默认返回18个字段.
 48 |     '''
 49 | 
 50 |     if mode == None:
 51 |         y = data.排名2
 52 |         data.drop(['排名2'], axis=1, inplace=True)
 53 | 
 54 |     elif mode == 'PCC':
 55 |         columns_corr_ = np.abs(data.corr().排名2)
 56 |         data = data[list(columns_corr_[columns_corr_ >= 0.005].index)]
 57 |         y = data.排名2
 58 |         data.drop(['排名2'], axis=1, inplace=True)
 59 | 
 60 |     elif mode == 'PCA':
 61 |         y = data.排名2
 62 |         data.drop(['排名2'], axis=1, inplace=True)
 63 |         data = pd.DataFrame(data=PCA(n_components=18).fit_transform(
 64 |             data), columns=['PCA_' + str(i) for i in range(18)])
 65 | 
 66 |     return data, y
 67 | 
 68 | 
 69 | def PreData(data, mode=None, group=False):
 70 |     '''
 71 |     数据预处理：采用什么模式处理字段、是否对字段名称分组、归一化数据、切分数据集(0.3).
 72 |     '''
 73 | 
 74 |     x = FillNa(data)
 75 |     if group == True:
 76 |         x = x.groupby('名称').mean().reset_index(drop=True)
 77 |     else:
 78 |         x.drop(['名称'], axis=1, inplace=True)
 79 |     x, y = GetColumns(x, mode)
 80 |     columns = x.columns
 81 |     x = pd.DataFrame(data=StandardScaler().fit_transform(x), columns=columns)
 82 | 
 83 |     return train_test_split(x, y, test_size=0.3, random_state=0)
 84 | 
 85 | def LRModel(x_train,x_test,y_train,y_test):
 86 |     
 87 |     '''训练LR模型.'''
 88 |     
 89 |     model = LinearRegression()
 90 |     start_time = datetime.datetime.now()
 91 |     model.fit(x_train,y_train)
 92 |     end_time = datetime.datetime.now()
 93 |     print("线性回归各参数：")
 94 |     print(model.coef_)
 95 |     print("\n")
 96 |     y_pred = model.predict(x_test)
 97 |     r2_loss = r2_score(y_test,y_pred)
 98 |     print("线性回归的r方值为{}".format(r2_loss))
 99 |     print("LR模型总计用时: %d s" % (end_time- start_time).seconds)
100 |     
101 |     return model,r2_loss
102 | 
103 | 
104 | def SVMModel(x_train,x_test,y_train,y_test):
105 |     
106 |     '''SVM模型，采用网格搜索超参数.'''
107 |     
108 |     model = SVR()
109 |     start_time = datetime.datetime.now()
110 |     #help(SVR)
111 |     param_grid = {'C': [0.01,0.1,1,10,100],
112 |                   'gamma': [0.1, 1, 10],
113 |                   'kernel': ['linear','rbf'],
114 |                   }
115 |     #help(GridSearchCV)
116 |     grid_model= GridSearchCV(model, param_grid, cv=5,scoring='r2',n_jobs=-1)
117 |     grid_model.fit(x_train,y_train)
118 |     end_time = datetime.datetime.now()
119 |     print("SVM最优参数如下：")
120 |     print(grid_model.best_params_)
121 |     y_pred = grid_model.predict(x_test)
122 |     r2_loss = r2_score(y_test,y_pred)
123 |     print("SVM在验证集上的r方值为{}".format(r2_loss))
124 |     print("SVM网格搜索超参数总计用时: %d s" % (end_time- start_time).seconds)
125 |     
126 |     return grid_model,r2_loss
127 | 
128 | 
129 | def XBModel(x_train,x_test,y_train,y_test):
130 |     
131 |     '''Xgboost模型，采用网格搜索超参数.'''
132 |     
133 |     model = xgboost.XGBRegressor(n_jobs=-1)
134 |     start_time = datetime.datetime.now()
135 |     param_grid = {'learning_rate': [0.5,0.1,0.05,0.01], 
136 |                       'n_estimators': [400,600,800,1000,1200,1600], 
137 |                       'max_depth': [3,4,5], 
138 |                       }
139 |     grid_model= GridSearchCV(model, param_grid, cv=5,scoring='r2',n_jobs=-1)
140 |     grid_model.fit(x_train,y_train)
141 |     end_time = datetime.datetime.now()
142 |     print("xgboost最优超参数如下：")
143 |     print(grid_model.best_params_)
144 |     y_pred = grid_model.predict(x_test)
145 |     r2_loss = r2_score(y_test,y_pred)
146 |     print("xgboost在验证集上的r方值为{}".format(r2_loss))
147 |     print("Xgboost网格搜索超参数总计用时: %d s" % (end_time- start_time).seconds)
148 |     
149 |     return grid_model,r2_loss
150 | 
151 | def PlotResult(model,x_test,y_test):
152 |     
153 |     '''预测值\真实值可视化.'''
154 |     
155 |     x = [i+1 for i in range(len(y_test))][:100]
156 |     y_pred = model.predict(x_test)
157 |     plt.rcParams['font.family'] = ['sans-serif']
158 |     plt.rcParams['font.sans-serif'] = ['SimHei']
159 |     fig = plt.figure(figsize=(12,6))
160 |     plt.plot(x,y_test[:100],c='green',label='y_true')
161 |     plt.plot(x,y_pred[:100],c='red',label='y_pred')
162 |     plt.legend()
163 |     plt.xlabel("预测\真实")
164 |     plt.ylabel("值大小")
165 |     plt.title("预测值与真实值变化趋势图")
166 |     plt.show()
167 | 
168 | 
169 | if __name__ == '__main__':
170 |     
171 |     '''
172 |     三种基本模型：LRmodel、Xgboost、SVM
173 |     三种基本模式：PCA、PCC、None
174 |     是否对数据分组：True、False
175 |     通过以上对数据进行搭配训练。
176 |     '''
177 |     model_data = GetModelData()
178 |     x_train,x_test,y_train,y_test = PreData(model_data,mode='PCA',group=False)
179 |     lr_model,lr_r2_loss = LRModel(x_train,x_test,y_train,y_test)
180 |     PlotResult(lr_model,x_test,y_test) 
181 |     #svm_model,svm_r2_loss,x_pred = SVMModel_(x_train,x_test,y_train,y_test)
182 |     #PlotResult(svm_model,x_test,y_test)   
183 |     #xgboost_model,xgboos_r2_loss = XBModel(x_train,x_test,y_train,y_test)
184 |     #PlotResult(xgboost_model,x_test,y_test)


--------------------------------------------------------------------------------