├── README.md ├── base ├── __init__.py ├── dataset.py ├── feipeng │ ├── __init__.py │ ├── feature_engineering.py │ └── model.py ├── keras_helper.py ├── utils.py └── yuml │ └── models.py ├── clean.sh ├── data ├── SMP 竞赛DUTIRTONE.pptx ├── SMP_DUTIRTONE.pptx └── user_data │ ├── city_loca.dict │ ├── city_prov.dict │ ├── emoji.txt │ ├── enum_list.txt │ ├── keywords.txt │ ├── latitude.dict │ ├── location.txt │ ├── short_prov.dict │ └── stopwords.txt ├── main.py ├── others └── banner.jpg ├── process_data.py ├── run.sh ├── stack_age.py ├── stack_loca.py └── submission └── empty.csv /README.md: -------------------------------------------------------------------------------- 1 | # 2nd Place Solution for SMP CUP 2016 2 | 3 | 竞赛链接:https://biendata.com/competition/smpcup2016/ 4 | 5 | 队伍成员:[Yumeng Li](https://github.com/liyumeng), [Peng Fei](https://github.com/feidapeng), [Hengchao Li](https://github.com/hengchao0248) 6 | 7 | 8 | 任务介绍: 9 | 10 | > 参赛队伍利用给定的新浪微博数据(包括用户个人信息、用户微博文本以及用户粉丝列表,详见数据描述部分),进行微博用户画像,具体包括以下三个任务: 11 | 任务1:推断用户的年龄(共3个标签:-1979/1980-1989/1990+) 12 | 任务2:推断用户的性别(共2个标签:男/女) 13 | 任务3:推断用户的地域(共8个标签:东北/华北/华中/华东/西北/西南/华南/境外) 14 | 15 | ## 1. 文件配置 16 | 程序依赖python3及以下程序包 17 | ``` 18 | anaconda3 19 | theano 0.9.0 20 | keras(使用theano作为backend) 21 | xgboost 22 | gensim 23 | jieba 24 | ``` 25 | 程序运行需要下载原始语料及训练好的word2vec的模型文件,已上传百度云,共1.3GB。 26 | 27 | 原始语料下载链接:http://pan.baidu.com/s/1o8lV37s 密码:wyk8 28 | 29 | word2vec模型文件下载链接:http://pan.baidu.com/s/1ciWjpk 密码:cvlo 30 | 31 | 文件说明如下: 32 | 33 | 原始数据放于下面目录中 34 | ``` 35 | data/raw_data 36 | train 37 | valid 38 | ``` 39 | 40 | word2vec词向量文件放在下面目录中 41 | ``` 42 | data/word2vec/ 43 | smp.w2v.300d gensim使用的word2vec模型文件 44 | smp.w2v.300d.syn0.npy gensim使用的word2vec模型文件 45 | ``` 46 | 47 | 其余目录文件的作用 48 | ``` 49 | data/user_data/ 50 | short_prov.dict 省份简称 51 | location.txt 省份与地域的对应表 52 | latitude.dict 省份与经纬度的对应表 53 | keywords.txt 整理出的关键词表 54 | enum_list.txt 三个任务的label值 55 | emoji.txt 整理出的表情文件 56 | city_prov.dict 城市与省份的对应表 57 | city_loca.dict 城市与地域的对应表 58 | stopwords.txt 停用词表 59 | data/feature_data/ 用于存放程序运行过程中输出的各类临时文件 60 | data/models/ 用于存放程序运行中产生的模型权重参数 61 | ``` 62 | 63 | ## 2. 运行 64 | 65 | > 程序运行较为耗时,建议使用带有GPU的服务器运行 66 | 在Arch Linux, CPU i7-6700HQ, GPU GTX960M, 内存 16G, 固态硬盘 配置的笔记本上运行需要90分钟,占用硬盘空间10GB 67 | 68 | ``` 69 | #首先运行 run.sh 将使用data/models中保存的模型参数进行运行 70 | ./run.sh 71 | #------------------------------------ 72 | #如果想从头开始运行,请依次运行 73 | ./clean.sh 74 | ./run.sh 75 | #随机数种子设置不同,也会输出略微不同的结果 76 | ``` 77 | 78 | ## 3. 输出文件说明 79 | 程序输出的文件将保存在以下两个文件夹中 80 | ``` 81 | data/feature_data/ 82 | features.v1.pkl 初次处理后的特征文件,主要是按人进行了划分 83 | features.v2.pkl 将特征全部转变为numpy array保存 84 | f_letter_svd.300.cache 将微博原文按字符划分后,取tfidf特征并svd降维至300维 85 | f_word_svd.300.cache 将微博原文按词划分后,取tfidf特征并svd降维至300维 86 | f_source_svd.300.cache 将微博来源文本按字符划分后,取count特征并svd降维到300维 87 | f_w2v_tfidf.300.cache 用句子中每个单词的词向量经tfidf加权的结果作为300维句子向量 88 | loca.empty.pkl 程序输出的用于补全训练集中location标签缺失的部分 89 | loca.source.feature 将微博来源文本中出现地名的取出来计算count特征 90 | yuml.age.feature 由stack_age.py程序输出的经过xgb,mcnn,mcnn2模型输出的概率形式的结果 91 | 92 | data/models/ 93 | fp.age.feature main.py在训练过程中产生的权重文件 94 | fp.gender.feature main.py在训练过程中产生的权重文件 95 | loca.em_nn.weight BP神经网络模型经Stack训练出的权重文件 96 | loca.em_knn.weight KNN模型经Stack训练出的权重文件 97 | loca.em_mcnn.weight MCNN模型经Stack训练出的权重文件 98 | loca.em_mcnn3.weight MCNN3模型经Stack训练出的权重文件 99 | yuml.age.feature stack_age.py训练出的权重文件 100 | ``` 101 | 102 | ## 4. 其他 103 | 104 | 如果觉得不错的话,*欢迎大家点击右上角star*,谢谢! 105 | 106 | [ppt下载](data/SMP_DUTIRTONE.pptx) 107 | 108 | 我们参加的其他竞赛: 109 | 110 | [final winner solution for 2016CCF大数据精准营销中搜狗用户画像挖掘](https://github.com/hengchao0248/ccf2016_sougou) 111 | 112 | [1st Place Solution for 2016CCF大数据竞赛客户画像赛题(用户画像)](https://github.com/feidapeng/2016CCF_StateGrid_UserProfile) 113 | 114 | [Tsinghua Data Science Winter School 2017 Link Prediction](https://github.com/liyumeng/LinkPrediction) 115 | 116 | ![](others/banner.jpg) 117 | -------------------------------------------------------------------------------- /base/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/liyumeng/SmpCup2016/e23611fb5a590b357ecc6451d87f81b8e2a62615/base/__init__.py -------------------------------------------------------------------------------- /base/dataset.py: -------------------------------------------------------------------------------- 1 | from gensim.models.word2vec import Word2Vec 2 | import pickle 3 | import os 4 | from os.path import dirname 5 | 6 | #---------配置项------------------ 7 | 8 | #项目根目录文件夹名称 9 | project_name='DUTIRTone' 10 | 11 | __cur=os.path.abspath('.') 12 | project_path=__cur[:__cur.rindex(project_name)+len(project_name)] 13 | 14 | #data所在路径 15 | smp_path=project_path+'/data' 16 | 17 | #特征文件输出路径 18 | feature_path=smp_path+'/feature_data' 19 | 20 | #结果文件的存储路径 21 | submission_path=project_path+'/submission' 22 | 23 | #-------配置项结束----------------- 24 | 25 | def load_v1(): 26 | return pickle.load(open(feature_path+'/features.v1.pkl','rb')) 27 | 28 | def load_v2(): 29 | return pickle.load(open(feature_path+'/features.v2.pkl','rb')) 30 | 31 | '''粉丝数''' 32 | def load_links_dict(): 33 | link_dict={} 34 | with open(smp_path+'/raw_data/train/train_links.txt') as f: 35 | items=[item.strip().split(' ') for item in f] 36 | for item in items: 37 | link_dict[item[0]]=len(set(item[1:])) 38 | 39 | with open(smp_path+'/raw_data/valid/valid_links.txt') as f: 40 | items=[item.strip().split(' ') for item in f] 41 | for item in items: 42 | link_dict[item[0]]=len(set(item[1:])) 43 | 44 | return link_dict 45 | 46 | def load_w2v(dim=300): 47 | if dim==200: 48 | return Word2Vec.load(smp_path+'/word2vec/smp.w2v.200d') 49 | if dim==300: 50 | return Word2Vec.load(smp_path+'/word2vec/smp.w2v.300d') 51 | return None 52 | 53 | 54 | def load_glove(dim=50): 55 | return Word2Vec.load(smp_path+'/glove/smp.glove.%dd'%dim) 56 | -------------------------------------------------------------------------------- /base/feipeng/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/liyumeng/SmpCup2016/e23611fb5a590b357ecc6451d87f81b8e2a62615/base/feipeng/__init__.py -------------------------------------------------------------------------------- /base/feipeng/feature_engineering.py: -------------------------------------------------------------------------------- 1 | ''' 2 | 特征工程:数据预处理 3 | ''' 4 | import numpy as np 5 | import pandas as pd 6 | import os 7 | import re 8 | import jieba 9 | import pickle 10 | import codecs 11 | 12 | class features(object): 13 | def __init__(self, inpaths, outpaths): 14 | print('---加载数据---') 15 | # 训练集 16 | self.train_info = inpaths['train_info'] 17 | self.train_labels = inpaths['train_labels'] 18 | self.train_links = inpaths['train_links'] 19 | self.train_status = inpaths['train_status'] 20 | # 测试集 21 | self.test_info = inpaths['test_info'] 22 | self.test_nolabels = inpaths['test_nolabels'] 23 | self.test_links = inpaths['test_links'] 24 | self.test_status = inpaths['test_status'] 25 | # 停用词 26 | self.stopwords = inpaths['stopwords'] 27 | 28 | # 中间文件 29 | self.status_file = outpaths['status_file'] 30 | self.text_file = outpaths['text_file'] 31 | # 最终文件 32 | self.data_file = outpaths['data_file'] 33 | self.features_file = outpaths['features_file'] 34 | print('---加载数据---') 35 | 36 | self.new_labels = inpaths['new_labels'] 37 | labels = [] 38 | with codecs.open(self.train_labels, 'r', encoding='utf-8')as f: 39 | for line in f: 40 | line = line.replace('||',',') 41 | labels.append(line) 42 | with codecs.open(self.new_labels, 'w', encoding='utf-8')as f: 43 | for line in labels: 44 | f.write(line) 45 | self.train_labels = self.new_labels 46 | 47 | # 开始构造训练集和测试集的特征 48 | self.df = pd.read_csv(self.train_labels, names=['uid','gender','birthday','location'], encoding='utf-8') 49 | train_x = pd.DataFrame(self.df.uid) 50 | test_x = pd.read_csv(self.test_nolabels, encoding='utf-8', names=['uid']) 51 | # 合并 52 | self.data_x = pd.concat([train_x,test_x], axis=0, ignore_index=True) 53 | print ('训练集+测试集共有{}个样本'.format(self.data_x.shape[0])) 54 | self.features = self.data_x[:] # 特征 55 | 56 | def build(self): 57 | print('---处理微博status文件---') 58 | self.process_status() 59 | print('---建立统计特征---') 60 | self.build_feature() 61 | print('---处理文本特征---') 62 | self.process_text() 63 | print('---处理粉丝信息---') 64 | self.process_fans() 65 | print('---输出完毕---') 66 | pickle.dump(self.data_x, open(self.data_file,'wb')) 67 | pickle.dump(self.features, open(self.features_file, 'wb')) 68 | 69 | 70 | def process_status(self): 71 | if os.path.isfile(self.status_file): 72 | os.remove(self.status_file) 73 | if os.path.isfile(self.text_file): 74 | os.remove(self.text_file) 75 | 76 | paths = [self.train_status, self.test_status] 77 | for path in paths: 78 | with codecs.open(path,'r',encoding='utf-8') as f,\ 79 | codecs.open(self.status_file,'a',encoding='utf-8') as status_out,\ 80 | codecs.open(self.text_file,'a',encoding='utf-8') as text_out: 81 | i = 0 82 | for line in f: 83 | item = line.strip().split(',',5) 84 | 85 | status_info = item[:5] 86 | if len(status_info) != 5: 87 | continue 88 | status_out.write(','.join(status_info)+'\n') 89 | text_info = item[5] 90 | text_out.write(text_info+'\n') 91 | i+=1 92 | print ('{}条微博'.format(i)) 93 | self.status = pd.read_csv(self.status_file, names=['uid','retweet','review','source','time']) 94 | self.text = pd.read_csv(self.text_file, names=['content'], sep='delimiter') 95 | 96 | if not self.text.shape[0] == self.status.shape[0]: 97 | print ('status 和 text 不匹配!!!') 98 | 99 | self.text['uid'] = self.status.uid 100 | self.text = self.text[['uid','content']] 101 | 102 | # 除去多余的微博 103 | self.status = self.status[self.status.uid.apply(lambda x:x in self.data_x.uid.values)] 104 | self.text = self.text[self.text.uid.apply(lambda x:x in self.data_x.uid.values)] 105 | 106 | ''' 107 | 处理时间 108 | 微博时间分为三种格式: 109 | 1、2015-11-10 09:13:35 110 | 2、今天 00:15 111 | 3、7分钟前 112 | 经过统计,将2中‘今天’替换为‘2016-06-28’,方便计算 113 | 最后一条微博的时间为“2016-06-28 22:32:00”, 将X分钟前设为“2016-06-28 23:00:00” 114 | 有6条错误格式的时间“2014-06-12 00:25:45 来自”。 去掉“来自” 115 | ''' 116 | def processtime(x): 117 | reg_time1 = re.compile('^\d{4}-\d{2}-\d{2} \d{2}:') 118 | reg_time2 = re.compile('^\d{1,2}分钟前$') 119 | pattern1 = reg_time1.match(x) 120 | if not pattern1: 121 | if reg_time2.match(x): 122 | x = '2016-06-28 23:00:00' 123 | x = x.replace('今天', '2016-06-28') 124 | return x[:19] 125 | 126 | # 处理时间 127 | self.status['time'] = self.status['time'].map(lambda x: processtime(x)) 128 | self.status['time'] = self.status['time'].map(lambda x: pd.to_datetime(x, errors='coerce')) 129 | self.status['date'] = self.status.time.map(lambda x: x.strftime('%Y-%m-%d')) 130 | self.status['week'] = self.status['time'].map(lambda x: x.dayofweek) 131 | self.status['hour'] = self.status['time'].map(lambda x: x.hour) 132 | # 统计微博词数 133 | self.text['word_count'] = self.text.content.apply(lambda x: len(str(x).strip().split())) 134 | self.text['source'] = self.status.source 135 | 136 | 137 | def build_feature(self): 138 | # 微博总数 139 | temp = pd.DataFrame(self.status.groupby('uid').size(),columns=['weibo_count']).reset_index() 140 | self.features = self.features.merge(temp[['uid','weibo_count']], how='left',on='uid') 141 | 142 | # 微博去重总数 143 | temp = pd.DataFrame(self.text.drop_duplicates().groupby('uid').size(),columns=['weibo_unique_count']).reset_index() 144 | self.features = self.features.merge(temp[['uid','weibo_unique_count']], how='left',on='uid') 145 | 146 | # 转发评论数 147 | temp = self.status.groupby('uid').sum().reset_index().rename(columns={'retweet':'retweet_count','review':'review_count'}) 148 | self.features = self.features.merge(temp[['uid', 'retweet_count', 'review_count']], how='left', on='uid') 149 | 150 | # 带转发的微博数、带评论的微博数 151 | temp = pd.DataFrame(self.status[self.status.retweet > 0].groupby('uid').size(),columns=['retweet_weibo_count']).reset_index() 152 | self.features = self.features.merge(temp[['uid','retweet_weibo_count']], how='left',on='uid') 153 | temp = pd.DataFrame(self.status[self.status.review > 0].groupby('uid').size(),columns=['review_weibo_count']).reset_index() 154 | self.features = self.features.merge(temp[['uid','review_weibo_count']], how='left',on='uid') 155 | self.features.fillna(0, inplace=True) 156 | 157 | # 平均转发、评论数 158 | self.features['retweet_average_count'] = self.features.retweet_count / self.features.weibo_count 159 | self.features['review_average_count'] = self.features.review_count / self.features.weibo_count 160 | 161 | # 微博转发率(有转发的微博/微博总数) retweet_rate 162 | # 微博评论率(有评论的微博/微博总数) review_rate 163 | self.features['retweet_rate'] = self.features.retweet_weibo_count / self.features.weibo_count 164 | self.features['review_rate'] = self.features.review_weibo_count / self.features.weibo_count 165 | 166 | 167 | # 来源总数 168 | temp = pd.DataFrame(self.status.groupby('uid').source.nunique()).reset_index().rename(columns={'source':'source_count'}) 169 | self.features = self.features.merge(temp[['uid','source_count']], how='left',on='uid') 170 | 171 | # 微博登录天数 day_post_count 172 | temp = pd.DataFrame(self.status.groupby('uid').date.nunique()).reset_index().rename(columns={'date':'day_post_count'}) 173 | self.features = self.features.merge(temp[['uid','day_post_count']], how='left',on='uid') 174 | 175 | # 微博总天数 day_total_count 176 | temp = pd.DataFrame(((pd.to_datetime((self.status.groupby('uid').date.max())) - \ 177 | pd.to_datetime((self.status.groupby('uid').date.min())))/ np.timedelta64(1, 'D')).astype(float)\ 178 | ).reset_index().rename(columns={'date':'day_total_count'}) 179 | self.features = self.features.merge(temp[['uid','day_total_count']], how='left',on='uid') 180 | 181 | # 活跃天数比(发微博的天数/(最后一天-第一天)) day_rate 182 | self.features['day_rate'] = self.features.day_post_count / self.features.day_total_count 183 | 184 | # 日均微博数 everyday_weibo_count 185 | self.features['everyday_weibo_count'] = self.features.weibo_count / self.features.day_post_count 186 | 187 | # 总词数 word_total_count 188 | # 微博平均词数 word_average_count 189 | temp = pd.DataFrame(self.text.groupby('uid').word_count.sum()).reset_index().rename(columns={'word_count':'word_total_count'}) 190 | self.features = self.features.merge(temp[['uid','word_total_count']], how='left',on='uid') 191 | self.features['word_average_count'] = self.features.word_total_count / self.features.weibo_count 192 | 193 | # 周几的微博 194 | for i in range(7): 195 | temp = pd.DataFrame(self.status[self.status.week == i].groupby('uid').size(),columns=['weibo_count_of_week{}'.format(i+1)]).reset_index() 196 | self.features = self.features.merge(temp[['uid','weibo_count_of_week{}'.format(i+1)]], how='left',on='uid') 197 | self.features.fillna(0, inplace=True) 198 | 199 | # 工作日和周末的微博 200 | self.features['weibo_count_of_workday'] = (self.features.weibo_count_of_week1 + self.features.weibo_count_of_week2 + self.features.weibo_count_of_week3 +\ 201 | self.features.weibo_count_of_week4 + self.features.weibo_count_of_week5) 202 | self.features['weibo_count_of_weekend'] = (self.features.weibo_count_of_week6 + self.features.weibo_count_of_week7) 203 | 204 | # 各天微博的比例 205 | for i in range(1,8): 206 | self.features['weibo_rate_of_week{}'.format(i)] = (self.features['weibo_count_of_week{}'.format(i)] / self.features.weibo_count) 207 | self.features['weibo_rate_of_workday'] = self.features.weibo_count_of_workday / self.features.weibo_count 208 | self.features['weibo_rate_of_weekend'] = self.features.weibo_count_of_weekend / self.features.weibo_count 209 | 210 | # 每天各个时段的微博数量 211 | temp = pd.DataFrame(self.status[self.status.hour.apply(lambda x: x in range(0,6))].groupby(self.status.uid).size(),\ 212 | columns=['weibo_count_of_midnight']).reset_index() 213 | self.features = self.features.merge(temp[['uid','weibo_count_of_midnight']], how='left',on='uid') 214 | temp = pd.DataFrame(self.status[self.status.hour.apply(lambda x: x in range(6,12))].groupby(self.status.uid).size(),\ 215 | columns=['weibo_count_of_morning']).reset_index() 216 | self.features = self.features.merge(temp[['uid','weibo_count_of_morning']], how='left',on='uid') 217 | temp = pd.DataFrame(self.status[self.status.hour.apply(lambda x: x in range(12,18))].groupby(self.status.uid).size(),\ 218 | columns=['weibo_count_of_afternoon']).reset_index() 219 | self.features = self.features.merge(temp[['uid','weibo_count_of_afternoon']], how='left',on='uid') 220 | temp = pd.DataFrame(self.status[self.status.hour.apply(lambda x: x in range(18,24))].groupby(self.status.uid).size(),\ 221 | columns=['weibo_count_of_night']).reset_index() 222 | self.features = self.features.merge(temp[['uid','weibo_count_of_night']], how='left',on='uid') 223 | self.features.fillna(0, inplace=True) 224 | 225 | # 各个时段的微博比例 226 | self.features['weibo_rate_of_midnight'] = self.features.weibo_count_of_midnight / self.features.weibo_count 227 | self.features['weibo_rate_of_morning'] = self.features.weibo_count_of_morning / self.features.weibo_count 228 | self.features['weibo_rate_of_afternoon'] = self.features.weibo_count_of_afternoon / self.features.weibo_count 229 | self.features['weibo_rate_of_night'] = self.features.weibo_count_of_night / self.features.weibo_count 230 | 231 | # 各小时的微博数量 232 | for i in range(24): 233 | temp = pd.DataFrame(self.status[self.status.hour == i].groupby('uid').size(),columns=['weibo_count_of_hour{}'.format(i)]).reset_index() 234 | self.features = self.features.merge(temp[['uid','weibo_count_of_hour{}'.format(i)]], how='left',on='uid') 235 | self.features.fillna(0, inplace=True) 236 | for i in range(24): 237 | self.features['weibo_rate_of_hour{}'.format(i)] = (self.features['weibo_count_of_hour{}'.format(i)] / self.features.weibo_count) 238 | 239 | # 按时间分段 间隔3小时 240 | for i in range(0,24,3): 241 | temp = pd.DataFrame(self.status[self.status.hour.apply(lambda x: x in range(i,i+3))].groupby(self.status.uid).size(),\ 242 | columns=['weibo_count_of_{}_plus3'.format(i)]).reset_index() 243 | self.features = self.features.merge(temp[['uid','weibo_count_of_{}_plus3'.format(i)]], how='left',on='uid') 244 | self.features.fillna(0, inplace=True) 245 | for i in range(0,24,3): 246 | self.features['weibo_rate_of_{}_plus3'.format(i)] = (self.features['weibo_count_of_{}_plus3'.format(i)] / self.features.weibo_count) 247 | 248 | 249 | # 替换掉inf值 250 | self.features = self.features.replace(np.inf, 0) 251 | self.features.fillna(0, inplace=True) 252 | 253 | del temp 254 | 255 | 256 | def process_text(self): 257 | # 停用词 258 | with codecs.open(self.stopwords, encoding='utf-8', errors='ignore')as f: 259 | stop =set() 260 | for line in f: 261 | stop.add(line.strip()) 262 | def de_stop(line): 263 | line = line.strip().split() 264 | res = [] 265 | for word in line: 266 | if word not in stop: 267 | res.append(word) 268 | return ' '.join(res) 269 | # 去停用词的微博正文 270 | self.text['words'] = self.text.content.apply(lambda x: de_stop(x)) 271 | temp = pd.DataFrame(self.text.groupby('uid')['words'].apply(lambda x: ' '.join(x))).reset_index() 272 | self.data_x = self.data_x.merge(temp[['uid','words']], how='left',on='uid') 273 | # source信息 274 | temp = pd.DataFrame(self.status.groupby('uid')['source'].apply(lambda x: ' '.join(x))).reset_index() 275 | self.data_x = self.data_x.merge(temp[['uid','source']], how='left',on='uid') 276 | # source分词 277 | def fenci(line): 278 | line = line.strip().split() 279 | line = ''.join(line) 280 | seglist = jieba.cut(line) 281 | line = ' '.join(seglist) 282 | return line 283 | self.data_x['source_fenci'] = self.data_x.source.apply(lambda x:fenci(x)) 284 | self.data_x['weibo_and_source'] = (self.data_x.words + self.data_x.source_fenci) 285 | 286 | 287 | def process_fans(self): 288 | with open(self.train_links) as f: 289 | res = [] 290 | for line in f: 291 | items = line.strip().split() 292 | uid = int(items[0]) 293 | fans = ' '.join(items[1:]) 294 | number = len(fans.split()) 295 | res.append([uid, fans, number]) 296 | trainfans = pd.DataFrame(res, columns=['uid','fans','count_of_fans']) 297 | 298 | with open(self.test_links) as f: 299 | res = [] 300 | for line in f: 301 | items = line.strip().split() 302 | uid = int(items[0]) 303 | fans = ' '.join(items[1:]) 304 | number = len(fans.split()) 305 | res.append([uid, fans, number]) 306 | testfans = pd.DataFrame(res, columns=['uid','fans','count_of_fans']) 307 | 308 | fans = pd.concat([trainfans, testfans], axis=0) 309 | fans.drop_duplicates(inplace=1) 310 | 311 | self.data_x = self.data_x.merge(fans[['uid','fans','count_of_fans']], how='left',on='uid') 312 | self.data_x['has_fans'] = 0 313 | self.data_x['has_fans'][self.data_x.fans.notnull()] = 1 314 | self.data_x.fillna(0, inplace=True) 315 | 316 | # fans info 317 | self.features['has_fans'] = self.data_x.has_fans 318 | self.features['count_of_fans'] = self.data_x.count_of_fans 319 | 320 | self.features['weibo_per_fans'] = self.features.weibo_count / self.features.count_of_fans 321 | self.features['ret_per_fans'] = self.features.retweet_count / self.features.count_of_fans 322 | self.features['rev_per_fans'] = self.features.review_count / self.features.count_of_fans 323 | 324 | self.features = self.features.replace(np.inf, 0) 325 | self.features.fillna(0, inplace=True) 326 | 327 | self.features['fans_0_50'] = 0 328 | self.features['fans_0_50'][(self.features.count_of_fans > 0) & (self.features.count_of_fans <= 50)] = 1 329 | 330 | self.features['fans_50_100'] = 0 331 | self.features['fans_50_100'][(self.features.count_of_fans > 50) & (self.features.count_of_fans <= 100)] = 1 332 | 333 | self.features['fans_100_200'] = 0 334 | self.features['fans_100_200'][(self.features.count_of_fans > 100) & (self.features.count_of_fans <= 200)] = 1 335 | 336 | self.features['fans_200_500'] = 0 337 | self.features['fans_200_500'][(self.features.count_of_fans > 200) & (self.features.count_of_fans <= 500)] = 1 338 | 339 | self.features['fans_500_1000'] = 0 340 | self.features['fans_500_1000'][(self.features.count_of_fans > 500) & (self.features.count_of_fans <= 1000)] = 1 341 | 342 | self.features['fans_1000'] = 0 343 | self.features['fans_1000'][(self.features.count_of_fans > 1000)] = 1 -------------------------------------------------------------------------------- /base/feipeng/model.py: -------------------------------------------------------------------------------- 1 | import xgboost as xgb 2 | import pickle 3 | import numpy as np 4 | import pandas as pd 5 | from sklearn.preprocessing import LabelEncoder 6 | from sklearn.feature_extraction.text import TfidfVectorizer 7 | from sklearn.cross_validation import StratifiedKFold, StratifiedShuffleSplit 8 | from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier, BaggingClassifier 9 | from sklearn.linear_model import LogisticRegression 10 | 11 | 12 | class age(object): 13 | def __init__(self, data_file, feature_file, stack_file, train_labels): 14 | #保存stack特征 15 | self.stack_file = stack_file 16 | # 特征和微博数据X 17 | self.features = pickle.load(open(feature_file, 'rb')) 18 | self.data_x = pickle.load(open(data_file, 'rb')) 19 | 20 | # 标签Y 21 | self.df = pd.read_csv(train_labels, names=['uid','gender','birthday','location'], encoding='utf-8') 22 | def age(x): 23 | if x <= 1979: 24 | return u'-1979' 25 | elif x>=1980 and x<=1989: 26 | return u'1980-1989' 27 | else: 28 | return u'1990+' 29 | self.df['age'] = self.df.birthday.apply(lambda x:age(x)) 30 | 31 | self.df = self.df[['uid','age']] 32 | 33 | self.le_age = LabelEncoder() 34 | self.le_age.fit(self.df.age) 35 | self.df['y_age'] = self.le_age.transform(self.df.age) 36 | 37 | def stacking(self): 38 | X = self.data_x.weibo_and_source[:] 39 | vectormodel = TfidfVectorizer(ngram_range=(1,1), min_df=3,use_idf=False, smooth_idf=False, sublinear_tf=True, norm=False) 40 | X = vectormodel.fit_transform(X) 41 | 42 | # 数据 43 | y = self.df.y_age 44 | train_x = X[:len(y)] 45 | test_x = X[len(y):].tocsc() 46 | 47 | np.random.seed(0) 48 | 49 | n_folds = 5 50 | n_class = 3 51 | 52 | train_x_id = range(train_x.shape[0]) 53 | val_x_id = range(test_x.shape[0]) 54 | 55 | X = train_x 56 | y = y 57 | X_submission = test_x 58 | 59 | 60 | skf = list(StratifiedKFold(y, n_folds)) 61 | 62 | clfs = [ 63 | LogisticRegression(penalty='l1',n_jobs=-1,C=1.0), 64 | LogisticRegression(penalty='l2',n_jobs=-1,C=1.0), 65 | RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'), 66 | RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'), 67 | ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='gini'), 68 | ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='entropy') 69 | ] 70 | 71 | 72 | dataset_blend_train = np.zeros((X.shape[0], len(clfs)*n_class)) 73 | dataset_blend_test = np.zeros((X_submission.shape[0], len(clfs)*n_class)) 74 | 75 | for j, clf in enumerate(clfs): 76 | print (j, clf) 77 | dataset_blend_test_j = np.zeros((X_submission.shape[0], n_class)) 78 | for i, (train, test) in enumerate(skf): 79 | print ('Fold ',i) 80 | X_train = X[train] 81 | y_train = y[train] 82 | X_test = X[test] 83 | y_test = y[test] 84 | clf.fit(X_train, y_train) 85 | y_submission = clf.predict_proba(X_test) 86 | dataset_blend_train[test, j*n_class:j*n_class+n_class] = y_submission 87 | dataset_blend_test_j += clf.predict_proba(X_submission) 88 | dataset_blend_test[:,j*n_class:j*n_class+n_class] = dataset_blend_test_j[:,]/n_folds 89 | 90 | all_X_1 = np.concatenate((dataset_blend_train, dataset_blend_test), axis=0) 91 | 92 | # xgboost 93 | temp = np.zeros((len(y),n_class)) 94 | test = np.zeros((test_x.shape[0], n_class)) 95 | test_x = test_x.tocsc() 96 | dtest = xgb.DMatrix(test_x) 97 | for tra, val in StratifiedKFold(y, 5, random_state=658): 98 | X_train = train_x[tra] 99 | y_train = y[tra] 100 | X_val = train_x[val] 101 | y_val = y[val] 102 | 103 | x_train = X_train.tocsc() 104 | x_val = X_val.tocsc() 105 | 106 | dtrain = xgb.DMatrix(x_train, y_train) 107 | dval = xgb.DMatrix(x_val) 108 | 109 | params = { 110 | "objective": "multi:softprob", 111 | "booster": "gblinear", 112 | "eval_metric": "merror", 113 | "num_class":3, 114 | 'max_depth':3, 115 | 'min_child_weight':1.5, 116 | 'subsample':0.7, 117 | 'colsample_bytree':1, 118 | 'gamma':2.5, 119 | "eta": 0.01, 120 | "lambda":1, 121 | 'alpha':0, 122 | "silent": 1, 123 | } 124 | watchlist = [(dtrain, 'train')] 125 | model = xgb.train(params, dtrain, 2000, evals=watchlist, 126 | early_stopping_rounds=200, verbose_eval=200) 127 | result = model.predict(dval) 128 | temp[val] = result[:] 129 | 130 | res = model.predict(dtest) 131 | test += res 132 | test /= 5 133 | all_X_2 = np.concatenate((temp, test), axis=0) 134 | 135 | ############################################################################# 136 | ############################################################################# 137 | # merge 138 | all_X = np.concatenate((all_X_1, all_X_2), axis=1) 139 | pickle.dump(all_X, open(self.stack_file,'wb')) 140 | 141 | def concat_features(self, outfeatures=None): 142 | print('concat features...') 143 | all_X = pickle.load(open(self.stack_file,'rb')) 144 | myfeature = self.features.drop(['uid'],axis=1).as_matrix() 145 | #train+test set 146 | self.all_X = np.concatenate((all_X, myfeature), axis=1) 147 | 148 | #concat_outfea 149 | if outfeatures: 150 | featureslist = pickle.load(open(outfeatures, 'rb')) 151 | for fea in featureslist: 152 | self.all_X = np.concatenate((self.all_X, fea), axis=1) 153 | #train set 154 | self.X = self.all_X[:self.df.shape[0]] 155 | self.y = self.df.y_age 156 | 157 | print ('特征维数为{}维'.format(self.X.shape[1])) 158 | 159 | def fit_transform(self, result_age): 160 | print('bagging...') 161 | n = 8 162 | score = 0 163 | pres = [] 164 | i=1 165 | for tra, val in StratifiedShuffleSplit(self.y, n, test_size=0.2, random_state=233): 166 | print('run {}/{}'.format(i,n)) 167 | i+=1 168 | 169 | X_train = self.X[tra] 170 | y_train = self.y[tra] 171 | X_val = self.X[val] 172 | y_val = self.y[val] 173 | 174 | dtrain = xgb.DMatrix(X_train, y_train) 175 | dval = xgb.DMatrix(X_val, y_val) 176 | dtest = xgb.DMatrix(self.all_X[self.df.shape[0]:]) 177 | 178 | params = { 179 | "objective": "multi:softmax", 180 | "booster": "gbtree", 181 | "eval_metric": "merror", 182 | "num_class":3, 183 | 'max_depth':3, 184 | 'min_child_weight':1.5, 185 | 'subsample':0.7, 186 | 'colsample_bytree':1, 187 | 'gamma':2.5, 188 | "eta": 0.01, 189 | "lambda":1, 190 | 'alpha':0, 191 | "silent": 1, 192 | } 193 | watchlist = [(dtrain, 'train'), (dval, 'eval')] 194 | 195 | bst = xgb.train(params, dtrain, 2000, evals=watchlist, 196 | early_stopping_rounds=200, verbose_eval=False) 197 | score += bst.best_score 198 | 199 | pre = bst.predict(dtest) 200 | pres.append(pre) 201 | 202 | score /= n 203 | score = 1 - score 204 | print('*********************************************') 205 | print('*********************************************') 206 | print("******年龄平均准确率为{}**************".format(score)) 207 | print('*********************************************') 208 | print('*********************************************') 209 | 210 | # vote 211 | pres = np.array(pres).T.astype('int64') 212 | pre = [] 213 | for line in pres: 214 | pre.append(np.bincount(line).argmax()) 215 | 216 | result = pd.DataFrame(pre, columns=['age']) 217 | result['age'] = result.age.apply(lambda x: int(x)) 218 | result['age'] = self.le_age.inverse_transform(result.age) 219 | result.to_csv(result_age, index=None) 220 | print('result saved!') 221 | 222 | 223 | 224 | 225 | 226 | class gender(object): 227 | def __init__(self, data_file, feature_file, stack_file, train_labels): 228 | #保存stack特征 229 | self.stack_file = stack_file 230 | # 特征和微博数据X 231 | self.features = pickle.load(open(feature_file, 'rb')) 232 | self.data_x = pickle.load(open(data_file, 'rb')) 233 | 234 | # 标签Y 235 | self.df = pd.read_csv(train_labels, names=['uid','gender','birthday','location'], encoding='utf-8') 236 | self.df = self.df[['uid','gender']] 237 | 238 | self.le_gender = LabelEncoder() 239 | self.le_gender.fit(self.df.gender) 240 | self.df['y_gender'] = self.le_gender.transform(self.df.gender) 241 | 242 | def stacking(self): 243 | X = self.data_x.weibo_and_source[:] 244 | vectormodel = TfidfVectorizer(ngram_range=(1,1), min_df=3,use_idf=False, smooth_idf=False, sublinear_tf=True, norm=False) 245 | X = vectormodel.fit_transform(X) 246 | 247 | # 数据 248 | y = self.df.y_gender 249 | train_x = X[:len(y)] 250 | test_x = X[len(y):].tocsc() 251 | 252 | np.random.seed(9) 253 | 254 | n_folds = 5 255 | n_class = 2 256 | 257 | train_x_id = range(train_x.shape[0]) 258 | val_x_id = range(test_x.shape[0]) 259 | 260 | X = train_x 261 | y = y 262 | X_submission = test_x 263 | 264 | 265 | skf = list(StratifiedKFold(y, n_folds, random_state=99)) 266 | 267 | clfs = [ 268 | LogisticRegression(penalty='l1',n_jobs=-1,C=1.0), 269 | LogisticRegression(penalty='l2',n_jobs=-1,C=1.0), 270 | RandomForestClassifier(n_estimators=200, n_jobs=-1, criterion='gini', random_state=9), 271 | RandomForestClassifier(n_estimators=200, n_jobs=-1, criterion='entropy', random_state=9), 272 | ExtraTreesClassifier(n_estimators=200, n_jobs=-1, criterion='gini', random_state=9), 273 | ExtraTreesClassifier(n_estimators=200, n_jobs=-1, criterion='entropy', random_state=9) 274 | ] 275 | 276 | 277 | dataset_blend_train = np.zeros((X.shape[0], len(clfs)*n_class)) 278 | dataset_blend_test = np.zeros((X_submission.shape[0], len(clfs)*n_class)) 279 | 280 | for j, clf in enumerate(clfs): 281 | np.random.seed(9) 282 | print (j, clf) 283 | dataset_blend_test_j = np.zeros((X_submission.shape[0], n_class)) 284 | for i, (train, test) in enumerate(skf): 285 | print ('Fold ',i) 286 | X_train = X[train] 287 | y_train = y[train] 288 | X_test = X[test] 289 | y_test = y[test] 290 | clf.fit(X_train, y_train) 291 | y_submission = clf.predict_proba(X_test) 292 | dataset_blend_train[test, j*n_class:j*n_class+n_class] = y_submission 293 | dataset_blend_test_j += clf.predict_proba(X_submission) 294 | dataset_blend_test[:,j*n_class:j*n_class+n_class] = dataset_blend_test_j[:,]/n_folds 295 | 296 | all_X_1 = np.concatenate((dataset_blend_train, dataset_blend_test), axis=0) 297 | 298 | # xgboost 299 | temp = np.zeros((len(y),n_class)) 300 | test = np.zeros((test_x.shape[0], n_class)) 301 | test_x = test_x.tocsc() 302 | dtest = xgb.DMatrix(test_x) 303 | for tra, val in StratifiedKFold(y, 5, random_state=23): 304 | X_train = train_x[tra] 305 | y_train = y[tra] 306 | X_val = train_x[val] 307 | y_val = y[val] 308 | 309 | x_train = X_train.tocsc() 310 | x_val = X_val.tocsc() 311 | 312 | dtrain = xgb.DMatrix(x_train, y_train) 313 | dval = xgb.DMatrix(x_val) 314 | 315 | params = { 316 | "objective": "multi:softprob", 317 | "booster": "gblinear", 318 | "eval_metric": "merror", 319 | "num_class":2, 320 | 'max_depth':3, 321 | 'min_child_weight':1.5, 322 | 'subsample':0.7, 323 | 'colsample_bytree':1, 324 | 'gamma':2.5, 325 | "eta": 0.01, 326 | "lambda":1, 327 | 'alpha':0, 328 | "silent": 1, 329 | 'seed':1 330 | } 331 | watchlist = [(dtrain, 'train')] 332 | model = xgb.train(params, dtrain, 2000, evals=watchlist, 333 | early_stopping_rounds=50, verbose_eval=1000) 334 | result = model.predict(dval) 335 | temp[val] = result[:] 336 | 337 | res = model.predict(dtest) 338 | test += res 339 | test /= n_folds 340 | all_X_2 = np.concatenate((temp, test), axis=0) 341 | 342 | ############################### 343 | ############################### 344 | all_X = np.concatenate((all_X_1, all_X_2), axis=1) 345 | pickle.dump(all_X, open(self.stack_file, 'wb')) 346 | 347 | def fit_transform(self, result_gender): 348 | print('concat features...') 349 | all_X = pickle.load(open(self.stack_file,'rb')) 350 | myfeature = self.features.drop(['uid'],axis=1).as_matrix() 351 | #train+test set 352 | self.all_X = np.concatenate((all_X, myfeature), axis=1) 353 | #train set 354 | self.X = self.all_X[:self.df.shape[0]] 355 | self.y = self.df.y_gender 356 | print ('特征维数为{}维'.format(self.X.shape[1])) 357 | 358 | print('bagging') 359 | n = 7 360 | score = 0 361 | pres = [] 362 | i=1 363 | for tra, val in StratifiedShuffleSplit(self.y, n, test_size=0.2, random_state=7): 364 | 365 | print('run {}/{}'.format(i,n)) 366 | i+=1 367 | 368 | X_train = self.X[tra] 369 | y_train = self.y[tra] 370 | X_val = self.X[val] 371 | y_val = self.y[val] 372 | 373 | dtrain = xgb.DMatrix(X_train, y_train) 374 | dval = xgb.DMatrix(X_val, y_val) 375 | dtest = xgb.DMatrix(self.all_X[self.df.shape[0]:]) 376 | 377 | params = { 378 | "objective": "binary:logistic", 379 | "booster": "gbtree", 380 | "eval_metric": "error", 381 | 'max_depth':3, 382 | 'min_child_weight':1.5, 383 | 'subsample':0.7, 384 | 'colsample_bytree':1, 385 | 'gamma':2.5, 386 | "eta": 0.01, 387 | "lambda":1, 388 | 'alpha':0, 389 | "silent": 1, 390 | } 391 | watchlist = [(dtrain, 'train'), (dval, 'eval')] 392 | 393 | bst = xgb.train(params, dtrain, 2000, evals=watchlist, 394 | early_stopping_rounds=200, verbose_eval=False) 395 | score += bst.best_score 396 | 397 | pre = bst.predict(dtest) 398 | pre[pre>=0.5] = 1 399 | pre[pre<0.5] = 0 400 | pres.append(pre) 401 | 402 | 403 | score /= n 404 | score = 1 - score 405 | print('*********************************************') 406 | print('*********************************************') 407 | print("******性别平均准确率为{}**************".format(score)) 408 | print('*********************************************') 409 | print('*********************************************') 410 | 411 | # vote 412 | pres = np.array(pres).T.astype('int64') 413 | pre = [] 414 | for line in pres: 415 | pre.append(np.bincount(line).argmax()) 416 | 417 | result = pd.DataFrame(pre, columns=['gender']) 418 | result['gender'] = self.le_gender.inverse_transform(result.gender) 419 | result.to_csv(result_gender, index=None) -------------------------------------------------------------------------------- /base/keras_helper.py: -------------------------------------------------------------------------------- 1 | from keras import backend as K 2 | import numpy as np 3 | from keras.callbacks import Callback 4 | class ModelCheckpointPlus(Callback): 5 | ''' 6 | 定义最优代数为val_loss-val_acc的最小值 7 | ''' 8 | def __init__(self, filepath, monitor='val_loss+', verbose=0, 9 | save_best_only=True, save_weights_only=False, 10 | mode='auto',verbose_show=5): 11 | super(ModelCheckpointPlus, self).__init__() 12 | self.monitor = monitor 13 | self.verbose = verbose 14 | self.verbose_show=verbose_show 15 | self.filepath = filepath 16 | self.save_best_only = save_best_only 17 | self.save_weights_only = save_weights_only 18 | 19 | if mode not in ['auto', 'min', 'max']: 20 | warnings.warn('ModelCheckpoint mode %s is unknown, ' 21 | 'fallback to auto mode.' % (mode), 22 | RuntimeWarning) 23 | mode = 'auto' 24 | 25 | if mode == 'min': 26 | self.monitor_op = np.less 27 | self.best = np.Inf 28 | elif mode == 'max': 29 | self.monitor_op = np.greater 30 | self.best = -np.Inf 31 | else: 32 | if 'acc' in self.monitor: 33 | self.monitor_op = np.greater 34 | self.best = -np.Inf 35 | else: 36 | self.monitor_op = np.less 37 | self.best = np.Inf 38 | 39 | def on_epoch_end(self, epoch, logs={}): 40 | filepath = self.filepath.format(epoch=epoch, **logs) 41 | if self.save_best_only: 42 | if self.monitor=='val_loss+': 43 | loss_val = logs.get('val_loss') 44 | acc_val = logs.get('val_acc') 45 | if loss_val is not None and acc_val is not None: 46 | current = loss_val-acc_val 47 | else: 48 | current=None 49 | else: 50 | current=logs.get(self.monitor) 51 | 52 | if current is None: 53 | warnings.warn('Can save best model only with %s available, ' 54 | 'skipping.' % (self.monitor), RuntimeWarning) 55 | else: 56 | if self.monitor_op(current, self.best): 57 | if self.verbose > 0: 58 | print('Epoch %05d: %s improved from %0.5f to %0.5f,' 59 | ' saving model to %s' 60 | % (epoch, self.monitor, self.best, 61 | current, filepath)) 62 | self.best = current 63 | self.best_loss = logs.get('val_loss') 64 | self.best_acc = logs.get('val_acc') 65 | if self.save_weights_only: 66 | self.model.save_weights(filepath, overwrite=True) 67 | else: 68 | self.model.save(filepath, overwrite=True) 69 | else: 70 | if self.verbose > 0: 71 | print('Epoch %05d: %s did not improve' % 72 | (epoch, self.monitor)) 73 | else: 74 | if self.verbose > 0: 75 | print('Epoch %05d: saving model to %s' % (epoch, filepath)) 76 | if self.save_weights_only: 77 | self.model.save_weights(filepath, overwrite=True) 78 | else: 79 | self.model.save(filepath, overwrite=True) 80 | 81 | if self.verbose_show>0 and epoch%self.verbose_show==0: 82 | print("epoch: %d - loss: %.4f - acc: %.4f - val_loss: %.4f - val_acc: %.4f" 83 | %(epoch,logs.get('loss'),logs.get('acc'),logs.get('val_loss'),logs.get('val_acc'))) 84 | -------------------------------------------------------------------------------- /base/utils.py: -------------------------------------------------------------------------------- 1 | import math 2 | import random 3 | import scipy 4 | from scipy.sparse.csr import csr_matrix 5 | from scipy import sparse 6 | import numpy as np 7 | from base.dataset import load_w2v,smp_path 8 | 9 | ''' 10 | 工具函数 11 | ''' 12 | def get(items,i): 13 | return [item[i] for item in items] 14 | 15 | '''载入资源''' 16 | def load_keywords(): 17 | with open(smp_path+'/user_data/keywords.txt',encoding='utf8') as f: 18 | items=[item.strip() for item in f.readlines()] 19 | return items 20 | 21 | '''计算两个词集余弦的相似度''' 22 | def cal_similar(words1,words2): 23 | dict1={} 24 | dict2={} 25 | for w in words1: 26 | dict1[w]=dict1.get(w,0)+1 27 | 28 | total1=0 29 | for w in dict1: 30 | total1+=dict1[w]*dict1[w] 31 | 32 | for w in words2: 33 | dict2[w]=dict2.get(w,0)+1 34 | total2=0 35 | for w in dict2: 36 | total2+=dict2[w]*dict2[w] 37 | 38 | res=0 39 | for w in dict1: 40 | if w in dict2: 41 | res+=dict1[w]*dict2[w]/math.sqrt(total1*total2) 42 | return res 43 | 44 | 45 | '''去除高度相似的微博''' 46 | def remove_duplicate(sentences): 47 | random.seed(86) 48 | max_cnt=100 49 | res=[] 50 | for sen in sentences: 51 | items=sen.split() 52 | data=[] 53 | for i in range(len(items)): 54 | is_similar=False 55 | for j in range(len(data)): 56 | if cal_similar(items[i],data[j])>0.9: 57 | is_similar=True 58 | break 59 | if is_similar==False: 60 | data.append(items[i]) 61 | if len(data)>max_cnt: 62 | random.shuffle(data) 63 | data=data[:max_cnt] 64 | res.append(data) 65 | 66 | return [' '.join(item) for item in res] 67 | 68 | 69 | 70 | def merge(fs,filter_indexs=None): 71 | ''' 72 | 传入多个特征的集合,每个特征由[train,test]这样的list组成 73 | filter_indexs=(filter_index,train_index,valid_index) 74 | 其中filter_index用来预先排除训练集中有缺陷的数据 75 | train_index和valid_index分别为训练集和验证集的索引值 76 | ''' 77 | print(type(fs[0][0])) 78 | tmp=[] 79 | for f in fs: 80 | if f[0].ndim==1: 81 | f=[f[0].reshape(f[0].shape[0],1),f[1].reshape(f[1].shape[0],1)] 82 | if filter_indexs!=None: 83 | if len(f)==2: 84 | f=[f[0][filter_indexs[0]],f[1]] 85 | f=[f[0][filter_indexs[1]],f[0][filter_indexs[2]],f[1]] 86 | tmp.append(f) 87 | 88 | '''判断是train_data,test_data还是train_data,valid_data,test_data''' 89 | colCnt=len(tmp[0]) 90 | if type(tmp[0][0])==scipy.sparse.csr.csr_matrix: 91 | res=[sparse.hstack(tuple([item[i] for item in tmp])) for i in range(colCnt)] 92 | else: 93 | res=[np.hstack(tuple([item[i] for item in tmp])) for i in range(colCnt)] 94 | print(res[0].shape) 95 | return res 96 | 97 | 98 | ''' 99 | 显示各个类别的准确率 100 | ''' 101 | def describe(py,ry,detail=False): 102 | if np.array(py).ndim>1: 103 | py=[np.argmax(y) for y in py] 104 | 105 | right_cnt=len([y for r,y in zip(ry,py) if y==r]) 106 | total_cnt=len(ry) 107 | res=right_cnt/total_cnt 108 | print('准确率:',res,'(%d/%d)'%(right_cnt,total_cnt)) 109 | if(detail): 110 | keys=list(set(ry)) 111 | for k in keys: 112 | right_cnt=len([r for r,y in zip(ry,py) if y==r and y==k]) 113 | total_cnt=len([y for y in ry if y==k]) 114 | predict_cnt=len([y for y in py if y==k]) 115 | p=right_cnt/(predict_cnt+0.0000001) 116 | r=right_cnt/(total_cnt+0.0000001) 117 | print('类别',k,'准确率=%.4f'%p,'(%d/%d)'%(right_cnt,predict_cnt), 118 | '\t召回率=%.4f'%r,'(%d/%d)'%(right_cnt,total_cnt),'\tf=%.4f'%(2*p*r/(p+r+0.00001))) 119 | return res 120 | 121 | ''' 122 | 传入feature集合整合成特征集 123 | ''' 124 | def get_xs(fs): 125 | if len(fs)==1: 126 | return fs[0] 127 | 128 | if type(fs[0])==csr_matrix: 129 | return sparse.hstack((fs)).tocsc() 130 | else: 131 | tmp=[] 132 | for f in fs: 133 | if type(f)==csr_matrix: 134 | tmp.append(f.toarray()) 135 | else: 136 | tmp.append(f) 137 | return np.hstack(tmp) 138 | 139 | ''' 140 | 传入单词表,返回词向量集合 141 | ''' 142 | def get_word_vectors(words,w2v_model=None): 143 | if w2v_model==None: 144 | w2v_model=load_w2v() 145 | vectors=[] 146 | cnt=0 147 | for w in words: 148 | if w in w2v_model: 149 | vectors.append(w2v_model[w]) 150 | else: 151 | vectors.append(np.zeros(w2v_model.vector_size)) 152 | cnt+=1 153 | print('不在词表中的词数量:',cnt) 154 | return np.array(vectors) 155 | 156 | 157 | 158 | 159 | 160 | 161 | 162 | 163 | -------------------------------------------------------------------------------- /base/yuml/models.py: -------------------------------------------------------------------------------- 1 | import keras 2 | from keras.models import Sequential 3 | from keras.layers.core import Dense,Activation,Flatten,Dropout 4 | from keras.layers.convolutional import Convolution2D,MaxPooling1D,Convolution1D,MaxPooling2D,AveragePooling2D 5 | from keras.utils.np_utils import to_categorical 6 | from keras.callbacks import EarlyStopping 7 | from keras.engine.topology import Merge 8 | from base.keras_helper import ModelCheckpointPlus 9 | from sklearn.metrics import log_loss,accuracy_score 10 | import numpy as np 11 | from sklearn.cross_validation import StratifiedShuffleSplit,StratifiedKFold 12 | from sklearn.metrics import log_loss 13 | from scipy.optimize import minimize 14 | from collections import Counter 15 | import pickle 16 | import xgboost as xgb 17 | 18 | 19 | class StackEnsemble(object): 20 | ''' 21 | Stack可以利用一个元分类器,使用5折样本 22 | pred_proba: True调用predict_proba函数,False调用predict函数 23 | need_valid: True将给分类器的fit同时传入训练集和验证集(用于earlystop),False只传入训练集 24 | ''' 25 | def __init__(self,model_creator,n_folds=5,seed=100,multi_input=False,pred_proba=True,need_valid=True): 26 | self.model_creator=model_creator 27 | self.n_folds=n_folds 28 | self.seed=seed 29 | self.multi_input=multi_input 30 | self.pred_proba=pred_proba 31 | self.need_valid=need_valid 32 | self.models=[] 33 | self.fit_yprob=[] #训练集的预测结果 34 | self.predict_yprob=[] #测试集的预测结果(5折bagging) 35 | self.indexes=None 36 | pass 37 | def fit(self,X,y): 38 | ''' 39 | 训练Stack Ensemble模型,会得到n_folds个分类器 40 | ''' 41 | y_prob=np.zeros(1) 42 | indexes=StratifiedKFold(y,n_folds=self.n_folds,shuffle=True,random_state=self.seed) 43 | self.indexes=indexes 44 | cnt=0 45 | for ti,vi in indexes: 46 | print('--------stack-%d-------------'%(cnt+1)) 47 | cnt+=1 48 | model=self.model_creator() 49 | #兼容keras的多输入 50 | if self.multi_input: 51 | Xti,Xvi=[x[ti] for x in X],[x[vi] for x in X] 52 | else: 53 | Xti,Xvi=X[ti],X[vi] 54 | 55 | #训练模型 56 | if self.need_valid: 57 | model.fit(Xti,y[ti],Xvi,y[vi]) 58 | else: 59 | model.fit(Xti,y[ti]) 60 | 61 | if self.pred_proba: 62 | y_p=model.predict_proba(Xvi) 63 | else: 64 | y_p=model.predict(Xvi) 65 | 66 | if y_prob.shape[0]==1: 67 | y_prob=np.zeros((len(y),y_p.shape[1])) 68 | y_prob[vi]=y_p 69 | 70 | self.models.append(model) 71 | print('log loss: %f, accuracy: %f'%(log_loss(y[vi],y_p),accuracy_score(y[vi],np.argmax(y_p,axis=1)))) 72 | print('----------------------------------') 73 | print('----log loss: %f, accuracy: %f'%(log_loss(y,y_prob),accuracy_score(y,np.argmax(y_prob,axis=1)))) 74 | 75 | self.fit_yprob=y_prob #得到对训练集的预测值 76 | return y_prob 77 | 78 | def predict(self,X,type='aver'): 79 | return self.predict_proba(X,type=type) 80 | 81 | ''' 82 | type表示返回数据的格式 83 | aver: 取平均值返回 84 | raw: 返回原始数据(由多个分类器输出结果构成的数组) 85 | ''' 86 | def predict_proba(self,X,type='aver'): 87 | if self.pred_proba: 88 | res= [model.predict_proba(X) for model in self.models] 89 | else: 90 | res= [model.predict(X) for model in self.models] 91 | self.predict_yprob=res 92 | 93 | if type=='aver': 94 | return np.average(res,axis=0) 95 | elif type=='max': 96 | return np.max(res,axis=0) 97 | else: 98 | return res 99 | ''' 100 | 获得下一级分类器的输入数据 101 | ''' 102 | def get_next_input(self): 103 | aver=np.average(self.predict_yprob,axis=0) 104 | return np.vstack((self.fit_yprob,aver)) 105 | 106 | ''' 107 | 保存模型 108 | ''' 109 | def save(self,filename,save_model=False): 110 | if save_model==True: 111 | pickle.dump([self.models,self.fit_yprob,self.predict_yprob,self.indexes],open(filename,'wb')) 112 | else: 113 | pickle.dump([self.fit_yprob,self.predict_yprob,self.indexes],open(filename,'wb')) 114 | 115 | ''' 116 | 载入模型 117 | ''' 118 | def load(self,filename): 119 | items=pickle.load(open(filename,'rb')) 120 | if len(items)==4: 121 | self.models,self.fit_yprob,self.predict_yprob,self.indexes=items 122 | else: 123 | self.fit_yprob,self.predict_yprob,self.indexes=items 124 | 125 | 126 | 127 | 128 | class WeightEnsemble(object): 129 | ''' 130 | 对多个模型的输出结果进行加权,接口类似sklearn 131 | ''' 132 | def __init__(self): 133 | pass 134 | 135 | def fit(self, y_probs, y_true): 136 | self.y_true=y_true 137 | self.y_probs=y_probs 138 | starting_values = [0.5]*len(y_probs) 139 | bounds = [(0,1)]*len(y_probs) 140 | cons = ({'type':'eq','fun':lambda w: 1-sum(w)}) 141 | res=minimize(self.weight_log_loss, starting_values, method='SLSQP', bounds=bounds, constraints=cons) 142 | self.weights=res.x 143 | return res 144 | 145 | def predict(self,y_probs): 146 | res=0 147 | for y_prob,w in zip(y_probs,self.weights): 148 | res+=y_prob*w 149 | return res 150 | 151 | def weight_log_loss(self,weights): 152 | final_ypred=0 153 | for weight,p in zip(weights,self.y_probs): 154 | final_ypred+=weight*p 155 | return log_loss(self.y_true,final_ypred) 156 | 157 | 158 | class WeightVoter(object): 159 | ''' 160 | 获得获得不同stack的权重 161 | 每个stack中的5个模型进行投票 162 | ''' 163 | def __init__(self): 164 | self.models=[] 165 | 166 | def append(self,model): 167 | self.models.append(model) 168 | 169 | def extend(self,models): 170 | self.models.extend(models) 171 | 172 | def fit(self,y): 173 | self.y=y 174 | self.en_models=[] 175 | self.y_probs=[] 176 | index=0 177 | for ti,vi in self.models[0].indexes: 178 | fs=[em.fit_yprob[vi] for em in self.models] 179 | model=WeightEnsemble() 180 | model.fit(fs,self.y[vi]) 181 | y_p=model.predict([em.predict_yprob[index] for em in self.models]) 182 | self.y_probs.append(y_p) 183 | self.en_models.append(model) 184 | 185 | index+=1 186 | return self.en_models 187 | 188 | def vote(self): 189 | preds=np.argmax(self.y_probs,axis=2).transpose() 190 | y_pred=[sorted(Counter(items).items(),key=lambda x:x[1],reverse=True)[0][0] for items in preds] 191 | return y_pred 192 | 193 | def fit_vote(self,y): 194 | self.fit(y) 195 | return self.vote() 196 | 197 | class MCNN2(object): 198 | ''' 199 | cnn与人工特征混合,输入数据为2组 200 | 使用word2vec*tfidf的cnn并与人工特征混合,接口与sklearn分类器一致 201 | ''' 202 | def __init__(self,cnn_input_dim,ext_input_dim,num_class=3,num_channel=1,seed=100): 203 | self.seed=seed 204 | self.num_class=num_class 205 | self.num_channel=num_channel 206 | self.build(cnn_input_dim,ext_input_dim) 207 | 208 | def build(self,vector_dim,ext_feature_dim): 209 | #句子特征 210 | model=Sequential() 211 | model.add(Convolution2D(100,1,vector_dim,input_shape=(self.num_channel,100,vector_dim),activation='relu')) 212 | model.add(Dropout(0.5)) 213 | model.add(MaxPooling2D(pool_size=(50,1))) 214 | model.add(Flatten()) 215 | model.add(Dropout(0.5)) 216 | model.add(Dense(100,activation='tanh')) 217 | model.add(Dropout(0.5)) 218 | 219 | #用户整体特征 220 | model2=Sequential() 221 | model2.add(Dense(100,input_dim=ext_feature_dim,activation='tanh')) 222 | model2.add(Dropout(0.5)) 223 | 224 | merged_model= Sequential() 225 | merged_model.add(Merge([model, model2], mode='concat', concat_axis=1)) 226 | merged_model.add(Dense(self.num_class)) 227 | merged_model.add(Activation('softmax')) 228 | 229 | merged_model.compile(loss='categorical_crossentropy',optimizer='adadelta',metrics=['accuracy'],) 230 | 231 | self.model=merged_model 232 | self.earlyStopping=EarlyStopping(monitor='val_loss', patience=25, verbose=0, mode='auto') 233 | self.checkpoint=ModelCheckpointPlus(filepath='weights.hdf5',monitor='val_loss',verbose_show=20) 234 | 235 | def fit(self,X,y,Xvi=None,yvi=None): 236 | np.random.seed(self.seed) 237 | yc=to_categorical(y) 238 | if Xvi is None: 239 | self.model.fit(X,yc,nb_epoch=1000,verbose=0,validation_split=0.2,batch_size=32,callbacks=[self.earlyStopping,self.checkpoint]) 240 | else: 241 | ycvi=to_categorical(yvi) 242 | self.model.fit(X,yc,nb_epoch=1000,verbose=0,validation_data=[Xvi,ycvi], 243 | batch_size=32,callbacks=[self.earlyStopping,self.checkpoint]) 244 | self.model.load_weights('weights.hdf5') 245 | return self.model 246 | 247 | def predict(self,X): 248 | return self.predict_proba(X) 249 | 250 | def predict_proba(self,X): 251 | return self.model.predict(X) 252 | 253 | 254 | class MCNN3(object): 255 | ''' 256 | cnn与人工特征混合,输入数据为3组 257 | 使用word2vec*tfidf的cnn并与人工特征混合,接口与sklearn分类器一致 258 | ''' 259 | def __init__(self,input_dims,num_class=8,num_channel=1,seed=100): 260 | self.seed=seed 261 | self.num_class=num_class 262 | self.num_channel=num_channel 263 | self.build(input_dims) 264 | 265 | def build(self,input_dims): 266 | #句子特征 267 | model=Sequential() 268 | model.add(Convolution2D(100,1,input_dims[0],input_shape=(self.num_channel,100,input_dims[0]),activation='relu')) 269 | model.add(Dropout(0.5)) 270 | model.add(MaxPooling2D(pool_size=(50,1))) 271 | model.add(Flatten()) 272 | model.add(Dropout(0.5)) 273 | model.add(Dense(100,activation='tanh')) 274 | model.add(Dropout(0.5)) 275 | 276 | #用户整体特征 277 | model2=Sequential() 278 | model2.add(Dense(100,input_dim=input_dims[1],activation='tanh')) 279 | model2.add(Dropout(0.5)) 280 | 281 | #时间地域特征 282 | model3=Sequential() 283 | model3.add(Dense(output_dim=800,input_dim=input_dims[2],activation='tanh')) 284 | model3.add(Dropout(0.5)) 285 | model3.add(Dense(output_dim=300,activation='tanh')) 286 | model3.add(Dropout(0.5)) 287 | 288 | merged_model= Sequential() 289 | merged_model.add(Merge([model, model2,model3], mode='concat', concat_axis=1)) 290 | merged_model.add(Dense(self.num_class)) 291 | merged_model.add(Activation('softmax')) 292 | 293 | merged_model.compile(loss='categorical_crossentropy',optimizer='adadelta',metrics=['accuracy'],) 294 | 295 | self.model=merged_model 296 | self.earlyStopping=EarlyStopping(monitor='val_loss', patience=25, verbose=0, mode='auto') 297 | self.checkpoint=ModelCheckpointPlus(filepath='weights.hdf5',monitor='val_loss',verbose_show=20) 298 | 299 | def fit(self,X,y,Xvi=None,yvi=None): 300 | np.random.seed(self.seed) 301 | yc=to_categorical(y) 302 | if Xvi is None: 303 | self.model.fit(X,yc,nb_epoch=1000,verbose=0,validation_split=0.2,batch_size=32,callbacks=[self.earlyStopping,self.checkpoint]) 304 | else: 305 | ycvi=to_categorical(yvi) 306 | self.model.fit(X,yc,nb_epoch=1000,verbose=0,validation_data=[Xvi,ycvi], 307 | batch_size=32,callbacks=[self.earlyStopping,self.checkpoint]) 308 | self.model.load_weights('weights.hdf5') 309 | return self.model 310 | 311 | def predict(self,X): 312 | return self.predict_proba(X) 313 | 314 | def predict_proba(self,X): 315 | return self.model.predict(X) 316 | 317 | 318 | 319 | class XGB(object): 320 | def __init__(self,params,early_stop=50,verbose=100): 321 | self.params=params 322 | self.early_stop=early_stop 323 | self.verbose=verbose 324 | 325 | def fit(self,X,y,Xvi=None,yvi=None): 326 | if Xvi is None: 327 | ti,vi=list(StratifiedShuffleSplit(y,test_size=0.2,random_state=100,n_iter=1))[0] 328 | dtrain=xgb.DMatrix(X[ti],label=y[ti]) 329 | dvalid=xgb.DMatrix(X[vi],label=y[vi]) 330 | else: 331 | dtrain=xgb.DMatrix(X,label=y) 332 | dvalid=xgb.DMatrix(Xvi,label=yvi) 333 | watchlist=[(dtrain,'train'),(dvalid,'val')] 334 | self.model=xgb.train(self.params,dtrain,num_boost_round=2000,early_stopping_rounds=self.early_stop, 335 | evals=(watchlist),verbose_eval=self.verbose) 336 | return self.model 337 | 338 | def predict(self,X): 339 | return self.predict_proba(X) 340 | 341 | def predict_proba(self,X): 342 | return self.model.predict(xgb.DMatrix(X)) -------------------------------------------------------------------------------- /clean.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | rm data/models/* 3 | -------------------------------------------------------------------------------- /data/SMP 竞赛DUTIRTONE.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/liyumeng/SmpCup2016/e23611fb5a590b357ecc6451d87f81b8e2a62615/data/SMP 竞赛DUTIRTONE.pptx -------------------------------------------------------------------------------- /data/SMP_DUTIRTONE.pptx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/liyumeng/SmpCup2016/e23611fb5a590b357ecc6451d87f81b8e2a62615/data/SMP_DUTIRTONE.pptx -------------------------------------------------------------------------------- /data/user_data/city_loca.dict: -------------------------------------------------------------------------------- 1 | 宁夏,西北 2 | 哈尔滨,东北 3 | 湘潭,华中 4 | 杭州,华东 5 | 庆阳,西北 6 | 阜新,东北 7 | 海南,华南 8 | 阿坝,西南 9 | 韶关,华南 10 | 湖州,华东 11 | 潮州,华南 12 | 西安,西北 13 | 白银,西北 14 | 淮南,华东 15 | 和田,西北 16 | 益阳,华中 17 | 徐州,华东 18 | 黔南,西南 19 | 重庆,西南 20 | 池州,华东 21 | 锡林郭勒盟,华北 22 | 安阳,华中 23 | 海北,西北 24 | 阿勒泰,西北 25 | 玉林,华南 26 | 日喀则,西南 27 | 楚雄,西南 28 | 自贡,西南 29 | 贵阳,西南 30 | 西宁,西北 31 | 通辽,华北 32 | 伊春,东北 33 | 张家界,华中 34 | 荆门,华中 35 | 漳州,华东 36 | 阿拉善盟,华北 37 | 吉安,华东 38 | 承德,华北 39 | 玉树,西北 40 | 铜陵,华东 41 | 呼和浩特,华北 42 | 鄂州,华中 43 | 梅州,华南 44 | 南充,西南 45 | 包头,华北 46 | 山东,华东 47 | 湖北,华中 48 | 大庆,东北 49 | 北京,华北 50 | 朝阳,东北 51 | 惠州,华南 52 | 天门,华中 53 | 朔州,华北 54 | 龙岩,华东 55 | 南通,华东 56 | 绥化,东北 57 | 渭南,西北 58 | 上饶,华东 59 | 茂名,华南 60 | 滁州,华东 61 | 长春,东北 62 | 郴州,华中 63 | 成都,西南 64 | 三亚,华南 65 | 鸡西,东北 66 | 三门峡,华中 67 | 博州,西北 68 | 定西,西北 69 | 辽阳,东北 70 | 扬州,华东 71 | 乌海,华北 72 | 济源,华中 73 | 临沂,华东 74 | 德阳,西南 75 | 阜阳,华东 76 | 南宁,华南 77 | 盐城,华东 78 | 济宁,华东 79 | 南阳,华中 80 | 泉州,华东 81 | 大兴安岭,东北 82 | 乌兰察布,华北 83 | 宜昌,华中 84 | 钦州,华南 85 | 甘南,西北 86 | 武汉,华中 87 | 桂林,华南 88 | 宿迁,华东 89 | 兴安盟,华北 90 | 贵港,华南 91 | 廊坊,华北 92 | 神农架林区,华中 93 | 宣城,华东 94 | 江苏,华东 95 | 阳江,华南 96 | 揭阳,华南 97 | 常州,华东 98 | 呼伦贝尔,华北 99 | 广东,华南 100 | 乐山,西南 101 | 遵义,西南 102 | 浙江,华东 103 | 海口,华南 104 | 安庆,华东 105 | 阿克苏,西北 106 | 黄山,华东 107 | 大连,东北 108 | 三明,华东 109 | 延安,西北 110 | 长治,华北 111 | 山南,西南 112 | 大理,西南 113 | 巴彦淖尔,华北 114 | 运城,华北 115 | 葫芦岛,东北 116 | 天津,华北 117 | 白城,东北 118 | 甘肃,西北 119 | 延边,东北 120 | 东营,华东 121 | 汉中,西北 122 | 邯郸,华北 123 | 黄石,华中 124 | 铜川,西北 125 | 西藏,西南 126 | 南昌,华东 127 | 百色,华南 128 | 海西,西北 129 | 固原,西北 130 | 汕尾,华南 131 | 六盘水,西南 132 | 秦皇岛,华北 133 | 达州,西南 134 | 那曲,西南 135 | 衢州,华东 136 | 巴中,西南 137 | 陕西,西北 138 | 温州,华东 139 | 吐鲁番,西北 140 | 淄博,华东 141 | 淮北,华东 142 | 晋城,华北 143 | 安徽,华东 144 | 沧州,华北 145 | 合肥,华东 146 | 乌鲁木齐,西北 147 | 湖南,华中 148 | 金华,华东 149 | 克州,西北 150 | 榆林,西北 151 | 其它,华南 152 | 周口,华中 153 | 鞍山,东北 154 | 迪庆,西南 155 | 贺州,华南 156 | 丹东,东北 157 | 开封,华中 158 | 马鞍山,华东 159 | 宝鸡,西北 160 | 孝感,华中 161 | 铜仁,西南 162 | 中山,华南 163 | 肇庆,华南 164 | 林芝,西南 165 | 宿州,华东 166 | 苏州,华东 167 | 东莞,华南 168 | 绵阳,西南 169 | 莆田,华东 170 | 石嘴山,西北 171 | 景德镇,华东 172 | 衡水,华北 173 | 宁波,华东 174 | 临汾,华北 175 | 通化,东北 176 | 邢台,华北 177 | 鹤壁,华中 178 | 昭通,西南 179 | 咸阳,西北 180 | 张家口,华北 181 | 锦州,东北 182 | 广元,西南 183 | 松源,东北 184 | 商丘,华中 185 | 海东,西北 186 | 日照,华东 187 | 舟山,华东 188 | 枣庄,华东 189 | 四川,西南 190 | 河池,华南 191 | 济南,华东 192 | 营口,东北 193 | 甘孜,西南 194 | 泰州,华东 195 | 无锡,华东 196 | 宜宾,西南 197 | 深圳,华南 198 | 邵阳,华中 199 | 濮阳,华中 200 | 潜江,华中 201 | 保定,华北 202 | 汕头,华南 203 | 资阳,西南 204 | 衡阳,华中 205 | 黔西南,西南 206 | 威海,华东 207 | 盘锦,东北 208 | 娄底,华中 209 | 武威,西北 210 | 新疆,西北 211 | 喀什,西北 212 | 澳门,华南 213 | 防城港,华南 214 | 怀化,华中 215 | 酒泉,西北 216 | 吉林,东北 217 | 上海,华东 218 | 文山,西南 219 | 晋中,华北 220 | 丽水,华东 221 | 德宏,西南 222 | 太原,华北 223 | 台州,华东 224 | 临夏,西北 225 | 铁岭,东北 226 | 清远,华南 227 | 安康,西北 228 | 镇江,华东 229 | 昌都,西南 230 | 中卫,西北 231 | 齐齐哈尔,东北 232 | 河北,华北 233 | 怒江,西南 234 | 湛江,华南 235 | 眉山,西南 236 | 沈阳,东北 237 | 嘉峪关,西北 238 | 恩施,华中 239 | 广州,华南 240 | 洛阳,华中 241 | 遂宁,西南 242 | 潍坊,华东 243 | 亳州,华东 244 | 德州,华东 245 | 仙桃,华中 246 | 崇左,华南 247 | 毕节,西南 248 | 广安,西南 249 | 白山,东北 250 | 红河,西南 251 | 南平,华东 252 | 北海,华南 253 | 哈密,西北 254 | 山西,华北 255 | 贵州,西南 256 | 四平,东北 257 | 宜春,华东 258 | 昆明,西南 259 | 岳阳,华中 260 | 滨州,华东 261 | 莱芜,华东 262 | 云浮,华南 263 | 昌吉州,西北 264 | 忻州,华北 265 | 塔城,西北 266 | 克拉玛依,西北 267 | 随州,华中 268 | 永州,华中 269 | 荆州,华中 270 | 安顺,西南 271 | 长沙,华中 272 | 珠海,华南 273 | 郑州,华中 274 | 厦门,华东 275 | 蚌埠,华东 276 | 梧州,华南 277 | 台湾,华东 278 | 辽源,东北 279 | 凉山,西北 280 | 陇南,西北 281 | 株洲,华中 282 | 十堰,华中 283 | 黔东南,西南 284 | 临沧,西南 285 | 兰州,西北 286 | 石家庄,华北 287 | 焦作,华中 288 | 果洛,西北 289 | 双鸭山,东北 290 | 芜湖,华东 291 | 咸宁,华中 292 | 来宾,华南 293 | 黄南,西北 294 | 鄂尔多斯,华北 295 | 佛山,华南 296 | 江门,华南 297 | 本溪,东北 298 | 青海,西北 299 | 七台河,东北 300 | 抚顺,东北 301 | 鹤岗,东北 302 | 萍乡,华东 303 | 江西,华东 304 | 信阳,华中 305 | 九江,华东 306 | 抚州,华东 307 | 普洱,西南 308 | 广西,华南 309 | 辽宁,东北 310 | 攀枝花,西南 311 | 嘉兴,华东 312 | 天水,西北 313 | 曲靖,西南 314 | 阳泉,华北 315 | 柳州,华南 316 | 常德,华中 317 | 金昌,西北 318 | 版纳,西南 319 | 南京,华东 320 | 香港,华南 321 | 鹰潭,华东 322 | 赣州,华东 323 | 保山,西南 324 | 丽江,西南 325 | 烟台,华东 326 | 牡丹江,东北 327 | 银川,西北 328 | 泸州,西南 329 | 大同,华北 330 | 玉溪,西南 331 | 连云港,华东 332 | 聊城,华东 333 | 淮安,华东 334 | 黑河,东北 335 | 襄樊,华中 336 | 许昌,华中 337 | 平凉,西北 338 | 六安,华东 339 | 漯河,华中 340 | 赤峰,华北 341 | 内蒙古,华北 342 | 吕梁,华北 343 | 雅安,西南 344 | 驻马店,华中 345 | 湘西,华中 346 | 张掖,西北 347 | 吴忠,西北 348 | 云南,西南 349 | 河南,华中 350 | 唐山,华北 351 | 商洛,西北 352 | 巴州,西北 353 | 河源,华南 354 | 绍兴,华东 355 | 宁德,华东 356 | 新乡,华中 357 | 菏泽,华东 358 | 黑龙江,东北 359 | 伊犁州直,西北 360 | 新余,华东 361 | 泰安,华东 362 | 黄冈,华中 363 | 青岛,华东 364 | 内江,西南 365 | 佳木斯,东北 366 | 拉萨,西南 367 | 福州,华东 368 | 巢湖,华东 369 | 福建,华东 370 | 平顶山,华中 -------------------------------------------------------------------------------- /data/user_data/city_prov.dict: -------------------------------------------------------------------------------- 1 | 台湾,台湾 2 | 香港,香港 3 | 澳门,澳门 4 | 上海,上海 5 | 红河,云南 6 | 迪庆,云南 7 | 文山,云南 8 | 普洱,云南 9 | 丽江,云南 10 | 版纳,云南 11 | 昭通,云南 12 | 昆明,云南 13 | 曲靖,云南 14 | 大理,云南 15 | 怒江,云南 16 | 临沧,云南 17 | 玉溪,云南 18 | 楚雄,云南 19 | 保山,云南 20 | 德宏,云南 21 | 阿拉善盟,内蒙古 22 | 通辽,内蒙古 23 | 赤峰,内蒙古 24 | 乌兰察布,内蒙古 25 | 乌海,内蒙古 26 | 呼和浩特,内蒙古 27 | 巴彦淖尔,内蒙古 28 | 包头,内蒙古 29 | 锡林郭勒盟,内蒙古 30 | 呼伦贝尔,内蒙古 31 | 鄂尔多斯,内蒙古 32 | 兴安盟,内蒙古 33 | 北京,北京 34 | 白山,吉林 35 | 松源,吉林 36 | 吉林,吉林 37 | 白城,吉林 38 | 延边,吉林 39 | 长春,吉林 40 | 辽源,吉林 41 | 通化,吉林 42 | 四平,吉林 43 | 攀枝花,四川 44 | 阿坝,四川 45 | 绵阳,四川 46 | 南充,四川 47 | 乐山,四川 48 | 泸州,四川 49 | 内江,四川 50 | 自贡,四川 51 | 德阳,四川 52 | 甘孜,四川 53 | 广安,四川 54 | 遂宁,四川 55 | 广元,四川 56 | 资阳,四川 57 | 成都,四川 58 | 达州,四川 59 | 雅安,四川 60 | 宜宾,四川 61 | 巴中,四川 62 | 眉山,四川 63 | 天津,天津 64 | 中卫,宁夏 65 | 固原,宁夏 66 | 银川,宁夏 67 | 吴忠,宁夏 68 | 石嘴山,宁夏 69 | 凉山,宁夏 70 | 六安,安徽 71 | 宿州,安徽 72 | 合肥,安徽 73 | 阜阳,安徽 74 | 黄山,安徽 75 | 池州,安徽 76 | 淮北,安徽 77 | 滁州,安徽 78 | 铜陵,安徽 79 | 马鞍山,安徽 80 | 亳州,安徽 81 | 宣城,安徽 82 | 巢湖,安徽 83 | 安庆,安徽 84 | 芜湖,安徽 85 | 蚌埠,安徽 86 | 淮南,安徽 87 | 莱芜,山东 88 | 聊城,山东 89 | 东营,山东 90 | 潍坊,山东 91 | 枣庄,山东 92 | 日照,山东 93 | 青岛,山东 94 | 烟台,山东 95 | 临沂,山东 96 | 威海,山东 97 | 济南,山东 98 | 滨州,山东 99 | 菏泽,山东 100 | 德州,山东 101 | 济宁,山东 102 | 淄博,山东 103 | 泰安,山东 104 | 晋中,山西 105 | 吕梁,山西 106 | 运城,山西 107 | 大同,山西 108 | 太原,山西 109 | 忻州,山西 110 | 长治,山西 111 | 临汾,山西 112 | 朔州,山西 113 | 晋城,山西 114 | 阳泉,山西 115 | 揭阳,广东 116 | 云浮,广东 117 | 东莞,广东 118 | 梅州,广东 119 | 阳江,广东 120 | 清远,广东 121 | 湛江,广东 122 | 惠州,广东 123 | 汕尾,广东 124 | 潮州,广东 125 | 汕头,广东 126 | 肇庆,广东 127 | 茂名,广东 128 | 河源,广东 129 | 广州,广东 130 | 韶关,广东 131 | 佛山,广东 132 | 中山,广东 133 | 江门,广东 134 | 珠海,广东 135 | 深圳,广东 136 | 贵港,广西 137 | 来宾,广西 138 | 北海,广西 139 | 玉林,广西 140 | 钦州,广西 141 | 百色,广西 142 | 河池,广西 143 | 崇左,广西 144 | 梧州,广西 145 | 桂林,广西 146 | 南宁,广西 147 | 贺州,广西 148 | 柳州,广西 149 | 防城港,广西 150 | 喀什,新疆 151 | 巴州,新疆 152 | 哈密,新疆 153 | 乌鲁木齐,新疆 154 | 吐鲁番,新疆 155 | 克拉玛依,新疆 156 | 昌吉州,新疆 157 | 克州,新疆 158 | 和田,新疆 159 | 阿勒泰,新疆 160 | 阿克苏,新疆 161 | 塔城,新疆 162 | 伊犁州直,新疆 163 | 博州,新疆 164 | 无锡,江苏 165 | 连云港,江苏 166 | 南京,江苏 167 | 淮安,江苏 168 | 泰州,江苏 169 | 徐州,江苏 170 | 南通,江苏 171 | 盐城,江苏 172 | 宿迁,江苏 173 | 常州,江苏 174 | 扬州,江苏 175 | 镇江,江苏 176 | 苏州,江苏 177 | 鹰潭,江西 178 | 南昌,江西 179 | 九江,江西 180 | 新余,江西 181 | 景德镇,江西 182 | 上饶,江西 183 | 赣州,江西 184 | 吉安,江西 185 | 抚州,江西 186 | 宜春,江西 187 | 萍乡,江西 188 | 承德,河北 189 | 沧州,河北 190 | 张家口,河北 191 | 唐山,河北 192 | 秦皇岛,河北 193 | 邢台,河北 194 | 廊坊,河北 195 | 邯郸,河北 196 | 石家庄,河北 197 | 衡水,河北 198 | 保定,河北 199 | 漯河,河南 200 | 商丘,河南 201 | 濮阳,河南 202 | 洛阳,河南 203 | 安阳,河南 204 | 许昌,河南 205 | 开封,河南 206 | 鹤壁,河南 207 | 南阳,河南 208 | 周口,河南 209 | 新乡,河南 210 | 驻马店,河南 211 | 平顶山,河南 212 | 三门峡,河南 213 | 焦作,河南 214 | 济源,河南 215 | 信阳,河南 216 | 郑州,河南 217 | 杭州,浙江 218 | 温州,浙江 219 | 金华,浙江 220 | 宁波,浙江 221 | 舟山,浙江 222 | 绍兴,浙江 223 | 湖州,浙江 224 | 嘉兴,浙江 225 | 台州,浙江 226 | 衢州,浙江 227 | 丽水,浙江 228 | 海口,海南 229 | 其它,海南 230 | 三亚,海南 231 | 十堰,湖北 232 | 孝感,湖北 233 | 荆门,湖北 234 | 仙桃,湖北 235 | 襄樊,湖北 236 | 神农架林区,湖北 237 | 黄冈,湖北 238 | 咸宁,湖北 239 | 武汉,湖北 240 | 潜江,湖北 241 | 宜昌,湖北 242 | 荆州,湖北 243 | 鄂州,湖北 244 | 黄石,湖北 245 | 天门,湖北 246 | 随州,湖北 247 | 恩施,湖北 248 | 娄底,湖南 249 | 衡阳,湖南 250 | 常德,湖南 251 | 湘潭,湖南 252 | 益阳,湖南 253 | 张家界,湖南 254 | 株洲,湖南 255 | 湘西,湖南 256 | 长沙,湖南 257 | 永州,湖南 258 | 怀化,湖南 259 | 郴州,湖南 260 | 岳阳,湖南 261 | 邵阳,湖南 262 | 陇南,甘肃 263 | 嘉峪关,甘肃 264 | 庆阳,甘肃 265 | 张掖,甘肃 266 | 平凉,甘肃 267 | 定西,甘肃 268 | 酒泉,甘肃 269 | 兰州,甘肃 270 | 临夏,甘肃 271 | 武威,甘肃 272 | 白银,甘肃 273 | 甘南,甘肃 274 | 天水,甘肃 275 | 金昌,甘肃 276 | 漳州,福建 277 | 宁德,福建 278 | 南平,福建 279 | 泉州,福建 280 | 三明,福建 281 | 莆田,福建 282 | 福州,福建 283 | 厦门,福建 284 | 龙岩,福建 285 | 昌都,西藏 286 | 拉萨,西藏 287 | 阿里,西藏 288 | 那曲,西藏 289 | 山南,西藏 290 | 日喀则,西藏 291 | 林芝,西藏 292 | 六盘水,贵州 293 | 黔东南,贵州 294 | 铜仁,贵州 295 | 毕节,贵州 296 | 黔南,贵州 297 | 遵义,贵州 298 | 安顺,贵州 299 | 黔西南,贵州 300 | 贵阳,贵州 301 | 抚顺,辽宁 302 | 本溪,辽宁 303 | 大连,辽宁 304 | 沈阳,辽宁 305 | 盘锦,辽宁 306 | 营口,辽宁 307 | 葫芦岛,辽宁 308 | 辽阳,辽宁 309 | 阜新,辽宁 310 | 丹东,辽宁 311 | 铁岭,辽宁 312 | 朝阳,辽宁 313 | 鞍山,辽宁 314 | 锦州,辽宁 315 | 重庆,重庆 316 | 安康,陕西 317 | 咸阳,陕西 318 | 汉中,陕西 319 | 延安,陕西 320 | 铜川,陕西 321 | 商洛,陕西 322 | 宝鸡,陕西 323 | 渭南,陕西 324 | 西安,陕西 325 | 榆林,陕西 326 | 黄南,青海 327 | 玉树,青海 328 | 海东,青海 329 | 海北,青海 330 | 西宁,青海 331 | 海西,青海 332 | 果洛,青海 333 | 海南,青海 334 | 哈尔滨,黑龙江 335 | 七台河,黑龙江 336 | 黑河,黑龙江 337 | 大庆,黑龙江 338 | 大兴安岭,黑龙江 339 | 佳木斯,黑龙江 340 | 伊春,黑龙江 341 | 鹤岗,黑龙江 342 | 绥化,黑龙江 343 | 牡丹江,黑龙江 344 | 鸡西,黑龙江 345 | 双鸭山,黑龙江 346 | 齐齐哈尔,黑龙江 347 | 陕西,陕西 348 | 河北,河北 349 | 重庆,重庆 350 | 广西,广西 351 | 福建,福建 352 | 江西,江西 353 | 湖南,湖南 354 | 内蒙古,内蒙古 355 | 山西,山西 356 | 宁夏,宁夏 357 | 浙江,浙江 358 | 青海,青海 359 | 江苏,江苏 360 | 辽宁,辽宁 361 | 上海,上海 362 | 云南,云南 363 | 贵州,贵州 364 | 北京,北京 365 | 西藏,西藏 366 | 广东,广东 367 | 新疆,新疆 368 | 安徽,安徽 369 | 海南,海南 370 | 山东,山东 371 | 吉林,吉林 372 | 湖北,湖北 373 | 甘肃,甘肃 374 | 四川,四川 375 | 河南,河南 376 | 天津,天津 377 | 黑龙江,黑龙江 -------------------------------------------------------------------------------- /data/user_data/emoji.txt: -------------------------------------------------------------------------------- 1 | [泪] 2 | [哈哈] 3 | [偷笑] 4 | [嘻嘻] 5 | [爱你] 6 | [good] 7 | [星星] 8 | [心] 9 | [抓狂] 10 | [笑cry] 11 | [太开心] 12 | [怒] 13 | [鼓掌] 14 | [doge] 15 | [微笑] 16 | [礼物] 17 | [赞] 18 | [给力] 19 | [眼泪] 20 | [可爱] 21 | [威武] 22 | [许愿] 23 | [汗] 24 | [馋嘴] 25 | [话筒] 26 | [飞机] 27 | [害羞] 28 | [哼] 29 | [酷] 30 | [色] 31 | [发红包] 32 | [悲伤] 33 | [衰] 34 | [耶] 35 | [可怜] 36 | [晕] 37 | [兔子] 38 | [太阳] 39 | [喵喵] 40 | [蛋糕] 41 | [挖鼻] 42 | [得瑟] 43 | [拜拜] 44 | [花心] 45 | [笑哈哈] 46 | [生病] 47 | [围观] 48 | [偷乐] 49 | [蜡烛] 50 | [得意地笑] 51 | [呵呵] 52 | [奥特曼] 53 | [挖鼻屎] 54 | [亲亲] 55 | [噢耶] 56 | [祈祷] 57 | [怒骂] 58 | [思考] 59 | [吃惊] 60 | [微风] 61 | [鲜花] 62 | [爱心] 63 | [玫瑰] 64 | [睡觉] 65 | [钱] 66 | [强] 67 | [开心] 68 | [挤眼] 69 | [月亮] 70 | [傻笑] 71 | [阴险] 72 | [飞吻] 73 | [失望] 74 | [惊恐] 75 | [泪流满面] 76 | [委屈] 77 | [大哭] 78 | [打哈气] 79 | [好棒] 80 | [xkl转圈] 81 | [鞭炮] 82 | [黑线] 83 | [鄙视] 84 | [加油啊] 85 | [伤心] 86 | [强壮] 87 | [崩溃] 88 | [浮云] 89 | [嘘] 90 | [求关注] 91 | [困] 92 | [来] 93 | [弱] 94 | [赞啊] 95 | [闭眼] 96 | [呲牙] 97 | [吐] 98 | [猪头] 99 | [互粉] 100 | [拳头] 101 | [囧] 102 | [火焰] 103 | [草泥马] 104 | [花痴] 105 | [熊猫] 106 | [星光] 107 | [飞个吻] 108 | [转] 109 | [羞嗒嗒] 110 | [疑问] 111 | [下雨] 112 | [微博之星] 113 | [西瓜] 114 | [脸红] 115 | [鬼脸二] 116 | [向右] 117 | [音乐] 118 | [转发] 119 | [狗] 120 | [干杯] 121 | [礼花] 122 | [愤怒] 123 | [大笑] 124 | [向左] 125 | [顶] 126 | [好爱哦] 127 | [傻眼] 128 | [哭笑] 129 | [吐舌头] 130 | [前进] 131 | [抱抱] 132 | [哈欠] 133 | [闭嘴] 134 | [发红包啦] 135 | [神马] 136 | [打脸] 137 | [抢到啦] 138 | [个人求助] 139 | [拍照] 140 | [做鬼脸] 141 | [睡] 142 | [半星] 143 | [得意] 144 | [雪] 145 | [haha] 146 | [电话] 147 | [咖啡] 148 | [心形礼物] 149 | [钟] 150 | [玩去啦] 151 | [右哼哼] 152 | [不好意思] 153 | [幸运之星] 154 | [白眼] 155 | [坏笑] 156 | [bm可爱] 157 | [圣诞树] 158 | [樱花] 159 | [幽灵] 160 | [萌] 161 | [小嘴] 162 | [右上] 163 | [撒花] 164 | [作揖] 165 | [懒得理你] 166 | [瞌睡] 167 | [七夕快乐] 168 | [斜眼] 169 | [眨眼] 170 | [不开心] 171 | [震惊] 172 | [四叶草] 173 | [相爱] 174 | [巨汗] 175 | [调皮] 176 | [好喜欢] 177 | [愉快] 178 | [男孩儿] 179 | [胜利] 180 | [奋斗] 181 | [音乐盒] 182 | [推荐] 183 | [气球] 184 | [卡门] 185 | [女孩儿] 186 | [打哈欠] 187 | [感冒] 188 | [惊讶] 189 | [羊年大吉] 190 | [纠结] 191 | [悲催] 192 | [蝴蝶结] 193 | [雷锋] 194 | [电影] 195 | [抠鼻屎] 196 | [给劲] 197 | [淚] 198 | [难过] 199 | [tm] 200 | [喜] 201 | [闪电] 202 | [左哼哼] 203 | [雪人] 204 | [马到成功] 205 | [地点] 206 | [左上] 207 | [lt切克闹] 208 | [ok] 209 | [最右] 210 | [好囧] 211 | [圣诞老人] 212 | [快银] 213 | [微博有礼] 214 | [大红灯笼] 215 | [平装] 216 | [moc转发] 217 | [流泪] 218 | [紧张] 219 | [din推撞] 220 | [看涨] 221 | [口罩] 222 | [妖怪] 223 | [拥抱] 224 | [emoji] 225 | [绿叶] 226 | [小汽车] 227 | [看看] 228 | [沙尘暴] 229 | [ppb鼓掌] 230 | [困死了] 231 | [互相膜拜] 232 | [握手] 233 | [跑] 234 | [花朵] 235 | [绿丝带] 236 | [心碎] 237 | [足球] 238 | [围脖] 239 | [无聊] 240 | [帅] 241 | [爱心传递] 242 | [落叶] 243 | [好] 244 | [温暖帽子] 245 | [置顶] 246 | [书呆子] 247 | [鼻涕] 248 | [实惠] 249 | [hold住] 250 | [猴子] 251 | [喇叭] 252 | [水滴] 253 | [抵消] 254 | [张嘴] 255 | [红桃] 256 | [撇嘴] 257 | [白云] 258 | [奶牛] 259 | [汽车] 260 | [我的手机] 261 | [猫咪] 262 | [憨笑] 263 | [向日葵] 264 | [织] 265 | [自行车] 266 | [去旅行] 267 | [甩甩手] 268 | [被电] 269 | [星星眼] 270 | [爱心发光] 271 | [手套] 272 | [女孩] 273 | [躁狂症] 274 | [不要] 275 | [霹雳] 276 | [挖鼻孔] 277 | [甜馨得瑟] 278 | [咒骂] 279 | [直播ing] 280 | [大雄微笑] 281 | [啤酒] 282 | [风扇] 283 | [愛你] 284 | [勾引] 285 | [人身攻击] 286 | [裙子] 287 | [最差] 288 | -------------------------------------------------------------------------------- /data/user_data/enum_list.txt: -------------------------------------------------------------------------------- 1 | m,f 2 | -1979,1980-1989,1990+ 3 | 华北,华东,华南,西南,华中,东北,西北,境外,None 4 | -------------------------------------------------------------------------------- /data/user_data/keywords.txt: -------------------------------------------------------------------------------- 1 | 红包 2 | 分享 -------------------------------------------------------------------------------- /data/user_data/latitude.dict: -------------------------------------------------------------------------------- 1 | 陕西,108.8,34.8 2 | 河北,115.8,38.7 3 | 重庆,107.4,29.9 4 | 广西,108.9,23.7 5 | 福建,118.3,25.7 6 | 江西,115.8,27.8 7 | 湖南,111.9,27.6 8 | 内蒙古,114.4,42.7 9 | 山西,112.4,37.4 10 | 宁夏,106.2,37.6 11 | 浙江,120.5,29.4 12 | 青海,100.1,35.7 13 | 江苏,119.4,32.7 14 | 辽宁,122.7,41.2 15 | 上海,121.4,31.2 16 | 云南,101.7,25.1 17 | 贵州,107.0,26.8 18 | 北京,116.4,40.0 19 | 西藏,90.8,30.0 20 | 广东,113.6,23.1 21 | 新疆,83.7,42.6 22 | 安徽,117.4,31.9 23 | 海南,109.9,19.3 24 | 山东,118.1,36.3 25 | 吉林,126.0,43.4 26 | 湖北,112.9,30.8 27 | 甘肃,103.7,36.1 28 | 四川,104.0,30.1 29 | 河南,113.8,34.3 30 | 天津,117.3,39.2 31 | 黑龙江,128.4,46.8 32 | 香港,113,22.3 33 | 台湾,120,22.5 34 | 澳门,113.5,22.2 -------------------------------------------------------------------------------- /data/user_data/location.txt: -------------------------------------------------------------------------------- 1 | 东北:辽宁,吉林,黑龙江 2 | 华北:河北,山西,内蒙古,北京,天津 3 | 华东:山东,江苏,安徽,浙江,台湾,福建,江西,上海 4 | 华中:河南,湖北,湖南 5 | 华南:广东,广西,海南,香港,澳门 6 | 西南:云南,重庆,贵州,四川,西藏 7 | 西北:新疆,陕西,宁夏,青海,甘肃 8 | 境外:其他 -------------------------------------------------------------------------------- /data/user_data/short_prov.dict: -------------------------------------------------------------------------------- 1 | 冀,华北 2 | 晋,华北 3 | 蒙,华北 4 | 沪,华东 5 | 浙,华东 6 | 皖,华东 7 | 闽,华东 8 | 赣,华东 9 | 豫,华中 10 | 鄂,华中 11 | 湘,华中 12 | 粤,华南 13 | 渝,西南 14 | 蜀,西南 15 | 黔,西南 16 | 滇,西南 17 | 陕,西北 18 | 陇,西北 -------------------------------------------------------------------------------- /data/user_data/stopwords.txt: -------------------------------------------------------------------------------- 1 | ~ 2 | . 3 | ! 4 | @ 5 | # 6 | $ 7 | % 8 | ^ 9 | & 10 | * 11 | ( 12 | ) 13 | ) 14 | [ 15 | ] 16 | ; 17 | ' 18 | : 19 | " 20 | . 21 | < 22 | > 23 | ? 24 | / 25 | \ 26 | ` 27 | * 28 | - 29 | + 30 | = 31 | _ 32 | ! 33 | @ 34 | # 35 | ¥ 36 | % 37 | …… 38 | & 39 | * 40 | ( 41 | ) 42 | —— 43 | ― 44 | ; 45 | ‘ 46 | : 47 | “ 48 | … 49 | ” 50 |   51 | 【 52 | 】 53 | Δ 54 | L 55 | 56 | 57 | Q 58 | N 59 | ? 60 | 、 61 | 。 62 | 》 63 | 《 64 | · 65 | . 66 | , 67 | , 68 | 69 | 1 70 | 2 71 | 3 72 | 4 73 | 5 74 | 6 75 | 7 76 | 8 77 | 9 78 | 0 79 | a 80 | b 81 | c 82 | d 83 | e 84 | f 85 | g 86 | h 87 | i 88 | j 89 | k 90 | l 91 | m 92 | n 93 | o 94 | p 95 | q 96 | r 97 | s 98 | t 99 | u 100 | v 101 | w 102 | x 103 | y 104 | z 105 | A 106 | B 107 | C 108 | D 109 | E 110 | F 111 | G 112 | H 113 | I 114 | J 115 | K 116 | L 117 | M 118 | N 119 | O 120 | P 121 | Q 122 | R 123 | S 124 | T 125 | U 126 | V 127 | W 128 | X 129 | Y 130 | Z 131 | 吗 132 | 一 133 | 的 134 | 了 135 | 啊 136 | 阿 137 | 是 138 | 我 139 | 用 140 | 也 141 | 就 142 | 买 143 | 有 144 | 在 145 | 个 146 | 大 147 | 去 148 | 这 149 | 麼 150 | 哎 151 | 唉 152 | 哦 153 | 呀 154 | 的 155 | 一 156 | 不 157 | 在 158 | 人 159 | 有 160 | 是 161 | 为 162 | 以 163 | 于 164 | 上 165 | 他 166 | 而 167 | 后 168 | 之 169 | 来 170 | 及 171 | 了 172 | 因 173 | 下 174 | 可 175 | 到 176 | 由 177 | 这 178 | 与 179 | 也 180 | 此 181 | 但 182 | 并 183 | 个 184 | 其 185 | 已 186 | 无 187 | 小 188 | 我 189 | 们 190 | 起 191 | 最 192 | 再 193 | 今 194 | 去 195 | 好 196 | 只 197 | 又 198 | 或 199 | 很 200 | 亦 201 | 某 202 | 把 203 | 那 204 | 你 205 | 乃 206 | 它 207 | 任何 208 | 连同 209 | 开外 210 | 再有 211 | 哪些 212 | 又及 213 | 当然 214 | 就是 215 | 遵照 216 | 以来 217 | 赖以 218 | 否则 219 | 此间 220 | 后者 221 | 按照 222 | 才是 223 | 自身 224 | 再则 225 | 就算 226 | 即便 227 | 有些 228 | 例如 229 | 它们 230 | 虽然 231 | 为此 232 | 以免 233 | 别处 234 | 我们 235 | 依据 236 | 趁着 237 | 就要 238 | 各位 239 | 别的 240 | 前者 241 | 不外乎 242 | 虽说 243 | 除此 244 | 个别 245 | 的话 246 | 甚而 247 | 那般 248 | 譬如 249 | 作为 250 | 谁人 251 | 进而 252 | 那边 253 | 首先 254 | 因此 255 | 果然 256 | 除非 257 | 以上 258 | 为何 259 | 要么 260 | 随时 261 | 诸如 262 | 还是 263 | 一旦 264 | 基于 265 | 本人 266 | 因而 267 | 继而 268 | 不单 269 | 此时 270 | 等等 271 | 截至 272 | 不但 273 | 故而 274 | 全体 275 | 从此 276 | 对于 277 | 朝着 278 | 怎样 279 | 以为 280 | 那儿 281 | 或是 282 | 本身 283 | 况且 284 | 处在 285 | 吧 286 | 那个 287 | 被 288 | 诸位 289 | 从而 290 | 比 291 | 各自 292 | 针对 293 | 此外 294 | 何处 295 | 为了 296 | 这般 297 | 别 298 | 仍旧 299 | 既然 300 | 反而 301 | 关于 302 | 较之 303 | 不管 304 | 趁 305 | 彼时 306 | 这边 307 | 不光 308 | 宁可 309 | 要是 310 | 其他 311 | 其它 312 | 由于 313 | 还要 314 | 经过 315 | 不过 316 | 来说 317 | 当 318 | 从 319 | 除了 320 | 既是 321 | 的确 322 | 得 323 | 说来 324 | 打 325 | 据此 326 | 只限于 327 | 还有 328 | 只怕 329 | 不尽 330 | 多会 331 | 正巧 332 | 凡 333 | 以至 334 | 以致 335 | 某个 336 | 与否 337 | 凭借 338 | 儿 339 | 不仅 340 | 尔 341 | 两者 342 | 该 343 | 另外 344 | 一来 345 | 正如 346 | 那里 347 | 不尽然 348 | 毋宁 349 | 这儿 350 | 嘿嘿 351 | 就是说 352 | 正是 353 | 既往 354 | 随着 355 | 于是 356 | 各 357 | 给 358 | 跟 359 | 那么 360 | 而后 361 | 和 362 | 何 363 | 似的 364 | 不料 365 | 其余 366 | 或者 367 | 介于 368 | 别人 369 | 还 370 | 这个 371 | 受到 372 | 只是 373 | 即使 374 | 即 375 | 几 376 | 不论 377 | 本着 378 | 既 379 | 及至 380 | 加以 381 | 多么 382 | 其中 383 | 别说 384 | 这会 385 | 依照 386 | 人们 387 | 如此 388 | 个人 389 | 出来 390 | 看 391 | 另一方面 392 | 唯有 393 | 据 394 | 距 395 | 靠 396 | 接着 397 | 何况 398 | 啦 399 | 加之 400 | 至今 401 | 凡是 402 | 他们 403 | 一切 404 | 那时 405 | 只限 406 | 不然 407 | 许多 408 | 在于 409 | 某某 410 | 除外 411 | 来自 412 | 便于 413 | 同时 414 | 只消 415 | 只需 416 | 不如 417 | 只要 418 | 另 419 | 并不 420 | 不仅仅 421 | 这里 422 | 么 423 | 总之 424 | 因为 425 | 每 426 | 固然 427 | 不是 428 | 嘛 429 | 或者说 430 | 然而 431 | 假如 432 | 如何 433 | 这么 434 | 可见 435 | 如果 436 | 拿 437 | 简言之 438 | 多少 439 | 哪 440 | 光是 441 | 非但 442 | 呵呵 443 | 只有 444 | 只因 445 | 连带 446 | 正值 447 | 沿着 448 | 哪儿 449 | 他人 450 | 若非 451 | 怎么办 452 | 她们 453 | 您 454 | 凭 455 | 而且 456 | 与其 457 | 如同下 458 | 有的 459 | 那些 460 | 甚至 461 | 为止 462 | 无论 463 | 鉴于 464 | 嘻嘻 465 | 哪个 466 | 然后 467 | 直到 468 | 且 469 | 却 470 | 并非 471 | 对比 472 | 为着 473 | 一些 474 | 让 475 | 何时 476 | 仍 477 | 啥 478 | 而是 479 | 自从 480 | 比如 481 | 之所以 482 | 如 483 | 你们 484 | 若 485 | 使 486 | 那样 487 | 所以 488 | 得了 489 | 谁 490 | 当地 491 | 有关 492 | 所有 493 | 因之 494 | 用来 495 | 虽 496 | 随 497 | 所在 498 | 同 499 | 对待 500 | 而外 501 | 分别 502 | 所 503 | 她 504 | 某些 505 | 对方 506 | 哇 507 | 嗡 508 | 往 509 | 不只 510 | 但是 511 | 全部 512 | 尽管 513 | 些 514 | 大家 515 | 以便 516 | 自己 517 | 可是 518 | 反之 519 | 这些 520 | 向 521 | 什么 522 | 由此 523 | 万一 524 | 而已 525 | 何以 526 | 咱们 527 | 沿 528 | 值此 529 | 向着 530 | 哪怕 531 | 倘若 532 | 出于 533 | 哟 534 | 如上 535 | 如若 536 | 替代 537 | 用 538 | 什么样 539 | 如是 540 | 照着 541 | 此处 542 | 这样 543 | 每当 544 | 咱 545 | 此次 546 | 至于 547 | 则 548 | 怎 549 | 曾 550 | 至 551 | 致 552 | 此地 553 | 要不然 554 | 逐步 555 | 格里斯 556 | 本地 557 | 着 558 | 诸 559 | 要不 560 | 自 561 | 其次 562 | 尽管如此 563 | 遵循 564 | 乃至 565 | 若是 566 | 并且 567 | 如下 568 | 可以 569 | 才能 570 | 以及 571 | 彼此 572 | 根据 573 | 随后 574 | 有时 575 | 个人感觉 -------------------------------------------------------------------------------- /main.py: -------------------------------------------------------------------------------- 1 | import os 2 | from base.feipeng.feature_engineering import features 3 | from base.feipeng.model import age, gender 4 | import pandas as pd 5 | 6 | # 输入文件路径 7 | inpaths = { 8 | 'train_info' : 'data/raw_data/train/train_info.txt', 9 | 'train_labels' : 'data/raw_data/train/train_labels.txt', 10 | 'train_links' : 'data/raw_data/train/train_links.txt', 11 | 'train_status' : 'data/raw_data/train/train_status.txt', 12 | 'new_labels' : 'data/raw_data/train/new_labels.txt', 13 | 14 | 'test_info' : 'data/raw_data/valid/valid_info.txt', 15 | 'test_nolabels' : 'data/raw_data/valid/valid_nolabel.txt', 16 | 'test_links' : 'data/raw_data/valid/valid_links.txt', 17 | 'test_status' : 'data/raw_data/valid/valid_status.txt', 18 | 19 | 'stopwords' : 'data/user_data/stopwords.txt' 20 | } 21 | 22 | # 输出文件路径 23 | outpaths = { 24 | # 中间文件 25 | 'status_file' : 'data/feature_data/fp.status.txt', 26 | 'text_file' : 'data/feature_data/fp.text.txt', 27 | # 最终文件 28 | 'data_file' : 'data/feature_data/fp.data_x.pkl', 29 | 'features_file' : 'data/feature_data/fp.features.pkl' 30 | } 31 | 32 | # stack特征输出文件 33 | stack_age = 'data/models/fp.age.feature' 34 | stack_gender = 'data/models/fp.gender.feature' 35 | 36 | # 预测结果 37 | result_age = 'submission/age.csv' 38 | result_gender = 'submission/gender.csv' 39 | 40 | 41 | 42 | 43 | #################################################### 44 | ############### ############################# 45 | ############## ################################ 46 | ############## ############################## 47 | #################################################### 48 | 49 | 50 | def preprocess(): 51 | print('##################特征工程#####################') 52 | if os.path.isfile(outpaths['features_file']): 53 | print('yo 特征已有!') 54 | else: 55 | print('重新构建特征') 56 | myfeatures = features(inpaths, outpaths) 57 | myfeatures.build() 58 | 59 | def predict_age(lee_features): 60 | print('################## age #######################') 61 | model_age = age(outpaths['data_file'], outpaths['features_file'],\ 62 | stack_age, inpaths['new_labels']) 63 | if os.path.isfile(stack_age): 64 | print('stacked~ 开始训练第二级模型') 65 | else: 66 | print('stacking') 67 | model_age.stacking() 68 | model_age.concat_features(lee_features) 69 | model_age.fit_transform(result_age) 70 | 71 | def predict_gender(): 72 | print('################## gender #######################') 73 | model_gender = gender(outpaths['data_file'], outpaths['features_file'],\ 74 | stack_gender, inpaths['new_labels']) 75 | if os.path.isfile(stack_gender): 76 | print('stacked~ 开始训练第二级模型') 77 | else: 78 | print('stacking') 79 | model_gender.stacking() 80 | model_gender.fit_transform(result_gender) 81 | 82 | 83 | 84 | if __name__ == '__main__': 85 | ############################################### 86 | ################## 87 | # step1: 预处理 88 | preprocess() 89 | 90 | ############################################### 91 | ################## 92 | # step2: age 93 | lee_features = 'data/models/yuml.age.feature' # from lym 94 | predict_age(lee_features) 95 | 96 | ############################################### 97 | ################## 98 | # step3: gender 99 | predict_gender() 100 | 101 | # step4: merge 102 | print('merge result') 103 | result = pd.read_csv('submission/temp.csv') 104 | age_file = pd.read_csv(result_age) 105 | gender_file = pd.read_csv(result_gender) 106 | result.age = age_file.age 107 | result.gender = gender_file.gender 108 | result.to_csv('submission/final.csv', index=0) 109 | 110 | print('finish!') -------------------------------------------------------------------------------- /others/banner.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/liyumeng/SmpCup2016/e23611fb5a590b357ecc6451d87f81b8e2a62615/others/banner.jpg -------------------------------------------------------------------------------- /process_data.py: -------------------------------------------------------------------------------- 1 | ''' 2 | 处理原始文本,输出为特征文件 3 | features.v1.pkl 4 | features.v2.pkl 5 | author:yuml 6 | 5052909@qq.com 7 | ''' 8 | import os,sys 9 | sys.path.append(os.path.abspath('../')) 10 | #检查所需的python包 11 | import theano 12 | import jieba 13 | import keras 14 | import xgboost 15 | import gensim 16 | import numpy as np 17 | #------------------- 18 | import random 19 | import pickle 20 | import os 21 | import datetime 22 | import re 23 | from base.utils import remove_duplicate 24 | from base.dataset import feature_path,smp_path 25 | 26 | #原始数据 27 | train_labels=smp_path+r'/raw_data/train/train_labels.txt' 28 | train_status=smp_path+'/raw_data/train/train_status.txt' 29 | test_nolabels=smp_path+r'/raw_data/valid/valid_nolabel.txt' 30 | test_status=smp_path+'/raw_data/valid/valid_status.txt' 31 | #城市映射 32 | location=smp_path+r'/user_data/location.txt' 33 | city_loca=smp_path+r'/user_data/city_loca.dict' 34 | short_prov=smp_path+r'/user_data/short_prov.dict' 35 | city_prov=smp_path+r'/user_data/city_prov.dict' 36 | prov_lat=smp_path+r'/user_data/latitude.dict' 37 | 38 | reg_topic=re.compile('# (.*?) #') 39 | reg_at=re.compile('@[\S]+') 40 | 41 | train_ids=[] #训练集的id顺序 42 | test_ids=[] #测试集的id顺序 43 | 44 | ''' 45 | 载入地区映射表和城市映射表 46 | ''' 47 | def load_location(): 48 | dict={} 49 | with open(location,encoding='utf8') as f: 50 | for line in f: 51 | tmp=line.strip().split(':') 52 | for item in tmp[1].split(','): 53 | dict[item]=tmp[0] 54 | return dict 55 | def load_dict(filename): 56 | city_dict={} 57 | with open(filename,encoding='utf8') as f: 58 | for line in f: 59 | items=line.strip().split(',') 60 | city_dict[items[0]]=items[1] 61 | return city_dict 62 | 63 | loca_dict=load_location() 64 | loca_map_dict=load_dict(city_loca) 65 | loca2_map_dict=load_dict(short_prov) #省份缩写 66 | prov_map_dict=load_dict(city_prov) 67 | lat_map_dict={} 68 | 69 | reg_city='|'.join(loca_map_dict.keys()) 70 | reg_short_prov='|'.join(loca2_map_dict.keys()) 71 | 72 | '''载入经纬度''' 73 | with open(prov_lat,encoding='utf8') as f: 74 | for line in f: 75 | items=line.strip().split(',') 76 | lat_map_dict[items[0]]=np.array([items[1],items[2]]) 77 | 78 | ''' 79 | 用户 80 | ''' 81 | class UserInfo(object): 82 | def __init__(self): 83 | self.source=[] 84 | self.times=[] 85 | self.weeks=[] 86 | self.hours=[] 87 | self.reviewCnt=[] 88 | self.forwardCnt=[] 89 | self.content=[] 90 | self.topics=[] 91 | self.citys=[] 92 | self.loca_maps=[] 93 | self.prov_maps=[] 94 | self.lat_maps=[] 95 | self.loca2_maps=[] #映射地名简写 96 | self.aver_word_cnt=0 97 | self.aver_length=0 98 | self.at=[] 99 | self.id='' 100 | self.gender='' 101 | self.location='' 102 | self.age='' 103 | self.prov='' 104 | self.timeImage=np.zeros(7*24) 105 | self.timeDict={} #过滤同一时间的微博 106 | self.stimes=[] #发微博的标准时间 107 | pass 108 | 109 | def parse(self,line): 110 | items=line.split('||') 111 | self.id=items[0] 112 | if items[1]=='m': 113 | self.gender=1 114 | elif items[1]=='f': 115 | self.gender=0 116 | else: 117 | self.gender=-999 118 | year=int(items[2]) 119 | if year<1980: 120 | self.age=0 121 | elif year>1989: 122 | self.age=2 123 | else: 124 | self.age=1 125 | prov=items[3].strip().split(' ')[0] 126 | self.prov=prov 127 | if prov=='None': 128 | self.location='None' 129 | else: 130 | self.location=loca_dict.get(prov,'境外') 131 | 132 | ''' 133 | 载入用户的label值 134 | ''' 135 | def load_train_dict(): 136 | user_dict={} 137 | train_ids.clear() 138 | with open(train_labels,encoding='utf8') as f: 139 | for line in f: 140 | u=UserInfo() 141 | u.parse(line) 142 | user_dict[u.id]=u 143 | train_ids.append(u.id) 144 | return user_dict 145 | 146 | 147 | 148 | #将同一个人的微博拼接在一起 149 | 150 | def scan_status(filename): 151 | with open(filename,encoding='utf8') as f: 152 | for line in f: 153 | items=line.strip().split(',') 154 | items[5]=','.join(items[5:]) 155 | yield items[:6] 156 | 157 | def get_week_hour(time_string): 158 | reg_time1=re.compile('^\d{4}-\d{2}-\d{2} \d{2}:') 159 | reg_time2=re.compile('^今天 (\d{2}):\d{2}') 160 | reg_time3=re.compile('^\d{1,2}分钟前$') 161 | 162 | ts=reg_time1.match(time_string) 163 | if ts!=None: 164 | t=datetime.datetime.strptime(ts.group(), "%Y-%m-%d %H:") 165 | else: 166 | ts=reg_time2.match(time_string) #今天 00:00 167 | t=datetime.datetime.strptime("2016-06-28 0:", "%Y-%m-%d %H:") 168 | if ts!=None: 169 | return 1,int(ts.groups()[0]),t 170 | else: 171 | if reg_time3.match(time_string)!=None: #3分钟前 默认是2016年6月28日19点,星期二 172 | return 1,19,t 173 | else: 174 | print(time_string) 175 | return -1,-1 176 | return t.weekday(),t.hour,t 177 | 178 | #将星期,小时转成向量形式 179 | def cnt_to_vector(cnts,dim): 180 | v=[0 for _ in range(dim)] 181 | for c in cnts: 182 | if c>-1: 183 | v[c]+=1 184 | return v 185 | 186 | ''' 187 | 表情特征 188 | ''' 189 | import re 190 | reg_emoji=re.compile('\[.{1,8}?\]') 191 | 192 | with open(smp_path+'/user_data/emoji.txt',encoding='utf8') as f: 193 | emoji_set=set([item.strip() for item in f.readlines()]) 194 | 195 | def get_emoji(sentence): 196 | emojis=[item.replace(' ','') for item in reg_emoji.findall(sentence)] 197 | emojis=[item for item in emojis if item in emoji_set] 198 | return ' '.join(emojis) 199 | 200 | 201 | #拓展用户特征数量 202 | def extend_users(user_dict,status_file): 203 | for items in scan_status(status_file): 204 | if(len(items)>6): 205 | print(len(items)) 206 | if items[0] not in user_dict: 207 | continue 208 | user=user_dict[items[0]] 209 | user.source.append(items[3]) 210 | user.times.append(items[4]) 211 | user.reviewCnt.append(int(items[1])) 212 | user.forwardCnt.append(int(items[2])) 213 | ts=get_week_hour(items[4]) 214 | user.weeks.append(ts[0]) 215 | user.hours.append(ts[1]) 216 | user.stimes.append(ts[2]) #发微博的时间 217 | user.content.append(items[5]) 218 | user.topics.extend(reg_topic.findall(items[5])) 219 | user.at.extend(reg_at.findall(items[5])) 220 | user.emoji=get_emoji(items[5]) 221 | if ts[2] not in user.timeDict: 222 | user.timeImage[ts[0]*24+ts[1]]+=1 223 | user.timeDict[ts[2]]=1 224 | 225 | for key in user_dict: 226 | u=user_dict[key] 227 | u.week_vec=cnt_to_vector(u.weeks,7) 228 | u.hour_vec=cnt_to_vector(u.hours,24) 229 | words='\n'.join(u.content).split() 230 | u.aver_word_cnt=len(words)/len(u.content) 231 | u.aver_length=len('\n'.join(u.content))/len(u.content) 232 | 233 | for w in re.findall(reg_city,'\n'.join(u.content)): 234 | '''如果出现在地域字典里''' 235 | p=prov_map_dict[w] 236 | u.citys.append(w) 237 | u.loca_maps.append(loca_map_dict[w]) 238 | u.prov_maps.append(p) 239 | u.lat_maps.append(lat_map_dict[p]) 240 | 241 | for w in re.findall(reg_short_prov,'\n'.join(u.content)): 242 | '''如果出现在地名简写字典中''' 243 | u.loca2_maps.append(loca2_map_dict[w]) 244 | 245 | def get_data_by_dict(user_dict,ids): 246 | data=[] #train_data or test_data 247 | for id in ids: 248 | u=user_dict[id] 249 | #id, content, gender, age, location, topic, review, forward, source, week, hour, 250 | #aver_word_cnt, aver_length, citys,city_maps 251 | #emoji, timeImage, times 252 | data.append([u.id, '\n'.join(u.content), u.gender, u.age, u.location, u.prov, 253 | '\n'.join(u.topics), u.reviewCnt, u.forwardCnt, 254 | '\n'.join(u.source),'\n'.join(u.at),u.week_vec, 255 | u.hour_vec, u.aver_word_cnt, u.aver_length, 256 | '\n'.join(u.citys),'\n'.join(u.loca_maps),u.emoji,u.timeImage,u.stimes, 257 | '\n'.join(u.prov_maps),np.array(u.lat_maps).astype('float'),'\n'.join(u.loca2_maps)]) 258 | return data 259 | 260 | def get_test_data(test_nolabels,test_status): 261 | test_user_dict={} 262 | test_ids.clear() 263 | with open(test_nolabels,encoding='utf8') as f: 264 | for line in f: 265 | u=UserInfo() 266 | u.id=line.strip() 267 | test_user_dict[u.id]=u 268 | test_ids.append(u.id) 269 | extend_users(test_user_dict,test_status) 270 | test_data=get_data_by_dict(test_user_dict,test_ids) 271 | return test_data 272 | 273 | #----------------------保存数据到features.v1.pkl---------------------------------------------------------- 274 | 275 | ''' 276 | 持久化数据 277 | 训练集:3200个 278 | 测试集:980个 279 | 输出到/data/feature_data/features.v1.pkl中 280 | ''' 281 | 282 | user_dict=load_train_dict() 283 | extend_users(user_dict,train_status) 284 | print('user:%d'%len(user_dict)) 285 | train_data=get_data_by_dict(user_dict,train_ids) 286 | test_data=get_test_data(test_nolabels,test_status) 287 | print('train_data:%d, test_data:%d'%(len(train_data),len(test_data))) 288 | 289 | abs_path=os.path.abspath(feature_path) 290 | if os.path.exists(abs_path)==False: 291 | os.mkdir(abs_path) 292 | 293 | pickle.dump([train_data,test_data],open(feature_path+'/features.v1.pkl','wb')) 294 | print('train_data:%d, test_data:%d'%(len(train_data),len(test_data))) 295 | print('数据已保存到:%s/features.v1.pkl'%feature_path) 296 | print('...') 297 | #-------------------------继续生成features.v2.pkl----------------------------------------------------------- 298 | import pickle 299 | import numpy as np 300 | from base.dataset import smp_path,feature_path,load_v1,load_links_dict 301 | from base.utils import cal_similar,get,load_keywords,remove_duplicate,get_word_vectors 302 | train_data,test_data=load_v1() 303 | 304 | from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer 305 | '''获得计数类文本特征''' 306 | def get_f_cnter(xs,min_df=5): 307 | cnter=CountVectorizer(min_df=min_df) 308 | cnter.fit(xs[0]+xs[1]) 309 | f_xs=[cnter.transform(x) for x in xs] 310 | return f_xs 311 | 312 | '''将整数转换占比例形式''' 313 | def to_rate(a): 314 | b=np.sum(a,axis=1) 315 | return a/np.outer(b,np.ones(a.shape[1])) 316 | 317 | '''将日期转成int型特征''' 318 | def date_to_int(d): 319 | return (d.year-2009)*12+d.month-10 320 | 321 | #-------------------1. 文本特征------------------------------------- 322 | ''' 323 | tfidf类特征 324 | ''' 325 | 326 | from sklearn.decomposition import TruncatedSVD 327 | from scipy import sparse 328 | 329 | content=get(train_data,1),get(test_data,1) 330 | xs=content 331 | 332 | '''词tfidf特征''' 333 | tfidf=TfidfVectorizer(min_df=5,ngram_range=(1,2)) 334 | tfidf.fit(xs[0]) 335 | f_word=[tfidf.transform(x) for x in xs] 336 | print('f_word',f_word[0].shape) 337 | 338 | '''字tfidf特征''' 339 | tfidf2=TfidfVectorizer(min_df=5,ngram_range=(1,2),analyzer='char') 340 | tfidf2.fit(xs[0]) 341 | f_letter=[tfidf2.transform(x) for x in xs] 342 | print('f_letter',f_letter[0].shape) 343 | 344 | '''话题词特征''' 345 | theme=get(train_data,6),get(test_data,6) 346 | xs=theme 347 | tfidf3=TfidfVectorizer(min_df=5,ngram_range=(1,2),analyzer='char') 348 | tfidf3.fit(xs[0]+xs[1]) 349 | f_theme_word=[tfidf3.transform(x) for x in xs] 350 | 351 | '''话题特征''' 352 | nospace=[[item.replace(' ','') for item in t] for t in theme] 353 | f_theme=get_f_cnter(nospace) 354 | 355 | '''话题字特征''' 356 | xs=nospace 357 | tfidf3=TfidfVectorizer(min_df=5,ngram_range=(1,2),analyzer='char') 358 | tfidf3.fit(xs[0]+xs[1]) 359 | f_theme_letter=[tfidf3.transform(x) for x in xs] 360 | 361 | '''表情特征''' 362 | emoji=get(train_data,17),get(test_data,17) 363 | f_emoji=get_f_cnter(emoji) 364 | 365 | '''分享源''' 366 | shared=get(train_data,10),get(test_data,10) 367 | f_shared=get_f_cnter(shared) 368 | 369 | '''信息来源''' 370 | source=get(train_data,9),get(test_data,9) 371 | f_source=get_f_cnter(source) 372 | 373 | '''关键词表''' 374 | xs=content 375 | keywords=load_keywords() 376 | cnter=CountVectorizer(binary=True,vocabulary=keywords) 377 | cnter.fit(xs[0]+xs[1]) 378 | f_keywords=[cnter.transform(x) for x in xs] 379 | 380 | 381 | '''svd降维''' 382 | svd=TruncatedSVD(n_components=500) 383 | svd.fit(f_word[0]) 384 | f_word_pca=svd.transform(f_word[0]),svd.transform(f_word[1]) 385 | 386 | svd=TruncatedSVD(n_components=500) 387 | svd.fit(f_letter[0]) 388 | f_letter_pca=svd.transform(f_letter[0]),svd.transform(f_letter[1]) 389 | 390 | '''词tfidf特征 ngram1''' 391 | tfidf4=TfidfVectorizer(min_df=3,ngram_range=(1,1)) 392 | tfidf4.fit(xs[0]) 393 | f_word_n1=[tfidf4.transform(x) for x in xs] 394 | print('f_word_n1',f_word_n1[0].shape) 395 | 396 | '''字tfidf特征 ngram1''' 397 | tfidf5=TfidfVectorizer(min_df=3,ngram_range=(1,1),analyzer='char') 398 | tfidf5.fit(xs[0]) 399 | f_letter_n1=[tfidf5.transform(x) for x in xs] 400 | print('f_letter_n1',f_letter_n1[0].shape) 401 | 402 | '''source tfidf 特征''' 403 | source=get(train_data,9),get(test_data,9) 404 | tfidf6=TfidfVectorizer(min_df=3,ngram_range=(1,1)) 405 | tfidf6.fit(source[0]) 406 | f_source_tfidf=[tfidf6.transform(x) for x in source] 407 | 408 | f_text=[f_word,f_letter,f_word_pca,f_letter_pca,f_theme,f_theme_word,f_theme_letter, 409 | f_emoji,f_shared,f_source,f_keywords,f_word_n1,f_letter_n1,f_source_tfidf] 410 | 411 | #-------------------2. 统计特征------------------------------------------------------- 412 | import numpy as np 413 | '''统计类特征''' 414 | 415 | '''粉丝数''' 416 | links_dict=load_links_dict() 417 | 418 | def aver(items): 419 | if len(items)==0: 420 | return 0 421 | return np.average(items) 422 | 423 | def get_f_statistic(user_data): 424 | fs=[] 425 | for item in user_data: 426 | sens=item[1].split('\n') 427 | '''去除与自己重复的''' 428 | norSens=remove_duplicate(sens) 429 | 430 | '''去除分享后''' 431 | noShared=[] 432 | for source,sen in zip(item[9],sens): 433 | if source.find('分享') or sen.find('分享'): 434 | continue 435 | noShared.append(sen) 436 | 437 | cnt=len(sens) 438 | norCnt=len(norSens) 439 | nosCnt=len(noShared) 440 | rrate=(cnt-norCnt)/cnt 441 | 442 | review=np.array(item[7]) 443 | forward=np.array(item[8]) 444 | 445 | reviewCnt=np.sum(review) 446 | forwardCnt=np.sum(forward) 447 | 448 | averReview=aver(review) 449 | averForward=aver(forward) 450 | 451 | goodCnt=np.sum((review>4)+(forward>4)) 452 | goodRate=1.0*goodCnt/cnt 453 | 454 | averLetter=aver([len(sen) for sen in sens]) 455 | averWord=aver([len(sen.split()) for sen in sens]) 456 | 457 | norAverLetter=aver([len(sen) for sen in norSens]) 458 | norAverWord=aver([len(sen.split()) for sen in norSens]) 459 | 460 | nosAverLetter=aver([len(sen) for sen in noShared]) 461 | nosAverWord=aver([len(sen.split()) for sen in noShared]) 462 | 463 | linksCnt=links_dict.get(item[0],0) 464 | 465 | fs.append([cnt,norCnt,nosCnt,rrate,reviewCnt,forwardCnt, 466 | averReview,averForward,goodCnt,goodRate, 467 | averLetter,averWord,norAverLetter,norAverWord, 468 | nosAverLetter,nosAverWord,linksCnt]) 469 | 470 | return np.array(fs) 471 | 472 | 473 | f_stat=get_f_statistic(train_data),get_f_statistic(test_data) 474 | 475 | #---------------------3. 时间特征------------------------------------------------ 476 | from collections import Counter 477 | import datetime 478 | '''时间类特征''' 479 | def get_f_times_by_user(user_item): 480 | times=sorted(user_item[19]) 481 | '''连续活跃小时数特征''' 482 | hourCnt=1 483 | longHourCnt=0 484 | for i in range(1,len(times)): 485 | if (times[i]-times[i-1]).seconds<3600*3: 486 | hourCnt+=1 487 | if hourCnt>longHourCnt: 488 | longHourCnt=hourCnt 489 | else: 490 | hourCnt=1 491 | 492 | '''连续活跃天数特征''' 493 | dayCnt=1 494 | longDayCnt=0 495 | for i in range(1,len(times)): 496 | if (times[i]-times[i-1]).days<2: 497 | if times[i].day-times[i-1].day==1: 498 | dayCnt+=1 499 | if dayCnt>longDayCnt: 500 | longDayCnt=dayCnt 501 | else: 502 | dayCnt=1 503 | 504 | hourDict=Counter([t.strftime('%Y%m%d%H') for t in times]) 505 | dayDict=Counter([t.strftime('%Y%m%d') for t in times]) 506 | monthDict=Counter([t.strftime('%Y%m') for t in times]) 507 | 508 | norHourCnt=len(hourDict) 509 | norDayCnt=len(dayDict) 510 | norMonthCnt=len(monthDict) 511 | 512 | maxHours=np.max(list(hourDict.values())) 513 | maxDays=np.max(list(dayDict.values())) 514 | 515 | ftCnt=np.array(user_item[18]) 516 | ftNormCnt=ftCnt/np.sum(ftCnt) 517 | fs=np.array([longHourCnt,longDayCnt,norHourCnt, 518 | norDayCnt,norMonthCnt,maxHours,maxDays]) 519 | return np.concatenate((ftCnt,ftNormCnt,fs)) 520 | 521 | def get_f_times(user_data): 522 | fs=[] 523 | for user_item in user_data: 524 | f=get_f_times_by_user(user_item) 525 | f_week=np.array(user_item[11]) 526 | f_hour=np.array(user_item[12]) 527 | 528 | f_norm_week=f_week/np.sum(f_week) 529 | f_norm_hour=f_hour/np.sum(f_hour) 530 | 531 | 532 | fs.append(np.concatenate((f_week,f_norm_week,f_hour,f_norm_hour,f))) 533 | return np.array(fs) 534 | 535 | f_times=get_f_times(train_data),get_f_times(test_data) 536 | 537 | '''统计每月发微博数量的分布''' 538 | def get_time_cnt(times): 539 | data=[] 540 | for t in times: 541 | data.append([(d.year-2009)*12+d.month-10 for d in t]) 542 | data=[' '.join(map(str,item)) for item in data] 543 | vocab=[str(i) for i in range(81)] 544 | cnter=CountVectorizer(min_df=0,token_pattern='(?u)\\b\\w+\\b',vocabulary=vocab) 545 | return cnter.fit_transform(data).toarray() 546 | 547 | 548 | times=get(train_data,19),get(test_data,19) 549 | f_monthDist=get_time_cnt(times[0]+times[1]) 550 | f_monthDist=f_monthDist[:len(times[0]),:],f_monthDist[len(times[0]):,:] 551 | 552 | '''统计最后一次发微博距今多少天''' 553 | tday=datetime.datetime.strptime("2016-06-28 0:", "%Y-%m-%d %H:") 554 | 555 | def get_last_day(times): 556 | f=[tday-np.max(t) for t in times] 557 | f=[t.days for t in f] 558 | f=np.array(f).reshape((len(f),1)) 559 | return f 560 | 561 | f_last_day=get_last_day(times[0]),get_last_day(times[1]) 562 | 563 | f_times=np.hstack((f_times[0],f_monthDist[0],f_last_day[0])),np.hstack((f_times[1],f_monthDist[1],f_last_day[1])) 564 | 565 | 566 | print('时间类特征:',f_times[0].shape) 567 | print('...') 568 | #--------------------4. 地名特征 ---------------------------- 569 | '''城市映射''' 570 | citys=get(train_data,15),get(test_data,15) 571 | loca_maps=get(train_data,16),get(test_data,16) 572 | prov_maps=get(train_data,20),get(test_data,20) 573 | lat_maps=get(train_data,21),get(test_data,21) 574 | short_prov_maps=get(train_data,22),get(test_data,22) 575 | 576 | fp_citys=get_f_cnter(citys,1) 577 | fp_loca_maps=get_f_cnter(loca_maps,1) 578 | fp_prov_maps=get_f_cnter(prov_maps,1) 579 | fp_exist=[(np.array([len(c) for c in items])>0).astype('int') for items in citys] 580 | fp_loca2_maps=get_f_cnter(short_prov_maps,1) 581 | 582 | 583 | fp_exist=[item.reshape(item.shape[0],1) for item in fp_exist] 584 | 585 | 586 | def get_fp_lat(user_data): 587 | fs=[] 588 | for item in user_data: 589 | if len(item)==0: 590 | fs.append([0,0]) 591 | else: 592 | fs.append(np.average(item,axis=0)) 593 | return np.array(fs) 594 | fp_lat_maps=get_fp_lat(lat_maps[0]),get_fp_lat(lat_maps[1]) 595 | 596 | fp=[fp_citys,fp_loca_maps,fp_prov_maps,fp_exist,fp_lat_maps,fp_loca2_maps] 597 | 598 | #----------------------5. 抽取y值-------------------------------------- 599 | '''抽取y值''' 600 | ids=get(train_data,0),get(test_data,0) 601 | 602 | y_gen=get(train_data,2) 603 | y_age=get(train_data,3) 604 | y_loca=get(train_data,4) 605 | 606 | loca_enum='None,华北,华东,华南,西南,华中,东北,西北,境外'.split(',') 607 | y_loca=[loca_enum.index(y)-1 for y in y_loca] 608 | 609 | ys=np.array([y_gen,y_age,y_loca]) 610 | 611 | '''train 与 test特征分开输出''' 612 | f_text_train=[item[0] for item in f_text] 613 | f_text_test=[item[1] for item in f_text] 614 | 615 | fp_train=[item[0] for item in fp] 616 | fp_test=[item[1] for item in fp] 617 | 618 | f_train=[f_text_train,f_stat[0],f_times[0],fp_train] 619 | f_test=[f_text_test,f_stat[1],f_times[1],fp_test] 620 | 621 | f_content=(get(train_data,1),get(test_data,1)) 622 | pickle.dump([ids,ys,f_train,f_test,f_content],open(feature_path+'/features.v2.pkl','wb')) 623 | print('features.v2.pkl输出完毕!') 624 | print('...') 625 | #-----------------------6. 输出word2vec--------------------------------------- 626 | 627 | '''word2vec''' 628 | from sklearn.feature_extraction.text import TfidfVectorizer 629 | from base.dataset import load_w2v 630 | import random 631 | def filter_content(content): 632 | return content.replace('\xa0',' ') 633 | 634 | def split_weibo(contents): 635 | sens=[] 636 | ids=[] 637 | for item in contents: 638 | c_sens=filter_content(item).split('\n') 639 | ids.append(range(len(sens),len(sens)+len(c_sens))) 640 | sens.extend(c_sens) 641 | return sens,ids 642 | ''' 643 | 将所有句子重新分配给每个人 644 | ''' 645 | def get_f_w2v(xs,ids,vector_size=300): 646 | x_cnn=[] 647 | max_dim=100 648 | for indexs in ids: 649 | item=[] 650 | for i in indexs[:min(max_dim,len(indexs))]: 651 | item.append(xs[i]) 652 | if len(item)0: 792 | fp_s_exist.append(1) 793 | else: 794 | fp_s_exist.append(0) 795 | fp_s_loca.append(' '.join(locas)) 796 | fp_s_prov.append(' '.join(provs)) 797 | 798 | def get_f_cnter2(xs,min_df=1): 799 | cnter=CountVectorizer(min_df=min_df) 800 | return cnter.fit_transform(xs) 801 | 802 | fp_s_loca=get_f_cnter2(fp_s_loca,1) 803 | fp_s_prov=get_f_cnter2(fp_s_prov,1) 804 | fp_s_exist=np.array(fp_s_exist).reshape((len(fp_s_exist),1)) 805 | ids=get(train_data+test_data,0) 806 | pickle.dump([ids,fp_s_loca,fp_s_prov,fp_s_exist],open(feature_path+'/loca.source.feature','wb')) 807 | #----------------------------------以句子为单位进行特征输出--------------------------------------------------- 808 | 809 | # data中存储了每个用户的数据 810 | # indexes中存储了用户对应的位置信息 811 | data=[] 812 | fids=[] 813 | indexes=[] 814 | for item in train_data+test_data: 815 | id=item[0] 816 | fids.append(id) 817 | sens=item[1].split('\n') 818 | reviews=item[7] 819 | forwards=item[8] 820 | sources=item[9].split('\n') 821 | times=item[19] 822 | years=[t.year for t in times] 823 | days=[t.day for t in times] 824 | months=[t.month for t in times] 825 | weeks=[t.weekday() for t in times] 826 | hours=[t.hour for t in times] 827 | begin=len(data) 828 | data.extend(list(zip(sens,reviews,forwards,sources,years,months,days,weeks,hours))) 829 | indexes.append(range(begin,len(data))) 830 | 831 | #关键词特征 832 | f_keys=[] 833 | import re 834 | rules=['#','分享','http','地图'] 835 | for item in data: 836 | val=np.zeros((len(rules),)) 837 | for i,rule in enumerate(rules): 838 | if re.search(rule,item[0])!=None: 839 | val[i]=1 840 | f_keys.append(val) 841 | f_keys=np.array(f_keys) 842 | 843 | #count feature 844 | f_cnt=[] 845 | for item in data: 846 | f_cnt.append([item[1],item[2]]) 847 | f_cnt=np.array(f_cnt) 848 | 849 | #time feature 850 | from keras.utils.np_utils import to_categorical 851 | f_week=to_categorical(get(data,7)) 852 | f_hour=to_categorical(get(data,8)) 853 | 854 | #char feature 855 | from sklearn.feature_extraction.text import TfidfVectorizer 856 | tfidf=TfidfVectorizer(min_df=3) 857 | f_char=tfidf.fit_transform(get(data,0)) 858 | from sklearn.decomposition import TruncatedSVD 859 | svd=TruncatedSVD(n_components=300-37) 860 | f_char_svd=svd.fit_transform(f_char) 861 | 862 | fs=np.hstack([f_keys,f_cnt,f_week,f_hour,f_char_svd]) 863 | f_cnn=[] 864 | for index in indexes: 865 | f=fs[index] 866 | if f.shape[0]<100: 867 | t=np.vstack((f,np.zeros((100-f.shape[0],300)))) 868 | else: 869 | t=f[:100] 870 | f_cnn.append(t) 871 | f_cnn=np.array(f_cnn) 872 | pickle.dump([fids,f_cnn],open(feature_path+'/f_sens.300.pkl','wb')) 873 | #-------------------------------- 程序运行完毕--------------------------------------------------- 874 | print('程序运行完毕') 875 | -------------------------------------------------------------------------------- /run.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | feature_file="data/feature_data/f_sens.300.pkl" 3 | age_model_file="data/models/yuml.age.feature" 4 | 5 | if [ ! -f "$feature_file" ]; then 6 | python process_data.py 7 | fi 8 | if [ ! -f "$age_model_file" ]; then 9 | python stack_age.py 10 | fi 11 | python stack_loca.py 12 | python main.py 13 | -------------------------------------------------------------------------------- /stack_age.py: -------------------------------------------------------------------------------- 1 | import os,sys 2 | sys.path.append(os.path.abspath('.')) 3 | print('正在进行年龄特征变换...') 4 | import keras 5 | from keras.models import Sequential 6 | from keras.layers.core import Dense,Activation,Flatten,Dropout 7 | from keras.layers.convolutional import Convolution2D,MaxPooling1D,Convolution1D,MaxPooling2D,AveragePooling2D 8 | from keras.layers.embeddings import Embedding 9 | from keras.layers.recurrent import LSTM 10 | from keras.utils.np_utils import to_categorical 11 | from keras.optimizers import SGD 12 | from keras.callbacks import EarlyStopping 13 | from keras.engine.topology import Merge 14 | from base.keras_helper import ModelCheckpointPlus 15 | from base.yuml.models import StackEnsemble 16 | from base.yuml.models import MCNN2 17 | from base.dataset import load_v2,feature_path,smp_path 18 | from base.utils import merge,describe,get,get_xs,get_word_vectors 19 | from sklearn.linear_model import LogisticRegression as LR 20 | from sklearn.ensemble import RandomForestClassifier as RF 21 | from sklearn.ensemble import GradientBoostingClassifier as GBDT 22 | from sklearn.ensemble import AdaBoostClassifier as AdaBoost 23 | from sklearn.ensemble import BaggingClassifier 24 | from sklearn.ensemble import ExtraTreesClassifier 25 | from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer 26 | from sklearn.svm import SVC 27 | from sklearn import cross_validation 28 | from scipy import sparse 29 | import numpy as np 30 | import scipy.sparse as ss 31 | from sklearn.cross_validation import StratifiedShuffleSplit,StratifiedKFold 32 | ids,ys,f_train,f_test,f_content=load_v2() 33 | f_text,f_stat,f_times,fp=f_train 34 | f_text_test,f_stat_test,f_times_test,fp_test=f_test 35 | ReTrain=False 36 | y_age=ys[1] 37 | 38 | import pickle 39 | fids,f_w2v1,f_w2v1_test=pickle.load(open(feature_path+'/f_w2v_tfidf.300.cache','rb')) 40 | fids,f_w2v2,f_w2v2_test=pickle.load(open(feature_path+'/f_word_svd.300.cache','rb')) 41 | 42 | f_w2v=np.concatenate((f_w2v1,f_w2v2),axis=1) 43 | f_w2v_test=np.concatenate((f_w2v1_test,f_w2v2_test),axis=1) 44 | 45 | #--------------- MCNN --------------- 46 | class MCNN(object): 47 | ''' 48 | 使用word2vec*tfidf的cnn并与人工特征混合,接口与sklearn分类器一致 49 | ''' 50 | def __init__(self,cnn_input_dim,num_class=3): 51 | self.num_class=num_class 52 | self.build(cnn_input_dim) 53 | 54 | 55 | def build(self,vector_dim): 56 | #句子特征 57 | model=Sequential() 58 | model.add(Convolution2D(100,1,vector_dim,input_shape=(2,100,vector_dim),activation='relu')) 59 | model.add(Dropout(0.5)) 60 | model.add(MaxPooling2D(pool_size=(50,1))) 61 | model.add(Flatten()) 62 | model.add(Dropout(0.5)) 63 | model.add(Dense(100,activation='tanh')) 64 | model.add(Dropout(0.5)) 65 | model.add(Dense(3,activation='softmax')) 66 | model.compile(loss='categorical_crossentropy',optimizer='adadelta',metrics=['accuracy'],) 67 | 68 | self.model=model 69 | self.earlyStopping=EarlyStopping(monitor='val_loss', patience=25, verbose=0, mode='auto') 70 | self.checkpoint=ModelCheckpointPlus(filepath='weights.hdf5',monitor='val_loss',verbose_show=20) 71 | 72 | def fit(self,X,y,Xvi=None,yvi=None): 73 | yc=to_categorical(y) 74 | if Xvi is None: 75 | self.model.fit(X,yc,nb_epoch=1000,verbose=0,validation_split=0.2,batch_size=32,callbacks=[self.earlyStopping,self.checkpoint]) 76 | else: 77 | ycvi=to_categorical(yvi) 78 | self.model.fit(X,yc,nb_epoch=1000,verbose=0,validation_data=[Xvi,ycvi], 79 | batch_size=32,callbacks=[self.earlyStopping,self.checkpoint]) 80 | self.model.load_weights('weights.hdf5') 81 | return self.model 82 | 83 | def predict(self,X): 84 | return predict_proba(X) 85 | 86 | def predict_proba(self,X): 87 | return self.model.predict(X) 88 | 89 | #-----------------XGBoost -------------------------------- 90 | 91 | import xgboost as xgb 92 | class XGB(object): 93 | def __init__(self): 94 | self.params={ 95 | 'booster':'gblinear', 96 | 'eta':0.03, 97 | 'alpha':0.1, 98 | 'lambda':0, 99 | 'subsample':1, 100 | 'colsample_bytree':1, 101 | 'num_class':3, 102 | 'objective':'multi:softprob', 103 | 'eval_metric':'mlogloss', 104 | 'silent':1 105 | } 106 | pass 107 | 108 | def fit(self,X,y,Xvi=None,yvi=None): 109 | if Xvi is None: 110 | ti,vi=list(StratifiedShuffleSplit(y,test_size=0.2,random_state=100,n_iter=1))[0] 111 | dtrain=xgb.DMatrix(X[ti],label=y[ti]) 112 | dvalid=xgb.DMatrix(X[vi],label=y[vi]) 113 | else: 114 | dtrain=xgb.DMatrix(X,label=y) 115 | dvalid=xgb.DMatrix(Xvi,label=yvi) 116 | watchlist=[(dtrain,'train'),(dvalid,'val')] 117 | self.model=xgb.train(self.params,dtrain,num_boost_round=1000,early_stopping_rounds=25,evals=(watchlist),verbose_eval=100) 118 | return self.model 119 | 120 | def predict(self,X): 121 | return self.predict_proba(X) 122 | 123 | def predict_proba(self,X): 124 | return self.model.predict(xgb.DMatrix(X)) 125 | 126 | #----------- 获取特征 ------------ 127 | def get_xgb_X(f_train): 128 | f_text,f_stat,f_times,fp=f_train 129 | fs=(f_text[1],f_stat[:,[4,5,10,11,12,13,16]],) 130 | X=get_xs(fs) 131 | return X 132 | 133 | #--------------------mcnn2------------------------------------- 134 | def get_mcnn2_X(xtype='train'): 135 | if xtype=='train': 136 | f_text,f_stat,f_times,fp=f_train 137 | x_cnn=f_w2v 138 | else: 139 | f_text,f_stat,f_times,fp=f_test 140 | x_cnn=f_w2v_test 141 | 142 | f_norm_hour=f_times[:,38:62] 143 | f_hour=f_times[:,14:38] 144 | f_norm_week=f_times[:,7:14] 145 | f_week=f_times[:,0:7] 146 | fs=[f_text[12],f_text[3],f_stat[:,[4,5,10,11,12,13,16]],f_norm_hour] 147 | #fs=[f_text[12]] 148 | 149 | if xtype=='train': 150 | fs+=[em_xgb.get_next_input()[:3200]] 151 | else: 152 | fs+=[em_xgb.get_next_input()[3200:]] 153 | 154 | x_ext=get_xs(fs).toarray() 155 | return [x_cnn,x_ext] 156 | 157 | if __name__=='__main__': 158 | filename=smp_path+'/models/yuml.age.feature' 159 | print('将输出文件:',filename) 160 | if ReTrain==True or os.path.exists(filename)==False: 161 | X=f_w2v 162 | X_test=f_w2v_test 163 | np.random.seed(100) 164 | 165 | #----mcnn model----- 166 | em_mcnn=StackEnsemble(lambda:MCNN(300),multi_input=False) 167 | f2_cnn=em_mcnn.fit(X,y_age) 168 | f2_cnn_test=em_mcnn.predict(X_test) 169 | 170 | #----xgb model------- 171 | X=get_xgb_X(f_train) 172 | X_test=get_xgb_X(f_test) 173 | em_xgb=StackEnsemble(lambda:XGB()) 174 | f2_xgb=em_xgb.fit(X,y_age) 175 | f2_xgb_test=em_xgb.predict(X_test) 176 | 177 | #----mcnn2 model----- 178 | X=get_mcnn2_X('train') 179 | X_test=get_mcnn2_X('test') 180 | np.random.seed(100) 181 | 182 | em_mcnn2=StackEnsemble(lambda:MCNN2(X[0].shape[3],X[1].shape[1],num_channel=2),multi_input=True) 183 | f2_cnn=em_mcnn2.fit(X,y_age) 184 | f2_cnn_test=em_mcnn2.predict(X_test) 185 | 186 | #-----------------------特征输出------------- 187 | import pickle 188 | pickle.dump([em_mcnn2.get_next_input(),em_xgb.get_next_input(),em_mcnn.get_next_input()],open(filename,'wb')) 189 | print('程序运行完毕') 190 | -------------------------------------------------------------------------------- /stack_loca.py: -------------------------------------------------------------------------------- 1 | ''' 2 | 本程序用来预测地域 3 | 运行前需要先运行process_data.py 4 | 将更新'/submission/temp.csv'文件中的地域一列 5 | ''' 6 | import os,sys 7 | sys.path.append(os.path.abspath('.')) 8 | print('正在预测地域...') 9 | from base.dataset import load_v2,feature_path,smp_path,load_v1 10 | from base.yuml.models import StackEnsemble 11 | from base.utils import describe,get,get_xs,get_word_vectors 12 | from sklearn.linear_model import LogisticRegression as LR 13 | from sklearn.ensemble import RandomForestClassifier as RF 14 | from sklearn.ensemble import GradientBoostingClassifier as GBDT 15 | from sklearn.ensemble import AdaBoostClassifier as AdaBoost 16 | from sklearn.ensemble import BaggingClassifier 17 | from sklearn.ensemble import ExtraTreesClassifier 18 | from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer 19 | from sklearn import cross_validation 20 | from scipy import sparse 21 | import numpy as np 22 | import scipy.sparse as ss 23 | from sklearn.cross_validation import StratifiedShuffleSplit,StratifiedKFold 24 | import pickle 25 | 26 | ReTrain=False 27 | 28 | ids,ys,f_train,f_test,f_content=load_v2() 29 | f_text,f_stat,f_times,fp=f_train 30 | f_text_test,f_stat_test,f_times_test,fp_test=f_test 31 | y_loca=ys[2] 32 | 33 | fids,fp_sloca,fp_sprov,fp_sexist=pickle.load(open(feature_path+'/loca.source.feature','rb')) 34 | n_folds=5 35 | 36 | #--------------------------定义神经网络模型---------------------------------------------- 37 | from keras.models import Sequential 38 | from keras.layers import Dense,Dropout,Activation 39 | from keras.utils.np_utils import to_categorical 40 | from keras.callbacks import EarlyStopping,ModelCheckpoint 41 | from keras.optimizers import SGD 42 | from base.keras_helper import ModelCheckpointPlus 43 | class LocaNN(object): 44 | ''' 45 | 3层BP神经网络 46 | ''' 47 | def __init__(self,input_dim,seed=100): 48 | self.seed=seed 49 | self.build(input_dim) 50 | 51 | def build(self,input_dim): 52 | model=Sequential() 53 | model.add(Dense(output_dim=800,input_dim=input_dim,activation='tanh')) 54 | model.add(Dropout(0.5)) 55 | model.add(Dense(output_dim=300,activation='tanh')) 56 | model.add(Dropout(0.5)) 57 | model.add(Dense(output_dim=8)) 58 | model.add(Dropout(0.3)) 59 | model.add(Activation('softmax')) 60 | 61 | model.compile(optimizer='adadelta',loss='categorical_crossentropy',metrics=['accuracy']) 62 | 63 | self.model=model 64 | self.earlyStopping=EarlyStopping(monitor='val_loss', patience=100, verbose=0, mode='auto') 65 | self.checkpoint=ModelCheckpointPlus(filepath='weights.hdf5',monitor='val_loss',verbose_show=20) 66 | 67 | def fit(self,X,y,Xvi=None,yvi=None): 68 | yc=to_categorical(y) 69 | 70 | if Xvi is None: 71 | self.model.fit(X,yc,nb_epoch=1000,verbose=0,validation_split=0.2,batch_size=32,callbacks=[self.earlyStopping,self.checkpoint]) 72 | else: 73 | ycvi=to_categorical(yvi) 74 | self.model.fit(X,yc,nb_epoch=1000,verbose=0,validation_data=[Xvi,ycvi], 75 | batch_size=32,callbacks=[self.earlyStopping,self.checkpoint]) 76 | self.model.load_weights('weights.hdf5') 77 | return self.model 78 | 79 | def predict(self,X): 80 | return self.predict_proba(X) 81 | 82 | def predict_proba(self,X): 83 | return self.model.predict(X) 84 | 85 | def get_nn_x(ftype): 86 | if ftype=='train': 87 | f_text,f_stat,f_times,fp=f_train 88 | ft_norm_hour=f_times[:,38:62] 89 | ft_norm_cnt=f_times[:,230:398] 90 | fs=(fp[0],fp[1],fp[2],fp[3],fp[4],ft_norm_hour,ft_norm_cnt,) 91 | else: 92 | f_text,f_stat,f_times,fp=f_test 93 | ft_norm_hour=f_times[:,38:62] 94 | ft_norm_cnt=f_times[:,230:398] 95 | fs=(fp[0],fp[1],fp[2],fp[3],fp[4],ft_norm_hour,ft_norm_cnt,) 96 | return get_xs(fs) 97 | 98 | 99 | def get_nn_x2(ftype): 100 | if ftype=='train': 101 | f_text,f_stat,f_times,fp=f_train 102 | ft_norm_hour=f_times[:,38:62] 103 | ft_norm_cnt=f_times[:,230:398] 104 | fs=(fp[0],fp[1],fp[2],fp[3],fp[4],ft_norm_hour,ft_norm_cnt, 105 | fp_sloca[:3200],fp_sprov[:3200],fp_sexist[:3200], 106 | em_age.get_next_input()[:3200], 107 | em_gen.get_next_input()[:3200]) 108 | else: 109 | f_text,f_stat,f_times,fp=f_test 110 | ft_norm_hour=f_times[:,38:62] 111 | ft_norm_cnt=f_times[:,230:398] 112 | fs=(fp[0],fp[1],fp[2],fp[3],fp[4],ft_norm_hour,ft_norm_cnt, 113 | fp_sloca[3200:],fp_sprov[3200:],fp_sexist[3200:], 114 | em_age.get_next_input()[3200:], 115 | em_gen.get_next_input()[3200:]) 116 | return get_xs(fs) 117 | 118 | #----------------填充空值---------------------------------------------- 119 | #填充y的空值 120 | loca_empth_path=feature_path+'/loca.empty.pkl' 121 | if os.path.exists(loca_empth_path): 122 | loca_empty=pickle.load(open(loca_empth_path,'rb')) 123 | tindexs=np.arange(len(ys[2]))[ys[2]==-1] 124 | if len(tindexs)>0: 125 | assert((tindexs==loca_empty[0]).all()) 126 | y_loca[tindexs]=loca_empty[1] 127 | else: 128 | filter_index=np.arange(len(ys[2]))[ys[2]!=-1] 129 | print('正在填充训练集中空缺的y值...') 130 | X=get_nn_x('train').toarray() 131 | X_test=get_nn_x('test').toarray() 132 | em_nn=StackEnsemble(lambda:LocaNN(X.shape[1]),n_folds=5) 133 | em_nn.fit(X[filter_index],y_loca[filter_index]) 134 | 135 | tindexs=np.arange(len(ys[2]))[ys[2]==-1] 136 | loca_empty=np.argmax(em_nn.predict(X[tindexs]),axis=1) 137 | pickle.dump([tindexs,loca_empty],open(loca_empth_path,'wb')) 138 | y_loca[tindexs]=loca_empty[1] 139 | print('y值填充完毕') 140 | 141 | #----------------1. BP神经网络模型-------------------------------- 142 | 143 | 144 | X=get_nn_x('train').toarray() 145 | X_test=get_nn_x('test').toarray() 146 | em_nn=StackEnsemble(lambda:LocaNN(X.shape[1]),n_folds=n_folds) 147 | 148 | em_nn_path=smp_path+'/models/loca.em_nn.weight' 149 | if ReTrain==False and os.path.exists(em_nn_path): 150 | em_nn.load(em_nn_path) 151 | else: 152 | np.random.seed(50) 153 | em_nn.fit(X,y_loca) 154 | em_nn.predict(X_test) 155 | f2_nn=em_nn.get_next_input() 156 | em_nn.save(em_nn_path) 157 | 158 | #-----------------2. KNN -------------------------------------------- 159 | from sklearn.neighbors import KNeighborsClassifier as KNN 160 | import scipy.sparse as ss 161 | 162 | def get_knn_x(xtype): 163 | if xtype=='train': 164 | f_text,f_stat,f_times,fp=f_train 165 | ft_norm_hour=f_times[:,38:62] 166 | ft_norm_cnt=f_times[:,230:398] 167 | fs=(fp[0],fp[1],fp[3],fp[2],fp[4],ft_norm_hour,ft_norm_cnt,) 168 | 169 | else: 170 | f_text,f_stat,f_times,fp=f_test 171 | ft_norm_hour=f_times[:,38:62] 172 | ft_norm_cnt=f_times[:,230:398] 173 | fs=(fp[0],fp[1],fp[3],fp[2],fp[4],ft_norm_hour,ft_norm_cnt, ) 174 | 175 | return get_xs(fs) 176 | X=get_knn_x('train') 177 | X_test=get_knn_x('test') 178 | 179 | em_knn_path=smp_path+'/models/loca.em_knn.weight' 180 | em_knn=StackEnsemble(lambda:KNN(n_neighbors=20),need_valid=False,n_folds=n_folds) 181 | 182 | if ReTrain==False and os.path.exists(em_knn_path): 183 | em_knn.load(em_knn_path) 184 | else: 185 | em_knn.fit(X,y_loca) 186 | em_knn.predict(X_test) 187 | f2_knn=em_knn.get_next_input() 188 | em_knn.save(em_knn_path) 189 | 190 | #-----------------3. MCNN ----------------------------------------------------- 191 | import pickle 192 | fids,f_w2v1,f_w2v1_test=pickle.load(open(feature_path+'/f_w2v_tfidf.300.cache','rb')) 193 | fids,f_w2v2,f_w2v2_test=pickle.load(open(feature_path+'/f_word_svd.300.cache','rb')) 194 | #fids,f_w2v3,f_w2v3_test=pickle.load(open(feature_path+'/f_source_svd.300.cache','rb')) 195 | #fids,f_w2v4,f_w2v4_test=pickle.load(open(feature_path+'/f_letter_svd.300.cache','rb')) 196 | fids,f_sens=pickle.load(open(feature_path+'/f_sens.300.pkl','rb')) 197 | f_sens=f_sens.reshape((f_sens.shape[0],1,100,300)) 198 | f_w2v=np.concatenate((f_w2v1,f_w2v2,f_sens[:3200]),axis=1) 199 | f_w2v_test=np.concatenate((f_w2v1_test,f_w2v2_test,f_sens[3200:]),axis=1) 200 | 201 | 202 | from base.yuml.models import MCNN2 203 | def get_mcnn_x(xtype='train'): 204 | if xtype=='train': 205 | f_text,f_stat,f_times,fp=f_train 206 | x_cnn=f_w2v 207 | x_ext=f_text[2] 208 | else: 209 | f_text,f_stat,f_times,fp=f_test 210 | x_cnn=f_w2v_test 211 | x_ext=f_text[2] 212 | return [x_cnn,x_ext] 213 | 214 | X=get_mcnn_x('train') 215 | X_test=get_mcnn_x('test') 216 | 217 | em_mcnn_path=smp_path+'/models/loca.em_mcnn.weight' 218 | em_mcnn=StackEnsemble(lambda:MCNN2(X[0].shape[3],X[1].shape[1],num_class=8,num_channel=3),multi_input=True,n_folds=n_folds) 219 | if ReTrain==False and os.path.exists(em_mcnn_path): 220 | em_mcnn.load(em_mcnn_path) 221 | else: 222 | em_mcnn.fit(X,y_loca) 223 | em_mcnn.predict(X_test) 224 | f2_mcnn=em_mcnn.get_next_input() 225 | em_mcnn.save(em_mcnn_path) 226 | 227 | #-------------------4. MCNN3----------------------------------------------------- 228 | f_w2v=np.concatenate((f_w2v1,f_w2v2,f_sens[:3200]),axis=1) 229 | f_w2v_test=np.concatenate((f_w2v1_test,f_w2v2_test,f_sens[3200:]),axis=1) 230 | 231 | from base.yuml.models import MCNN3 232 | def get_mcnn3_x(xtype='train'): 233 | if xtype=='train': 234 | f_text,f_stat,f_times,fp=f_train 235 | x_cnn=f_w2v 236 | else: 237 | f_text,f_stat,f_times,fp=f_test 238 | x_cnn=f_w2v_test 239 | 240 | x_ext=f_text[2] 241 | 242 | ft_norm_hour=f_times[:,38:62] 243 | ft_norm_cnt=f_times[:,230:398] 244 | if xtype=='train': 245 | fs=(fp[0],fp[1],fp[2],fp[3],fp[4],ft_norm_hour,ft_norm_cnt) 246 | else: 247 | fs=(fp[0],fp[1],fp[2],fp[3],fp[4],ft_norm_hour,ft_norm_cnt) 248 | x_ext1=get_xs(fs).toarray() 249 | 250 | return [x_cnn,x_ext,x_ext1] 251 | 252 | X=get_mcnn3_x('train') 253 | X_test=get_mcnn3_x('test') 254 | 255 | shape=(X[0].shape[3],X[1].shape[1],X[2].shape[1]) 256 | 257 | em_mcnn3_path=smp_path+'/models/loca.em_mcnn3.weight' 258 | em_mcnn3=StackEnsemble(lambda:MCNN3(shape,num_class=8,num_channel=3),multi_input=True,n_folds=n_folds) 259 | 260 | if ReTrain==False and os.path.exists(em_mcnn3_path): 261 | em_mcnn3.load(em_mcnn3_path) 262 | else: 263 | em_mcnn3.fit(X,y_loca) 264 | em_mcnn3.predict(X_test) 265 | f2_mcnn3=em_mcnn3.get_next_input() 266 | em_mcnn3.save(em_mcnn3_path) 267 | 268 | 269 | #------------------------ 6. 加权投票-------------------------------------------------------- 270 | from base.yuml.models import WeightVoter 271 | ems=[em_nn,em_mcnn,em_knn,em_mcnn3] 272 | voter=WeightVoter() 273 | voter.extend(ems) 274 | y_pred=voter.fit_vote(y_loca) 275 | 276 | #-------------------------输出--------------------------------------------------------- 277 | from base.dataset import submission_path 278 | 279 | loca_enum='华北,华东,华南,西南,华中,东北,西北,境外'.split(',') 280 | y_pred=[loca_enum[y] for y in y_pred] 281 | 282 | #-------------- output ----------------------- 283 | y_dict={} 284 | for id,y in zip(ids[1],y_pred): 285 | y_dict[id]=y 286 | 287 | with open(submission_path+'/empty.csv',encoding='utf8') as f: 288 | items=[item.strip() for item in f] 289 | 290 | with open(submission_path+'/temp.csv','w',encoding='utf8') as f: 291 | f.write('%s\n'%items[0]) 292 | cnt=0 293 | for item in items[1:]: 294 | values=item.split(',') 295 | if y_dict[values[0]]!=values[3]: 296 | #print(values[3],'->',y_dict[values[0]]) 297 | cnt+=1 298 | f.write('%s,%s,%s,%s\n'%(values[0],values[1],values[2],y_dict[values[0]])) 299 | print('输出完毕,更新条数:',cnt) -------------------------------------------------------------------------------- /submission/empty.csv: -------------------------------------------------------------------------------- 1 | uid,age,gender,province 2 | 1743152063,None,None,None 3 | 1073390982,None,None,None 4 | 2137599524,None,None,None 5 | 2279196033,None,None,None 6 | 1039584863,None,None,None 7 | 2263800485,None,None,None 8 | 1727727530,None,None,None 9 | 2636071,None,None,None 10 | 2254551532,None,None,None 11 | 1354370475,None,None,None 12 | 2122894743,None,None,None 13 | 2103200284,None,None,None 14 | 2692899093,None,None,None 15 | 1748047122,None,None,None 16 | 1846837671,None,None,None 17 | 2179467223,None,None,None 18 | 1433793732,None,None,None 19 | 1278713991,None,None,None 20 | 1846054284,None,None,None 21 | 2327289553,None,None,None 22 | 1721034545,None,None,None 23 | 2941073947,None,None,None 24 | 2243519355,None,None,None 25 | 1114766430,None,None,None 26 | 1000056512,None,None,None 27 | 1758162851,None,None,None 28 | 2355697441,None,None,None 29 | 1273817417,None,None,None 30 | 1692145762,None,None,None 31 | 1953987080,None,None,None 32 | 1440962973,None,None,None 33 | 2798122095,None,None,None 34 | 2715361861,None,None,None 35 | 2091990107,None,None,None 36 | 1223413445,None,None,None 37 | 1132523305,None,None,None 38 | 2499208892,None,None,None 39 | 2182851475,None,None,None 40 | 1787118805,None,None,None 41 | 2185732252,None,None,None 42 | 2081601367,None,None,None 43 | 3212181255,None,None,None 44 | 1825840883,None,None,None 45 | 1270963501,None,None,None 46 | 1786018667,None,None,None 47 | 1085562794,None,None,None 48 | 1880309142,None,None,None 49 | 3185870731,None,None,None 50 | 1072304107,None,None,None 51 | 1729674993,None,None,None 52 | 1680975557,None,None,None 53 | 1771792650,None,None,None 54 | 2389248703,None,None,None 55 | 2039354763,None,None,None 56 | 1651456263,None,None,None 57 | 2096022380,None,None,None 58 | 1751345681,None,None,None 59 | 2782707502,None,None,None 60 | 1766004535,None,None,None 61 | 2874402390,None,None,None 62 | 2416275737,None,None,None 63 | 1653003903,None,None,None 64 | 1896892533,None,None,None 65 | 1567571807,None,None,None 66 | 2655439467,None,None,None 67 | 1425545803,None,None,None 68 | 3363301662,None,None,None 69 | 2675275181,None,None,None 70 | 2706680547,None,None,None 71 | 2452231361,None,None,None 72 | 2265706987,None,None,None 73 | 3115383751,None,None,None 74 | 2068032752,None,None,None 75 | 2625501947,None,None,None 76 | 2057665367,None,None,None 77 | 2546877980,None,None,None 78 | 3072370235,None,None,None 79 | 2690601997,None,None,None 80 | 1879778817,None,None,None 81 | 2480224850,None,None,None 82 | 1824895633,None,None,None 83 | 2267788063,None,None,None 84 | 2266490493,None,None,None 85 | 1672728830,None,None,None 86 | 1163165185,None,None,None 87 | 2954775382,None,None,None 88 | 2643118927,None,None,None 89 | 1314664643,None,None,None 90 | 2785917050,None,None,None 91 | 2722968981,None,None,None 92 | 2008630695,None,None,None 93 | 1782686791,None,None,None 94 | 1155313510,None,None,None 95 | 2868360900,None,None,None 96 | 2856382664,None,None,None 97 | 1832712741,None,None,None 98 | 2714341463,None,None,None 99 | 2365955653,None,None,None 100 | 2380171235,None,None,None 101 | 2213378071,None,None,None 102 | 2848328527,None,None,None 103 | 2211674960,None,None,None 104 | 2492431274,None,None,None 105 | 2694174710,None,None,None 106 | 1623043823,None,None,None 107 | 2435322624,None,None,None 108 | 2944811934,None,None,None 109 | 1372195384,None,None,None 110 | 1760347330,None,None,None 111 | 1368064210,None,None,None 112 | 2479855844,None,None,None 113 | 2156376185,None,None,None 114 | 1281528805,None,None,None 115 | 1417001713,None,None,None 116 | 1822742032,None,None,None 117 | 1336549385,None,None,None 118 | 2813715602,None,None,None 119 | 1864350213,None,None,None 120 | 1827037650,None,None,None 121 | 2814139504,None,None,None 122 | 2027628523,None,None,None 123 | 1659985882,None,None,None 124 | 2670042354,None,None,None 125 | 1815876407,None,None,None 126 | 2841697115,None,None,None 127 | 2446564031,None,None,None 128 | 1878424003,None,None,None 129 | 2185615444,None,None,None 130 | 2357864747,None,None,None 131 | 2058804685,None,None,None 132 | 1796607647,None,None,None 133 | 2211874627,None,None,None 134 | 1828338772,None,None,None 135 | 2394410043,None,None,None 136 | 1798523071,None,None,None 137 | 1753249671,None,None,None 138 | 1874090361,None,None,None 139 | 2664758555,None,None,None 140 | 2690401704,None,None,None 141 | 3331138020,None,None,None 142 | 1930729827,None,None,None 143 | 1849119523,None,None,None 144 | 2268566482,None,None,None 145 | 1465193434,None,None,None 146 | 1803443643,None,None,None 147 | 1832790984,None,None,None 148 | 1296376625,None,None,None 149 | 1059153991,None,None,None 150 | 1706779313,None,None,None 151 | 2441862452,None,None,None 152 | 2495080803,None,None,None 153 | 1836669740,None,None,None 154 | 2377056732,None,None,None 155 | 1717073433,None,None,None 156 | 1426331230,None,None,None 157 | 2259954174,None,None,None 158 | 2150961787,None,None,None 159 | 1764408274,None,None,None 160 | 1915896041,None,None,None 161 | 1602712610,None,None,None 162 | 2269751351,None,None,None 163 | 1905684062,None,None,None 164 | 2607301817,None,None,None 165 | 2579453675,None,None,None 166 | 1293736151,None,None,None 167 | 2916405137,None,None,None 168 | 2935303527,None,None,None 169 | 1978598113,None,None,None 170 | 1869007973,None,None,None 171 | 2708144347,None,None,None 172 | 1832695380,None,None,None 173 | 1266913000,None,None,None 174 | 2321559184,None,None,None 175 | 2541334467,None,None,None 176 | 1658908317,None,None,None 177 | 1315718013,None,None,None 178 | 1891044002,None,None,None 179 | 2403861287,None,None,None 180 | 1722189331,None,None,None 181 | 2800908681,None,None,None 182 | 1793183032,None,None,None 183 | 1827202442,None,None,None 184 | 2560722877,None,None,None 185 | 2586926194,None,None,None 186 | 2501396470,None,None,None 187 | 49202000,None,None,None 188 | 2542603432,None,None,None 189 | 1729586027,None,None,None 190 | 3086919671,None,None,None 191 | 2277951605,None,None,None 192 | 1428556057,None,None,None 193 | 3285116473,None,None,None 194 | 1278534055,None,None,None 195 | 1410150407,None,None,None 196 | 1558107674,None,None,None 197 | 2871873712,None,None,None 198 | 2642665271,None,None,None 199 | 1095546562,None,None,None 200 | 2386946767,None,None,None 201 | 1934690391,None,None,None 202 | 2124627542,None,None,None 203 | 1830922363,None,None,None 204 | 1745846270,None,None,None 205 | 1816341994,None,None,None 206 | 2475994283,None,None,None 207 | 1801120292,None,None,None 208 | 1752296795,None,None,None 209 | 3174687897,None,None,None 210 | 1421582511,None,None,None 211 | 1890240293,None,None,None 212 | 2764306630,None,None,None 213 | 2734405295,None,None,None 214 | 1453563652,None,None,None 215 | 3206598931,None,None,None 216 | 1820438203,None,None,None 217 | 2879829467,None,None,None 218 | 1718522074,None,None,None 219 | 1790923821,None,None,None 220 | 1793275663,None,None,None 221 | 2233005395,None,None,None 222 | 1591037164,None,None,None 223 | 2602099561,None,None,None 224 | 1097019671,None,None,None 225 | 2855803990,None,None,None 226 | 2024932054,None,None,None 227 | 1822217501,None,None,None 228 | 1795865003,None,None,None 229 | 2583909674,None,None,None 230 | 1882371191,None,None,None 231 | 1732397022,None,None,None 232 | 1903174085,None,None,None 233 | 1146143120,None,None,None 234 | 2480368647,None,None,None 235 | 1346066905,None,None,None 236 | 1045647587,None,None,None 237 | 1702092531,None,None,None 238 | 2319863523,None,None,None 239 | 1913436627,None,None,None 240 | 3196373475,None,None,None 241 | 2526375621,None,None,None 242 | 2679135333,None,None,None 243 | 1686662974,None,None,None 244 | 1191693630,None,None,None 245 | 2529410317,None,None,None 246 | 1135568164,None,None,None 247 | 2534892030,None,None,None 248 | 2108652481,None,None,None 249 | 1436183367,None,None,None 250 | 1936104012,None,None,None 251 | 2798107573,None,None,None 252 | 1504734535,None,None,None 253 | 1659013510,None,None,None 254 | 2747643807,None,None,None 255 | 1912226102,None,None,None 256 | 1651786410,None,None,None 257 | 1222826910,None,None,None 258 | 1826793973,None,None,None 259 | 2942543017,None,None,None 260 | 2111913667,None,None,None 261 | 1704744581,None,None,None 262 | 2945624541,None,None,None 263 | 2785608072,None,None,None 264 | 1135270763,None,None,None 265 | 2230974612,None,None,None 266 | 2002830027,None,None,None 267 | 1706717647,None,None,None 268 | 1321246601,None,None,None 269 | 2218210763,None,None,None 270 | 2280137471,None,None,None 271 | 2288099201,None,None,None 272 | 2270123145,None,None,None 273 | 1081992905,None,None,None 274 | 2205082954,None,None,None 275 | 2381636693,None,None,None 276 | 1182037630,None,None,None 277 | 1962973332,None,None,None 278 | 1205540905,None,None,None 279 | 2598084612,None,None,None 280 | 1914984834,None,None,None 281 | 1581595273,None,None,None 282 | 2428776430,None,None,None 283 | 2430972351,None,None,None 284 | 3014553163,None,None,None 285 | 2114853170,None,None,None 286 | 1294340800,None,None,None 287 | 2257008791,None,None,None 288 | 1588576784,None,None,None 289 | 2659774670,None,None,None 290 | 2299291803,None,None,None 291 | 2301097593,None,None,None 292 | 1675984730,None,None,None 293 | 2858709532,None,None,None 294 | 3178960643,None,None,None 295 | 1234275454,None,None,None 296 | 2830071124,None,None,None 297 | 2689683321,None,None,None 298 | 1146555413,None,None,None 299 | 1807512607,None,None,None 300 | 2616699483,None,None,None 301 | 1168728283,None,None,None 302 | 1480133234,None,None,None 303 | 2456519795,None,None,None 304 | 1765212780,None,None,None 305 | 3005471383,None,None,None 306 | 1045284441,None,None,None 307 | 1761129900,None,None,None 308 | 2312754895,None,None,None 309 | 2476689617,None,None,None 310 | 2582387672,None,None,None 311 | 1909030637,None,None,None 312 | 1736834765,None,None,None 313 | 1658643883,None,None,None 314 | 2685939441,None,None,None 315 | 2538612732,None,None,None 316 | 1835461600,None,None,None 317 | 1871061914,None,None,None 318 | 2642296631,None,None,None 319 | 1217650882,None,None,None 320 | 1445687504,None,None,None 321 | 1843283224,None,None,None 322 | 2424095957,None,None,None 323 | 2291288353,None,None,None 324 | 2324795500,None,None,None 325 | 1699444081,None,None,None 326 | 2797247467,None,None,None 327 | 2490312060,None,None,None 328 | 1923595373,None,None,None 329 | 2243558527,None,None,None 330 | 2726244237,None,None,None 331 | 2247848767,None,None,None 332 | 1880070643,None,None,None 333 | 2490986060,None,None,None 334 | 1595594960,None,None,None 335 | 1740937421,None,None,None 336 | 2458430992,None,None,None 337 | 2696559771,None,None,None 338 | 2916238587,None,None,None 339 | 2605575117,None,None,None 340 | 1893837255,None,None,None 341 | 2695649093,None,None,None 342 | 2388498761,None,None,None 343 | 1654128650,None,None,None 344 | 1243629662,None,None,None 345 | 1776450881,None,None,None 346 | 2271037352,None,None,None 347 | 2119673924,None,None,None 348 | 1050263283,None,None,None 349 | 1693053184,None,None,None 350 | 2705143367,None,None,None 351 | 2718265731,None,None,None 352 | 2257146430,None,None,None 353 | 2159088280,None,None,None 354 | 1760380282,None,None,None 355 | 2958966260,None,None,None 356 | 2216646335,None,None,None 357 | 1668850612,None,None,None 358 | 2408102575,None,None,None 359 | 3216542700,None,None,None 360 | 2397453880,None,None,None 361 | 1649255173,None,None,None 362 | 2440572542,None,None,None 363 | 1715014170,None,None,None 364 | 3203352097,None,None,None 365 | 1059328282,None,None,None 366 | 1836349401,None,None,None 367 | 1817304600,None,None,None 368 | 2678104171,None,None,None 369 | 2341227717,None,None,None 370 | 2061802911,None,None,None 371 | 1280693592,None,None,None 372 | 2216702742,None,None,None 373 | 2813641494,None,None,None 374 | 1904135293,None,None,None 375 | 2598233864,None,None,None 376 | 1195235102,None,None,None 377 | 1702693983,None,None,None 378 | 2770658780,None,None,None 379 | 1840686841,None,None,None 380 | 1070793077,None,None,None 381 | 1158228704,None,None,None 382 | 2355079001,None,None,None 383 | 1271015820,None,None,None 384 | 1252559512,None,None,None 385 | 2686846727,None,None,None 386 | 2294105054,None,None,None 387 | 1308616370,None,None,None 388 | 1442327465,None,None,None 389 | 1611260370,None,None,None 390 | 1659163742,None,None,None 391 | 1450983574,None,None,None 392 | 2718272545,None,None,None 393 | 1099126897,None,None,None 394 | 1791600157,None,None,None 395 | 1217402220,None,None,None 396 | 2745303555,None,None,None 397 | 1762178825,None,None,None 398 | 2824103253,None,None,None 399 | 2418735472,None,None,None 400 | 1759016503,None,None,None 401 | 1354243132,None,None,None 402 | 2256255854,None,None,None 403 | 1300685760,None,None,None 404 | 1760679433,None,None,None 405 | 2137862441,None,None,None 406 | 1437282885,None,None,None 407 | 1549030550,None,None,None 408 | 1086517472,None,None,None 409 | 1730491081,None,None,None 410 | 2149035802,None,None,None 411 | 1662043644,None,None,None 412 | 2459122511,None,None,None 413 | 1676479093,None,None,None 414 | 1439603061,None,None,None 415 | 1758879437,None,None,None 416 | 2273043910,None,None,None 417 | 2678034884,None,None,None 418 | 2532726222,None,None,None 419 | 2310334287,None,None,None 420 | 1376388924,None,None,None 421 | 1793267394,None,None,None 422 | 1908136651,None,None,None 423 | 1268688930,None,None,None 424 | 1607398312,None,None,None 425 | 2172776854,None,None,None 426 | 1238650702,None,None,None 427 | 2592522171,None,None,None 428 | 2814227943,None,None,None 429 | 2392219810,None,None,None 430 | 1134126684,None,None,None 431 | 2373739231,None,None,None 432 | 1726618517,None,None,None 433 | 2831291060,None,None,None 434 | 2861216440,None,None,None 435 | 1227917800,None,None,None 436 | 1628274681,None,None,None 437 | 1016823867,None,None,None 438 | 2499316531,None,None,None 439 | 1401113090,None,None,None 440 | 1397822322,None,None,None 441 | 2630818711,None,None,None 442 | 2643492720,None,None,None 443 | 2390270251,None,None,None 444 | 1773322485,None,None,None 445 | 1903688170,None,None,None 446 | 3236457582,None,None,None 447 | 1776050710,None,None,None 448 | 1287163315,None,None,None 449 | 1309459122,None,None,None 450 | 2727069225,None,None,None 451 | 1677071631,None,None,None 452 | 2728879023,None,None,None 453 | 2305025032,None,None,None 454 | 2605392322,None,None,None 455 | 2708344714,None,None,None 456 | 2635073511,None,None,None 457 | 1647554923,None,None,None 458 | 1729047741,None,None,None 459 | 1954235357,None,None,None 460 | 1433549103,None,None,None 461 | 3002374617,None,None,None 462 | 1761269744,None,None,None 463 | 2005164537,None,None,None 464 | 1940865523,None,None,None 465 | 2151176490,None,None,None 466 | 2095203825,None,None,None 467 | 2441431507,None,None,None 468 | 1884317152,None,None,None 469 | 2675181071,None,None,None 470 | 1108339247,None,None,None 471 | 2083840225,None,None,None 472 | 2515135520,None,None,None 473 | 2926478351,None,None,None 474 | 2671234942,None,None,None 475 | 2633638123,None,None,None 476 | 2113707763,None,None,None 477 | 1879130153,None,None,None 478 | 1591632564,None,None,None 479 | 2233998230,None,None,None 480 | 2706689713,None,None,None 481 | 1771466270,None,None,None 482 | 2359906260,None,None,None 483 | 1004952991,None,None,None 484 | 1863748821,None,None,None 485 | 2251045950,None,None,None 486 | 2784952264,None,None,None 487 | 1836822713,None,None,None 488 | 2782432522,None,None,None 489 | 2933678981,None,None,None 490 | 2346259347,None,None,None 491 | 2626699337,None,None,None 492 | 2852283045,None,None,None 493 | 1219846652,None,None,None 494 | 2566961945,None,None,None 495 | 2737963973,None,None,None 496 | 2196914955,None,None,None 497 | 1759652113,None,None,None 498 | 2382403281,None,None,None 499 | 2395606577,None,None,None 500 | 1764313354,None,None,None 501 | 1240507852,None,None,None 502 | 2300420474,None,None,None 503 | 3115400871,None,None,None 504 | 2401030731,None,None,None 505 | 3095326151,None,None,None 506 | 1770839332,None,None,None 507 | 1222885067,None,None,None 508 | 1939454535,None,None,None 509 | 1535072383,None,None,None 510 | 2245118817,None,None,None 511 | 1099228584,None,None,None 512 | 2910098747,None,None,None 513 | 1686750962,None,None,None 514 | 2776366303,None,None,None 515 | 1562655870,None,None,None 516 | 2813504202,None,None,None 517 | 2697050121,None,None,None 518 | 1705128983,None,None,None 519 | 1810288755,None,None,None 520 | 2735422365,None,None,None 521 | 2416528024,None,None,None 522 | 1571367417,None,None,None 523 | 1879061103,None,None,None 524 | 1247586765,None,None,None 525 | 2550576333,None,None,None 526 | 1599841142,None,None,None 527 | 2294945435,None,None,None 528 | 2418466033,None,None,None 529 | 2824976872,None,None,None 530 | 1186575835,None,None,None 531 | 1744412633,None,None,None 532 | 1737854284,None,None,None 533 | 1747561052,None,None,None 534 | 1967630925,None,None,None 535 | 1583095435,None,None,None 536 | 2757798323,None,None,None 537 | 1909484543,None,None,None 538 | 2704711553,None,None,None 539 | 2091581501,None,None,None 540 | 2027300580,None,None,None 541 | 2794647204,None,None,None 542 | 1904432792,None,None,None 543 | 1866116872,None,None,None 544 | 1706332083,None,None,None 545 | 1832218692,None,None,None 546 | 2790437631,None,None,None 547 | 1615225710,None,None,None 548 | 2813274281,None,None,None 549 | 1668837942,None,None,None 550 | 1734117871,None,None,None 551 | 2614203315,None,None,None 552 | 1683262534,None,None,None 553 | 2496513862,None,None,None 554 | 2310256715,None,None,None 555 | 2720922411,None,None,None 556 | 1749859757,None,None,None 557 | 2534043921,None,None,None 558 | 2448366413,None,None,None 559 | 1800488812,None,None,None 560 | 2253935930,None,None,None 561 | 1628203092,None,None,None 562 | 2469953990,None,None,None 563 | 2588741600,None,None,None 564 | 1465512643,None,None,None 565 | 1670461631,None,None,None 566 | 2640148613,None,None,None 567 | 2323439873,None,None,None 568 | 2193574710,None,None,None 569 | 1403264367,None,None,None 570 | 1888359931,None,None,None 571 | 2611074525,None,None,None 572 | 2405586963,None,None,None 573 | 1419021967,None,None,None 574 | 1878959310,None,None,None 575 | 2295260380,None,None,None 576 | 1421941067,None,None,None 577 | 3095657495,None,None,None 578 | 2275451987,None,None,None 579 | 2206527971,None,None,None 580 | 1655593560,None,None,None 581 | 2111113635,None,None,None 582 | 1772586637,None,None,None 583 | 2352862915,None,None,None 584 | 2390180990,None,None,None 585 | 2585391774,None,None,None 586 | 2512020233,None,None,None 587 | 1732567481,None,None,None 588 | 1407604085,None,None,None 589 | 1572243313,None,None,None 590 | 2356277673,None,None,None 591 | 1657602061,None,None,None 592 | 1371359263,None,None,None 593 | 1560811012,None,None,None 594 | 1047273002,None,None,None 595 | 1895743152,None,None,None 596 | 2331382124,None,None,None 597 | 1628401674,None,None,None 598 | 1093130001,None,None,None 599 | 2717877093,None,None,None 600 | 1462429103,None,None,None 601 | 2194141151,None,None,None 602 | 2528489734,None,None,None 603 | 1888869987,None,None,None 604 | 1178837544,None,None,None 605 | 1839518480,None,None,None 606 | 1072778333,None,None,None 607 | 2463286903,None,None,None 608 | 2016458675,None,None,None 609 | 2329691643,None,None,None 610 | 2439850033,None,None,None 611 | 1150794353,None,None,None 612 | 2791460161,None,None,None 613 | 2159642185,None,None,None 614 | 1414323002,None,None,None 615 | 2251016034,None,None,None 616 | 2742564803,None,None,None 617 | 1918100581,None,None,None 618 | 2822625592,None,None,None 619 | 1958048501,None,None,None 620 | 1599449007,None,None,None 621 | 1846242183,None,None,None 622 | 2240448475,None,None,None 623 | 1553062685,None,None,None 624 | 1606643417,None,None,None 625 | 2461401527,None,None,None 626 | 1788499767,None,None,None 627 | 1863279280,None,None,None 628 | 1707022000,None,None,None 629 | 1734380083,None,None,None 630 | 1938480904,None,None,None 631 | 2232799795,None,None,None 632 | 2826956437,None,None,None 633 | 1840587742,None,None,None 634 | 2959810315,None,None,None 635 | 1858400464,None,None,None 636 | 1838113625,None,None,None 637 | 2279193771,None,None,None 638 | 3013060127,None,None,None 639 | 2638616804,None,None,None 640 | 2556295304,None,None,None 641 | 1889568113,None,None,None 642 | 1778605763,None,None,None 643 | 2369865831,None,None,None 644 | 2284610294,None,None,None 645 | 1668823983,None,None,None 646 | 1871185137,None,None,None 647 | 2377559694,None,None,None 648 | 1965399672,None,None,None 649 | 1713338103,None,None,None 650 | 1904925072,None,None,None 651 | 1167081394,None,None,None 652 | 1750559541,None,None,None 653 | 1916398973,None,None,None 654 | 1907548435,None,None,None 655 | 2775971913,None,None,None 656 | 3218164590,None,None,None 657 | 1876360390,None,None,None 658 | 1607615535,None,None,None 659 | 1813528752,None,None,None 660 | 2013072154,None,None,None 661 | 1390484700,None,None,None 662 | 2275269785,None,None,None 663 | 1405271484,None,None,None 664 | 2560301405,None,None,None 665 | 2209946014,None,None,None 666 | 3225143501,None,None,None 667 | 1844290793,None,None,None 668 | 1655756205,None,None,None 669 | 1130222531,None,None,None 670 | 1428652062,None,None,None 671 | 2327118535,None,None,None 672 | 2697749224,None,None,None 673 | 1773757097,None,None,None 674 | 2705031232,None,None,None 675 | 2975228027,None,None,None 676 | 2758837101,None,None,None 677 | 2437604823,None,None,None 678 | 1881103992,None,None,None 679 | 1735195071,None,None,None 680 | 1320483941,None,None,None 681 | 2683921235,None,None,None 682 | 2216862800,None,None,None 683 | 2689280681,None,None,None 684 | 3136730833,None,None,None 685 | 2653601134,None,None,None 686 | 1736646085,None,None,None 687 | 1662463183,None,None,None 688 | 1345324695,None,None,None 689 | 2815234954,None,None,None 690 | 1510214665,None,None,None 691 | 2813086545,None,None,None 692 | 1976187052,None,None,None 693 | 1794958882,None,None,None 694 | 2103208243,None,None,None 695 | 1839501807,None,None,None 696 | 1732032003,None,None,None 697 | 2719394111,None,None,None 698 | 3122114020,None,None,None 699 | 1801696915,None,None,None 700 | 2122094875,None,None,None 701 | 2563939270,None,None,None 702 | 1762895003,None,None,None 703 | 1993842637,None,None,None 704 | 1237884817,None,None,None 705 | 2911257187,None,None,None 706 | 2722045963,None,None,None 707 | 2038721554,None,None,None 708 | 2810383762,None,None,None 709 | 2160830792,None,None,None 710 | 3085160073,None,None,None 711 | 3169139663,None,None,None 712 | 2841127252,None,None,None 713 | 1713431704,None,None,None 714 | 1813872542,None,None,None 715 | 2492951980,None,None,None 716 | 2632945384,None,None,None 717 | 1629158130,None,None,None 718 | 2288058635,None,None,None 719 | 1887764852,None,None,None 720 | 2610885991,None,None,None 721 | 1522474181,None,None,None 722 | 1663382171,None,None,None 723 | 3123671877,None,None,None 724 | 1417677931,None,None,None 725 | 2315071971,None,None,None 726 | 1095796153,None,None,None 727 | 2095450481,None,None,None 728 | 1406980903,None,None,None 729 | 2280618897,None,None,None 730 | 1416782153,None,None,None 731 | 2253677264,None,None,None 732 | 1894546243,None,None,None 733 | 3037748720,None,None,None 734 | 1842306537,None,None,None 735 | 3141576902,None,None,None 736 | 1253327042,None,None,None 737 | 2180233507,None,None,None 738 | 1876750515,None,None,None 739 | 1962030043,None,None,None 740 | 2257391490,None,None,None 741 | 1853458031,None,None,None 742 | 2347570787,None,None,None 743 | 3173356190,None,None,None 744 | 1436588442,None,None,None 745 | 2610163407,None,None,None 746 | 1966207253,None,None,None 747 | 2301279774,None,None,None 748 | 2011480513,None,None,None 749 | 1878580605,None,None,None 750 | 1850504885,None,None,None 751 | 3037651094,None,None,None 752 | 1853447441,None,None,None 753 | 1785384652,None,None,None 754 | 2099366144,None,None,None 755 | 1832727450,None,None,None 756 | 3046519217,None,None,None 757 | 1965713822,None,None,None 758 | 2599364084,None,None,None 759 | 2188119752,None,None,None 760 | 1666939372,None,None,None 761 | 2193592587,None,None,None 762 | 2398214134,None,None,None 763 | 2173414305,None,None,None 764 | 1614003492,None,None,None 765 | 3186247615,None,None,None 766 | 1284130440,None,None,None 767 | 1571882663,None,None,None 768 | 1713915734,None,None,None 769 | 1067900262,None,None,None 770 | 2793565701,None,None,None 771 | 1630262283,None,None,None 772 | 2299128991,None,None,None 773 | 2187425003,None,None,None 774 | 1644215093,None,None,None 775 | 1967550143,None,None,None 776 | 2167424813,None,None,None 777 | 3175039507,None,None,None 778 | 1766696047,None,None,None 779 | 1800899567,None,None,None 780 | 3108171667,None,None,None 781 | 2763027141,None,None,None 782 | 1661693597,None,None,None 783 | 1663951822,None,None,None 784 | 2285002930,None,None,None 785 | 2856550494,None,None,None 786 | 2111040205,None,None,None 787 | 2985062583,None,None,None 788 | 22885151,None,None,None 789 | 2535883512,None,None,None 790 | 2887481154,None,None,None 791 | 1559133374,None,None,None 792 | 1887529105,None,None,None 793 | 2554042422,None,None,None 794 | 1840998010,None,None,None 795 | 2454397150,None,None,None 796 | 1950917351,None,None,None 797 | 2283802445,None,None,None 798 | 2423668637,None,None,None 799 | 2290405973,None,None,None 800 | 1048212783,None,None,None 801 | 2328822871,None,None,None 802 | 2987179462,None,None,None 803 | 1877322845,None,None,None 804 | 2918978982,None,None,None 805 | 1903836303,None,None,None 806 | 3191897454,None,None,None 807 | 1868606174,None,None,None 808 | 1830503722,None,None,None 809 | 1908805083,None,None,None 810 | 2458637554,None,None,None 811 | 1197039675,None,None,None 812 | 1790176491,None,None,None 813 | 1734318362,None,None,None 814 | 1976091850,None,None,None 815 | 1210691424,None,None,None 816 | 2164601452,None,None,None 817 | 2114808810,None,None,None 818 | 2311715697,None,None,None 819 | 1879846724,None,None,None 820 | 2274064801,None,None,None 821 | 2299205073,None,None,None 822 | 2644198111,None,None,None 823 | 2739709015,None,None,None 824 | 2161412015,None,None,None 825 | 1650731783,None,None,None 826 | 2491784077,None,None,None 827 | 1819849227,None,None,None 828 | 2778702865,None,None,None 829 | 2015911243,None,None,None 830 | 2600664297,None,None,None 831 | 1733687943,None,None,None 832 | 2040664751,None,None,None 833 | 2191393865,None,None,None 834 | 2670203193,None,None,None 835 | 1785573771,None,None,None 836 | 2375437395,None,None,None 837 | 2884478055,None,None,None 838 | 1722066993,None,None,None 839 | 2658721611,None,None,None 840 | 1877933877,None,None,None 841 | 2118576943,None,None,None 842 | 1855272451,None,None,None 843 | 2689533247,None,None,None 844 | 1417966981,None,None,None 845 | 2772211020,None,None,None 846 | 1745632325,None,None,None 847 | 1874033965,None,None,None 848 | 1302459531,None,None,None 849 | 2611435467,None,None,None 850 | 2028512770,None,None,None 851 | 1864739361,None,None,None 852 | 1993952625,None,None,None 853 | 2400925723,None,None,None 854 | 2762134860,None,None,None 855 | 2307027872,None,None,None 856 | 2463043725,None,None,None 857 | 2419586747,None,None,None 858 | 1679187385,None,None,None 859 | 1445681057,None,None,None 860 | 1649227594,None,None,None 861 | 1060933804,None,None,None 862 | 2174832754,None,None,None 863 | 1895910890,None,None,None 864 | 2873632550,None,None,None 865 | 1419649945,None,None,None 866 | 1711134073,None,None,None 867 | 1740559111,None,None,None 868 | 1566646464,None,None,None 869 | 1671725772,None,None,None 870 | 1339581950,None,None,None 871 | 2714831411,None,None,None 872 | 1811358323,None,None,None 873 | 1810323837,None,None,None 874 | 3059311203,None,None,None 875 | 1216109575,None,None,None 876 | 2292600215,None,None,None 877 | 2406447784,None,None,None 878 | 1088659091,None,None,None 879 | 2704356785,None,None,None 880 | 2350816524,None,None,None 881 | 1512466564,None,None,None 882 | 1699496267,None,None,None 883 | 1339466254,None,None,None 884 | 2339848393,None,None,None 885 | 1085504381,None,None,None 886 | 1103173143,None,None,None 887 | 2389252782,None,None,None 888 | 2358050391,None,None,None 889 | 2684074833,None,None,None 890 | 1673025024,None,None,None 891 | 2848326144,None,None,None 892 | 2176533771,None,None,None 893 | 1886902677,None,None,None 894 | 2626136864,None,None,None 895 | 2379178513,None,None,None 896 | 1793330664,None,None,None 897 | 2269046802,None,None,None 898 | 1819046120,None,None,None 899 | 2053248305,None,None,None 900 | 2245076965,None,None,None 901 | 1777945141,None,None,None 902 | 1736156057,None,None,None 903 | 2008133701,None,None,None 904 | 2300801527,None,None,None 905 | 1751156287,None,None,None 906 | 2074899745,None,None,None 907 | 1404846341,None,None,None 908 | 1132080382,None,None,None 909 | 1571288013,None,None,None 910 | 3226702187,None,None,None 911 | 1826472915,None,None,None 912 | 2658395034,None,None,None 913 | 1416443200,None,None,None 914 | 2805169922,None,None,None 915 | 1871710311,None,None,None 916 | 2206361761,None,None,None 917 | 1867738613,None,None,None 918 | 1606033650,None,None,None 919 | 2194704295,None,None,None 920 | 2815452554,None,None,None 921 | 2837585864,None,None,None 922 | 2233678780,None,None,None 923 | 2654046093,None,None,None 924 | 1910940833,None,None,None 925 | 1916203541,None,None,None 926 | 1150199684,None,None,None 927 | 3171253670,None,None,None 928 | 1659896143,None,None,None 929 | 2820147850,None,None,None 930 | 2464130854,None,None,None 931 | 2477862101,None,None,None 932 | 1703387805,None,None,None 933 | 1831711532,None,None,None 934 | 2070776595,None,None,None 935 | 2442151792,None,None,None 936 | 2101799347,None,None,None 937 | 1844475307,None,None,None 938 | 2185778163,None,None,None 939 | 2596054670,None,None,None 940 | 1872051102,None,None,None 941 | 1820119395,None,None,None 942 | 1632992170,None,None,None 943 | 1572416205,None,None,None 944 | 1480975182,None,None,None 945 | 1194956002,None,None,None 946 | 2504323690,None,None,None 947 | 1105953575,None,None,None 948 | 2302340807,None,None,None 949 | 2649293533,None,None,None 950 | 1730797780,None,None,None 951 | 2678269943,None,None,None 952 | 1723929691,None,None,None 953 | 1721105227,None,None,None 954 | 2408353751,None,None,None 955 | 2272614240,None,None,None 956 | 2485579657,None,None,None 957 | 1742659945,None,None,None 958 | 2907006627,None,None,None 959 | 1802546305,None,None,None 960 | 1229297394,None,None,None 961 | 2144732091,None,None,None 962 | 2751028113,None,None,None 963 | 1504414927,None,None,None 964 | 2233483874,None,None,None 965 | 2686472283,None,None,None 966 | 3173914124,None,None,None 967 | 1586303513,None,None,None 968 | 2008176975,None,None,None 969 | 1192430137,None,None,None 970 | 1669105932,None,None,None 971 | 2112764677,None,None,None 972 | 2665457347,None,None,None 973 | 1572441061,None,None,None 974 | 1862937800,None,None,None 975 | 2485710301,None,None,None 976 | 3214881233,None,None,None 977 | 2300214375,None,None,None 978 | 1798604215,None,None,None 979 | 1363163922,None,None,None 980 | 1862722514,None,None,None 981 | 3014169797,None,None,None 982 | 1951642605,None,None,None 983 | 2804925562,None,None,None 984 | 2295443055,None,None,None 985 | 1823988984,None,None,None 986 | 1866540195,None,None,None 987 | 1583400374,None,None,None 988 | 3212877060,None,None,None 989 | 1763118212,None,None,None 990 | 2319008431,None,None,None 991 | 1874259974,None,None,None 992 | 1868678142,None,None,None 993 | 2039378617,None,None,None 994 | 1239847773,None,None,None 995 | 2796017557,None,None,None 996 | 2290138117,None,None,None 997 | 1984942435,None,None,None 998 | 1464139943,None,None,None 999 | 1865169233,None,None,None 1000 | 1010206540,None,None,None 1001 | 2925230603,None,None,None 1002 | 2543460603,None,None,None 1003 | 3287905451,None,None,None 1004 | 67406402,None,None,None 1005 | 1809045643,None,None,None 1006 | 2872076174,None,None,None 1007 | 2342565487,None,None,None 1008 | 1219978810,None,None,None 1009 | 1903520125,None,None,None 1010 | 1017500627,None,None,None 1011 | 1743137672,None,None,None 1012 | 1483114385,None,None,None 1013 | 1764149357,None,None,None 1014 | 1744323660,None,None,None 1015 | 1772227557,None,None,None 1016 | 1837060462,None,None,None 1017 | 1863384824,None,None,None 1018 | 1788175537,None,None,None 1019 | 1938347541,None,None,None 1020 | 2608904951,None,None,None 1021 | 2406291987,None,None,None 1022 | 1741043280,None,None,None 1023 | 2273097695,None,None,None 1024 | 2016636711,None,None,None 1025 | 2880560447,None,None,None 1026 | 1873012403,None,None,None 1027 | 2743279522,None,None,None 1028 | 1180065853,None,None,None 1029 | 1993840022,None,None,None 1030 | 1428406927,None,None,None 1031 | 1788769791,None,None,None 1032 | 1843634713,None,None,None 1033 | 1822269395,None,None,None 1034 | 1847150225,None,None,None 1035 | 1456017262,None,None,None 1036 | 1802528930,None,None,None 1037 | 1645835575,None,None,None 1038 | 2747927407,None,None,None 1039 | 2724423081,None,None,None 1040 | 1889305682,None,None,None 1041 | 1583431063,None,None,None 1042 | 1583403102,None,None,None 1043 | 2242908443,None,None,None 1044 | 1216652734,None,None,None 1045 | 1923524237,None,None,None 1046 | 2292717621,None,None,None 1047 | 3179603174,None,None,None 1048 | 3209997372,None,None,None 1049 | 1588233575,None,None,None 1050 | 2300132555,None,None,None 1051 | 1896734323,None,None,None 1052 | 2356313963,None,None,None 1053 | 1792108922,None,None,None 1054 | 2970974730,None,None,None 1055 | 2041223004,None,None,None 1056 | 1890998497,None,None,None 1057 | 1769434422,None,None,None 1058 | 2608802865,None,None,None 1059 | 1153511521,None,None,None 1060 | 1848989161,None,None,None 1061 | 2597538141,None,None,None 1062 | 1664282381,None,None,None 1063 | 2618662861,None,None,None 1064 | 2848088451,None,None,None 1065 | 2997540812,None,None,None 1066 | 2844982351,None,None,None 1067 | 1834880663,None,None,None 1068 | 1883996911,None,None,None 1069 | 2532595942,None,None,None 1070 | 2613058012,None,None,None 1071 | 2928740327,None,None,None 1072 | 2707993253,None,None,None 1073 | 1742712911,None,None,None 1074 | 2261120363,None,None,None 1075 | 2660985383,None,None,None 1076 | 1401616093,None,None,None 1077 | 2190978852,None,None,None 1078 | 2608924507,None,None,None 1079 | 1904072104,None,None,None 1080 | 1773785231,None,None,None 1081 | 1470566153,None,None,None 1082 | 1759034963,None,None,None 1083 | 2832247351,None,None,None 1084 | 2112859733,None,None,None 1085 | 1240850170,None,None,None 1086 | 1870425363,None,None,None 1087 | 2477656114,None,None,None 1088 | 1091761162,None,None,None 1089 | 1222072200,None,None,None 1090 | 1670881727,None,None,None 1091 | 2446567851,None,None,None 1092 | 1471289553,None,None,None 1093 | 2251922121,None,None,None 1094 | 1768035545,None,None,None 1095 | 2188135785,None,None,None 1096 | 1294249763,None,None,None 1097 | 2539755142,None,None,None 1098 | 1848841943,None,None,None 1099 | 3001994504,None,None,None 1100 | 2620158873,None,None,None 1101 | 1923306441,None,None,None 1102 | 1404844844,None,None,None 1103 | 2614561977,None,None,None 1104 | 1615566863,None,None,None 1105 | 2360970234,None,None,None 1106 | 3164697567,None,None,None 1107 | 1450687240,None,None,None 1108 | 1575585953,None,None,None 1109 | 2414219790,None,None,None 1110 | 1852724777,None,None,None 1111 | 1763565945,None,None,None 1112 | 1888628611,None,None,None 1113 | 1826335820,None,None,None 1114 | 2310425061,None,None,None 1115 | 1749140383,None,None,None 1116 | 1913052597,None,None,None 1117 | 1729661003,None,None,None 1118 | 1131296744,None,None,None 1119 | 2926286584,None,None,None 1120 | 1941510572,None,None,None 1121 | 2992002534,None,None,None 1122 | 1407883302,None,None,None 1123 | 1132999900,None,None,None 1124 | 2127663061,None,None,None 1125 | 2484696914,None,None,None 1126 | 2373255252,None,None,None 1127 | 2562694481,None,None,None 1128 | 1437654575,None,None,None 1129 | 1965157097,None,None,None 1130 | 1596789133,None,None,None 1131 | 1740803694,None,None,None 1132 | 2809972164,None,None,None 1133 | 2279362267,None,None,None 1134 | 2029970055,None,None,None 1135 | 2357029162,None,None,None 1136 | 2179286194,None,None,None 1137 | 2398724721,None,None,None 1138 | 1764089704,None,None,None 1139 | 2165315272,None,None,None 1140 | 2627503774,None,None,None 1141 | 3053296471,None,None,None 1142 | 2331407000,None,None,None 1143 | 1805552233,None,None,None 1144 | 2515463507,None,None,None 1145 | 2722617213,None,None,None 1146 | 1179764700,None,None,None 1147 | 1912805047,None,None,None 1148 | 2720364610,None,None,None 1149 | 2272407802,None,None,None 1150 | 2615101877,None,None,None 1151 | 1342165903,None,None,None 1152 | 1653911911,None,None,None 1153 | 1775135170,None,None,None 1154 | 1854096291,None,None,None 1155 | 1413214321,None,None,None 1156 | 1747434920,None,None,None 1157 | 1407279765,None,None,None 1158 | 2114041470,None,None,None 1159 | 2312968481,None,None,None 1160 | 1624501527,None,None,None 1161 | 1972864963,None,None,None 1162 | 2685920901,None,None,None 1163 | 1782364423,None,None,None 1164 | 2737491320,None,None,None 1165 | 2537618544,None,None,None 1166 | 43465349,None,None,None 1167 | 2356851304,None,None,None 1168 | 1841864292,None,None,None 1169 | 1574885232,None,None,None 1170 | 1082311752,None,None,None 1171 | 1685833114,None,None,None 1172 | 2502415043,None,None,None 1173 | 1226349880,None,None,None 1174 | 1771376163,None,None,None 1175 | 1774481472,None,None,None 1176 | 1785958201,None,None,None 1177 | 1180361187,None,None,None 1178 | 2640036161,None,None,None 1179 | 2169084413,None,None,None 1180 | 2385008674,None,None,None 1181 | 1407764102,None,None,None 1182 | 1929988490,None,None,None 1183 | 2648362533,None,None,None 1184 | 1674829422,None,None,None 1185 | 2419170344,None,None,None 1186 | 1377769544,None,None,None 1187 | 1619729000,None,None,None 1188 | 2977019562,None,None,None 1189 | 2966297450,None,None,None 1190 | 1323618883,None,None,None 1191 | 3029237492,None,None,None 1192 | 1410895810,None,None,None 1193 | 1610962080,None,None,None 1194 | 2409108395,None,None,None 1195 | 2887476642,None,None,None 1196 | 1661992613,None,None,None 1197 | 1426226457,None,None,None 1198 | 1749153194,None,None,None 1199 | 2401667841,None,None,None 1200 | 1507062404,None,None,None 1201 | 1907346245,None,None,None 1202 | 2653791631,None,None,None 1203 | 3210332200,None,None,None 1204 | 2682306645,None,None,None 1205 | 2728552247,None,None,None 1206 | 1763745827,None,None,None 1207 | 2015436193,None,None,None 1208 | 2372881040,None,None,None 1209 | 1943200413,None,None,None 1210 | 1918261705,None,None,None 1211 | 1553575840,None,None,None 1212 | 1872343313,None,None,None 1213 | 1612339233,None,None,None 1214 | 2808456882,None,None,None 1215 | 1173091831,None,None,None 1216 | 2671991960,None,None,None 1217 | 1408478885,None,None,None 1218 | 2570299831,None,None,None 1219 | 1804227890,None,None,None 1220 | 1948043161,None,None,None 1221 | 3115402210,None,None,None 1222 | 2293447203,None,None,None 1223 | 1317372634,None,None,None 1224 | 1449282163,None,None,None 1225 | 2401335332,None,None,None 1226 | 1077416192,None,None,None 1227 | 1770543293,None,None,None 1228 | 1791486193,None,None,None 1229 | 3259907870,None,None,None 1230 | 1803874025,None,None,None 1231 | 2005126320,None,None,None 1232 | 2622026331,None,None,None 1233 | 2174600425,None,None,None 1234 | 2270206382,None,None,None 1235 | 2514275574,None,None,None 1236 | 2509440967,None,None,None 1237 | 2373230214,None,None,None 1238 | 2820317095,None,None,None 1239 | 2548427222,None,None,None 1240 | 2871090720,None,None,None 1241 | 1783343890,None,None,None --------------------------------------------------------------------------------