├── .gitignore ├── 2017的腾讯广告社交赛TOP10PPT.zip ├── 2018腾讯算法大赛参赛手册-V4.pdf ├── README.md ├── data.py ├── feature.py ├── ffm.py ├── ffmdata.py ├── forRank.py ├── forVal.py ├── imformation └── README.md ├── lgb.py ├── nn.py ├── nnHelper.py ├── picture ├── top1.png └── top3.pdf ├── test.py └── xlearn_ffm.py /.gitignore: -------------------------------------------------------------------------------- 1 | /.idea 2 | /data 3 | /datas 4 | /__pycache__ -------------------------------------------------------------------------------- /2017的腾讯广告社交赛TOP10PPT.zip: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ColaDrill/2018spa/24bf9f906e06ee6cf58da3f6d8248dddc164ece2/2017的腾讯广告社交赛TOP10PPT.zip -------------------------------------------------------------------------------- /2018腾讯算法大赛参赛手册-V4.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ColaDrill/2018spa/24bf9f906e06ee6cf58da3f6d8248dddc164ece2/2018腾讯算法大赛参赛手册-V4.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 此贴代码分析进行到模仿top3 attention机制,持续更新ing.....(代码粗糙丑陋,后续将逐步类化,pipeline完善泛化一个模板) 2 | 2018腾讯算法大赛第29/1500名,此贴总结TOP10代码,初次大数据(机器学习)比赛,扫盲贴。 3 | 4 | 大多数kaggle比赛利用特征工程加gbdt即可取得较好成绩,而腾讯比赛是一个靠模型得奖的比赛,因此对机器,以及dnn模型研究的要求就很高。 5 | 目前根据论文实验较有效的几个操作: 6 | *微软亚研的MSRAPrelu初始化 7 | *多值特征的attention机制 8 | *自适应的正则化 9 | *wide 方向以及 纵向的MLP 10 | *learning rate decay 11 | *BN,drop out 12 | *focal loss 13 | @矩阵乘法GPU优化可以加速 14 | 15 | 计划尝试的操作: 16 | #dice激活函数正在试验中 17 | #据说log mvm有效 18 | #尝试GAP(global average pooling)代替全连接层 19 | 实际上lr挖掘一阶特征,FFM层二阶,interest 和 mvm层的高阶特征是为模型尽可能的去挖掘特征中隐藏的信息,mlp来分配权重. 20 | 相关论文地址: 21 | deep interest Network: https://arxiv.org/pdf/1706.06978.pdf 22 | 23 | 此次TOP1由葛文强团队的自创的NN取得,模型类似于其IJCAI2018模型图片位于: 24 | 25 | ./picture/top1.png 26 | 第三名为清华大学茶园和贵系大佬自创的DEEP INTEREST FFM,PPT海报位于: 27 | 28 | ./picture/top3.pdf 29 | 其余大佬模型多基于郭大和渣大分享的NFFM模型稍作改动,此模型为2017年第一届比赛南大熊猫冠军模型,github地址: 30 | 31 | 郭大https://github.com/guoday/Tencent2018_Lookalike_Rank7th(郭大的Xdeepfm听说有效果,还未来得及研究) 32 | 渣大https://github.com/nzc/tencent-contest 33 | top20经验贴传送门: 34 | 35 | 官方baseline: 36 | https://github.com/YouChouNoBB/2018-tencent-ad-competition-baseline 37 | 38 | [腾讯广告算法赛开源-Top4-SDD]https://zhuanlan.zhihu.com/p/42089584 39 | https://blog.csdn.net/ML_SDD/article/details/81702168 40 | [2018腾讯广告算法大赛总结(Rank6)-模型篇](https://zhuanlan.zhihu.com/p/38443751) 41 | [第二届腾讯广告算法大赛总结(Rank 9)](https://zhuanlan.zhihu.com/p/38499275?utm_source=com.tencent.tim&utm_medium=social&utm_oi=571282483765710848) 42 | [2018腾讯广告算法大赛Top10-特征工程]https://zhuanlan.zhihu.com/p/40479648 43 | [2018腾讯广告算法大赛总结/0.772229/Rank11](https://zhuanlan.zhihu.com/p/38034501) 44 | [腾讯社交广告算法大赛(Top15)总结-特征工程]https://zhuanlan.zhihu.com/p/39491062 45 | 46 | 以上模型皆属于neutral network模型,因当数据量很大,机器限制又很高的时候,NN模型很适应场景,ctr nn模型扫盲贴在后面,此次数据量过大,复赛对GPU和内存要求很高,大概400G内存,我将在下面的目录下放一个我赛后学习测试用的数据集,16G笔记本可跑: 47 | 48 | ./datas/ 49 | 50 | 赛题详情: 51 | 52 | ./2018腾讯算法大赛参赛手册 53 | 54 | 大数据(机器学习)比赛大致流程如下,旨在预测在官方给的计量方法上取得高分(如AUC,logloss): 55 | 56 | 1、做特征 57 | 2、喂模型 58 | 3、融合 59 | 60 | 相关基础知识扫盲帖: 61 | 62 | ./information/ 63 | 64 | 特征----本次比赛大家特征都大同小异: 65 | 66 | 常见几种特征提取代码./feature.py (注意转化率特征KFOLD操作和贝叶斯平滑操作,可自行百度) 67 | 怎么挖掘特征?渣大传送门https://zhuanlan.zhihu.com/p/38341881 68 | 69 | PS:具体情况具体分析,有时候选特征可以看LGB提供的API看重要度,有时候需观察与label的相关度,但是要注意不要leak(即用未来的数据来预测过去,用label去预测label,会造成线下很高,线上很低),可以用KFOLD来避免。但是这些都不保证一个特征有用无用,最靠谱的还是直接放进去看线上是否提高了,但是这样你的时间不够的,复赛又可能有别的强特。另外学习怎么做特征,最好的办法就是看大赛TOP10的PPT和代码,看太多特征工程原理上的东西会浪费你很多时间。另外,基本的可视化分析,这个东西扫盲贴里有介绍,但是赛后发现大佬们分析的东西和我这样的新手菜鸟分析的东西完全不是一个东西啊。所以尚在摸索中。 70 | 71 | 模型----本次比赛为CTR类比赛,常用模型有lgb,xgb,catboost等树类模型,FFM模型,还有一系列复杂的NN深度模型. 72 | 73 | 融合----我用的加权融合,台大三傻的反sigmoid融合然后排列组合 74 | 75 | 大概就是利用哈夫曼原理,进行哈夫曼融合,把每个单次线上成绩进行哈夫曼反sigmoid,穿插进行加权融合,因为有的时候加权比反sigmoid的效果好,具体慢慢详细研究介绍 76 | 代码:./*********.py 77 | 78 | 初赛: 79 | 初赛精力主要要放在做特征上,微软的lgb模型效率很高,可以尽可能的去做特征然后愉快的喂给LGB即可。 80 | 81 | 复赛: 82 | 我们重点说说复赛,上面说了复赛数据量和机器限制,一开始大家考虑deepfm模型,后来郭大渣大开源NFFM,所以开始入手研究NN模型,下面是CTR的NN扫盲贴两篇: 83 | 84 | 渣大知乎:https://zhuanlan.zhihu.com/p/32885978 85 | 石晓文大神简书:https://www.jianshu.com/nb/21403842 86 | 辛俊波知乎:https://zhuanlan.zhihu.com/p/35484389 87 | 88 | 简单来说deepFFM是将deep层和ffm层wide方向concat了,nffm是在串行将FFM层的结果再喂给DENSE层,关于其效果优良的可解释性至今未查清,一老师给出的是,不同的环境和场景,模型效果也可能不同,至于TOP3的interest网络,attention机制的可解释性就比较好理解了,本次比赛中的多值特征,大家都去了average pool(类似max pool 你懂得),但是对不同广告,你兴趣权重肯定是不一样的,比如你点击口红广告,不可能跟你喜欢的AJ款式有很大权重,所以这里用了ATTENTIONG机制动态分配权重,这个网络其实就是lr表达一阶特征,FFM表达二阶特征,TOP1团队则用MVM取了个log表达了更高阶特征,其余细节,可以看论文中讲述: 89 | 90 | 阿里巴巴din网络https://arxiv.org/pdf/1706.06978.pdf 91 | 92 | 本文代码起初借鉴于初赛不知名大佬的MXNET新框架gluon,apache的动静集合的框架其实我很喜欢,但是有些API还是维护的比较慢的,所以正转战TF,和torch,哎,很桑心,正在尝试TOP1模型复现,以及spark进行数据处理,最后会总结出一个比较方便的通用框架持续更新ing... 93 | 94 | 95 | 另外一些踩过的坑与大家分享: 96 | 97 | # 特征清洗很重要,不然要重做特征很多次,做大量特征前要小批量验证代码正确,不然有BUG就很浪费时间 98 | #内存不够的时候read_csv数据类型别用默认的64位 99 | #del 后的内存要和谐的collect()一下 100 | -------------------------------------------------------------------------------- /data.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # encoding=utf8 3 | import gc 4 | import os 5 | import csv 6 | import json 7 | import random 8 | import hashlib 9 | import numpy as np 10 | import pandas as pd 11 | from scipy import sparse 12 | from collections import Counter 13 | from scipy.sparse import csr_matrix 14 | from collections import defaultdict 15 | from collections import OrderedDict 16 | 17 | User = ['appIdAction','appIdInstall','interest1','interest2','interest3','interest4','interest5','kw1','kw2','kw3','topic1','topic2','topic3','ct','marriageStatus','os'] 18 | Ad = ['aid'] 19 | User_Ad_idx = './datas/features/UA_idx.txt' 20 | ############################################ 21 | 22 | def mkdir(folder): 23 | folder = './datas/features/%s' % folder 24 | if not os.path.exists(folder): 25 | os.makedirs(folder) 26 | return folder 27 | import gc 28 | def getMerged(*keys, **kwargs): 29 | kind = kwargs.get('kind', 0) 30 | name1 = './datas/trainMerged.csv' 31 | name2 = './datas/test1Merged.csv' 32 | 33 | for name in [name1, name2]: 34 | if os.path.exists(name): continue 35 | print( "@@@@@", name) 36 | ad = pd.read_csv('./datas/adFeature.csv') 37 | if not os.path.exists(name.replace('Merged', '')): 38 | continue 39 | df = pd.read_csv(name.replace('Merged', '')) 40 | user = pd.read_csv('./datas/userFeature.csv') 41 | 42 | notUsed = ['appIdAction', 'appIdInstall'] 43 | user.drop(notUsed, axis=1, inplace=True) 44 | 45 | df = df.merge(ad).merge(user) 46 | if 'label' not in df: df['label'] = 0 47 | 48 | columns = sorted(df.columns.tolist()) 49 | columns.remove('label'); columns.append('label') 50 | df.to_csv(name, index=False, columns=columns) 51 | print(' ok ',name) 52 | del df,user 53 | gc.collect() 54 | print('merge ok') 55 | # for name in [name1, name2]: 56 | # df = pd.read_csv(name) 57 | # notUsed = ['appIdAction', 'appIdInstall'] 58 | # if notUsed in df.columns: 59 | # df.drop(notUsed, axis=1, inplace=True) 60 | # if 'label' not in df: df['label'] = 0 61 | # columns = sorted(df.columns.tolist()) 62 | # columns.remove('label'); columns.append('label') 63 | # df.to_csv(name, index=False, columns=columns) 64 | 65 | 66 | name3 = './datas/trainMerged1.csv' 67 | name4 = './datas/trainMerged2.csv' 68 | 69 | exists = os.path.exists 70 | if not (exists(name3) and exists(name4)): 71 | random.seed(2018) 72 | 73 | #t1n, t2n = int(7e+6), int(1e+6) 74 | 75 | df = pd.read_csv(name1) 76 | t1n, t2n = int(0.7 * len(df)), int(0.3 * len(df)) 77 | indices = list(range(len(df))) 78 | #random.shuffle(indices) 79 | indices3 = indices[:t1n] 80 | indices4 = indices[-t2n:] 81 | 82 | columns = sorted(df.columns.tolist()) 83 | columns.remove('label'); columns.append('label') 84 | df.take(indices3).to_csv( 85 | name3, index=False, columns=columns) 86 | df.take(indices4).to_csv( 87 | name4, index=False, columns=columns) 88 | del df 89 | gc.collect() 90 | print('merge ok') 91 | if not keys: keys = None 92 | if kind==1: return pd.read_csv(name1, usecols=keys) 93 | if kind==2: return pd.read_csv(name2, usecols=keys) 94 | if kind==0: return pd.concat([ 95 | pd.read_csv(name1, usecols=keys), 96 | pd.read_csv(name2, usecols=keys) 97 | ], axis=0) 98 | if kind==3: return pd.read_csv(name3, usecols=keys) 99 | if kind==4: return pd.read_csv(name4, usecols=keys) 100 | 101 | def getMergedA(*keys, **kwargs): 102 | kind = kwargs.get('kind', 0) 103 | name1 = './datas/trainMergedA.csv' 104 | name2 = './datas/test1MergedA.csv' 105 | 106 | for name in [name1, name2]: 107 | if os.path.exists(name): continue 108 | 109 | ad = pd.read_csv('./datas/adFeature.csv') 110 | df = pd.read_csv(name.replace('Merged', '')) 111 | user = pd.read_csv('./datas/userFeature.csv') 112 | 113 | df = df.merge(ad).merge(user) 114 | if 'label' not in df: df['label'] = 0 115 | 116 | columns = sorted(df.columns.tolist()) 117 | columns.remove('label'); columns.append('label') 118 | df.to_csv(name, index=False, columns=columns) 119 | 120 | if not keys: keys = None 121 | if kind==1: return pd.read_csv(name1, usecols=keys) 122 | if kind==2: return pd.read_csv(name2, usecols=keys) 123 | if kind==0: return pd.concat([ 124 | pd.read_csv(name1, usecols=keys), 125 | pd.read_csv(name2, usecols=keys), 126 | ], axis=0) 127 | 128 | ############################################ 129 | 130 | def getUF(): 131 | name = './datas/userFeature.data' 132 | 133 | for f in open(name): 134 | user = {} 135 | for uf in f.strip().split('|'): 136 | uf = uf.split(' ') 137 | user[uf[0]] = list(map(int, uf[1:])) 138 | 139 | yield user 140 | 141 | def userMaxLen(): 142 | folder = mkdir('msg') 143 | maxLen = defaultdict(int) 144 | maxFile = '%s/maxLen.txt' % folder 145 | 146 | if not os.path.exists(maxFile): 147 | for i, user in enumerate(getUF()): 148 | for key, vs in user.items(): 149 | maxLen[key] = max(maxLen[key], len(list(vs))) 150 | 151 | with open(maxFile, 'w') as f: 152 | for k in sorted(maxLen.keys()): 153 | f.write('%s %s\n' % (k, maxLen[k])) 154 | #print( 'Saved to %s ...' % saveName) 155 | 156 | for line in open(maxFile): 157 | k, v = line.strip().split() 158 | maxLen[k] = int(v) 159 | 160 | return maxLen 161 | 162 | def userFeature(): 163 | default = { k: 0 for k in userMaxLen() } 164 | keys = sorted(default.keys()) 165 | keys.remove('uid'); keys = ['uid'] + keys 166 | 167 | saveName = './datas/userFeature.csv' 168 | if os.path.exists(saveName): return 169 | 170 | with open(saveName, 'w') as sf: 171 | dw = csv.DictWriter(sf, fieldnames=keys) 172 | dw.writeheader() 173 | 174 | for i, user in enumerate(getUF()): 175 | data = default.copy() 176 | for key, vs in user.items(): 177 | data[key] = ' '.join(list(map(str,vs))) 178 | dw.writerow(data) 179 | 180 | if (i+1) % 100000 == 0: 181 | print( "%8d..." % (i+1)) 182 | print( 'Saved to %s ...' % saveName) 183 | 184 | ############################################ 185 | 186 | def getFFM(): 187 | print( 'start getFFM()') 188 | folder = mkdir('ffm') 189 | datas = [ 190 | ['trainMerged', 'trainFFM'], 191 | ['test1Merged', 'testFFM'], 192 | ] 193 | print( 'start get FFM data in getFFM()') 194 | result, flag = [], True 195 | 196 | for _, save in datas: 197 | save = '%s/%s.txt' % (folder, save) 198 | result.append(save) 199 | if not os.path.exists(save): 200 | flag = False 201 | if flag: 202 | return result 203 | 204 | count, _ = setCount()#return the dict-->{feat:{onehot : index}} and the number of dim 205 | fields = sorted(count.keys()) 206 | UA_idx, User_idx, Ad_idx = open(User_Ad_idx, 'w'), '', '' 207 | for k, v in enumerate(fields): 208 | if v in User: 209 | User_idx += ' '+str(k) 210 | elif v in Ad: 211 | Ad_idx += ' '+str(k) 212 | if not UA_idx.closed: 213 | UA_idx.write('%s\n' % User_idx) 214 | UA_idx.write('%s\n' % Ad_idx) 215 | UA_idx.close() 216 | 217 | result = [] 218 | 219 | for name, save in datas: 220 | print( 'start gen data ', save) 221 | save = '%s/%s.txt' % (folder, save) 222 | result.append(save) 223 | if os.path.exists(save): continue 224 | 225 | txt = open(save, 'w') 226 | name = './datas/%s.csv' % name 227 | 228 | for i, line in enumerate(open(name)):#for the merged df 229 | vvs = line.strip().split(',') 230 | if i == 0: fNs = vvs; continue#fns feature name 231 | 232 | s = '' 233 | for k, vs in enumerate(vvs):#vvs value of this line 234 | if fNs[k] == 'label': 235 | s = ('%d'%(vs=='1')) + s#replace the -1 label to 0 236 | continue 237 | 238 | ii = fields.index(fNs[k])#index of the feature 239 | for v in vs.split(' '):#all value of one feature 240 | try: 241 | float(v) 242 | except: 243 | continue 244 | v = str(int(float(v))) 245 | c = count[fNs[k]][v]#index of the value in his feature 246 | # try: 247 | # float(v) 248 | # except: 249 | # continue 250 | # if float(v) and c: 251 | if int(float(v)) and c: 252 | s += ' %d:%d:1' % (ii, c)#特征index 总体取值index 1131,0直空值过滤了(这不一定合理 253 | txt.write('%s\n' % s) 254 | txt.close() 255 | 256 | 257 | print( 'Saved to %s ...' % save) 258 | return result 259 | 260 | def splitFFM(): 261 | t1, save3 = getFFM()#get conut path(train ,test ) of file which have fomat (index of feature : index fo all dim : 1) 262 | 263 | folder = mkdir('ffm') 264 | exists = os.path.exists 265 | save1 = '%s/t1.txt' % folder 266 | save2 = '%s/t2.txt' % folder 267 | if exists(save1) and exists(save2): 268 | return save1, save2, save3 269 | 270 | random.seed(2018) 271 | lines = open(t1).readlines()#read the tranFFM 272 | #random.shuffle(lines) 273 | #t1n, t2n = int(7e+6), int(1e+6) 274 | t1n, t2n = int(0.7 * len(lines)), int(0.3 * len(lines)) 275 | print( 'start gen train valid data') 276 | with open(save1, 'w') as txt: 277 | txt.writelines(lines[:t1n]) 278 | print( 'Saved to %s ...' % save1) 279 | with open(save2, 'w') as txt: 280 | txt.writelines(lines[-t2n:]) 281 | print( 'Saved to %s ...' % save2) 282 | 283 | return save1, save2, save3 284 | 285 | def getCount(): 286 | '''count values for every column, 287 | when dtype==O, expend it''' 288 | folder = mkdir('msg') 289 | saveName = '%s/counts.json' % folder 290 | print( 'start take get counts.json') 291 | if os.path.exists(saveName): 292 | print( 'getCount already done!!') 293 | return json.load(open(saveName)) 294 | 295 | datas = OrderedDict() 296 | print( 'start get merged data') 297 | df = getMerged(kind=0)#.drop('label', 1) 298 | del df['label'] 299 | gc.collect() 300 | for fn in df: 301 | gc.collect() 302 | print( 'start count', fn) 303 | if df[fn].dtype == 'object': 304 | # df_cnt = pd.DataFrame() 305 | # r = ' '.join( df[fn].dropna().tolist() ) 306 | # df_cnt[fn] = r.split(' ') 307 | # datas[fn] = df_cnt[fn].value_counts().to_dict() 308 | df[fn] = df[fn].astype(str) 309 | d, ds = [], df[fn].str.split(' ', expand=False) 310 | for s in ds: d.extend(list(map(int, s))) 311 | datas[fn] = Counter(d) 312 | 313 | else: 314 | #datas[fn] = df[fn].dropna().astype('int').value_counts().to_dict() 315 | datas[fn] = Counter(df[fn]) 316 | del df[fn] 317 | 318 | json.dump(datas, open(saveName, 'w')) 319 | print( 'Saved to %s ...' % saveName) 320 | 321 | return datas 322 | 323 | def setCount(): 324 | print( 'start setCount') 325 | count = getCount()#每个特征一个field,a dict of dict [('lbs',{value:count})] 326 | fields = sorted(count.keys())#list of feature['lbs',''...] 327 | 328 | offset = 0 329 | for k in fields: 330 | print( 'get %s fields in var range' % k) 331 | t = 100 if k=='uid' else 100#threshold of every feature 332 | 333 | temp = OrderedDict(#one col feature's dic--> value : count 334 | sorted( 335 | count[k].items(),#return the tuple of dict 336 | reverse=1, key=lambda i: i[1],#sort by i[1] 337 | ) 338 | ) 339 | count[k] = defaultdict( int, {#dict --> {feat:{value : index}} 340 | str(u): i+1+offset 341 | for i, u in enumerate(temp) 342 | if temp[u]>=t and int(float(u))!=0 343 | })#低于100的过滤掉了,给key对应自己index!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!all mass for this op 344 | offset += len(count[k]) 345 | print('setcount ok') 346 | return count, offset#return the dict and the number of dim 347 | 348 | def getFFMDim(): 349 | print( 'start getFFMdim()') 350 | t1, t2, t3 = splitFFM()#get the train ,validate , test FFMfile(::) 351 | print( 'end splitFFM()') 352 | UA_idx = open(User_Ad_idx,'r') 353 | User, Ad = UA_idx.readline().strip().split(' '), UA_idx.readline().strip().split(' ') 354 | UA_idx.close() 355 | 356 | folder = mkdir('msg') 357 | saveName = '%s/ffmDim.csv' % folder 358 | if os.path.exists(saveName): 359 | df = pd.read_csv(saveName) 360 | return OrderedDict( 361 | zip(df.columns, df.values[0])), User, Ad 362 | 363 | datas = defaultdict(int) 364 | for t in [t1, t2, t3]: 365 | print( 'start countFFMDim in [%s]' % t) 366 | for line in open(t): 367 | temp = defaultdict(int) 368 | 369 | s = line.strip().split(' ') 370 | fs = map(int, map( 371 | lambda x: x.split(':')[0], s[1:])) 372 | 373 | for f in fs: temp['field#%03d'%f] += 1 374 | for key in temp:#number of value of each field 375 | datas[key] = max(temp[key], datas[key])#所有数据中每一行中出现该field次数最大数,也就是每个特征每行取值最大个数 376 | for key in datas: 377 | datas[key] = [datas[key]] 378 | 379 | pd.DataFrame.from_dict(datas).to_csv( 380 | saveName, index=False, columns=sorted(datas.keys())) 381 | print( 'Saved to %s ...' % saveName) 382 | 383 | return OrderedDict([ 384 | (k, datas[k][0]) for k in sorted(datas.keys())]), User, Ad#return the dimlist in witch include the max val_num of the col 385 | 386 | def getSparse(): 387 | offsets = [0]#every field's begin index 388 | ffmDim, _, _ = getFFMDim() 389 | for k in sorted(ffmDim.keys()): 390 | offsets.append(offsets[-1]+ffmDim[k]) 391 | 392 | print( 'start splitFFM') 393 | t1, t2, t3 = splitFFM()#get the train ,validate , test FFMfile(::) 394 | print( 'start genFFM data') 395 | folder = mkdir('ffm') 396 | save1 = '%s/t1X.npz' % folder 397 | save2 = '%s/t2X.npz' % folder 398 | save3 = '%s/testX.npz' % folder 399 | 400 | ydfs = [ 401 | [t1, save1], [t2, save2], [t3, save3], 402 | ] 403 | 404 | flag = True 405 | for t, save in ydfs: 406 | if not os.path.exists(save): 407 | flag = False 408 | if flag: return save1, save2, save3 409 | print( 'end genFFM data') 410 | #data, indices, indptr = [], [], [0]#data, indices, indptr ::index of 1131 of all of each line,每个非0元素行内相对位置,这行多少个元素(起始index 411 | print( 'start gen sparse matrix data') 412 | for k, (t, save) in enumerate(ydfs): 413 | print( 'processing[%d], libsvm src[%s], npz dest[%s]' % (k, t, save)) 414 | ydfs[k].append(0) 415 | data, indices, indptr = [], [], [0] # data, indices, indptr ::index of 1131 of all of each line,每个非0元素行内相对位置,这行多少个元素(起始index 416 | for line_number, line in enumerate(open(t)): 417 | ydfs[k][-1] += 1#line num of this file 418 | 419 | s = line.strip().split(' ') # label field:key:value 420 | # value is always 1 421 | i = list(map(int, s[:1] + list(map( 422 | lambda x: x.split(':')[1], s[1:])))) #get index 423 | fs = list(map(int, list(map( 424 | lambda x: x.split(':')[0], s[1:])))) #get field 425 | 426 | data.extend(i) #VALUE IN EVERY ROW 427 | indices.append(0) 428 | ios = [0] * (len(offsets)) # offset存每个field的特征数目 429 | #print 'start process line[%s]' % line_number 430 | #print 'len(ios) is:', len(ios) 431 | #print ios[29] 432 | #print 'max(fs) is:', np.max(fs) 433 | #print 'fs is:', fs 434 | #fs = list(set(fs)) 435 | for f in fs: 436 | ios[f] += 1#value num of this line 437 | #print('error:',len(offsets),len(ios),f) 438 | indices.append(offsets[f] + ios[f]) #每个非0元素行内相对位置 439 | #print 'ios is:', ios 440 | #print 'indices is:', indices 441 | #data.extend(i) 442 | indptr.append(indptr[-1] + len(i)) #这一行到哪个元素 443 | rows, cols = len(indptr) - 1, np.max(indices) + 1 444 | csr = csr_matrix((data, indices, indptr), shape=(rows, cols)) # 445 | t, save, l = ydfs[k] 446 | sparse.save_npz(save, csr[:-1], compressed=True) 447 | # rows, cols = len(indptr)-1, np.max(indices)+1 448 | # ''' 449 | # csr_matrix((data, indices, indptr), [shape=(M, N)]) 450 | # is the standard CSR representation where the column indices for row i are stored in indices[indptr[i]:indptr[i+1]] and their corresponding values are stored in data[indptr[i]:indptr[i+1]]. If the shape parameter is not supplied, the matrix dimensions are inferred from the index arrays. 451 | # indptr表示某一行在data中的range,indices表示每条数据在第几列 452 | # ''' 453 | # csr = csr_matrix( 454 | # (data, indices, indptr), shape=(rows, cols))#data, indices, indptr ::index of 1131 of all of each line,每个非0元素行内相对位置,这行多少个元素(起始index 455 | # 456 | # start, end = 0, 0 457 | # for t, save, l in ydfs: 458 | # start, end = end, end + l 459 | # sparse.save_npz( 460 | # save, csr[start:end], compressed=True) 461 | print( 'end gen sparse matrix data') 462 | return save1, save2, save3 463 | 464 | def loadSparse(kind): 465 | return sparse.load_npz(getSparse()[kind]) 466 | 467 | ############################################ 468 | from sklearn.preprocessing import LabelEncoder 469 | if __name__ == '__main__': 470 | #userFeature 471 | 472 | mytype = {'uid': np.uint16, 'LBS': np.uint16, 'adCategoryId': np.uint16, 'advertiserId': np.uint16, 473 | 'age': np.uint16, 'aid': np.uint16, 'campaignId': np.uint16, 474 | 'carrier': np.uint16, 'consumptionAbility': np.uint16, 'creativeId': np.uint16, 'creativeSize': np.uint16, 475 | 'ct': object, 'education': np.uint16, 'gender': np.uint16, 'house': np.uint16, 'interest1': object, 476 | 'interest2': object, 'interest3': object, 'interest4': object, 'interest5': object, 'kw1': object, 477 | 'kw2': object, 'kw3': object, 'marriageStatus': object, 'os': object, 'productId': np.uint16, 478 | 'productType': np.uint16, 'topic1': object, 'topic2': object, 'topic3': object, 'label': np.int8, 479 | 'appIdAction': object, 'appIdInstall': object} 480 | # df = pd.read_csv('./datas/8g/8g.csv', header=0, iterator=True,dtype=mytype, usecols=mytype.keys()) 481 | # data = df.get_chunk(80000) 482 | # data.to_csv('./datas/trainMerged.csv',index=False) 483 | # data = df.get_chunk(20000) 484 | # data.to_csv('./datas/test1Merged.csv', index=False) 485 | # print('data ok') 486 | # del data 487 | # gc.collect() 488 | print(getFFM()) 489 | print(setCount()[1]) 490 | loadSparse(0) 491 | #os.system('python3 nn.py 1 0 test 0') 492 | -------------------------------------------------------------------------------- /feature.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from sklearn.feature_extraction.text import CountVectorizer 3 | from sklearn.preprocessing import OneHotEncoder,LabelEncoder 4 | from scipy import sparse 5 | import os 6 | import gc 7 | import random 8 | import numpy as np 9 | import scipy.special as special 10 | from tqdm import tqdm 11 | from collections import * 12 | 13 | cross_feat = ['LBS_marriageStatus','adCategoryId_LBS','adCategoryId_age','adCategoryId_ct','adCategoryId_productId','advertiserId_LBS','campaignId_productId','campaignId_adCategoryId','campaignId_age','campaignId_ct','campaignId_marriageStatus','consumptionAbility_ct','education_ct','gender_os','productId_LBS','productType_marriageStatus','productType_ct'] 14 | 15 | stas_feat = ['adCategoryId_age','adCategoryId_marriageStatus','age_carrier','age_consumptionAbility','age_gender','age_house','age_os','campaignId_LBS','campaignId_marriageStatus','consumptionAbility_gender','creativeSize_age','creativeSize_marriageStatus','education_gender','productId_marriageStatus','productType_age','productType_consumptionAbility'] 16 | #多值特征 17 | def feat_mulval(data): 18 | for i in data.columns: 19 | i_types = data[i].dtypes 20 | if i_types != 'object': 21 | print('single', i) 22 | # train_and_test_and_adFeature[i] = train_and_test_and_adFeature[i].astype('category', copy=False) 23 | else: 24 | if str(i).startswith('marriageStatus'): 25 | tmp = data[i].apply(lambda x: len(str(x).split(' '))) 26 | i_length = min(5, tmp.max()) 27 | for sub_i in range(i_length): 28 | data[i + '_%d' % (sub_i)] = data[i].apply( 29 | lambda x: str(x).split(' ')[sub_i] if sub_i < len(str(x).split(' ')) else -1 30 | ) 31 | data[i + '_length'] = data[i].apply( 32 | lambda x: len(list(set(str(x).split(' ')))) 33 | ) 34 | return data 35 | #过滤低值 36 | def remove_lowcase(se,times): 37 | count = dict(se.value_counts()) 38 | se = se.map(lambda x : -1 if count[x] 2) 310 | data = pd.merge(data, temp, how='left', on=subset) 311 | del temp 312 | gc.collect() 313 | return data 314 | 315 | 316 | #均值特征 317 | def feat_mean(): 318 | res = data[['index']] 319 | 320 | grouped = data.groupby('appID')['age'].mean().reset_index() 321 | grouped.columns = ['appID', 'app_mean_age'] 322 | data = data.merge(grouped, how='left', on='appID') 323 | grouped = data.groupby('positionID')['age'].mean().reset_index() 324 | grouped.columns = ['positionID', 'position_mean_age'] 325 | data = data.merge(grouped, how='left', on='positionID') 326 | grouped = data.groupby('appCategory')['age'].mean().reset_index() 327 | grouped.columns = ['appCategory', 'appCategory_mean_age'] 328 | data = data.merge(grouped, how='left', on='appCategory') 329 | 330 | X_train = data.loc[data['label'] != -1, :] 331 | X_test = data.loc[data['label'] == -1, :] 332 | X_test.loc[:, 'instanceID'] = res.values 333 | del data, grouped 334 | gc.collect() 335 | return X_train, X_test 336 | 337 | #排序特征 338 | def get_rank_feat(data, col, name): 339 | data[name] = data.groupby(col).cumcount() + 1 340 | return data 341 | 342 | # 343 | def gen_pos_neg_aid_fea(): 344 | train_data = pd.read_csv('input/train.csv') 345 | test2_data = pd.read_csv('input/test2.csv') 346 | 347 | train_user = train_data.uid.unique() 348 | 349 | # user-aid dict 350 | uid_dict = defaultdict(list) 351 | for row in tqdm(train_data.itertuples(), total=len(train_data)): 352 | uid_dict[row[2]].append([row[1], row[3]]) 353 | 354 | # user convert 355 | uid_convert = {} 356 | for uid in tqdm(train_user): 357 | pos_aid, neg_aid = [], [] 358 | for data in uid_dict[uid]: 359 | if data[1] > 0: 360 | pos_aid.append(data[0]) 361 | else: 362 | neg_aid.append(data[0]) 363 | uid_convert[uid] = [pos_aid, neg_aid] 364 | 365 | test2_neg_pos_aid = {} 366 | for row in tqdm(test2_data.itertuples(), total=len(test2_data)): 367 | aid = row[1] 368 | uid = row[2] 369 | if uid_convert.get(uid, []) == []: 370 | test2_neg_pos_aid[row[0]] = ['', '', -1] 371 | else: 372 | pos_aid, neg_aid = uid_convert[uid][0].copy(), uid_convert[uid][1].copy() 373 | convert = len(pos_aid) / (len(pos_aid) + len(neg_aid)) if (len(pos_aid) + len(neg_aid)) > 0 else -1 374 | test2_neg_pos_aid[row[0]] = [' '.join(map(str, pos_aid)), ' '.join(map(str, neg_aid)), convert] 375 | df_test2 = pd.DataFrame.from_dict(data=test2_neg_pos_aid, orient='index') 376 | df_test2.columns = ['pos_aid', 'neg_aid', 'uid_convert'] 377 | 378 | train_neg_pos_aid = {} 379 | for row in tqdm(train_data.itertuples(), total=len(train_data)): 380 | aid = row[1] 381 | uid = row[2] 382 | pos_aid, neg_aid = uid_convert[uid][0].copy(), uid_convert[uid][1].copy() 383 | if aid in pos_aid: 384 | pos_aid.remove(aid) 385 | if aid in neg_aid: 386 | neg_aid.remove(aid) 387 | convert = len(pos_aid) / (len(pos_aid) + len(neg_aid)) if (len(pos_aid) + len(neg_aid)) > 0 else -1 388 | train_neg_pos_aid[row[0]] = [' '.join(map(str, pos_aid)), ' '.join(map(str, neg_aid)), convert] 389 | 390 | df_train = pd.DataFrame.from_dict(data=train_neg_pos_aid, orient='index') 391 | df_train.columns = ['pos_aid', 'neg_aid', 'uid_convert'] 392 | 393 | df_train.to_csv("dataset/train_neg_pos_aid.csv", index=False) 394 | df_test2.to_csv("dataset/test2_neg_pos_aid.csv", index=False) 395 | 396 | #zhada**平滑转化率 397 | # def count_feature(train_data, test1_data, test2_data, labels, k, test_only= False): 398 | # nums = len(train_data) 399 | # interval = nums // k 400 | # split_points = [] 401 | # for i in range(k): 402 | # split_points.append(i * interval) 403 | # split_points.append(nums) 404 | # 405 | # s = set() 406 | # for d in train_data: 407 | # xs = d.split(' ') 408 | # for x in xs: 409 | # s.add(x) 410 | # b = nums // len(s) 411 | # a = b*1.0 / 20 412 | # 413 | # train_res = [] 414 | # if not test_only: 415 | # for i in range(k): 416 | # tmp = [] 417 | # total_dict, pos_dict = gen_count_dict(train_data, labels, split_points[i],split_points[i+1]) 418 | # for j in range(split_points[i],split_points[i+1]): 419 | # xs = train_data[j].split(' ') 420 | # t = [] 421 | # for x in xs: 422 | # if not total_dict.has_key(x): 423 | # t.append(0.05) 424 | # continue 425 | # t.append((a + pos_dict[x]) / (b + total_dict[x])) 426 | # tmp.append(max(t)) 427 | # train_res.extend(tmp) 428 | # 429 | # total_dict, pos_dict = gen_count_dict(train_data, labels, 1, 0) 430 | # test1_res = [] 431 | # for d in test1_data: 432 | # xs = d.split(' ') 433 | # t = [] 434 | # for x in xs: 435 | # if not total_dict.has_key(x): 436 | # t.append(0.05) 437 | # continue 438 | # t.append((a + pos_dict[x]) / (b + total_dict[x])) 439 | # test1_res.append(max(t)) 440 | # 441 | # test2_res = [] 442 | # for d in test2_data: 443 | # xs = d.split(' ') 444 | # t = [] 445 | # for x in xs: 446 | # if not total_dict.has_key(x): 447 | # t.append(0.05) 448 | # continue 449 | # t.append((a + pos_dict[x]) / (b + total_dict[x])) 450 | # test2_res.append(max(t)) 451 | # 452 | # return train_res, test1_res, test2_res 453 | 454 | #dnn feature################################################################################################################################ 455 | def mutil_ids(train_df, dev_df, test_df, word2index): 456 | features_mutil = ['interest1', 'interest2', 'interest3', 'interest4', 'interest5', 'kw1', 'kw2', 'kw3', 'topic1', 457 | 'topic2', 'topic3', 'appIdAction', 'appIdInstall', 'marriageStatus', 'ct', 'os'] 458 | for s in features_mutil: 459 | cont = {} 460 | with open('ffm_data/train/' + str(s), 'w') as f: 461 | for lines in list(train_df[s].values): 462 | f.write(str(lines) + '\n') 463 | for line in lines.split(): 464 | if str(line) not in cont: 465 | cont[str(line)] = 0 466 | cont[str(line)] += 1 467 | 468 | with open('ffm_data/dev/' + str(s), 'w') as f: 469 | for line in list(dev_df[s].values): 470 | f.write(str(line) + '\n') 471 | 472 | with open('ffm_data/test/' + str(s), 'w') as f: 473 | for line in list(test_df[s].values): 474 | f.write(str(line) + '\n') 475 | index = [] 476 | for k in cont: 477 | if cont[k] >= threshold: 478 | index.append(k) 479 | word2index[s] = {} 480 | for idx, val in enumerate(index): 481 | word2index[s][val] = idx + 2 482 | print(s + ' done!') 483 | 484 | 485 | def len_features(train_df, dev_df, test_df, word2index): 486 | len_features = ['interest1', 'interest2', 'interest3', 'interest4', 'interest5', 'kw1', 'kw2', 'kw3', 'topic1', 487 | 'topic2', 'topic3'] 488 | for s in len_features: 489 | dev_df[s + '_len'] = dev_df[s].apply(lambda x: len(x.split()) if x != '-1' else 0) 490 | test_df[s + '_len'] = test_df[s].apply(lambda x: len(x.split()) if x != '-1' else 0) 491 | train_df[s + '_len'] = train_df[s].apply(lambda x: len(x.split()) if x != '-1' else 0) 492 | s = s + '_len' 493 | cont = {} 494 | with open('ffm_data/train/' + str(s), 'w') as f: 495 | for line in list(train_df[s].values): 496 | f.write(str(line) + '\n') 497 | if str(line) not in cont: 498 | cont[str(line)] = 0 499 | cont[str(line)] += 1 500 | 501 | with open('ffm_data/dev/' + str(s), 'w') as f: 502 | for line in list(dev_df[s].values): 503 | f.write(str(line) + '\n') 504 | 505 | with open('ffm_data/test/' + str(s), 'w') as f: 506 | for line in list(test_df[s].values): 507 | f.write(str(line) + '\n') 508 | index = [] 509 | for k in cont: 510 | if cont[k] >= threshold: 511 | index.append(k) 512 | word2index[s] = {} 513 | for idx, val in enumerate(index): 514 | word2index[s][val] = idx + 2 515 | del train_df[s] 516 | del dev_df[s] 517 | del test_df[s] 518 | gc.collect() 519 | print(s + ' done!') 520 | 521 | 522 | def count_features(train_df, dev_df, test_df, word2index): 523 | count_feature = ['uid'] 524 | data = train_df.append(dev_df) 525 | data = data.append(test_df) 526 | for s in count_feature: 527 | g = dict(data.groupby(s).size()) 528 | s_ = s 529 | s = s + '_count' 530 | cont = {} 531 | 532 | with open('ffm_data/train/' + str(s), 'w') as f: 533 | for line in list(train_df[s_].values): 534 | line = g[line] 535 | if str(line) not in cont: 536 | cont[str(line)] = 0 537 | cont[str(line)] += 1 538 | f.write(str(line) + '\n') 539 | 540 | with open('ffm_data/dev/' + str(s), 'w') as f: 541 | for line in list(dev_df[s_].values): 542 | line = g[line] 543 | f.write(str(line) + '\n') 544 | 545 | with open('ffm_data/test/' + str(s), 'w') as f: 546 | for line in list(test_df[s_].values): 547 | line = g[line] 548 | f.write(str(line) + '\n') 549 | index = [] 550 | for k in cont: 551 | if cont[k] >= threshold: 552 | index.append(k) 553 | word2index[s] = {} 554 | for idx, val in enumerate(index): 555 | word2index[s][val] = idx + 2 556 | print(s + ' done!') 557 | 558 | 559 | def kfold_features(train_df, dev_df, test_df, word2index): 560 | features_mutil = ['interest1', 'interest2', 'interest3', 'interest4', 'interest5', 'kw1', 'kw2', 'kw3', 'topic1', 561 | 'topic2', 'topic3', 'appIdAction', 'appIdInstall', 'marriageStatus', 'ct', 'os'] 562 | for f in features_mutil: 563 | del train_df[f] 564 | del dev_df[f] 565 | del test_df[f] 566 | gc.collect() 567 | count_feature = ['uid'] 568 | feature = ['advertiserId', 'campaignId', 'creativeId', 'creativeSize', 'adCategoryId', 'productId', 'productType'] 569 | for f in feature: 570 | train_df[f + '_uid'] = train_df[f] + train_df['uid'] * 10000000 571 | dev_df[f + '_uid'] = dev_df[f] + dev_df['uid'] * 10000000 572 | test_df[f + '_uid'] = test_df[f] + test_df['uid'] * 10000000 573 | count_feature.append(f + '_uid') 574 | 575 | for s in count_feature: 576 | temp = s 577 | kfold_static(train_df, dev_df, test_df, s) 578 | s = temp + '_positive_num' 579 | cont = {} 580 | with open('ffm_data/train/' + str(s), 'w') as f: 581 | for line in list(train_df[s].values): 582 | if str(line) not in cont: 583 | cont[str(line)] = 0 584 | cont[str(line)] += 1 585 | f.write(str(line) + '\n') 586 | 587 | with open('ffm_data/dev/' + str(s), 'w') as f: 588 | for line in list(dev_df[s].values): 589 | f.write(str(line) + '\n') 590 | 591 | with open('ffm_data/test/' + str(s), 'w') as f: 592 | for line in list(test_df[s].values): 593 | f.write(str(line) + '\n') 594 | index = [] 595 | for k in cont: 596 | if cont[k] >= threshold: 597 | index.append(k) 598 | word2index[s] = {} 599 | for idx, val in enumerate(index): 600 | word2index[s][val] = idx + 2 601 | pkl.dump(word2index, open('ffm_data/dic.pkl', 'wb')) 602 | print(s + ' done!') 603 | del train_df[s] 604 | del dev_df[s] 605 | del test_df[s] 606 | 607 | s = temp + '_negative_num' 608 | cont = {} 609 | with open('ffm_data/train/' + str(s), 'w') as f: 610 | for line in list(train_df[s].values): 611 | if str(line) not in cont: 612 | cont[str(line)] = 0 613 | cont[str(line)] += 1 614 | f.write(str(line) + '\n') 615 | 616 | with open('ffm_data/dev/' + str(s), 'w') as f: 617 | for line in list(dev_df[s].values): 618 | f.write(str(line) + '\n') 619 | 620 | with open('ffm_data/test/' + str(s), 'w') as f: 621 | for line in list(test_df[s].values): 622 | f.write(str(line) + '\n') 623 | index = [] 624 | for k in cont: 625 | if cont[k] >= threshold: 626 | index.append(k) 627 | word2index[s] = {} 628 | for idx, val in enumerate(index): 629 | word2index[s][val] = idx + 2 630 | pkl.dump(word2index, open('ffm_data/dic.pkl', 'wb')) 631 | del train_df[s] 632 | del dev_df[s] 633 | del test_df[s] 634 | gc.collect() 635 | print(s + ' done!') 636 | for f in feature: 637 | del train_df[f + '_uid'] 638 | del test_df[f + '_uid'] 639 | del dev_df[f + '_uid'] 640 | gc.collect() 641 | 642 | 643 | def kfold_static(train_df, dev_df, test_df, f): 644 | print("K-fold static:", f) 645 | # K-fold positive and negative num 646 | index = set(range(train_df.shape[0])) 647 | K_fold = [] 648 | for i in range(5): 649 | if i == 4: 650 | tmp = index 651 | else: 652 | tmp = random.sample(index, int(0.2 * train_df.shape[0])) 653 | index = index - set(tmp) 654 | print("Number:", len(tmp)) 655 | K_fold.append(tmp) 656 | positive = [-1 for i in range(train_df.shape[0])] 657 | negative = [-1 for i in range(train_df.shape[0])] 658 | for i in range(5): 659 | print('fold', i) 660 | pivot_index = K_fold[i] 661 | sample_idnex = [] 662 | for j in range(5): 663 | if j != i: 664 | sample_idnex += K_fold[j] 665 | dic = {} 666 | for item in train_df.iloc[sample_idnex][[f, 'label']].values: 667 | if item[0] not in dic: 668 | dic[item[0]] = [0, 0] 669 | dic[item[0]][item[1]] += 1 670 | uid = train_df[f].values 671 | for k in pivot_index: 672 | if uid[k] in dic: 673 | positive[k] = dic[uid[k]][1] 674 | negative[k] = dic[uid[k]][0] 675 | train_df[f + '_positive_num'] = positive 676 | train_df[f + '_negative_num'] = negative 677 | 678 | # for dev and test 679 | dic = {} 680 | for item in train_df[[f, 'label']].values: 681 | if item[0] not in dic: 682 | dic[item[0]] = [0, 0] 683 | dic[item[0]][item[1]] += 1 684 | positive = [] 685 | negative = [] 686 | for uid in dev_df[f].values: 687 | if uid in dic: 688 | positive.append(dic[uid][1]) 689 | negative.append(dic[uid][0]) 690 | else: 691 | positive.append(-1) 692 | negative.append(-1) 693 | dev_df[f + '_positive_num'] = positive 694 | dev_df[f + '_negative_num'] = negative 695 | print('dev', 'done') 696 | 697 | positive = [] 698 | negative = [] 699 | for uid in test_df[f].values: 700 | if uid in dic: 701 | positive.append(dic[uid][1]) 702 | negative.append(dic[uid][0]) 703 | else: 704 | positive.append(-1) 705 | negative.append(-1) 706 | test_df[f + '_positive_num'] = positive 707 | test_df[f + '_negative_num'] = negative 708 | print('test', 'done') 709 | print('avg of positive num', np.mean(train_df[f + '_positive_num']), np.mean(dev_df[f + '_positive_num']), 710 | np.mean(test_df[f + '_positive_num'])) 711 | print('avg of negative num', np.mean(train_df[f + '_negative_num']), np.mean(dev_df[f + '_negative_num']), 712 | np.mean(test_df[f + '_negative_num'])) 713 | 714 | 715 | 716 | def uid_seq_feature(train_data, test1_data, test2_data, label): 717 | count_dict = {}#存 该id: 追加这次出现的label 718 | seq_dict = {}#存序列字典 :该种序列出现次数 719 | seq_emb_dict = {}#存该序列的key 720 | train_seq = []#存每个学列的index 721 | ind = 0 722 | for i, d in enumerate(train_data): 723 | if not count_dict.__contains__(d): 724 | count_dict[d] = [] 725 | seq_key = ' '.join(count_dict[d][-4:]) 726 | if not seq_dict.__contains__(seq_key): 727 | seq_dict[seq_key] = 0 728 | seq_emb_dict[seq_key] = ind 729 | ind += 1 730 | seq_dict[seq_key] += 1 731 | train_seq.append(seq_emb_dict[seq_key]) 732 | count_dict[d].append(label[i]) 733 | test1_seq = [] 734 | for d in test1_data: 735 | if not count_dict.__contains__(d): 736 | seq_key = '' 737 | else: 738 | seq_key = ' '.join(count_dict[d][-4:]) 739 | if seq_emb_dict.__contains__(seq_key): 740 | key = seq_emb_dict[seq_key] 741 | else: 742 | key = 0 743 | test1_seq.append(key) 744 | test2_seq = [] 745 | for d in test2_data: 746 | if not count_dict.__contains__(d): 747 | seq_key = '' 748 | else: 749 | seq_key = ' '.join(count_dict[d][-4:]) 750 | if seq_emb_dict.__contains__(seq_key): 751 | key = seq_emb_dict[seq_key] 752 | else: 753 | key = 0 754 | test2_seq.append(key) 755 | 756 | def gen_uid_aid_fea(): 757 | ''' 758 | 载入数据, 提取aid, uid的全局统计特征 759 | ''' 760 | train_data = pd.read_csv('input/train.csv') 761 | test1_data = pd.read_csv('input/test1.csv') 762 | test2_data = pd.read_csv('input/test2.csv') 763 | 764 | ad_Feature = pd.read_csv('input/adFeature.csv') 765 | 766 | train_len = len(train_data) # 45539700 767 | test1_len = len(test1_data) 768 | test2_len = len(test2_data) # 11727304 769 | 770 | ad_Feature = pd.merge(ad_Feature, ad_Feature.groupby(['campaignId']).aid.nunique().reset_index( 771 | ).rename(columns={'aid': 'campaignId_aid_nunique'}), how='left', on='campaignId') 772 | 773 | df = pd.concat([train_data, test1_data, test2_data], axis=0) 774 | df = pd.merge(df, df.groupby(['uid'])['aid'].nunique().reset_index().rename( 775 | columns={'aid': 'uid_aid_nunique'}), how='left', on='uid') 776 | 777 | df = pd.merge(df, df.groupby(['aid'])['uid'].nunique().reset_index().rename( 778 | columns={'uid': 'aid_uid_nunique'}), how='left', on='aid') 779 | 780 | df['uid_count'] = df.groupby('uid')['aid'].transform('count') 781 | df = pd.merge(df, ad_Feature[['aid', 'campaignId_aid_nunique']], how='left', on='aid') 782 | 783 | fea_columns = ['campaignId_aid_nunique', 'uid_aid_nunique', 'aid_uid_nunique', 'uid_count', ] 784 | 785 | df[fea_columns].iloc[:train_len].to_csv('dataset/train_uid_aid.csv', index=False) 786 | df[fea_columns].iloc[train_len: train_len+test1_len].to_csv('dataset/test1_uid_aid.csv', index=False) 787 | df[fea_columns].iloc[-test2_len:].to_csv('dataset/test2_uid_aid.csv', index=False) 788 | 789 | 790 | def digitize(): 791 | uid_aid_train = pd.read_csv('dataset/train_uid_aid.csv') 792 | uid_aid_test1 = pd.read_csv('dataset/test1_uid_aid.csv') 793 | uid_aid_test2 = pd.read_csv('dataset/test2_uid_aid.csv') 794 | uid_aid_df = pd.concat([uid_aid_train, uid_aid_test1, uid_aid_test2], axis=0) 795 | for col in range(3): 796 | bins = [] 797 | for percent in [0, 20, 35, 50, 65, 85, 100]: 798 | bins.append(np.percentile(uid_aid_df.iloc[:, col], percent)) 799 | uid_aid_train.iloc[:, col] = np.digitize(uid_aid_train.iloc[:, col], bins, right=True) 800 | uid_aid_test1.iloc[:, col] = np.digitize(uid_aid_test1.iloc[:, col], bins, right=True) 801 | uid_aid_test2.iloc[:, col] = np.digitize(uid_aid_test2.iloc[:, col], bins, right=True) 802 | 803 | count_bins = [1, 2, 4, 6, 8, 10, 16, 27, 50] 804 | uid_aid_train.iloc[:, 3] = np.digitize(uid_aid_train.iloc[:, 3], count_bins, right=True) 805 | uid_aid_test1.iloc[:, 3] = np.digitize(uid_aid_test1.iloc[:, 3], count_bins, right=True) 806 | uid_aid_test2.iloc[:, 3] = np.digitize(uid_aid_test2.iloc[:, 3], count_bins, right=True) 807 | 808 | uid_convert_train = pd.read_csv("dataset/train_neg_pos_aid.csv", usecols=['uid_convert']) 809 | uid_convert_test2 = pd.read_csv("dataset/test2_neg_pos_aid.csv", usecols=['uid_convert']) 810 | 811 | convert_bins = [-1, 0, 0.1, 0.3, 0.5, 0.7, 1] 812 | uid_convert_train.iloc[:, 0] = np.digitize(uid_convert_train.iloc[:, 0], convert_bins, right=True) 813 | uid_convert_test2.iloc[:, 0] = np.digitize(uid_convert_test2.iloc[:, 0], convert_bins, right=True) 814 | 815 | uid_aid_train = pd.concat([uid_aid_train, uid_convert_train], axis=1) 816 | uid_aid_test2 = pd.concat([uid_aid_test2, uid_convert_test2], axis=1) 817 | 818 | uid_aid_train.to_csv('dataset/train_uid_aid_bin.csv', index=False) 819 | uid_aid_test2.to_csv('dataset/test2_uid_aid_bin.csv', index=False) 820 | 821 | 822 | ################# 823 | def feature_count(full, features=[]): 824 | new_feature = 'new_count' 825 | for i in features: 826 | get_series(i) 827 | new_feature += '_' + i 828 | log(new_feature) 829 | try: 830 | del full[new_feature] 831 | except: 832 | pass 833 | temp = full.groupby(features).size().reset_index().rename(columns={0: new_feature}) 834 | full = full.merge(temp, 'left', on=features) 835 | # save(full, new_feature) 836 | return full 837 | 838 | def Information_entropy(): 839 | for i in ['aid']: 840 | for j in ['age', 'gender', 'education', 'consumptionAbility', 'LBS', 'carrier', 'house', 841 | 'marriageStatus', 'ct', 'os']: 842 | t = time.time() 843 | full = feature_count(full, [i, j]) 844 | full['new_inf_' + i + '_' + j] = np.log1p( 845 | full['new_count_' + j] * full['new_count_' + i] / full['new_count_' + i + '_' + j] / len_full) 846 | 847 | min_v = full['new_inf_' + i + '_' + j].min() 848 | full['new_inf_' + i + '_' + j] = full['new_inf_' + i + '_' + j].apply( 849 | lambda x: int(float('%.1f' % min(x - min_v, 1.5)) * 10)) 850 | print(full['new_inf_' + i + '_' + j].value_counts()) 851 | save(full, 'new_inf_' + i + '_' + j, 'cate', max=15) 852 | 853 | # 多值特征的信息熵 854 | def get_inf(cond, keyword): 855 | get_series(cond) 856 | get_series(keyword) 857 | 858 | # 背景字典 每个ID出现的次数 859 | back_dict = {} 860 | # 不同条件下 每个id 出现的次数condi_dict 861 | condi_dict = {} 862 | # 不同条件下aid 每个id 出现的次数 863 | 864 | # 预先生成字典。省的判断慢 865 | for i in full[cond].unique(): 866 | condi_dict[i] = {} 867 | 868 | for i, row in full[[cond, keyword]].iterrows(): 869 | word_list = row[keyword].split() 870 | for word in word_list: 871 | # 对背景字典加1 872 | try: 873 | back_dict[word] = back_dict[word] + 1 874 | except: 875 | # 没有该词则设为0 876 | back_dict[word] = 1 877 | try: 878 | # 该条件下的该词的出现次数加1 879 | condi_dict[row[cond]][word] = condi_dict[row[cond]][word] + 1 880 | except: 881 | condi_dict[row[cond]][word] = 1 882 | 883 | # 先获取平均熵 884 | max_inf_list = [] 885 | mean_inf_list = [] 886 | condi_count = full.groupby(cond)[cond].count().to_dict() 887 | for i, row in full[[cond, keyword]].iterrows(): 888 | word_list = row[keyword].split() 889 | 890 | count = len(word_list) 891 | prob = 1 892 | prob_list = [] 893 | for word in word_list: 894 | temp_prob = condi_count[row[cond]] * back_dict[word] / condi_dict[row[cond]][word] / len_full 895 | prob = prob * temp_prob 896 | prob_list.append(temp_prob) 897 | mean_inf_list.append(np.log1p(prob) / count) 898 | max_inf_list.append( 899 | np.log1p(np.min(prob))) 900 | return max_inf_list, mean_inf_list 901 | 902 | # 计算maxpool 和meanpool 903 | for i in ['aid']: 904 | for j in ['interest1', 'interest2', 'interest3', 'interest4', 'interest5', 'kw1', 'kw2', 'kw3', 'topic1', 905 | 'topic2', 'topic3', 'appIdAction', 'appIdInstall']: 906 | log('new_inf_' + i + '_' + j) 907 | full['new_inf_' + i + '_' + j + '_max'], full['new_inf_' + i + '_' + j + '_mean'] = get_inf(i, j) 908 | 909 | min_v = full['new_inf_' + i + '_' + j + '_max'].min() 910 | full['new_inf_' + i + '_' + j + '_max'] = full['new_inf_' + i + '_' + j + '_max'].apply( 911 | lambda x: int(float('%.1f' % min(x - min_v, 1.5)) * 10)) 912 | save(full, 'new_inf_' + i + '_' + j + '_max', 'cate', 16) 913 | 914 | min_v = full['new_inf_' + i + '_' + j + '_mean'].min() 915 | full['new_inf_' + i + '_' + j + '_mean'] = full['new_inf_' + i + '_' + j + '_mean'].apply( 916 | lambda x: int(float('%.1f' % min(x - min_v, 1.5)) * 10)) 917 | save(full, 'new_inf_' + i + '_' + j + '_mean', 'cate', 16) -------------------------------------------------------------------------------- /ffm.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | 3 | import os 4 | import sys 5 | import csv 6 | import time 7 | import glob 8 | import zipfile 9 | import numpy as np 10 | import xlearn as xl 11 | 12 | from data import * 13 | 14 | d = pd.read_csv('',header=0,iterator=True) 15 | d = d.get_chunk() 16 | def train(): 17 | ffm = xl.create_ffm() 18 | train, test, _ = splitFFM() 19 | 20 | ffm.setTrain(train) 21 | ffm.setValidate(test) 22 | 23 | model = './modelFFM' 24 | sTime = time.strftime( 25 | '%m%d-%H%M', time.localtime(time.time())) 26 | if not os.path.exists(model): os.mkdir(model) 27 | model = '%s/xlModel_%s.txt' % (model, sTime) 28 | 29 | params = { 30 | 'epoch' : 100, 31 | 'metric' : 'auc', 32 | 'task' : 'binary', 33 | 34 | 'k' : 4, 35 | 'lr' : 0.02, 36 | 'lambda' : 1e-6, 37 | 'stop_window' : 3, 38 | } 39 | ffm.fit(params, model) 40 | 41 | def predict(): 42 | ffm = xl.create_ffm() 43 | _, _, test = splitFFM() 44 | 45 | ffm.setTest(test); ffm.setSigmoid() 46 | 47 | folder = './modelFFM' 48 | model = sorted( 49 | glob.glob('./modelFFM/xlModel_*.txt'))[-1] 50 | output = model.replace('Model', 'Output') 51 | ffm.predict(model, output) 52 | 53 | df = getMerged('aid', 'uid', kind=2) 54 | df['score'] = np.loadtxt(output) 55 | df.to_csv('submission.csv', index=False) 56 | 57 | zipName = '%s/submission.zip' % folder 58 | with zipfile.ZipFile(zipName, 'w') as f: 59 | f.write('submission.csv', compress_type=zipfile.ZIP_DEFLATED) 60 | 61 | 62 | if __name__ == '__main__': 63 | assert len(sys.argv)==2, 'Failed ...' 64 | kind = int(sys.argv[1]) 65 | 66 | if kind == 1: 67 | train() 68 | elif kind == 0: 69 | predict() 70 | 71 | -------------------------------------------------------------------------------- /ffmdata.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python 2 | # -*- coding: utf-8 -*- 3 | # Created by Bo Song on 2018/4/26 4 | 5 | import pandas as pd 6 | from pandas import get_dummies 7 | import lightgbm as lgb 8 | from sklearn.feature_extraction.text import CountVectorizer 9 | from sklearn.preprocessing import OneHotEncoder,LabelEncoder 10 | from scipy import sparse 11 | import numpy as np 12 | import os 13 | import gc 14 | 15 | 16 | path='data/' 17 | 18 | 19 | 20 | one_hot_feature=['LBS','age','carrier','consumptionAbility','education','gender','advertiserId','campaignId', 'creativeId', 21 | 'adCategoryId', 'productId', 'productType'] 22 | 23 | vector_feature=['interest1','interest2','interest5','kw1','kw2','topic1','topic2','os','ct','marriageStatus'] 24 | continus_feature=['creativeSize'] 25 | 26 | ad_feature=pd.read_csv(path+'adFeature.csv') 27 | user_feature=pd.read_csv(path+'userFeature.csv') 28 | 29 | train=pd.read_csv(path+'train.csv') 30 | test=pd.read_csv(path+'test1.csv') 31 | 32 | data=pd.concat([train,test]) 33 | data=pd.merge(data,ad_feature,on='aid',how='left') 34 | data=pd.merge(data,user_feature,on='uid',how='left') 35 | 36 | 37 | data=data.fillna(-1) 38 | data=data[one_hot_feature+vector_feature+continus_feature] 39 | 40 | class FFMFormat: 41 | def __init__(self,vector_feat,one_hot_feat,continus_feat): 42 | self.field_index_ = None 43 | self.feature_index_ = None 44 | self.vector_feat=vector_feat 45 | self.one_hot_feat=one_hot_feat 46 | self.continus_feat=continus_feat 47 | 48 | 49 | def get_params(self): 50 | pass 51 | 52 | def set_params(self, **parameters): 53 | pass 54 | 55 | def fit(self, df, y=None): 56 | self.field_index_ = {col: i for i, col in enumerate(df.columns)} 57 | self.feature_index_ = dict() 58 | last_idx = 0 59 | for col in df.columns: 60 | if col in self.one_hot_feat: 61 | print(col) 62 | df[col]=df[col].astype('int') 63 | vals = np.unique(df[col]) 64 | for val in vals: 65 | if val==-1: continue 66 | name = '{}_{}'.format(col, val) 67 | if name not in self.feature_index_: 68 | self.feature_index_[name] = last_idx 69 | last_idx += 1 70 | elif col in self.vector_feat: 71 | print(col) 72 | vals=[] 73 | for data in df[col].apply(str): 74 | if data!="-1": 75 | for word in data.strip().split(' '): 76 | vals.append(word) 77 | vals = np.unique(vals) 78 | for val in vals: 79 | if val=="-1": continue 80 | name = '{}_{}'.format(col, val) 81 | if name not in self.feature_index_: 82 | self.feature_index_[name] = last_idx 83 | last_idx += 1 84 | self.feature_index_[col] = last_idx 85 | last_idx += 1 86 | return self 87 | 88 | def fit_transform(self, df, y=None): 89 | self.fit(df, y) 90 | return self.transform(df) 91 | 92 | def transform_row_(self, row): 93 | ffm = [] 94 | 95 | for col, val in row.loc[row != 0].to_dict().items(): 96 | if col in self.one_hot_feat: 97 | name = '{}_{}'.format(col, val) 98 | if name in self.feature_index_: 99 | ffm.append('{}:{}:1'.format(self.field_index_[col], self.feature_index_[name])) 100 | # ffm.append('{}:{}:{}'.format(self.field_index_[col], self.feature_index_[col], 1)) 101 | elif col in self.vector_feat: 102 | for word in str(val).split(' '): 103 | name = '{}_{}'.format(col, word) 104 | if name in self.feature_index_: 105 | ffm.append('{}:{}:1'.format(self.field_index_[col], self.feature_index_[name])) 106 | elif col in self.continus_feat: 107 | if val!=-1: 108 | ffm.append('{}:{}:{}'.format(self.field_index_[col], self.feature_index_[col], val)) 109 | return ' '.join(ffm) 110 | 111 | def transform(self, df): 112 | # val=[] 113 | # for k,v in self.feature_index_.items(): 114 | # val.append(v) 115 | # val.sort() 116 | # print(val) 117 | # print(self.field_index_) 118 | # print(self.feature_index_) 119 | return pd.Series({idx: self.transform_row_(row) for idx, row in df.iterrows()}) 120 | 121 | tr = FFMFormat(vector_feature,one_hot_feature,continus_feature) 122 | user_ffm=tr.fit_transform(data) 123 | user_ffm.to_csv('ffm.csv',index=False) 124 | 125 | train = pd.read_csv(path + 'train.csv') 126 | test = pd.read_csv(path+'test1.csv') 127 | 128 | Y = np.array(train.pop('label')) 129 | len_train=len(train) 130 | 131 | with open('ffm.csv') as fin: 132 | f_train_out = open('train_ffm.csv','w') 133 | f_test_out = open('test_ffm.csv', 'w') 134 | for (i,line) in enumerate(fin): 135 | if i [50,] 163 | "learning_rate": hp.randint("learning_rate", 10), # [0,1,2,3,4,5] -> 0.05,0.06 164 | "subsample": hp.randint("subsample", 4), # [0,1,2,3] -> [0.7,0.8,0.9,1.0] 165 | "min_child_weight": hp.randint("min_child_weight", 10), 166 | "num_leaves": hp.randint("num_leaves", 2) 167 | } 168 | algo = partial(tpe.suggest, n_startup_jobs=1) 169 | best = fmin(LGB, space, algo=algo, max_evals=4) 170 | print(best) 171 | print(LGB(best)) 172 | 173 | HY() 174 | 175 | 176 | t_end = datetime.datetime.now() 177 | print('training time: %s' % ((t_end - t_start).seconds / 60)) 178 | sys.stdout.flush() 179 | -------------------------------------------------------------------------------- /nn.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf8 -*- 2 | 3 | import sys 4 | import csv 5 | import glob 6 | import random 7 | import zipfile 8 | import numpy as np 9 | import mxnet as mx 10 | from mxnet.gluon import nn 11 | 12 | from data import * 13 | from nnHelper import * 14 | 15 | logging.basicConfig( 16 | level=logging.INFO, 17 | datefmt='%Y%m%d-%H:%M:%S', 18 | format='%(asctime)s: %(message)s', 19 | ) 20 | ################################################### 21 | 22 | class MyData(object): 23 | def __init__(self, ctx, isTrain=True): 24 | self.ctx = ctx 25 | self.isTrain = isTrain 26 | 27 | global name 28 | bs = getBS(name) 29 | self.batchSize = bs 30 | 31 | self.loadDatas() 32 | 33 | def loadDatas(self): 34 | logging.info('Loading datas ...') 35 | 36 | self.datas = loadSparse( 37 | 0 if self.isTrain else 1) 38 | self.len = self.datas.shape[0] 39 | self.indices = list(range(self.len)) 40 | self.iters = self.len / self.batchSize 41 | 42 | logging.info('Loading datas done ...') 43 | 44 | def __next__(self): 45 | index, batchSize = self.index, self.batchSize 46 | indices = self.indices[index:index+batchSize] 47 | 48 | self.index += batchSize 49 | if self.index > self.len: raise StopIteration 50 | 51 | sData = self.datas[indices].toarray() 52 | datas = mx.nd.array( 53 | sData[:, 1:], self.ctx, dtype='int32') 54 | labels = mx.nd.array(sData[:, 0], self.ctx) 55 | 56 | return datas, labels 57 | 58 | def reset(self): 59 | self.index = 0 60 | #if self.isTrain: 61 | # #self.indices = random.sample(self.indices,len(self.indices)) 62 | # random.#shuffle(self.indices) 63 | 64 | def __iter__(self): 65 | return self 66 | 67 | class MyDataVal(object): 68 | def __init__(self, ctx): 69 | self.ctx = ctx 70 | 71 | global name 72 | bs = getBS(name) 73 | self.batchSize = bs 74 | 75 | self.loadDatas() 76 | 77 | def loadDatas(self): 78 | logging.info('Loading datas ...') 79 | 80 | keys = ['aid', 'uid'] 81 | self.ids = getMerged(*keys, kind=4).values 82 | 83 | self.datas = loadSparse(1) 84 | self.len = self.datas.shape[0] 85 | 86 | logging.info('Loading datas done ...') 87 | 88 | def __next__(self): 89 | index, batchSize = self.index, self.batchSize 90 | if index >= self.len: raise StopIteration 91 | end = min(index + batchSize, self.len) 92 | 93 | ids = self.ids[index:end] 94 | ids = mx.nd.array( 95 | ids, self.ctx, dtype='int32') 96 | 97 | sData = self.datas[index:end].toarray() 98 | datas = mx.nd.array( 99 | sData[:, 1:], self.ctx, dtype='int32') 100 | self.index += self.batchSize 101 | 102 | return ids, datas 103 | 104 | def reset(self): 105 | self.index = 0 106 | 107 | def __iter__(self): 108 | return self 109 | 110 | class MyDataTest(object): 111 | def __init__(self, ctx): 112 | self.ctx = ctx 113 | 114 | global name 115 | bs = getBS(name) 116 | self.batchSize = bs 117 | 118 | self.loadDatas() 119 | 120 | def loadDatas(self): 121 | logging.info('Loading datas ...') 122 | 123 | keys = ['aid', 'uid'] 124 | self.ids = getMerged(*keys, kind=2).values 125 | 126 | self.datas = loadSparse(2) 127 | self.len = self.datas.shape[0] 128 | 129 | logging.info('Loading datas done ...') 130 | 131 | def __next__(self): 132 | index, batchSize = self.index, self.batchSize 133 | if index >= self.len: raise StopIteration 134 | end = min(index + batchSize, self.len) 135 | 136 | ids = self.ids[index:end] 137 | ids = mx.nd.array( 138 | ids, self.ctx, dtype='int32') 139 | 140 | sData = self.datas[index:end].toarray() 141 | datas = mx.nd.array( 142 | sData[:, 1:], self.ctx, dtype='int32') 143 | self.index += self.batchSize 144 | 145 | return ids, datas 146 | 147 | def reset(self): 148 | self.index = 0 149 | 150 | def __iter__(self): 151 | return self 152 | 153 | ################################################### 154 | 155 | class MyNet(nn.HybridSequential): 156 | def __init__(self, ctx): 157 | super(MyNet, self).__init__() 158 | global name 159 | 160 | self.init(ctx) 161 | if getSorN(name): 162 | self.hybridize() 163 | #self.hybridize() 164 | self.collect_params().initialize( 165 | #mx.init.MSRAPrelu(slope=0), 166 | mx.init.Xavier(), 167 | ctx 168 | ) 169 | 170 | def init(self, ctx): 171 | with self.name_scope(): 172 | global name 173 | global lossKind 174 | 175 | n = lossKind + 1 176 | self.add(getStart(name)) 177 | self.add(MyIBA(128, 'relu'))#全连接最后一层,另外正则可选+在loss 178 | self.add(MyIBA(128, 'relu')) 179 | self.add(nn.Dense(n)) 180 | #self.add(MyIBA(n, 'self')) 181 | 182 | class MyNN(MyModel): 183 | def getNet(self, ctx): 184 | global lossKind 185 | 186 | return MyNet(ctx), getLoss(lossKind) 187 | 188 | def getModel(self): 189 | global model; return model 190 | 191 | def getMetric(self): 192 | global lossKind 193 | 194 | return getMetric(lossKind), 'aucMetric' 195 | 196 | def getData(self, ctx): 197 | return MyData(ctx, True), MyData(ctx, False) 198 | 199 | def forDebug(self, out): 200 | pass 201 | 202 | def getTrainer(self, params, iters): 203 | opt = getOpt(iters) 204 | 205 | return mx.gluon.Trainer(params, opt) 206 | #return mx.gluon.Trainer(params, 'adam',{ 'clip_gradient': 2}) 207 | 208 | class MyNNV(MyPredict): 209 | def __init__(self, gpu=0): 210 | super(MyNNV, self).__init__(gpuID) 211 | model = self.getModel() 212 | self.name = '%s/val.csv' % model 213 | 214 | self.f = open(self.name, 'w') 215 | self.csv = csv.writer(self.f) 216 | self.csv.writerow(['aid', 'uid', 'score']) 217 | 218 | def getModel(self): 219 | global model; return model 220 | 221 | def onDone(self): 222 | self.f.close() 223 | 224 | def getData(self, ctx): 225 | return MyDataVal(ctx) 226 | 227 | def getNet(self, ctx): 228 | model = self.getModel() 229 | name = sorted(glob.glob( 230 | '%s/*.params' % model))[-1] 231 | 232 | net = MyNet(ctx) 233 | net.load_params(name, ctx) 234 | logging.info('Load %s ...' % name) 235 | 236 | return net 237 | 238 | def preProcess(self, data): 239 | return data[1] 240 | 241 | def postProcess(self, data, pData, output): 242 | ids = data[0].asnumpy() 243 | if lossKind == 0: 244 | output = mx.nd.sigmoid(output) 245 | output = output.asnumpy()[:,0] 246 | if lossKind == 1: 247 | output = mx.nd.softmax(output) 248 | output = output.asnumpy()[:,1] 249 | if lossKind == 2: 250 | output1 = output[:, 0:2] 251 | output2 = output[:, 2:3] 252 | 253 | output1 = mx.nd.softmax(output1) 254 | output1 = output1.asnumpy()[:,1] 255 | 256 | output2 = mx.nd.sigmoid(output2) 257 | output2 = output2.asnumpy()[:,0] 258 | 259 | output = 0.5*(output1 + output2) 260 | 261 | for id, out in zip(ids, output): 262 | self.csv.writerow([id[0], id[1], '%.6f'%out]) 263 | 264 | class MyNNP(MyPredict): 265 | def __init__(self, gpu=0): 266 | super(MyNNP, self).__init__(gpuID) 267 | model = self.getModel() 268 | self.name = '%s/submission.csv' % model 269 | 270 | self.f = open(self.name, 'w') 271 | self.csv = csv.writer(self.f) 272 | self.csv.writerow(['aid', 'uid', 'score']) 273 | 274 | def getModel(self): 275 | global model; return model 276 | 277 | def onDone(self): 278 | self.f.close() 279 | 280 | model = self.getModel() 281 | zipName = '%s/submission.zip' % model 282 | with zipfile.ZipFile(zipName, 'w') as f: 283 | f.write( 284 | self.name, 'submission.csv', 285 | compress_type=zipfile.ZIP_DEFLATED 286 | ) 287 | 288 | def getData(self, ctx): 289 | return MyDataTest(ctx) 290 | 291 | def getNet(self, ctx): 292 | model = self.getModel() 293 | name = sorted(glob.glob( 294 | '%s/*.params' % model))[-1] 295 | 296 | net = MyNet(ctx) 297 | net.load_params(name, ctx) 298 | print( 'Load %s ...' % name) 299 | 300 | return net 301 | 302 | def preProcess(self, data): 303 | return data[1] 304 | 305 | def postProcess(self, data, pData, output): 306 | ids = data[0].asnumpy() 307 | if lossKind == 0: 308 | output = mx.nd.sigmoid(output) 309 | output = output.asnumpy()[:,0] 310 | if lossKind == 1: 311 | output = mx.nd.softmax(output) 312 | output = output.asnumpy()[:,1] 313 | if lossKind == 2: 314 | output1 = output[:, 0:2] 315 | output2 = output[:, 2:3] 316 | 317 | output1 = mx.nd.softmax(output1) 318 | output1 = output1.asnumpy()[:,1] 319 | 320 | output2 = mx.nd.sigmoid(output2) 321 | output2 = output2.asnumpy()[:,0] 322 | 323 | output = 0.5*(output1 + output2) 324 | 325 | for id, out in zip(ids, output): 326 | self.csv.writerow([id[0], id[1], '%.6f'%out]) 327 | 328 | ################################################### 329 | if __name__ == '__main__': 330 | mx.random.seed(2018) 331 | random.seed(2018) 332 | logging.info('All start...') 333 | 334 | 335 | assert len(sys.argv)==5, 'Failed ...' 336 | kind, gpuID = list(map(int, sys.argv[1:3])) 337 | 338 | name = sys.argv[3] 339 | model = './modelNN_%s' % name 340 | 341 | lossKind = int(sys.argv[4]) 342 | if lossKind == 1: 343 | model += '#Softmax' 344 | if lossKind == 2: 345 | model += '#S2SLoss' 346 | 347 | if kind==1: 348 | m = MyNN(gpuID); m.train() 349 | if kind==2: 350 | m = MyNNV(gpuID); m.predict() 351 | if kind==0: 352 | m = MyNNP(gpuID); m.predict() 353 | 354 | 355 | logging.info('All done ...') 356 | 357 | -------------------------------------------------------------------------------- /nnHelper.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf8 -*- 2 | 3 | import os 4 | import random 5 | import logging 6 | import mxnet as mx 7 | import numpy as np 8 | from mxnet.gluon import nn 9 | from mxnet import autograd 10 | from mxnet.gluon import loss 11 | import os.path 12 | import time 13 | from data import * 14 | from sklearn.metrics import roc_auc_score 15 | 16 | logging.basicConfig( 17 | level=logging.INFO, 18 | datefmt='%Y%m%d-%H:%M:%S', 19 | format='%(asctime)s: %(message)s', 20 | ) 21 | 22 | ################################################### 23 | 24 | def getBS(name): 25 | # bSs = { 26 | # 'dffm31' : 200, 27 | # 'dfcm12' : 500, 28 | # 'dfcm13' : 500, 29 | # } 30 | # 31 | # if name in bSs: 32 | # return bSs[name] 33 | return 2048 34 | 35 | def getOpt(iters): 36 | # opt = mx.optimizer.AdaGrad( 37 | # wd=1e-3, 38 | # learning_rate=2e-4, 39 | # ) 40 | opt = mx.optimizer.Adam(wd=1e-3, 41 | learning_rate=1e-2,) 42 | 43 | return opt 44 | 45 | def getSorN(name): 46 | 47 | models = ( 48 | 'dfr', 49 | ) 50 | 51 | return name not in models 52 | 53 | def getStart(name): 54 | dims, User, Ad= getFFMDim() 55 | dims = dims.values() 56 | inC = 4989 57 | models = { 58 | 'test': lambda: Chen_DFM(dims, inC, 8, 128, User, Ad), # 2048 59 | 'test1': lambda: Guo_DFFM(dims, inC, 8, 256), # 4096 60 | 'testdin': lambda: MyDIN(dims, inC, 8, 128, User, Ad), # XXXX, 7325, XXXX 61 | 62 | 'dcn' : lambda: MyDCN(dims, inC, 8, 64, 3), 63 | 'dfcm' : lambda: MyDFCM(dims, inC, 8, 64), # 7370, 7406, XXXX 64 | 'dfcm11': lambda: MyDFCM(dims, inC, 16, 128), # XXXX, 7431, XXXX 65 | 'dfcm12': lambda: MyDFCM(dims, inC, 32, 128), # XXXX, 7482, XXXX 66 | 'dfcm13': lambda: MyDFCM(dims, inC, 64, 256), # XXXX, 7485, XXXX 67 | 68 | 'dfr' : lambda: MyDFR(dims, inC, 8, 64), 69 | 'dfu' : lambda: MyDFU(dims, inC, 8, 64), # 7269, XXXX, XXXX 70 | 'dfm' : lambda: MyDFM(dims, inC, 8, 64), # 7256, 7280, XXXX 71 | 72 | 'dfm2' : lambda: MyDFM2(dims, inC, 8, 64), # XXXX, 7369, XXXX 73 | 'dfm21' : lambda: MyDFM2(dims, inC, 16, 128), # XXXX, 7410, XXXX 74 | 'dfm22' : lambda: MyDFM2(dims, inC, 32, 128), # XXXX, 7445, XXXX 75 | 'dfm23' : lambda: MyDFM2(dims, inC, 32, 256), # XXXX, 7450, XXXX 76 | 'dfm24' : lambda: MyDFM2(dims, inC, 64, 256), # XXXX, 7468, XXXX 77 | 'dfm25' : lambda: MyDFM2(dims, inC, 128, 256), # XXXX, 7468, XXXX 78 | 'dfm26' : lambda: MyDFM2(dims, inC, 256, 256), # XXXX, 7475, XXXX 79 | 80 | 'dfz' : lambda: MyDIN(dims, inC, 8, 64), # XXXX, 7325, XXXX 81 | 'dfz11' : lambda: MyDFZ(dims, inC, 16, 128), # XXXX, 7402, XXXX 82 | 'dfz12' : lambda: MyDFZ(dims, inC, 32, 128), # XXXX, 7443, XXXX 83 | 'dfz13' : lambda: MyDFZ(dims, inC, 32, 256), # XXXX, 7439, XXXX 84 | 'dfz14' : lambda: MyDFZ(dims, inC, 64, 256), # XXXX, 7456, XXXX 85 | 'dfz15' : lambda: MyDFZ(dims, inC, 128, 256), # XXXX, 7463, XXXX 86 | 'dfz16' : lambda: MyDFZ(dims, inC, 256, 256), # XXXX, 7469, XXXX 87 | 88 | 'din' : lambda: MyDIN(dims, inC, 8, 64), # XXXX, 7325, XXXX 89 | 'dfin' : lambda: MyDFIN(dims, inC, 8, 64), # 7399, 7415, XXXX 90 | 'dfcn' : lambda: MyDFCN(dims, inC, 8, 64), # 7415, XXXX, XXXX 91 | 92 | 'dffm' : lambda: MyDFFM(dims, inC, 8, 64), # 7395, 7439, XXXX 93 | 'dffm2' : lambda: MyDFFM2(dims, inC, 8, 64), # 7396, 7444, XXXX 94 | 'dffm3' : lambda: MyDFFM3(dims, inC, 8, 256), # 2048 95 | 'dffm31': lambda: MyDFFM3(dims, inC, 16, 64), # XXXX, 7462, XXXX 96 | } 97 | 98 | return models[name]() 99 | 100 | def getLoss(lossKind): 101 | losses = { 102 | 0: MyLoss, 103 | 1: MyLoss2, 104 | 2: MyLoss3, 105 | } 106 | return losses[lossKind]() 107 | 108 | def getMetric(lossKind): 109 | metrics = { 110 | 0: MyMetric, 111 | 1: MyMetric2, 112 | 2: MyMetric3, 113 | } 114 | return metrics[lossKind]() 115 | 116 | def randomRange(start, end): 117 | r1 = r2 = 0 118 | while(r1 == r2): 119 | r1 = random.randint(start, end) 120 | r2 = random.randint(start, end) 121 | t1, t2 = min(r1,r2), max(r1,r2) 122 | return t1, t2 123 | 124 | ################################################### 125 | 126 | class MyBA(nn.HybridSequential): 127 | def __init__(self, act='relu'): 128 | super(MyBA, self).__init__() 129 | 130 | with self.name_scope(): 131 | self.add(nn.BatchNorm()) 132 | self.add(MyAct(act)) 133 | 134 | class MyAct(nn.HybridBlock): 135 | def __init__(self, act='relu'): 136 | super(MyAct, self).__init__() 137 | self.act = act 138 | 139 | def hybrid_forward(self, F, x): 140 | if self.act == 'self': return x 141 | return F.Activation(x, self.act) 142 | 143 | class MyIBA(nn.HybridSequential): 144 | def __init__(self, c, act='relu'): 145 | super(MyIBA, self).__init__() 146 | 147 | with self.name_scope(): 148 | self.add(nn.Dense(c))#全连接+bias 149 | self.add(MyBA(act))#可选 150 | 151 | class MyCBA(nn.HybridSequential): 152 | def __init__(self, c, act='relu'): 153 | super(MyCBA, self).__init__() 154 | 155 | with self.name_scope(): 156 | self.add(nn.Conv1D(*c)) 157 | self.add(MyBA(act)) 158 | 159 | class MyRes(nn.HybridBlock): 160 | def __init__(self, c1, c2): 161 | super(MyRes, self).__init__() 162 | 163 | with self.name_scope(): 164 | self.opr = MyAct('relu') 165 | self.op1 = MyIBA(c1, 'self') 166 | self.op2 = MyIBA(c2, 'relu') 167 | 168 | def hybrid_forward(self, F, x): 169 | return self.op2(self.opr(self.op1(x)+x)) 170 | 171 | class MyRes2(nn.HybridSequential): 172 | def __init__(self, c1, c2, num): 173 | super(MyRes2, self).__init__() 174 | 175 | with self.name_scope(): 176 | for i in range(num): 177 | self.add(MyRes(c1, c1)) 178 | self.add(MyRes(c1, c2)) 179 | 180 | #&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& 181 | class MyDPO(nn.HybridSequential): 182 | def __init__(self, rate=.5): 183 | super(MyDPO, self).__init__() 184 | self.rate = rate 185 | 186 | with self.name_scope(): 187 | self.add(nn.Dropout(.5)) 188 | 189 | class MyIDP(nn.HybridSequential): 190 | def __init__(self, c, c1): 191 | super(MyIDP, self).__init__() 192 | 193 | with self.name_scope(): 194 | self.add(nn.Dense(c, activation='sigmoid', flatten=False))#全连接+bias 195 | self.add(nn.Dense(c1, activation='sigmoid', flatten=False)) # 全连接+bias 196 | self.add(nn.Dense(1, flatten=False)) # 全连接+bias 197 | 198 | 199 | ##################################################### 200 | class MyC(nn.HybridBlock): 201 | def __init__(self, shapew,shapeb): 202 | super(MyC, self).__init__() 203 | 204 | with self.name_scope(): 205 | self.b = self.params.get('b', shape=shapeb) 206 | self.w = self.params.get('w', shape=shapew) 207 | 208 | def hybrid_forward(self, F, x0, x1, b, w): 209 | y = F.broadcast_add(F.dot( 210 | F.batch_dot(x0, x1, False, True), w), b) 211 | #F.FullyConnected() 212 | return y 213 | 214 | class MyE(nn.HybridBlock): 215 | def __init__(self, inC, outC): 216 | super(MyE, self).__init__() 217 | 218 | with self.name_scope(): 219 | self.inC = inC 220 | self.outC = outC 221 | 222 | self.w = self.params.get( 223 | 'weight', 224 | shape=(inC, outC), 225 | allow_deferred_init=True 226 | ) 227 | 228 | def hybrid_forward(self, F, x, w): 229 | zw = F.concat( 230 | F.zeros((1, self.outC)), w, dim=0) 231 | return F.Embedding(x, w, self.inC, self.outC) 232 | 233 | class MyEB(nn.HybridBlock): 234 | def __init__(self, dims, inC, outC): 235 | super(MyEB, self).__init__() 236 | 237 | self.d = dims#a list of fields' max length 238 | self.e = MyE(inC, outC) 239 | 240 | 241 | def onDone(self, F, result): 242 | return result 243 | 244 | def hybrid_forward(self, F, x): 245 | e = self.e(x)#e*v? 246 | result, start, end = [], 0, 0 247 | for i, size in enumerate(self.d): 248 | start, end = end, end + size 249 | sliced = F.slice_axis(e, 1, start, end) 250 | result.append(F.mean(sliced, 1, True))#av_pool 251 | 252 | return self.onDone(F, result) 253 | 254 | class MyAUEB(nn.HybridBlock): 255 | def __init__(self, dims, inC, outC, User ,Ad): 256 | super(MyAUEB, self).__init__() 257 | 258 | self.d = dims#a list of fields' max length 259 | self.e = MyE(inC, outC) 260 | self.User = [int(x) for x in User] 261 | self.Ad = [int(x) for x in Ad] 262 | 263 | 264 | def onDone(self, F, result, User, Ad): 265 | for k in User: 266 | result[k] = User[k] 267 | for k in Ad: 268 | result[k] = Ad[k] 269 | keys = sorted(result.keys()) 270 | res = [] 271 | for k in keys: 272 | res.append(result[k]) 273 | return res 274 | 275 | def hybrid_forward(self, F, x): 276 | e = self.e(x) # e*v? 277 | result, User, Ad, start, end = {}, {}, {}, 0, 0 278 | for i, size in enumerate(self.d): 279 | start, end = end, end + size 280 | sliced = F.slice_axis(e, 1, start, end) 281 | if i in self.User: 282 | User[i] = sliced 283 | elif i in self.Ad: 284 | Ad[i] = sliced 285 | else: 286 | result[i] = sliced 287 | 288 | # print(User[1].infer_shape(data=(2048,590)))#B*field*K 289 | # print(Ad[0].infer_shape(data=(2048, 590))) 290 | # User = F.concat(*User, dim=1)#B*F*K 291 | # print(User.infer_shape(data=(2048, 590))) 292 | # Ad = F.concat(*Ad, dim=1)#B*1*K 293 | # print(Ad.infer_shape(data=(2048, 590))) 294 | return self.onDone(F, result, User, Ad) 295 | 296 | class MyAttention(MyAUEB): 297 | def __init__(self, dims, inC, outC, User, Ad): 298 | super(MyAttention, self).__init__(dims, inC, outC, User, Ad) 299 | self.K = outC 300 | self.ip = MyIDP(80, 40) 301 | self.dims = dims 302 | 303 | # def onDone(self, F, result, User, Ad): 304 | # #concat embedding 305 | # for k in User: 306 | # result[k] = User[k] 307 | # for k in Ad: 308 | # result[k] = Ad[k] 309 | # keys = sorted(result.keys()) 310 | # res = [] 311 | # for k in keys: 312 | # res.append(result[k]) 313 | # return result 314 | 315 | def hybrid_forward(self, F, x): 316 | e = self.e(x) # e*v? 317 | result, User, Ad, start, end = {}, {}, {}, 0, 0 318 | for i, size in enumerate(self.d): 319 | start, end = end, end + size 320 | sliced = F.slice_axis(e, 1, start, end) 321 | if i in self.User: 322 | User[i] = sliced 323 | elif i in self.Ad: 324 | Ad[i] = sliced 325 | else: 326 | result[i] = sliced 327 | 328 | # print(User[1].infer_shape(data=(2048,590)))#B*field*K 329 | # print(Ad[0].infer_shape(data=(2048, 590))) 330 | # User = F.concat(*User, dim=1)#B*F*K 331 | # print(User.infer_shape(data=(2048, 590))) 332 | # Ad = F.concat(*Ad, dim=1)#B*1*K 333 | # print(Ad.infer_shape(data=(2048, 590))) 334 | #AT part 335 | AU = {} 336 | for k in User:#[B, F, K] 337 | queries = Ad[list(Ad.keys())[0]] 338 | shape = User[k].infer_shape(data=(2048, 590))[1][0]#(,,) 339 | shapeF = list(self.dims)[k] 340 | print(queries.infer_shape(data=(2048, 590))) 341 | queries = F.tile(queries, [1, shapeF]) 342 | print(queries.infer_shape(data=(2048, 590))) 343 | queries = F.reshape(queries, [-1, shapeF, self.K]) 344 | print(queries.infer_shape(data=(2048, 590))) 345 | din_all = F.concat(queries, User[k], queries - User[k], queries * User[k], dim=-1) 346 | print(din_all.infer_shape(data=(2048, 590))) 347 | din_all = self.ip(din_all) 348 | print(din_all.infer_shape(data=(2048, 590))) 349 | outputs = F.reshape(din_all, [-1, 1, shapeF])#[B, 1, F] 350 | print(outputs.infer_shape(data=(2048, 590))) 351 | key_masks = F.broadcast_greater(User[k], F.zeros_like(User[k])) 352 | paddings = F.ones_like(outputs) * (-2 ** 32 + 1) 353 | outputs = F.where(key_masks, outputs, paddings) 354 | outputs = outputs / (shape[-1] ** 0.5) 355 | print(outputs.infer_shape(data=(2048, 590))) 356 | # Activation 357 | outputs = F.softmax(outputs) # [B, 1, T] 358 | # Weighted sum 359 | outputs = F.batch_dot(outputs, User[k]) # [B, 1, H] 360 | print(outputs.infer_shape(data=(2048, 590))) 361 | AU[k] = outputs 362 | 363 | return self.onDone(F, result, AU, Ad) 364 | 365 | class MyED(MyEB):#维度(各每行最大取值数),weight输入维度,输出维度 366 | def __init__(self, dims, inC, outC): 367 | super(MyED, self).__init__(dims, inC, outC) 368 | 369 | def onDone(self, F, result): 370 | return F.concat(*result, dim=1) 371 | 372 | class MyER(nn.HybridBlock): 373 | def __init__(self, dims, inC, outC): 374 | super(MyER, self).__init__() 375 | 376 | self.d = dims 377 | self.e = MyE(inC, outC) 378 | 379 | def hybrid_forward(self, F, x): 380 | flag = autograd.is_recording() 381 | 382 | e = self.e(x) 383 | result, start, end = [], 0, 0 384 | for i, size in enumerate(self.d): 385 | start, end = end, end + size 386 | 387 | t1, t2 = start, end 388 | if flag and size > 5: 389 | t1, t2 = randomRange(start, end) 390 | 391 | sliced = F.slice_axis(e, 1, t1, t2) 392 | result.append(F.mean(sliced, 1, True)) 393 | 394 | return F.concat(*result, dim=1) 395 | 396 | class MyU(nn.HybridBlock): 397 | def __init__(self, inC, outC): 398 | super(MyU, self).__init__() 399 | 400 | with self.name_scope(): 401 | self.inC = inC 402 | self.outC = outC 403 | 404 | self.w = self.params.get( 405 | 'weight', 406 | shape=(inC-1, outC), 407 | allow_deferred_init=True, 408 | init=mx.init.Uniform(0.1), 409 | ) 410 | 411 | def hybrid_forward(self, F, x, w): 412 | zw = F.concat( 413 | F.zeros((1, self.outC)), w, dim=0) 414 | return F.Embedding(x, zw, self.inC, self.outC) 415 | 416 | class MyUD(nn.HybridBlock): 417 | def __init__(self, dims, inC, outC): 418 | super(MyUD, self).__init__() 419 | 420 | self.d = dims 421 | self.e = MyU(inC, outC) 422 | 423 | def hybrid_forward(self, F, x): 424 | e = self.e(x) 425 | result, start, end = [], 0, 0 426 | for i, size in enumerate(self.d): 427 | start, end = end, end + size 428 | sliced = F.slice_axis(e, 1, start, end) 429 | result.append(F.mean(sliced, 1, True)) 430 | return F.concat(*result, dim=1) 431 | 432 | class MyZ(nn.HybridBlock): 433 | def __init__(self, inC, outC): 434 | super(MyZ, self).__init__() 435 | 436 | with self.name_scope(): 437 | self.inC = inC 438 | self.outC = outC 439 | 440 | self.w = self.params.get( 441 | 'weight', 442 | shape=(inC, outC), 443 | allow_deferred_init=True 444 | ) 445 | 446 | def hybrid_forward(self, F, x, w): 447 | return F.Embedding(x, w, self.inC, self.outC) 448 | 449 | class MyZB(nn.HybridBlock): 450 | def __init__(self, dims, inC, outC): 451 | super(MyZB, self).__init__() 452 | 453 | self.d = dims 454 | self.e = MyZ(inC, outC) 455 | 456 | def onDone(self, F, result): 457 | return result 458 | 459 | def hybrid_forward(self, F, x): 460 | e = self.e(x) 461 | result, start, end = [], 0, 0 462 | for i, size in enumerate(self.d): 463 | start, end = end, end + size 464 | sliced = F.slice_axis(e, 1, start, end) 465 | result.append(F.mean(sliced, 1, True)) 466 | 467 | return self.onDone(F, result) 468 | 469 | ################################################### 470 | class MyDPO(nn.HybridSequential): 471 | def __init__(self, rate=.5): 472 | super(MyDPO, self).__init__() 473 | self.rate = rate 474 | 475 | with self.name_scope(): 476 | self.add(nn.Dropout(.5)) 477 | 478 | 479 | class Chen_DFM(nn.HybridBlock): 480 | def __init__(self, dims, inC, outC, unit, User, Ad): 481 | super(Chen_DFM, self).__init__() 482 | 483 | with self.name_scope(): 484 | self.ba = MyBA('relu') 485 | #self.be = MyBA('relu') 486 | self.ip = MyIBA(unit, 'relu') 487 | self.ed = MyAttention(dims, inC, outC, User, Ad) 488 | self.lr = MyAttention(dims, inC, 1, User, Ad) 489 | self.dpo = MyDPO() 490 | self.n = len(dims) 491 | 492 | def hybrid_forward(self, F, x): 493 | # e = self.be(self.ed(x)) 494 | # 495 | # lr = self.be(self.lr(x)) 496 | e = self.ed(x) 497 | lr = self.lr(x) 498 | 499 | order1 = self.dpo(F.sum(lr, axis=2)) 500 | 501 | order2_1 = F.square(F.sum(e, axis=1)) 502 | order2_2 = F.sum(F.square(e), axis=1) 503 | order2 = self.dpo(0.5 * (order2_1 - order2_2)) 504 | 505 | dp = self.dpo(F.flatten(e)) 506 | # for i in range(0, 2): 507 | dp = self.dpo(self.ip(dp)) 508 | # in_shape, out_shape, uax_shape = out.infer_shape(data=(256,207)) 509 | # print(in_shape, out_shape, uax_shape) 510 | return self.ba( 511 | F.concat(dp, order1, order2, dim=1)) 512 | 513 | class Guo_DFFM(nn.HybridBlock): 514 | def __init__(self, dims, inC, outC, unit): 515 | super(Guo_DFFM, self).__init__() 516 | 517 | with self.name_scope(): 518 | self.n = n = len(dims) 519 | #self.be = MyBA('relu') 520 | self.ed = MyED(dims, inC, outC*n) 521 | 522 | def hybrid_forward(self, F, x): 523 | e = self.ed(x) 524 | #e = self.be(e) 525 | print(e.infer_shape(data=(2048, 590))) 526 | ss, es = [], F.split(e, self.n, 1) 527 | print(es.infer_shape(data=(2048, 590))) 528 | es1= F.split(es[0], self.n, 2) 529 | print(es1.infer_shape(data=(2048, 590))) 530 | for s in es: 531 | ss.append(F.split(s, self.n, 2))#[n*1*8,n*1*8,...] 532 | ############################################################################### 533 | import itertools 534 | embed_var_dict = {} 535 | for (i1, i2) in itertools.combinations(list(range(0, self.n)), 2): 536 | c1, c2 = i1, i2 537 | embed_var_dict.setdefault(c1, {})[c2] = ss[c1][i2] # None * k 538 | embed_var_dict.setdefault(c2, {})[c1] = ss[c2][i1] # None * k 539 | x_mat = [] 540 | y_mat = [] 541 | input_size = 0 542 | for (c1, c2) in itertools.combinations(embed_var_dict.keys(), 2): 543 | input_size += 1 544 | x_mat.append(embed_var_dict[c1][c2]) #input_size * None * k 545 | y_mat.append(embed_var_dict[c2][c1]) #input_size * None * k 546 | print(x_mat[0].infer_shape(data=(2048, 590))) 547 | x_mat = F.concat(*x_mat,dim=1) 548 | print(x_mat.infer_shape(data=(2048, 590))) 549 | y_mat = F.concat(*y_mat, dim=1) 550 | res = x_mat*y_mat 551 | print(res.infer_shape(data=(2048, 590))) 552 | ########################################################### 553 | # result = [] 554 | # for i in range(self.n): 555 | # for k in range(i, self.n): 556 | # result.append(ss[i][k]*ss[k][i])#ele-wise 557 | # #print(ss[i][k].infer_shape(data=(2048, 590))) 558 | # res = F.concat(*result, dim=1) 559 | # print(res.infer_shape(data=(2048, 590))) 560 | order2=F.sum(res, axis=1) 561 | print(order2.infer_shape(data=(2048, 590))) 562 | 563 | return order2 564 | class MyDCN(nn.HybridBlock): 565 | def __init__(self, dims, inC, outC, unit, depth): 566 | super(MyDCN, self).__init__() 567 | 568 | with self.name_scope(): 569 | self.be = MyBA('relu') 570 | self.ba = MyBA('relu') 571 | self.ip = MyIBA(unit, 'relu') 572 | self.ed = MyED(dims, inC, outC) 573 | 574 | self.depth = depth 575 | for i in range(depth): 576 | setattr( 577 | self, 'MyC#%02d'%i, 578 | MyC((len(dims)*outC, 1)) 579 | ) 580 | setattr( 581 | self, 'MyBA#%02d'%i, 582 | MyBA() 583 | ) 584 | 585 | def hybrid_forward(self, F, x): 586 | e = self.be(self.ed(x)) 587 | deep = self.ip(F.flatten(e)) 588 | 589 | y = e = F.expand_dims(F.flatten(e), -1) 590 | for i in range(self.depth): 591 | mc = getattr(self, 'MyC#%02d'%i) 592 | ba = getattr(self, 'MyBA#%02d'%i) 593 | 594 | y = ba(mc(e, y)) 595 | cross = F.flatten(y) 596 | 597 | return self.ba(F.concat(deep, cross, dim=1)) 598 | 599 | class MyDFM(nn.HybridBlock): 600 | def __init__(self, dims, inC, outC, unit): 601 | super(MyDFM, self).__init__() 602 | 603 | with self.name_scope(): 604 | self.ba = MyBA('relu') 605 | self.be = MyBA('relu') 606 | self.ip = MyIBA(unit, 'relu') 607 | self.ed = MyED(dims, inC, outC) 608 | 609 | 610 | def hybrid_forward(self, F, x): 611 | ed = self.ed(x) 612 | # in_shape, out_shape, uax_shape = ed.infer_shape(data=(2048,590)) 613 | # print('inshape:',in_shape,'outshape:', out_shape) 614 | e = self.be(ed) 615 | in_shape, out_shape, uax_shape = e.infer_shape(data=(2048,590)) 616 | print('inshape:',in_shape,'outshape:', out_shape) 617 | deep = self.ip(e) 618 | in_shape, out_shape, uax_shape = deep.infer_shape(data=(2048,590)) 619 | print('inshape:',in_shape,'outshape:', out_shape) 620 | order1 = F.sum(e, axis=2) 621 | order2_1 = F.square(F.sum(e, axis=1)) 622 | order2_2 = F.sum(F.square(e), axis=1) 623 | order2 = 0.5*(order2_1-order2_2) 624 | 625 | return self.ba( 626 | F.concat(deep, order1, order2, dim=1)) 627 | 628 | class MyDFM2(nn.HybridBlock): 629 | def __init__(self, dims, inC, outC, unit): 630 | super(MyDFM2, self).__init__() 631 | 632 | with self.name_scope(): 633 | self.n = len(dims) 634 | self.ba = MyBA('relu') 635 | self.be = MyBA('relu') 636 | self.ip = MyIBA(unit, 'relu') 637 | self.ed = MyEB(dims, inC, outC) 638 | 639 | def hybrid_forward(self, F, x): 640 | x = self.ed(x) 641 | in_shape, out_shape, uax_shape = x[0].infer_shape(data=(2048,590)) 642 | print('inshape:',in_shape,'outshape:', out_shape) 643 | y = F.concat(*x, dim=1) 644 | e = self.be(y) 645 | in_shape, out_shape, uax_shape = e.infer_shape(data=(2048,590)) 646 | print('inshape:',in_shape,'outshape:', out_shape) 647 | deep = self.ip(e) 648 | order1 = F.sum(e, axis=2) 649 | 650 | # order2_1 = F.square(F.sum(e, axis=1)) 651 | # order2_2 = F.sum(F.square(e), axis=1) 652 | # order2 = 0.5*(order2_1-order2_2) 653 | order2 = [] 654 | for i in range(self.n): 655 | for k in range(i, self.n): 656 | order2.append(x[i]*x[k]) 657 | order2 = F.flatten(F.concat(*order2, dim=1)) 658 | 659 | return self.ba( 660 | F.concat(deep, order1, order2, dim=1)) 661 | 662 | class MyDFU(nn.HybridBlock): 663 | def __init__(self, dims, inC, outC, unit): 664 | super(MyDFU, self).__init__() 665 | 666 | with self.name_scope(): 667 | self.ba = MyBA('relu') 668 | self.be = MyBA('relu') 669 | self.ip = MyIBA(unit, 'relu') 670 | self.ed = MyUD(dims, inC, outC) 671 | 672 | def hybrid_forward(self, F, x): 673 | e = self.be(self.ed(x)) 674 | 675 | deep = self.ip(e) 676 | order1 = F.sum(e, axis=2) 677 | order2_1 = F.square(F.sum(e, axis=1)) 678 | order2_2 = F.sum(F.square(e), axis=1) 679 | order2 = 0.5*(order2_1-order2_2) 680 | 681 | return self.ba( 682 | F.concat(deep, order1, order2, dim=1)) 683 | 684 | class MyDFR(nn.HybridBlock): 685 | def __init__(self, dims, inC, outC, unit): 686 | super(MyDFR, self).__init__() 687 | 688 | with self.name_scope(): 689 | self.ba = MyBA('relu') 690 | self.be = MyBA('relu') 691 | self.ip = MyIBA(unit, 'relu') 692 | self.ed = MyER(dims, inC, outC) 693 | 694 | def hybrid_forward(self, F, x): 695 | e = self.be(self.ed(x)) 696 | 697 | deep = self.ip(e) 698 | order1 = F.sum(e, axis=2) 699 | order2_1 = F.square(F.sum(e, axis=1)) 700 | order2_2 = F.sum(F.square(e), axis=1) 701 | order2 = 0.5*(order2_1-order2_2) 702 | 703 | return self.ba( 704 | F.concat(deep, order1, order2, dim=1)) 705 | 706 | class MyDFZ(nn.HybridBlock): 707 | def __init__(self, dims, inC, outC, unit): 708 | super(MyDFZ, self).__init__() 709 | 710 | with self.name_scope(): 711 | self.n = len(dims) 712 | self.ba = MyBA('relu') 713 | self.be = MyBA('relu') 714 | self.ip = MyIBA(unit, 'relu') 715 | self.ed = MyZB(dims, inC, outC) 716 | 717 | def hybrid_forward(self, F, x): 718 | x = self.ed(x) 719 | y = F.concat(*x, dim=1) 720 | 721 | e = self.be(y) 722 | deep = self.ip(e) 723 | order1 = F.sum(e, axis=2) 724 | 725 | order2 = [] 726 | for i in range(self.n): 727 | for k in range(i+1, self.n): 728 | order2.append(x[i]*x[k]) 729 | order2 = F.flatten(F.concat(*order2, dim=1)) 730 | 731 | return self.ba( 732 | F.concat(deep, order1, order2, dim=1)) 733 | 734 | class MyDFCN(nn.HybridBlock): 735 | def __init__(self, dims, inC, outC, unit): 736 | super(MyDFCN, self).__init__() 737 | 738 | with self.name_scope(): 739 | self.n = n = len(dims) 740 | self.be = MyBA('relu') 741 | self.ba = MyBA('relu') 742 | self.ip = MyIBA(unit, 'relu') 743 | self.ed = MyED(dims, inC, outC*n) 744 | 745 | for i in range(n): 746 | layer = nn.HybridSequential() 747 | layer.add(MyBA('relu')) 748 | layer.add(MyCBA((n,3,1,1), 'relu')) 749 | layer.add(MyCBA((n,3,1,1), 'relu')) 750 | layer.add(MyCBA((n,3,1,1), 'relu')) 751 | setattr( 752 | self, 'MyLayer#%02d'%i, layer 753 | ) 754 | 755 | def hybrid_forward(self, F, x): 756 | e = self.be(self.ed(x)) 757 | 758 | deep = self.ip(e) 759 | order1 = F.sum(e, axis=2) 760 | 761 | ss, es = [], F.split(e, self.n, 1) 762 | for s in es: ss.append(F.split(s,self.n,2)) 763 | 764 | order2 = [] 765 | for i in range(self.n): 766 | temp = [] 767 | for k in range(self.n): 768 | temp.append(ss[i][k]*ss[k][i]) 769 | temp = F.concat(*temp, dim=1) 770 | temp = F.swapaxes(temp, 1, 2) 771 | 772 | layer = getattr(self, 'MyLayer#%02d'%i) 773 | temp = F.swapaxes(layer(temp), 1, 2) 774 | order2.append(F.sum(temp, axis=1)) 775 | order2 = F.concat(*order2, dim=1) 776 | 777 | return self.ba( 778 | F.concat(deep, order1, order2, dim=1)) 779 | 780 | class MyDFCM(nn.HybridBlock): 781 | def __init__(self, dims, inC, outC, unit): 782 | super(MyDFCM, self).__init__() 783 | 784 | with self.name_scope(): 785 | self.n = len(dims) 786 | self.ba = MyBA('relu') 787 | self.be = MyBA('relu') 788 | self.ip = MyIBA(unit, 'relu') 789 | self.ed = MyEB(dims, inC, outC) 790 | 791 | def hybrid_forward(self, F, x): 792 | es = self.ed(x) 793 | e = self.be(F.concat(*es, dim=1)) 794 | 795 | deep = self.ip(e) 796 | order1 = F.sum(e, axis=2) 797 | 798 | order2 = [] 799 | for i in range(self.n): 800 | for k in range(i+1, self.n): 801 | order2.append(F.batch_dot( 802 | es[i], es[k], True, False)) 803 | order2 = F.flatten(F.concat(*order2, dim=1)) 804 | 805 | return self.ba( 806 | F.concat(deep, order1, order2, dim=1)) 807 | 808 | 809 | 810 | class MyDIN(nn.HybridBlock): 811 | def __init__(self, dims, inC, outC, unit, unit2=0): 812 | super(MyDIN, self).__init__() 813 | 814 | with self.name_scope(): 815 | self.n = len(dims) 816 | self.ba = MyBA('relu') 817 | self.be = MyBA('relu') 818 | self.ip = MyIBA(unit, 'relu') 819 | self.ed = MyEB(dims, inC, outC) 820 | 821 | unit2 = unit2 or outC 822 | for i in range(self.n): 823 | for k in range(i + 1, self.n): 824 | setattr( 825 | self, 'fc#%02d#%02d' % (i, k), 826 | MyIBA(unit2, 'relu') 827 | ) 828 | 829 | def hybrid_forward(self, F, x): 830 | x = self.ed(x) 831 | y = F.concat(*x, dim=1) 832 | 833 | e = self.be(y) 834 | deep = self.ip(e) 835 | order1 = F.sum(e, axis=2) 836 | 837 | order2 = [] 838 | for i in range(self.n): 839 | for k in range(i + 1, self.n): 840 | xi, xk = x[i], x[k] 841 | fi = F.concat(xi, xi - xk, xk, dim=1) 842 | f = getattr(self, 'fc#%02d#%02d' % (i, k)) 843 | 844 | order2.append(f(fi)) 845 | order2 = F.concat(*order2, dim=1) 846 | 847 | return self.ba( 848 | F.concat(deep, order1, order2, dim=1)) 849 | 850 | class MyDFIN(nn.HybridBlock): 851 | def __init__(self, dims, inC, outC, unit): 852 | super(MyDFIN, self).__init__() 853 | 854 | with self.name_scope(): 855 | self.n = n = len(dims) 856 | self.be = MyBA('relu') 857 | self.ba = MyBA('relu') 858 | self.ip = MyIBA(unit, 'relu') 859 | self.ed = MyED(dims, inC, outC*n) 860 | 861 | for i in range(n): 862 | for k in range(i+1, n): 863 | setattr( 864 | self, 'fc#%02d#%02d'%(i,k), 865 | MyIBA(outC, 'relu') 866 | ) 867 | 868 | def hybrid_forward(self, F, x): 869 | e = self.be(self.ed(x)) 870 | 871 | deep = self.ip(e) 872 | order1 = F.sum(e, axis=2) 873 | 874 | ss, es = [], F.split(e, self.n, 1) 875 | for s in es: ss.append(F.split(s, self.n, 2)) 876 | 877 | order2 = [] 878 | for i in range(self.n): 879 | temp = [] 880 | for k in range(i+1, self.n): 881 | eik, eki = ss[i][k], ss[k][i] 882 | fi = F.concat(eik, eik-eki, eki, dim=1) 883 | f = getattr(self, 'fc#%02d#%02d' % (i,k)) 884 | 885 | temp.append(f(fi)) 886 | if len(temp) == 0: continue 887 | order2.append(F.concat(*temp, dim=1)) 888 | order2 = F.concat(*order2, dim=1) 889 | 890 | return self.ba( 891 | F.concat(deep, order1, order2, dim=1)) 892 | 893 | class MyDFFM(nn.HybridBlock): 894 | def __init__(self, dims, inC, outC, unit): 895 | super(MyDFFM, self).__init__() 896 | 897 | with self.name_scope(): 898 | self.n = n = len(dims) 899 | self.be = MyBA('relu') 900 | self.ed = MyED(dims, inC, outC*n) 901 | 902 | def hybrid_forward(self, F, x): 903 | e = self.be(self.ed(x)) 904 | 905 | ss, es = [], F.split(e, self.n, 1) 906 | 907 | for s in es: 908 | ss.append(F.split(s, self.n, 2))#[n*1*8,n*1*8,...] 909 | 910 | 911 | result = [] 912 | for i in range(self.n): 913 | for k in range(i, self.n): 914 | result.append(ss[i][k]*ss[k][i])#ele-wise 915 | #print(te.infer_shape(data=(2048, 590))) 916 | 917 | res = F.concat(*result, dim=1) 918 | print(res.infer_shape(data=(2048, 590))) 919 | order2=F.sum(res, axis=1) 920 | print(order2.infer_shape(data=(2048, 590))) 921 | return order2 922 | 923 | class MyDFFM2(nn.HybridBlock): 924 | def __init__(self, dims, inC, outC, unit): 925 | super(MyDFFM2, self).__init__() 926 | 927 | with self.name_scope(): 928 | self.n = n = len(dims) 929 | self.be = MyBA('relu') 930 | self.ed = MyED(dims, inC, outC*n) 931 | 932 | def hybrid_forward(self, F, x): 933 | e = self.be(self.ed(x)) 934 | 935 | ss, es = [], F.split(e, self.n, 1) 936 | for s in es: 937 | ss.append(F.split(s, self.n, 2)) 938 | 939 | result = [] 940 | for i in range(self.n): 941 | temp = [] 942 | for k in range(self.n): 943 | temp.append(ss[i][k]*ss[k][i]) 944 | temp = F.concat(*temp, dim=1) 945 | result.append(F.sum(temp, axis=1)) 946 | 947 | return F.concat(*result, dim=1) 948 | 949 | class MyDFFM3(nn.HybridBlock): 950 | def __init__(self, dims, inC, outC, unit): 951 | super(MyDFFM3, self).__init__() 952 | 953 | with self.name_scope(): 954 | self.n = n = len(dims) 955 | self.be = MyBA('relu') 956 | self.ba = MyBA('relu') 957 | self.ip = MyIBA(unit, 'relu') 958 | self.ed = MyED(dims, inC, outC*n) 959 | self.dpo = MyDPO() 960 | 961 | def hybrid_forward(self, F, x): 962 | e = self.be(self.ed(x)) 963 | 964 | deep = self.ip(e) 965 | order1 = F.sum(e, axis=2) 966 | 967 | ss, es = [], F.split(e, self.n, 1) 968 | for s in es: ss.append(F.split(s,self.n,2)) 969 | 970 | order2 = [] 971 | for i in range(self.n): 972 | temp = [] 973 | for k in range(self.n): 974 | temp.append(ss[i][k]*ss[k][i]) 975 | temp = F.concat(*temp, dim=1) 976 | order2.append(F.sum(temp, axis=1)) 977 | order2 = F.concat(*order2, dim=1) 978 | 979 | return self.ba( 980 | F.concat(deep, order1, order2, dim=1)) 981 | 982 | ################################################### 983 | 984 | class MyModel(object): 985 | def __init__(self, gpu=0): 986 | self.gpu = gpu 987 | 988 | def getNet(self, ctx): 989 | raise NotImplementedError 990 | 991 | def getModel(self): 992 | return './model' 993 | 994 | def getEpoch(self): 995 | return 1 996 | 997 | def getMetric(self): 998 | raise NotImplementedError 999 | 1000 | def getTrainer(self, params, iters): 1001 | opt = mx.optimizer.SGD( 1002 | wd=1e-3, 1003 | momentum=0.9, 1004 | learning_rate=1e-2, 1005 | lr_scheduler=mx.lr_scheduler.FactorScheduler( 1006 | iters*30, 1e-1, 1e-3 1007 | ) 1008 | ) 1009 | 1010 | return mx.gluon.Trainer(params, opt) 1011 | 1012 | def forDebug(self, out): 1013 | pass 1014 | 1015 | def train(self): 1016 | model = self.getModel() 1017 | if not os.path.exists(model): 1018 | os.mkdir(model) 1019 | 1020 | ctx = mx.cpu(self.gpu) 1021 | net, myLoss = self.getNet(ctx) 1022 | trainI, testI = self.getData(ctx) 1023 | metric, monitor = self.getMetric() 1024 | trainer = self.getTrainer( 1025 | net.collect_params(), trainI.iters) 1026 | 1027 | logging.info('') 1028 | result, epochs = 0, self.getEpoch() 1029 | 1030 | for epoch in range(1, epochs+1): 1031 | train_l_sum = mx.nd.array([0], ctx=ctx) 1032 | logging.info('Epoch[%04d] start ...' % epoch) 1033 | 1034 | list(map(lambda x: x.reset(), [trainI, metric])) 1035 | for batch_i, (data, label) in enumerate(trainI): 1036 | with autograd.record(): 1037 | 1038 | out = net.forward(data)#2048,590 1039 | self.forDebug(out) 1040 | loss = myLoss(out, label) 1041 | loss.backward() 1042 | # grads = [p.grad(ctx) for p in net.collect_params().values()] 1043 | # mx.gluon.utils.clip_global_norm( 1044 | # grads, .2 * 5 * data.shape[0]) 1045 | trainer.step(data.shape[0])#trainer.step(batch_size) 1046 | print('train loss:', loss.mean().asnumpy()) 1047 | ############################################### 1048 | # train_l_sum += loss.sum() / data.shape[0] 1049 | # eval_period = 1 1050 | # if batch_i % eval_period == 0 and batch_i > 0: 1051 | # cur_l = train_l_sum / eval_period 1052 | # print('epoch %d, batch %d, train loss %.6f, perplexity %.2f' 1053 | # % (epoch, batch_i, cur_l.asscalar(), 1054 | # cur_l.exp().asscalar())) 1055 | # train_l_sum = mx.nd.array([0], ctx=ctx) 1056 | ############################################# 1057 | metric.update(label, out) 1058 | for name, value in metric.get(): 1059 | logging.info('Epoch[%04d] Train-%s=%f ...', epoch, name, value) 1060 | 1061 | _result = None 1062 | list(map(lambda x: x.reset(), [testI, metric])) 1063 | l_sum = mx.nd.array([0], ctx=ctx) 1064 | n = 0 1065 | for data, label in testI: 1066 | out = net.forward(data) 1067 | l = myLoss(out, label) 1068 | l_sum += l.sum() 1069 | n += l.size 1070 | self.forDebug(out) 1071 | metric.update(label, out) 1072 | print('valid loss:', (l_sum / n).mean().asnumpy()) 1073 | for name, value in metric.get(): 1074 | if name == monitor : _result = value 1075 | logging.info('Epoch[%04d] Validation-%s=%f', epoch, name, value) 1076 | 1077 | if _result > result: 1078 | result = _result 1079 | name = '%s/%04d-%3.3f%%.params' % (model, epoch, result*100) 1080 | net.save_params(name) 1081 | logging.info('Save params to %s ...', name) 1082 | 1083 | logging.info('Epoch[%04d] done ...\n' % epoch) 1084 | 1085 | class MyPredict(object): 1086 | def __init__(self, gpu=0): 1087 | self.gpu = gpu 1088 | 1089 | def getNet(self, ctx): 1090 | raise NotImplementedError 1091 | 1092 | def preProcess(self, data): 1093 | return data 1094 | 1095 | def postProcess(self, data, pData, output): 1096 | raise NotImplementedError 1097 | 1098 | def onDone(self): 1099 | pass 1100 | 1101 | def predict(self): 1102 | ctx = mx.cpu(self.gpu) 1103 | 1104 | net = self.getNet(ctx) 1105 | testI = self.getData(ctx) 1106 | 1107 | logging.info('Start predicting ...') 1108 | 1109 | testI.reset() 1110 | for data in testI: 1111 | pData = self.preProcess(data) 1112 | out = net.forward(pData) 1113 | self.postProcess(data, pData, out) 1114 | 1115 | self.onDone() 1116 | logging.info('Prediction done ...') 1117 | 1118 | ################################################### 1119 | 1120 | class MyLoss(nn.HybridBlock): 1121 | def __init__(self): 1122 | super(MyLoss, self).__init__() 1123 | self.loss = loss.SigmoidBCELoss()#l2 1124 | 1125 | def hybrid_forward(self, F, pred, label): 1126 | sampleWeight = 20*label+(1-label) 1127 | return self.loss(pred, label) 1128 | 1129 | class MyMetric(object): 1130 | def get(self): 1131 | assert len(self.labels) != 0, 'Failed ...' 1132 | assert len(self.outputs) != 0, 'Failed ...' 1133 | 1134 | labels = np.hstack(self.labels) 1135 | outputs = np.hstack(self.outputs) 1136 | results = ( 1137 | ( 1138 | 'aucMetric', 1139 | self.myAuc(labels, outputs) 1140 | ), 1141 | ( 1142 | 'accMetric', 1143 | self.myAcc(labels, outputs) 1144 | ), 1145 | ) 1146 | 1147 | return results 1148 | 1149 | def myAuc(self, labels, outputs): 1150 | return roc_auc_score(labels, outputs) 1151 | 1152 | def myAcc(self, labels, outputs): 1153 | return np.mean(labels == (outputs>=0.5)) 1154 | 1155 | def reset(self): 1156 | self.labels = [] 1157 | self.outputs = [] 1158 | 1159 | def update(self, label, output): 1160 | label = label.asnumpy() 1161 | output = mx.nd.sigmoid(output) 1162 | output = output.asnumpy()[:,0] 1163 | 1164 | self.labels.append(label) 1165 | self.outputs.append(output) 1166 | 1167 | ################################################### 1168 | 1169 | class MyLoss2(nn.HybridBlock): 1170 | def __init__(self): 1171 | super(MyLoss2, self).__init__() 1172 | self.loss = loss.SoftmaxCELoss() 1173 | 1174 | def hybrid_forward(self, F, pred, label): 1175 | sampleWeight = 20*label+(1-label) 1176 | return self.loss(pred, label, sampleWeight) 1177 | 1178 | class MyMetric2(object): 1179 | def get(self): 1180 | assert len(self.labels) != 0, 'Failed ...' 1181 | assert len(self.outputs) != 0, 'Failed ...' 1182 | 1183 | labels = np.hstack(self.labels) 1184 | outputs = np.hstack(self.outputs) 1185 | results = ( 1186 | ( 1187 | 'aucMetric', 1188 | self.myAuc(labels, outputs) 1189 | ), 1190 | ( 1191 | 'accMetric', 1192 | self.myAcc(labels, outputs) 1193 | ), 1194 | ) 1195 | 1196 | return results 1197 | 1198 | def myAuc(self, labels, outputs): 1199 | return roc_auc_score(labels, outputs) 1200 | 1201 | def myAcc(self, labels, outputs): 1202 | return np.mean(labels == (outputs>=0.5)) 1203 | 1204 | def reset(self): 1205 | self.labels = [] 1206 | self.outputs = [] 1207 | 1208 | def update(self, label, output): 1209 | label = label.asnumpy() 1210 | output = mx.nd.softmax(output) 1211 | output = output.asnumpy()[:,1] 1212 | 1213 | self.labels.append(label) 1214 | self.outputs.append(output) 1215 | 1216 | ################################################### 1217 | 1218 | class MyLoss3(nn.HybridBlock): 1219 | def __init__(self): 1220 | super(MyLoss3, self).__init__() 1221 | self.loss1 = loss.SoftmaxCELoss() 1222 | self.loss2 = loss.SigmoidBCELoss() 1223 | 1224 | def hybrid_forward(self, F, pred, label): 1225 | sampleWeight = 20*label+(1-label) 1226 | 1227 | pred1 = F.slice_axis(pred, 1, 0, 2) 1228 | pred2 = F.slice_axis(pred, 1, 2, 3) 1229 | 1230 | loss1 = self.loss1(pred1, label, sampleWeight) 1231 | loss2 = self.loss2(pred2, label, sampleWeight) 1232 | 1233 | return 0.5 * (loss1 + loss2) 1234 | 1235 | class MyMetric3(object): 1236 | def get(self): 1237 | assert len(self.labels) != 0, 'Failed ...' 1238 | assert len(self.outputs) != 0, 'Failed ...' 1239 | 1240 | labels = np.hstack(self.labels) 1241 | outputs = np.hstack(self.outputs) 1242 | results = ( 1243 | ( 1244 | 'aucMetric', 1245 | self.myAuc(labels, outputs) 1246 | ), 1247 | ( 1248 | 'accMetric', 1249 | self.myAcc(labels, outputs) 1250 | ), 1251 | ) 1252 | 1253 | return results 1254 | 1255 | def myAuc(self, labels, outputs): 1256 | return roc_auc_score(labels, outputs) 1257 | 1258 | def myAcc(self, labels, outputs): 1259 | return np.mean(labels == (outputs>=0.5)) 1260 | 1261 | def reset(self): 1262 | self.labels = [] 1263 | self.outputs = [] 1264 | 1265 | def update(self, label, output): 1266 | label = label.asnumpy() 1267 | 1268 | output1 = output[:, 0:2] 1269 | output2 = output[:, 2:3] 1270 | 1271 | output1 = mx.nd.softmax(output1) 1272 | output1 = output1.asnumpy()[:,1] 1273 | 1274 | output2 = mx.nd.sigmoid(output2) 1275 | output2 = output2.asnumpy()[:,0] 1276 | 1277 | output = 0.5*(output1 + output2) 1278 | 1279 | self.labels.append(label) 1280 | self.outputs.append(output) 1281 | 1282 | ################################################### 1283 | 1284 | -------------------------------------------------------------------------------- /picture/top1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ColaDrill/2018spa/24bf9f906e06ee6cf58da3f6d8248dddc164ece2/picture/top1.png -------------------------------------------------------------------------------- /picture/top3.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/ColaDrill/2018spa/24bf9f906e06ee6cf58da3f6d8248dddc164ece2/picture/top3.pdf -------------------------------------------------------------------------------- /test.py: -------------------------------------------------------------------------------- 1 | import multiprocessing #导入多进程模块,实现多进程 2 | import os #验证,打印进程的id 3 | import numpy as np #这个模块是为了实现矩阵乘法 4 | 5 | import pandas as pd 6 | 7 | from sklearn.preprocessing import OneHotEncoder,LabelEncoder 8 | # data = pd.read_csv('./datas/trainMerged.csv') 9 | # for concat_feat in data.columns: 10 | # if data[concat_feat].dtype=='float': 11 | # try: 12 | # print(concat_feat) 13 | # data[concat_feat] = LabelEncoder().fit_transform((10000000*data[concat_feat]).apply(float)) 14 | # except: 15 | # print('str???????????????????????') 16 | # data[concat_feat] = LabelEncoder().fit_transform(data[concat_feat].astype(str)) 17 | # data.to_csv('./datas/trainMerged.csv',index=False) 18 | # 19 | # 20 | # print('ok') 21 | # 22 | # temp = OrderedDict( 23 | # sorted( 24 | # count['uid'].items(), # return the tuple of dict 25 | # reverse=1, key=lambda i: i[1], 26 | # ) 27 | # ) 28 | # 29 | # df = pd.DataFrame(temp) 30 | # print(df.describe()) 31 | 32 | # 33 | # df = pd.read_csv('./submission.csv') 34 | # test = pd.read_csv('./test1.csv') 35 | # test = pd.merge(test,df,on=['uid','aid'],how='left') 36 | # test.fillna(0,inplace=True) 37 | # test.to_csv('./submission1.csv',index=False, float_format='%.8f') 38 | 39 | def read_raw_data(path, max_num): 40 | f = open(path, 'r') 41 | features = f.readline() 42 | features = features.strip().split(',') 43 | dict = {} 44 | num = 0 45 | for line in f: 46 | if num >= max_num: 47 | break 48 | datas = line.strip().split(',') 49 | for i, d in enumerate(datas): 50 | if not dict.__contains__(features[i]): 51 | dict[features[i]] = [] 52 | dict[features[i]].append(d) 53 | num += 1 54 | f.close() 55 | return dict, num 56 | 57 | nums = 1000 58 | train_dict, train_num = read_raw_data('C:\\Users\\Administrator\\Downloads\\TencentSPA02-PreA-master\\TencentSPA02-PreA-master\\datas\\trainMerged1.csv', nums) 59 | 60 | def count_combine_feature_times(train_data_1, train_data_2, test1_data_1, test1_data_2, test2_data_1, test2_data_2): 61 | total_dict = {} 62 | count_dict = {} 63 | for i, d in enumerate(train_data_1): 64 | xs1 = d.split(' ') 65 | xs2 = train_data_2[i].split(',') 66 | for x1 in xs1: 67 | for x2 in xs2: 68 | ke = x1+'|'+x2 69 | if not total_dict.__contains__(ke): 70 | total_dict[ke] = 0 71 | total_dict[ke] += 1 72 | for i, d in enumerate(test1_data_1): 73 | xs1 = d.split(' ') 74 | xs2 = test1_data_2[i].split(',') 75 | for x1 in xs1: 76 | for x2 in xs2: 77 | ke = x1+'|'+x2 78 | if not total_dict.__contains__(ke): 79 | total_dict[ke] = 0 80 | total_dict[ke] += 1 81 | for i, d in enumerate(test2_data_1): 82 | xs1 = d.split(' ') 83 | xs2 = test2_data_2[i].split(',') 84 | for x1 in xs1: 85 | for x2 in xs2: 86 | ke = x1+'|'+x2 87 | if not total_dict.__contains__(ke): 88 | total_dict[ke] = 0 89 | total_dict[ke] += 1 90 | for key in total_dict: 91 | if not count_dict.__contains__(total_dict[key]): 92 | count_dict[total_dict[key]] = 0 93 | count_dict[total_dict[key]] += 1 94 | 95 | train_res = [] 96 | for i, d in enumerate(train_data_1): 97 | t = [] 98 | xs1 = d.split(' ') 99 | xs2 = train_data_2[i].split(',') 100 | for x1 in xs1: 101 | for x2 in xs2: 102 | ke = x1 + '|' + x2 103 | t.append(total_dict[ke]) 104 | train_res.append(max(t)) 105 | test1_res = [] 106 | for i, d in enumerate(test1_data_1): 107 | t = [] 108 | xs1 = d.split(' ') 109 | xs2 = test1_data_2[i].split(',') 110 | for x1 in xs1: 111 | for x2 in xs2: 112 | ke = x1 + '|' + x2 113 | t.append(total_dict[ke]) 114 | test1_res.append(max(t)) 115 | test2_res = [] 116 | for i, d in enumerate(test2_data_1): 117 | t = [] 118 | xs1 = d.split(' ') 119 | xs2 = test2_data_2[i].split(',') 120 | for x1 in xs1: 121 | for x2 in xs2: 122 | ke = x1 + '|' + x2 123 | t.append(total_dict[ke]) 124 | test2_res.append(max(t)) 125 | return np.array(train_res), np.array(test1_res), np.array(test2_res), count_dict 126 | 127 | def gen_count_dict(data, labels, begin, end): 128 | total_dict = {} 129 | pos_dict = {} 130 | for i, d in enumerate(data): 131 | if i >= begin and i < end: 132 | continue 133 | xs = d.split(' ') 134 | for x in xs: 135 | if not total_dict.__contains__(x): 136 | total_dict[x] = 0 137 | if not pos_dict.__contains__(x): 138 | pos_dict[x] = 0 139 | total_dict[x] += 1 140 | if labels[i] == '1': 141 | pos_dict[x] += 1 142 | return total_dict, pos_dict 143 | 144 | def count_pos_feature(train_data, test1_data, test2_data, labels, k, test_only= False, is_val = False): 145 | nums = len(train_data) 146 | last = nums 147 | if is_val: 148 | last = nums-4739700 149 | assert last > 0 150 | interval = last // k 151 | split_points = [] 152 | for i in range(k): 153 | split_points.append(i * interval) 154 | split_points.append(last) 155 | count_train_data = train_data[0:last] 156 | count_labels = labels[0:last] 157 | 158 | train_res = [] 159 | if not test_only: 160 | for i in range(k): 161 | print( i,"part counting") 162 | print( split_points[i], split_points[i+1]) 163 | tmp = [] 164 | total_dict, pos_dict = gen_count_dict(count_train_data, count_labels, split_points[i],split_points[i+1]) 165 | for j in range(split_points[i],split_points[i+1]): 166 | xs = train_data[j].split(' ') 167 | t = [] 168 | for x in xs: 169 | if not pos_dict.__contains__(x): 170 | t.append(0) 171 | continue 172 | t.append(pos_dict[x] + 1) 173 | tmp.append(max(t)) 174 | train_res.extend(tmp) 175 | 176 | total_dict, pos_dict = gen_count_dict(count_train_data, count_labels, 1, 0) 177 | count_dict = {-1:0} 178 | for key in pos_dict: 179 | if not count_dict.__contains__(pos_dict[key]): 180 | count_dict[pos_dict[key]] = 0 181 | count_dict[pos_dict[key]] += 1 182 | 183 | if is_val: 184 | for i in range(last, nums): 185 | xs = train_data[i].split(' ') 186 | t = [] 187 | for x in xs: 188 | if not total_dict.__contains__(x): 189 | t.append(0) 190 | continue 191 | t.append(pos_dict[x] + 1) 192 | train_res.append(max(t)) 193 | 194 | test1_res = [] 195 | for d in test1_data: 196 | xs = d.split(' ') 197 | t = [] 198 | for x in xs: 199 | if not pos_dict.__contains__(x): 200 | t.append(0) 201 | continue 202 | t.append(pos_dict[x] + 1) 203 | test1_res.append(max(t)) 204 | 205 | test2_res = [] 206 | for d in test2_data: 207 | xs = d.split(' ') 208 | t = [] 209 | for x in xs: 210 | if not pos_dict.__contains__(x): 211 | t.append(0) 212 | continue 213 | t.append(pos_dict[x] + 1) 214 | test2_res.append(max(t)) 215 | 216 | return np.array(train_res), np.array(test1_res), np.array(test2_res), count_dict 217 | 218 | def combine_to_one(data1, data2): 219 | assert len(data1) == len(data2) 220 | new_res = [] 221 | for i, d in enumerate(data1): 222 | x1 = data1[i] 223 | x2 = data2[i] 224 | new_x = x1 + '|' + x2 225 | new_res.append(new_x) 226 | return new_res 227 | 228 | def uid_seq_feature(train_data, test1_data, test2_data, label): 229 | count_dict = {}#存 该id: 追加这次出现的label 230 | seq_dict = {}#存序列字典 :该种序列出现次数 231 | seq_emb_dict = {}#存该序列的key 232 | train_seq = []#存每个学列的index 233 | ind = 0 234 | for i, d in enumerate(train_data): 235 | if not count_dict.__contains__(d): 236 | count_dict[d] = [] 237 | seq_key = ' '.join(count_dict[d][-4:]) 238 | if not seq_dict.__contains__(seq_key): 239 | seq_dict[seq_key] = 0 240 | seq_emb_dict[seq_key] = ind 241 | ind += 1 242 | seq_dict[seq_key] += 1 243 | train_seq.append(seq_emb_dict[seq_key]) 244 | count_dict[d].append(label[i]) 245 | test1_seq = [] 246 | for d in test1_data: 247 | if not count_dict.__contains__(d): 248 | seq_key = '' 249 | else: 250 | seq_key = ' '.join(count_dict[d][-4:]) 251 | if seq_emb_dict.__contains__(seq_key): 252 | key = seq_emb_dict[seq_key] 253 | else: 254 | key = 0 255 | test1_seq.append(key) 256 | test2_seq = [] 257 | for d in test2_data: 258 | if not count_dict.__contains__(d): 259 | seq_key = '' 260 | else: 261 | seq_key = ' '.join(count_dict[d][-4:]) 262 | if seq_emb_dict.__contains__(seq_key): 263 | key = seq_emb_dict[seq_key] 264 | else: 265 | key = 0 266 | test2_seq.append(key) 267 | 268 | return np.array(train_seq), np.array(test1_seq), np.array(test2_seq), seq_emb_dict 269 | 270 | 271 | #train_res, test1_res, test2_res, f_dict = uid_seq_feature(train_dict['uid'], train_dict['uid'], 272 | # train_dict['uid'], train_dict['label']) 273 | 274 | # new_train_data = combine_to_one(train_dict['uid'], train_dict['aid']) 275 | # train_res, test1_res, test2_res, f_dict = count_pos_feature(new_train_data, new_train_data, 276 | # new_train_data, train_dict['label'], 5, 277 | # is_val=False) 278 | 279 | from collections import * 280 | from tqdm import tqdm 281 | def gen_pos_neg_aid_fea(): 282 | train_data = pd.read_csv('C:\\Users\\Administrator\\Downloads\\TencentSPA02-PreA-master\\TencentSPA02-PreA-master\\datas\\trainMerged1.csv') 283 | test2_data = pd.read_csv('C:\\Users\\Administrator\\Downloads\\TencentSPA02-PreA-master\\TencentSPA02-PreA-master\\datas\\trainMerged1.csv') 284 | 285 | train_user = train_data.uid.unique() 286 | 287 | # user-aid dict 288 | uid_dict = defaultdict(list) 289 | for row in tqdm(train_data.itertuples(), total=len(train_data)): 290 | uid_dict[row[2]].append([row[1], row[3]]) 291 | 292 | # user convert 293 | uid_convert = {} 294 | for uid in tqdm(train_user): 295 | pos_aid, neg_aid = [], [] 296 | for data in uid_dict[uid]: 297 | if data[1] > 0: 298 | pos_aid.append(data[0]) 299 | else: 300 | neg_aid.append(data[0]) 301 | uid_convert[uid] = [pos_aid, neg_aid] 302 | 303 | test2_neg_pos_aid = {} 304 | for row in tqdm(test2_data.itertuples(), total=len(test2_data)): 305 | aid = row[1] 306 | uid = row[2] 307 | if uid_convert.get(uid, []) == []: 308 | test2_neg_pos_aid[row[0]] = ['', '', -1] 309 | else: 310 | pos_aid, neg_aid = uid_convert[uid][0].copy(), uid_convert[uid][1].copy() 311 | convert = len(pos_aid) / (len(pos_aid) + len(neg_aid)) if (len(pos_aid) + len(neg_aid)) > 0 else -1 312 | test2_neg_pos_aid[row[0]] = [' '.join(map(str, pos_aid)), ' '.join(map(str, neg_aid)), convert] 313 | df_test2 = pd.DataFrame.from_dict(data=test2_neg_pos_aid, orient='index') 314 | df_test2.columns = ['pos_aid', 'neg_aid', 'uid_convert'] 315 | 316 | train_neg_pos_aid = {} 317 | for row in tqdm(train_data.itertuples(), total=len(train_data)): 318 | aid = row[1] 319 | uid = row[2] 320 | pos_aid, neg_aid = uid_convert[uid][0].copy(), uid_convert[uid][1].copy() 321 | if aid in pos_aid: 322 | pos_aid.remove(aid) 323 | if aid in neg_aid: 324 | neg_aid.remove(aid) 325 | convert = len(pos_aid) / (len(pos_aid) + len(neg_aid)) if (len(pos_aid) + len(neg_aid)) > 0 else -1 326 | train_neg_pos_aid[row[0]] = [' '.join(map(str, pos_aid)), ' '.join(map(str, neg_aid)), convert] 327 | 328 | df_train = pd.DataFrame.from_dict(data=train_neg_pos_aid, orient='index') 329 | df_train.columns = ['pos_aid', 'neg_aid', 'uid_convert'] 330 | 331 | df_train.to_csv("dataset/train_neg_pos_aid.csv", index=False) 332 | df_test2.to_csv("dataset/test2_neg_pos_aid.csv", index=False) 333 | 334 | def gen_uid_aid_fea(): 335 | ''' 336 | 载入数据, 提取aid, uid的全局统计特征 337 | ''' 338 | train_data = pd.read_csv('C:\\Users\\Administrator\\Downloads\\TencentSPA02-PreA-master\\TencentSPA02-PreA-master\\datas\\trainMerged1.csv') 339 | test1_data = pd.read_csv('C:\\Users\\Administrator\\Downloads\\TencentSPA02-PreA-master\\TencentSPA02-PreA-master\\datas\\trainMerged1.csv') 340 | test2_data = pd.read_csv('C:\\Users\\Administrator\\Downloads\\TencentSPA02-PreA-master\\TencentSPA02-PreA-master\\datas\\trainMerged1.csv') 341 | 342 | ad_Feature = pd.read_csv('C:\\Users\\Administrator\\Downloads\\TencentSPA02-PreA-master\\TencentSPA02-PreA-master\\datas\\trainMerged1.csv') 343 | 344 | train_len = len(train_data) # 45539700 345 | test1_len = len(test1_data) 346 | test2_len = len(test2_data) # 11727304 347 | 348 | ad_Feature = pd.merge(ad_Feature, ad_Feature.groupby(['campaignId']).aid.nunique().reset_index( 349 | ).rename(columns={'aid': 'campaignId_aid_nunique'}), how='left', on='campaignId') 350 | 351 | df = pd.concat([train_data, test1_data, test2_data], axis=0) 352 | df = pd.merge(df, df.groupby(['uid'])['aid'].nunique().reset_index().rename( 353 | columns={'aid': 'uid_aid_nunique'}), how='left', on='uid') 354 | 355 | df = pd.merge(df, df.groupby(['aid'])['uid'].nunique().reset_index().rename( 356 | columns={'uid': 'aid_uid_nunique'}), how='left', on='aid') 357 | 358 | df['uid_count'] = df.groupby('uid')['aid'].transform('count') 359 | df = pd.merge(df, ad_Feature[['aid', 'campaignId_aid_nunique']], how='left', on='aid') 360 | 361 | fea_columns = ['campaignId_aid_nunique', 'uid_aid_nunique', 'aid_uid_nunique', 'uid_count', ] 362 | 363 | df[fea_columns].iloc[:train_len].to_csv('dataset/train_uid_aid.csv', index=False) 364 | df[fea_columns].iloc[train_len: train_len+test1_len].to_csv('dataset/test1_uid_aid.csv', index=False) 365 | df[fea_columns].iloc[-test2_len:].to_csv('dataset/test2_uid_aid.csv', index=False) 366 | 367 | ad_Feature = pd.read_csv('C:\\Users\\Administrator\\Downloads\\TencentSPA02-PreA-master\\TencentSPA02-PreA-master\\datas\\trainMerged1.csv') 368 | 369 | te = ad_Feature.groupby(['campaignId']) 370 | 371 | te = te.aid 372 | te = te.nunique().reset_index().rename(columns={'aid': 'campaignId_aid_nunique'}) 373 | print('ok') 374 | # ad_Feature = pd.merge(ad_Feature, ad_Feature.groupby(['campaignId']).aid.nunique().reset_index( 375 | # ).rename(columns={'aid': 'campaignId_aid_nunique'}), how='left', on='campaignId') -------------------------------------------------------------------------------- /xlearn_ffm.py: -------------------------------------------------------------------------------- 1 | # -*- coding: utf-8 -*- 2 | 3 | import xlearn as xl 4 | import pandas as pd 5 | 6 | ffm_model=xl.create_ffm() 7 | #ffm_model=xl.setTrain("ffm_train.txt") 8 | #ffm_model.setValidate("ffm_val.txt") 9 | ffm_model.setTrain("./data/train.ffm") 10 | ffm_model.setTest("./data/test.ffm") 11 | ffm_model.setSigmoid() 12 | # ffm_model.setSign() 13 | 14 | param={'task':'binary','lr':0.05,'lambda':0.00002,'metric':'auc','k':8,'epoch':22,'stop_window':3} 15 | 16 | ffm_model.fit(param,"./data/xlearn_result/model.out") 17 | ffm_model.predict("./data/xlearn_result/model.out","./data/xlearn_result/output.txt") 18 | 19 | 20 | #ffm_model.cv(param) 21 | #ffm_model.fit(param,"ffm_result.txt") 22 | 23 | result=pd.read_table("./data/xlearn_result/output.txt",header=None) 24 | test=pd.read_csv("./data/test1.csv") 25 | test['score']=result[0] 26 | test.to_csv('./data/xlearn_result/submission_xlearn.csv',index=False) --------------------------------------------------------------------------------