├── .gitignore
├── 2017的腾讯广告社交赛TOP10PPT.zip
├── 2018腾讯算法大赛参赛手册-V4.pdf
├── README.md
├── data.py
├── feature.py
├── ffm.py
├── ffmdata.py
├── forRank.py
├── forVal.py
├── imformation
    └── README.md
├── lgb.py
├── nn.py
├── nnHelper.py
├── picture
    ├── top1.png
    └── top3.pdf
├── test.py
└── xlearn_ffm.py


/.gitignore:
--------------------------------------------------------------------------------
1 | /.idea
2 | /data
3 | /datas
4 | /__pycache__


--------------------------------------------------------------------------------
/2017的腾讯广告社交赛TOP10PPT.zip:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ColaDrill/2018spa/24bf9f906e06ee6cf58da3f6d8248dddc164ece2/2017的腾讯广告社交赛TOP10PPT.zip


--------------------------------------------------------------------------------
/2018腾讯算法大赛参赛手册-V4.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ColaDrill/2018spa/24bf9f906e06ee6cf58da3f6d8248dddc164ece2/2018腾讯算法大赛参赛手册-V4.pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | 此贴代码分析进行到模仿top3 attention机制,持续更新ing.....(代码粗糙丑陋，后续将逐步类化，pipeline完善泛化一个模板)  
  2 | 2018腾讯算法大赛第29/1500名，此贴总结TOP10代码，初次大数据（机器学习）比赛，扫盲贴。  
  3 | 
  4 | 	大多数kaggle比赛利用特征工程加gbdt即可取得较好成绩，而腾讯比赛是一个靠模型得奖的比赛，因此对机器，以及dnn模型研究的要求就很高。
  5 | 	目前根据论文实验较有效的几个操作：
  6 | 	*微软亚研的MSRAPrelu初始化
  7 | 	*多值特征的attention机制
  8 | 	*自适应的正则化
  9 | 	*wide 方向以及 纵向的MLP
 10 | 	*learning rate decay
 11 | 	*BN,drop out
 12 | 	*focal loss
 13 | 	@矩阵乘法GPU优化可以加速
 14 | 	
 15 | 	计划尝试的操作：
 16 | 	#dice激活函数正在试验中
 17 | 	#据说log mvm有效
 18 | 	#尝试GAP（global average pooling）代替全连接层
 19 | 	实际上lr挖掘一阶特征，FFM层二阶，interest 和 mvm层的高阶特征是为模型尽可能的去挖掘特征中隐藏的信息，mlp来分配权重.
 20 | 	相关论文地址：
 21 | 	deep interest Network:  https://arxiv.org/pdf/1706.06978.pdf
 22 | 	
 23 | 此次TOP1由葛文强团队的自创的NN取得，模型类似于其IJCAI2018模型图片位于：  
 24 | 
 25 | 	./picture/top1.png
 26 | 第三名为清华大学茶园和贵系大佬自创的DEEP INTEREST FFM，PPT海报位于：  
 27 | 
 28 | 	./picture/top3.pdf
 29 | 其余大佬模型多基于郭大和渣大分享的NFFM模型稍作改动，此模型为2017年第一届比赛南大熊猫冠军模型，github地址：  
 30 | 
 31 | 	郭大https://github.com/guoday/Tencent2018_Lookalike_Rank7th（郭大的Xdeepfm听说有效果，还未来得及研究）
 32 | 	渣大https://github.com/nzc/tencent-contest
 33 | top20经验贴传送门：  
 34 | 	
 35 | 	官方baseline:
 36 | 	https://github.com/YouChouNoBB/2018-tencent-ad-competition-baseline
 37 | 	
 38 | 	[腾讯广告算法赛开源-Top4-SDD]https://zhuanlan.zhihu.com/p/42089584
 39 | 	https://blog.csdn.net/ML_SDD/article/details/81702168
 40 | 	[2018腾讯广告算法大赛总结（Rank6）-模型篇](https://zhuanlan.zhihu.com/p/38443751)
 41 | 	[第二届腾讯广告算法大赛总结（Rank 9）](https://zhuanlan.zhihu.com/p/38499275?utm_source=com.tencent.tim&utm_medium=social&utm_oi=571282483765710848)
 42 | 	[2018腾讯广告算法大赛Top10-特征工程]https://zhuanlan.zhihu.com/p/40479648
 43 | 	[2018腾讯广告算法大赛总结/0.772229/Rank11](https://zhuanlan.zhihu.com/p/38034501)
 44 | 	[腾讯社交广告算法大赛（Top15）总结-特征工程]https://zhuanlan.zhihu.com/p/39491062
 45 | 	
 46 | 以上模型皆属于neutral network模型，因当数据量很大，机器限制又很高的时候，NN模型很适应场景，ctr nn模型扫盲贴在后面，此次数据量过大，复赛对GPU和内存要求很高，大概400G内存，我将在下面的目录下放一个我赛后学习测试用的数据集，16G笔记本可跑：  
 47 | 	
 48 | 	./datas/
 49 | 
 50 | 赛题详情:  
 51 | 
 52 | 	./2018腾讯算法大赛参赛手册  
 53 | 
 54 | 大数据（机器学习）比赛大致流程如下，旨在预测在官方给的计量方法上取得高分（如AUC，logloss）：  
 55 | 	
 56 | 	1、做特征
 57 | 	2、喂模型
 58 | 	3、融合
 59 | 
 60 | 相关基础知识扫盲帖：  
 61 | 
 62 | 	./information/
 63 | 
 64 | 特征----本次比赛大家特征都大同小异：  
 65 | 
 66 | 	常见几种特征提取代码./feature.py  (注意转化率特征KFOLD操作和贝叶斯平滑操作，可自行百度）
 67 | 	怎么挖掘特征？渣大传送门https://zhuanlan.zhihu.com/p/38341881
 68 | 
 69 | PS:具体情况具体分析，有时候选特征可以看LGB提供的API看重要度，有时候需观察与label的相关度，但是要注意不要leak（即用未来的数据来预测过去，用label去预测label，会造成线下很高，线上很低），可以用KFOLD来避免。但是这些都不保证一个特征有用无用，最靠谱的还是直接放进去看线上是否提高了，但是这样你的时间不够的，复赛又可能有别的强特。另外学习怎么做特征，最好的办法就是看大赛TOP10的PPT和代码，看太多特征工程原理上的东西会浪费你很多时间。另外，基本的可视化分析，这个东西扫盲贴里有介绍，但是赛后发现大佬们分析的东西和我这样的新手菜鸟分析的东西完全不是一个东西啊。所以尚在摸索中。 
 70 | 
 71 | 模型----本次比赛为CTR类比赛，常用模型有lgb,xgb,catboost等树类模型，FFM模型，还有一系列复杂的NN深度模型.  
 72 | 
 73 | 融合----我用的加权融合，台大三傻的反sigmoid融合然后排列组合  
 74 | 	
 75 | 	大概就是利用哈夫曼原理，进行哈夫曼融合，把每个单次线上成绩进行哈夫曼反sigmoid，穿插进行加权融合，因为有的时候加权比反sigmoid的效果好，具体慢慢详细研究介绍
 76 | 	代码：./*********.py
 77 | 
 78 | 初赛：  
 79 | 初赛精力主要要放在做特征上，微软的lgb模型效率很高，可以尽可能的去做特征然后愉快的喂给LGB即可。  
 80 | 
 81 | 复赛：   
 82 | 我们重点说说复赛，上面说了复赛数据量和机器限制，一开始大家考虑deepfm模型，后来郭大渣大开源NFFM，所以开始入手研究NN模型，下面是CTR的NN扫盲贴两篇：  
 83 | 
 84 | 	渣大知乎：https://zhuanlan.zhihu.com/p/32885978
 85 | 	石晓文大神简书：https://www.jianshu.com/nb/21403842
 86 | 	辛俊波知乎：https://zhuanlan.zhihu.com/p/35484389
 87 | 	
 88 | 简单来说deepFFM是将deep层和ffm层wide方向concat了，nffm是在串行将FFM层的结果再喂给DENSE层，关于其效果优良的可解释性至今未查清，一老师给出的是，不同的环境和场景，模型效果也可能不同，至于TOP3的interest网络，attention机制的可解释性就比较好理解了，本次比赛中的多值特征，大家都去了average pool(类似max pool 你懂得），但是对不同广告，你兴趣权重肯定是不一样的，比如你点击口红广告，不可能跟你喜欢的AJ款式有很大权重，所以这里用了ATTENTIONG机制动态分配权重，这个网络其实就是lr表达一阶特征，FFM表达二阶特征，TOP1团队则用MVM取了个log表达了更高阶特征，其余细节，可以看论文中讲述：  
 89 | 
 90 | 	阿里巴巴din网络https://arxiv.org/pdf/1706.06978.pdf
 91 | 
 92 | 本文代码起初借鉴于初赛不知名大佬的MXNET新框架gluon，apache的动静集合的框架其实我很喜欢，但是有些API还是维护的比较慢的，所以正转战TF，和torch，哎，很桑心，正在尝试TOP1模型复现，以及spark进行数据处理，最后会总结出一个比较方便的通用框架持续更新ing...
 93 | 	
 94 | 
 95 | 另外一些踩过的坑与大家分享：  
 96 | 
 97 | 	# 特征清洗很重要，不然要重做特征很多次，做大量特征前要小批量验证代码正确，不然有BUG就很浪费时间
 98 | 	#内存不够的时候read_csv数据类型别用默认的64位
 99 | 	#del 后的内存要和谐的collect()一下
100 | 


--------------------------------------------------------------------------------
/data.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # encoding=utf8
  3 | import gc
  4 | import os
  5 | import csv
  6 | import json
  7 | import random
  8 | import hashlib
  9 | import numpy as np
 10 | import pandas as pd
 11 | from scipy import sparse
 12 | from collections import Counter
 13 | from scipy.sparse import csr_matrix
 14 | from collections import defaultdict
 15 | from collections import OrderedDict
 16 | 
 17 | User = ['appIdAction','appIdInstall','interest1','interest2','interest3','interest4','interest5','kw1','kw2','kw3','topic1','topic2','topic3','ct','marriageStatus','os']
 18 | Ad = ['aid']
 19 | User_Ad_idx = './datas/features/UA_idx.txt'
 20 | ############################################
 21 | 
 22 | def mkdir(folder):
 23 |     folder = './datas/features/%s' % folder
 24 |     if not os.path.exists(folder):
 25 |         os.makedirs(folder)
 26 |     return folder
 27 | import gc
 28 | def getMerged(*keys, **kwargs):
 29 |     kind = kwargs.get('kind', 0)
 30 |     name1 = './datas/trainMerged.csv'
 31 |     name2 = './datas/test1Merged.csv'
 32 | 
 33 |     for name in [name1, name2]:
 34 |         if os.path.exists(name): continue
 35 |         print( "@@@@@", name)
 36 |         ad = pd.read_csv('./datas/adFeature.csv')
 37 |         if not os.path.exists(name.replace('Merged', '')):
 38 |             continue
 39 |         df = pd.read_csv(name.replace('Merged', ''))
 40 |         user = pd.read_csv('./datas/userFeature.csv')
 41 | 
 42 |         notUsed = ['appIdAction', 'appIdInstall']
 43 |         user.drop(notUsed, axis=1, inplace=True)
 44 | 
 45 |         df = df.merge(ad).merge(user)
 46 |         if 'label' not in df: df['label'] = 0
 47 | 
 48 |         columns = sorted(df.columns.tolist())
 49 |         columns.remove('label'); columns.append('label')
 50 |         df.to_csv(name, index=False, columns=columns)
 51 |         print(' ok ',name)
 52 |         del df,user
 53 |         gc.collect()
 54 |     print('merge ok')
 55 |     # for name in [name1, name2]:
 56 |     #     df = pd.read_csv(name)
 57 |     #     notUsed = ['appIdAction', 'appIdInstall']
 58 |     #     if notUsed in df.columns:
 59 |     #         df.drop(notUsed, axis=1, inplace=True)
 60 |     #         if 'label' not in df: df['label'] = 0
 61 |     #         columns = sorted(df.columns.tolist())
 62 |     #         columns.remove('label'); columns.append('label')
 63 |     #         df.to_csv(name, index=False, columns=columns)
 64 | 
 65 | 
 66 |     name3 = './datas/trainMerged1.csv'
 67 |     name4 = './datas/trainMerged2.csv'
 68 | 
 69 |     exists = os.path.exists
 70 |     if not (exists(name3) and exists(name4)):
 71 |         random.seed(2018)
 72 | 
 73 |         #t1n, t2n = int(7e+6), int(1e+6)
 74 | 
 75 |         df = pd.read_csv(name1)
 76 |         t1n, t2n = int(0.7 * len(df)), int(0.3 * len(df))
 77 |         indices = list(range(len(df)))
 78 |         #random.shuffle(indices)
 79 |         indices3 = indices[:t1n]
 80 |         indices4 = indices[-t2n:]
 81 | 
 82 |         columns = sorted(df.columns.tolist())
 83 |         columns.remove('label'); columns.append('label')
 84 |         df.take(indices3).to_csv(
 85 |             name3, index=False, columns=columns)
 86 |         df.take(indices4).to_csv(
 87 |             name4, index=False, columns=columns)
 88 |         del df
 89 |         gc.collect()
 90 |     print('merge ok')
 91 |     if not keys: keys = None
 92 |     if kind==1: return pd.read_csv(name1, usecols=keys)
 93 |     if kind==2: return pd.read_csv(name2, usecols=keys)
 94 |     if kind==0: return pd.concat([
 95 |         pd.read_csv(name1, usecols=keys),
 96 |         pd.read_csv(name2, usecols=keys)
 97 |     ], axis=0)
 98 |     if kind==3: return pd.read_csv(name3, usecols=keys)
 99 |     if kind==4: return pd.read_csv(name4, usecols=keys)
100 | 
101 | def getMergedA(*keys, **kwargs):
102 |     kind = kwargs.get('kind', 0)
103 |     name1 = './datas/trainMergedA.csv'
104 |     name2 = './datas/test1MergedA.csv'
105 | 
106 |     for name in [name1, name2]:
107 |         if os.path.exists(name): continue
108 | 
109 |         ad = pd.read_csv('./datas/adFeature.csv')
110 |         df = pd.read_csv(name.replace('Merged', ''))
111 |         user = pd.read_csv('./datas/userFeature.csv')
112 | 
113 |         df = df.merge(ad).merge(user)
114 |         if 'label' not in df: df['label'] = 0
115 | 
116 |         columns = sorted(df.columns.tolist())
117 |         columns.remove('label'); columns.append('label')
118 |         df.to_csv(name, index=False, columns=columns)
119 | 
120 |     if not keys: keys = None
121 |     if kind==1: return pd.read_csv(name1, usecols=keys)
122 |     if kind==2: return pd.read_csv(name2, usecols=keys)
123 |     if kind==0: return pd.concat([
124 |         pd.read_csv(name1, usecols=keys),
125 |         pd.read_csv(name2, usecols=keys),
126 |     ], axis=0)
127 | 
128 | ############################################
129 | 
130 | def getUF():
131 |     name = './datas/userFeature.data'
132 | 
133 |     for f in open(name):
134 |         user = {}
135 |         for uf in f.strip().split('|'):
136 |             uf = uf.split(' ')
137 |             user[uf[0]] = list(map(int, uf[1:]))
138 | 
139 |         yield user
140 | 
141 | def userMaxLen():
142 |     folder = mkdir('msg')
143 |     maxLen = defaultdict(int)
144 |     maxFile = '%s/maxLen.txt' % folder
145 | 
146 |     if not os.path.exists(maxFile):
147 |         for i, user in enumerate(getUF()):
148 |             for key, vs in user.items():
149 |                 maxLen[key] = max(maxLen[key], len(list(vs)))
150 | 
151 |         with open(maxFile, 'w') as f:
152 |             for k in sorted(maxLen.keys()):
153 |                 f.write('%s %s\n' % (k, maxLen[k]))
154 |             #print( 'Saved to %s ...' % saveName)
155 | 
156 |     for line in open(maxFile):
157 |         k, v = line.strip().split()
158 |         maxLen[k] = int(v)
159 | 
160 |     return maxLen
161 | 
162 | def userFeature():
163 |     default = { k: 0 for k in userMaxLen() }
164 |     keys = sorted(default.keys())
165 |     keys.remove('uid'); keys = ['uid'] + keys
166 | 
167 |     saveName = './datas/userFeature.csv'
168 |     if os.path.exists(saveName): return
169 | 
170 |     with open(saveName, 'w') as sf:
171 |         dw = csv.DictWriter(sf, fieldnames=keys)
172 |         dw.writeheader()
173 | 
174 |         for i, user in enumerate(getUF()):
175 |             data = default.copy()
176 |             for key, vs in user.items():
177 |                 data[key] = ' '.join(list(map(str,vs)))
178 |             dw.writerow(data)
179 | 
180 |             if (i+1) % 100000 == 0:
181 |                 print( "%8d..." % (i+1))
182 |         print( 'Saved to %s ...' % saveName)
183 | 
184 | ############################################
185 | 
186 | def getFFM():
187 |     print( 'start getFFM()')
188 |     folder = mkdir('ffm')
189 |     datas = [
190 |         ['trainMerged', 'trainFFM'],
191 |         ['test1Merged', 'testFFM'],
192 |     ]
193 |     print( 'start get FFM data in getFFM()')
194 |     result, flag = [], True
195 | 
196 |     for _, save in datas:
197 |         save = '%s/%s.txt' % (folder, save)
198 |         result.append(save)
199 |         if not os.path.exists(save):
200 |             flag = False
201 |     if flag:
202 |         return result
203 | 
204 |     count, _  = setCount()#return the dict-->{feat:{onehot : index}} and the number of dim
205 |     fields = sorted(count.keys())
206 |     UA_idx, User_idx, Ad_idx = open(User_Ad_idx, 'w'), '', ''
207 |     for k, v in enumerate(fields):
208 |         if v in User:
209 |             User_idx += ' '+str(k)
210 |         elif v in Ad:
211 |             Ad_idx += ' '+str(k)
212 |     if not UA_idx.closed:
213 |         UA_idx.write('%s\n' % User_idx)
214 |         UA_idx.write('%s\n' % Ad_idx)
215 |         UA_idx.close()
216 | 
217 |     result = []
218 | 
219 |     for name, save in datas:
220 |         print( 'start gen data ', save)
221 |         save = '%s/%s.txt' % (folder, save)
222 |         result.append(save)
223 |         if os.path.exists(save): continue
224 | 
225 |         txt = open(save, 'w')
226 |         name = './datas/%s.csv' % name
227 | 
228 |         for i, line in enumerate(open(name)):#for the merged df
229 |             vvs = line.strip().split(',')
230 |             if i == 0: fNs = vvs; continue#fns feature name
231 | 
232 |             s = ''
233 |             for k, vs in enumerate(vvs):#vvs value of this line
234 |                 if fNs[k] == 'label':
235 |                     s = ('%d'%(vs=='1')) + s#replace the -1 label to 0
236 |                     continue
237 | 
238 |                 ii = fields.index(fNs[k])#index of the feature
239 |                 for v in vs.split(' '):#all value of one feature
240 |                     try:
241 |                         float(v)
242 |                     except:
243 |                         continue
244 |                     v = str(int(float(v)))
245 |                     c = count[fNs[k]][v]#index of the value in his feature
246 |                     # try:
247 |                     #     float(v)
248 |                     # except:
249 |                     #     continue
250 |                     # if float(v) and c:
251 |                     if int(float(v)) and c:
252 |                         s += ' %d:%d:1' % (ii, c)#特征index 总体取值index 1131，0直空值过滤了（这不一定合理
253 |             txt.write('%s\n' % s)
254 |         txt.close()
255 | 
256 | 
257 |         print( 'Saved to %s ...' % save)
258 |     return result
259 | 
260 | def splitFFM():
261 |     t1, save3 = getFFM()#get conut path(train ,test ) of file which have fomat (index of feature : index fo all dim : 1)
262 | 
263 |     folder = mkdir('ffm')
264 |     exists = os.path.exists
265 |     save1 = '%s/t1.txt' % folder
266 |     save2 = '%s/t2.txt' % folder
267 |     if exists(save1) and exists(save2):
268 |         return save1, save2, save3
269 | 
270 |     random.seed(2018)
271 |     lines = open(t1).readlines()#read the tranFFM
272 |     #random.shuffle(lines)
273 |     #t1n, t2n = int(7e+6), int(1e+6)
274 |     t1n, t2n = int(0.7 * len(lines)), int(0.3 * len(lines))
275 |     print( 'start gen train valid data')
276 |     with open(save1, 'w') as txt:
277 |         txt.writelines(lines[:t1n])
278 |     print( 'Saved to %s ...' % save1)
279 |     with open(save2, 'w') as txt:
280 |         txt.writelines(lines[-t2n:])
281 |     print( 'Saved to %s ...' % save2)
282 | 
283 |     return save1, save2, save3
284 | 
285 | def getCount():
286 |     '''count values for every column,
287 |     when dtype==O, expend it'''
288 |     folder = mkdir('msg')
289 |     saveName = '%s/counts.json' % folder
290 |     print( 'start take get counts.json')
291 |     if os.path.exists(saveName):
292 |         print( 'getCount already done!!')
293 |         return json.load(open(saveName))
294 | 
295 |     datas = OrderedDict()
296 |     print( 'start get merged data')
297 |     df = getMerged(kind=0)#.drop('label', 1)
298 |     del df['label']
299 |     gc.collect()
300 |     for fn in df:
301 |         gc.collect()
302 |         print( 'start count', fn)
303 |         if df[fn].dtype == 'object':
304 |             # df_cnt = pd.DataFrame()
305 |             # r = ' '.join( df[fn].dropna().tolist() )
306 |             # df_cnt[fn] = r.split(' ')
307 |             # datas[fn] = df_cnt[fn].value_counts().to_dict()
308 |             df[fn] = df[fn].astype(str)
309 |             d, ds = [], df[fn].str.split(' ', expand=False)
310 |             for s in ds: d.extend(list(map(int, s)))
311 |             datas[fn] = Counter(d)
312 | 
313 |         else:
314 |             #datas[fn] = df[fn].dropna().astype('int').value_counts().to_dict()
315 |             datas[fn] = Counter(df[fn])
316 |         del df[fn]
317 | 
318 |     json.dump(datas, open(saveName, 'w'))
319 |     print( 'Saved to %s ...' % saveName)
320 | 
321 |     return datas
322 | 
323 | def setCount():
324 |     print( 'start setCount')
325 |     count = getCount()#每个特征一个field,a dict of dict [('lbs',{value:count})]
326 |     fields = sorted(count.keys())#list of feature['lbs',''...]
327 | 
328 |     offset = 0
329 |     for k in fields:
330 |         print( 'get %s fields in var range' % k)
331 |         t = 100 if k=='uid' else 100#threshold of every feature
332 | 
333 |         temp = OrderedDict(#one col feature's dic--> value : count
334 |             sorted(
335 |                 count[k].items(),#return the tuple of dict
336 |                 reverse=1, key=lambda i: i[1],#sort by i[1]
337 |             )
338 |         )
339 |         count[k] = defaultdict( int, {#dict --> {feat:{value : index}}
340 |             str(u): i+1+offset
341 |             for i, u in enumerate(temp)
342 |             if temp[u]>=t and int(float(u))!=0
343 |         })#低于100的过滤掉了，给key对应自己index!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!all mass for this op
344 |         offset += len(count[k])
345 |     print('setcount ok')
346 |     return count, offset#return the dict and the number of dim
347 | 
348 | def getFFMDim():
349 |     print( 'start getFFMdim()')
350 |     t1, t2, t3 = splitFFM()#get the train ,validate , test FFMfile(::)
351 |     print( 'end splitFFM()')
352 |     UA_idx = open(User_Ad_idx,'r')
353 |     User, Ad = UA_idx.readline().strip().split(' '), UA_idx.readline().strip().split(' ')
354 |     UA_idx.close()
355 | 
356 |     folder = mkdir('msg')
357 |     saveName = '%s/ffmDim.csv' % folder
358 |     if os.path.exists(saveName):
359 |         df = pd.read_csv(saveName)
360 |         return OrderedDict(
361 |             zip(df.columns, df.values[0])), User, Ad
362 | 
363 |     datas = defaultdict(int)
364 |     for t in [t1, t2, t3]:
365 |         print( 'start countFFMDim in [%s]' % t)
366 |         for line in open(t):
367 |             temp = defaultdict(int)
368 | 
369 |             s = line.strip().split(' ')
370 |             fs = map(int, map(
371 |                 lambda x: x.split(':')[0], s[1:]))
372 | 
373 |             for f in fs: temp['field#%03d'%f] += 1
374 |             for key in temp:#number of value of each field
375 |                 datas[key] = max(temp[key], datas[key])#所有数据中每一行中出现该field次数最大数，也就是每个特征每行取值最大个数
376 |     for key in datas:
377 |         datas[key] = [datas[key]]
378 | 
379 |     pd.DataFrame.from_dict(datas).to_csv(
380 |         saveName, index=False, columns=sorted(datas.keys()))
381 |     print( 'Saved to %s ...' % saveName)
382 |     
383 |     return OrderedDict([
384 |         (k, datas[k][0]) for k in sorted(datas.keys())]), User, Ad#return the dimlist in witch include the max val_num of the col
385 | 
386 | def getSparse():
387 |     offsets = [0]#every field's begin index
388 |     ffmDim, _, _ = getFFMDim()
389 |     for k in sorted(ffmDim.keys()):
390 |         offsets.append(offsets[-1]+ffmDim[k]) 
391 | 
392 |     print( 'start splitFFM')
393 |     t1, t2, t3 = splitFFM()#get the train ,validate , test FFMfile(::)
394 |     print( 'start genFFM data')
395 |     folder = mkdir('ffm')
396 |     save1 = '%s/t1X.npz' % folder
397 |     save2 = '%s/t2X.npz' % folder
398 |     save3 = '%s/testX.npz' % folder
399 | 
400 |     ydfs = [
401 |         [t1, save1], [t2, save2], [t3, save3],
402 |     ]
403 | 
404 |     flag = True
405 |     for t, save in ydfs:
406 |         if not os.path.exists(save):
407 |             flag = False
408 |     if flag: return save1, save2, save3
409 |     print( 'end genFFM data')
410 |     #data, indices, indptr = [], [], [0]#data, indices, indptr ::index of 1131 of all of each line,每个非0元素行内相对位置,这行多少个元素（起始index
411 |     print( 'start gen sparse matrix data')
412 |     for k, (t, save) in enumerate(ydfs):
413 |         print( 'processing[%d], libsvm src[%s], npz dest[%s]' % (k, t, save))
414 |         ydfs[k].append(0)
415 |         data, indices, indptr = [], [], [0]  # data, indices, indptr ::index of 1131 of all of each line,每个非0元素行内相对位置,这行多少个元素（起始index
416 |         for line_number, line in enumerate(open(t)):
417 |             ydfs[k][-1] += 1#line num of this file
418 |             
419 |             s = line.strip().split(' ') # label field:key:value
420 |                                         # value is always 1
421 |             i = list(map(int, s[:1] + list(map(
422 |                 lambda x: x.split(':')[1], s[1:])))) #get index
423 |             fs = list(map(int, list(map(
424 |                 lambda x: x.split(':')[0], s[1:])))) #get field
425 |             
426 |             data.extend(i) #VALUE IN EVERY ROW
427 |             indices.append(0)
428 |             ios = [0] * (len(offsets)) # offset存每个field的特征数目
429 |             #print 'start process line[%s]' % line_number
430 |             #print 'len(ios) is:', len(ios)
431 |             #print ios[29]
432 |             #print 'max(fs) is:', np.max(fs)
433 |             #print 'fs is:', fs
434 |             #fs = list(set(fs))
435 |             for f in fs:
436 |                 ios[f] += 1#value num of this line
437 |                 #print('error:',len(offsets),len(ios),f)
438 |                 indices.append(offsets[f] + ios[f]) #每个非0元素行内相对位置
439 |             #print 'ios is:', ios
440 |             #print 'indices is:', indices
441 |             #data.extend(i)
442 |             indptr.append(indptr[-1] + len(i)) #这一行到哪个元素
443 |         rows, cols = len(indptr) - 1, np.max(indices) + 1
444 |         csr = csr_matrix((data, indices, indptr), shape=(rows, cols))  #
445 |         t, save, l = ydfs[k]
446 |         sparse.save_npz(save, csr[:-1], compressed=True)
447 |     # rows, cols = len(indptr)-1, np.max(indices)+1
448 |     # '''
449 |     # csr_matrix((data, indices, indptr), [shape=(M, N)])
450 |     # is the standard CSR representation where the column indices for row i are stored in indices[indptr[i]:indptr[i+1]] and their         corresponding values are stored in data[indptr[i]:indptr[i+1]]. If the shape parameter is not supplied, the matrix dimensions are     inferred from the index arrays.
451 |     # indptr表示某一行在data中的range，indices表示每条数据在第几列
452 |     # '''
453 |     # csr = csr_matrix(
454 |     #     (data, indices, indptr), shape=(rows, cols))#data, indices, indptr ::index of 1131 of all of each line,每个非0元素行内相对位置,这行多少个元素（起始index
455 |     #
456 |     # start, end = 0, 0
457 |     # for t, save, l in ydfs:
458 |     #     start, end = end, end + l
459 |     #     sparse.save_npz(
460 |     #         save, csr[start:end], compressed=True)
461 |     print( 'end gen sparse matrix data')
462 |     return save1, save2, save3
463 | 
464 | def loadSparse(kind):
465 |     return sparse.load_npz(getSparse()[kind])
466 | 
467 | ############################################
468 | from sklearn.preprocessing import LabelEncoder
469 | if __name__ == '__main__':
470 |     #userFeature
471 | 
472 |     mytype = {'uid': np.uint16, 'LBS': np.uint16, 'adCategoryId': np.uint16, 'advertiserId': np.uint16,
473 |               'age': np.uint16, 'aid': np.uint16, 'campaignId': np.uint16,
474 |               'carrier': np.uint16, 'consumptionAbility': np.uint16, 'creativeId': np.uint16, 'creativeSize': np.uint16,
475 |               'ct': object, 'education': np.uint16, 'gender': np.uint16, 'house': np.uint16, 'interest1': object,
476 |               'interest2': object, 'interest3': object, 'interest4': object, 'interest5': object, 'kw1': object,
477 |               'kw2': object, 'kw3': object, 'marriageStatus': object, 'os': object, 'productId': np.uint16,
478 |               'productType': np.uint16, 'topic1': object, 'topic2': object, 'topic3': object, 'label': np.int8,
479 |               'appIdAction': object, 'appIdInstall': object}
480 |     # df = pd.read_csv('./datas/8g/8g.csv', header=0, iterator=True,dtype=mytype, usecols=mytype.keys())
481 |     # data = df.get_chunk(80000)
482 |     # data.to_csv('./datas/trainMerged.csv',index=False)
483 |     # data = df.get_chunk(20000)
484 |     # data.to_csv('./datas/test1Merged.csv', index=False)
485 |     # print('data ok')
486 |     # del data
487 |     # gc.collect()
488 |     print(getFFM())
489 |     print(setCount()[1])
490 |     loadSparse(0)
491 |     #os.system('python3 nn.py 1 0 test 0')
492 | 


--------------------------------------------------------------------------------
/feature.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | from sklearn.feature_extraction.text import CountVectorizer
  3 | from sklearn.preprocessing import OneHotEncoder,LabelEncoder
  4 | from scipy import sparse
  5 | import os
  6 | import gc
  7 | import random
  8 | import numpy as np
  9 | import scipy.special as special
 10 | from tqdm import tqdm
 11 | from collections import *
 12 | 
 13 | cross_feat = ['LBS_marriageStatus','adCategoryId_LBS','adCategoryId_age','adCategoryId_ct','adCategoryId_productId','advertiserId_LBS','campaignId_productId','campaignId_adCategoryId','campaignId_age','campaignId_ct','campaignId_marriageStatus','consumptionAbility_ct','education_ct','gender_os','productId_LBS','productType_marriageStatus','productType_ct']
 14 | 
 15 | stas_feat = ['adCategoryId_age','adCategoryId_marriageStatus','age_carrier','age_consumptionAbility','age_gender','age_house','age_os','campaignId_LBS','campaignId_marriageStatus','consumptionAbility_gender','creativeSize_age','creativeSize_marriageStatus','education_gender','productId_marriageStatus','productType_age','productType_consumptionAbility']
 16 | #多值特征
 17 | def feat_mulval(data):
 18 |     for i in data.columns:
 19 |         i_types = data[i].dtypes
 20 |         if i_types != 'object':
 21 |             print('single', i)
 22 |             # train_and_test_and_adFeature[i] = train_and_test_and_adFeature[i].astype('category', copy=False)
 23 |         else:
 24 |             if str(i).startswith('marriageStatus'):
 25 |                 tmp = data[i].apply(lambda x: len(str(x).split(' ')))
 26 |                 i_length = min(5, tmp.max())
 27 |                 for sub_i in range(i_length):
 28 |                     data[i + '_%d' % (sub_i)] = data[i].apply(
 29 |                         lambda x: str(x).split(' ')[sub_i] if sub_i < len(str(x).split(' ')) else -1
 30 |                     )
 31 |                     data[i + '_length'] = data[i].apply(
 32 |                         lambda x: len(list(set(str(x).split(' '))))
 33 |                     )
 34 |     return data
 35 | #过滤低值
 36 | def remove_lowcase(se,times):
 37 |     count = dict(se.value_counts())
 38 |     se = se.map(lambda x : -1 if count[x]<times else x)
 39 |     return se
 40 | #交叉组合特征
 41 | def feat_cross(data):
 42 |     for feat in cross_feat+stas_feat:
 43 |         afeat,ufeat = feat.split('_')
 44 |         if afeat+'_'+ufeat in data.columns or ufeat+'_'+afeat in data.columns:
 45 |             continue
 46 | 
 47 |         concat_feat = afeat + '_' + ufeat
 48 |         data[concat_feat] = data[afeat].astype('str') + '_' + data[ufeat].astype('str')
 49 |         data[concat_feat] = remove_lowcase(data[concat_feat], 100)  ####################change the num
 50 |         try:
 51 |             data[concat_feat] = LabelEncoder().fit_transform(data[concat_feat].apply(int))
 52 |         except:
 53 |             data[concat_feat] = LabelEncoder().fit_transform(data[concat_feat].astype(str))
 54 |         print(concat_feat + ' ok')
 55 |     return data
 56 | 
 57 | #转化率特征
 58 | def feat_sta(data):
 59 |     df_tr = data[['uid', 'aid', 'label']]
 60 |     for feat in stas_feat:
 61 |         concat_feat = feat
 62 |         if feat not in data.columns:
 63 |             afeat,ufeat = feat.split('_')
 64 |             concat_feat = afeat + '_' + ufeat
 65 | 
 66 |         df_test = df_tr[data.label == 0]
 67 |         df_train = df_tr[data.label != 0]
 68 | 
 69 |         from sklearn.model_selection import KFold
 70 |         df_stas_feat = None
 71 |         kf = KFold(n_splits=5, random_state=2018, shuffle=True)
 72 |         for train_index, val_index in kf.split(df_train):
 73 |             X_train = df_train.loc[train_index, :]
 74 |             X_val = df_train.loc[val_index, :]
 75 |             X_val = static_feat(X_train, X_val, concat_feat)
 76 |             df_stas_feat = pd.concat([df_stas_feat, X_val], axis=0)
 77 | 
 78 |         df_test = static_feat(df_train, df_test, concat_feat)
 79 |         df_stas_feat = pd.concat([df_stas_feat, df_test], axis=0)
 80 |         del df_stas_feat['label'], df_stas_feat[concat_feat],df_train,df_test
 81 |         gc.collect()
 82 |         data= pd.merge(data,df_stas_feat,on=['aid','uid'],how='left')
 83 |         print(concat_feat+'ok')
 84 |     return data
 85 | def static_feat(df,df_val, feature):
 86 | 
 87 |     df['label'] = df['label'].replace(-1,0)
 88 |     df = df.groupby(feature)['label'].agg(['sum','count']).reset_index()
 89 | 
 90 |     new_feat_name = feature + '_stas'
 91 |     ###########################
 92 |     bs = BayesianSmoothing(1, 1)
 93 |     bs.update(df['count'].values, df['sum'].values, 1000, 0.001)
 94 |     df.loc[:, new_feat_name] = (df['sum'] + bs.alpha) / (df['count'] + bs.alpha + bs.beta)
 95 |     ###############################
 96 |     #df.loc[:,new_feat_name] = 100 * (df['sum'] + 1) / (df['count'] + np.sum(df['sum']))
 97 | 
 98 |     # print(df.head())
 99 |     df_stas = df[[feature,new_feat_name]]
100 | 
101 |     df_val = pd.merge(df_val, df_stas, how='left', on=feature)
102 |     ####
103 |     df_val.fillna(value=bs.alpha / (bs.alpha + bs.beta), inplace=True)
104 |     df_val[new_feat_name]=round(100*df_val[new_feat_name],8)
105 |     del df_val[feature]
106 |     return df_val
107 | #贝叶斯平滑
108 | class BayesianSmoothing(object):
109 |     def __init__(self, alpha, beta):
110 |         self.alpha = alpha
111 |         self.beta = beta
112 | 
113 |     def sample(self, alpha, beta, num, imp_upperbound):
114 |         sample = np.random.beta(alpha, beta, num)
115 |         I = []
116 |         C = []
117 |         for clk_rt in sample:
118 |             imp = random.random() * imp_upperbound
119 |             #imp = imp_upperbound
120 |             clk = imp * clk_rt
121 |             I.append(imp)
122 |             C.append(clk)
123 |         return I, C
124 | 
125 |     def update(self, imps, clks, iter_num, epsilon):
126 |         for i in range(iter_num):
127 |             new_alpha, new_beta = self.__fixed_point_iteration(imps, clks, self.alpha, self.beta)
128 |             if abs(new_alpha - self.alpha) < epsilon and abs(new_beta - self.beta) < epsilon:
129 |                 break
130 |             self.alpha = new_alpha
131 |             self.beta = new_beta
132 | 
133 |     def __fixed_point_iteration(self, imps, clks, alpha, beta):
134 |         numerator_alpha = 0.0
135 |         numerator_beta = 0.0
136 |         denominator = 0.0
137 | 
138 |         for i in range(len(imps)):
139 |             numerator_alpha += (special.digamma(clks[i] + alpha) - special.digamma(alpha))
140 |             numerator_beta += (special.digamma(imps[i] - clks[i] + beta) - special.digamma(beta))
141 |             denominator += (special.digamma(imps[i] + alpha + beta) - special.digamma(alpha + beta))
142 | 
143 |         return alpha * (numerator_alpha / denominator), beta * (numerator_beta / denominator)
144 | 
145 | 
146 | 
147 | #多值转化率特征
148 | def nlp_feature_score(feature,data):
149 |     predict = data[data['label'] == 0 ]
150 |     train = data[data['label'] != 0 ]
151 |     del data
152 |     gc.collect()
153 |     feature_count = {}
154 |     feature_label_count = {}
155 |     feature_list = train[feature].astype(str).values.tolist()
156 |     label_list = train['label'].tolist()
157 |     for i in range(train.shape[0]):
158 |         for item in feature_list[i].split(" "):
159 |             if item in feature_count.keys():
160 |                 feature_count[item] += 1
161 |             else:
162 |                 feature_count[item] = 1
163 |             if (item not in feature_label_count.keys()) and (label_list[i]==1):
164 |                 feature_label_count[item] = 1
165 |             elif (item in feature_label_count.keys()) and (label_list[i]==1):
166 |                 feature_label_count[item] += 1
167 |     dianji=[]
168 |     zhuanhua = []
169 |     for item in feature_label_count.keys():
170 |         dianji.append(feature_count[item])
171 |         zhuanhua.append(feature_label_count[item])
172 |     c = pd.DataFrame({'dianji':dianji,'zhuanhua':zhuanhua})
173 |     hyper = HyperParam(1, 1)
174 |     hyper.update_from_data_by_FPI(c['dianji'], c['zhuanhua'], 1000, 0.00000001)
175 |     train_feature_score = []
176 |     for i in range(train.shape[0]):
177 |         score = 1
178 |         for item in feature_list[i].split(" "):
179 |             if item not in feature_label_count.keys():
180 |                 score += 0
181 |             else:
182 |                 score  = score *(1- (feature_label_count[item]+hyper.alpha)/(feature_count[item]+hyper.alpha+hyper.beta))
183 |         train_feature_score.append(score)
184 |     test_feature_score = []
185 |     test_feature_list = predict[feature].values.tolist()
186 |     for i in range(predict.shape[0]):
187 |         score = 1
188 |         for item in test_feature_list[i].split(" "):
189 |             if item not in feature_label_count.keys():
190 |                 score += 0
191 |             else:
192 |                 score  = score *(1- (feature_label_count[item]+hyper.alpha)/(feature_count[item]+hyper.alpha+hyper.beta))
193 |         test_feature_score.append(score)
194 |     train[feature + "_score"] = train_feature_score
195 |     train[feature + "_score"][train[feature]=="-1"]=-1.0
196 |     predict[feature + "_score"] = test_feature_score
197 |     predict[feature + "_score"][predict[feature]=="-1"]=-1.0
198 | 
199 |     predict['label']=0
200 |     data = pd.concat([train,predict])
201 |     return data
202 |     #data.to_csv(config.feat_folder + '/part/'+feature +'.csv',index=False)
203 | 
204 | class HyperParam(object):
205 |     def __init__(self, alpha, beta):
206 |         self.alpha = alpha
207 |         self.beta = beta
208 | 
209 |     def sample_from_beta(self, alpha, beta, num, imp_upperbound):
210 |         # 产生样例数据
211 |         sample = np.random.beta(alpha, beta, num)
212 |         I = []
213 |         C = []
214 |         for click_ratio in sample:
215 |             imp = random.random() * imp_upperbound
216 |             # imp = imp_upperbound
217 |             click = imp * click_ratio
218 |             I.append(imp)
219 |             C.append(click)
220 |         return pd.Series(I), pd.Series(C)
221 | 
222 |     def update_from_data_by_FPI(self, tries, success, iter_num, epsilon):
223 |         # 更新策略
224 |         for i in range(iter_num):
225 |             new_alpha, new_beta = self.__fixed_point_iteration(tries, success, self.alpha, self.beta)
226 |             if abs(new_alpha - self.alpha) < epsilon and abs(new_beta - self.beta) < epsilon:
227 |                 break
228 |             self.alpha = new_alpha
229 |             self.beta = new_beta
230 | 
231 |     def __fixed_point_iteration(self, tries, success, alpha, beta):
232 |         # 迭代函数
233 |         sumfenzialpha = 0.0
234 |         sumfenzibeta = 0.0
235 |         sumfenmu = 0.0
236 |         sumfenzialpha = (special.digamma(success + alpha) - special.digamma(alpha)).sum()
237 |         sumfenzibeta = (special.digamma(tries - success + beta) - special.digamma(beta)).sum()
238 |         sumfenmu = (special.digamma(tries + alpha + beta) - special.digamma(alpha + beta)).sum()
239 | 
240 |         return alpha * (sumfenzialpha / sumfenmu), beta * (sumfenzibeta / sumfenmu)
241 | #计数特征
242 | def feat_size(data):
243 |     aid_age_count = data.groupby(['aid', 'age']).size().reset_index().rename(columns={0: 'aid_age_count'})
244 |     data = pd.merge(data, aid_age_count, 'left', on=['aid', 'age'])
245 |     aid_gender_count = data.groupby(['aid', 'gender']).size().reset_index().rename(columns={0: 'aid_gender_count'})
246 |     data = pd.merge(data, aid_gender_count, 'left', on=['aid', 'gender'])
247 |     return data
248 | 
249 | #nunique特征
250 | def feat_active(data):
251 |     add = pd.DataFrame(data.groupby(["campaignId"]).aid.nunique()).reset_index()
252 |     add.columns = ["campaignId", "campaignId_active_aid"]
253 |     data = data.merge(add, on=["campaignId"], how="left")
254 |     return data
255 | 
256 | from sklearn.preprocessing import StandardScaler
257 | def get_nunique(data,pr_key,col,name):
258 |     tmp = data.groupby([pr_key])[col].nunique().reset_index()
259 |     #print(tmp.head())
260 | 
261 |     tmp.columns = [pr_key,name]
262 |     sc = StandardScaler()
263 |     data = pd.merge(data,tmp,on=[pr_key],how='left',copy=False)
264 |     data[name] = sc.fit_transform(data[name].values.reshape(-1,1))
265 |     del tmp
266 |     return data
267 | 
268 | def pro(data):
269 |     data = get_nunique(data,'uid','aid','uid_to_aid')
270 |     # [515]	valid_0's auc: 0.666197
271 |     data = get_nunique(data,'uid','campaignId','uid_to_campaignId')
272 |     # [509]	valid_0's auc: 0.666677
273 |     data = get_nunique(data,'uid','adCategoryId','uid_to_adCategoryId')
274 |     # [405]	valid_0's auc: 0.667196
275 |     data = get_nunique(data,'uid','productId','uid_to_productId')
276 |     # [515]	valid_0's auc: 0.667237
277 |     data = get_nunique(data,'uid','productType','uid_to_productType')
278 |     # [784]	valid_0's auc: 0.66769
279 | 
280 |     # {515]	valid_0's auc: 0.670179
281 |     data = get_nunique(data,'productId','aid','productId_to_aid')
282 |     return data
283 | 
284 | #计数特征
285 | def feat_count():
286 |     for feat in config.vector:
287 |     df_LE[feat+'_count_byuid'] = df_LE.groupby('uid')[feat].apply(lambda x: str(x).split(' ').count())
288 | #####################################################
289 | #重复值特征
290 | def feat_duplicate():
291 |     subset = []
292 |     data['maybe'] = 0
293 |     pos = data.duplicated(subset=subset, keep=False)
294 |     data.loc[pos, 'maybe'] = 1
295 |     pos = (~data.duplicated(subset=subset, keep='first')) & data.duplicated(subset=subset, keep=False)
296 |     data.loc[pos, 'maybe'] = 2
297 |     pos = (~data.duplicated(subset=subset, keep='last')) & data.duplicated(subset=subset, keep=False)
298 |     data.loc[pos, 'maybe'] = 3
299 | 
300 |     features_trans = ['maybe']
301 |     data = pd.get_dummies(data, columns=features_trans)
302 |     data['maybe_0'] = data['maybe_0'].astype(np.int8)
303 |     data['maybe_1'] = data['maybe_1'].astype(np.int8)
304 |     data['maybe_2'] = data['maybe_2'].astype(np.int8)
305 |     data['maybe_3'] = data['maybe_3'].astype(np.int8)
306 |     # 重复次数是否大于2
307 |     temp = data.groupby(subset)['label'].count().reset_index()
308 |     temp.columns = ['creativeID', 'positionID', 'adID', 'appID', 'userID', 'large2']
309 |     temp['large2'] = 1 * (temp['large2'] > 2)
310 |     data = pd.merge(data, temp, how='left', on=subset)
311 |     del temp
312 |     gc.collect()
313 |     return data
314 | 
315 | 
316 | #均值特征
317 | def feat_mean():
318 |     res = data[['index']]
319 | 
320 |     grouped = data.groupby('appID')['age'].mean().reset_index()
321 |     grouped.columns = ['appID', 'app_mean_age']
322 |     data = data.merge(grouped, how='left', on='appID')
323 |     grouped = data.groupby('positionID')['age'].mean().reset_index()
324 |     grouped.columns = ['positionID', 'position_mean_age']
325 |     data = data.merge(grouped, how='left', on='positionID')
326 |     grouped = data.groupby('appCategory')['age'].mean().reset_index()
327 |     grouped.columns = ['appCategory', 'appCategory_mean_age']
328 |     data = data.merge(grouped, how='left', on='appCategory')
329 | 
330 |     X_train = data.loc[data['label'] != -1, :]
331 |     X_test = data.loc[data['label'] == -1, :]
332 |     X_test.loc[:, 'instanceID'] = res.values
333 |     del data, grouped
334 |     gc.collect()
335 |     return X_train, X_test
336 | 
337 | #排序特征
338 | def get_rank_feat(data, col, name):
339 |     data[name] = data.groupby(col).cumcount() + 1
340 |     return data
341 | 
342 | #
343 | def gen_pos_neg_aid_fea():
344 |     train_data = pd.read_csv('input/train.csv')
345 |     test2_data = pd.read_csv('input/test2.csv')
346 | 
347 |     train_user = train_data.uid.unique()
348 | 
349 |     # user-aid dict
350 |     uid_dict = defaultdict(list)
351 |     for row in tqdm(train_data.itertuples(), total=len(train_data)):
352 |         uid_dict[row[2]].append([row[1], row[3]])
353 | 
354 |     # user convert
355 |     uid_convert = {}
356 |     for uid in tqdm(train_user):
357 |         pos_aid, neg_aid = [], []
358 |         for data in uid_dict[uid]:
359 |             if data[1] > 0:
360 |                 pos_aid.append(data[0])
361 |             else:
362 |                 neg_aid.append(data[0])
363 |         uid_convert[uid] = [pos_aid, neg_aid]
364 | 
365 |     test2_neg_pos_aid = {}
366 |     for row in tqdm(test2_data.itertuples(), total=len(test2_data)):
367 |         aid = row[1]
368 |         uid = row[2]
369 |         if uid_convert.get(uid, []) == []:
370 |             test2_neg_pos_aid[row[0]] = ['', '', -1]
371 |         else:
372 |             pos_aid, neg_aid = uid_convert[uid][0].copy(), uid_convert[uid][1].copy()
373 |             convert = len(pos_aid) / (len(pos_aid) + len(neg_aid)) if (len(pos_aid) + len(neg_aid)) > 0 else -1
374 |             test2_neg_pos_aid[row[0]] = [' '.join(map(str, pos_aid)), ' '.join(map(str, neg_aid)), convert]
375 |     df_test2 = pd.DataFrame.from_dict(data=test2_neg_pos_aid, orient='index')
376 |     df_test2.columns = ['pos_aid', 'neg_aid', 'uid_convert']
377 | 
378 |     train_neg_pos_aid = {}
379 |     for row in tqdm(train_data.itertuples(), total=len(train_data)):
380 |         aid = row[1]
381 |         uid = row[2]
382 |         pos_aid, neg_aid = uid_convert[uid][0].copy(), uid_convert[uid][1].copy()
383 |         if aid in pos_aid:
384 |             pos_aid.remove(aid)
385 |         if aid in neg_aid:
386 |             neg_aid.remove(aid)
387 |         convert = len(pos_aid) / (len(pos_aid) + len(neg_aid)) if (len(pos_aid) + len(neg_aid)) > 0 else -1
388 |         train_neg_pos_aid[row[0]] = [' '.join(map(str, pos_aid)), ' '.join(map(str, neg_aid)), convert]
389 | 
390 |     df_train = pd.DataFrame.from_dict(data=train_neg_pos_aid, orient='index')
391 |     df_train.columns = ['pos_aid', 'neg_aid', 'uid_convert']
392 | 
393 |     df_train.to_csv("dataset/train_neg_pos_aid.csv", index=False)
394 |     df_test2.to_csv("dataset/test2_neg_pos_aid.csv", index=False)
395 | 
396 | #zhada**平滑转化率
397 | # def count_feature(train_data, test1_data, test2_data, labels, k, test_only= False):
398 | #     nums = len(train_data)
399 | #     interval = nums // k
400 | #     split_points = []
401 | #     for i in range(k):
402 | #         split_points.append(i * interval)
403 | #     split_points.append(nums)
404 | #
405 | #     s = set()
406 | #     for d in train_data:
407 | #         xs = d.split(' ')
408 | #         for x in xs:
409 | #             s.add(x)
410 | #     b = nums // len(s)
411 | #     a = b*1.0 / 20
412 | #
413 | #     train_res = []
414 | #     if not test_only:
415 | #         for i in range(k):
416 | #             tmp = []
417 | #             total_dict, pos_dict = gen_count_dict(train_data, labels, split_points[i],split_points[i+1])
418 | #             for j in range(split_points[i],split_points[i+1]):
419 | #                 xs = train_data[j].split(' ')
420 | #                 t = []
421 | #                 for x in xs:
422 | #                     if not total_dict.has_key(x):
423 | #                         t.append(0.05)
424 | #                         continue
425 | #                     t.append((a + pos_dict[x]) / (b + total_dict[x]))
426 | #                 tmp.append(max(t))
427 | #             train_res.extend(tmp)
428 | #
429 | #     total_dict, pos_dict = gen_count_dict(train_data, labels, 1, 0)
430 | #     test1_res = []
431 | #     for d in test1_data:
432 | #         xs = d.split(' ')
433 | #         t = []
434 | #         for x in xs:
435 | #             if not total_dict.has_key(x):
436 | #                 t.append(0.05)
437 | #                 continue
438 | #             t.append((a + pos_dict[x]) / (b + total_dict[x]))
439 | #         test1_res.append(max(t))
440 | #
441 | #     test2_res = []
442 | #     for d in test2_data:
443 | #         xs = d.split(' ')
444 | #         t = []
445 | #         for x in xs:
446 | #             if not total_dict.has_key(x):
447 | #                 t.append(0.05)
448 | #                 continue
449 | #             t.append((a + pos_dict[x]) / (b + total_dict[x]))
450 | #         test2_res.append(max(t))
451 | #
452 | #     return train_res, test1_res, test2_res
453 | 
454 | #dnn feature################################################################################################################################
455 | def mutil_ids(train_df, dev_df, test_df, word2index):
456 |     features_mutil = ['interest1', 'interest2', 'interest3', 'interest4', 'interest5', 'kw1', 'kw2', 'kw3', 'topic1',
457 |                       'topic2', 'topic3', 'appIdAction', 'appIdInstall', 'marriageStatus', 'ct', 'os']
458 |     for s in features_mutil:
459 |         cont = {}
460 |         with open('ffm_data/train/' + str(s), 'w') as f:
461 |             for lines in list(train_df[s].values):
462 |                 f.write(str(lines) + '\n')
463 |                 for line in lines.split():
464 |                     if str(line) not in cont:
465 |                         cont[str(line)] = 0
466 |                     cont[str(line)] += 1
467 | 
468 |         with open('ffm_data/dev/' + str(s), 'w') as f:
469 |             for line in list(dev_df[s].values):
470 |                 f.write(str(line) + '\n')
471 | 
472 |         with open('ffm_data/test/' + str(s), 'w') as f:
473 |             for line in list(test_df[s].values):
474 |                 f.write(str(line) + '\n')
475 |         index = []
476 |         for k in cont:
477 |             if cont[k] >= threshold:
478 |                 index.append(k)
479 |         word2index[s] = {}
480 |         for idx, val in enumerate(index):
481 |             word2index[s][val] = idx + 2
482 |         print(s + ' done!')
483 | 
484 | 
485 | def len_features(train_df, dev_df, test_df, word2index):
486 |     len_features = ['interest1', 'interest2', 'interest3', 'interest4', 'interest5', 'kw1', 'kw2', 'kw3', 'topic1',
487 |                     'topic2', 'topic3']
488 |     for s in len_features:
489 |         dev_df[s + '_len'] = dev_df[s].apply(lambda x: len(x.split()) if x != '-1' else 0)
490 |         test_df[s + '_len'] = test_df[s].apply(lambda x: len(x.split()) if x != '-1' else 0)
491 |         train_df[s + '_len'] = train_df[s].apply(lambda x: len(x.split()) if x != '-1' else 0)
492 |         s = s + '_len'
493 |         cont = {}
494 |         with open('ffm_data/train/' + str(s), 'w') as f:
495 |             for line in list(train_df[s].values):
496 |                 f.write(str(line) + '\n')
497 |                 if str(line) not in cont:
498 |                     cont[str(line)] = 0
499 |                 cont[str(line)] += 1
500 | 
501 |         with open('ffm_data/dev/' + str(s), 'w') as f:
502 |             for line in list(dev_df[s].values):
503 |                 f.write(str(line) + '\n')
504 | 
505 |         with open('ffm_data/test/' + str(s), 'w') as f:
506 |             for line in list(test_df[s].values):
507 |                 f.write(str(line) + '\n')
508 |         index = []
509 |         for k in cont:
510 |             if cont[k] >= threshold:
511 |                 index.append(k)
512 |         word2index[s] = {}
513 |         for idx, val in enumerate(index):
514 |             word2index[s][val] = idx + 2
515 |         del train_df[s]
516 |         del dev_df[s]
517 |         del test_df[s]
518 |         gc.collect()
519 |         print(s + ' done!')
520 | 
521 | 
522 | def count_features(train_df, dev_df, test_df, word2index):
523 |     count_feature = ['uid']
524 |     data = train_df.append(dev_df)
525 |     data = data.append(test_df)
526 |     for s in count_feature:
527 |         g = dict(data.groupby(s).size())
528 |         s_ = s
529 |         s = s + '_count'
530 |         cont = {}
531 | 
532 |         with open('ffm_data/train/' + str(s), 'w') as f:
533 |             for line in list(train_df[s_].values):
534 |                 line = g[line]
535 |                 if str(line) not in cont:
536 |                     cont[str(line)] = 0
537 |                 cont[str(line)] += 1
538 |                 f.write(str(line) + '\n')
539 | 
540 |         with open('ffm_data/dev/' + str(s), 'w') as f:
541 |             for line in list(dev_df[s_].values):
542 |                 line = g[line]
543 |                 f.write(str(line) + '\n')
544 | 
545 |         with open('ffm_data/test/' + str(s), 'w') as f:
546 |             for line in list(test_df[s_].values):
547 |                 line = g[line]
548 |                 f.write(str(line) + '\n')
549 |         index = []
550 |         for k in cont:
551 |             if cont[k] >= threshold:
552 |                 index.append(k)
553 |         word2index[s] = {}
554 |         for idx, val in enumerate(index):
555 |             word2index[s][val] = idx + 2
556 |         print(s + ' done!')
557 | 
558 | 
559 | def kfold_features(train_df, dev_df, test_df, word2index):
560 |     features_mutil = ['interest1', 'interest2', 'interest3', 'interest4', 'interest5', 'kw1', 'kw2', 'kw3', 'topic1',
561 |                       'topic2', 'topic3', 'appIdAction', 'appIdInstall', 'marriageStatus', 'ct', 'os']
562 |     for f in features_mutil:
563 |         del train_df[f]
564 |         del dev_df[f]
565 |         del test_df[f]
566 |         gc.collect()
567 |     count_feature = ['uid']
568 |     feature = ['advertiserId', 'campaignId', 'creativeId', 'creativeSize', 'adCategoryId', 'productId', 'productType']
569 |     for f in feature:
570 |         train_df[f + '_uid'] = train_df[f] + train_df['uid'] * 10000000
571 |         dev_df[f + '_uid'] = dev_df[f] + dev_df['uid'] * 10000000
572 |         test_df[f + '_uid'] = test_df[f] + test_df['uid'] * 10000000
573 |         count_feature.append(f + '_uid')
574 | 
575 |     for s in count_feature:
576 |         temp = s
577 |         kfold_static(train_df, dev_df, test_df, s)
578 |         s = temp + '_positive_num'
579 |         cont = {}
580 |         with open('ffm_data/train/' + str(s), 'w') as f:
581 |             for line in list(train_df[s].values):
582 |                 if str(line) not in cont:
583 |                     cont[str(line)] = 0
584 |                 cont[str(line)] += 1
585 |                 f.write(str(line) + '\n')
586 | 
587 |         with open('ffm_data/dev/' + str(s), 'w') as f:
588 |             for line in list(dev_df[s].values):
589 |                 f.write(str(line) + '\n')
590 | 
591 |         with open('ffm_data/test/' + str(s), 'w') as f:
592 |             for line in list(test_df[s].values):
593 |                 f.write(str(line) + '\n')
594 |         index = []
595 |         for k in cont:
596 |             if cont[k] >= threshold:
597 |                 index.append(k)
598 |         word2index[s] = {}
599 |         for idx, val in enumerate(index):
600 |             word2index[s][val] = idx + 2
601 |         pkl.dump(word2index, open('ffm_data/dic.pkl', 'wb'))
602 |         print(s + ' done!')
603 |         del train_df[s]
604 |         del dev_df[s]
605 |         del test_df[s]
606 | 
607 |         s = temp + '_negative_num'
608 |         cont = {}
609 |         with open('ffm_data/train/' + str(s), 'w') as f:
610 |             for line in list(train_df[s].values):
611 |                 if str(line) not in cont:
612 |                     cont[str(line)] = 0
613 |                 cont[str(line)] += 1
614 |                 f.write(str(line) + '\n')
615 | 
616 |         with open('ffm_data/dev/' + str(s), 'w') as f:
617 |             for line in list(dev_df[s].values):
618 |                 f.write(str(line) + '\n')
619 | 
620 |         with open('ffm_data/test/' + str(s), 'w') as f:
621 |             for line in list(test_df[s].values):
622 |                 f.write(str(line) + '\n')
623 |         index = []
624 |         for k in cont:
625 |             if cont[k] >= threshold:
626 |                 index.append(k)
627 |         word2index[s] = {}
628 |         for idx, val in enumerate(index):
629 |             word2index[s][val] = idx + 2
630 |         pkl.dump(word2index, open('ffm_data/dic.pkl', 'wb'))
631 |         del train_df[s]
632 |         del dev_df[s]
633 |         del test_df[s]
634 |         gc.collect()
635 |         print(s + ' done!')
636 |     for f in feature:
637 |         del train_df[f + '_uid']
638 |         del test_df[f + '_uid']
639 |         del dev_df[f + '_uid']
640 |     gc.collect()
641 | 
642 | 
643 | def kfold_static(train_df, dev_df, test_df, f):
644 |     print("K-fold static:", f)
645 |     # K-fold positive and negative num
646 |     index = set(range(train_df.shape[0]))
647 |     K_fold = []
648 |     for i in range(5):
649 |         if i == 4:
650 |             tmp = index
651 |         else:
652 |             tmp = random.sample(index, int(0.2 * train_df.shape[0]))
653 |         index = index - set(tmp)
654 |         print("Number:", len(tmp))
655 |         K_fold.append(tmp)
656 |     positive = [-1 for i in range(train_df.shape[0])]
657 |     negative = [-1 for i in range(train_df.shape[0])]
658 |     for i in range(5):
659 |         print('fold', i)
660 |         pivot_index = K_fold[i]
661 |         sample_idnex = []
662 |         for j in range(5):
663 |             if j != i:
664 |                 sample_idnex += K_fold[j]
665 |         dic = {}
666 |         for item in train_df.iloc[sample_idnex][[f, 'label']].values:
667 |             if item[0] not in dic:
668 |                 dic[item[0]] = [0, 0]
669 |             dic[item[0]][item[1]] += 1
670 |         uid = train_df[f].values
671 |         for k in pivot_index:
672 |             if uid[k] in dic:
673 |                 positive[k] = dic[uid[k]][1]
674 |                 negative[k] = dic[uid[k]][0]
675 |     train_df[f + '_positive_num'] = positive
676 |     train_df[f + '_negative_num'] = negative
677 | 
678 |     # for dev and test
679 |     dic = {}
680 |     for item in train_df[[f, 'label']].values:
681 |         if item[0] not in dic:
682 |             dic[item[0]] = [0, 0]
683 |         dic[item[0]][item[1]] += 1
684 |     positive = []
685 |     negative = []
686 |     for uid in dev_df[f].values:
687 |         if uid in dic:
688 |             positive.append(dic[uid][1])
689 |             negative.append(dic[uid][0])
690 |         else:
691 |             positive.append(-1)
692 |             negative.append(-1)
693 |     dev_df[f + '_positive_num'] = positive
694 |     dev_df[f + '_negative_num'] = negative
695 |     print('dev', 'done')
696 | 
697 |     positive = []
698 |     negative = []
699 |     for uid in test_df[f].values:
700 |         if uid in dic:
701 |             positive.append(dic[uid][1])
702 |             negative.append(dic[uid][0])
703 |         else:
704 |             positive.append(-1)
705 |             negative.append(-1)
706 |     test_df[f + '_positive_num'] = positive
707 |     test_df[f + '_negative_num'] = negative
708 |     print('test', 'done')
709 |     print('avg of positive num', np.mean(train_df[f + '_positive_num']), np.mean(dev_df[f + '_positive_num']),
710 |           np.mean(test_df[f + '_positive_num']))
711 |     print('avg of negative num', np.mean(train_df[f + '_negative_num']), np.mean(dev_df[f + '_negative_num']),
712 |           np.mean(test_df[f + '_negative_num']))
713 | 
714 | 
715 | 
716 | def uid_seq_feature(train_data, test1_data, test2_data, label):
717 |     count_dict = {}#存 该id： 追加这次出现的label
718 |     seq_dict = {}#存序列字典 ：该种序列出现次数
719 |     seq_emb_dict = {}#存该序列的key
720 |     train_seq = []#存每个学列的index
721 |     ind = 0
722 |     for i, d in enumerate(train_data):
723 |         if not count_dict.__contains__(d):
724 |             count_dict[d] = []
725 |         seq_key = ' '.join(count_dict[d][-4:])
726 |         if not seq_dict.__contains__(seq_key):
727 |             seq_dict[seq_key] = 0
728 |             seq_emb_dict[seq_key] = ind
729 |             ind += 1
730 |         seq_dict[seq_key] += 1
731 |         train_seq.append(seq_emb_dict[seq_key])
732 |         count_dict[d].append(label[i])
733 |     test1_seq = []
734 |     for d in test1_data:
735 |         if not count_dict.__contains__(d):
736 |             seq_key = ''
737 |         else:
738 |             seq_key = ' '.join(count_dict[d][-4:])
739 |         if seq_emb_dict.__contains__(seq_key):
740 |             key = seq_emb_dict[seq_key]
741 |         else:
742 |             key = 0
743 |         test1_seq.append(key)
744 |     test2_seq = []
745 |     for d in test2_data:
746 |         if not count_dict.__contains__(d):
747 |             seq_key = ''
748 |         else:
749 |             seq_key = ' '.join(count_dict[d][-4:])
750 |         if seq_emb_dict.__contains__(seq_key):
751 |             key = seq_emb_dict[seq_key]
752 |         else:
753 |             key = 0
754 |         test2_seq.append(key)
755 | 
756 | def gen_uid_aid_fea():
757 |     '''
758 |     载入数据，　提取aid, uid的全局统计特征
759 |     '''
760 |     train_data = pd.read_csv('input/train.csv')
761 |     test1_data = pd.read_csv('input/test1.csv')
762 |     test2_data = pd.read_csv('input/test2.csv')
763 | 
764 |     ad_Feature = pd.read_csv('input/adFeature.csv')
765 | 
766 |     train_len = len(train_data)  # 45539700
767 |     test1_len = len(test1_data)
768 |     test2_len = len(test2_data)  # 11727304
769 | 
770 |     ad_Feature = pd.merge(ad_Feature, ad_Feature.groupby(['campaignId']).aid.nunique().reset_index(
771 |     ).rename(columns={'aid': 'campaignId_aid_nunique'}), how='left', on='campaignId')
772 | 
773 |     df = pd.concat([train_data, test1_data, test2_data], axis=0)
774 |     df = pd.merge(df, df.groupby(['uid'])['aid'].nunique().reset_index().rename(
775 |         columns={'aid': 'uid_aid_nunique'}), how='left', on='uid')
776 | 
777 |     df = pd.merge(df, df.groupby(['aid'])['uid'].nunique().reset_index().rename(
778 |         columns={'uid': 'aid_uid_nunique'}), how='left', on='aid')
779 | 
780 |     df['uid_count'] = df.groupby('uid')['aid'].transform('count')
781 |     df = pd.merge(df, ad_Feature[['aid', 'campaignId_aid_nunique']], how='left', on='aid')
782 | 
783 |     fea_columns = ['campaignId_aid_nunique', 'uid_aid_nunique', 'aid_uid_nunique', 'uid_count', ]
784 | 
785 |     df[fea_columns].iloc[:train_len].to_csv('dataset/train_uid_aid.csv', index=False)
786 |     df[fea_columns].iloc[train_len: train_len+test1_len].to_csv('dataset/test1_uid_aid.csv', index=False)
787 |     df[fea_columns].iloc[-test2_len:].to_csv('dataset/test2_uid_aid.csv', index=False)
788 | 
789 | 
790 | def digitize():
791 |     uid_aid_train = pd.read_csv('dataset/train_uid_aid.csv')
792 |     uid_aid_test1 = pd.read_csv('dataset/test1_uid_aid.csv')
793 |     uid_aid_test2 = pd.read_csv('dataset/test2_uid_aid.csv')
794 |     uid_aid_df = pd.concat([uid_aid_train, uid_aid_test1, uid_aid_test2], axis=0)
795 |     for col in range(3):
796 |         bins = []
797 |         for percent in [0, 20, 35, 50, 65, 85, 100]:
798 |             bins.append(np.percentile(uid_aid_df.iloc[:, col], percent))
799 |         uid_aid_train.iloc[:, col] = np.digitize(uid_aid_train.iloc[:, col], bins, right=True)
800 |         uid_aid_test1.iloc[:, col] = np.digitize(uid_aid_test1.iloc[:, col], bins, right=True)
801 |         uid_aid_test2.iloc[:, col] = np.digitize(uid_aid_test2.iloc[:, col], bins, right=True)
802 | 
803 |     count_bins = [1, 2, 4, 6, 8, 10, 16, 27, 50]
804 |     uid_aid_train.iloc[:, 3] = np.digitize(uid_aid_train.iloc[:, 3], count_bins, right=True)
805 |     uid_aid_test1.iloc[:, 3] = np.digitize(uid_aid_test1.iloc[:, 3], count_bins, right=True)
806 |     uid_aid_test2.iloc[:, 3] = np.digitize(uid_aid_test2.iloc[:, 3], count_bins, right=True)
807 | 
808 |     uid_convert_train = pd.read_csv("dataset/train_neg_pos_aid.csv", usecols=['uid_convert'])
809 |     uid_convert_test2 = pd.read_csv("dataset/test2_neg_pos_aid.csv", usecols=['uid_convert'])
810 | 
811 |     convert_bins = [-1, 0, 0.1, 0.3, 0.5, 0.7, 1]
812 |     uid_convert_train.iloc[:, 0] = np.digitize(uid_convert_train.iloc[:, 0], convert_bins, right=True)
813 |     uid_convert_test2.iloc[:, 0] = np.digitize(uid_convert_test2.iloc[:, 0], convert_bins, right=True)
814 | 
815 |     uid_aid_train = pd.concat([uid_aid_train, uid_convert_train], axis=1)
816 |     uid_aid_test2 = pd.concat([uid_aid_test2, uid_convert_test2], axis=1)
817 | 
818 |     uid_aid_train.to_csv('dataset/train_uid_aid_bin.csv', index=False)
819 |     uid_aid_test2.to_csv('dataset/test2_uid_aid_bin.csv', index=False)
820 | 
821 | 
822 | #################
823 | def feature_count(full, features=[]):
824 |     new_feature = 'new_count'
825 |     for i in features:
826 |         get_series(i)
827 |         new_feature += '_' + i
828 |     log(new_feature)
829 |     try:
830 |         del full[new_feature]
831 |     except:
832 |         pass
833 |     temp = full.groupby(features).size().reset_index().rename(columns={0: new_feature})
834 |     full = full.merge(temp, 'left', on=features)
835 |     # save(full, new_feature)
836 |     return full
837 | 
838 | def Information_entropy():
839 |     for i in ['aid']:
840 |         for j in ['age', 'gender', 'education', 'consumptionAbility', 'LBS', 'carrier', 'house',
841 |                   'marriageStatus', 'ct', 'os']:
842 |             t = time.time()
843 |             full = feature_count(full, [i, j])
844 |             full['new_inf_' + i + '_' + j] = np.log1p(
845 |                 full['new_count_' + j] * full['new_count_' + i] / full['new_count_' + i + '_' + j] / len_full)
846 | 
847 |             min_v = full['new_inf_' + i + '_' + j].min()
848 |             full['new_inf_' + i + '_' + j] = full['new_inf_' + i + '_' + j].apply(
849 |                 lambda x: int(float('%.1f' % min(x - min_v, 1.5)) * 10))
850 |             print(full['new_inf_' + i + '_' + j].value_counts())
851 |             save(full, 'new_inf_' + i + '_' + j, 'cate', max=15)
852 | 
853 |     # 多值特征的信息熵
854 |     def get_inf(cond, keyword):
855 |         get_series(cond)
856 |         get_series(keyword)
857 | 
858 |         # 背景字典 每个ID出现的次数
859 |         back_dict = {}
860 |         # 不同条件下 每个id 出现的次数condi_dict
861 |         condi_dict = {}
862 |         # 不同条件下aid 每个id 出现的次数
863 | 
864 |         # 预先生成字典。省的判断慢
865 |         for i in full[cond].unique():
866 |             condi_dict[i] = {}
867 | 
868 |         for i, row in full[[cond, keyword]].iterrows():
869 |             word_list = row[keyword].split()
870 |             for word in word_list:
871 |                 # 对背景字典加1
872 |                 try:
873 |                     back_dict[word] = back_dict[word] + 1
874 |                 except:
875 |                     # 没有该词则设为0
876 |                     back_dict[word] = 1
877 |                 try:
878 |                     # 该条件下的该词的出现次数加1
879 |                     condi_dict[row[cond]][word] = condi_dict[row[cond]][word] + 1
880 |                 except:
881 |                     condi_dict[row[cond]][word] = 1
882 | 
883 |         # 先获取平均熵
884 |         max_inf_list = []
885 |         mean_inf_list = []
886 |         condi_count = full.groupby(cond)[cond].count().to_dict()
887 |         for i, row in full[[cond, keyword]].iterrows():
888 |             word_list = row[keyword].split()
889 | 
890 |             count = len(word_list)
891 |             prob = 1
892 |             prob_list = []
893 |             for word in word_list:
894 |                 temp_prob = condi_count[row[cond]] * back_dict[word] / condi_dict[row[cond]][word] / len_full
895 |                 prob = prob * temp_prob
896 |                 prob_list.append(temp_prob)
897 |             mean_inf_list.append(np.log1p(prob) / count)
898 |             max_inf_list.append(
899 |                 np.log1p(np.min(prob)))
900 |         return max_inf_list, mean_inf_list
901 | 
902 |     # 计算maxpool 和meanpool
903 |     for i in ['aid']:
904 |         for j in ['interest1', 'interest2', 'interest3', 'interest4', 'interest5', 'kw1', 'kw2', 'kw3', 'topic1',
905 |                   'topic2', 'topic3', 'appIdAction', 'appIdInstall']:
906 |             log('new_inf_' + i + '_' + j)
907 |             full['new_inf_' + i + '_' + j + '_max'], full['new_inf_' + i + '_' + j + '_mean'] = get_inf(i, j)
908 | 
909 |             min_v = full['new_inf_' + i + '_' + j + '_max'].min()
910 |             full['new_inf_' + i + '_' + j + '_max'] = full['new_inf_' + i + '_' + j + '_max'].apply(
911 |                 lambda x: int(float('%.1f' % min(x - min_v, 1.5)) * 10))
912 |             save(full, 'new_inf_' + i + '_' + j + '_max', 'cate', 16)
913 | 
914 |             min_v = full['new_inf_' + i + '_' + j + '_mean'].min()
915 |             full['new_inf_' + i + '_' + j + '_mean'] = full['new_inf_' + i + '_' + j + '_mean'].apply(
916 |                 lambda x: int(float('%.1f' % min(x - min_v, 1.5)) * 10))
917 |             save(full, 'new_inf_' + i + '_' + j + '_mean', 'cate', 16)


--------------------------------------------------------------------------------
/ffm.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import os
 4 | import sys
 5 | import csv
 6 | import time
 7 | import glob
 8 | import zipfile
 9 | import numpy as np
10 | import xlearn as xl
11 | 
12 | from data import *
13 | 
14 | d = pd.read_csv('',header=0,iterator=True)
15 | d = d.get_chunk()
16 | def train():
17 |     ffm = xl.create_ffm()
18 |     train, test, _ = splitFFM()
19 | 
20 |     ffm.setTrain(train)
21 |     ffm.setValidate(test)
22 | 
23 |     model = './modelFFM'
24 |     sTime = time.strftime(
25 |         '%m%d-%H%M', time.localtime(time.time()))
26 |     if not os.path.exists(model): os.mkdir(model)
27 |     model = '%s/xlModel_%s.txt' % (model, sTime)
28 | 
29 |     params = {
30 |         'epoch'         : 100,
31 |         'metric'        : 'auc',
32 |         'task'          : 'binary',
33 | 
34 |         'k'             : 4,
35 |         'lr'            : 0.02,
36 |         'lambda'        : 1e-6,
37 |         'stop_window'   : 3,
38 |     }
39 |     ffm.fit(params, model)
40 | 
41 | def predict():
42 |     ffm = xl.create_ffm()
43 |     _, _, test = splitFFM()
44 | 
45 |     ffm.setTest(test); ffm.setSigmoid()
46 | 
47 |     folder = './modelFFM'
48 |     model = sorted(
49 |         glob.glob('./modelFFM/xlModel_*.txt'))[-1]
50 |     output = model.replace('Model', 'Output')
51 |     ffm.predict(model, output)
52 | 
53 |     df = getMerged('aid', 'uid', kind=2)
54 |     df['score'] = np.loadtxt(output)
55 |     df.to_csv('submission.csv', index=False)
56 | 
57 |     zipName = '%s/submission.zip' % folder
58 |     with zipfile.ZipFile(zipName, 'w') as f:
59 |         f.write('submission.csv', compress_type=zipfile.ZIP_DEFLATED)
60 | 
61 | 
62 | if __name__ == '__main__':
63 |     assert len(sys.argv)==2, 'Failed ...'
64 |     kind = int(sys.argv[1])
65 | 
66 |     if kind == 1:
67 |         train()
68 |     elif kind == 0:
69 |         predict()
70 | 
71 | 


--------------------------------------------------------------------------------
/ffmdata.py:
--------------------------------------------------------------------------------
  1 | #!/usr/bin/env python
  2 | # -*- coding: utf-8 -*-
  3 | # Created by Bo Song on 2018/4/26
  4 | 
  5 | import pandas as pd
  6 | from pandas import get_dummies
  7 | import lightgbm as lgb
  8 | from sklearn.feature_extraction.text import CountVectorizer
  9 | from sklearn.preprocessing import OneHotEncoder,LabelEncoder
 10 | from scipy import sparse
 11 | import numpy as np
 12 | import os
 13 | import gc
 14 | 
 15 | 
 16 | path='data/'
 17 | 
 18 | 
 19 | 
 20 | one_hot_feature=['LBS','age','carrier','consumptionAbility','education','gender','advertiserId','campaignId', 'creativeId',
 21 |        'adCategoryId', 'productId', 'productType']
 22 | 
 23 | vector_feature=['interest1','interest2','interest5','kw1','kw2','topic1','topic2','os','ct','marriageStatus']
 24 | continus_feature=['creativeSize']
 25 | 
 26 | ad_feature=pd.read_csv(path+'adFeature.csv')
 27 | user_feature=pd.read_csv(path+'userFeature.csv')
 28 | 
 29 | train=pd.read_csv(path+'train.csv')
 30 | test=pd.read_csv(path+'test1.csv')
 31 | 
 32 | data=pd.concat([train,test])
 33 | data=pd.merge(data,ad_feature,on='aid',how='left')
 34 | data=pd.merge(data,user_feature,on='uid',how='left')
 35 | 
 36 | 
 37 | data=data.fillna(-1)
 38 | data=data[one_hot_feature+vector_feature+continus_feature]
 39 | 
 40 | class FFMFormat:
 41 |     def __init__(self,vector_feat,one_hot_feat,continus_feat):
 42 |         self.field_index_ = None
 43 |         self.feature_index_ = None
 44 |         self.vector_feat=vector_feat
 45 |         self.one_hot_feat=one_hot_feat
 46 |         self.continus_feat=continus_feat
 47 | 
 48 | 
 49 |     def get_params(self):
 50 |         pass
 51 | 
 52 |     def set_params(self, **parameters):
 53 |         pass
 54 | 
 55 |     def fit(self, df, y=None):
 56 |         self.field_index_ = {col: i for i, col in enumerate(df.columns)}
 57 |         self.feature_index_ = dict()
 58 |         last_idx = 0
 59 |         for col in df.columns:
 60 |             if col in self.one_hot_feat:
 61 |                 print(col)
 62 |                 df[col]=df[col].astype('int')
 63 |                 vals = np.unique(df[col])
 64 |                 for val in vals:
 65 |                     if val==-1: continue
 66 |                     name = '{}_{}'.format(col, val)
 67 |                     if name not in self.feature_index_:
 68 |                         self.feature_index_[name] = last_idx
 69 |                         last_idx += 1
 70 |             elif col in self.vector_feat:
 71 |                 print(col)
 72 |                 vals=[]
 73 |                 for data in df[col].apply(str):
 74 |                     if data!="-1":
 75 |                         for word in data.strip().split(' '):
 76 |                             vals.append(word)
 77 |                 vals = np.unique(vals)
 78 |                 for val in vals:
 79 |                     if val=="-1": continue
 80 |                     name = '{}_{}'.format(col, val)
 81 |                     if name not in self.feature_index_:
 82 |                         self.feature_index_[name] = last_idx
 83 |                         last_idx += 1
 84 |             self.feature_index_[col] = last_idx
 85 |             last_idx += 1
 86 |         return self
 87 | 
 88 |     def fit_transform(self, df, y=None):
 89 |         self.fit(df, y)
 90 |         return self.transform(df)
 91 | 
 92 |     def transform_row_(self, row):
 93 |         ffm = []
 94 | 
 95 |         for col, val in row.loc[row != 0].to_dict().items():
 96 |             if col in self.one_hot_feat:
 97 |                 name = '{}_{}'.format(col, val)
 98 |                 if name in self.feature_index_:
 99 |                     ffm.append('{}:{}:1'.format(self.field_index_[col], self.feature_index_[name]))
100 |                 # ffm.append('{}:{}:{}'.format(self.field_index_[col], self.feature_index_[col], 1))
101 |             elif col in self.vector_feat:
102 |                 for word in str(val).split(' '):
103 |                     name = '{}_{}'.format(col, word)
104 |                     if name in self.feature_index_:
105 |                         ffm.append('{}:{}:1'.format(self.field_index_[col], self.feature_index_[name]))
106 |             elif col in self.continus_feat:
107 |                 if val!=-1:
108 |                     ffm.append('{}:{}:{}'.format(self.field_index_[col], self.feature_index_[col], val))
109 |         return ' '.join(ffm)
110 | 
111 |     def transform(self, df):
112 |         # val=[]
113 |         # for k,v in self.feature_index_.items():
114 |         #     val.append(v)
115 |         # val.sort()
116 |         # print(val)
117 |         # print(self.field_index_)
118 |         # print(self.feature_index_)
119 |         return pd.Series({idx: self.transform_row_(row) for idx, row in df.iterrows()})
120 | 
121 | tr = FFMFormat(vector_feature,one_hot_feature,continus_feature)
122 | user_ffm=tr.fit_transform(data)
123 | user_ffm.to_csv('ffm.csv',index=False)
124 | 
125 | train = pd.read_csv(path + 'train.csv')
126 | test = pd.read_csv(path+'test1.csv')
127 | 
128 | Y = np.array(train.pop('label'))
129 | len_train=len(train)
130 | 
131 | with open('ffm.csv') as fin:
132 |     f_train_out = open('train_ffm.csv','w')
133 |     f_test_out = open('test_ffm.csv', 'w')
134 |     for (i,line) in enumerate(fin):
135 |         if i<len_train:
136 |             f_train_out.write(str(Y[i])+' '+line)
137 |         else:
138 |             f_test_out.write(line)
139 |     f_train_out.close()
140 |     f_test_out.close()


--------------------------------------------------------------------------------
/forRank.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys
 4 | import time
 5 | import glob
 6 | import zipfile
 7 | import numpy as np
 8 | import pandas as pd
 9 | 
10 | 
11 | def rankAll():
12 |     factor = 1e+4
13 |     keys = ['aid', 'uid']
14 |     name = '../datas/test1Merged.csv'
15 | 
16 |     dfAll = pd.read_csv(name, usecols=keys)
17 |     dfIds = dfAll[keys]; dfAll['score'] = 0
18 | 
19 |     files = sorted(glob.glob('./model*/submission.csv'))
20 |     for i, f in enumerate(files):
21 |         print f
22 | 
23 |         df = pd.read_csv(f)
24 |         score = dfIds.merge(df)['score']
25 |         dfAll['score'] += (factor / score.rank(ascending=False))
26 | 
27 |     sTime = time.strftime('%m%d-%H%M', time.localtime(time.time()))
28 |     dfAll.to_csv('submission.csv', index=False, float_format='%.6f')
29 |     zipName = './submission_%s.zip' % sTime
30 |     with zipfile.ZipFile(zipName, 'w') as f:
31 |         f.write('submission.csv', compress_type=zipfile.ZIP_DEFLATED)
32 | 
33 | 
34 | if __name__ == '__main__':
35 |     rankAll()
36 | 
37 | 


--------------------------------------------------------------------------------
/forVal.py:
--------------------------------------------------------------------------------
 1 | #!/usr/bin/env python
 2 | 
 3 | import sys
 4 | import time
 5 | import glob
 6 | import zipfile
 7 | import numpy as np
 8 | import pandas as pd
 9 | 
10 | 
11 | def combineVal():
12 |     keys = ['aid', 'uid']
13 |     name = '../datas/trainMerged2.csv'
14 | 
15 |     dfAll = pd.read_csv(name, usecols=keys+['label'])
16 |     dfIds = dfAll[keys]
17 | 
18 |     files = sorted(glob.glob('./model*/val.csv'))
19 |     for i, f in enumerate(files):
20 |         print f
21 | 
22 |         df = pd.read_csv(f)
23 |         d = f.split('_')[1].split('/')[0]
24 |         dfAll['score#%s'%d] = dfIds.merge(df)['score']
25 | 
26 |     sTime = time.strftime('%m%d-%H%M', time.localtime(time.time()))
27 |     dfAll.to_csv('valAll_%s.csv' % sTime, index=False, float_format='%.6f')
28 | 
29 | 
30 | if __name__ == '__main__':
31 |     combineVal()
32 | 
33 | 


--------------------------------------------------------------------------------
/imformation/README.md:
--------------------------------------------------------------------------------
 1 | # 扫盲贴  
 2 | 
 3 | 经验贴
 4 | 
 5 | 	陈成龙经验（腾讯主办方大神，kaggle中国区第一科学家，此处特别推荐腾讯比赛的公众号，一般你参加什么比赛都应该去关注他官方发布消息的地方，特别是QQ群，里面都是大佬。）
 6 | 	https://mp.weixin.qq.com/s/BE1mfmKJTsDSwWi16mllNA
 7 | 	如何在 Kaggle 首战中进入前 10%
 8 | 	https://dnc1994.com/2016/04/rank-10-percent-in-first-kaggle-competition/
 9 | 	第一届比赛代码
10 | 	https://github.com/BladeCoda/
11 | 	有关stacking
12 | 	https://blog.csdn.net/ben3ben/article/details/74838338
13 | 	
14 | 	场景相关相关
15 | 	广点通：https://www.csdn.net/article/2015-08-27/2825557
16 | 	阿里妈妈：https://blog.csdn.net/seu_yang/article/details/51966385
17 | 数据集  
18 | 
19 | 特征分析  
20 | 	
21 | 	
22 | 	# 数据可视化
23 | 	https://www.kaggle.com/benhamner/python-data-visualizations
24 | 	http://python.jobbole.com/87136/
25 | 	# 简单分析
26 | 	https://blog.csdn.net/u010289316/article/details/51571540
27 | 	#数据清洗
28 | 	http://baijiahao.baidu.com/s?id=1581755863609513980&wfr=spider&for=pc
29 | 	# #########特征工程######### #
30 | 	https://www.zhihu.com/question/29316149
31 | 	https://blog.csdn.net/sqiu_11/article/details/59487177
32 | 	http://www.cnblogs.com/jasonfreak/p/5448385.html  
33 | 	http://www.cnblogs.com/jasonfreak/p/5619260.html  
34 | 	http://www.cnblogs.com/jasonfreak/p/5448462.html 
35 | 	#结合Scikit-learn介绍几种常用的特征选择方法
36 | 	https://www.cnblogs.com/hhh5460/p/5186226.html
37 | 	https://blog.csdn.net/u013421629/article/details/78769949
38 | 	#冷启动特征
39 |     https://www.cnblogs.com/nolonely/p/6863574.html
40 |     #连续特征离散化
41 |     https://blog.csdn.net/u013818406/article/details/70494800
42 | 	#one-hot and labelEncoder
43 | 	https://blog.csdn.net/accumulate_zhang/article/details/78510571
44 | 	# GDBT叶子特征
45 | 	https://blog.csdn.net/shine19930820/article/details/71713680
46 | 	# 特征组合，这里的组合变化远不限于把已有的特征加减乘除（比如Kernel Tricks之类）
47 | 	#杂七杂八
48 | 	http://www.docin.com/p-108257148.html
49 | 	https://www.douban.com/note/642960921/
50 | 	#节省内存
51 | 	https://www.cnblogs.com/xxr1/p/7397767.html
52 | 	https://blog.csdn.net/u012347642/article/details/78555132
53 | 	https://blog.csdn.net/weiyongle1996/article/details/78498603?locationNum=3&fps=1
54 | 	http://www.360doc.com/content/17/0305/18/1489589_634208133.shtml
55 | 	https://www.zhihu.com/question/50477642
56 | embedding
57 | 
58 | 	https://developers.google.cn/machine-learning/crash-course/embeddings/obtaining-embeddings  
59 | 
60 | 调参  
61 | 
62 | 	#能凭经验就凭经验吧，hyperopt这种调参库也很慢的，格子就算了
63 | 	https://www.zhihu.com/question/34470160?sort=created  
64 | lgb  
65 | 	https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst  
66 | 	https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters-Tuning.rst  
67 | 	https://zhuanlan.zhihu.com/p/27916208  
68 | 	https://blog.csdn.net/weiyongle1996/article/details/78446244  
69 | 	https://sites.google.com/view/lauraepp/parameters  
70 | 	http://lightgbm.apachecn.org/cn/latest/Python-API.html#scikit-learn-api  
71 |   
72 | 一些聊天记录：
73 | 
74 |     如果你发现了两组强特，他们相关性很强，可以尝试用他们分别训练模型然后融合，可以得到不错的效果。（理论上）
75 | 
76 | 	基础特征：计数特征、转化率、比例特征等各种基本的特征；
77 | 
78 | 	特征提取方式有以下几个方面考虑：
79 | 
80 | 	基于cv统计、贝叶斯平滑等方法，能够很好的修正线上线下的特征分布不一致的问题；
81 | 	  
82 |     #w2v
83 |     两个vec的距离越近，这两个人的兴趣就越接近,太稀疏w2v训不好,对特征使用w2c虽然解决了稀疏问题，但是使用太多会使模型学习变得困难。这个要折中。  
84 |     其实可以对照着各个特征的describe，有些特征不需要w2v感觉,单个进程要很久很久的，提醒一下,反正我是interest  kw topic  和另外的分开的thread,内存小的话吧size设小点，小数点后面保留小点  
85 |     http://d0evi1.com/gensim/  
86 |     
87 |     #coo_matrix??????sparse matrix,csr,csc,  
88 |     https://blog.csdn.net/u013010889/article/details/53305595  
89 | 
90 | 


--------------------------------------------------------------------------------
/lgb.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf8 -*-
  2 | import pandas as pd
  3 | import lightgbm as lgb
  4 | import gc
  5 | import datetime
  6 | import numpy as np
  7 | from sklearn.cross_validation import StratifiedKFold
  8 | import sys
  9 | from scipy import sparse
 10 | from sklearn.model_selection import train_test_split
 11 | from hpsklearn import HyperoptEstimator
 12 | from hyperopt import tpe
 13 | from sklearn.metrics import roc_auc_score
 14 | from hyperopt import fmin, tpe, hp,space_eval,rand,Trials,partial,STATUS_OK
 15 | from sklearn.cross_validation import cross_val_score
 16 | 
 17 | t_start = datetime.datetime.now()
 18 | print(t_start)
 19 | sys.stdout.flush()
 20 | mytype ={'uid':np.uint16,'LBS': np.uint16,'adCategoryId':np.uint16,'advertiserId':np.uint16,'age':np.uint16,'aid':np.uint16,'campaignId':np.uint16,
 21 | 'carrier':np.uint16,'consumptionAbility':np.uint16,'creativeId':np.uint16,'creativeSize':np.uint16,'ct':object,'education':np.uint16,'gender':np.uint16,
 22 |          'house':np.uint16,'interest1':object,'interest2':object,'interest3':object,'interest4':object,'interest5':object,'kw1':object,'kw2':object,'kw3':object,
 23 |          'marriageStatus':object,'os':object ,'productId':np.uint16,'productType':np.uint16,'topic1':object,'topic2':object,'topic3':object,'label':np.uint16,
 24 |          'appIdAction':object,'appIdInstall':object,'appIdAction_score':np.uint16,'appIdInstall_score':np.uint16,'productId_to_aid':np.uint16,
 25 |          'aid_age_count':np.uint16,'aid_gender_count':np.uint16,'campaignId_active_aid':np.uint16,'LBS_marriageStatus':np.uint16,
 26 |          'adCategoryId_LBS':np.uint16,'adCategoryId_age':np.uint16,'adCategoryId_ct':np.uint16,'adCategoryId_productId':np.uint16,
 27 |          'advertiserId_LBS':np.uint16,'campaignId_productId':np.uint16,'campaignId_adCategoryId':np.uint16,'campaignId_age':np.uint16,
 28 |          'campaignId_ct':np.uint16,'campaignId_marriageStatus':np.uint16,'consumptionAbility_ct':np.uint16,'education_ct':np.uint16,'gender_os':np.uint16,
 29 |          'productId_LBS':np.uint16,'productType_marriageStatus':np.uint16,'productType_ct':np.uint16,
 30 |          'adCategoryId_age_stas':np.uint16,'adCategoryId_marriageStatus_stas':np.uint16,
 31 |          'age_carrier_stas':np.uint16,'age_consumptionAbility_stas':np.uint16,'age_gender_stas':np.uint16,
 32 |          'age_house_stas':np.uint16,'age_os_stas':np.uint16,'campaignId_LBS_stas':np.uint16,'campaignId_marriageStatus_stas':np.uint16,'consumptionAbility_gender_stas':np.uint16,'creativeSize_age_stas':np.uint16,'creativeSize_marriageStatus_stas':np.uint16,'education_gender_stas':np.uint16,'productId_marriageStatus_stas':np.uint16,'productType_age_stas':np.uint16,'productType_consumptionAbility_stas':np.uint16}
 33 | 
 34 | vector_feature=['appIdAction','appIdInstall','interest1','interest2','interest3','interest4','interest5','kw1','kw2','kw3','topic1','topic2','topic3']
 35 | 
 36 | one_hot_feature=['LBS','age','carrier','consumptionAbility','education','gender','house','os','ct','marriageStatus','advertiserId','campaignId', 'creativeId',
 37 |        'adCategoryId', 'productId', 'productType']
 38 | 
 39 | 
 40 | data = pd.read_csv('/data/spa/datas/mergedAll.csv', dtype=mytype, header=0, iterator=True)
 41 | data = data.get_chunk(10000000)
 42 | print(data.head())
 43 | # test = data.get_chunk(2000000)
 44 | # print(test.head())
 45 | data[data.label == -1] = 0
 46 | data[data.label == -2] = 0
 47 | data.fillna(0, inplace=True)
 48 | data_y = data.pop('label')
 49 | train, test, train_y, test_y = train_test_split(data, data_y, test_size=0.2, random_state=2018, stratify=data_y)
 50 | # X_loc_train = data[data.label!=0]
 51 | # X_loc_test = pd.read_csv('./datas/test2m.csv')
 52 | # X_loc_test['label']=0
 53 | # data = pd.concat([X_loc_train,X_loc_test])
 54 | # data.fillna(0,inplace=True)
 55 | # from sklearn.preprocessing import LabelEncoder
 56 | # for feature in data.columns:
 57 | #     if feature not in vector_feature:
 58 | #         try:
 59 | #             data[feature] = LabelEncoder().fit_transform(data[feature].apply(int))
 60 | #         except:
 61 | #             data[feature] = LabelEncoder().fit_transform(data[feature])
 62 | #
 63 | # data.to_csv('./datas/mergedAll.csv', index=False)
 64 | ##########!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
 65 | res = test[['uid','aid']]
 66 | train_y = train_y.values
 67 | # train_x=train[['creativeSize']]
 68 | # test_x=test[['creativeSize']]
 69 | #
 70 | # from sklearn.preprocessing import OneHotEncoder,LabelEncoder
 71 | # enc = OneHotEncoder()
 72 | # for feature in one_hot_feature:
 73 | #     enc.fit(data[feature].values.reshape(-1, 1))
 74 | #     train_a=enc.transform(train[feature].values.reshape(-1, 1))
 75 | #     test_a = enc.transform(test[feature].values.reshape(-1, 1))
 76 | #     train_x= sparse.hstack((train_x, train_a))
 77 | #     test_x = sparse.hstack((test_x, test_a))
 78 | # print('one-hot prepared !')
 79 | # from sklearn.feature_extraction.text import CountVectorizer
 80 | # cv=CountVectorizer(token_pattern='(?u)\\b\\w+\\b', min_df=5)
 81 | # for feature in vector_feature:
 82 | #     cv.fit(data[feature])
 83 | #     train_a = cv.transform(train[feature])
 84 | #     test_a = cv.transform(test[feature])
 85 | #     train_x = sparse.hstack((train_x, train_a))
 86 | #     test_x = sparse.hstack((test_x, test_a))
 87 | # print('cv prepared !')
 88 | # sparse.save_npz('./datas/train.npz', train_x, compressed=True)
 89 | # sparse.save_npz('./datas/test.npz', test_x, compressed=True)
 90 | print('load npz')
 91 | train_x = sparse.load_npz('./datas/train.npz')
 92 | test_x = sparse.load_npz('./datas/test.npz')
 93 | print('loaded npz')
 94 | cols = [col for col in train.columns if col not in one_hot_feature and col not in vector_feature]
 95 | train, test = train[cols],test[cols]
 96 | del train['creativeSize'],test['creativeSize']
 97 | train_x = sparse.hstack((train, train_x))
 98 | test_x = sparse.hstack((test, test_x))
 99 | ###################################################################################
100 | 
101 | def LGB_test(train_x,train_y,test_x,test_y):
102 |     print("LGB test")
103 |     clf = lgb.LGBMClassifier(
104 |         boosting_type='gbdt', num_leaves=31, reg_alpha=0.0, reg_lambda=1,
105 |         max_depth=-1, n_estimators=300, objective='binary',
106 |         subsample=0.7, colsample_bytree=0.7, subsample_freq=1,
107 |         learning_rate=0.05, min_child_weight=50,random_state=2018,n_jobs=-1
108 |     )
109 |     # clf.fit(train_x, train_y,eval_set=[(train_x, train_y),(test_x,test_y)],eval_metric='auc',early_stopping_rounds=100)
110 |     # # print(clf.feature_importances_)
111 |     # return clf,clf.best_score_[ 'valid_1']['auc']
112 | 
113 |     estim = HyperoptEstimator(classifier=clf,
114 |                               algo=tpe.suggest,
115 |                               max_evals=100,
116 |                               trial_timeout=120)
117 | 
118 |     # Search the hyperparameter space based on the data
119 |     estim.fit(train_x, train_y, random_state=2018)
120 |     y_pred_te = estim.predict(test_x)
121 | 
122 |     print(roc_auc_score(test_y, y_pred_te))
123 |     #print(estim.score(test_x, test_y))
124 |     print(estim.best_model())
125 | 
126 | #LGB_test(train_x,train_y,test_x,test_y)
127 | 
128 | def LGB(argsDict):
129 |     print("LGB test")
130 |     #max_depth = argsDict["max_depth"] + 5
131 |     n_estimators = argsDict['n_estimators'] * 1000 + 1000
132 |     learning_rate = argsDict["learning_rate"] * 0.01 + 0.005
133 |     subsample = argsDict["subsample"] * 0.1 + 0.7
134 |     min_child_weight = argsDict["min_child_weight"]*10+5
135 |     num_leaves = argsDict["num_leaves"]*10 + 21
136 |     #print( "max_depth:" + str(max_depth))
137 |     print("n_estimator:" + str(n_estimators))
138 |     print("learning_rate:" + str(learning_rate))
139 |     print("subsample:" + str(subsample))
140 |     print("min_child_weight:" + str(min_child_weight))
141 |     print("num_leaves:" + str(num_leaves))
142 |     global train_x,train_y
143 |     global test_x,test_y
144 |     clf = lgb.LGBMClassifier(
145 |         boosting_type='gbdt', num_leaves=num_leaves, reg_alpha=0.0, reg_lambda=1,
146 |         max_depth=-1, n_estimators=n_estimators, objective='binary',
147 |         subsample=subsample, colsample_bytree=0.7, subsample_freq=1,
148 |         learning_rate=learning_rate, min_child_weight=min_child_weight, random_state=2018, n_jobs=-1
149 |     )
150 |     #metric = cross_val_score(clf, train_x, train_y, cv=5, scoring="roc_auc").mean()
151 |     clf.fit(train_x, train_y,eval_set=[(train_x, train_y),(test_x,test_y)],eval_metric='auc',early_stopping_rounds=100)
152 |     # print(clf.feature_importances_)
153 |     return -clf.best_score_[ 'valid_1']['auc']
154 |     # y_pred_te = clf
155 |     # metric = roc_auc_score(test_y, y_pred_te)
156 |     #print(metric)
157 |     #return -metric
158 | 
159 | 
160 | def HY():
161 |     space = {#"max_depth": hp.randint("max_depth", 15),
162 |              "n_estimators": hp.randint("n_estimators", 10),  # [0,1,2,3,4,5] -> [50,]
163 |              "learning_rate": hp.randint("learning_rate", 10),  # [0,1,2,3,4,5] -> 0.05,0.06
164 |              "subsample": hp.randint("subsample", 4),  # [0,1,2,3] -> [0.7,0.8,0.9,1.0]
165 |              "min_child_weight": hp.randint("min_child_weight", 10),
166 |             "num_leaves": hp.randint("num_leaves", 2)
167 |              }
168 |     algo = partial(tpe.suggest, n_startup_jobs=1)
169 |     best = fmin(LGB, space, algo=algo, max_evals=4)
170 |     print(best)
171 |     print(LGB(best))
172 | 
173 | HY()
174 | 
175 | 
176 | t_end = datetime.datetime.now()
177 | print('training time: %s' % ((t_end - t_start).seconds / 60))
178 | sys.stdout.flush()
179 | 


--------------------------------------------------------------------------------
/nn.py:
--------------------------------------------------------------------------------
  1 | # -*- coding: utf8 -*-
  2 | 
  3 | import sys
  4 | import csv
  5 | import glob
  6 | import random
  7 | import zipfile
  8 | import numpy as np
  9 | import mxnet as mx
 10 | from mxnet.gluon import nn
 11 | 
 12 | from data import *
 13 | from nnHelper import *
 14 | 
 15 | logging.basicConfig(
 16 |     level=logging.INFO,
 17 |     datefmt='%Y%m%d-%H:%M:%S',
 18 |     format='%(asctime)s:  %(message)s',
 19 | )
 20 | ###################################################
 21 | 
 22 | class MyData(object):
 23 |     def __init__(self, ctx, isTrain=True):
 24 |         self.ctx = ctx
 25 |         self.isTrain = isTrain
 26 | 
 27 |         global name
 28 |         bs = getBS(name)
 29 |         self.batchSize = bs
 30 | 
 31 |         self.loadDatas()
 32 | 
 33 |     def loadDatas(self):
 34 |         logging.info('Loading datas ...')
 35 | 
 36 |         self.datas = loadSparse(
 37 |             0 if self.isTrain else 1)
 38 |         self.len = self.datas.shape[0]
 39 |         self.indices = list(range(self.len))
 40 |         self.iters = self.len / self.batchSize
 41 | 
 42 |         logging.info('Loading datas done ...')
 43 | 
 44 |     def __next__(self):
 45 |         index, batchSize = self.index, self.batchSize
 46 |         indices = self.indices[index:index+batchSize]
 47 | 
 48 |         self.index += batchSize
 49 |         if self.index > self.len: raise StopIteration
 50 | 
 51 |         sData = self.datas[indices].toarray()
 52 |         datas = mx.nd.array(
 53 |             sData[:, 1:], self.ctx, dtype='int32')
 54 |         labels = mx.nd.array(sData[:, 0], self.ctx)
 55 | 
 56 |         return datas, labels
 57 | 
 58 |     def reset(self):
 59 |         self.index = 0
 60 |         #if self.isTrain:
 61 |         #    #self.indices = random.sample(self.indices,len(self.indices))
 62 |         #    random.#shuffle(self.indices)
 63 | 
 64 |     def __iter__(self):
 65 |         return self
 66 | 
 67 | class MyDataVal(object):
 68 |     def     __init__(self, ctx):
 69 |         self.ctx = ctx
 70 | 
 71 |         global name
 72 |         bs = getBS(name)
 73 |         self.batchSize = bs
 74 | 
 75 |         self.loadDatas()
 76 | 
 77 |     def loadDatas(self):
 78 |         logging.info('Loading datas ...')
 79 | 
 80 |         keys = ['aid', 'uid']
 81 |         self.ids = getMerged(*keys, kind=4).values
 82 | 
 83 |         self.datas = loadSparse(1)
 84 |         self.len = self.datas.shape[0]
 85 | 
 86 |         logging.info('Loading datas done ...')
 87 | 
 88 |     def __next__(self):
 89 |         index, batchSize = self.index, self.batchSize
 90 |         if index >= self.len: raise StopIteration
 91 |         end = min(index + batchSize, self.len)
 92 | 
 93 |         ids = self.ids[index:end]
 94 |         ids = mx.nd.array(
 95 |             ids, self.ctx, dtype='int32')
 96 | 
 97 |         sData = self.datas[index:end].toarray()
 98 |         datas = mx.nd.array(
 99 |             sData[:, 1:], self.ctx, dtype='int32')
100 |         self.index += self.batchSize
101 | 
102 |         return ids, datas
103 | 
104 |     def reset(self):
105 |         self.index = 0
106 | 
107 |     def __iter__(self):
108 |         return self
109 | 
110 | class MyDataTest(object):
111 |     def __init__(self, ctx):
112 |         self.ctx = ctx
113 | 
114 |         global name
115 |         bs = getBS(name)
116 |         self.batchSize = bs
117 | 
118 |         self.loadDatas()
119 | 
120 |     def loadDatas(self):
121 |         logging.info('Loading datas ...')
122 | 
123 |         keys = ['aid', 'uid']
124 |         self.ids = getMerged(*keys, kind=2).values
125 | 
126 |         self.datas = loadSparse(2)
127 |         self.len = self.datas.shape[0]
128 | 
129 |         logging.info('Loading datas done ...')
130 | 
131 |     def __next__(self):
132 |         index, batchSize = self.index, self.batchSize
133 |         if index >= self.len: raise StopIteration
134 |         end = min(index + batchSize, self.len)
135 | 
136 |         ids = self.ids[index:end]
137 |         ids = mx.nd.array(
138 |             ids, self.ctx, dtype='int32')
139 | 
140 |         sData = self.datas[index:end].toarray()
141 |         datas = mx.nd.array(
142 |             sData[:, 1:], self.ctx, dtype='int32')
143 |         self.index += self.batchSize
144 | 
145 |         return ids, datas
146 | 
147 |     def reset(self):
148 |         self.index = 0
149 | 
150 |     def __iter__(self):
151 |         return self
152 | 
153 | ###################################################
154 | 
155 | class MyNet(nn.HybridSequential):
156 |     def __init__(self, ctx):
157 |         super(MyNet, self).__init__()
158 |         global name
159 | 
160 |         self.init(ctx)
161 |         if getSorN(name):
162 |             self.hybridize()
163 |         #self.hybridize()
164 |         self.collect_params().initialize(
165 |             #mx.init.MSRAPrelu(slope=0),
166 |             mx.init.Xavier(),
167 |             ctx
168 |         )
169 | 
170 |     def init(self, ctx):
171 |         with self.name_scope():
172 |             global name
173 |             global lossKind
174 | 
175 |             n = lossKind + 1
176 |             self.add(getStart(name))
177 |             self.add(MyIBA(128, 'relu'))#全连接最后一层，另外正则可选+在loss
178 |             self.add(MyIBA(128, 'relu'))
179 |             self.add(nn.Dense(n))
180 |             #self.add(MyIBA(n, 'self'))
181 | 
182 | class MyNN(MyModel):
183 |     def getNet(self, ctx):
184 |         global lossKind
185 | 
186 |         return MyNet(ctx), getLoss(lossKind)
187 | 
188 |     def getModel(self):
189 |         global model; return model
190 | 
191 |     def getMetric(self):
192 |         global lossKind
193 | 
194 |         return getMetric(lossKind), 'aucMetric'
195 | 
196 |     def getData(self, ctx):
197 |         return MyData(ctx, True), MyData(ctx, False)
198 | 
199 |     def forDebug(self, out):
200 |         pass
201 | 
202 |     def getTrainer(self, params, iters):
203 |         opt = getOpt(iters)
204 | 
205 |         return mx.gluon.Trainer(params, opt)
206 |     #return mx.gluon.Trainer(params, 'adam',{ 'clip_gradient': 2})
207 | 
208 | class MyNNV(MyPredict):
209 |     def __init__(self, gpu=0):
210 |         super(MyNNV, self).__init__(gpuID)
211 |         model = self.getModel()
212 |         self.name = '%s/val.csv' % model
213 | 
214 |         self.f = open(self.name, 'w')
215 |         self.csv = csv.writer(self.f)
216 |         self.csv.writerow(['aid', 'uid', 'score'])
217 | 
218 |     def getModel(self):
219 |         global model; return model
220 | 
221 |     def onDone(self):
222 |         self.f.close()
223 | 
224 |     def getData(self, ctx):
225 |         return MyDataVal(ctx)
226 | 
227 |     def getNet(self, ctx):
228 |         model = self.getModel()
229 |         name = sorted(glob.glob(
230 |             '%s/*.params' % model))[-1]
231 | 
232 |         net = MyNet(ctx)
233 |         net.load_params(name, ctx)
234 |         logging.info('Load %s ...' % name)
235 | 
236 |         return net
237 | 
238 |     def preProcess(self, data):
239 |         return data[1]
240 | 
241 |     def postProcess(self, data, pData, output):
242 |         ids = data[0].asnumpy()
243 |         if lossKind == 0:
244 |             output = mx.nd.sigmoid(output)
245 |             output = output.asnumpy()[:,0]
246 |         if lossKind == 1:
247 |             output = mx.nd.softmax(output)
248 |             output = output.asnumpy()[:,1]
249 |         if lossKind == 2:
250 |             output1 = output[:, 0:2]
251 |             output2 = output[:, 2:3]
252 | 
253 |             output1 = mx.nd.softmax(output1)
254 |             output1 = output1.asnumpy()[:,1]
255 | 
256 |             output2 = mx.nd.sigmoid(output2)
257 |             output2 = output2.asnumpy()[:,0]
258 | 
259 |             output = 0.5*(output1 + output2)
260 | 
261 |         for id, out in zip(ids, output):
262 |             self.csv.writerow([id[0], id[1], '%.6f'%out])
263 | 
264 | class MyNNP(MyPredict):
265 |     def __init__(self, gpu=0):
266 |         super(MyNNP, self).__init__(gpuID)
267 |         model = self.getModel()
268 |         self.name = '%s/submission.csv' % model
269 | 
270 |         self.f = open(self.name, 'w')
271 |         self.csv = csv.writer(self.f)
272 |         self.csv.writerow(['aid', 'uid', 'score'])
273 | 
274 |     def getModel(self):
275 |         global model; return model
276 | 
277 |     def onDone(self):
278 |         self.f.close()
279 | 
280 |         model = self.getModel()
281 |         zipName = '%s/submission.zip' % model
282 |         with zipfile.ZipFile(zipName, 'w') as f:
283 |             f.write(
284 |                 self.name, 'submission.csv',
285 |                 compress_type=zipfile.ZIP_DEFLATED
286 |             )
287 | 
288 |     def getData(self, ctx):
289 |         return MyDataTest(ctx)
290 | 
291 |     def getNet(self, ctx):
292 |         model = self.getModel()
293 |         name = sorted(glob.glob(
294 |             '%s/*.params' % model))[-1]
295 | 
296 |         net = MyNet(ctx)
297 |         net.load_params(name, ctx)
298 |         print( 'Load %s ...' % name)
299 | 
300 |         return net
301 | 
302 |     def preProcess(self, data):
303 |         return data[1]
304 | 
305 |     def postProcess(self, data, pData, output):
306 |         ids = data[0].asnumpy()
307 |         if lossKind == 0:
308 |             output = mx.nd.sigmoid(output)
309 |             output = output.asnumpy()[:,0]
310 |         if lossKind == 1:
311 |             output = mx.nd.softmax(output)
312 |             output = output.asnumpy()[:,1]
313 |         if lossKind == 2:
314 |             output1 = output[:, 0:2]
315 |             output2 = output[:, 2:3]
316 | 
317 |             output1 = mx.nd.softmax(output1)
318 |             output1 = output1.asnumpy()[:,1]
319 | 
320 |             output2 = mx.nd.sigmoid(output2)
321 |             output2 = output2.asnumpy()[:,0]
322 | 
323 |             output = 0.5*(output1 + output2)
324 | 
325 |         for id, out in zip(ids, output):
326 |             self.csv.writerow([id[0], id[1], '%.6f'%out])
327 | 
328 | ###################################################
329 | if __name__ == '__main__':
330 |     mx.random.seed(2018)
331 |     random.seed(2018)
332 |     logging.info('All start...')
333 | 
334 | 
335 |     assert len(sys.argv)==5, 'Failed ...'
336 |     kind, gpuID = list(map(int, sys.argv[1:3]))
337 | 
338 |     name = sys.argv[3]
339 |     model = './modelNN_%s' % name
340 | 
341 |     lossKind = int(sys.argv[4])
342 |     if lossKind == 1:
343 |         model += '#Softmax'
344 |     if lossKind == 2:
345 |         model += '#S2SLoss'
346 | 
347 |     if kind==1:
348 |         m = MyNN(gpuID); m.train()
349 |     if kind==2:
350 |         m = MyNNV(gpuID); m.predict()
351 |     if kind==0:
352 |         m = MyNNP(gpuID); m.predict()
353 | 
354 | 
355 |     logging.info('All done ...')
356 | 
357 | 


--------------------------------------------------------------------------------
/nnHelper.py:
--------------------------------------------------------------------------------
   1 | # -*- coding: utf8 -*-
   2 | 
   3 | import os
   4 | import random
   5 | import logging
   6 | import mxnet as mx
   7 | import numpy as np
   8 | from mxnet.gluon import nn
   9 | from mxnet import autograd
  10 | from mxnet.gluon import loss
  11 | import os.path
  12 | import time
  13 | from data import *
  14 | from sklearn.metrics import roc_auc_score
  15 | 
  16 | logging.basicConfig(
  17 |     level=logging.INFO,
  18 |     datefmt='%Y%m%d-%H:%M:%S',
  19 |     format='%(asctime)s:  %(message)s',
  20 | )
  21 | 
  22 | ###################################################
  23 | 
  24 | def getBS(name):
  25 |     # bSs = {
  26 |     #     'dffm31'    : 200,
  27 |     #     'dfcm12'    : 500,
  28 |     #     'dfcm13'    : 500,
  29 |     # }
  30 |     #
  31 |     # if name in bSs:
  32 |     #     return bSs[name]
  33 |     return 2048
  34 | 
  35 | def getOpt(iters):
  36 |     # opt = mx.optimizer.AdaGrad(
  37 |     #     wd=1e-3,
  38 |     #     learning_rate=2e-4,
  39 |     # )
  40 |     opt = mx.optimizer.Adam(wd=1e-3,
  41 |     learning_rate=1e-2,)
  42 | 
  43 |     return opt
  44 | 
  45 | def getSorN(name):
  46 | 
  47 |     models = (
  48 |         'dfr',
  49 |     )
  50 | 
  51 |     return name not in models
  52 | 
  53 | def getStart(name):
  54 |     dims, User, Ad= getFFMDim()
  55 |     dims = dims.values()
  56 |     inC = 4989
  57 |     models = {
  58 |         'test': lambda: Chen_DFM(dims, inC, 8, 128, User, Ad),  # 2048
  59 |         'test1': lambda: Guo_DFFM(dims, inC, 8, 256),  # 4096
  60 |         'testdin': lambda: MyDIN(dims, inC, 8, 128, User, Ad),  # XXXX, 7325, XXXX
  61 | 
  62 |         'dcn'   : lambda: MyDCN(dims, inC, 8, 64, 3),
  63 |         'dfcm'  : lambda: MyDFCM(dims, inC, 8, 64),       # 7370, 7406, XXXX
  64 |         'dfcm11': lambda: MyDFCM(dims, inC, 16, 128),     # XXXX, 7431, XXXX
  65 |         'dfcm12': lambda: MyDFCM(dims, inC, 32, 128),     # XXXX, 7482, XXXX
  66 |         'dfcm13': lambda: MyDFCM(dims, inC, 64, 256),     # XXXX, 7485, XXXX
  67 | 
  68 |         'dfr'   : lambda: MyDFR(dims, inC, 8, 64),
  69 |         'dfu'   : lambda: MyDFU(dims, inC, 8, 64),        # 7269, XXXX, XXXX
  70 |         'dfm'   : lambda: MyDFM(dims, inC, 8, 64),        # 7256, 7280, XXXX
  71 | 
  72 |         'dfm2'  : lambda: MyDFM2(dims, inC, 8, 64),       # XXXX, 7369, XXXX
  73 |         'dfm21' : lambda: MyDFM2(dims, inC, 16, 128),     # XXXX, 7410, XXXX
  74 |         'dfm22' : lambda: MyDFM2(dims, inC, 32, 128),     # XXXX, 7445, XXXX
  75 |         'dfm23' : lambda: MyDFM2(dims, inC, 32, 256),     # XXXX, 7450, XXXX
  76 |         'dfm24' : lambda: MyDFM2(dims, inC, 64, 256),     # XXXX, 7468, XXXX
  77 |         'dfm25' : lambda: MyDFM2(dims, inC, 128, 256),    # XXXX, 7468, XXXX
  78 |         'dfm26' : lambda: MyDFM2(dims, inC, 256, 256),    # XXXX, 7475, XXXX
  79 | 
  80 |         'dfz'   : lambda: MyDIN(dims, inC, 8, 64),        # XXXX, 7325, XXXX
  81 |         'dfz11' : lambda: MyDFZ(dims, inC, 16, 128),      # XXXX, 7402, XXXX
  82 |         'dfz12' : lambda: MyDFZ(dims, inC, 32, 128),      # XXXX, 7443, XXXX
  83 |         'dfz13' : lambda: MyDFZ(dims, inC, 32, 256),      # XXXX, 7439, XXXX
  84 |         'dfz14' : lambda: MyDFZ(dims, inC, 64, 256),      # XXXX, 7456, XXXX
  85 |         'dfz15' : lambda: MyDFZ(dims, inC, 128, 256),     # XXXX, 7463, XXXX
  86 |         'dfz16' : lambda: MyDFZ(dims, inC, 256, 256),     # XXXX, 7469, XXXX
  87 | 
  88 |         'din'   : lambda: MyDIN(dims, inC, 8, 64),        # XXXX, 7325, XXXX
  89 |         'dfin'  : lambda: MyDFIN(dims, inC, 8, 64),       # 7399, 7415, XXXX
  90 |         'dfcn'  : lambda: MyDFCN(dims, inC, 8, 64),       # 7415, XXXX, XXXX
  91 | 
  92 |         'dffm'  : lambda: MyDFFM(dims, inC, 8, 64),       # 7395, 7439, XXXX
  93 |         'dffm2' : lambda: MyDFFM2(dims, inC, 8, 64),      # 7396, 7444, XXXX
  94 |         'dffm3' : lambda: MyDFFM3(dims, inC, 8, 256),      # 2048
  95 |         'dffm31': lambda: MyDFFM3(dims, inC, 16, 64),     # XXXX, 7462, XXXX
  96 |     }
  97 | 
  98 |     return models[name]()
  99 | 
 100 | def getLoss(lossKind):
 101 |     losses = {
 102 |         0: MyLoss,
 103 |         1: MyLoss2,
 104 |         2: MyLoss3,
 105 |     }
 106 |     return losses[lossKind]()
 107 | 
 108 | def getMetric(lossKind):
 109 |     metrics = {
 110 |         0: MyMetric,
 111 |         1: MyMetric2,
 112 |         2: MyMetric3,
 113 |     }
 114 |     return metrics[lossKind]()
 115 | 
 116 | def randomRange(start, end):
 117 |     r1 = r2 = 0
 118 |     while(r1 == r2):
 119 |         r1 = random.randint(start, end)
 120 |         r2 = random.randint(start, end)
 121 |     t1, t2 = min(r1,r2), max(r1,r2)
 122 |     return t1, t2
 123 | 
 124 | ###################################################
 125 | 
 126 | class MyBA(nn.HybridSequential):
 127 |     def __init__(self, act='relu'):
 128 |         super(MyBA, self).__init__()
 129 | 
 130 |         with self.name_scope():
 131 |             self.add(nn.BatchNorm())
 132 |             self.add(MyAct(act))
 133 | 
 134 | class MyAct(nn.HybridBlock):
 135 |     def __init__(self, act='relu'):
 136 |         super(MyAct, self).__init__()
 137 |         self.act = act
 138 | 
 139 |     def hybrid_forward(self, F, x):
 140 |         if self.act == 'self': return x
 141 |         return F.Activation(x, self.act)
 142 | 
 143 | class MyIBA(nn.HybridSequential):
 144 |     def __init__(self, c, act='relu'):
 145 |         super(MyIBA, self).__init__()
 146 | 
 147 |         with self.name_scope():
 148 |             self.add(nn.Dense(c))#全连接+bias
 149 |             self.add(MyBA(act))#可选
 150 | 
 151 | class MyCBA(nn.HybridSequential):
 152 |     def __init__(self, c, act='relu'):
 153 |         super(MyCBA, self).__init__()
 154 | 
 155 |         with self.name_scope():
 156 |             self.add(nn.Conv1D(*c))
 157 |             self.add(MyBA(act))
 158 | 
 159 | class MyRes(nn.HybridBlock):
 160 |     def __init__(self, c1, c2):
 161 |         super(MyRes, self).__init__()
 162 | 
 163 |         with self.name_scope():
 164 |             self.opr = MyAct('relu')
 165 |             self.op1 = MyIBA(c1, 'self')
 166 |             self.op2 = MyIBA(c2, 'relu')
 167 | 
 168 |     def hybrid_forward(self, F, x):
 169 |         return self.op2(self.opr(self.op1(x)+x))
 170 | 
 171 | class MyRes2(nn.HybridSequential):
 172 |     def __init__(self, c1, c2, num):
 173 |         super(MyRes2, self).__init__()
 174 | 
 175 |         with self.name_scope():
 176 |             for i in range(num):
 177 |                 self.add(MyRes(c1, c1))
 178 |             self.add(MyRes(c1, c2))
 179 | 
 180 | #&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&
 181 | class MyDPO(nn.HybridSequential):
 182 |     def __init__(self, rate=.5):
 183 |         super(MyDPO, self).__init__()
 184 |         self.rate = rate
 185 | 
 186 |         with self.name_scope():
 187 |             self.add(nn.Dropout(.5))
 188 | 
 189 | class MyIDP(nn.HybridSequential):
 190 |     def __init__(self, c, c1):
 191 |         super(MyIDP, self).__init__()
 192 | 
 193 |         with self.name_scope():
 194 |             self.add(nn.Dense(c, activation='sigmoid', flatten=False))#全连接+bias
 195 |             self.add(nn.Dense(c1, activation='sigmoid', flatten=False))  # 全连接+bias
 196 |             self.add(nn.Dense(1, flatten=False))  # 全连接+bias
 197 | 
 198 | 
 199 | #####################################################
 200 | class MyC(nn.HybridBlock):
 201 |     def __init__(self, shapew,shapeb):
 202 |         super(MyC, self).__init__()
 203 | 
 204 |         with self.name_scope():
 205 |             self.b = self.params.get('b', shape=shapeb)
 206 |             self.w = self.params.get('w', shape=shapew)
 207 | 
 208 |     def hybrid_forward(self, F, x0, x1, b, w):
 209 |         y = F.broadcast_add(F.dot(
 210 |             F.batch_dot(x0, x1, False, True), w), b)
 211 |         #F.FullyConnected()
 212 |         return y
 213 | 
 214 | class MyE(nn.HybridBlock):
 215 |     def __init__(self, inC, outC):
 216 |         super(MyE, self).__init__()
 217 | 
 218 |         with self.name_scope():
 219 |             self.inC = inC
 220 |             self.outC = outC
 221 | 
 222 |             self.w = self.params.get(
 223 |                 'weight',
 224 |                 shape=(inC, outC),
 225 |                 allow_deferred_init=True
 226 |             )
 227 | 
 228 |     def hybrid_forward(self, F, x, w):
 229 |         zw = F.concat(
 230 |             F.zeros((1, self.outC)), w, dim=0)
 231 |         return F.Embedding(x, w, self.inC, self.outC)
 232 | 
 233 | class MyEB(nn.HybridBlock):
 234 |     def __init__(self, dims, inC, outC):
 235 |         super(MyEB, self).__init__()
 236 | 
 237 |         self.d = dims#a list of fields' max length
 238 |         self.e = MyE(inC, outC)
 239 | 
 240 | 
 241 |     def onDone(self, F, result):
 242 |         return result
 243 | 
 244 |     def hybrid_forward(self, F, x):
 245 |         e = self.e(x)#e*v？
 246 |         result, start, end = [], 0, 0
 247 |         for i, size in enumerate(self.d):
 248 |             start, end = end, end + size
 249 |             sliced = F.slice_axis(e, 1, start, end)
 250 |             result.append(F.mean(sliced, 1, True))#av_pool
 251 | 
 252 |         return self.onDone(F, result)
 253 | 
 254 | class MyAUEB(nn.HybridBlock):
 255 |     def __init__(self, dims, inC, outC, User ,Ad):
 256 |         super(MyAUEB, self).__init__()
 257 | 
 258 |         self.d = dims#a list of fields' max length
 259 |         self.e = MyE(inC, outC)
 260 |         self.User = [int(x) for x in User]
 261 |         self.Ad = [int(x) for x in Ad]
 262 | 
 263 | 
 264 |     def onDone(self, F, result, User, Ad):
 265 |         for k in User:
 266 |             result[k] = User[k]
 267 |         for k in Ad:
 268 |             result[k] = Ad[k]
 269 |         keys = sorted(result.keys())
 270 |         res = []
 271 |         for k in keys:
 272 |             res.append(result[k])
 273 |         return res
 274 | 
 275 |     def hybrid_forward(self, F, x):
 276 |         e = self.e(x)  # e*v？
 277 |         result, User, Ad, start, end = {}, {}, {}, 0, 0
 278 |         for i, size in enumerate(self.d):
 279 |             start, end = end, end + size
 280 |             sliced = F.slice_axis(e, 1, start, end)
 281 |             if i in self.User:
 282 |                 User[i] = sliced
 283 |             elif i in self.Ad:
 284 |                 Ad[i] = sliced
 285 |             else:
 286 |                 result[i] = sliced
 287 | 
 288 |         # print(User[1].infer_shape(data=(2048,590)))#B*field*K
 289 |         # print(Ad[0].infer_shape(data=(2048, 590)))
 290 |         # User = F.concat(*User, dim=1)#B*F*K
 291 |         # print(User.infer_shape(data=(2048, 590)))
 292 |         # Ad = F.concat(*Ad, dim=1)#B*1*K
 293 |         # print(Ad.infer_shape(data=(2048, 590)))
 294 |         return self.onDone(F, result, User, Ad)
 295 | 
 296 | class MyAttention(MyAUEB):
 297 |     def __init__(self, dims, inC, outC, User, Ad):
 298 |         super(MyAttention, self).__init__(dims, inC, outC, User, Ad)
 299 |         self.K = outC
 300 |         self.ip = MyIDP(80, 40)
 301 |         self.dims = dims
 302 | 
 303 |     # def onDone(self, F, result, User, Ad):
 304 |     #     #concat embedding
 305 |     #     for k in User:
 306 |     #         result[k] = User[k]
 307 |     #     for k in Ad:
 308 |     #         result[k] = Ad[k]
 309 |     #     keys = sorted(result.keys())
 310 |     #     res = []
 311 |     #     for k in keys:
 312 |     #         res.append(result[k])
 313 |     #     return result
 314 | 
 315 |     def hybrid_forward(self, F, x):
 316 |         e = self.e(x)  # e*v？
 317 |         result, User, Ad, start, end = {}, {}, {}, 0, 0
 318 |         for i, size in enumerate(self.d):
 319 |             start, end = end, end + size
 320 |             sliced = F.slice_axis(e, 1, start, end)
 321 |             if i in self.User:
 322 |                 User[i] = sliced
 323 |             elif i in self.Ad:
 324 |                 Ad[i] = sliced
 325 |             else:
 326 |                 result[i] = sliced
 327 | 
 328 |         # print(User[1].infer_shape(data=(2048,590)))#B*field*K
 329 |         # print(Ad[0].infer_shape(data=(2048, 590)))
 330 |         # User = F.concat(*User, dim=1)#B*F*K
 331 |         # print(User.infer_shape(data=(2048, 590)))
 332 |         # Ad = F.concat(*Ad, dim=1)#B*1*K
 333 |         # print(Ad.infer_shape(data=(2048, 590)))
 334 |         #AT part
 335 |         AU = {}
 336 |         for k in User:#[B, F, K]
 337 |             queries = Ad[list(Ad.keys())[0]]
 338 |             shape = User[k].infer_shape(data=(2048, 590))[1][0]#(,,)
 339 |             shapeF = list(self.dims)[k]
 340 |             print(queries.infer_shape(data=(2048, 590)))
 341 |             queries = F.tile(queries, [1, shapeF])
 342 |             print(queries.infer_shape(data=(2048, 590)))
 343 |             queries = F.reshape(queries, [-1, shapeF, self.K])
 344 |             print(queries.infer_shape(data=(2048, 590)))
 345 |             din_all = F.concat(queries, User[k], queries - User[k], queries * User[k], dim=-1)
 346 |             print(din_all.infer_shape(data=(2048, 590)))
 347 |             din_all = self.ip(din_all)
 348 |             print(din_all.infer_shape(data=(2048, 590)))
 349 |             outputs = F.reshape(din_all, [-1, 1, shapeF])#[B, 1, F]
 350 |             print(outputs.infer_shape(data=(2048, 590)))
 351 |             key_masks = F.broadcast_greater(User[k], F.zeros_like(User[k]))
 352 |             paddings = F.ones_like(outputs) * (-2 ** 32 + 1)
 353 |             outputs = F.where(key_masks, outputs, paddings)
 354 |             outputs = outputs / (shape[-1] ** 0.5)
 355 |             print(outputs.infer_shape(data=(2048, 590)))
 356 |             # Activation
 357 |             outputs = F.softmax(outputs)  # [B, 1, T]
 358 |             # Weighted sum
 359 |             outputs = F.batch_dot(outputs, User[k])  # [B, 1, H]
 360 |             print(outputs.infer_shape(data=(2048, 590)))
 361 |             AU[k] = outputs
 362 | 
 363 |         return self.onDone(F, result, AU, Ad)
 364 | 
 365 | class MyED(MyEB):#维度（各每行最大取值数），weight输入维度，输出维度
 366 |     def __init__(self, dims, inC, outC):
 367 |         super(MyED, self).__init__(dims, inC, outC)
 368 | 
 369 |     def onDone(self, F, result):
 370 |         return F.concat(*result, dim=1)
 371 | 
 372 | class MyER(nn.HybridBlock):
 373 |     def __init__(self, dims, inC, outC):
 374 |         super(MyER, self).__init__()
 375 | 
 376 |         self.d = dims
 377 |         self.e = MyE(inC, outC)
 378 | 
 379 |     def hybrid_forward(self, F, x):
 380 |         flag = autograd.is_recording()
 381 | 
 382 |         e = self.e(x)
 383 |         result, start, end = [], 0, 0
 384 |         for i, size in enumerate(self.d):
 385 |             start, end = end, end + size
 386 | 
 387 |             t1, t2 = start, end
 388 |             if flag and size > 5:
 389 |                 t1, t2 = randomRange(start, end)
 390 | 
 391 |             sliced = F.slice_axis(e, 1, t1, t2)
 392 |             result.append(F.mean(sliced, 1, True))
 393 | 
 394 |         return F.concat(*result, dim=1)
 395 | 
 396 | class MyU(nn.HybridBlock):
 397 |     def __init__(self, inC, outC):
 398 |         super(MyU, self).__init__()
 399 | 
 400 |         with self.name_scope():
 401 |             self.inC = inC
 402 |             self.outC = outC
 403 | 
 404 |             self.w = self.params.get(
 405 |                 'weight',
 406 |                 shape=(inC-1, outC),
 407 |                 allow_deferred_init=True,
 408 |                 init=mx.init.Uniform(0.1),
 409 |             )
 410 | 
 411 |     def hybrid_forward(self, F, x, w):
 412 |         zw = F.concat(
 413 |             F.zeros((1, self.outC)), w, dim=0)
 414 |         return F.Embedding(x, zw, self.inC, self.outC)
 415 | 
 416 | class MyUD(nn.HybridBlock):
 417 |     def __init__(self, dims, inC, outC):
 418 |         super(MyUD, self).__init__()
 419 | 
 420 |         self.d = dims
 421 |         self.e = MyU(inC, outC)
 422 | 
 423 |     def hybrid_forward(self, F, x):
 424 |         e = self.e(x)
 425 |         result, start, end = [], 0, 0
 426 |         for i, size in enumerate(self.d):
 427 |             start, end = end, end + size
 428 |             sliced = F.slice_axis(e, 1, start, end)
 429 |             result.append(F.mean(sliced, 1, True))
 430 |         return F.concat(*result, dim=1)
 431 | 
 432 | class MyZ(nn.HybridBlock):
 433 |     def __init__(self, inC, outC):
 434 |         super(MyZ, self).__init__()
 435 | 
 436 |         with self.name_scope():
 437 |             self.inC = inC
 438 |             self.outC = outC
 439 | 
 440 |             self.w = self.params.get(
 441 |                 'weight',
 442 |                 shape=(inC, outC),
 443 |                 allow_deferred_init=True
 444 |             )
 445 | 
 446 |     def hybrid_forward(self, F, x, w):
 447 |         return F.Embedding(x, w, self.inC, self.outC)
 448 | 
 449 | class MyZB(nn.HybridBlock):
 450 |     def __init__(self, dims, inC, outC):
 451 |         super(MyZB, self).__init__()
 452 | 
 453 |         self.d = dims
 454 |         self.e = MyZ(inC, outC)
 455 | 
 456 |     def onDone(self, F, result):
 457 |         return result
 458 | 
 459 |     def hybrid_forward(self, F, x):
 460 |         e = self.e(x)
 461 |         result, start, end = [], 0, 0
 462 |         for i, size in enumerate(self.d):
 463 |             start, end = end, end + size
 464 |             sliced = F.slice_axis(e, 1, start, end)
 465 |             result.append(F.mean(sliced, 1, True))
 466 | 
 467 |         return self.onDone(F, result)
 468 | 
 469 | ###################################################
 470 | class MyDPO(nn.HybridSequential):
 471 |     def __init__(self, rate=.5):
 472 |         super(MyDPO, self).__init__()
 473 |         self.rate = rate
 474 | 
 475 |         with self.name_scope():
 476 |             self.add(nn.Dropout(.5))
 477 | 
 478 | 
 479 | class Chen_DFM(nn.HybridBlock):
 480 |     def __init__(self, dims, inC, outC, unit, User, Ad):
 481 |         super(Chen_DFM, self).__init__()
 482 | 
 483 |         with self.name_scope():
 484 |             self.ba = MyBA('relu')
 485 |             #self.be = MyBA('relu')
 486 |             self.ip = MyIBA(unit, 'relu')
 487 |             self.ed = MyAttention(dims, inC, outC, User, Ad)
 488 |             self.lr = MyAttention(dims, inC, 1, User, Ad)
 489 |             self.dpo = MyDPO()
 490 |             self.n = len(dims)
 491 | 
 492 |     def hybrid_forward(self, F, x):
 493 |         # e = self.be(self.ed(x))
 494 |         #
 495 |         # lr = self.be(self.lr(x))
 496 |         e = self.ed(x)
 497 |         lr = self.lr(x)
 498 | 
 499 |         order1 = self.dpo(F.sum(lr, axis=2))
 500 | 
 501 |         order2_1 = F.square(F.sum(e, axis=1))
 502 |         order2_2 = F.sum(F.square(e), axis=1)
 503 |         order2 = self.dpo(0.5 * (order2_1 - order2_2))
 504 | 
 505 |         dp = self.dpo(F.flatten(e))
 506 |         # for i in range(0, 2):
 507 |         dp = self.dpo(self.ip(dp))
 508 |         # in_shape, out_shape, uax_shape = out.infer_shape(data=(256,207))
 509 |         # print(in_shape, out_shape, uax_shape)
 510 |         return self.ba(
 511 |             F.concat(dp, order1, order2, dim=1))
 512 | 
 513 | class Guo_DFFM(nn.HybridBlock):
 514 |     def __init__(self, dims, inC, outC, unit):
 515 |         super(Guo_DFFM, self).__init__()
 516 | 
 517 |         with self.name_scope():
 518 |             self.n = n = len(dims)
 519 |             #self.be = MyBA('relu')
 520 |             self.ed = MyED(dims, inC, outC*n)
 521 | 
 522 |     def hybrid_forward(self, F, x):
 523 |         e = self.ed(x)
 524 |         #e = self.be(e)
 525 |         print(e.infer_shape(data=(2048, 590)))
 526 |         ss, es = [], F.split(e, self.n, 1)
 527 |         print(es.infer_shape(data=(2048, 590)))
 528 |         es1= F.split(es[0], self.n, 2)
 529 |         print(es1.infer_shape(data=(2048, 590)))
 530 |         for s in es:
 531 |             ss.append(F.split(s, self.n, 2))#[n*1*8,n*1*8,...]
 532 | ###############################################################################
 533 |         import itertools
 534 |         embed_var_dict = {}
 535 |         for (i1, i2) in itertools.combinations(list(range(0, self.n)), 2):
 536 |             c1, c2 = i1, i2
 537 |             embed_var_dict.setdefault(c1, {})[c2] = ss[c1][i2]  # None * k
 538 |             embed_var_dict.setdefault(c2, {})[c1] = ss[c2][i1]  # None * k
 539 |         x_mat = []
 540 |         y_mat = []
 541 |         input_size = 0
 542 |         for (c1, c2) in itertools.combinations(embed_var_dict.keys(), 2):
 543 |             input_size += 1
 544 |             x_mat.append(embed_var_dict[c1][c2]) #input_size * None * k
 545 |             y_mat.append(embed_var_dict[c2][c1]) #input_size * None * k
 546 |         print(x_mat[0].infer_shape(data=(2048, 590)))
 547 |         x_mat = F.concat(*x_mat,dim=1)
 548 |         print(x_mat.infer_shape(data=(2048, 590)))
 549 |         y_mat = F.concat(*y_mat, dim=1)
 550 |         res = x_mat*y_mat
 551 |         print(res.infer_shape(data=(2048, 590)))
 552 | ###########################################################
 553 |         # result = []
 554 |         # for i in range(self.n):
 555 |         #     for k in range(i, self.n):
 556 |         #         result.append(ss[i][k]*ss[k][i])#ele-wise
 557 |         #         #print(ss[i][k].infer_shape(data=(2048, 590)))
 558 |         # res = F.concat(*result, dim=1)
 559 |         # print(res.infer_shape(data=(2048, 590)))
 560 |         order2=F.sum(res, axis=1)
 561 |         print(order2.infer_shape(data=(2048, 590)))
 562 | 
 563 |         return order2
 564 | class MyDCN(nn.HybridBlock):
 565 |     def __init__(self, dims, inC, outC, unit, depth):
 566 |         super(MyDCN, self).__init__()
 567 | 
 568 |         with self.name_scope():
 569 |             self.be = MyBA('relu')
 570 |             self.ba = MyBA('relu')
 571 |             self.ip = MyIBA(unit, 'relu')
 572 |             self.ed = MyED(dims, inC, outC)
 573 | 
 574 |             self.depth = depth
 575 |             for i in range(depth):
 576 |                 setattr(
 577 |                     self, 'MyC#%02d'%i,
 578 |                     MyC((len(dims)*outC, 1))
 579 |                 )
 580 |                 setattr(
 581 |                     self, 'MyBA#%02d'%i,
 582 |                     MyBA()
 583 |                 )
 584 | 
 585 |     def hybrid_forward(self, F, x):
 586 |         e = self.be(self.ed(x))
 587 |         deep = self.ip(F.flatten(e))
 588 | 
 589 |         y = e = F.expand_dims(F.flatten(e), -1)
 590 |         for i in range(self.depth):
 591 |             mc = getattr(self, 'MyC#%02d'%i)
 592 |             ba = getattr(self, 'MyBA#%02d'%i)
 593 | 
 594 |             y = ba(mc(e, y))
 595 |         cross = F.flatten(y)
 596 | 
 597 |         return self.ba(F.concat(deep, cross, dim=1))
 598 | 
 599 | class MyDFM(nn.HybridBlock):
 600 |     def __init__(self, dims, inC, outC, unit):
 601 |         super(MyDFM, self).__init__()
 602 | 
 603 |         with self.name_scope():
 604 |             self.ba = MyBA('relu')
 605 |             self.be = MyBA('relu')
 606 |             self.ip = MyIBA(unit, 'relu')
 607 |             self.ed = MyED(dims, inC, outC)
 608 | 
 609 | 
 610 |     def hybrid_forward(self, F, x):
 611 |         ed = self.ed(x)
 612 |         # in_shape, out_shape, uax_shape = ed.infer_shape(data=(2048,590))
 613 |         # print('inshape:',in_shape,'outshape:', out_shape)
 614 |         e = self.be(ed)
 615 |         in_shape, out_shape, uax_shape = e.infer_shape(data=(2048,590))
 616 |         print('inshape:',in_shape,'outshape:', out_shape)
 617 |         deep = self.ip(e)
 618 |         in_shape, out_shape, uax_shape = deep.infer_shape(data=(2048,590))
 619 |         print('inshape:',in_shape,'outshape:', out_shape)
 620 |         order1 = F.sum(e, axis=2)
 621 |         order2_1 = F.square(F.sum(e, axis=1))
 622 |         order2_2 = F.sum(F.square(e), axis=1)
 623 |         order2 = 0.5*(order2_1-order2_2)
 624 | 
 625 |         return self.ba(
 626 |             F.concat(deep, order1, order2, dim=1))
 627 | 
 628 | class MyDFM2(nn.HybridBlock):
 629 |     def __init__(self, dims, inC, outC, unit):
 630 |         super(MyDFM2, self).__init__()
 631 | 
 632 |         with self.name_scope():
 633 |             self.n = len(dims)
 634 |             self.ba = MyBA('relu')
 635 |             self.be = MyBA('relu')
 636 |             self.ip = MyIBA(unit, 'relu')
 637 |             self.ed = MyEB(dims, inC, outC)
 638 | 
 639 |     def hybrid_forward(self, F, x):
 640 |         x = self.ed(x)
 641 |         in_shape, out_shape, uax_shape = x[0].infer_shape(data=(2048,590))
 642 |         print('inshape:',in_shape,'outshape:', out_shape)
 643 |         y = F.concat(*x, dim=1)
 644 |         e = self.be(y)
 645 |         in_shape, out_shape, uax_shape = e.infer_shape(data=(2048,590))
 646 |         print('inshape:',in_shape,'outshape:', out_shape)
 647 |         deep = self.ip(e)
 648 |         order1 = F.sum(e, axis=2)
 649 | 
 650 |         # order2_1 = F.square(F.sum(e, axis=1))
 651 |         # order2_2 = F.sum(F.square(e), axis=1)
 652 |         # order2 = 0.5*(order2_1-order2_2)
 653 |         order2 = []
 654 |         for i in range(self.n):
 655 |             for k in range(i, self.n):
 656 |                 order2.append(x[i]*x[k])
 657 |         order2 = F.flatten(F.concat(*order2, dim=1))
 658 | 
 659 |         return self.ba(
 660 |             F.concat(deep, order1, order2, dim=1))
 661 | 
 662 | class MyDFU(nn.HybridBlock):
 663 |     def __init__(self, dims, inC, outC, unit):
 664 |         super(MyDFU, self).__init__()
 665 | 
 666 |         with self.name_scope():
 667 |             self.ba = MyBA('relu')
 668 |             self.be = MyBA('relu')
 669 |             self.ip = MyIBA(unit, 'relu')
 670 |             self.ed = MyUD(dims, inC, outC)
 671 | 
 672 |     def hybrid_forward(self, F, x):
 673 |         e = self.be(self.ed(x))
 674 | 
 675 |         deep = self.ip(e)
 676 |         order1 = F.sum(e, axis=2)
 677 |         order2_1 = F.square(F.sum(e, axis=1))
 678 |         order2_2 = F.sum(F.square(e), axis=1)
 679 |         order2 = 0.5*(order2_1-order2_2)
 680 | 
 681 |         return self.ba(
 682 |             F.concat(deep, order1, order2, dim=1))
 683 | 
 684 | class MyDFR(nn.HybridBlock):
 685 |     def __init__(self, dims, inC, outC, unit):
 686 |         super(MyDFR, self).__init__()
 687 | 
 688 |         with self.name_scope():
 689 |             self.ba = MyBA('relu')
 690 |             self.be = MyBA('relu')
 691 |             self.ip = MyIBA(unit, 'relu')
 692 |             self.ed = MyER(dims, inC, outC)
 693 | 
 694 |     def hybrid_forward(self, F, x):
 695 |         e = self.be(self.ed(x))
 696 | 
 697 |         deep = self.ip(e)
 698 |         order1 = F.sum(e, axis=2)
 699 |         order2_1 = F.square(F.sum(e, axis=1))
 700 |         order2_2 = F.sum(F.square(e), axis=1)
 701 |         order2 = 0.5*(order2_1-order2_2)
 702 | 
 703 |         return self.ba(
 704 |             F.concat(deep, order1, order2, dim=1))
 705 | 
 706 | class MyDFZ(nn.HybridBlock):
 707 |     def __init__(self, dims, inC, outC, unit):
 708 |         super(MyDFZ, self).__init__()
 709 | 
 710 |         with self.name_scope():
 711 |             self.n = len(dims)
 712 |             self.ba = MyBA('relu')
 713 |             self.be = MyBA('relu')
 714 |             self.ip = MyIBA(unit, 'relu')
 715 |             self.ed = MyZB(dims, inC, outC)
 716 | 
 717 |     def hybrid_forward(self, F, x):
 718 |         x = self.ed(x)
 719 |         y = F.concat(*x, dim=1)
 720 | 
 721 |         e = self.be(y)
 722 |         deep = self.ip(e)
 723 |         order1 = F.sum(e, axis=2)
 724 | 
 725 |         order2 = []
 726 |         for i in range(self.n):
 727 |             for k in range(i+1, self.n):
 728 |                 order2.append(x[i]*x[k])
 729 |         order2 = F.flatten(F.concat(*order2, dim=1))
 730 | 
 731 |         return self.ba(
 732 |             F.concat(deep, order1, order2, dim=1))
 733 | 
 734 | class MyDFCN(nn.HybridBlock):
 735 |     def __init__(self, dims, inC, outC, unit):
 736 |         super(MyDFCN, self).__init__()
 737 | 
 738 |         with self.name_scope():
 739 |             self.n = n = len(dims)
 740 |             self.be = MyBA('relu')
 741 |             self.ba = MyBA('relu')
 742 |             self.ip = MyIBA(unit, 'relu')
 743 |             self.ed = MyED(dims, inC, outC*n)
 744 | 
 745 |             for i in range(n):
 746 |                 layer = nn.HybridSequential()
 747 |                 layer.add(MyBA('relu'))
 748 |                 layer.add(MyCBA((n,3,1,1), 'relu'))
 749 |                 layer.add(MyCBA((n,3,1,1), 'relu'))
 750 |                 layer.add(MyCBA((n,3,1,1), 'relu'))
 751 |                 setattr(
 752 |                     self, 'MyLayer#%02d'%i, layer
 753 |                 )
 754 | 
 755 |     def hybrid_forward(self, F, x):
 756 |         e = self.be(self.ed(x))
 757 | 
 758 |         deep = self.ip(e)
 759 |         order1 = F.sum(e, axis=2)
 760 | 
 761 |         ss, es = [], F.split(e, self.n, 1)
 762 |         for s in es: ss.append(F.split(s,self.n,2))
 763 | 
 764 |         order2 = []
 765 |         for i in range(self.n):
 766 |             temp = []
 767 |             for k in range(self.n):
 768 |                 temp.append(ss[i][k]*ss[k][i])
 769 |             temp = F.concat(*temp, dim=1)
 770 |             temp = F.swapaxes(temp, 1, 2)
 771 | 
 772 |             layer = getattr(self, 'MyLayer#%02d'%i)
 773 |             temp = F.swapaxes(layer(temp), 1, 2)
 774 |             order2.append(F.sum(temp, axis=1))
 775 |         order2 = F.concat(*order2, dim=1)
 776 | 
 777 |         return self.ba(
 778 |             F.concat(deep, order1, order2, dim=1))
 779 | 
 780 | class MyDFCM(nn.HybridBlock):
 781 |     def __init__(self, dims, inC, outC, unit):
 782 |         super(MyDFCM, self).__init__()
 783 | 
 784 |         with self.name_scope():
 785 |             self.n = len(dims)
 786 |             self.ba = MyBA('relu')
 787 |             self.be = MyBA('relu')
 788 |             self.ip = MyIBA(unit, 'relu')
 789 |             self.ed = MyEB(dims, inC, outC)
 790 | 
 791 |     def hybrid_forward(self, F, x):
 792 |         es = self.ed(x)
 793 |         e = self.be(F.concat(*es, dim=1))
 794 | 
 795 |         deep = self.ip(e)
 796 |         order1 = F.sum(e, axis=2)
 797 | 
 798 |         order2 = []
 799 |         for i in range(self.n):
 800 |             for k in range(i+1, self.n):
 801 |                 order2.append(F.batch_dot(
 802 |                     es[i], es[k], True, False))
 803 |         order2 = F.flatten(F.concat(*order2, dim=1))
 804 | 
 805 |         return self.ba(
 806 |             F.concat(deep, order1, order2, dim=1))
 807 | 
 808 | 
 809 | 
 810 | class MyDIN(nn.HybridBlock):
 811 |     def __init__(self, dims, inC, outC, unit, unit2=0):
 812 |         super(MyDIN, self).__init__()
 813 | 
 814 |         with self.name_scope():
 815 |             self.n = len(dims)
 816 |             self.ba = MyBA('relu')
 817 |             self.be = MyBA('relu')
 818 |             self.ip = MyIBA(unit, 'relu')
 819 |             self.ed = MyEB(dims, inC, outC)
 820 | 
 821 |             unit2 = unit2 or outC
 822 |             for i in range(self.n):
 823 |                 for k in range(i + 1, self.n):
 824 |                     setattr(
 825 |                         self, 'fc#%02d#%02d' % (i, k),
 826 |                         MyIBA(unit2, 'relu')
 827 |                     )
 828 | 
 829 |     def hybrid_forward(self, F, x):
 830 |         x = self.ed(x)
 831 |         y = F.concat(*x, dim=1)
 832 | 
 833 |         e = self.be(y)
 834 |         deep = self.ip(e)
 835 |         order1 = F.sum(e, axis=2)
 836 | 
 837 |         order2 = []
 838 |         for i in range(self.n):
 839 |             for k in range(i + 1, self.n):
 840 |                 xi, xk = x[i], x[k]
 841 |                 fi = F.concat(xi, xi - xk, xk, dim=1)
 842 |                 f = getattr(self, 'fc#%02d#%02d' % (i, k))
 843 | 
 844 |                 order2.append(f(fi))
 845 |         order2 = F.concat(*order2, dim=1)
 846 | 
 847 |         return self.ba(
 848 |             F.concat(deep, order1, order2, dim=1))
 849 | 
 850 | class MyDFIN(nn.HybridBlock):
 851 |     def __init__(self, dims, inC, outC, unit):
 852 |         super(MyDFIN, self).__init__()
 853 | 
 854 |         with self.name_scope():
 855 |             self.n = n = len(dims)
 856 |             self.be = MyBA('relu')
 857 |             self.ba = MyBA('relu')
 858 |             self.ip = MyIBA(unit, 'relu')
 859 |             self.ed = MyED(dims, inC, outC*n)
 860 | 
 861 |             for i in range(n):
 862 |                 for k in range(i+1, n):
 863 |                     setattr(
 864 |                         self, 'fc#%02d#%02d'%(i,k),
 865 |                         MyIBA(outC, 'relu')
 866 |                     )
 867 | 
 868 |     def hybrid_forward(self, F, x):
 869 |         e = self.be(self.ed(x))
 870 | 
 871 |         deep = self.ip(e)
 872 |         order1 = F.sum(e, axis=2)
 873 | 
 874 |         ss, es = [], F.split(e, self.n, 1)
 875 |         for s in es: ss.append(F.split(s, self.n, 2))
 876 | 
 877 |         order2 = []
 878 |         for i in range(self.n):
 879 |             temp = []
 880 |             for k in range(i+1, self.n):
 881 |                 eik, eki = ss[i][k], ss[k][i]
 882 |                 fi = F.concat(eik, eik-eki, eki, dim=1)
 883 |                 f = getattr(self, 'fc#%02d#%02d' % (i,k))
 884 | 
 885 |                 temp.append(f(fi))
 886 |             if len(temp) == 0: continue
 887 |             order2.append(F.concat(*temp, dim=1))
 888 |         order2 = F.concat(*order2, dim=1)
 889 | 
 890 |         return self.ba(
 891 |             F.concat(deep, order1, order2, dim=1))
 892 | 
 893 | class MyDFFM(nn.HybridBlock):
 894 |     def __init__(self, dims, inC, outC, unit):
 895 |         super(MyDFFM, self).__init__()
 896 | 
 897 |         with self.name_scope():
 898 |             self.n = n = len(dims)
 899 |             self.be = MyBA('relu')
 900 |             self.ed = MyED(dims, inC, outC*n)
 901 | 
 902 |     def hybrid_forward(self, F, x):
 903 |         e = self.be(self.ed(x))
 904 | 
 905 |         ss, es = [], F.split(e, self.n, 1)
 906 | 
 907 |         for s in es:
 908 |             ss.append(F.split(s, self.n, 2))#[n*1*8,n*1*8,...]
 909 | 
 910 | 
 911 |         result = []
 912 |         for i in range(self.n):
 913 |             for k in range(i, self.n):
 914 |                 result.append(ss[i][k]*ss[k][i])#ele-wise
 915 |                 #print(te.infer_shape(data=(2048, 590)))
 916 | 
 917 |         res = F.concat(*result, dim=1)
 918 |         print(res.infer_shape(data=(2048, 590)))
 919 |         order2=F.sum(res, axis=1)
 920 |         print(order2.infer_shape(data=(2048, 590)))
 921 |         return order2
 922 | 
 923 | class MyDFFM2(nn.HybridBlock):
 924 |     def __init__(self, dims, inC, outC, unit):
 925 |         super(MyDFFM2, self).__init__()
 926 | 
 927 |         with self.name_scope():
 928 |             self.n = n = len(dims)
 929 |             self.be = MyBA('relu')
 930 |             self.ed = MyED(dims, inC, outC*n)
 931 | 
 932 |     def hybrid_forward(self, F, x):
 933 |         e = self.be(self.ed(x))
 934 | 
 935 |         ss, es = [], F.split(e, self.n, 1)
 936 |         for s in es:
 937 |             ss.append(F.split(s, self.n, 2))
 938 | 
 939 |         result = []
 940 |         for i in range(self.n):
 941 |             temp = []
 942 |             for k in range(self.n):
 943 |                 temp.append(ss[i][k]*ss[k][i])
 944 |             temp = F.concat(*temp, dim=1)
 945 |             result.append(F.sum(temp, axis=1))
 946 | 
 947 |         return F.concat(*result, dim=1)
 948 | 
 949 | class MyDFFM3(nn.HybridBlock):
 950 |     def __init__(self, dims, inC, outC, unit):
 951 |         super(MyDFFM3, self).__init__()
 952 | 
 953 |         with self.name_scope():
 954 |             self.n = n = len(dims)
 955 |             self.be = MyBA('relu')
 956 |             self.ba = MyBA('relu')
 957 |             self.ip = MyIBA(unit, 'relu')
 958 |             self.ed = MyED(dims, inC, outC*n)
 959 |             self.dpo = MyDPO()
 960 | 
 961 |     def hybrid_forward(self, F, x):
 962 |         e = self.be(self.ed(x))
 963 | 
 964 |         deep = self.ip(e)
 965 |         order1 = F.sum(e, axis=2)
 966 | 
 967 |         ss, es = [], F.split(e, self.n, 1)
 968 |         for s in es: ss.append(F.split(s,self.n,2))
 969 | 
 970 |         order2 = []
 971 |         for i in range(self.n):
 972 |             temp = []
 973 |             for k in range(self.n):
 974 |                 temp.append(ss[i][k]*ss[k][i])
 975 |             temp = F.concat(*temp, dim=1)
 976 |             order2.append(F.sum(temp, axis=1))
 977 |         order2 = F.concat(*order2, dim=1)
 978 | 
 979 |         return self.ba(
 980 |             F.concat(deep, order1, order2, dim=1))
 981 | 
 982 | ###################################################
 983 | 
 984 | class MyModel(object):
 985 |     def __init__(self, gpu=0):
 986 |         self.gpu = gpu
 987 | 
 988 |     def getNet(self, ctx):
 989 |         raise NotImplementedError
 990 | 
 991 |     def getModel(self):
 992 |         return './model'
 993 | 
 994 |     def getEpoch(self):
 995 |         return 1
 996 | 
 997 |     def getMetric(self):
 998 |         raise NotImplementedError
 999 | 
1000 |     def getTrainer(self, params, iters):
1001 |         opt = mx.optimizer.SGD(
1002 |             wd=1e-3,
1003 |             momentum=0.9,
1004 |             learning_rate=1e-2,
1005 |             lr_scheduler=mx.lr_scheduler.FactorScheduler(
1006 |                 iters*30, 1e-1, 1e-3
1007 |             )
1008 |         )
1009 | 
1010 |         return mx.gluon.Trainer(params, opt)
1011 | 
1012 |     def forDebug(self, out):
1013 |         pass
1014 | 
1015 |     def train(self):
1016 |         model = self.getModel()
1017 |         if not os.path.exists(model):
1018 |             os.mkdir(model)
1019 | 
1020 |         ctx = mx.cpu(self.gpu)
1021 |         net, myLoss = self.getNet(ctx)
1022 |         trainI, testI = self.getData(ctx)
1023 |         metric, monitor = self.getMetric()
1024 |         trainer = self.getTrainer(
1025 |             net.collect_params(), trainI.iters)
1026 | 
1027 |         logging.info('')
1028 |         result, epochs = 0, self.getEpoch()
1029 | 
1030 |         for epoch in range(1, epochs+1):
1031 |             train_l_sum = mx.nd.array([0], ctx=ctx)
1032 |             logging.info('Epoch[%04d] start ...' % epoch)
1033 | 
1034 |             list(map(lambda x: x.reset(), [trainI, metric]))
1035 |             for batch_i, (data, label) in enumerate(trainI):
1036 |                 with autograd.record():
1037 | 
1038 |                     out = net.forward(data)#2048,590
1039 |                     self.forDebug(out)
1040 |                     loss = myLoss(out, label)
1041 |                 loss.backward()
1042 |                 # grads = [p.grad(ctx) for p in net.collect_params().values()]
1043 |                 # mx.gluon.utils.clip_global_norm(
1044 |                 #     grads, .2 * 5 * data.shape[0])
1045 |                 trainer.step(data.shape[0])#trainer.step(batch_size)
1046 |                 print('train loss:', loss.mean().asnumpy())
1047 |                 ###############################################
1048 |                 # train_l_sum += loss.sum() / data.shape[0]
1049 |                 # eval_period = 1
1050 |                 # if batch_i % eval_period == 0 and batch_i > 0:
1051 |                 #     cur_l = train_l_sum / eval_period
1052 |                 #     print('epoch %d, batch %d, train loss %.6f, perplexity %.2f'
1053 |                 #           % (epoch, batch_i, cur_l.asscalar(),
1054 |                 #              cur_l.exp().asscalar()))
1055 |                 #     train_l_sum = mx.nd.array([0], ctx=ctx)
1056 |                 #############################################
1057 |                 metric.update(label, out)
1058 |             for name, value in metric.get():
1059 |                 logging.info('Epoch[%04d] Train-%s=%f ...', epoch, name, value)
1060 | 
1061 |             _result = None
1062 |             list(map(lambda x: x.reset(), [testI, metric]))
1063 |             l_sum = mx.nd.array([0], ctx=ctx)
1064 |             n = 0
1065 |             for data, label in testI:
1066 |                 out = net.forward(data)
1067 |                 l = myLoss(out, label)
1068 |                 l_sum += l.sum()
1069 |                 n += l.size
1070 |                 self.forDebug(out)
1071 |                 metric.update(label, out)
1072 |             print('valid loss:', (l_sum / n).mean().asnumpy())
1073 |             for name, value in metric.get():
1074 |                 if name == monitor : _result = value
1075 |                 logging.info('Epoch[%04d] Validation-%s=%f', epoch, name, value)
1076 | 
1077 |             if _result > result:
1078 |                 result = _result
1079 |                 name = '%s/%04d-%3.3f%%.params' % (model, epoch, result*100)
1080 |                 net.save_params(name)
1081 |                 logging.info('Save params to %s ...', name)
1082 | 
1083 |             logging.info('Epoch[%04d] done ...\n' % epoch)
1084 | 
1085 | class MyPredict(object):
1086 |     def __init__(self, gpu=0):
1087 |         self.gpu = gpu
1088 | 
1089 |     def getNet(self, ctx):
1090 |         raise NotImplementedError
1091 | 
1092 |     def preProcess(self, data):
1093 |         return data
1094 | 
1095 |     def postProcess(self, data, pData, output):
1096 |         raise NotImplementedError
1097 | 
1098 |     def onDone(self):
1099 |         pass
1100 | 
1101 |     def predict(self):
1102 |         ctx = mx.cpu(self.gpu)
1103 | 
1104 |         net = self.getNet(ctx)
1105 |         testI = self.getData(ctx)
1106 | 
1107 |         logging.info('Start predicting ...')
1108 | 
1109 |         testI.reset()
1110 |         for data in testI:
1111 |             pData = self.preProcess(data)
1112 |             out = net.forward(pData)
1113 |             self.postProcess(data, pData, out)
1114 | 
1115 |         self.onDone()
1116 |         logging.info('Prediction done ...')
1117 | 
1118 | ###################################################
1119 | 
1120 | class MyLoss(nn.HybridBlock):
1121 |     def __init__(self):
1122 |         super(MyLoss, self).__init__()
1123 |         self.loss = loss.SigmoidBCELoss()#l2
1124 | 
1125 |     def hybrid_forward(self, F, pred, label):
1126 |         sampleWeight = 20*label+(1-label)
1127 |         return self.loss(pred, label)
1128 | 
1129 | class MyMetric(object):
1130 |     def get(self):
1131 |         assert len(self.labels) != 0, 'Failed ...'
1132 |         assert len(self.outputs) != 0, 'Failed ...'
1133 | 
1134 |         labels = np.hstack(self.labels)
1135 |         outputs = np.hstack(self.outputs)
1136 |         results = (
1137 |             (
1138 |                 'aucMetric',
1139 |                 self.myAuc(labels, outputs)
1140 |             ),
1141 |             (
1142 |                 'accMetric',
1143 |                 self.myAcc(labels, outputs)
1144 |             ),
1145 |         )
1146 | 
1147 |         return results
1148 | 
1149 |     def myAuc(self, labels, outputs):
1150 |         return roc_auc_score(labels, outputs)
1151 | 
1152 |     def myAcc(self, labels, outputs):
1153 |         return np.mean(labels == (outputs>=0.5))
1154 | 
1155 |     def reset(self):
1156 |         self.labels = []
1157 |         self.outputs = []
1158 | 
1159 |     def update(self, label, output):
1160 |         label = label.asnumpy()
1161 |         output = mx.nd.sigmoid(output)
1162 |         output = output.asnumpy()[:,0]
1163 | 
1164 |         self.labels.append(label)
1165 |         self.outputs.append(output)
1166 | 
1167 | ###################################################
1168 | 
1169 | class MyLoss2(nn.HybridBlock):
1170 |     def __init__(self):
1171 |         super(MyLoss2, self).__init__()
1172 |         self.loss = loss.SoftmaxCELoss()
1173 | 
1174 |     def hybrid_forward(self, F, pred, label):
1175 |         sampleWeight = 20*label+(1-label)
1176 |         return self.loss(pred, label, sampleWeight)
1177 | 
1178 | class MyMetric2(object):
1179 |     def get(self):
1180 |         assert len(self.labels) != 0, 'Failed ...'
1181 |         assert len(self.outputs) != 0, 'Failed ...'
1182 | 
1183 |         labels = np.hstack(self.labels)
1184 |         outputs = np.hstack(self.outputs)
1185 |         results = (
1186 |             (
1187 |                 'aucMetric',
1188 |                 self.myAuc(labels, outputs)
1189 |             ),
1190 |             (
1191 |                 'accMetric',
1192 |                 self.myAcc(labels, outputs)
1193 |             ),
1194 |         )
1195 | 
1196 |         return results
1197 | 
1198 |     def myAuc(self, labels, outputs):
1199 |         return roc_auc_score(labels, outputs)
1200 | 
1201 |     def myAcc(self, labels, outputs):
1202 |         return np.mean(labels == (outputs>=0.5))
1203 | 
1204 |     def reset(self):
1205 |         self.labels = []
1206 |         self.outputs = []
1207 | 
1208 |     def update(self, label, output):
1209 |         label = label.asnumpy()
1210 |         output = mx.nd.softmax(output)
1211 |         output = output.asnumpy()[:,1]
1212 | 
1213 |         self.labels.append(label)
1214 |         self.outputs.append(output)
1215 | 
1216 | ###################################################
1217 | 
1218 | class MyLoss3(nn.HybridBlock):
1219 |     def __init__(self):
1220 |         super(MyLoss3, self).__init__()
1221 |         self.loss1 = loss.SoftmaxCELoss()
1222 |         self.loss2 = loss.SigmoidBCELoss()
1223 | 
1224 |     def hybrid_forward(self, F, pred, label):
1225 |         sampleWeight = 20*label+(1-label)
1226 | 
1227 |         pred1 = F.slice_axis(pred, 1, 0, 2)
1228 |         pred2 = F.slice_axis(pred, 1, 2, 3)
1229 | 
1230 |         loss1 = self.loss1(pred1, label, sampleWeight)
1231 |         loss2 = self.loss2(pred2, label, sampleWeight)
1232 | 
1233 |         return 0.5 * (loss1 + loss2)
1234 | 
1235 | class MyMetric3(object):
1236 |     def get(self):
1237 |         assert len(self.labels) != 0, 'Failed ...'
1238 |         assert len(self.outputs) != 0, 'Failed ...'
1239 | 
1240 |         labels = np.hstack(self.labels)
1241 |         outputs = np.hstack(self.outputs)
1242 |         results = (
1243 |             (
1244 |                 'aucMetric',
1245 |                 self.myAuc(labels, outputs)
1246 |             ),
1247 |             (
1248 |                 'accMetric',
1249 |                 self.myAcc(labels, outputs)
1250 |             ),
1251 |         )
1252 | 
1253 |         return results
1254 | 
1255 |     def myAuc(self, labels, outputs):
1256 |         return roc_auc_score(labels, outputs)
1257 | 
1258 |     def myAcc(self, labels, outputs):
1259 |         return np.mean(labels == (outputs>=0.5))
1260 | 
1261 |     def reset(self):
1262 |         self.labels = []
1263 |         self.outputs = []
1264 | 
1265 |     def update(self, label, output):
1266 |         label = label.asnumpy()
1267 | 
1268 |         output1 = output[:, 0:2]
1269 |         output2 = output[:, 2:3]
1270 | 
1271 |         output1 = mx.nd.softmax(output1)
1272 |         output1 = output1.asnumpy()[:,1]
1273 | 
1274 |         output2 = mx.nd.sigmoid(output2)
1275 |         output2 = output2.asnumpy()[:,0]
1276 | 
1277 |         output = 0.5*(output1 + output2)
1278 | 
1279 |         self.labels.append(label)
1280 |         self.outputs.append(output)
1281 | 
1282 | ###################################################
1283 | 
1284 | 


--------------------------------------------------------------------------------
/picture/top1.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ColaDrill/2018spa/24bf9f906e06ee6cf58da3f6d8248dddc164ece2/picture/top1.png


--------------------------------------------------------------------------------
/picture/top3.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/ColaDrill/2018spa/24bf9f906e06ee6cf58da3f6d8248dddc164ece2/picture/top3.pdf


--------------------------------------------------------------------------------
/test.py:
--------------------------------------------------------------------------------
  1 | import multiprocessing #导入多进程模块，实现多进程
  2 | import os #验证，打印进程的id
  3 | import numpy as np #这个模块是为了实现矩阵乘法
  4 | 
  5 | import pandas as pd
  6 | 
  7 | from sklearn.preprocessing import OneHotEncoder,LabelEncoder
  8 | # data  = pd.read_csv('./datas/trainMerged.csv')
  9 | # for concat_feat in data.columns:
 10 | #     if data[concat_feat].dtype=='float':
 11 | #         try:
 12 | #             print(concat_feat)
 13 | #             data[concat_feat] = LabelEncoder().fit_transform((10000000*data[concat_feat]).apply(float))
 14 | #         except:
 15 | #             print('str???????????????????????')
 16 | #             data[concat_feat] = LabelEncoder().fit_transform(data[concat_feat].astype(str))
 17 | # data.to_csv('./datas/trainMerged.csv',index=False)
 18 | #
 19 | #
 20 | # print('ok')
 21 | #
 22 | # temp = OrderedDict(
 23 | #     sorted(
 24 | #         count['uid'].items(),  # return the tuple of dict
 25 | #         reverse=1, key=lambda i: i[1],
 26 | #     )
 27 | # )
 28 | #
 29 | # df = pd.DataFrame(temp)
 30 | # print(df.describe())
 31 | 
 32 | #
 33 | # df  = pd.read_csv('./submission.csv')
 34 | # test  = pd.read_csv('./test1.csv')
 35 | # test = pd.merge(test,df,on=['uid','aid'],how='left')
 36 | # test.fillna(0,inplace=True)
 37 | # test.to_csv('./submission1.csv',index=False, float_format='%.8f')
 38 | 
 39 | def read_raw_data(path, max_num):
 40 |     f = open(path, 'r')
 41 |     features = f.readline()
 42 |     features = features.strip().split(',')
 43 |     dict = {}
 44 |     num = 0
 45 |     for line in f:
 46 |         if num >= max_num:
 47 |             break
 48 |         datas = line.strip().split(',')
 49 |         for i, d in enumerate(datas):
 50 |             if not dict.__contains__(features[i]):
 51 |                 dict[features[i]] = []
 52 |             dict[features[i]].append(d)
 53 |         num += 1
 54 |     f.close()
 55 |     return dict, num
 56 | 
 57 | nums = 1000
 58 | train_dict, train_num = read_raw_data('C:\\Users\\Administrator\\Downloads\\TencentSPA02-PreA-master\\TencentSPA02-PreA-master\\datas\\trainMerged1.csv', nums)
 59 | 
 60 | def count_combine_feature_times(train_data_1, train_data_2, test1_data_1, test1_data_2, test2_data_1, test2_data_2):
 61 |     total_dict = {}
 62 |     count_dict = {}
 63 |     for i, d in enumerate(train_data_1):
 64 |         xs1 = d.split(' ')
 65 |         xs2 = train_data_2[i].split(',')
 66 |         for x1 in xs1:
 67 |             for x2 in xs2:
 68 |                 ke  = x1+'|'+x2
 69 |                 if not total_dict.__contains__(ke):
 70 |                     total_dict[ke] = 0
 71 |             total_dict[ke] += 1
 72 |     for i, d in enumerate(test1_data_1):
 73 |         xs1 = d.split(' ')
 74 |         xs2 = test1_data_2[i].split(',')
 75 |         for x1 in xs1:
 76 |             for x2 in xs2:
 77 |                 ke  = x1+'|'+x2
 78 |                 if not total_dict.__contains__(ke):
 79 |                     total_dict[ke] = 0
 80 |             total_dict[ke] += 1
 81 |     for i, d in enumerate(test2_data_1):
 82 |         xs1 = d.split(' ')
 83 |         xs2 = test2_data_2[i].split(',')
 84 |         for x1 in xs1:
 85 |             for x2 in xs2:
 86 |                 ke  = x1+'|'+x2
 87 |                 if not total_dict.__contains__(ke):
 88 |                     total_dict[ke] = 0
 89 |             total_dict[ke] += 1
 90 |     for key in total_dict:
 91 |         if not count_dict.__contains__(total_dict[key]):
 92 |             count_dict[total_dict[key]] = 0
 93 |         count_dict[total_dict[key]] += 1
 94 | 
 95 |     train_res = []
 96 |     for i, d in enumerate(train_data_1):
 97 |         t = []
 98 |         xs1 = d.split(' ')
 99 |         xs2 = train_data_2[i].split(',')
100 |         for x1 in xs1:
101 |             for x2 in xs2:
102 |                 ke = x1 + '|' + x2
103 |                 t.append(total_dict[ke])
104 |         train_res.append(max(t))
105 |     test1_res = []
106 |     for i, d in enumerate(test1_data_1):
107 |         t = []
108 |         xs1 = d.split(' ')
109 |         xs2 = test1_data_2[i].split(',')
110 |         for x1 in xs1:
111 |             for x2 in xs2:
112 |                 ke = x1 + '|' + x2
113 |                 t.append(total_dict[ke])
114 |         test1_res.append(max(t))
115 |     test2_res = []
116 |     for i, d in enumerate(test2_data_1):
117 |         t = []
118 |         xs1 = d.split(' ')
119 |         xs2 = test2_data_2[i].split(',')
120 |         for x1 in xs1:
121 |             for x2 in xs2:
122 |                 ke = x1 + '|' + x2
123 |                 t.append(total_dict[ke])
124 |         test2_res.append(max(t))
125 |     return np.array(train_res), np.array(test1_res), np.array(test2_res), count_dict
126 | 
127 | def gen_count_dict(data, labels, begin, end):
128 |     total_dict = {}
129 |     pos_dict = {}
130 |     for i, d in enumerate(data):
131 |         if i >= begin and i < end:
132 |             continue
133 |         xs = d.split(' ')
134 |         for x in xs:
135 |             if not total_dict.__contains__(x):
136 |                 total_dict[x] = 0
137 |             if not pos_dict.__contains__(x):
138 |                 pos_dict[x] = 0
139 |             total_dict[x] += 1
140 |             if labels[i] == '1':
141 |                 pos_dict[x] += 1
142 |     return total_dict, pos_dict
143 | 
144 | def count_pos_feature(train_data, test1_data, test2_data, labels, k, test_only= False, is_val = False):
145 |     nums = len(train_data)
146 |     last = nums
147 |     if is_val:
148 |         last = nums-4739700
149 |         assert last > 0
150 |     interval = last // k
151 |     split_points = []
152 |     for i in range(k):
153 |         split_points.append(i * interval)
154 |     split_points.append(last)
155 |     count_train_data = train_data[0:last]
156 |     count_labels = labels[0:last]
157 | 
158 |     train_res = []
159 |     if not test_only:
160 |         for i in range(k):
161 |             print( i,"part counting")
162 |             print( split_points[i], split_points[i+1])
163 |             tmp = []
164 |             total_dict, pos_dict = gen_count_dict(count_train_data, count_labels, split_points[i],split_points[i+1])
165 |             for j in range(split_points[i],split_points[i+1]):
166 |                 xs = train_data[j].split(' ')
167 |                 t = []
168 |                 for x in xs:
169 |                     if not pos_dict.__contains__(x):
170 |                         t.append(0)
171 |                         continue
172 |                     t.append(pos_dict[x] + 1)
173 |                 tmp.append(max(t))
174 |             train_res.extend(tmp)
175 | 
176 |     total_dict, pos_dict = gen_count_dict(count_train_data, count_labels, 1, 0)
177 |     count_dict = {-1:0}
178 |     for key in pos_dict:
179 |         if not count_dict.__contains__(pos_dict[key]):
180 |             count_dict[pos_dict[key]] = 0
181 |         count_dict[pos_dict[key]] += 1
182 | 
183 |     if is_val:
184 |         for i in range(last, nums):
185 |             xs = train_data[i].split(' ')
186 |             t = []
187 |             for x in xs:
188 |                 if not total_dict.__contains__(x):
189 |                     t.append(0)
190 |                     continue
191 |                 t.append(pos_dict[x] + 1)
192 |             train_res.append(max(t))
193 | 
194 |     test1_res = []
195 |     for d in test1_data:
196 |         xs = d.split(' ')
197 |         t = []
198 |         for x in xs:
199 |             if not pos_dict.__contains__(x):
200 |                 t.append(0)
201 |                 continue
202 |             t.append(pos_dict[x] + 1)
203 |         test1_res.append(max(t))
204 | 
205 |     test2_res = []
206 |     for d in test2_data:
207 |         xs = d.split(' ')
208 |         t = []
209 |         for x in xs:
210 |             if not pos_dict.__contains__(x):
211 |                 t.append(0)
212 |                 continue
213 |             t.append(pos_dict[x] + 1)
214 |         test2_res.append(max(t))
215 | 
216 |     return np.array(train_res), np.array(test1_res), np.array(test2_res), count_dict
217 | 
218 | def combine_to_one(data1, data2):
219 |     assert  len(data1) == len(data2)
220 |     new_res = []
221 |     for i, d in enumerate(data1):
222 |         x1 = data1[i]
223 |         x2 = data2[i]
224 |         new_x = x1 + '|' + x2
225 |         new_res.append(new_x)
226 |     return new_res
227 | 
228 | def uid_seq_feature(train_data, test1_data, test2_data, label):
229 |     count_dict = {}#存 该id： 追加这次出现的label
230 |     seq_dict = {}#存序列字典 ：该种序列出现次数
231 |     seq_emb_dict = {}#存该序列的key
232 |     train_seq = []#存每个学列的index
233 |     ind = 0
234 |     for i, d in enumerate(train_data):
235 |         if not count_dict.__contains__(d):
236 |             count_dict[d] = []
237 |         seq_key = ' '.join(count_dict[d][-4:])
238 |         if not seq_dict.__contains__(seq_key):
239 |             seq_dict[seq_key] = 0
240 |             seq_emb_dict[seq_key] = ind
241 |             ind += 1
242 |         seq_dict[seq_key] += 1
243 |         train_seq.append(seq_emb_dict[seq_key])
244 |         count_dict[d].append(label[i])
245 |     test1_seq = []
246 |     for d in test1_data:
247 |         if not count_dict.__contains__(d):
248 |             seq_key = ''
249 |         else:
250 |             seq_key = ' '.join(count_dict[d][-4:])
251 |         if seq_emb_dict.__contains__(seq_key):
252 |             key = seq_emb_dict[seq_key]
253 |         else:
254 |             key = 0
255 |         test1_seq.append(key)
256 |     test2_seq = []
257 |     for d in test2_data:
258 |         if not count_dict.__contains__(d):
259 |             seq_key = ''
260 |         else:
261 |             seq_key = ' '.join(count_dict[d][-4:])
262 |         if seq_emb_dict.__contains__(seq_key):
263 |             key = seq_emb_dict[seq_key]
264 |         else:
265 |             key = 0
266 |         test2_seq.append(key)
267 | 
268 |     return np.array(train_seq), np.array(test1_seq), np.array(test2_seq), seq_emb_dict
269 | 
270 | 
271 | #train_res, test1_res, test2_res, f_dict = uid_seq_feature(train_dict['uid'], train_dict['uid'],
272 | #                                                          train_dict['uid'], train_dict['label'])
273 | 
274 | # new_train_data = combine_to_one(train_dict['uid'], train_dict['aid'])
275 | # train_res, test1_res, test2_res, f_dict = count_pos_feature(new_train_data, new_train_data,
276 | #                                                             new_train_data, train_dict['label'], 5,
277 | #                                                             is_val=False)
278 | 
279 | from collections import *
280 | from tqdm import tqdm
281 | def gen_pos_neg_aid_fea():
282 |     train_data = pd.read_csv('C:\\Users\\Administrator\\Downloads\\TencentSPA02-PreA-master\\TencentSPA02-PreA-master\\datas\\trainMerged1.csv')
283 |     test2_data = pd.read_csv('C:\\Users\\Administrator\\Downloads\\TencentSPA02-PreA-master\\TencentSPA02-PreA-master\\datas\\trainMerged1.csv')
284 | 
285 |     train_user = train_data.uid.unique()
286 | 
287 |     # user-aid dict
288 |     uid_dict = defaultdict(list)
289 |     for row in tqdm(train_data.itertuples(), total=len(train_data)):
290 |         uid_dict[row[2]].append([row[1], row[3]])
291 | 
292 |     # user convert
293 |     uid_convert = {}
294 |     for uid in tqdm(train_user):
295 |         pos_aid, neg_aid = [], []
296 |         for data in uid_dict[uid]:
297 |             if data[1] > 0:
298 |                 pos_aid.append(data[0])
299 |             else:
300 |                 neg_aid.append(data[0])
301 |         uid_convert[uid] = [pos_aid, neg_aid]
302 | 
303 |     test2_neg_pos_aid = {}
304 |     for row in tqdm(test2_data.itertuples(), total=len(test2_data)):
305 |         aid = row[1]
306 |         uid = row[2]
307 |         if uid_convert.get(uid, []) == []:
308 |             test2_neg_pos_aid[row[0]] = ['', '', -1]
309 |         else:
310 |             pos_aid, neg_aid = uid_convert[uid][0].copy(), uid_convert[uid][1].copy()
311 |             convert = len(pos_aid) / (len(pos_aid) + len(neg_aid)) if (len(pos_aid) + len(neg_aid)) > 0 else -1
312 |             test2_neg_pos_aid[row[0]] = [' '.join(map(str, pos_aid)), ' '.join(map(str, neg_aid)), convert]
313 |     df_test2 = pd.DataFrame.from_dict(data=test2_neg_pos_aid, orient='index')
314 |     df_test2.columns = ['pos_aid', 'neg_aid', 'uid_convert']
315 | 
316 |     train_neg_pos_aid = {}
317 |     for row in tqdm(train_data.itertuples(), total=len(train_data)):
318 |         aid = row[1]
319 |         uid = row[2]
320 |         pos_aid, neg_aid = uid_convert[uid][0].copy(), uid_convert[uid][1].copy()
321 |         if aid in pos_aid:
322 |             pos_aid.remove(aid)
323 |         if aid in neg_aid:
324 |             neg_aid.remove(aid)
325 |         convert = len(pos_aid) / (len(pos_aid) + len(neg_aid)) if (len(pos_aid) + len(neg_aid)) > 0 else -1
326 |         train_neg_pos_aid[row[0]] = [' '.join(map(str, pos_aid)), ' '.join(map(str, neg_aid)), convert]
327 | 
328 |     df_train = pd.DataFrame.from_dict(data=train_neg_pos_aid, orient='index')
329 |     df_train.columns = ['pos_aid', 'neg_aid', 'uid_convert']
330 | 
331 |     df_train.to_csv("dataset/train_neg_pos_aid.csv", index=False)
332 |     df_test2.to_csv("dataset/test2_neg_pos_aid.csv", index=False)
333 | 
334 | def gen_uid_aid_fea():
335 |     '''
336 |     载入数据，　提取aid, uid的全局统计特征
337 |     '''
338 |     train_data = pd.read_csv('C:\\Users\\Administrator\\Downloads\\TencentSPA02-PreA-master\\TencentSPA02-PreA-master\\datas\\trainMerged1.csv')
339 |     test1_data = pd.read_csv('C:\\Users\\Administrator\\Downloads\\TencentSPA02-PreA-master\\TencentSPA02-PreA-master\\datas\\trainMerged1.csv')
340 |     test2_data = pd.read_csv('C:\\Users\\Administrator\\Downloads\\TencentSPA02-PreA-master\\TencentSPA02-PreA-master\\datas\\trainMerged1.csv')
341 | 
342 |     ad_Feature = pd.read_csv('C:\\Users\\Administrator\\Downloads\\TencentSPA02-PreA-master\\TencentSPA02-PreA-master\\datas\\trainMerged1.csv')
343 | 
344 |     train_len = len(train_data)  # 45539700
345 |     test1_len = len(test1_data)
346 |     test2_len = len(test2_data)  # 11727304
347 | 
348 |     ad_Feature = pd.merge(ad_Feature, ad_Feature.groupby(['campaignId']).aid.nunique().reset_index(
349 |     ).rename(columns={'aid': 'campaignId_aid_nunique'}), how='left', on='campaignId')
350 | 
351 |     df = pd.concat([train_data, test1_data, test2_data], axis=0)
352 |     df = pd.merge(df, df.groupby(['uid'])['aid'].nunique().reset_index().rename(
353 |         columns={'aid': 'uid_aid_nunique'}), how='left', on='uid')
354 | 
355 |     df = pd.merge(df, df.groupby(['aid'])['uid'].nunique().reset_index().rename(
356 |         columns={'uid': 'aid_uid_nunique'}), how='left', on='aid')
357 | 
358 |     df['uid_count'] = df.groupby('uid')['aid'].transform('count')
359 |     df = pd.merge(df, ad_Feature[['aid', 'campaignId_aid_nunique']], how='left', on='aid')
360 | 
361 |     fea_columns = ['campaignId_aid_nunique', 'uid_aid_nunique', 'aid_uid_nunique', 'uid_count', ]
362 | 
363 |     df[fea_columns].iloc[:train_len].to_csv('dataset/train_uid_aid.csv', index=False)
364 |     df[fea_columns].iloc[train_len: train_len+test1_len].to_csv('dataset/test1_uid_aid.csv', index=False)
365 |     df[fea_columns].iloc[-test2_len:].to_csv('dataset/test2_uid_aid.csv', index=False)
366 | 
367 | ad_Feature = pd.read_csv('C:\\Users\\Administrator\\Downloads\\TencentSPA02-PreA-master\\TencentSPA02-PreA-master\\datas\\trainMerged1.csv')
368 | 
369 | te = ad_Feature.groupby(['campaignId'])
370 | 
371 | te  = te.aid
372 | te  = te.nunique().reset_index().rename(columns={'aid': 'campaignId_aid_nunique'})
373 | print('ok')
374 | # ad_Feature = pd.merge(ad_Feature, ad_Feature.groupby(['campaignId']).aid.nunique().reset_index(
375 | # ).rename(columns={'aid': 'campaignId_aid_nunique'}), how='left', on='campaignId')


--------------------------------------------------------------------------------
/xlearn_ffm.py:
--------------------------------------------------------------------------------
 1 | # -*- coding: utf-8 -*-
 2 | 
 3 | import xlearn as xl
 4 | import pandas as pd
 5 | 
 6 | ffm_model=xl.create_ffm()
 7 | #ffm_model=xl.setTrain("ffm_train.txt")
 8 | #ffm_model.setValidate("ffm_val.txt")
 9 | ffm_model.setTrain("./data/train.ffm")
10 | ffm_model.setTest("./data/test.ffm")
11 | ffm_model.setSigmoid()
12 | # ffm_model.setSign()
13 | 
14 | param={'task':'binary','lr':0.05,'lambda':0.00002,'metric':'auc','k':8,'epoch':22,'stop_window':3}
15 | 
16 | ffm_model.fit(param,"./data/xlearn_result/model.out")
17 | ffm_model.predict("./data/xlearn_result/model.out","./data/xlearn_result/output.txt")
18 | 
19 | 
20 | #ffm_model.cv(param)
21 | #ffm_model.fit(param,"ffm_result.txt")
22 | 
23 | result=pd.read_table("./data/xlearn_result/output.txt",header=None)
24 | test=pd.read_csv("./data/test1.csv")
25 | test['score']=result[0]
26 | test.to_csv('./data/xlearn_result/submission_xlearn.csv',index=False)


--------------------------------------------------------------------------------