├── ppt
    └── 2020数字中国创新大赛-天才海神号-算法部分.pdf
├── README.md
└── code
    └── final.py


/ppt/2020数字中国创新大赛-天才海神号-算法部分.pdf:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/fengdu78/tianchi_haiyang/HEAD/ppt/2020数字中国创新大赛-天才海神号-算法部分.pdf


--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
  1 | # 天才海神号最终代码说明
  2 | 这是我们在天池大赛，2020数字中国创新大赛—算法赛：智慧海洋建设中复赛阶段B榜单获得最好成绩的代码。
  3 | 
  4 | 比赛连接为，https://tianchi.aliyun.com/competition/entrance/231768/rankingList
  5 | 
  6 | 代码在[code](code/)文件夹
  7 | 
  8 | ppt在[ppt](ppt/)文件夹
  9 | 
 10 | 作者: 天才海神号
 11 | 
 12 | 时间: 2020年03月14日
 13 | 
 14 | ## 最终结果
 15 | 
 16 | ### 分数
 17 | ```
 18 | score = 0.89937
 19 | ```
 20 | 
 21 | ### 运行结果日志
 22 | ```
 23 | {
 24 |     "eval_score": 0.8993695740342255,
 25 |     "cost_time": 867,
 26 |     "info": "null",
 27 |     "deadline": 18000,
 28 |     "score_detail": {
 29 |         "success": "true",
 30 |         "score": 0.8993695740342255,
 31 |         "scoreJson": {
 32 |             "score": 0.8993695740342255
 33 |         }
 34 |     }
 35 | }
 36 | ```
 37 | 
 38 | 
 39 | ## 方案整体介绍
 40 | ## 方案优点
 41 | - 简单高效，过拟合风险低。
 42 | - 代码逻辑清晰简单，约200+行代码，可读性高，易于扩展和使用。
 43 | - 在百万量级数据，只提取了100+有效特征，全程运行时间只有16分钟左右。
 44 | - 特征工程中包含，时间，空间，速度，位移，相对值等各个维度的特征，全面且精简。
 45 | - 符合现实世界中的使用要求。
 46 | 
 47 | 
 48 | ## 包括5个模块
 49 | - 读数据
 50 | - 预处理
 51 | - 特征工程
 52 | - 模型训练和预测
 53 | - 保存结果
 54 | 
 55 | 
 56 | ## 1 读数据
 57 | 我们将第一阶段训练数据加到训练集合，达到数据增强的作用
 58 | ```
 59 | # get_data('/tcdata/hy_round2_train_20200225', 'train')
 60 | # get_data('/tcdata/hy_round2_testA_20200225', 'testA')
 61 | # get_data('/tcdata/hy_round2_testB_20200312', 'testB')
 62 | # get_data('/tcdata/hy_round1_train_20200102', 'train_chusai')
 63 | ```
 64 | 
 65 | ## 2 预处理
 66 | 为方便处理，我们将原始数据的中文映射到数字和英文
 67 | ```
 68 | # label_dict1 = {'拖网': 0, '围网': 1, '刺网': 2}
 69 | # label_dict2 = {0: '拖网', 1: '围网', 2: '刺网'}
 70 | # name_dict = {'渔船ID': 'id', '速度': 'v', '方向': 'dir', 'type': 'label', 'lat': 'x', 'lon': 'y'}
 71 | ```
 72 | 
 73 | 对原始坐标做一个变换，得到近似的平面坐标。对时间进行格式化
 74 | ```
 75 | df['x'] = df['x'] * 100000 - 5630000
 76 | df['y'] = df['y'] * 110000 + 2530000
 77 | df['time'] = pd.to_datetime(df['time'].apply(lambda x :'2019-'+ x[:2] + '-' + x[2:4] + ' ' + x[5:]))
 78 | ```
 79 | 
 80 | ## 3 特征工程
 81 | 分为5大组特征
 82 | 
 83 | ### 3.1
 84 | 分箱特征，距离海岸线的近似值。
 85 | ```
 86 | 对v求分箱特征，等分为200份，求每一份的统计值
 87 | 对x求分箱特征，1000份和10000份，求每一份的次数统计值，和每一个分箱对应不同id数目
 88 | 对y求分箱特征，1000份和10000份，求每一份的次数统计值，和每一个分箱对应不同id数目
 89 | 求x，y分箱后的组合特征做为分组，求对应的次数统计值，和对应的id的不同数目
 90 | 根据x分组，求y距离最小y的距离 # 可以理解为距离海岸线距离
 91 | 根据y分组，求x距离最小x的距离 # 可以理解为距离海岸线距离
 92 | ```
 93 | 
 94 | ### 3.2
 95 | 间隔空间位移特征
 96 | ```
 97 | 根据id分组，对x求，上一个x，下一个x，间隔2个x的距离
 98 | 根据id分组，对y求，上一个y，下一个y，间隔2个y的距离
 99 | 根据上述距离，求上一时刻，下一时刻，间隔2个时刻的面积，相对值
100 | ```
101 | 
102 | ### 3.3
103 | 空间位移的文本特征，提取Word2Vec，具有前后关系
104 | ```
105 | 根据id分组，以xy网格特征编号作为单词，求文本特征，Word2Vec，窗口大小为10，提取10维的特赠
106 | ```
107 | 
108 | ### 3.4
109 | 常见统计特征，相对值
110 | ```
111 | 根据v_bin和dist_move_prev_bin分组，求其他列的常见统计特征
112 | 'id': ['count'], 'x_bin1': [mode], 'y_bin1': [mode], 'x_bin2': [mode], 'y_bin2': [mode], 'x_y_bin1': [mode],
113 | 'x': ['mean', 'max', 'min', 'std', np.ptp, start, end],
114 | 'y': ['mean', 'max', 'min', 'std', np.ptp, start, end],
115 | 'v': ['mean', 'max', 'min', 'std', np.ptp], 'dir': ['mean'],
116 | 'x_bin1_count': ['mean'], 'y_bin1_count': ['mean', 'max', 'min'],
117 | 'x_bin2_count': ['mean', 'max', 'min'], 'y_bin2_count': ['mean', 'max', 'min'],
118 | 'x_bin1_y_bin1_count': ['mean', 'max', 'min'],
119 | 'dist_move_prev': ['mean', 'max', 'std', 'min', 'sum'],
120 | 'x_y_min': ['mean', 'min'], 'y_x_min': ['mean', 'min'],
121 | 'x_y_max': ['mean', 'min'], 'y_x_max': ['mean', 'min'],
122 | ```
123 | 
124 | ### 3.5
125 | 行程特征
126 | ```
127 | 总行程距离
128 | 每一步行程的占比
129 | 将'dist_move_prev_bin_sen', 'v_bin_sen'转化为onehot稀疏特征
130 | ```
131 | 
132 | ## 4 模型训练和预测
133 | 我们使用全量数据进行训练，使用分层的K折作为验证，训练lgb多分类模型。
134 | 
135 | ## 5 保存结果
136 | 将预测结果进行格式化，按要求保存的指定目录，得到最终结果，result.csv。
137 | 
138 | 
139 | # enjoy : )
140 | 


--------------------------------------------------------------------------------
/code/final.py:
--------------------------------------------------------------------------------
  1 | import pandas as pd
  2 | import numpy as np
  3 | from lightgbm.sklearn import LGBMClassifier
  4 | from sklearn.model_selection import StratifiedKFold
  5 | from sklearn.feature_extraction.text import CountVectorizer
  6 | from sklearn.metrics import f1_score
  7 | from gensim.models import Word2Vec
  8 | from scipy import sparse
  9 | from tqdm import tqdm
 10 | import os
 11 | import gc
 12 | import time
 13 | import warnings
 14 | warnings.filterwarnings('ignore')
 15 | 
 16 | 
 17 | label_dict1 = {'拖网': 0, '围网': 1, '刺网': 2}
 18 | label_dict2 = {0: '拖网', 1: '围网', 2: '刺网'}
 19 | name_dict = {'渔船ID': 'id', '速度': 'v', '方向': 'dir', 'type': 'label', 'lat': 'x', 'lon': 'y'}
 20 | 
 21 | 
 22 | def get_data(file_path, model):
 23 |     paths = os.listdir(file_path)
 24 |     tmp = open(f'{model}.csv', 'w', encoding='utf-8')
 25 |     for t in tqdm(range(len(paths))):
 26 |         p = paths[t]
 27 |         with open(f'{file_path}/{p}', encoding='utf-8') as f:
 28 |             if t!=0:
 29 |                 next(f)
 30 |             tmp.write(f.read())
 31 |     tmp.close()
 32 | 
 33 | 
 34 | ttt = time.time()
 35 | 
 36 | get_data('/tcdata/hy_round2_train_20200225', 'train')
 37 | get_data('/tcdata/hy_round2_testA_20200225', 'testA')
 38 | get_data('/tcdata/hy_round2_testB_20200312', 'testB')
 39 | get_data('/tcdata/hy_round1_train_20200102', 'train_chusai')
 40 | 
 41 | train = pd.read_csv('train.csv')
 42 | train['flag'] = 0
 43 | train['trn'] = 1
 44 | test = pd.read_csv('testB.csv')
 45 | test['flag'] = 0
 46 | test['trn'] = 0
 47 | testA = pd.read_csv('testA.csv')
 48 | testA['flag'] = 1
 49 | testA['trn'] = 0
 50 | train_chusai = pd.read_csv('train_chusai.csv')
 51 | train_chusai['flag'] = 1
 52 | train_chusai['trn'] = 1
 53 | 
 54 | print(time.time() - ttt)
 55 | 
 56 | train.rename(columns = name_dict, inplace = True)
 57 | test.rename(columns = name_dict, inplace = True)
 58 | testA.rename(columns = name_dict, inplace = True)
 59 | train_chusai.rename(columns = name_dict, inplace = True)
 60 | 
 61 | df = pd.concat([train, testA, test], axis=0, ignore_index=True)
 62 | df['x'] = df['x'] * 100000 - 5630000
 63 | df['y'] = df['y'] * 110000 + 2530000
 64 | df = pd.concat([train_chusai, df], axis=0, ignore_index=True)
 65 | df['time'] = pd.to_datetime(df['time'].apply(lambda x :'2019-'+ x[:2] + '-' + x[2:4] + ' ' + x[5:]))
 66 | df = df.sort_values(['id', 'time']).reset_index(drop=True)
 67 | df['label'] = df['label'].map(label_dict1)
 68 | df.loc[df['trn'] == 0, 'label'] = -1
 69 | 
 70 | print(time.time() - ttt)
 71 | 
 72 | 
 73 | df['v_bin'] = pd.qcut(df['v'], 200, duplicates='drop')
 74 | df['v_bin'] = df['v_bin'].map(dict(zip(df['v_bin'].unique(), range(df['v_bin'].nunique()))))
 75 | for f in ['x', 'y']:
 76 |     df[f + '_bin1'] = pd.qcut(df[f], 1000, duplicates='drop')
 77 |     df[f + '_bin1'] = df[f + '_bin1'].map(dict(zip(df[f + '_bin1'].unique(), range(df[f + '_bin1'].nunique()))))
 78 |     df[f + '_bin2'] = df[f] // 10000
 79 |     df[f + '_bin1_count'] = df[f + '_bin1'].map(df[f + '_bin1'].value_counts())
 80 |     df[f + '_bin2_count'] = df[f + '_bin2'].map(df[f + '_bin2'].value_counts())
 81 |     df[f + '_bin1_id_nunique'] = df.groupby(f + '_bin1')['id'].transform('nunique')
 82 |     df[f + '_bin2_id_nunique'] = df.groupby(f + '_bin2')['id'].transform('nunique')
 83 | for i in [1, 2]:
 84 |     df['x_y_bin{}'.format(i)] = df['x_bin{}'.format(i)].astype('str') + '_' + df['y_bin{}'.format(i)].astype('str')
 85 |     df['x_y_bin{}'.format(i)] = df['x_y_bin{}'.format(i)].map(
 86 |         dict(zip(df['x_y_bin{}'.format(i)].unique(), range(df['x_y_bin{}'.format(i)].nunique())))
 87 |     )
 88 |     df['x_bin{}_y_bin{}_count'.format(i, i)] = df['x_y_bin{}'.format(i)].map(df['x_y_bin{}'.format(i)].value_counts())
 89 | for stat in ['max', 'min']:
 90 |     df['x_y_{}'.format(stat)] = df['y'] - df.groupby('x_bin1')['y'].transform(stat)
 91 |     df['y_x_{}'.format(stat)] = df['x'] - df.groupby('y_bin1')['x'].transform(stat)
 92 | 
 93 | print(time.time() - ttt)
 94 | 
 95 | g = df.groupby('id')
 96 | for f in ['x', 'y']:
 97 |     df[f + '_prev_diff'] = df[f] - g[f].shift(1)
 98 |     df[f + '_next_diff'] = df[f] - g[f].shift(-1)
 99 |     df[f + '_prev_next_diff'] = g[f].shift(1) - g[f].shift(-1)
100 | df['dist_move_prev'] = np.sqrt(np.square(df['x_prev_diff']) + np.square(df['y_prev_diff']))
101 | df['dist_move_next'] = np.sqrt(np.square(df['x_next_diff']) + np.square(df['y_next_diff']))
102 | df['dist_move_prev_next'] = np.sqrt(np.square(df['x_prev_next_diff']) + np.square(df['y_prev_next_diff']))
103 | df['dist_move_prev_bin'] = pd.qcut(df['dist_move_prev'], 50, duplicates='drop')
104 | df['dist_move_prev_bin'] = df['dist_move_prev_bin'].map(
105 |     dict(zip(df['dist_move_prev_bin'].unique(), range(df['dist_move_prev_bin'].nunique())))
106 | )
107 | 
108 | print(time.time() - ttt)
109 | 
110 | 
111 | def get_loc_list(x):
112 |     prev = ''
113 |     res = []
114 |     for loc in x:
115 |         loc = str(loc)
116 |         if loc != prev:
117 |             res.append(loc)
118 |         prev = loc
119 |     return res
120 | 
121 | 
122 | size = 10
123 | sentence = df.groupby('id')['x_y_bin1'].agg(get_loc_list).tolist()
124 | model = Word2Vec(sentence, size=size, window=20, min_count=1, sg=1, workers=12, iter=10)
125 | emb = []
126 | for w in df['x_y_bin1'].unique():
127 |     vec = [w]
128 |     try:
129 |         vec.extend(model[str(w)])
130 |     except:
131 |         vec.extend(np.ones(size) * -size)
132 |     emb.append(vec)
133 | emb_df = pd.DataFrame(emb)
134 | emb_cols = ['x_y_bin1']
135 | for i in range(size):
136 |     emb_cols.append('x_y_bin1_emb_{}'.format(i))
137 | emb_df.columns = emb_cols
138 | 
139 | print(time.time() - ttt)
140 | 
141 | 
142 | def start(x):
143 |     try:
144 |         return x[0]
145 |     except:
146 |         return None
147 | 
148 | 
149 | def end(x):
150 |     try:
151 |         return x[-1]
152 |     except:
153 |         return None
154 | 
155 | 
156 | def mode(x):
157 |     try:
158 |         return pd.Series(x).value_counts().index[0]
159 |     except:
160 |         return None
161 | 
162 | 
163 | df = df[df['flag'] == 0].reset_index(drop=True)
164 | for f in ['dist_move_prev_bin', 'v_bin']:
165 |     df[f + '_sen'] = df['id'].map(df.groupby('id')[f].agg(lambda x: ','.join(x.astype(str))))
166 | g = df.groupby('id').agg({
167 |     'id': ['count'], 'x_bin1': [mode], 'y_bin1': [mode], 'x_bin2': [mode], 'y_bin2': [mode], 'x_y_bin1': [mode],
168 |     'x': ['mean', 'max', 'min', 'std', np.ptp, start, end],
169 |     'y': ['mean', 'max', 'min', 'std', np.ptp, start, end],
170 |     'v': ['mean', 'max', 'min', 'std', np.ptp], 'dir': ['mean'],
171 |     'x_bin1_count': ['mean'], 'y_bin1_count': ['mean', 'max', 'min'],
172 |     'x_bin2_count': ['mean', 'max', 'min'], 'y_bin2_count': ['mean', 'max', 'min'],
173 |     'x_bin1_y_bin1_count': ['mean', 'max', 'min'],
174 |     'dist_move_prev': ['mean', 'max', 'std', 'min', 'sum'],
175 |     'x_y_min': ['mean', 'min'], 'y_x_min': ['mean', 'min'],
176 |     'x_y_max': ['mean', 'min'], 'y_x_max': ['mean', 'min'],
177 | }).reset_index()
178 | g.columns = ['_'.join(col).strip() for col in g.columns]
179 | g.rename(columns={'id_': 'id'}, inplace=True)
180 | cols = [f for f in g.keys() if f != 'id']
181 | 
182 | print(time.time() - ttt)
183 | 
184 | df = df.drop_duplicates('id')[['id', 'label', 'dist_move_prev_bin_sen', 'v_bin_sen']].sort_values('id').reset_index(drop=True)
185 | df = df.sort_values('label').reset_index(drop=True)
186 | sub = df[df['label'] == -1].reset_index(drop=True)[['id']]
187 | test_num = sub.shape[0]
188 | labels = df[df['label'] != -1]['label'].values
189 | df = df.merge(g, on='id', how='left')
190 | df[cols] = df[cols].astype('float32')
191 | df['dist_total'] = np.sqrt(np.square(df['x_end'] - df['y_start']) + np.square(df['y_end'] - df['y_start']))
192 | df['dist_rate'] = df['dist_total'] / (df['dist_move_prev_sum'] + 1e-8)
193 | df = df.merge(emb_df, left_on='x_y_bin1_mode', right_on='x_y_bin1', how='left')
194 | df_values = sparse.csr_matrix(df[cols + emb_cols[1:] + ['dist_total', 'dist_rate']].values)
195 | for f in ['dist_move_prev_bin_sen', 'v_bin_sen']:
196 |     cv = CountVectorizer(min_df=10).fit_transform(df[f].values)
197 |     df_values = sparse.hstack((df_values, cv), 'csr')
198 | test_values, train_values = df_values[:test_num], df_values[test_num:]
199 | del df, df_values
200 | gc.collect()
201 | 
202 | print(time.time() - ttt)
203 | 
204 | 
205 | def f1(y_true, y_pred):
206 |     y_pred = np.transpose(np.reshape(y_pred, [3, -1]))
207 |     return 'f1', f1_score(y_true, np.argmax(y_pred, axis=1), average='macro'), True
208 | 
209 | 
210 | print(train_values.shape, test_values.shape)
211 | test_pred = np.zeros((test_values.shape[0], 3))
212 | skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=2020)
213 | clf = LGBMClassifier(
214 |     learning_rate=0.05,
215 |     n_estimators=20000,
216 |     num_leaves=63,
217 |     subsample_freq=1,
218 |     subsample=0.9,
219 |     colsample_bytree=0.4,
220 |     min_child_samples=10,
221 |     random_state=2020,
222 |     class_weight='balanced',
223 |     metric='None'
224 | )
225 | for i, (trn_idx, val_idx) in enumerate(skf.split(train_values, labels)):
226 |     trn_x, trn_y = train_values[trn_idx], labels[trn_idx]
227 |     val_x, val_y = train_values[val_idx], labels[val_idx]
228 |     clf.fit(
229 |         trn_x, trn_y,
230 |         eval_set=[(val_x, val_y)],
231 |         eval_metric=f1,
232 |         early_stopping_rounds=100,
233 |         verbose=100
234 |     )
235 |     test_pred += clf.predict_proba(test_values) / skf.n_splits
236 | 
237 | sub['id'] = sub['id'].astype('int32')
238 | sub['label'] = np.argmax(test_pred, axis=1)
239 | sub['label'] = sub['label'].map(label_dict2)
240 | sub = sub.sort_values('id').reset_index(drop=True)
241 | sub.to_csv('result.csv', index=False, header=False)
242 | 


--------------------------------------------------------------------------------