├── ppt └── 2020数字中国创新大赛-天才海神号-算法部分.pdf ├── README.md └── code └── final.py /ppt/2020数字中国创新大赛-天才海神号-算法部分.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/fengdu78/tianchi_haiyang/HEAD/ppt/2020数字中国创新大赛-天才海神号-算法部分.pdf -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # 天才海神号最终代码说明 2 | 这是我们在天池大赛,2020数字中国创新大赛—算法赛:智慧海洋建设中复赛阶段B榜单获得最好成绩的代码。 3 | 4 | 比赛连接为,https://tianchi.aliyun.com/competition/entrance/231768/rankingList 5 | 6 | 代码在[code](code/)文件夹 7 | 8 | ppt在[ppt](ppt/)文件夹 9 | 10 | 作者: 天才海神号 11 | 12 | 时间: 2020年03月14日 13 | 14 | ## 最终结果 15 | 16 | ### 分数 17 | ``` 18 | score = 0.89937 19 | ``` 20 | 21 | ### 运行结果日志 22 | ``` 23 | { 24 | "eval_score": 0.8993695740342255, 25 | "cost_time": 867, 26 | "info": "null", 27 | "deadline": 18000, 28 | "score_detail": { 29 | "success": "true", 30 | "score": 0.8993695740342255, 31 | "scoreJson": { 32 | "score": 0.8993695740342255 33 | } 34 | } 35 | } 36 | ``` 37 | 38 | 39 | ## 方案整体介绍 40 | ## 方案优点 41 | - 简单高效,过拟合风险低。 42 | - 代码逻辑清晰简单,约200+行代码,可读性高,易于扩展和使用。 43 | - 在百万量级数据,只提取了100+有效特征,全程运行时间只有16分钟左右。 44 | - 特征工程中包含,时间,空间,速度,位移,相对值等各个维度的特征,全面且精简。 45 | - 符合现实世界中的使用要求。 46 | 47 | 48 | ## 包括5个模块 49 | - 读数据 50 | - 预处理 51 | - 特征工程 52 | - 模型训练和预测 53 | - 保存结果 54 | 55 | 56 | ## 1 读数据 57 | 我们将第一阶段训练数据加到训练集合,达到数据增强的作用 58 | ``` 59 | # get_data('/tcdata/hy_round2_train_20200225', 'train') 60 | # get_data('/tcdata/hy_round2_testA_20200225', 'testA') 61 | # get_data('/tcdata/hy_round2_testB_20200312', 'testB') 62 | # get_data('/tcdata/hy_round1_train_20200102', 'train_chusai') 63 | ``` 64 | 65 | ## 2 预处理 66 | 为方便处理,我们将原始数据的中文映射到数字和英文 67 | ``` 68 | # label_dict1 = {'拖网': 0, '围网': 1, '刺网': 2} 69 | # label_dict2 = {0: '拖网', 1: '围网', 2: '刺网'} 70 | # name_dict = {'渔船ID': 'id', '速度': 'v', '方向': 'dir', 'type': 'label', 'lat': 'x', 'lon': 'y'} 71 | ``` 72 | 73 | 对原始坐标做一个变换,得到近似的平面坐标。对时间进行格式化 74 | ``` 75 | df['x'] = df['x'] * 100000 - 5630000 76 | df['y'] = df['y'] * 110000 + 2530000 77 | df['time'] = pd.to_datetime(df['time'].apply(lambda x :'2019-'+ x[:2] + '-' + x[2:4] + ' ' + x[5:])) 78 | ``` 79 | 80 | ## 3 特征工程 81 | 分为5大组特征 82 | 83 | ### 3.1 84 | 分箱特征,距离海岸线的近似值。 85 | ``` 86 | 对v求分箱特征,等分为200份,求每一份的统计值 87 | 对x求分箱特征,1000份和10000份,求每一份的次数统计值,和每一个分箱对应不同id数目 88 | 对y求分箱特征,1000份和10000份,求每一份的次数统计值,和每一个分箱对应不同id数目 89 | 求x,y分箱后的组合特征做为分组,求对应的次数统计值,和对应的id的不同数目 90 | 根据x分组,求y距离最小y的距离 # 可以理解为距离海岸线距离 91 | 根据y分组,求x距离最小x的距离 # 可以理解为距离海岸线距离 92 | ``` 93 | 94 | ### 3.2 95 | 间隔空间位移特征 96 | ``` 97 | 根据id分组,对x求,上一个x,下一个x,间隔2个x的距离 98 | 根据id分组,对y求,上一个y,下一个y,间隔2个y的距离 99 | 根据上述距离,求上一时刻,下一时刻,间隔2个时刻的面积,相对值 100 | ``` 101 | 102 | ### 3.3 103 | 空间位移的文本特征,提取Word2Vec,具有前后关系 104 | ``` 105 | 根据id分组,以xy网格特征编号作为单词,求文本特征,Word2Vec,窗口大小为10,提取10维的特赠 106 | ``` 107 | 108 | ### 3.4 109 | 常见统计特征,相对值 110 | ``` 111 | 根据v_bin和dist_move_prev_bin分组,求其他列的常见统计特征 112 | 'id': ['count'], 'x_bin1': [mode], 'y_bin1': [mode], 'x_bin2': [mode], 'y_bin2': [mode], 'x_y_bin1': [mode], 113 | 'x': ['mean', 'max', 'min', 'std', np.ptp, start, end], 114 | 'y': ['mean', 'max', 'min', 'std', np.ptp, start, end], 115 | 'v': ['mean', 'max', 'min', 'std', np.ptp], 'dir': ['mean'], 116 | 'x_bin1_count': ['mean'], 'y_bin1_count': ['mean', 'max', 'min'], 117 | 'x_bin2_count': ['mean', 'max', 'min'], 'y_bin2_count': ['mean', 'max', 'min'], 118 | 'x_bin1_y_bin1_count': ['mean', 'max', 'min'], 119 | 'dist_move_prev': ['mean', 'max', 'std', 'min', 'sum'], 120 | 'x_y_min': ['mean', 'min'], 'y_x_min': ['mean', 'min'], 121 | 'x_y_max': ['mean', 'min'], 'y_x_max': ['mean', 'min'], 122 | ``` 123 | 124 | ### 3.5 125 | 行程特征 126 | ``` 127 | 总行程距离 128 | 每一步行程的占比 129 | 将'dist_move_prev_bin_sen', 'v_bin_sen'转化为onehot稀疏特征 130 | ``` 131 | 132 | ## 4 模型训练和预测 133 | 我们使用全量数据进行训练,使用分层的K折作为验证,训练lgb多分类模型。 134 | 135 | ## 5 保存结果 136 | 将预测结果进行格式化,按要求保存的指定目录,得到最终结果,result.csv。 137 | 138 | 139 | # enjoy : ) 140 | -------------------------------------------------------------------------------- /code/final.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import numpy as np 3 | from lightgbm.sklearn import LGBMClassifier 4 | from sklearn.model_selection import StratifiedKFold 5 | from sklearn.feature_extraction.text import CountVectorizer 6 | from sklearn.metrics import f1_score 7 | from gensim.models import Word2Vec 8 | from scipy import sparse 9 | from tqdm import tqdm 10 | import os 11 | import gc 12 | import time 13 | import warnings 14 | warnings.filterwarnings('ignore') 15 | 16 | 17 | label_dict1 = {'拖网': 0, '围网': 1, '刺网': 2} 18 | label_dict2 = {0: '拖网', 1: '围网', 2: '刺网'} 19 | name_dict = {'渔船ID': 'id', '速度': 'v', '方向': 'dir', 'type': 'label', 'lat': 'x', 'lon': 'y'} 20 | 21 | 22 | def get_data(file_path, model): 23 | paths = os.listdir(file_path) 24 | tmp = open(f'{model}.csv', 'w', encoding='utf-8') 25 | for t in tqdm(range(len(paths))): 26 | p = paths[t] 27 | with open(f'{file_path}/{p}', encoding='utf-8') as f: 28 | if t!=0: 29 | next(f) 30 | tmp.write(f.read()) 31 | tmp.close() 32 | 33 | 34 | ttt = time.time() 35 | 36 | get_data('/tcdata/hy_round2_train_20200225', 'train') 37 | get_data('/tcdata/hy_round2_testA_20200225', 'testA') 38 | get_data('/tcdata/hy_round2_testB_20200312', 'testB') 39 | get_data('/tcdata/hy_round1_train_20200102', 'train_chusai') 40 | 41 | train = pd.read_csv('train.csv') 42 | train['flag'] = 0 43 | train['trn'] = 1 44 | test = pd.read_csv('testB.csv') 45 | test['flag'] = 0 46 | test['trn'] = 0 47 | testA = pd.read_csv('testA.csv') 48 | testA['flag'] = 1 49 | testA['trn'] = 0 50 | train_chusai = pd.read_csv('train_chusai.csv') 51 | train_chusai['flag'] = 1 52 | train_chusai['trn'] = 1 53 | 54 | print(time.time() - ttt) 55 | 56 | train.rename(columns = name_dict, inplace = True) 57 | test.rename(columns = name_dict, inplace = True) 58 | testA.rename(columns = name_dict, inplace = True) 59 | train_chusai.rename(columns = name_dict, inplace = True) 60 | 61 | df = pd.concat([train, testA, test], axis=0, ignore_index=True) 62 | df['x'] = df['x'] * 100000 - 5630000 63 | df['y'] = df['y'] * 110000 + 2530000 64 | df = pd.concat([train_chusai, df], axis=0, ignore_index=True) 65 | df['time'] = pd.to_datetime(df['time'].apply(lambda x :'2019-'+ x[:2] + '-' + x[2:4] + ' ' + x[5:])) 66 | df = df.sort_values(['id', 'time']).reset_index(drop=True) 67 | df['label'] = df['label'].map(label_dict1) 68 | df.loc[df['trn'] == 0, 'label'] = -1 69 | 70 | print(time.time() - ttt) 71 | 72 | 73 | df['v_bin'] = pd.qcut(df['v'], 200, duplicates='drop') 74 | df['v_bin'] = df['v_bin'].map(dict(zip(df['v_bin'].unique(), range(df['v_bin'].nunique())))) 75 | for f in ['x', 'y']: 76 | df[f + '_bin1'] = pd.qcut(df[f], 1000, duplicates='drop') 77 | df[f + '_bin1'] = df[f + '_bin1'].map(dict(zip(df[f + '_bin1'].unique(), range(df[f + '_bin1'].nunique())))) 78 | df[f + '_bin2'] = df[f] // 10000 79 | df[f + '_bin1_count'] = df[f + '_bin1'].map(df[f + '_bin1'].value_counts()) 80 | df[f + '_bin2_count'] = df[f + '_bin2'].map(df[f + '_bin2'].value_counts()) 81 | df[f + '_bin1_id_nunique'] = df.groupby(f + '_bin1')['id'].transform('nunique') 82 | df[f + '_bin2_id_nunique'] = df.groupby(f + '_bin2')['id'].transform('nunique') 83 | for i in [1, 2]: 84 | df['x_y_bin{}'.format(i)] = df['x_bin{}'.format(i)].astype('str') + '_' + df['y_bin{}'.format(i)].astype('str') 85 | df['x_y_bin{}'.format(i)] = df['x_y_bin{}'.format(i)].map( 86 | dict(zip(df['x_y_bin{}'.format(i)].unique(), range(df['x_y_bin{}'.format(i)].nunique()))) 87 | ) 88 | df['x_bin{}_y_bin{}_count'.format(i, i)] = df['x_y_bin{}'.format(i)].map(df['x_y_bin{}'.format(i)].value_counts()) 89 | for stat in ['max', 'min']: 90 | df['x_y_{}'.format(stat)] = df['y'] - df.groupby('x_bin1')['y'].transform(stat) 91 | df['y_x_{}'.format(stat)] = df['x'] - df.groupby('y_bin1')['x'].transform(stat) 92 | 93 | print(time.time() - ttt) 94 | 95 | g = df.groupby('id') 96 | for f in ['x', 'y']: 97 | df[f + '_prev_diff'] = df[f] - g[f].shift(1) 98 | df[f + '_next_diff'] = df[f] - g[f].shift(-1) 99 | df[f + '_prev_next_diff'] = g[f].shift(1) - g[f].shift(-1) 100 | df['dist_move_prev'] = np.sqrt(np.square(df['x_prev_diff']) + np.square(df['y_prev_diff'])) 101 | df['dist_move_next'] = np.sqrt(np.square(df['x_next_diff']) + np.square(df['y_next_diff'])) 102 | df['dist_move_prev_next'] = np.sqrt(np.square(df['x_prev_next_diff']) + np.square(df['y_prev_next_diff'])) 103 | df['dist_move_prev_bin'] = pd.qcut(df['dist_move_prev'], 50, duplicates='drop') 104 | df['dist_move_prev_bin'] = df['dist_move_prev_bin'].map( 105 | dict(zip(df['dist_move_prev_bin'].unique(), range(df['dist_move_prev_bin'].nunique()))) 106 | ) 107 | 108 | print(time.time() - ttt) 109 | 110 | 111 | def get_loc_list(x): 112 | prev = '' 113 | res = [] 114 | for loc in x: 115 | loc = str(loc) 116 | if loc != prev: 117 | res.append(loc) 118 | prev = loc 119 | return res 120 | 121 | 122 | size = 10 123 | sentence = df.groupby('id')['x_y_bin1'].agg(get_loc_list).tolist() 124 | model = Word2Vec(sentence, size=size, window=20, min_count=1, sg=1, workers=12, iter=10) 125 | emb = [] 126 | for w in df['x_y_bin1'].unique(): 127 | vec = [w] 128 | try: 129 | vec.extend(model[str(w)]) 130 | except: 131 | vec.extend(np.ones(size) * -size) 132 | emb.append(vec) 133 | emb_df = pd.DataFrame(emb) 134 | emb_cols = ['x_y_bin1'] 135 | for i in range(size): 136 | emb_cols.append('x_y_bin1_emb_{}'.format(i)) 137 | emb_df.columns = emb_cols 138 | 139 | print(time.time() - ttt) 140 | 141 | 142 | def start(x): 143 | try: 144 | return x[0] 145 | except: 146 | return None 147 | 148 | 149 | def end(x): 150 | try: 151 | return x[-1] 152 | except: 153 | return None 154 | 155 | 156 | def mode(x): 157 | try: 158 | return pd.Series(x).value_counts().index[0] 159 | except: 160 | return None 161 | 162 | 163 | df = df[df['flag'] == 0].reset_index(drop=True) 164 | for f in ['dist_move_prev_bin', 'v_bin']: 165 | df[f + '_sen'] = df['id'].map(df.groupby('id')[f].agg(lambda x: ','.join(x.astype(str)))) 166 | g = df.groupby('id').agg({ 167 | 'id': ['count'], 'x_bin1': [mode], 'y_bin1': [mode], 'x_bin2': [mode], 'y_bin2': [mode], 'x_y_bin1': [mode], 168 | 'x': ['mean', 'max', 'min', 'std', np.ptp, start, end], 169 | 'y': ['mean', 'max', 'min', 'std', np.ptp, start, end], 170 | 'v': ['mean', 'max', 'min', 'std', np.ptp], 'dir': ['mean'], 171 | 'x_bin1_count': ['mean'], 'y_bin1_count': ['mean', 'max', 'min'], 172 | 'x_bin2_count': ['mean', 'max', 'min'], 'y_bin2_count': ['mean', 'max', 'min'], 173 | 'x_bin1_y_bin1_count': ['mean', 'max', 'min'], 174 | 'dist_move_prev': ['mean', 'max', 'std', 'min', 'sum'], 175 | 'x_y_min': ['mean', 'min'], 'y_x_min': ['mean', 'min'], 176 | 'x_y_max': ['mean', 'min'], 'y_x_max': ['mean', 'min'], 177 | }).reset_index() 178 | g.columns = ['_'.join(col).strip() for col in g.columns] 179 | g.rename(columns={'id_': 'id'}, inplace=True) 180 | cols = [f for f in g.keys() if f != 'id'] 181 | 182 | print(time.time() - ttt) 183 | 184 | df = df.drop_duplicates('id')[['id', 'label', 'dist_move_prev_bin_sen', 'v_bin_sen']].sort_values('id').reset_index(drop=True) 185 | df = df.sort_values('label').reset_index(drop=True) 186 | sub = df[df['label'] == -1].reset_index(drop=True)[['id']] 187 | test_num = sub.shape[0] 188 | labels = df[df['label'] != -1]['label'].values 189 | df = df.merge(g, on='id', how='left') 190 | df[cols] = df[cols].astype('float32') 191 | df['dist_total'] = np.sqrt(np.square(df['x_end'] - df['y_start']) + np.square(df['y_end'] - df['y_start'])) 192 | df['dist_rate'] = df['dist_total'] / (df['dist_move_prev_sum'] + 1e-8) 193 | df = df.merge(emb_df, left_on='x_y_bin1_mode', right_on='x_y_bin1', how='left') 194 | df_values = sparse.csr_matrix(df[cols + emb_cols[1:] + ['dist_total', 'dist_rate']].values) 195 | for f in ['dist_move_prev_bin_sen', 'v_bin_sen']: 196 | cv = CountVectorizer(min_df=10).fit_transform(df[f].values) 197 | df_values = sparse.hstack((df_values, cv), 'csr') 198 | test_values, train_values = df_values[:test_num], df_values[test_num:] 199 | del df, df_values 200 | gc.collect() 201 | 202 | print(time.time() - ttt) 203 | 204 | 205 | def f1(y_true, y_pred): 206 | y_pred = np.transpose(np.reshape(y_pred, [3, -1])) 207 | return 'f1', f1_score(y_true, np.argmax(y_pred, axis=1), average='macro'), True 208 | 209 | 210 | print(train_values.shape, test_values.shape) 211 | test_pred = np.zeros((test_values.shape[0], 3)) 212 | skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=2020) 213 | clf = LGBMClassifier( 214 | learning_rate=0.05, 215 | n_estimators=20000, 216 | num_leaves=63, 217 | subsample_freq=1, 218 | subsample=0.9, 219 | colsample_bytree=0.4, 220 | min_child_samples=10, 221 | random_state=2020, 222 | class_weight='balanced', 223 | metric='None' 224 | ) 225 | for i, (trn_idx, val_idx) in enumerate(skf.split(train_values, labels)): 226 | trn_x, trn_y = train_values[trn_idx], labels[trn_idx] 227 | val_x, val_y = train_values[val_idx], labels[val_idx] 228 | clf.fit( 229 | trn_x, trn_y, 230 | eval_set=[(val_x, val_y)], 231 | eval_metric=f1, 232 | early_stopping_rounds=100, 233 | verbose=100 234 | ) 235 | test_pred += clf.predict_proba(test_values) / skf.n_splits 236 | 237 | sub['id'] = sub['id'].astype('int32') 238 | sub['label'] = np.argmax(test_pred, axis=1) 239 | sub['label'] = sub['label'].map(label_dict2) 240 | sub = sub.sort_values('id').reset_index(drop=True) 241 | sub.to_csv('result.csv', index=False, header=False) 242 | --------------------------------------------------------------------------------