├── README ├── add_dummy_label.py ├── bag ├── converter │ ├── 6.py │ ├── common.py │ └── group6.py ├── mark │ ├── Makefile │ ├── README │ └── src │ │ ├── common.cpp │ │ ├── common.h │ │ ├── timer.cpp │ │ ├── timer.h │ │ └── train.cpp ├── run.sh ├── run │ └── 6.py └── util │ ├── cat_id_click.py │ ├── cat_submit.py │ ├── common.py │ ├── count_feat.py │ ├── join_data.py │ ├── parallel_do.py │ └── parallelizer.py ├── base ├── converter │ ├── 2.py │ └── common.py ├── mark │ └── mark1 │ │ ├── Makefile │ │ ├── README │ │ └── src │ │ ├── common.cpp │ │ ├── common.h │ │ ├── timer.cpp │ │ ├── timer.h │ │ └── train.cpp ├── run.py ├── run │ ├── app.py │ └── site.py └── util │ ├── common.py │ ├── gen_data.py │ ├── merge_prediction.py │ ├── parallelizer.py │ ├── pickle_prediction.py │ └── unpickle_prediction.py ├── ensemble ├── mark │ └── mark1 │ │ ├── Makefile │ │ ├── README │ │ └── src │ │ ├── common.cpp │ │ ├── common.h │ │ ├── timer.cpp │ │ ├── timer.h │ │ └── train.cpp ├── model │ ├── app.id │ │ ├── cvt.py │ │ ├── data │ │ ├── mark │ │ ├── run.sh │ │ └── util │ ├── app.ip │ │ ├── cvt.py │ │ ├── data │ │ ├── mark │ │ ├── run.sh │ │ └── util │ ├── app │ │ ├── cvt.py │ │ ├── data │ │ ├── mark │ │ ├── run.sh │ │ └── util │ ├── app_category-0f2161f8 │ │ ├── cvt.py │ │ ├── data │ │ ├── mark │ │ ├── run.sh │ │ └── util │ ├── app_id-92f5800b │ │ ├── cvt.py │ │ ├── data │ │ ├── mark │ │ ├── run.sh │ │ └── util │ ├── banner_pos-1 │ │ ├── cvt.py │ │ ├── data │ │ ├── mark │ │ ├── run.sh │ │ └── util │ ├── device_conn_type-3 │ │ ├── cvt.py │ │ ├── data │ │ ├── mark │ │ ├── run.sh │ │ └── util │ ├── site.cold_feature │ │ ├── cvt.py │ │ ├── data │ │ ├── mark │ │ ├── run.sh │ │ └── util │ ├── site.exd1d2 │ │ ├── cvt.py │ │ ├── data │ │ ├── mark │ │ ├── run.sh │ │ └── util │ ├── site.id │ │ ├── cvt.py │ │ ├── data │ │ ├── mark │ │ ├── run.sh │ │ └── util │ ├── site.ip │ │ ├── cvt.py │ │ ├── data │ │ ├── mark │ │ ├── run.sh │ │ └── util │ ├── site │ │ ├── cvt.py │ │ ├── data │ │ ├── mark │ │ ├── run.sh │ │ └── util │ ├── site_category-3e814130 │ │ ├── cvt.py │ │ ├── data │ │ ├── mark │ │ ├── run.sh │ │ └── util │ ├── site_category-f028772b │ │ ├── cvt.py │ │ ├── data │ │ ├── mark │ │ ├── run.sh │ │ └── util │ ├── site_domain-7e091613 │ │ ├── cvt.py │ │ ├── data │ │ ├── mark │ │ ├── run.sh │ │ └── util │ └── site_id-e151e245 │ │ ├── cvt.py │ │ ├── data │ │ ├── mark │ │ ├── run.sh │ │ └── util ├── run.sh └── util │ ├── calc_loss.py │ ├── calc_loss2.py │ ├── common.py │ ├── ensemble.py │ ├── gendata.py │ ├── merge_prd.py │ ├── mkprd.py │ ├── parallelizer.py │ ├── run.template.py │ ├── runall.py │ └── subset.py ├── license.txt ├── run.sh ├── run_all.sh ├── tr.rx.csv └── va.rx.csv /README: -------------------------------------------------------------------------------- 1 | 4 Idiots' Approach for Click-through Rate Prediction 2 | ==================================================== 3 | 4 | Our team consists of: 5 | 6 | Name Kaggle ID Affiliation 7 | ==================================================================== 8 | Yu-Chin Juan guestwalk National Taiwan University (NTU) 9 | Wei-Sheng Chin mandora National Taiwan University (NTU) 10 | Yong Zhuang yolicat National Taiwan University (NTU) 11 | Michael Jahrer Michael Jahrer Opera Solutions 12 | 13 | Our final model is an ensemble of NTU's model and Michael's model. Michael's 14 | model is based on his work in Opera Solutions, so he cannot release his part. 15 | Therefore, in the codes and documents we only present NTU's model. 16 | 17 | This README introduces how to run our code up. For the introduction to our 18 | approach, please see 19 | 20 | http://www.csie.ntu.edu.tw/~r01922136/slides/kaggle-avazu.pdf 21 | 22 | The model we use for this competition is called `field-aware factorization 23 | machines.' We have released a package for this model at: 24 | 25 | http://www.csie.ntu.edu.tw/~r01922136/libffm 26 | 27 | 28 | 29 | System Requirement 30 | ================== 31 | 32 | - 64-bit Unix-like operating system 33 | 34 | - Python 3 35 | 36 | - g++ (with C++11 and OpenMP support) 37 | 38 | - pandas (required if you want to run the `bag' part. See `Step-by-step' 39 | below.) 40 | 41 | 42 | 43 | Step-by-step 44 | ============ 45 | 46 | Our solution is an ensemble of 20 models. It is organized into the following 47 | three parts: 48 | 49 | name public score private score description 50 | =========================================================================== 51 | base 0.3832 0.3813 2 basic models 52 | 53 | bag 0.3826 0.3807 2 models using bag features. 54 | 55 | ensemble 0.3817 0.3797 an ensemble of the above 4 56 | models and 16 new small models 57 | 58 | Because the `bag' part consumes a huge amount of memory (more than 64GB), and 59 | the `ensemble' part takes a long time to run, this instruction guides you to 60 | run our `base' part first. If you want reproduce our best result, please run the 61 | commands in the final step on a suitable machine. 62 | 63 | 64 | 1. First, please use the following command to run a tiny example up 65 | 66 | $ ./run.sh x 67 | 68 | 2. Create a symbolic link to the training dataset 69 | 70 | $ ln -sf tr.r0.csv 71 | 72 | 3. Add a dummy label to the test set 73 | 74 | $ ./add_dummy_label.py va.r0.csv 75 | 76 | 4. Checksum 77 | 78 | $ md5sum tr.r0.csv va.r0.csv 79 | f5d49ff28f41dc993b9ecb2372abb033 tr.r0.csv 80 | 6edd380a5897bc16b61c5a626062f7b3 va.r0.csv 81 | 82 | 5. Reproduce our base submission 83 | 84 | $ ./run.sh 0 85 | 86 | Note: base.r0.prd is the submission file 87 | 88 | 6. (optional) Reproduce our best submission 89 | 90 | $ ./run_all.sh x 91 | 92 | If success, then run 93 | 94 | $ ./run_all.sh 0 95 | 96 | Note: The algorithm in the `bag' part is non-deterministic. That is, the 97 | result can be slightly different when you run it two or more times. 98 | 99 | 100 | 101 | ============== 102 | 103 | If you want to trace these codes, please be prepared that it will take some 104 | efforts. We do not have enough time to polish the codes here to improve the 105 | readability. Sorry about it. 106 | -------------------------------------------------------------------------------- /add_dummy_label.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import argparse, csv, hashlib 4 | 5 | parser = argparse.ArgumentParser(description='process some integers') 6 | parser.add_argument('csv_path', type=str, nargs=1, help='set path to the csv file') 7 | parser.add_argument('out_path', type=str, nargs=1, help='set path to the svm file') 8 | args = parser.parse_args() 9 | 10 | CSV_PATH, OUT_PATH = args.csv_path[0], args.out_path[0] 11 | 12 | f = csv.writer(open(OUT_PATH, 'w')) 13 | for i, row in enumerate(csv.reader(open(CSV_PATH))): 14 | if i == 0: 15 | row.insert(1, 'click') 16 | else: 17 | row.insert(1, '0') 18 | f.writerow(row) 19 | -------------------------------------------------------------------------------- /bag/converter/6.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | 3 | import argparse, csv, sys, pickle, collections, math 4 | 5 | from common import * 6 | 7 | if len(sys.argv) == 1: 8 | sys.argv.append('-h') 9 | 10 | parser = argparse.ArgumentParser() 11 | parser.add_argument('tr_src_path', type=str) 12 | parser.add_argument('va_src_path', type=str) 13 | parser.add_argument('tr_dst_path', type=str) 14 | parser.add_argument('va_dst_path', type=str) 15 | args = vars(parser.parse_args()) 16 | 17 | fields = ['pub_id','pub_domain','pub_category','banner_pos','device_model','device_conn_type','C14','C17','C20','C21'] 18 | 19 | def convert(src_path, dst_path, is_train): 20 | with open(dst_path, 'w') as f: 21 | for row in csv.DictReader(open(src_path)): 22 | i = 1 23 | w = math.sqrt(2)/math.sqrt(15) 24 | feats = [] 25 | 26 | for field in fields: 27 | v = hashstr(field+'-'+row[field]) 28 | feats.append('{i}:{v}:{w:.20f}'.format(i=i, v=v, w=w)) 29 | i += 1 30 | 31 | v = hashstr('hour-'+row['hour'][-2:]) 32 | feats.append('{i}:{v}:{w:.20f}'.format(i=i, v=v, w=w)) 33 | i += 1 34 | 35 | if int(row['device_ip_count']) > 1000: 36 | v = hashstr('device_ip-'+row['device_ip']) 37 | feats.append('{i}:{v}:{w:.20f}'.format(i=i, v=v, w=w)) 38 | else: 39 | v = hashstr('device_ip-less-'+row['device_ip_count']) 40 | feats.append('{i}:{v}:{w:.20f}'.format(i=i, v=v, w=w)) 41 | i += 1 42 | 43 | if int(row['device_id_count']) > 1000: 44 | v = hashstr('device_id-'+row['device_id']) 45 | feats.append('{i}:{v}:{w:.20f}'.format(i=i, v=v, w=w)) 46 | else: 47 | v = hashstr('device_id-less-'+row['device_id_count']) 48 | feats.append('{i}:{v}:{w:.20f}'.format(i=i, v=v, w=w)) 49 | i += 1 50 | 51 | if int(row['smooth_user_hour_count']) > 30: 52 | v = hashstr('smooth_user_hour_count-0') 53 | feats.append('{i}:{v}:{w:.20f}'.format(i=i, v=v, w=w)) 54 | else: 55 | v = hashstr('smooth_user_hour_count-'+row['smooth_user_hour_count']) 56 | feats.append('{i}:{v}:{w:.20f}'.format(i=i, v=v, w=w)) 57 | i += 1 58 | 59 | if int(row['user_count']) > 30: 60 | v = hashstr('user_click_histroy-'+row['user_count']) 61 | feats.append('{i}:{v}:{w:.20f}'.format(i=i, v=v, w=w)) 62 | else: 63 | v = hashstr('user_click_histroy-'+row['user_count']+'-'+row['user_click_histroy']) 64 | feats.append('{i}:{v}:{w:.20f}'.format(i=i, v=v, w=w)) 65 | i += 1 66 | 67 | f.write('{0} {1} {2}\n'.format(row['id'], row['click'], ' '.join(feats))) 68 | 69 | convert(args['tr_src_path'], args['tr_dst_path'], True) 70 | convert(args['va_src_path'], args['va_dst_path'], False) 71 | -------------------------------------------------------------------------------- /bag/converter/common.py: -------------------------------------------------------------------------------- 1 | import hashlib, csv, math, os, subprocess 2 | 3 | NR_BINS = 1000000 4 | 5 | def hashstr(input): 6 | return str(int(hashlib.md5(input.encode('utf8')).hexdigest(), 16)%(NR_BINS-1)+1) 7 | 8 | def open_with_first_line_skipped(path, skip=True): 9 | f = open(path) 10 | if not skip: 11 | return f 12 | next(f) 13 | return f 14 | 15 | def split(path, nr_thread, has_header): 16 | 17 | def open_with_header_witten(path, idx, header): 18 | f = open(path+'.__tmp__.{0}'.format(idx), 'w') 19 | if not has_header: 20 | return f 21 | f.write(header) 22 | return f 23 | 24 | def calc_nr_lines_per_thread(): 25 | nr_lines = int(list(subprocess.Popen('wc -l {0}'.format(path), shell=True, 26 | stdout=subprocess.PIPE).stdout)[0].split()[0]) 27 | if not has_header: 28 | nr_lines += 1 29 | return math.ceil(float(nr_lines)/nr_thread) 30 | 31 | header = open(path).readline() 32 | 33 | nr_lines_per_thread = calc_nr_lines_per_thread() 34 | 35 | idx = 0 36 | f = open_with_header_witten(path, idx, header) 37 | for i, line in enumerate(open_with_first_line_skipped(path, has_header), start=1): 38 | if i%nr_lines_per_thread == 0: 39 | f.close() 40 | idx += 1 41 | f = open_with_header_witten(path, idx, header) 42 | f.write(line) 43 | f.close() 44 | 45 | def parallel_convert(cvt_path, arg_paths, nr_thread): 46 | 47 | workers = [] 48 | for i in range(nr_thread): 49 | cmd = '{0}'.format(os.path.join('.', cvt_path)) 50 | for path in arg_paths: 51 | cmd += ' {0}'.format(path+'.__tmp__.{0}'.format(i)) 52 | worker = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE) 53 | workers.append(worker) 54 | for worker in workers: 55 | worker.communicate() 56 | 57 | def cat(path, nr_thread): 58 | 59 | if os.path.exists(path): 60 | os.remove(path) 61 | for i in range(nr_thread): 62 | cmd = 'cat {svm}.__tmp__.{idx} >> {svm}'.format(svm=path, idx=i) 63 | p = subprocess.Popen(cmd, shell=True) 64 | p.communicate() 65 | 66 | def delete(path, nr_thread): 67 | 68 | for i in range(nr_thread): 69 | os.remove('{0}.__tmp__.{1}'.format(path, i)) 70 | 71 | def def_user(row): 72 | 73 | if row['device_id'] == 'a99f214a': 74 | user = 'ip-' + row['device_ip'] + '-' + row['device_model'] 75 | else: 76 | user = 'id-' + row['device_id'] 77 | 78 | return user 79 | 80 | def is_app(row): 81 | 82 | return True if row['site_id'] == '85f751fd' else False 83 | 84 | def has_id_info(row): 85 | 86 | return False if row['device_id'] == 'a99f214a' else True 87 | -------------------------------------------------------------------------------- /bag/converter/group6.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/env python3 2 | import argparse 3 | import copy 4 | import hashlib 5 | import itertools 6 | import math 7 | import multiprocessing 8 | import numpy as np 9 | import os 10 | import pandas as pd 11 | import pickle 12 | import random 13 | import time 14 | from multiprocessing import Pool 15 | import subprocess 16 | from collections import defaultdict 17 | from collections import Counter 18 | 19 | f_fields = ['hour', 'banner_pos', 'device_id', 'device_ip', 'device_model', 'device_conn_type', 'C14', 'C17', 'C20', 'C21', 'pub_id', 'pub_domain', 'pub_category', 'device_id_count', 'device_ip_count', 'user_count', 'smooth_user_hour_count', 'user_click_histroy'] 20 | 21 | def parse_args(): 22 | parser = argparse.ArgumentParser('Calculate group features and dump them to a specified file') 23 | parser.add_argument('train', type=str, help='csv file') 24 | parser.add_argument('valid', type=str, help='csv file') 25 | parser.add_argument('partition', type=str, help='site/app') 26 | parser.add_argument('g_field', type=str, help='specified the fields used to group instances') 27 | parser.add_argument('a_field', type=str, help='specified the fields considered in each group') 28 | parser.add_argument('--gc_begin', type=int, default=16, help='the index of the first column in group features') 29 | parser.add_argument('--max_occur', type=int, default=100, help='specified the maximum number of count features. Any feature with counts less than the value would be replaced with its count.') 30 | parser.add_argument('--max_sz_group', type=int, default=100, help='the upper limit of the size of each group') 31 | parser.add_argument('--max_nr_group_feats', type=int, default=2500, help='the maximum number of features among a group') 32 | return vars(parser.parse_args()) 33 | 34 | def hashstr(str, nr_bins=1e6): 35 | return int(hashlib.md5(str.encode('utf8')).hexdigest(), 16)%(nr_bins-1)+1 36 | 37 | def vtform(v, partition, c, cnts, max_occur): 38 | pub_in_raw = {'pub_id': {'app': 'app_id', 'site': 'site_id'}, 'pub_domain': {'app': 'app_domain', 'site': 'site_domain'}, 'pub_category': {'app': 'app_category', 'site': 'site_category'}} 39 | if c in pub_in_raw: 40 | c = pub_in_raw[c][partition] 41 | if c != 'hour': 42 | if v in cnts[c]: 43 | if cnts[c][v] >= max_occur: 44 | return c+'-'+v 45 | else: 46 | return c+'-less-'+str(cnts[c][v]) 47 | else: 48 | return c+'-less' 49 | else: 50 | return c+'-'+v[-2:] 51 | 52 | def generate_feats(df, partition, a_field, gc_begin, max_occur, max_sz_group, max_nr_group_feats, tr_path, va_path): 53 | g_added = set(a_field.split(',')) & set(f_fields) 54 | col_fm_indices = {c:i+gc_begin for i, c in enumerate(g_added)} 55 | with open('fc.trva.r0.t2.pkl', 'rb') as fh: 56 | cnts = pickle.load(fh) 57 | with open(tr_path, 'wt') as f_tr, open(va_path, 'wt') as f_va: 58 | for gid, group in df.groupby('__kid__'): 59 | group_feats = dict() 60 | if len(group) < max_sz_group: 61 | for c in g_added: 62 | group_feats[c] = Counter(group[c].apply(lambda x: vtform(x, partition, c, cnts, max_occur))) 63 | c_norm = 1/math.sqrt(sum([w**2 for w in group_feats[c].values()]))/len(g_added) 64 | for v, w in group_feats[c].items(): 65 | group_feats[c][v] = w*c_norm 66 | 67 | gf_str = '' 68 | for c, vws in group_feats.items(): 69 | for v, w in vws.items(): 70 | gf_str += ' {0}:{1}:{2:.5f}'.format(col_fm_indices[c], int(hashstr('group-'+v)), w) 71 | 72 | for rid, row in group.iterrows(): 73 | feats_str = row['id'] + gf_str 74 | if row['__src__'] == '__tr__': 75 | f_tr.write(feats_str+'\n') 76 | elif row['__src__'] == '__va__': 77 | f_va.write(feats_str+'\n') 78 | 79 | def cat(combined, names): 80 | if os.path.exists(combined): 81 | os.remove(combined) 82 | for name in names: 83 | cmd = 'cat {0} >> {1}'.format(name, combined) 84 | p = subprocess.Popen(cmd, shell=True) 85 | p.communicate() 86 | 87 | 88 | def delete(names): 89 | for name in names: 90 | cmd = 'rm {0}'.format(name) 91 | p = subprocess.Popen(cmd, shell=True) 92 | p.communicate() 93 | 94 | def get_pid_table(df, col, sz_chunk): 95 | return df.groupby(col)['id'].count().cumsum().apply(lambda x: int(x/sz_chunk)) 96 | 97 | if __name__ == '__main__': 98 | args = parse_args() 99 | spec = '.T_{max_occur}.gins_{max_sz_group}.gfeat_{max_nr_group_feats}.gby_{g_field}.add_{a_field}'.format( 100 | max_occur=args['max_occur'], max_sz_group=args['max_sz_group'], max_nr_group_feats=args['max_nr_group_feats'], 101 | g_field=args['g_field'], a_field=args['a_field']) 102 | # loading 103 | start = time.time() 104 | tr = pd.read_csv(args['train'], dtype=str) 105 | tr['__src__'] = '__tr__' 106 | va = pd.read_csv(args['valid'], dtype=str) 107 | va['__src__'] = '__va__' 108 | trva = pd.concat([tr, va]) 109 | if args['g_field'] != 'device_id': 110 | trva['__kid__'] = trva.apply(lambda row: '-'.join([row[c] for c in args['g_field'].split(',')]), axis=1) 111 | else: 112 | trva['__kid__'] = trva.apply(lambda row: row['device_id'] if row['device_id'] != 'a99f214a' else row['device_ip']+'-'+row['device_model'], axis=1) 113 | del tr 114 | del va 115 | print('Loading: {0} sec.'.format(time.time()-start)) 116 | 117 | # assign process IDs 118 | start = time.time() 119 | sz_chunk = max(20000, int(len(trva)/100) + 1) 120 | trva['__pid__'] = get_pid_table(trva, '__kid__', sz_chunk)[trva['__kid__']].values 121 | pids = set(trva['__pid__']) 122 | tr_files = [args['train']+'.__tmp__.'+str(k)+spec for k in pids] 123 | va_files = [args['valid']+'.__tmp__.'+str(k)+spec for k in pids] 124 | print('Compute the sizes of groups: {0} sec.'.format(time.time()-start)) 125 | 126 | # compute group features in parallel 127 | start = time.time() 128 | nr_procs = multiprocessing.cpu_count() 129 | pool = Pool(processes=nr_procs) 130 | 131 | result = pool.starmap(generate_feats, [(g[1], args['partition'], args['a_field'], args['gc_begin'], args['max_occur'], args['max_sz_group'], args['max_nr_group_feats'], f_tr, f_va) for g, f_tr, f_va in zip(trva.groupby('__pid__'), tr_files, va_files)]) 132 | pool.close() 133 | pool.join() 134 | print('Calculate groups'' features: {0} sec.'.format(time.time()-start)) 135 | 136 | # combine results and delete redundant files 137 | start = time.time() 138 | tr_path = args['train']+'.group' 139 | va_path = args['valid']+'.group' 140 | cat(tr_path, tr_files) 141 | cat(va_path, va_files) 142 | delete(tr_files) 143 | delete(va_files) 144 | print('Clean temporary files: {0} sec.'.format(time.time()-start)) 145 | -------------------------------------------------------------------------------- /bag/mark/Makefile: -------------------------------------------------------------------------------- 1 | CXX = g++ 2 | CXXFLAGS = -Wall -Wconversion -O3 -fPIC -std=c++0x -march=native -fopenmp 3 | MAIN = mark18 4 | FILES = common.cpp timer.cpp 5 | SRCS = $(FILES:%.cpp=src/%.cpp) 6 | HEADERS = $(FILES:%.cpp=src/%.h) 7 | 8 | #DFLAG = -DNOSSE 9 | 10 | all: $(MAIN) 11 | 12 | mark18: src/train.cpp $(SRCS) $(HEADERS) 13 | $(CXX) $(CXXFLAGS) $(DFLAG) -o $@ $< $(SRCS) 14 | 15 | clean: 16 | rm -f $(MAIN) 17 | -------------------------------------------------------------------------------- /bag/mark/README: -------------------------------------------------------------------------------- 1 | Data Format 2 | =========== 3 | The input of this factorization machine solver consists of a label vector (y) 4 | and a binary sparse matrix (X). The input format is: 5 | 6 |