├── README.md ├── taxii ├── NTUST_MLclass_final.pdf ├── README.md ├── bidirectional_LSTM.py ├── parallelize_data_processing(part of).py └── sampling fliter machine.py ├── toxic_comment ├── README.md ├── hash_back.py ├── model │ ├── DPCNN.py │ ├── GRU_Capsule.py │ ├── LSTM_Attention.py │ ├── README.md │ ├── Single_GRU_glove_char.py │ ├── Single_GRU_wiki.py │ ├── Single_GRU_wiki_char(including preprocessing).py │ ├── criteria.py │ └── stacking.py ├── out-of-fold-cv.py ├── preprocess │ ├── commen_preprocess.py │ ├── correction.py │ ├── glove_twitter_preprocess.py │ └── word_net_lemmatize.py └── toxic_comment_writting_report.pdf └── toxic_rank_pic.png /README.md: -------------------------------------------------------------------------------- 1 | # Kaggle-project-list 2 | 3 | ## competition and project list 4 | 5 | ### 1.NTUST Machine Learning Class final project: 6 | #### Taxi travel time prediction: 7 | https://www.kaggle.com/c/pkdd-15-taxi-trip-time-prediction-ii
8 | #### File name:taxii ; Including writting report NTUST_MLclass_final.pdf 9 | (on github, pdf link will not work, it needs to be downloaded) 10 | 11 | Or please download writting report from: https://drive.google.com/drive/folders/1rtRz4IRec-6NAfTHGua1As5q4PIpZScq 12 | 13 | ### 2.Toxic Comment Classification Challenge (Top 5% silver at final): 14 | #### Content: build multi-headed models that are capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate 15 | #### Holder: Jigsaw (part of Alphabet and it works with Google AI conversation team) 16 | https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
17 | 18 | ![image](toxic_rank_pic.png) 19 | 20 | #### File name:toxic_comment; Including writting report toxic_comment_writting_report.pdf 21 | (on github, pdf link will not work, it needs to be downloaded) 22 | 23 | Or please download writting report from: https://drive.google.com/drive/folders/1rtRz4IRec-6NAfTHGua1As5q4PIpZScq 24 | -------------------------------------------------------------------------------- /taxii/NTUST_MLclass_final.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JasonEricZhan/Kaggle-project-list/460e33e6475d77a6db481b1e518421a9ddcb65cc/taxii/NTUST_MLclass_final.pdf -------------------------------------------------------------------------------- /taxii/README.md: -------------------------------------------------------------------------------- 1 | Taxi travel time prediction: 2 | 3 | https://www.kaggle.com/c/pkdd-15-taxi-trip-time-prediction-ii 4 | 5 | *** 6 | # Complete Data handling 7 | ## Data preprocessing: 8 | ### About IDs: 9 | 1. label encode the TAXI ID, to let the number will be the smaller and shorter bit, so it will be easier for training and save the training time. 10 | 2. Drop trip ID, because it just labels every trips, so each trip will have different label, and it not has special meaning for the training model, because 11 | About time: 12 | The main goal is to get different type of time day, especially the type is like a cycle, for example, which day in a month, which hour in a day, etc. They will let us easier to train the model, because model of machine learning are not good to find the unseen data value before, which will make them hard to generalize, and transferring to cycle time data type will let future data have some similarity with historical data. 13 | 1. Find the based unix timestamp for 12:00am UTC 01/07/2013, and use TIMESTAMP to minus it, then we can get the seconds in a year( from 01/07/2013 to 30/06/2014 ). 14 | 2. Values obtained in 1 are divided by 86400( one day’s sum of second), and transfer to integer, get which day in a year. New feature: day_of_year. 15 | 3. Values obtained in 1 are divided by 86400*30( one month’s sum of second), and transfer to integer, get which day in a year. New feature: month. 16 | 4. Use the values got in 2 and 3, day_of_year minus month*30, get the day in a month. New feature: day_of_month. 17 | 5. Values obtained in 4 mod 7 then add one, get the day in a weak. 18 | New feature: day_of_week. 19 | 6. Make: (TIMESTAMP - month*30*86400-day_of_month*86400)/ (60*60), it means we get the remainder of seconds in a day, and we divide it with seconds in a hour, after that transfer to integer value. Thus, we get the hour in a day. New feature: hour_in_day. 20 | 7. Use the values which are computed from above minus the value of hour_in_day, get the remainder of seconds in a hour, and we divide it by 60, get the minutes in a hour. New feature: minute_in_hour. 21 | 22 | ### About position: 23 | 1. Get the data from the POLYLINE and create four features: start 24 | point’s longitude, start point’s latitude, end point’s longitude, 25 | end point’s latitude. 26 | 2. Do the distance computation between the start point and end point, using Haversine formula to compute ( It consider the radian of the earth’s surface better than using traditional distance computation, like Euclidean distance, in this data set). New feature: Haversine distance. 27 | About target value(travel time): 28 | 1. See how many points traverse in a trip, compute sum of them and minus one, after that multiply 15 seconds( just as the statement on kaggle). The data which does not have any point in the trip get zero. 29 | Target value: POLYLINE_time_second. 30 | 2. Another Target value: POLYLINE_time_second_log, Because the distribution of above value is very skewed, so I get the log of them, and the result is like normal distribution, it’s really good! ! The data which does not have any point in the trip get zero. 31 | 3. Compute the travel path pass the "hot point" or not 32 | 33 | 34 | ### About other features: 35 | 1. Let our model which type of data it is(numerical or categorical), so I impute the missing value of ORIGIN_STAND, and ORIGIN_CALL with -1, and using label encoder to encode them( It encode them by frequency, that is, the most frequent label got 0, second got 1.. ...) 36 | 2. Get the dummy variable of CALL_TYPE, for training our Neural Network model better, and replace original CALL_TYPE. 37 | Feature choosing: 38 | 1. Drop the zero variance feature, including DAY_TYPE, “MISSING_DATA”. 39 | 2. Drop the feature that is always different in the data set, which means the feature values have the number of kinds equal to the length of data set, including TRIP_ID. 40 | 3. Using correlation plot to choose which features about “time” to use. Since the features about time is made from TIMESTAMP. Through the process is easily to get the collinearity effect(Sometime will make model overfitting, or bad performance). Thus, I get rid of highly correlated data, like month, TIMESTAMP, and day_of_year. 41 | 42 | ## Data choosing: 43 | 1. Choosing the data that its “MISSING_DATA" is false. 44 | 2. Choosing the data that have start point and end point. 45 | Target value choosing: 46 | *Choose POLYLINE_time_second_log for training model, after model 47 | training, I’ll take exponential value to transfer back for comparison with POLYLINE_time_second. 48 | 49 | 50 | 51 | 52 | Detailed writting report : NTUST_MLclass_final.pdf 53 | 54 | (on github, pdf link will not work, it needs to be downloaded) 55 | 56 | ** WARNING: IF YOU EXPAND THE TIME WINDOE, I WOULD NOT RECOMMAND YOU TO USE Bi-directional RNN, IT VIOLATE THE INTUITION!! 57 | 58 | 59 | -------------------------------------------------------------------------------- /taxii/bidirectional_LSTM.py: -------------------------------------------------------------------------------- 1 | from keras.models import Sequential,Model 2 | from keras.layers.core import Dense 3 | from keras.layers import LSTM,GRU,Bidirectional,Dropout 4 | from keras.callbacks import EarlyStopping 5 | 6 | 7 | from keras import regularizers 8 | from keras import initializers 9 | from keras.optimizers import Adam 10 | 11 | from sklearn.metrics import mean_absolute_error 12 | from sklearn.metrics import mean_squared_log_error 13 | 14 | from numpy.random import seed 15 | seed(1) 16 | 17 | %matplotlib inline 18 | import matplotlib.pyplot as plt; 19 | 20 | count_prev=0 21 | mae_list=[] 22 | RMSL_list=[] 23 | for i in np.arange(0,10,2): 24 | length_first_month=len(df.loc[(df['month']==i)]) #first month for training 25 | length_second_month=len(df.loc[(df['month']==i+2)])#second month for training 26 | length_val=len(df.loc[df['month']==i+4])#Third month for validation 27 | length_train=length_first_month+length_second_month #length for training 28 | count_now=count_prev+length_train 29 | train=df_norm[count_prev:count_now] 30 | val=df_norm[count_now:count_now+length_val] 31 | 32 | train=train.reshape(-1,1,17) #it depends on the number of features 33 | val=val.reshape(-1,1,17) 34 | 35 | y_train=df.POLYLINE_time_second_log[count_prev:count_now] 36 | y_val=df.POLYLINE_time_second_log[count_now:count_now+length_val] 37 | 38 | train_answer=df.POLYLINE_time_second[count_prev:count_now] 39 | test_answer=df.POLYLINE_time_second[count_now:count_now+length_val] 40 | 41 | 42 | count_prev=count_prev+length_first_month #move one month 43 | 44 | model = Sequential() 45 | #It will double the nodes number automatically 46 | model.add(Bidirectional(LSTM(128,recurrent_dropout=0.3, 47 | return_sequences = True),input_shape=(1,train.shape[2]))) 48 | model.add(Dropout(0.2)) 49 | model.add(Bidirectional(LSTM(128,recurrent_dropout=0.3, 50 | return_sequences = True))) 51 | model.add(Dropout(0.2)) 52 | model.add(Bidirectional(LSTM(64,recurrent_dropout=0.3, 53 | return_sequences = True))) 54 | model.add(Dropout(0.2)) 55 | model.add(Bidirectional(LSTM(64,recurrent_dropout=0.3, 56 | return_sequences = True))) 57 | model.add(Dropout(0.2)) 58 | 59 | model.add(Bidirectional(LSTM(32,recurrent_dropout=0.3, 60 | return_sequences = False))) 61 | #not return all sequences,so the artificail neural can averaging the bidirectional LSTM units 62 | model.add(Dropout(0.2)) 63 | model.add(Dense(1)) 64 | model.add(Activation('linear')) 65 | print(model.summary()) 66 | 67 | 68 | adam=Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0,clipvalue=0.0015) 69 | #clipvalue is nearly abitary set here, just give some look at each epochs make set the number ,not do much experiement 70 | model.compile(loss='mean_squared_logarithmic_error', optimizer=adam,metrics =['mae']) 71 | 72 | 73 | earlystop = EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=0) 74 | history = model.fit(train,y_train.values,batch_size=64,validation_data=(val, y_val), 75 | epochs=40,shuffle=False,callbacks=[earlystop]) 76 | pred=model.predict(val) 77 | 78 | ##code of validation plot for NN model 79 | print(history.history.keys()) 80 | plt.plot(history.history['loss']) 81 | plt.plot(history.history['val_loss']) 82 | plt.title('model accuracy') 83 | plt.ylabel('mean squared logarithmic error') 84 | plt.xlabel('epoch') 85 | plt.legend(['train', 'val'], loc='upper left') 86 | plt.show() 87 | 88 | print(history.history.keys()) 89 | plt.plot(history.history['mean_absolute_error']) 90 | plt.plot(history.history['val_mean_absolute_error']) 91 | plt.title('model accuracy') 92 | plt.ylabel('mean absolute error') 93 | plt.xlabel('epoch') 94 | plt.legend(['train', 'val'], loc='upper left') 95 | plt.show() 96 | 97 | 98 | 99 | pred=pred.reshape(-1,) 100 | 101 | mae=mean_absolute_error(np.exp(pred),test_answer) 102 | msle=mean_squared_log_error(np.exp(pred),test_answer) 103 | 104 | 105 | mae_list.append(mae) 106 | RMSL_list.append(np.sqrt(msle)) 107 | 108 | 109 | print ("# MAE: {}\n".format( mae ) ) 110 | print ("# Mean square log error: {}\n".format( msle ) ) 111 | print ("# Root Mean square log error: {}\n".format( np.sqrt(msle )) ) 112 | 113 | print ("# mean of MAE: {}\n".format( np.mean(mae_list) ) ) 114 | print ("# std of MAE: {}\n".format( np.std(mae_list) ) ) 115 | print ("# mean of Root Mean square log error: {}\n".format( np.mean(RMSL_list )) ) 116 | print ("# std of Root Mean square log error: {}\n".format( np.std(RMSL_list) ) ) 117 | -------------------------------------------------------------------------------- /taxii/parallelize_data_processing(part of).py: -------------------------------------------------------------------------------- 1 | 2 | import numpy as np 3 | from multiprocessing import Pool 4 | #add to the list 5 | #df there is pandas data frame 6 | 7 | #compute pass the hot point or not 8 | hot_point=[[-8.61470,41.14660],[-8.616,41.141],[-8.60658,41.14724],[-8.624817,41.177124],[-8.64927,41.170653],[-8.6382,41.159151], 9 | [-8.624817,41.177124],[-8.64927,41.170653],[-8.6382,41.159151]] 10 | 11 | def get_dist(lonlat,point): 12 | if len(lonlat) >0: 13 | lon_diff = np.abs(lonlat[0]-point[0])*np.pi/360.0 14 | lat_diff = np.abs(lonlat[1]-point[1])*np.pi/360.0 15 | a = np.sin(lat_diff)**2 + np.cos(point[0]*np.pi/180.0) * np.cos(point[1]*np.pi/180.0) * np.sin(lon_diff)**2 16 | d = 2*6371*np.arctan2(np.sqrt(a), np.sqrt(1-a)) 17 | return(d) 18 | else: 19 | return 0 20 | 21 | 22 | 23 | def pass_hot_point(lonlat): 24 | if len(lonlat) >0: 25 | length=len(lonlat) 26 | len_hot=len(hot_point) 27 | for i in range(0,length): 28 | for j in range(0,len_hot): 29 | dist=get_dist(lonlat[i],hot_point[j]) 30 | if(dist<0.1): #bias set to 100 meters 31 | return True 32 | else: 33 | pass 34 | #if(i%1000==0): 35 | #print(i) 36 | return False 37 | else: 38 | return False 39 | 40 | 41 | 42 | num_partitions = 3 #number of partitions to split dataframe 43 | num_cores = 3 #number of cores on the machine 44 | 45 | def parallelize_dataframe(df, func): 46 | df_split = np.array_split(df, num_partitions) 47 | pool = Pool(num_cores) 48 | df=pd.concat(pool.map(func, df_split)) 49 | pool.close() 50 | pool.join() 51 | return df 52 | 53 | 54 | def multiply_hot_point(data): 55 | data['pass_hot_point'] = data['POLYLINE'].apply(lambda x: pass_hot_point(x)) 56 | return data 57 | 58 | df= parallelize_dataframe(df, multiply_hot_point) 59 | 60 | 61 | -------------------------------------------------------------------------------- /taxii/sampling fliter machine.py: -------------------------------------------------------------------------------- 1 | """ 2 | Author:Edited and created by Eric 3 | 4 | """ 5 | # coding: utf-8 6 | 7 | # In[1]: 8 | 9 | import pandas as pd 10 | df=pd.read_csv('train.csv').sort_values(['TIMESTAMP']) 11 | 12 | 13 | 14 | 15 | # In[2]: 16 | 17 | df.columns 18 | 19 | 20 | # In[3]: 21 | 22 | def getFirst_LONGITUDE(x): 23 | if(len(x)>0): 24 | return x[0][0] 25 | else: 26 | return "no information" 27 | 28 | def getFirst_LATITUDE(x): 29 | if(len(x)>0): 30 | return x[0][1] 31 | else: 32 | return "no information" 33 | 34 | def getLast_LONGITUDE(x): 35 | if(len(x)>0): 36 | return x[-1][0] 37 | else: 38 | return "no information" 39 | def getLast_LATITUDE(x): 40 | if(len(x)>0): 41 | return x[-1][1] 42 | else: 43 | return "no information" 44 | ### Get Haversine distance 45 | def get_dist(lonlat): 46 | if len(lonlat) >0: 47 | lon_diff = np.abs(lonlat[0][0]-lonlat[-1][0])*np.pi/360.0 48 | lat_diff = np.abs(lonlat[0][1]-lonlat[-1][1])*np.pi/360.0 49 | a = np.sin(lat_diff)**2 + np.cos(lonlat[-1][0]*np.pi/180.0) * np.cos(lonlat[0][0]*np.pi/180.0) * np.sin(lon_diff)**2 50 | d = 2*6371*np.arctan2(np.sqrt(a), np.sqrt(1-a)) 51 | return(d) 52 | else: 53 | return 0 54 | 55 | 56 | # In[4]: 57 | 58 | preprocessing 59 | 60 | 61 | # In[5]: 62 | 63 | def time_processing(df): 64 | print("at time processing....") 65 | data=df.copy() 66 | data['TIMESTAMP']=data['TIMESTAMP']-1372636800 67 | data['day_of_year']=(data['TIMESTAMP']/86400).astype(int) #day of year 68 | data['month']=(data['TIMESTAMP']/(86400*30)).astype(int) #month 69 | data['day_of_month']=(data['day_of_year']-data['month'].astype(int)*30) 70 | data['day_of_week']=(data['day_of_month']%7)+1 71 | data['month']=data['month'].astype(int) 72 | data['day_of_month']=data['day_of_month'].astype(int) 73 | data['hour_in_day_not_int']=(data['TIMESTAMP']-data['month']*30*86400-data['day_of_month']*86400)/(60*60) 74 | data['hour_in_day']=data['hour_in_day_not_int'].astype(int) 75 | data['minute_in_hour']=(data['hour_in_day_not_int']-data['hour_in_day'])*60 76 | data['minute_in_hour']=data['minute_in_hour'].astype(int) 77 | 78 | return data 79 | 80 | 81 | # In[8]: 82 | 83 | def geography_processing(df): 84 | print("at geography processing....") 85 | data=df.copy() 86 | data['POLYLINE_time_second']=data['POLYLINE'].apply(lambda x: (len(x)-1)*15) 87 | #print(data.describe()) 88 | data=data[data['POLYLINE_time_second']<10000] #testing data range 89 | data=data[data['POLYLINE_time_second']>0] 90 | #print(data.describe()) 91 | data['start_longitude']=data['POLYLINE'].apply(getFirst_LONGITUDE) 92 | data['start_latitude']=data['POLYLINE'].apply(getFirst_LATITUDE) 93 | data['end_longitude']=data['POLYLINE'].apply(getLast_LONGITUDE) 94 | data['end_latitude']=data['POLYLINE'].apply(getLast_LATITUDE) 95 | data['dist']=data['POLYLINE'].apply(get_dist) 96 | 97 | return data 98 | 99 | 100 | # In[ ]: 101 | 102 | import json 103 | from sklearn.utils import resample 104 | def sampling_fliter_machine(df,numberOfSample=0,proportion=0,double_hour=False,random_state=0): 105 | """ 106 | double_hour is to emphazie particular hour or not 107 | """ 108 | accumulator=[] 109 | data=df.copy() 110 | data=time_processing(data) 111 | #print(data.describe()) 112 | data=data.loc[data['hour_in_day']>=2] 113 | data=data.loc[data['hour_in_day']<18] 114 | for i in range(0,13): 115 | data_month=data.loc[data['month']==i].copy() 116 | #print(data_month.describe()) 117 | data_month['POLYLINE'] = data_month['POLYLINE'].apply(json.loads) 118 | data_month=geography_processing(data_month) 119 | #print(data_month.describe()) 120 | data_month=data_month[data_month['start_longitude']<=-7] 121 | data_month=data_month[data_month['start_longitude']>=-9] 122 | data_month=data_month[data_month['start_latitude']>=40] 123 | data_month=data_month[data_month['start_latitude']<=42] 124 | data_month=data_month[data_month['end_longitude']<=-7] 125 | data_month=data_month[data_month['end_longitude']>=-9] 126 | data_month=data_month[data_month['end_latitude']>=40] 127 | data_month=data_month[data_month['end_latitude']<=42] 128 | print("Now is "+str(i)+" month:") 129 | for j in range(2,18): 130 | set_=data_month.loc[data_month['hour_in_day']==j] 131 | #print(set_) 132 | if len(set_)==0: 133 | print("**Warning: month"+str(i)+" hour "+str(j)+" 's data is empty**") 134 | break 135 | if(numberOfSample>0): 136 | if len(set_) "+str(j)+" hour is complete") 156 | print("Data length is: {}\n".format( len(accumulator) )) 157 | else: 158 | if(proportion>0): 159 | length=int(len(set_)*proportion) 160 | if(double_hour==True): 161 | if((j==3) or (j==8) or (j==14) or (j==17)): 162 | if(length*2 "+str(j)+" hour is complete") 179 | print("Data length is: {}\n".format( len(accumulator) )) 180 | return accumulator 181 | 182 | 183 | 184 | # In[34]: 185 | 186 | import numpy as np 187 | sampling_data=sampling_fliter_machine(df,proportion=0.2,double_hour=True,random_state=0) 188 | 189 | 190 | #if you don't want to deal missing data 191 | sampling_data=sampling_data[sampling_data['MISSING_DATA']==False] 192 | # In[30]: 193 | 194 | get_ipython().magic(u'matplotlib inline') 195 | import matplotlib.pyplot as plt; 196 | plt.title('Travel time, if ceiling is 10000, get logarithm') 197 | plt.hist(np.log(sampling_data['POLYLINE_time_second']),bins=10,normed=True) 198 | plt.xlabel('POLYLINE_time_second(log)') 199 | 200 | 201 | # In[31]: 202 | 203 | get_ipython().magic(u'matplotlib inline') 204 | import matplotlib.pyplot as plt; 205 | plt.title('Travel time, if ceiling is 10000') 206 | plt.hist(sampling_data['POLYLINE_time_second'],bins=10,normed=True) 207 | plt.xlabel('POLYLINE_time_second') 208 | 209 | 210 | 211 | %matplotlib inline 212 | import matplotlib.pyplot as plt; 213 | plt.title('Travel time, if ceiling is 10000') 214 | plt.hist(sampling_data['hour_in_day'],bins=10,normed=True) 215 | plt.xlabel('POLYLINE_time_second') 216 | 217 | #saving memory 218 | del df 219 | -------------------------------------------------------------------------------- /toxic_comment/README.md: -------------------------------------------------------------------------------- 1 | ## Top 5% solution 2 | ### Detailed writting report file name: toxic_comment_writting_report.pdf 3 | Or please download writting report from: https://drive.google.com/drive/folders/1rtRz4IRec-6NAfTHGua1As5q4PIpZScq 4 | 5 | # Competition abstract: 6 | 7 | Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments. 8 | In this competition, competiters are challenged to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate 9 | 10 | *The Conversation AI team, a research initiative founded by Jigsaw and Google (both a part of Alphabet) are working on the tools to help improve online conversation, the competiton is part of Jigsaw project. 11 | 12 | Please see:https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge to get more detail if needed 13 | 14 | 15 | 16 | # Run on which machine and software enviroment: 17 | 18 | GPU GTx 1080 8GB is fine, 4 core cpu, ram at least 32 GB (for loading wiki.en.bin) 19 | 20 | python3, tensorflow 1.4, cuda 6 21 | 22 | (I mainly run on AWS p2.xlarge instance) 23 | 24 | # Data abstract: 25 | **** 26 | 27 | ## target label's occurance(toxic): 28 | 29 | |label|occurance|label|occurance|label|occurance| 30 | |-----|---------|-----|---------|-----|---------| 31 | |toxic|15294|servere toxic|1595|obscene|8449| 32 | |threat|478|insult|7877|identity hate|1405| 33 | 34 | ## toxic and not toxic data occurance: 35 | 36 | |type|occurance|type|occurance| 37 | |----|---------|----|---------| 38 | |not toxic|143346 |toxic| 16225| 39 | 40 | **** 41 | 42 | ## multi language occurance in toxic data (by comment): 43 | 44 | |language|occurance| 45 | |--------|---------| 46 | |af|1| 47 | |ar|3| 48 | |bg|1| 49 | |bs|1| 50 | |co|1| 51 | |cs|1| 52 | |cy|2| 53 | |da|4| 54 | |de|15| 55 | |el|1| 56 | |en|16126| 57 | |eo|1| 58 | |es|2| 59 | |fr|3| 60 | |fy|1| 61 | |gd|2| 62 | |haw|3| 63 | |hi|2| 64 | |hr|4| 65 | |hu|2| 66 | |id|2| 67 | |is|1| 68 | |ja|2| 69 | |la|1| 70 | |lb|1| 71 | |mi|1| 72 | |mt|1| 73 | |nl|6| 74 | |no|3| 75 | |pl|1| 76 | |pt|2| 77 | |ro|2| 78 | |sk|1| 79 | |so|2| 80 | |st|1| 81 | |su|1| 82 | |sv|2| 83 | |sw|3| 84 | |tl|13| 85 | |tr|1| 86 | |zh-CN|2| 87 | 88 | 89 | ## multi language occurance in the whole training set (by comment) 90 | 91 | |language|occurance| 92 | |--------|---------| 93 | |af|3| 94 | |ar|13| 95 | |bg|5| 96 | |bs|5| 97 | |ca|11| 98 | |co|3| 99 | |cs|2| 100 | |cy|6| 101 | |da|16| 102 | |de|65| 103 | |el|15| 104 | |en|158954| 105 | |eo|1| 106 | |es|45| 107 | |et|2| 108 | |eu|3| 109 | |fa|2| 110 | |fi|3| 111 | |fr|52| 112 | |fy|4| 113 | |ga|1| 114 | |gd|4| 115 | |gl|7| 116 | |ha|3| 117 | |haw|5| 118 | |hi|23| 119 | |hmn|1| 120 | |hr|11| 121 | |ht|1| 122 | |hu|11| 123 | |id|6| 124 | |ig|4| 125 | |is|4| 126 | |it|10| 127 | |ja|29| 128 | |jw|5| 129 | |ku|1| 130 | |ky|1| 131 | |la|11| 132 | |lb|2| 133 | |lt|4| 134 | |mg|2| 135 | |mi|2| 136 | |ms|6| 137 | |mt|1| 138 | |nl|29| 139 | |no|11| 140 | |ny|1| 141 | |pl|13| 142 | |pt|20| 143 | |ro|14| 144 | |ru|13| 145 | |sk|4| 146 | |sl|4| 147 | |so|4| 148 | |sq|1| 149 | |st|1| 150 | |su|1| 151 | |sv|11| 152 | |sw|6| 153 | |ta|1| 154 | |tl|30| 155 | |tr|10| 156 | |uk|2| 157 | |vi|14| 158 | |zh-CN|8| 159 | |zh-TW|5| 160 | |zu|3| 161 | 162 | 163 | ## multi language occurance in the whole testing set (by comment) 164 | 165 | |language|occurance| 166 | |--------|---------| 167 | |af|24| 168 | |am|2| 169 | |ar|251| 170 | |az|10| 171 | |bg|42| 172 | |bn|36| 173 | |bs|106| 174 | |ca|31| 175 | |ceb|19| 176 | |co|12| 177 | |cs|35| 178 | |cy|27| 179 | |da|73| 180 | |de|334| 181 | |el|158| 182 | |en|148847| 183 | |eo|11| 184 | |es|285| 185 | |et|16| 186 | |eu|5| 187 | |fa|147| 188 | |fi|30| 189 | |fr|123| 190 | |fy|22| 191 | |ga|11| 192 | |gd|7| 193 | |gl|16| 194 | |gu|8| 195 | |ha|11| 196 | |haw|38| 197 | |hi|202| 198 | |hmn|7| 199 | |hr|142| 200 | |ht|14| 201 | |hu|58| 202 | |id|76| 203 | |ig|4| 204 | |is|24| 205 | |it|60| 206 | |iw|41| 207 | |ja|79| 208 | |jw|26| 209 | |ka|38| 210 | |kk|1| 211 | |km|1| 212 | |kn|5| 213 | |ko|50| 214 | |ku|11| 215 | |la|32| 216 | |lb|11| 217 | |lt|16| 218 | |lv|8| 219 | |mg|5| 220 | |mi|21| 221 | |mk|14| 222 | |ml|12| 223 | |mn|6| 224 | |mr|11| 225 | |ms|30| 226 | |mt|20| 227 | |my|6| 228 | |ne|12| 229 | |nl|95| 230 | |no|60| 231 | |ny|5| 232 | |pa|5| 233 | |pl|72| 234 | |ps|10| 235 | |pt|130| 236 | |ro|85| 237 | |ru|139| 238 | |si|3| 239 | |sk|7| 240 | |sl|18| 241 | |sn|4| 242 | |so|50| 243 | |sq|54| 244 | |sr|12| 245 | |st|3| 246 | |su|15| 247 | |sv|75| 248 | |sw|16| 249 | |ta|24| 250 | |te|4| 251 | |th|22| 252 | |tl|111| 253 | |tr|100| 254 | |uk|17| 255 | |ur|20| 256 | |uz|8| 257 | |vi|111| 258 | |xh|3| 259 | |yi|2| 260 | |yo|2| 261 | |zh-CN|59| 262 | |zh-TW|31| 263 | |zu|5| 264 | 265 | ## language tag reference: 266 | Please see:https://sites.google.com/site/tomihasa/google-language-codes 267 | 268 | *can be the base of which language should be translated 269 | *The detection is done by the help of textblob library, the comment that less than 3 words will be unable to be detected 270 | *Seems de(Germany) is very reasonable to be translated 271 | 272 | **** 273 | 274 | ## leaky information: 275 | 276 | ips: in train 5804 different ips, in test 842 ips, but overlap only 41 ips 277 | 278 | user name: in train 157 different user name, in test 81 user name, but overlap only 1 user name 279 | 280 | reference from: https://www.kaggle.com/jagangupta/stop-the-s-toxic-comments-eda , it is a very good EDA!! 281 | 282 | **** 283 | 284 | # Preprocessing: 285 | 286 | 1.brute force(including tokenizing some specific punctuation) 287 | 2.lemmatize: 288 | * differenciate the word to 4 different type: 289 | Noun,Verb,Adjective,Adverb 290 | * get the common form of them, for example: 291 | he hates-->he hate, 292 | Apples-->Apple, 293 | happier-->happy, 294 | he was walking slowly-->he be walk slowly 295 | 296 | # Model details: 297 | 298 | Please see the file name: model 299 | -------------------------------------------------------------------------------- /toxic_comment/hash_back.py: -------------------------------------------------------------------------------- 1 | 2 | #this script is to hash the out of fold back to orignial data's order 3 | 4 | import pickle 5 | import pandas as pd 6 | import numpy as np 7 | 8 | 9 | 10 | train = pd.read_csv('train.csv') 11 | test = pd.read_csv('test.csv') 12 | 13 | 14 | 15 | from sklearn.model_selection import StratifiedKFold 16 | splits = 10 17 | skf = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42) 18 | 19 | # produce two lists of ids. each list has n items where n is the 20 | # number of folds and each item is a pandas series of indexed id numbers 21 | train_ids = [] 22 | val_ids = [] 23 | for i, (train_idx, val_idx) in enumerate(skf.split(np.zeros(train.shape[0]), y_multi)): 24 | train_ids.append(train.loc[train_idx, 'id']) 25 | val_ids.append(train.loc[val_idx, 'id']) 26 | 27 | 28 | 29 | oof_Capsule=pickle.load(open("prediction_RNN_Capsule.pkl", "rb")) 30 | oof_CNN=pickle.load(open("prediction_DPCNN.pkl", "rb")) 31 | oof_GRU=pickle.load(open("prediction_GRU.pkl", "rb")) 32 | oof_GRU_LSTM=pickle.load(open("test_average_LSTM.pkl", "rb")) 33 | 34 | 35 | #=============================Example=================================== 36 | 37 | container_before=oof_CNN #change to other fold file 38 | length_before=0 39 | length_after=0 40 | new=np.zeros((container_before.shape[0],container_before.shape[1])) 41 | for i in range(0,10): #take the fold 42 | length_after+=len(np.where(train.id.isin(train_ids[i]).values==0)[0]) 43 | if i ==9: 44 | input_segment=container_before[length_before:] 45 | else: 46 | input_segment=container_before[length_before:length_after] 47 | length_before+=len(np.where(train.id.isin(train_ids[i]).values==0)[0]) 48 | counter=0 49 | for j in np.where(train.id.isin(train_ids[i]).values==0)[0]: #transfer back to original index 50 | new[j,:]=input_segment[counter,:] 51 | counter+=1 52 | 53 | oof_CNN=new 54 | -------------------------------------------------------------------------------- /toxic_comment/model/DPCNN.py: -------------------------------------------------------------------------------- 1 | # This Python 3 environment comes with many helpful analytics libraries installed 2 | #Should have the same package as kernel, implement and modified by Eric 3 | 4 | 5 | 6 | 7 | 8 | from __future__ import absolute_import, division 9 | import sys, os, re, csv, codecs, numpy as np, pandas as pd 10 | 11 | 12 | from keras.preprocessing.sequence import pad_sequences 13 | from keras.layers import Dense, Input, LSTM, Embedding,Dropout,Activation,GRU,Conv1D,CuDNNGRU,CuDNNLSTM 14 | from keras.layers import SpatialDropout1D,MaxPool1D,GlobalAveragePooling1D,RepeatVector ,Add,PReLU 15 | from keras.layers import Bidirectional, GlobalMaxPool1D,BatchNormalization,concatenate,TimeDistributed,Merge,Flatten 16 | from keras.models import Model 17 | from keras import initializers, regularizers, constraints, optimizers, layers 18 | from keras.optimizers import Adam,SGD,Nadam 19 | from keras.callbacks import EarlyStopping, ModelCheckpoint 20 | from keras.layers.core import Layer 21 | from keras import initializers, regularizers, constraints 22 | from keras import backend as K 23 | from nltk.stem import SnowballStemmer 24 | embed_size = 200 # how big is each word vector 25 | max_features = 180000 # how many unique words to use (i.e num rows in embedding vector) 26 | maxlen=180 27 | 28 | 29 | 30 | 31 | train = pd.read_csv('train.csv') 32 | test = pd.read_csv('test.csv') 33 | 34 | 35 | 36 | merge=pd.concat([train,test]) 37 | df=merge.reset_index(drop=True) 38 | 39 | 40 | merge["comment_text"]=merge["comment_text"].fillna("_na_").values 41 | 42 | 43 | import pickle 44 | 45 | 46 | 47 | 48 | corpus_raw=df.comment_text 49 | 50 | 51 | 52 | 53 | 54 | import time 55 | 56 | start=time.time() 57 | 58 | 59 | from commen_preprocess import * 60 | from criteria import * 61 | 62 | 63 | corpus_clean= parallelize_dataframe(corpus_raw, multiply_columns_clean) 64 | pickle.dump(corpus_clean,open("tmp_noWordNet_clean.pkl", "wb")) 65 | corpus_twitter=pickle.load(open("tmp_noWordNet_clean.pkl", "rb")) 66 | 67 | 68 | 69 | 70 | end=time.time() 71 | 72 | timeStep=end-start 73 | 74 | print("spend sencond: "+str(timeStep)) 75 | 76 | 77 | 78 | 79 | df["comment_text"]=corpus_twitter 80 | 81 | 82 | 83 | train_cl=df[:train.shape[0]] 84 | test_cl=df[train.shape[0]:] 85 | 86 | list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"] 87 | y_tr = train_cl[list_classes].values 88 | list_sentences_train=train_cl.comment_text 89 | list_sentences_test=test_cl.comment_text 90 | 91 | 92 | print("....At....Tokenizer") 93 | 94 | 95 | puncuate=r'([\.\!\?\:\,])' 96 | 97 | from keras.preprocessing.text import Tokenizer 98 | tokenizer = Tokenizer(num_words=max_features,oov_token=puncuate) 99 | tokenizer.fit_on_texts(list(list_sentences_train)+list(list_sentences_test)) 100 | 101 | 102 | list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train) 103 | list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test) 104 | 105 | 106 | 107 | 108 | 109 | 110 | totalNumWords = [len(one_comment) for one_comment in list_tokenized_train] 111 | print("mean length:"+ str(np.mean(totalNumWords ))) 112 | print("max length:"+ str(max(totalNumWords) ) ) 113 | print("std length:"+ str(np.std(totalNumWords ))) 114 | 115 | 116 | 117 | 118 | print(" maxlen is:"+str(maxlen)) 119 | 120 | print("number of different word:"+ str(len(tokenizer.word_index.items()))) 121 | 122 | if len(tokenizer.word_index.items()) < max_features: 123 | max_features=len(tokenizer.word_index.items()) 124 | 125 | 126 | from keras.preprocessing import sequence 127 | print('Pad sequences (samples x time)') 128 | 129 | 130 | 131 | X_tr = pad_sequences(list_tokenized_train, maxlen=maxlen,padding='post') 132 | X_te = pad_sequences(list_tokenized_test, maxlen=maxlen,padding='post') 133 | 134 | 135 | 136 | print('x_train shape:', X_tr.shape) 137 | print('x_test shape:', X_te.shape) 138 | 139 | 140 | 141 | import os, re, csv, math, codecs 142 | print('loading word embeddings...') 143 | embeddings_index = {} 144 | f = codecs.open('crawl-300d-2M.vec', encoding='utf-8') 145 | from tqdm import tqdm 146 | for line in tqdm(f): 147 | values = line.rstrip().rsplit(' ') 148 | word = values[0] 149 | coefs = np.asarray(values[1:], dtype='float32') 150 | embeddings_index[word] = coefs 151 | f.close() 152 | print('found %s word vectors' % len(embeddings_index)) 153 | 154 | 155 | 156 | 157 | 158 | print('preparing embedding matrix...') 159 | words_not_found = [] 160 | nb_words = min(max_features, len(tokenizer.word_index)) 161 | print('number with words...'+str(nb_words)) 162 | 163 | embedding_matrix = np.zeros((nb_words, embed_size)) 164 | for word, i in tokenizer.word_index.items(): 165 | if i >= nb_words: 166 | continue 167 | embedding_vector = embeddings_index.get(word) 168 | if (embedding_vector is not None) and len(embedding_vector) > 0: 169 | # words not found in embedding index will be all-zeros. 170 | embedding_matrix[i] = embedding_vector 171 | else: 172 | words_not_found.append(word) 173 | print('number of null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0)) 174 | 175 | 176 | import tensorflow as tf 177 | 178 | gpu_options = tf.GPUOptions(allow_growth=True) 179 | sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) 180 | 181 | #set up keras session 182 | 183 | tf.keras.backend.set_session(sess) 184 | 185 | from tensorflow.python.client import device_lib 186 | print(device_lib.list_local_devices()) 187 | 188 | 189 | from numpy.random import seed 190 | seed(1) 191 | 192 | 193 | 194 | #default parameter are represented by my own setting of model parameter 195 | 196 | 197 | def DPCNN(num_block=6,ngram=4,drop_ratio=0.15,last_drop_ratio=0.5): 198 | 199 | main_input=Input(shape=(maxlen,)) 200 | embedded_sequences= Embedding(max_features, embed_size,weights=[embedding_matrix],trainable=False)(main_input) 201 | embedded_sequences=SpatialDropout1D(0.22)(embedded_sequences) 202 | 203 | 204 | 205 | 206 | assert num_block > 1 207 | 208 | X_shortcut1 = embedded_sequences 209 | 210 | x= Conv1D(filters=hidden_dim,padding='same',kernel_size=ngram)(embedded_sequences) 211 | x= BatchNormalization()(x) 212 | x = Dropout(drop_ratio)(x) 213 | x= PReLU()(x) 214 | x= Conv1D(filters=hidden_dim,padding='same', kernel_size=ngram)(x) 215 | x= BatchNormalization()(x) 216 | x = Dropout(drop_ratio)(x) 217 | x= PReLU()(x) 218 | 219 | 220 | embedding_reshape=Conv1D(nb_filter=hidden_dim,kernel_size=1,padding='same',activation='linear')(X_shortcut1) 221 | # connect shortcut to the main path 222 | embedding_reshape = PReLU()(embedding_reshape) # pre activation 223 | x = Add()([embedding_reshape,x]) 224 | 225 | x = MaxPool1D(pool_size=4, strides=2, padding='valid')(x) 226 | 227 | 228 | 229 | for i in range(2,num_block): 230 | X_shortcut = x 231 | 232 | x = Conv1D(filters=hidden_dim,padding='same', kernel_size=ngram)(x) 233 | x= BatchNormalization()(x) 234 | x = Dropout(drop_ratio)(x) 235 | x = PReLU()(x) 236 | x = Conv1D(filters=hidden_dim,padding='same', kernel_size=ngram)(x) 237 | x= BatchNormalization()(x) 238 | x = Dropout(drop_ratio)(x) 239 | x = PReLU()(x) 240 | 241 | x = Add()([X_shortcut,x]) 242 | x = MaxPool1D(pool_size=4,strides=2, padding='valid')(x) 243 | 244 | 245 | X_shortcut_final=x 246 | x = Conv1D(filters=hidden_dim,padding='same', kernel_size=ngram)(x) 247 | x= BatchNormalization()(x) 248 | x = Dropout(drop_ratio)(x) 249 | x = PReLU()(x) 250 | x = Conv1D(filters=hidden_dim,padding='same', kernel_size=ngram)(x) 251 | x= BatchNormalization()(x) 252 | x = Dropout(drop_ratio)(x) 253 | x = PReLU()(x) 254 | 255 | x = Add()([X_shortcut_final,x]) 256 | 257 | x = GlobalMaxPool1D()(x) 258 | 259 | x = Dense(dense_filter, activation='linear')(x) 260 | x = BatchNormalization()(x) 261 | x = PReLU()(x) 262 | 263 | x = Add()([X_shortcut6,x]) 264 | 265 | x = GlobalMaxPool1D()(x) 266 | 267 | x = Dense(256, activation='linear')(x) 268 | x = BatchNormalization()(x) 269 | x = PReLU()(x) 270 | x = Dropout(last_drop_ratio)(x) 271 | 272 | 273 | x= Dense(6, activation="sigmoid",kernel_regularizer=regularizers.l2(1e-8))(x) 274 | 275 | 276 | 277 | model = Model(inputs=main_input, outputs=x) 278 | 279 | nadam=Nadam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.0022) 280 | model.compile(loss='binary_crossentropy', 281 | optimizer=nadam, 282 | metrics=['accuracy',f1_score,auc]) 283 | print(model.summary()) 284 | return model 285 | 286 | 287 | batch_size = 400 288 | #total average roc_auc: 0.9865626176353365 289 | -------------------------------------------------------------------------------- /toxic_comment/model/GRU_Capsule.py: -------------------------------------------------------------------------------- 1 | # This Python 3 environment comes with many helpful analytics libraries installed 2 | #Should have the same package as kernel, implement and modified by Eric 3 | 4 | 5 | 6 | from __future__ import absolute_import, division 7 | import sys, os, re, csv, codecs, numpy as np, pandas as pd 8 | 9 | 10 | from keras.preprocessing.sequence import pad_sequences 11 | from keras.layers import Dense, Input, LSTM, Embedding,Dropout,Activation,GRU,Conv1D,CuDNNGRU,CuDNNLSTM 12 | from keras.layers import SpatialDropout1D,MaxPool1D,GlobalAveragePooling1D,RepeatVector,Add 13 | from keras.layers import Bidirectional, GlobalMaxPool1D,BatchNormalization,concatenate,TimeDistributed,Merge,Flatten 14 | from keras.models import Model 15 | from keras import initializers, regularizers, constraints, optimizers, layers 16 | from keras.optimizers import Adam,SGD,Nadam 17 | from keras.callbacks import EarlyStopping, ModelCheckpoint 18 | from keras.layers.core import Layer 19 | from keras import initializers, regularizers, constraints 20 | from keras import backend as K 21 | from nltk.stem import SnowballStemmer 22 | embed_size = 200 # how big is each word vector 23 | max_features = 180000 # how many unique words to use (i.e num rows in embedding vector) 24 | maxlen=180 25 | 26 | 27 | 28 | 29 | train = pd.read_csv('train.csv') 30 | test = pd.read_csv('test.csv') 31 | 32 | 33 | 34 | merge=pd.concat([train,test]) 35 | df=merge.reset_index(drop=True) 36 | 37 | 38 | merge["comment_text"]=merge["comment_text"].fillna("_na_").values 39 | 40 | 41 | import pickle 42 | 43 | 44 | 45 | 46 | corpus_raw=df.comment_text 47 | 48 | 49 | 50 | 51 | 52 | import time 53 | 54 | start=time.time() 55 | 56 | 57 | from commen_preprocess import * 58 | from criteria import * 59 | 60 | 61 | corpus_clean= parallelize_dataframe(corpus_raw, multiply_columns_clean) 62 | pickle.dump(corpus_clean,open("tmp_noWordNet_clean.pkl", "wb")) 63 | corpus_twitter=pickle.load(open("tmp_noWordNet_clean.pkl", "rb")) 64 | 65 | 66 | 67 | 68 | end=time.time() 69 | 70 | timeStep=end-start 71 | 72 | print("spend sencond: "+str(timeStep)) 73 | 74 | 75 | 76 | 77 | df["comment_text"]=corpus_twitter 78 | 79 | 80 | 81 | train_cl=df[:train.shape[0]] 82 | test_cl=df[train.shape[0]:] 83 | 84 | list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"] 85 | y_tr = train_cl[list_classes].values 86 | list_sentences_train=train_cl.comment_text 87 | list_sentences_test=test_cl.comment_text 88 | 89 | 90 | print("....At....Tokenizer") 91 | 92 | 93 | puncuate=r'([\.\!\?\:\,])' 94 | 95 | from keras.preprocessing.text import Tokenizer 96 | tokenizer = Tokenizer(num_words=max_features,oov_token=puncuate) 97 | tokenizer.fit_on_texts(list(list_sentences_train)+list(list_sentences_test)) 98 | 99 | 100 | list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train) 101 | list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test) 102 | 103 | 104 | 105 | 106 | 107 | 108 | totalNumWords = [len(one_comment) for one_comment in list_tokenized_train] 109 | print("mean length:"+ str(np.mean(totalNumWords ))) 110 | print("max length:"+ str(max(totalNumWords) ) ) 111 | print("std length:"+ str(np.std(totalNumWords ))) 112 | 113 | 114 | 115 | 116 | print(" maxlen is:"+str(maxlen)) 117 | 118 | print("number of different word:"+ str(len(tokenizer.word_index.items()))) 119 | 120 | if len(tokenizer.word_index.items()) < max_features: 121 | max_features=len(tokenizer.word_index.items()) 122 | 123 | 124 | from keras.preprocessing import sequence 125 | print('Pad sequences (samples x time)') 126 | 127 | 128 | 129 | X_tr = pad_sequences(list_tokenized_train, maxlen=maxlen,padding='post') 130 | X_te = pad_sequences(list_tokenized_test, maxlen=maxlen,padding='post') 131 | 132 | 133 | 134 | print('x_train shape:', X_tr.shape) 135 | print('x_test shape:', X_te.shape) 136 | 137 | 138 | 139 | import os, re, csv, math, codecs 140 | print('loading word embeddings...') 141 | embeddings_index = {} 142 | f = codecs.open('crawl-300d-2M.vec', encoding='utf-8') 143 | from tqdm import tqdm 144 | for line in tqdm(f): 145 | values = line.rstrip().rsplit(' ') 146 | word = values[0] 147 | coefs = np.asarray(values[1:], dtype='float32') 148 | embeddings_index[word] = coefs 149 | f.close() 150 | print('found %s word vectors' % len(embeddings_index)) 151 | 152 | 153 | embedding_matrix = np.zeros((nb_words, embed_size)) 154 | for word, i in tokenizer.word_index.items(): 155 | if i >= nb_words: 156 | continue 157 | embedding_vector = embeddings_index.get(word) 158 | if (embedding_vector is not None) and len(embedding_vector) > 0: 159 | # words not found in embedding index will be all-zeros. 160 | embedding_matrix[i] = embedding_vector 161 | else: 162 | words_not_found.append(word) 163 | print('number of null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0)) 164 | 165 | 166 | 167 | 168 | 169 | 170 | 171 | def squash(x, axis=-1): 172 | # s_squared_norm is really small 173 | # s_squared_norm = K.sum(K.square(x), axis, keepdims=True) + K.epsilon() 174 | # scale = K.sqrt(s_squared_norm)/ (0.5 + s_squared_norm) 175 | # return scale * x 176 | s_squared_norm = K.sum(K.square(x), axis, keepdims=True) 177 | scale = K.sqrt(s_squared_norm + K.epsilon()) 178 | return x / scale 179 | 180 | 181 | class Capsule(Layer): 182 | def __init__(self, num_capsule, dim_capsule, routings=3, kernel_size=(9, 1), share_weights=True, 183 | activation='default', **kwargs): 184 | super(Capsule, self).__init__(**kwargs) 185 | self.num_capsule = num_capsule 186 | self.dim_capsule = dim_capsule 187 | self.routings = routings 188 | self.kernel_size = kernel_size 189 | self.share_weights = share_weights 190 | if activation == 'default': 191 | self.activation = squash 192 | else: 193 | self.activation = Activation(activation) 194 | 195 | def build(self, input_shape): 196 | super(Capsule, self).build(input_shape) 197 | input_dim_capsule = input_shape[-1] 198 | if self.share_weights: 199 | self.W = self.add_weight(name='capsule_kernel', 200 | shape=(1, input_dim_capsule, 201 | self.num_capsule * self.dim_capsule), 202 | # shape=self.kernel_size, 203 | initializer='glorot_uniform', 204 | trainable=True) 205 | else: 206 | input_num_capsule = input_shape[-2] 207 | self.W = self.add_weight(name='capsule_kernel', 208 | shape=(input_num_capsule, 209 | input_dim_capsule, 210 | self.num_capsule * self.dim_capsule), 211 | initializer='glorot_uniform', 212 | trainable=True) 213 | 214 | def call(self, u_vecs): 215 | if self.share_weights: 216 | u_hat_vecs = K.conv1d(u_vecs, self.W) 217 | else: 218 | u_hat_vecs = K.local_conv1d(u_vecs, self.W, [1], [1]) 219 | 220 | batch_size = K.shape(u_vecs)[0] 221 | input_num_capsule = K.shape(u_vecs)[1] 222 | u_hat_vecs = K.reshape(u_hat_vecs, (batch_size, input_num_capsule, 223 | self.num_capsule, self.dim_capsule)) 224 | u_hat_vecs = K.permute_dimensions(u_hat_vecs, (0, 2, 1, 3)) 225 | # final u_hat_vecs.shape = [None, num_capsule, input_num_capsule, dim_capsule] 226 | 227 | b = K.zeros_like(u_hat_vecs[:, :, :, 0]) # shape = [None, num_capsule, input_num_capsule] 228 | for i in range(self.routings): 229 | b = K.permute_dimensions(b, (0, 2, 1)) # shape = [None, input_num_capsule, num_capsule] 230 | c = K.softmax(b) 231 | c = K.permute_dimensions(c, (0, 2, 1)) 232 | b = K.permute_dimensions(b, (0, 2, 1)) 233 | outputs = self.activation(K.batch_dot(c, u_hat_vecs, [2, 2])) 234 | if i < self.routings - 1: 235 | b = K.batch_dot(outputs, u_hat_vecs, [2, 3]) 236 | 237 | return outputs 238 | 239 | def compute_output_shape(self, input_shape): 240 | return (None, self.num_capsule, self.dim_capsule) 241 | 242 | 243 | 244 | 245 | 246 | def bigru_capsule(): 247 | 248 | main_input=Input(shape=(maxlen,),name='main_input')#, name='main_input' 249 | 250 | embedded_sequences= Embedding(max_features, embed_size,weights=[embedding_matrix],trainable=False)(main_input) 251 | 252 | 253 | hidden_dim=80 #300/4 254 | 255 | Routings = 6 256 | Num_capsule = 16 257 | Dim_capsule = 32 258 | dropout_p = 0.4 259 | 260 | 261 | x=SpatialDropout1D(0.2)(embedded_sequences) 262 | x = Bidirectional(CuDNNGRU(hidden_dim,recurrent_regularizer=regularizers.l2(1e-6),return_sequences=True))(x) 263 | capsule = Capsule(num_capsule=Num_capsule, dim_capsule=Dim_capsule, routings=Routings, 264 | share_weights=True)(x) 265 | 266 | capsule = Flatten()(capsule) 267 | capsule = Dropout(dropout_p)(capsule) 268 | x=capsule 269 | 270 | x= Dense(6, activation="sigmoid",kernel_regularizer=regularizers.l2(1e-8))(x) 271 | 272 | 273 | 274 | model = Model(inputs=main_input, outputs=x) 275 | 276 | nadam=Nadam(lr=0.00125, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.0035) 277 | 278 | model.compile(loss='binary_crossentropy', 279 | optimizer=nadam, 280 | metrics=['accuracy',f1_score,auc]) 281 | print(model.summary()) 282 | return model 283 | 284 | 285 | batch_size = 512 286 | 287 | #total average roc_auc: 0.9884394886430595 288 | 289 | -------------------------------------------------------------------------------- /toxic_comment/model/LSTM_Attention.py: -------------------------------------------------------------------------------- 1 | # This Python 3 environment comes with many helpful analytics libraries installed 2 | #Should have the same package as kernel, implement and modified by Eric 3 | 4 | 5 | 6 | from __future__ import absolute_import, division 7 | import sys, os, re, csv, codecs, numpy as np, pandas as pd 8 | 9 | 10 | from keras.preprocessing.sequence import pad_sequences 11 | from keras.layers import Dense, Input, LSTM, Embedding,Dropout,Activation,GRU,Conv1D,CuDNNGRU,CuDNNLSTM 12 | from keras.layers import SpatialDropout1D,MaxPool1D,GlobalAveragePooling1D,RepeatVector,Add 13 | from keras.layers import Bidirectional, GlobalMaxPool1D,BatchNormalization,concatenate,TimeDistributed,Merge,Flatten 14 | from keras.models import Model 15 | from keras import initializers, regularizers, constraints, optimizers, layers 16 | from keras.optimizers import Adam,SGD,Nadam 17 | from keras.callbacks import EarlyStopping, ModelCheckpoint 18 | from keras.layers.core import Layer 19 | from keras import initializers, regularizers, constraints 20 | from keras import backend as K 21 | 22 | embed_size = 200 # how big is each word vector 23 | max_features = 180000 # how many unique words to use (i.e num rows in embedding vector) 24 | maxlen=180 25 | 26 | 27 | 28 | 29 | train = pd.read_csv('train.csv') 30 | test = pd.read_csv('test.csv') 31 | 32 | 33 | 34 | merge=pd.concat([train,test]) 35 | df=merge.reset_index(drop=True) 36 | 37 | 38 | merge["comment_text"]=merge["comment_text"].fillna("_na_").values 39 | 40 | 41 | import pickle 42 | 43 | 44 | 45 | 46 | corpus_raw=df.comment_text 47 | 48 | 49 | 50 | 51 | 52 | import time 53 | 54 | start=time.time() 55 | 56 | 57 | from commen_preprocess import * 58 | from glove_twitter_preprocess import * 59 | from criteria import * 60 | 61 | corpus_pre1= parallelize_dataframe(corpus_raw, multiply_columns_clean) 62 | corpus_twitter= parallelize_dataframe(corpus_pre1, multiply_columns_glove_twitter_preprocess) 63 | pickle.dump(corpus_twitter,open("tmp_noWordNet_twitter.pkl", "wb")) 64 | corpus_twitter=pickle.load(open("tmp_noWordNet_twitter.pkl", "rb")) 65 | 66 | 67 | 68 | 69 | end=time.time() 70 | 71 | timeStep=end-start 72 | 73 | print("spend sencond: "+str(timeStep)) 74 | 75 | 76 | 77 | 78 | df["comment_text"]=corpus_twitter 79 | 80 | 81 | 82 | train_cl=df[:train.shape[0]] 83 | test_cl=df[train.shape[0]:] 84 | 85 | list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"] 86 | y_tr = train_cl[list_classes].values 87 | list_sentences_train=train_cl.comment_text 88 | list_sentences_test=test_cl.comment_text 89 | 90 | 91 | print("....At....Tokenizer") 92 | 93 | 94 | puncuate=r'([\.\!\?\:\,])' 95 | 96 | from keras.preprocessing.text import Tokenizer 97 | tokenizer = Tokenizer(num_words=max_features,oov_token=puncuate) 98 | tokenizer.fit_on_texts(list(list_sentences_train)+list(list_sentences_test)) 99 | 100 | 101 | list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train) 102 | list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test) 103 | 104 | 105 | 106 | 107 | 108 | 109 | totalNumWords = [len(one_comment) for one_comment in list_tokenized_train] 110 | print("mean length:"+ str(np.mean(totalNumWords ))) 111 | print("max length:"+ str(max(totalNumWords) ) ) 112 | print("std length:"+ str(np.std(totalNumWords ))) 113 | 114 | 115 | 116 | 117 | print(" maxlen is:"+str(maxlen)) 118 | 119 | print("number of different word:"+ str(len(tokenizer.word_index.items()))) 120 | 121 | if len(tokenizer.word_index.items()) < max_features: 122 | max_features=len(tokenizer.word_index.items()) 123 | 124 | 125 | from keras.preprocessing import sequence 126 | print('Pad sequences (samples x time)') 127 | 128 | 129 | 130 | X_tr = pad_sequences(list_tokenized_train, maxlen=maxlen,padding='post') 131 | X_te = pad_sequences(list_tokenized_test, maxlen=maxlen,padding='post') 132 | 133 | 134 | 135 | print('x_train shape:', X_tr.shape) 136 | print('x_test shape:', X_te.shape) 137 | 138 | 139 | import os, re, csv, math, codecs 140 | print('loading word embeddings...') 141 | embeddings_index = {} 142 | f = codecs.open('glove.twitter.27B.200d.txt', encoding='utf-8') 143 | from tqdm import tqdm 144 | for line in tqdm(f): 145 | values = line.rstrip().rsplit(' ') 146 | word = values[0] 147 | coefs = np.asarray(values[1:], dtype='float32') 148 | embeddings_index[word] = coefs 149 | f.close() 150 | print('found %s word vectors' % len(embeddings_index)) 151 | 152 | embedding_matrix = np.zeros((nb_words, embed_size)) 153 | for word, i in tokenizer.word_index.items(): 154 | if i >= nb_words: 155 | continue 156 | embedding_vector = embeddings_index.get(word) 157 | if (embedding_vector is not None) and len(embedding_vector) > 0: 158 | # words not found in embedding index will be all-zeros. 159 | embedding_matrix[i] = embedding_vector 160 | else: 161 | words_not_found.append(word) 162 | print('number of null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0)) 163 | 164 | print("complete preprocess") 165 | 166 | 167 | 168 | import sys 169 | from os.path import dirname 170 | from keras import initializers 171 | from keras.engine import InputSpec, Layer 172 | from keras import backend as K 173 | 174 | """ 175 | From https://arxiv.org/pdf/1708.00524.pdf, 176 | Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm 177 | """ 178 | class AttentionWeightedAverage(Layer): 179 | """ 180 | #Computes a weighted average of the different channels across timesteps. 181 | #Uses 1 parameter pr. channel to compute the attention value for a single timestep. 182 | """ 183 | 184 | def __init__(self, return_attention=False, **kwargs): 185 | self.init = initializers.get('uniform') 186 | self.supports_masking = True 187 | self.return_attention = return_attention 188 | super(AttentionWeightedAverage, self).__init__(** kwargs) 189 | 190 | def build(self, input_shape): 191 | self.input_spec = [InputSpec(ndim=3)] 192 | assert len(input_shape) == 3 193 | 194 | self.W = self.add_weight(shape=(input_shape[2], 1), 195 | name='{}_W'.format(self.name), 196 | initializer=self.init) 197 | self.trainable_weights = [self.W] 198 | super(AttentionWeightedAverage, self).build(input_shape) 199 | 200 | def call(self, x, mask=None): 201 | # computes a probability distribution over the timesteps 202 | # uses 'max trick' for numerical stability 203 | # reshape is done to avoid issue with Tensorflow 204 | # and 1-dimensional weights 205 | logits = K.dot(x, self.W) 206 | x_shape = K.shape(x) 207 | logits = K.reshape(logits, (x_shape[0], x_shape[1])) 208 | ai = K.exp(logits - K.max(logits, axis=-1, keepdims=True)) 209 | 210 | # masked timesteps have zero weight 211 | if mask is not None: 212 | mask = K.cast(mask, K.floatx()) 213 | ai = ai * mask 214 | att_weights = ai / (K.sum(ai, axis=1, keepdims=True) + K.epsilon()) 215 | weighted_input = x * K.expand_dims(att_weights) 216 | result = K.sum(weighted_input, axis=1) 217 | if self.return_attention: 218 | return [result, att_weights] 219 | return result 220 | 221 | def get_output_shape_for(self, input_shape): 222 | return self.compute_output_shape(input_shape) 223 | 224 | def compute_output_shape(self, input_shape): 225 | output_len = input_shape[2] 226 | if self.return_attention: 227 | return [(input_shape[0], output_len), (input_shape[0], input_shape[1])] 228 | return (input_shape[0], output_len) 229 | 230 | def compute_mask(self, input, input_mask=None): 231 | if isinstance(input_mask, list): 232 | return [None] * len(input_mask) 233 | else: 234 | return None 235 | 236 | import tensorflow as tf 237 | 238 | gpu_options = tf.GPUOptions(allow_growth=True) 239 | sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) 240 | 241 | #set up keras session 242 | 243 | 244 | def lstm_attention(): 245 | 246 | main_input=Input(shape=(maxlen,),name='main_input') 247 | 248 | embedded_sequences= Embedding(max_features, embed_size,weights=[embedding_matrix],trainable=False)(main_input) 249 | 250 | 251 | hidden_dim=100 252 | 253 | x=SpatialDropout1D(0.21)(embedded_sequences) #0.1 254 | x_lstm_1 = Bidirectional(CuDNNLSTM(hidden_dim,recurrent_regularizer=regularizers.l2(1e-5),return_sequences=True))(x) 255 | x_lstm_2 = Bidirectional(CuDNNLSTM(hidden_dim,recurrent_regularizer=regularizers.l2(1e-5),return_sequences=True))(x_lstm_1) 256 | x_com = concatenate([x_lstm_1,x_lstm_2]) 257 | x_att_1 = AttentionWeightedAverage()(x_com) 258 | x_att_1= Dropout(0.225)(x_att_1) 259 | x= Dense(6, activation="sigmoid",kernel_regularizer=regularizers.l2(1e-8))(x_att_1) 260 | 261 | 262 | 263 | 264 | model = Model(inputs=main_input, outputs=x) 265 | nadam=Nadam(lr=0.00125, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.0035) 266 | model.compile(loss='binary_crossentropy', 267 | optimizer=nadam, 268 | metrics=['accuracy',f1_score,auc]) 269 | print(model.summary()) 270 | return model 271 | 272 | 273 | batch_size = 640 274 | 275 | #total average roc_auc: 0.9875268030202132 276 | -------------------------------------------------------------------------------- /toxic_comment/model/README.md: -------------------------------------------------------------------------------- 1 | 2 | ## word embedding: 3 | FastText: 4 | 1.wiki.en.bin 5 | 2.crawl-300d-2M.vec 6 | Glove: 7 | 1.glove.840B.300D.txt 8 | 2.glove.twitter.27B.200D.txt 9 | 10 | 11 | ## lb score(mean columnwise AUC): 12 | #### wiki.en.bin single gru (late submit): 13 | public:0.9860, private:0.9847 14 | #### wiki.en.bin and char word2vec single gru: 15 | public:0.9859, private:0.9846 16 | #### glove.840B.300D and char word2vec single gru: 17 | public: 0.9855, private:0.9847 18 | #### glove.twitter.27B.200D LSTM attention and skip connected channel: 19 | public:0.9852, private:0.9845 20 | #### crawl-300d-2M.vec DPCNN: 21 | public:0.9847, private:0.9827 22 | #### crawl-300d-2M.vec Capsule: 23 | public:0.9847, private:0.9841 24 | 25 | #### final ensemble: 26 | stacking with 7 models(one from teamate John Miller, lgbm lb public score: 0.9820) 27 | 28 | ensemble step: 29 | 1.average high correlated model(mean correlate of each corresponded column bigger than 0.98) : 30 | * 0.5 * DPCNN+ 0.5 * Capsule, and get the new out of fold (new out of fold--no.1) 31 | * 0.5 * Single gru+ 0.5 * Wiki.en.bin and char word2vec single gru, and get the new out of fold (new out of fold--no.2) 32 | 33 | 2.stacking 5 out of fold: 34 | * new out of fold--no.1 35 | * new out of fold--no.2 36 | * LSTM attention 37 | * glove and char single gru 38 | * lgbm (from teamate John Miller) 39 | -------------------------------------------------------------------------------- /toxic_comment/model/Single_GRU_glove_char.py: -------------------------------------------------------------------------------- 1 | # This Python 3 environment comes with many helpful analytics libraries installed 2 | #Should have the same package as kernel, implement and modified by Eric 3 | 4 | 5 | 6 | from __future__ import absolute_import, division 7 | import sys, os, re, csv, codecs, numpy as np, pandas as pd 8 | 9 | 10 | from keras.preprocessing.sequence import pad_sequences 11 | from keras.layers import Dense, Input, LSTM, Embedding,Dropout,Activation,GRU,Conv1D,CuDNNGRU,CuDNNLSTM 12 | from keras.layers import SpatialDropout1D,MaxPool1D,GlobalAveragePooling1D,RepeatVector,Add,PReLU 13 | from keras.layers import Bidirectional, GlobalMaxPool1D,BatchNormalization,concatenate,TimeDistributed,Merge,Flatten 14 | from keras.models import Model 15 | from keras import initializers, regularizers, constraints, optimizers, layers 16 | from keras.optimizers import Adam,SGD,Nadam 17 | from keras.callbacks import EarlyStopping, ModelCheckpoint 18 | from keras.layers.core import Layer 19 | from keras import initializers, regularizers, constraints 20 | from keras import backend as K 21 | 22 | 23 | embed_size = 300 24 | max_features = 160000 25 | maxlen=180 26 | 27 | 28 | #============= 29 | 30 | #.....preprocessing like wiki char.... 31 | 32 | #============= 33 | 34 | 35 | 36 | class Attention(Layer): 37 | def __init__(self, 38 | W_regularizer=None, b_regularizer=None, 39 | W_constraint=None, b_constraint=None, 40 | bias=True, **kwargs): 41 | """ 42 | Keras Layer that implements an Attention mechanism for temporal data. 43 | Supports Masking. 44 | Follows the work of Raffel et al. [https://arxiv.org/abs/1512.08756] 45 | # Input shape 46 | 3D tensor with shape: `(samples, steps, features)`. 47 | # Output shape 48 | 2D tensor with shape: `(samples, features)`. 49 | :param kwargs: 50 | Just put it on top of an RNN Layer (GRU/LSTM/SimpleRNN) with return_sequences=True. 51 | The dimensions are inferred based on the output shape of the RNN. 52 | Note: The layer has been tested with Keras 2.0.6 53 | Example: 54 | model.add(LSTM(64, return_sequences=True)) 55 | model.add(Attention()) 56 | # next add a Dense layer (for classification/regression) or whatever... 57 | """ 58 | self.supports_masking = True 59 | self.init = initializers.get('glorot_uniform') 60 | 61 | self.W_regularizer = regularizers.get(W_regularizer) 62 | self.b_regularizer = regularizers.get(b_regularizer) 63 | 64 | self.W_constraint = constraints.get(W_constraint) 65 | self.b_constraint = constraints.get(b_constraint) 66 | 67 | self.bias = bias 68 | super(Attention, self).__init__(**kwargs) 69 | 70 | def build(self, input_shape): 71 | assert len(input_shape) == 3 72 | 73 | self.W = self.add_weight((input_shape[-1],), 74 | initializer=self.init, 75 | name='{}_W'.format(self.name), 76 | regularizer=self.W_regularizer, 77 | constraint=self.W_constraint) 78 | if self.bias: 79 | self.b = self.add_weight((input_shape[1],), 80 | initializer='zero', 81 | name='{}_b'.format(self.name), 82 | regularizer=self.b_regularizer, 83 | constraint=self.b_constraint) 84 | else: 85 | self.b = None 86 | 87 | self.built = True 88 | 89 | def compute_mask(self, input, input_mask=None): 90 | # do not pass the mask to the next layers 91 | return None 92 | 93 | def call(self, x, mask=None): 94 | eij = dot_product(x, self.W) 95 | 96 | if self.bias: 97 | eij += self.b 98 | 99 | eij = K.tanh(eij) 100 | 101 | a = K.exp(eij) 102 | 103 | # apply mask after the exp. will be re-normalized next 104 | if mask is not None: 105 | # Cast the mask to floatX to avoid float64 upcasting in theano 106 | a *= K.cast(mask, K.floatx()) 107 | 108 | # in some cases especially in the early stages of training the sum may be almost zero 109 | # and this results in NaN's. A workaround is to add a very small positive number [ ] to the sum. 110 | # a /= K.cast(K.sum(a, axis=1, keepdims=True), K.floatx()) 111 | a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx()) 112 | 113 | a = K.expand_dims(a) 114 | weighted_input = x * a 115 | return K.sum(weighted_input, axis=1) 116 | 117 | def compute_output_shape(self, input_shape): 118 | return input_shape[0], input_shape[-1] 119 | 120 | 121 | 122 | 123 | 124 | import tensorflow as tf 125 | 126 | gpu_options = tf.GPUOptions(allow_growth=True) 127 | sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) 128 | 129 | #set up keras session 130 | 131 | tf.keras.backend.set_session(sess) 132 | 133 | from tensorflow.python.client import device_lib 134 | print(device_lib.list_local_devices()) 135 | 136 | 137 | from numpy.random import seed 138 | seed(1) 139 | 140 | 141 | 142 | 143 | def bigru_pool_attention(attention=False): 144 | main_input=Input(shape=(maxlen,),name='main_input')#, name='main_input' 145 | Ngram_input= Input(shape=(maxlen_char,), name='aux_input')#, name='aux_input' 146 | embedded_sequences= Embedding(max_features, embed_size,weights=[embedding_matrix],trainable=False)(main_input) 147 | embedded_sequences_2= Embedding(weights_char.shape[0], 50,weights=[weights_char],trainable=True)(Ngram_input) 148 | 149 | #word level 150 | hidden_dim=128 151 | x=SpatialDropout1D(0.22)(embedded_sequences) #0.1 152 | x_gru_1 = Bidirectional(CuDNNGRU(hidden_dim,recurrent_regularizer=regularizers.l2(1e-6),return_sequences=True))(x) 153 | 154 | #char level 155 | hidden_dim=30 156 | x_2=SpatialDropout1D(0.21)(embedded_sequences_2) #0.1 157 | x_gru_2 = Bidirectional(CuDNNGRU(hidden_dim,recurrent_regularizer=regularizers.l2(1e-8),return_sequences=True))(x_2) 158 | 159 | 160 | x_ave_1=GlobalAveragePooling1D()(x_gru_1) 161 | x_ave_2=GlobalAveragePooling1D()(x_gru_2) 162 | x_ave= concatenate([x_ave_1,x_ave_2]) 163 | x_max_1=GlobalMaxPool1D()(x_gru_1) 164 | x_max_2=GlobalMaxPool1D()(x_gru_2) 165 | x_max= concatenate([x_max_1,x_max_2]) 166 | x_dense= concatenate([x_max,x_ave]) 167 | 168 | if attention: #did not use it at the final 169 | x_att_1=Attention()(x_gru_1) 170 | 171 | 172 | x_dense=BatchNormalization()(x_dense) 173 | x_dense= Dropout(0.35)(x_dense) 174 | x_dense = Dense(256, activation="elu")(x_dense) 175 | x_dense = Dropout(0.3)(x_dense) 176 | x_dense = Dense(128, activation="elu")(x_dense) 177 | x = Dropout(0.2)(x_dense) 178 | 179 | if attention: 180 | x=concatenate([x_att_1,x]) 181 | 182 | x= Dense(6, activation="sigmoid",kernel_regularizer=regularizers.l2(1e-8))(x) 183 | 184 | 185 | 186 | 187 | model = Model(inputs=main_input, outputs=x) 188 | nadam=Nadam(lr=0.00225, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.00325) 189 | model.compile(loss='binary_crossentropy', 190 | optimizer=nadam, 191 | metrics=['accuracy',f1_score,auc]) 192 | print(model.summary()) 193 | return model 194 | 195 | 196 | batch_size = 1600 #faster for char level embedding model 197 | 198 | #total average roc_auc: 0.988312005324 199 | -------------------------------------------------------------------------------- /toxic_comment/model/Single_GRU_wiki.py: -------------------------------------------------------------------------------- 1 | # This Python 3 environment comes with many helpful analytics libraries installed 2 | #Should have the same package as kernel, implement and modified by Eric 3 | 4 | 5 | 6 | from __future__ import absolute_import, division 7 | import sys, os, re, csv, codecs, numpy as np, pandas as pd 8 | 9 | 10 | from keras.preprocessing.sequence import pad_sequences 11 | from keras.layers import Dense, Input, LSTM, Embedding,Dropout,Activation,GRU,Conv1D,CuDNNGRU,CuDNNLSTM 12 | from keras.layers import SpatialDropout1D,MaxPool1D,GlobalAveragePooling1D,RepeatVector,Add,PReLU 13 | from keras.layers import Bidirectional, GlobalMaxPool1D,BatchNormalization,concatenate,TimeDistributed,Merge,Flatten 14 | from keras.models import Model 15 | from keras import initializers, regularizers, constraints, optimizers, layers 16 | from keras.optimizers import Adam,SGD,Nadam 17 | from keras.callbacks import EarlyStopping, ModelCheckpoint 18 | from keras.layers.core import Layer 19 | from keras import initializers, regularizers, constraints 20 | from keras import backend as K 21 | from nltk.stem import SnowballStemmer 22 | stemmer = SnowballStemmer('english') 23 | 24 | embed_size = 300 25 | max_features = 160000 26 | maxlen=180 27 | 28 | 29 | train = pd.read_csv('train.csv') 30 | test = pd.read_csv('test.csv') 31 | merge=pd.concat([train,test]) 32 | df=merge.reset_index(drop=True) 33 | 34 | 35 | 36 | corpus_raw=df.comment_text 37 | 38 | 39 | from commen_preprocess import * 40 | from word_net_lemmatize import * 41 | from criteria import * 42 | 43 | 44 | 45 | import time 46 | 47 | start=time.time() 48 | corpus_pre1= parallelize_dataframe(corpus_raw, multiply_columns_clean) 49 | corpus_lemmatize= parallelize_dataframe(corpus_pre1, multiply_columns_lemmatize_sentence) 50 | pickle.dump(corpus_lemmatize,open("tmp_WordNet_corpus_lemmatize.pkl", "wb")) 51 | corpus_lemmatize=pickle.load(open("tmp_WordNet_corpus_lemmatize.pkl", "rb")) 52 | 53 | end=time.time() 54 | 55 | timeStep=end-start 56 | 57 | print("spend sencond: "+str(timeStep)) 58 | 59 | 60 | 61 | 62 | 63 | df["comment_text"]=corpus_lemmatize 64 | 65 | 66 | 67 | train_cl=df[:train.shape[0]] 68 | test_cl=df[train.shape[0]:] 69 | 70 | print("....start....tokenizer") 71 | 72 | list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"] 73 | y_tr = train_cl[list_classes].values 74 | 75 | 76 | 77 | 78 | list_sentences_train=train_cl.comment_text 79 | list_sentences_test=test_cl.comment_text 80 | 81 | 82 | from numpy import asarray 83 | from numpy import zeros 84 | 85 | 86 | print("....At....Tokenizer") 87 | 88 | 89 | puncuate=r'([\.\!\?\:\,])' 90 | 91 | from keras.preprocessing.text import Tokenizer 92 | tokenizer = Tokenizer(num_words=max_features,oov_token=puncuate) 93 | tokenizer.fit_on_texts(list(list_sentences_train)+list(list_sentences_test)) 94 | 95 | 96 | list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train) 97 | list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test) 98 | 99 | 100 | 101 | 102 | 103 | totalNumWords = [len(one_comment) for one_comment in list_tokenized_train] 104 | print("mean length:"+ str(np.mean(totalNumWords ))) 105 | print("max length:"+ str(max(totalNumWords) ) ) 106 | print("std length:"+ str(np.std(totalNumWords ))) 107 | 108 | 109 | 110 | print("number of different word:"+ str(len(tokenizer.word_index.items()))) 111 | 112 | if len(tokenizer.word_index.items()) < max_features: 113 | max_features=len(tokenizer.word_index.items()) 114 | 115 | from keras.preprocessing import sequence 116 | print('Pad sequences (samples x time)') 117 | 118 | 119 | 120 | 121 | X_tr = pad_sequences(list_tokenized_train, maxlen=maxlen,padding='post') 122 | X_te = pad_sequences(list_tokenized_test, maxlen=maxlen,padding='post') 123 | 124 | 125 | 126 | 127 | print('x_train shape:', X_tr.shape) 128 | print('x_test shape:', X_te.shape) 129 | 130 | print("================") 131 | 132 | print(X_tr) 133 | print("================") 134 | print(X_te) 135 | 136 | print("================") 137 | 138 | 139 | 140 | 141 | 142 | 143 | from gensim.models.wrappers import FastText 144 | 145 | print("start...loading...wiki....en") 146 | 147 | model = FastText.load_fasttext_format('wiki.en') 148 | 149 | 150 | nb_words= min(max_features, len(tokenizer.word_index)) 151 | 152 | embedding_matrix = np.zeros((nb_words, embed_size)) 153 | for word, i in tokenizer.word_index.items(): 154 | if i >= nb_words: 155 | continue 156 | if word in model.wv: 157 | embedding_matrix[i] = model[word] 158 | print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0)) 159 | 160 | 161 | 162 | 163 | import tensorflow as tf 164 | 165 | gpu_options = tf.GPUOptions(allow_growth=True) 166 | sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) 167 | 168 | #set up keras session 169 | 170 | tf.keras.backend.set_session(sess) 171 | 172 | from tensorflow.python.client import device_lib 173 | print(device_lib.list_local_devices()) 174 | 175 | 176 | from numpy.random import seed 177 | seed(1) 178 | 179 | 180 | 181 | def bigru_pool_model(): 182 | main_input=Input(shape=(maxlen,),name='main_input')#, name='main_input' 183 | embedded_sequences= Embedding(max_features, embed_size,weights=[embedding_matrix],trainable=trainable)(main_input) 184 | 185 | hidden_dim=136 186 | x=SpatialDropout1D(0.22)(embedded_sequences) #0.1 187 | x_gru_1 = Bidirectional(CuDNNGRU(hidden_dim,recurrent_regularizer=regularizers.l2(1e-6),return_sequences=True))(x) 188 | x_ave=GlobalAveragePooling1D()(x_gru_1) 189 | x_max=GlobalMaxPool1D()(x_gru_1) 190 | x_dense= concatenate([x_max,x_ave]) 191 | x_dense=BatchNormalization()(x_dense) 192 | x_dense= Dropout(0.35)(x_dense) 193 | x_dense = Dense(256, activation="elu")(x_dense) 194 | x_dense = Dropout(0.3)(x_dense) 195 | x_dense = Dense(128, activation="elu")(x_dense) 196 | x = Dropout(0.2)(x_dense) 197 | x= Dense(6, activation="sigmoid",kernel_regularizer=regularizers.l2(1e-8))(x) 198 | 199 | 200 | 201 | 202 | model = Model(inputs=main_input, outputs=x) 203 | nadam=Nadam(lr=0.00225, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.00325) 204 | model.compile(loss='binary_crossentropy', 205 | optimizer=nadam, 206 | metrics=['accuracy',f1_score,auc]) 207 | print(model.summary()) 208 | return model 209 | 210 | 211 | 212 | 213 | batch_size = 640 214 | 215 | #total average roc_auc: 0.9891360615629836 216 | -------------------------------------------------------------------------------- /toxic_comment/model/Single_GRU_wiki_char(including preprocessing).py: -------------------------------------------------------------------------------- 1 | # This Python 3 environment comes with many helpful analytics libraries installed 2 | #Should have the same package as kernel, implement and modified by Eric 3 | 4 | 5 | from __future__ import absolute_import, division 6 | import sys, os, re, csv, codecs, numpy as np, pandas as pd 7 | 8 | 9 | from keras.preprocessing.sequence import pad_sequences 10 | from keras.layers import Dense, Input, LSTM, Embedding,Dropout,Activation,GRU,Conv1D,CuDNNGRU,CuDNNLSTM 11 | from keras.layers import SpatialDropout1D,MaxPool1D,GlobalAveragePooling1D,RepeatVector,Add,PReLU 12 | from keras.layers import Bidirectional, GlobalMaxPool1D,BatchNormalization,concatenate,TimeDistributed,Merge,Flatten 13 | from keras.models import Model 14 | from keras import initializers, regularizers, constraints, optimizers, layers 15 | from keras.optimizers import Adam,SGD,Nadam 16 | from keras.callbacks import EarlyStopping, ModelCheckpoint 17 | from keras.layers.core import Layer 18 | from keras import initializers, regularizers, constraints 19 | from keras import backend as K 20 | from criteria import * 21 | 22 | word_embed_size = 300 23 | char_embed_size=50 24 | max_features = 160000 25 | maxlen=180 26 | 27 | 28 | 29 | train = pd.read_csv('train.csv') 30 | test = pd.read_csv('test.csv') 31 | 32 | 33 | merge=pd.concat([train,test]) 34 | df=merge.reset_index(drop=True) 35 | corpus_raw=df.comment_text 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | APPO = { 45 | "aren't" : "are not", 46 | "can't" : "cannot", 47 | "couldn't" : "could not", 48 | "didn't" : "did not", 49 | "doesn't" : "does not", 50 | "don't" : "do not", 51 | "hadn't" : "had not", 52 | "hasn't" : "has not", 53 | "haven't" : "have not", 54 | "he'd" : "he would", 55 | "he'll" : "he will", 56 | "he's" : "he is", 57 | "i'd" : "I would", 58 | "i'd" : "I had", 59 | "i'll" : "I will", 60 | "i'm" : "I am", 61 | "isn't" : "is not", 62 | "it's" : "it is", 63 | "it'll":"it will", 64 | "i've" : "I have", 65 | "let's" : "let us", 66 | "mightn't" : "might not", 67 | "mustn't" : "must not", 68 | "shan't" : "shall not", 69 | "she'd" : "she would", 70 | "she'll" : "she will", 71 | "she's" : "she is", 72 | "shouldn't" : "should not", 73 | "that's" : "that is", 74 | "there's" : "there is", 75 | "they'd" : "they would", 76 | "they'll" : "they will", 77 | "they're" : "they are", 78 | "they've" : "they have", 79 | "we'd" : "we would", 80 | "we're" : "we are", 81 | "weren't" : "were not", 82 | "we've" : "we have", 83 | "what'll" : "what will", 84 | "what're" : "what are", 85 | "what's" : "what is", 86 | "what've" : "what have", 87 | "where's" : "where is", 88 | "who'd" : "who would", 89 | "who'll" : "who will", 90 | "who're" : "who are", 91 | "who's" : "who is", 92 | "who've" : "who have", 93 | "won't" : "will not", 94 | "wouldn't" : "would not", 95 | "you'd" : "you would", 96 | "you'll" : "you will", 97 | "you're" : "you are", 98 | "you've" : "you have", 99 | "'re": " are", 100 | "wasn't": "was not", 101 | "we'll":" will", 102 | "didn't": "did not", 103 | "tryin'":"trying" 104 | } 105 | 106 | 107 | 108 | repl = { 109 | "<3": " good ", 110 | ":d": " good ", 111 | ":dd": " good ", 112 | ":p": " good ", 113 | "8)": " good ", 114 | ":-)": " good ", 115 | ":)": " good ", 116 | ";)": " good ", 117 | "(-:": " good ", 118 | "(:": " good ", 119 | "yay!": " good ", 120 | "yay": " good ", 121 | "yaay": " good ", 122 | "yaaay": " good ", 123 | "yaaaay": " good ", 124 | "yaaaaay": " good ", 125 | ":/": " bad ", 126 | ":>": " sad ", 127 | ":')": " sad ", 128 | ":-(": " bad ", 129 | ":(": " bad ", 130 | ":s": " bad ", 131 | ":-s": " bad ", 132 | "<3": " heart ", 133 | ":d": " smile ", 134 | ":p": " smile ", 135 | ":dd": " smile ", 136 | "8)": " smile ", 137 | ":-)": " smile ", 138 | ":)": " smile ", 139 | ";)": " smile ", 140 | "(-:": " smile ", 141 | "(:": " smile ", 142 | ":/": " worry ", 143 | ":>": " angry ", 144 | ":')": " sad ", 145 | ":-(": " sad ", 146 | ":(": " sad ", 147 | ":s": " sad ", 148 | ":-s": " sad ", 149 | r"\br\b": "are", 150 | r"\bu\b": "you", 151 | r"\bhaha\b": "ha", 152 | r"\bhahaha\b": "ha"} 153 | 154 | 155 | 156 | bad_wordBank={ 157 | 'fage':"shove your balls up your own ass or the ass of another to stretch your scrotum skin", 158 | } 159 | 160 | 161 | 162 | 163 | 164 | print("....start....cleaning") 165 | 166 | from nltk.stem.wordnet import WordNetLemmatizer 167 | lem = WordNetLemmatizer() 168 | 169 | stop_words = ['the','a','an','and','but','if','or','because','as','what','which','this','that','these','those','then', 170 | 'just','so','than','such','both','through','about','for','is','of','while','during','to','What','Which', 171 | 'Is','If','While','This'] 172 | 173 | 174 | from nltk.tokenize import TweetTokenizer 175 | 176 | 177 | tokenizer=TweetTokenizer() 178 | 179 | 180 | re_tok = re.compile(r'([1234567890!@#$%^&*_+-=,./<>?;:"[][}]"\'\\|�鎿�𤲞阬威鄞捍朝溘甄蝓壇螞¯岑�''\t])') 181 | 182 | 183 | 184 | 185 | 186 | df['count_sent']=df["comment_text"].apply(lambda x: len(re.findall("\n",str(x)))+1) 187 | df['count_word']=df["comment_text"].apply(lambda x: len(str(x).split())) 188 | 189 | df['avg_sent_length']=df['count_word']/df['count_sent'] 190 | print(df['count_sent'].describe()) 191 | print(df['count_word'].describe()) 192 | print(df['avg_sent_length'].describe()) 193 | 194 | 195 | 196 | 197 | 198 | 199 | 200 | from nltk.tokenize import TweetTokenizer 201 | 202 | 203 | tokenizer=TweetTokenizer() 204 | 205 | def clean(comment): 206 | """ 207 | This function receives comments and returns clean word-list 208 | """ 209 | #Convert to lower case , so that Hi and hi are the same 210 | comment=comment.lower() 211 | #remove \n 212 | comment=re.sub(r"\n",".",comment) 213 | comment=re.sub(r"\\n\n",".",comment) 214 | comment=re.sub(r"fucksex","fuck sex",comment) 215 | comment=re.sub(r"f u c k","fuck",comment) 216 | comment=re.sub(r"幹","fuck",comment) 217 | #text = re.sub("www.* ", "", text) 218 | comment=re.sub(r"死","die",comment) 219 | comment=re.sub(r"他妈的","fuck",comment) 220 | comment=re.sub(r"去你妈的","fuck off",comment) 221 | comment=re.sub(r"肏你妈","fuck your mother",comment) 222 | comment=re.sub(r"肏你祖宗十八代","your ancestors to the 18th generation",comment) 223 | # remove leaky elements like ip,user 224 | comment=re.sub("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}","",comment) 225 | #removing usernames 226 | comment=re.sub("\[\[.*\]","",comment) 227 | comment = re.sub(r"you ' re", "you are", comment) 228 | comment = re.sub(r"wtf","what the fuck", comment) 229 | comment = re.sub(r"i ' m", "I am", comment) 230 | comment = re.sub(r"I", "one", comment) 231 | comment = re.sub(r"II", "two", comment) 232 | comment = re.sub(r"III", "three", comment) 233 | comment = re.sub(r'牛', "cow", comment) 234 | comment=re.sub(r"mothjer","mother",comment) 235 | comment=re.sub(r"g e t r i d o f a l l i d i d p l e a s e j a ck a s s", 236 | "get rid of all i did please jackass",comment) 237 | comment=re.sub(r"nazi","nazy",comment) 238 | comment=re.sub(r"withought","with out",comment) 239 | s=comment 240 | 241 | s = s.replace('&', ' and ') 242 | s = s.replace('@', ' at ') 243 | s = s.replace('0', 'zero') 244 | s = s.replace('1', 'one') 245 | s = s.replace('2', 'two') 246 | s = s.replace('3', 'three') 247 | s = s.replace('4', 'four') 248 | s = s.replace('5', 'five') 249 | s = s.replace('6', 'six') 250 | s = s.replace('7', 'seven') 251 | s = s.replace('8', 'eight') 252 | s = s.replace('9', 'night') 253 | s = s.replace('雲水','') 254 | 255 | comment=s 256 | comment = re_tok.sub(' ', comment) 257 | 258 | words=tokenizer.tokenize(comment) 259 | 260 | 261 | words=[APPO[word] if word in APPO else word for word in words] 262 | words=[bad_wordBank[word] if word in bad_wordBank else word for word in words] 263 | words=[repl[word] if word in repl else word for word in words] 264 | words = [w for w in words if not w in stop_words] 265 | 266 | 267 | 268 | sent=" ".join(words) 269 | sent = re.sub(r'([\'\"\/\-\_\--\_])',' ', sent) 270 | # Remove some special characters 271 | clean_sent= re.sub(r'([\;\|•«\n])',' ', sent) 272 | 273 | return(clean_sent) 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | from nltk.corpus import wordnet 282 | from nltk import word_tokenize, pos_tag 283 | from nltk.stem import WordNetLemmatizer 284 | 285 | 286 | def get_wordnet_pos(treebank_tag): 287 | if treebank_tag.startswith('V'): 288 | return wordnet.VERB 289 | elif treebank_tag.startswith('J'): 290 | return wordnet.ADJ 291 | elif treebank_tag.startswith('N'): 292 | return wordnet.NOUN 293 | elif treebank_tag.startswith('R'): 294 | return wordnet.ADV 295 | else: 296 | return None 297 | 298 | 299 | 300 | def lemmatize_all(sentence): 301 | wnl = WordNetLemmatizer() 302 | for word, tag in pos_tag(word_tokenize(sentence)): 303 | if tag.startswith("NN"): 304 | yield wnl.lemmatize(word, pos='n') 305 | elif tag.startswith('VB'): 306 | yield wnl.lemmatize(word, pos='v') 307 | elif tag.startswith('JJ'): 308 | yield wnl.lemmatize(word, pos='a') 309 | elif tag.startswith('R'): 310 | yield wnl.lemmatize(word, pos='r') 311 | else: 312 | yield word 313 | 314 | 315 | 316 | def lemmatize_sentence(sentence): 317 | res = [] 318 | lemmatizer = WordNetLemmatizer() 319 | for word, pos in pos_tag(word_tokenize(sentence)): 320 | wordnet_pos = get_wordnet_pos(pos) or wordnet.NOUN 321 | res.append(lemmatizer.lemmatize(word, pos=wordnet_pos)) 322 | res=" ".join(res) 323 | 324 | return res 325 | 326 | 327 | import pandas as pd 328 | import numpy as np 329 | 330 | from multiprocessing import Pool 331 | 332 | num_partitions = 8 #number of partitions to split dataframe 333 | num_cores = 4 #number of cores on your machine 334 | 335 | def parallelize_dataframe(df, func): 336 | df_split = np.array_split(df, num_partitions) 337 | pool = Pool(num_cores) 338 | df = pd.concat(pool.map(func, df_split)) 339 | pool.close() 340 | pool.join() 341 | return df 342 | 343 | def multiply_columns_clean(data): 344 | data = data.apply(lambda x: clean(x)) 345 | return data 346 | 347 | def multiply_columns_lemmatize_sentence(data): 348 | data=data.apply(lambda x:lemmatize_sentence(x)) 349 | return data 350 | 351 | 352 | 353 | 354 | def sent_len(x): 355 | doc=str(x).split("\n") 356 | count_one=0 357 | summation=0 358 | for word in doc: 359 | summation+=len(word) 360 | return summation/len(doc) 361 | 362 | 363 | 364 | 365 | 366 | import time 367 | 368 | start=time.time() 369 | corpus= parallelize_dataframe(corpus_raw, multiply_columns_clean) 370 | corpus= parallelize_dataframe(corpus, multiply_columns_lemmatize_sentence) 371 | 372 | print("dump 1") 373 | 374 | 375 | 376 | end=time.time() 377 | 378 | timeStep=end-start 379 | 380 | print("spend sencond: "+str(timeStep)) 381 | 382 | import pickle 383 | 384 | 385 | 386 | 387 | 388 | 389 | df["comment_text"]=corpus 390 | 391 | 392 | 393 | print(df["comment_text"]) 394 | print(df["comment_text"].isnull().sum()) 395 | 396 | 397 | 398 | 399 | print("....set..indirect..feature") 400 | 401 | 402 | 403 | print("set ngram feature") 404 | 405 | 406 | 407 | 408 | train_cl=df[:train.shape[0]] 409 | test_cl=df[train.shape[0]:] 410 | 411 | 412 | 413 | df['count_sent']=df["comment_text"].apply(lambda x: len(re.findall(" ",str(x)))+1) 414 | df['count_word']=df["comment_text"].apply(lambda x: len(str(x).split())) 415 | 416 | df['avg_sent_length']=df['count_word']/df['count_sent'] 417 | print(df['count_sent'].describe()) 418 | print(df['count_word'].describe()) 419 | print(df['avg_sent_length'].describe()) 420 | 421 | 422 | 423 | 424 | 425 | 426 | #===============char preprocessing================ 427 | 428 | 429 | def char_ngram(word,ngram=2): 430 | char_ngram_list=[word.decode('utf-8')[i:i+ngram] for i in range(len(word)-ngram+1)] 431 | char_ngram_sent=u"-:-".join(char_ngram_list) 432 | return char_ngram_sent 433 | 434 | 435 | 436 | 437 | def multiply_columns_char_ngram(data): 438 | data2 = data.apply(lambda x: char_ngram(str(x),ngram=4)) 439 | return data2 440 | 441 | 442 | 443 | #using same data as word embedding or the performance may be worse!? 444 | 445 | 446 | corpus_gram=df["comment_text"] 447 | corpus_gram=parallelize_dataframe(corpus, multiply_columns_char_ngram) 448 | 449 | df['count_sent']=df["comment_text"].apply(lambda x: len(re.findall(" ",str(x)))+1) 450 | df['count_word']=df["comment_text"].apply(lambda x: len(str(x).split())) 451 | 452 | df['avg_sent_length']=df['count_word']/df['count_sent'] 453 | print(df['count_sent'].describe()) 454 | print(df['count_word'].describe()) 455 | print(df['avg_sent_length'].describe()) 456 | 457 | 458 | 459 | from collections import Counter 460 | 461 | # part from Dieter 462 | def create_char_vocabulary_ngram(texts_arr,ngram_in=3,min_count_chars=50): 463 | idx=0 464 | for article in texts_arr: 465 | texts_arr[idx]=article.lower() 466 | idx+=1 467 | char_dict = {} 468 | for k, text in enumerate(texts_arr): 469 | ngram_text=char_ngram(text,ngram=ngram_in) 470 | list_char=ngram_text.split("-:-") 471 | for char in list_char: 472 | if char not in char_dict: 473 | char_dict[char]=1 474 | else: 475 | char_dict[char]=char_dict[char]+1 476 | raw_counts_char = list(char_dict) 477 | #print(raw_counts) 478 | print("{}-gram".format(ngram)) 479 | print('%s characters found' %len(raw_counts_char)) 480 | print('keepin characters with count >= %s' % min_count_chars) 481 | vocab = [ char for char in raw_counts_char if char_dict[char] >= min_count_chars] 482 | char2index = {char:(ind+1) for ind, char in enumerate(vocab)} 483 | for token in UNKNOWN_CHAR: 484 | char2index[token] = 0 485 | print(token) 486 | char2index[PAD_CHAR] = -1 487 | index2char = {ind:char for char, ind in char2index.items()} 488 | print('%s remaining characters' % len(char2index)) 489 | return char2index, index2char 490 | 491 | def char2seq(texts, maxlen): 492 | res = np.zeros((len(texts),maxlen)) 493 | for k,text in enumerate(texts): 494 | seq = np.zeros((len(text))) #equals padding with PAD_CHAR 495 | for l, char in enumerate(text): 496 | try: 497 | id = char2index[char] 498 | seq[l] = id 499 | except KeyError: 500 | seq[l] = char2index[UNKNOWN_CHAR] #if it is error, please replace it by 0 501 | seq = seq[:maxlen] 502 | res[k][:len(seq)] = seq 503 | return res 504 | 505 | 506 | UNKNOWN_CHAR = 'ⓤ' 507 | PAD_CHAR = '℗' 508 | 509 | 510 | sentences_train=corpus_gram.iloc[:train.shape[0]] 511 | sentences_test=corpus_gram.iloc[train.shape[0]:] 512 | 513 | 514 | 515 | totalNumWords = [len(one_comment) for one_comment in sentences_train] 516 | print("X_tr_2 mean length:"+ str(np.mean(totalNumWords ))) 517 | print("X_tr_2 max length:"+ str(max(totalNumWords) ) ) 518 | print("X_tr_2 std length:"+ str(np.std(totalNumWords ))) 519 | 520 | totalNumWords = [len(one_comment) for one_comment in sentences_test] 521 | print("X_te_2 mean length:"+ str(np.mean(totalNumWords ))) 522 | print("X_te_2 max length:"+ str(max(totalNumWords) ) ) 523 | print("X_te_2 std length:"+ str(np.std(totalNumWords ))) 524 | 525 | maxlen_char=720 #540 526 | 527 | 528 | sentences_train_bi=char_ngram(sentences_train,ngram=2) 529 | sentences_test_bi=char_ngram(sentences_test,ngram=2) 530 | X_tr_2 = char2seq(sentences_train_bi,maxlen_char) 531 | X_te_2 = char2seq(sentences_test_bi,maxlen_char) 532 | 533 | 534 | 535 | #tricky way to get the preprocessed data for training word2vec model 536 | train["comment_text"]=train_cl["comment_text"].iloc[:train.shape[0]] 537 | 538 | char_toxic=train.loc[train["clean"]==0,"char"] 539 | 540 | 541 | #===========char embedding training and get the embedding matrix================ 542 | 543 | 544 | char2index, index2char = create_char_vocabulary_ngram(corpus_gram.values(),min_count_chars=20) 545 | 546 | 547 | list_container=[] 548 | for article in char_toxic.values(): 549 | char_ngram_article=char_ngram(article,ngram=3) 550 | list_container.append(char_ngram_article) 551 | 552 | 553 | count=0 554 | for article in list_container: 555 | list_container[count]=article.split("-:-") 556 | count+=1 557 | 558 | 559 | from gensim.models import Word2Vec 560 | 561 | model= Word2Vec(sentences=list_container, size=char_embed_size, window=100, min_count=20, workers=2000, sg=0) 562 | 563 | 564 | model.save('mymodel_toxic') 565 | 566 | 567 | weights_char=model.wv.syn0 568 | 569 | np.save(open("self_train_weight_toxic.npz", 'wb'), weights_char) 570 | 571 | 572 | #weights_char=np.load(open("self_train_weight_toxic.npz", 'rb')) 573 | 574 | 575 | nb_char=len(char2index) 576 | char_embedding_matrix = np.zeros((nb_char, char_embed_size)) 577 | for word in char2index: 578 | idx=char2index[word] 579 | if word in model.wv: 580 | char_embedding_matrix[idx] = model[word] 581 | print('Null char embeddings: %d' % np.sum(np.sum(char_embedding_matrix, axis=1) == 0)) 582 | 583 | 584 | 585 | 586 | 587 | #===============tokenize================ 588 | 589 | print("....start....tokenizer") 590 | 591 | list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"] 592 | y_tr = train_cl[list_classes].values 593 | 594 | 595 | 596 | 597 | list_sentences_train=train_cl.comment_text 598 | list_sentences_test=test_cl.comment_text 599 | 600 | 601 | 602 | print("....start....pretrain") 603 | 604 | from numpy import asarray 605 | from numpy import zeros 606 | 607 | 608 | 609 | 610 | print("....At....Tokenizer") 611 | 612 | 613 | puncuate=r'([\.\!\?\:\,])' 614 | 615 | from keras.preprocessing.text import Tokenizer 616 | tokenizer = Tokenizer(num_words=max_features,oov_token=puncuate) 617 | tokenizer.fit_on_texts(list(list_sentences_train)+list(list_sentences_test)) 618 | 619 | 620 | 621 | list_tokenized_train = tokenizer.texts_to_sequences(list_sentences_train) 622 | list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test) 623 | 624 | 625 | 626 | 627 | 628 | 629 | 630 | totalNumWords = [len(one_comment) for one_comment in list_tokenized_train] 631 | print("mean length:"+ str(np.mean(totalNumWords ))) 632 | print("max length:"+ str(max(totalNumWords) ) ) 633 | print("std length:"+ str(np.std(totalNumWords ))) 634 | 635 | 636 | 637 | print(" maxlen is:"+str(maxlen)) 638 | 639 | print("number of different word:"+ str(len(tokenizer.word_index.items()))) 640 | 641 | if len(tokenizer.word_index.items()) < max_features: 642 | max_features=len(tokenizer.word_index.items()) 643 | 644 | 645 | from keras.preprocessing import sequence 646 | print('Pad sequences (samples x time)') 647 | 648 | 649 | maxlen=180 650 | X_tr = pad_sequences(list_tokenized_train, maxlen=maxlen,padding='post') 651 | X_te = pad_sequences(list_tokenized_test, maxlen=maxlen,padding='post') 652 | 653 | 654 | 655 | print('x_train shape:', X_tr.shape) 656 | print('x_test shape:', X_te.shape) 657 | 658 | print("================") 659 | 660 | print(X_tr) 661 | print("================") 662 | print(X_te) 663 | 664 | print("================") 665 | 666 | #print(X_tr_Ngram) 667 | #print(X_te_Ngram) 668 | 669 | 670 | from bs4 import BeautifulSoup 671 | 672 | 673 | data=pd.concat([list_sentences_train,list_sentences_test]) 674 | 675 | 676 | 677 | 678 | X_tr_1=X_tr 679 | X_te_1=X_te 680 | 681 | print('x_train_1 new shape:', X_tr_1.shape) 682 | print('x_test_1 new shape:', X_te_1.shape) 683 | 684 | 685 | 686 | 687 | X_tr_2=X_tr_Ngram 688 | X_te_2=X_te_Ngram 689 | 690 | print('x_train_2 new shape:', X_tr_2.shape) 691 | print('x_test_2 new shape:', X_te_2.shape) 692 | 693 | 694 | #=========start to load word embedding=========== 695 | 696 | 697 | from gensim.models.wrappers import FastText 698 | 699 | print("start...loading...wiki....en") 700 | 701 | model = FastText.load_fasttext_format('wiki.en') 702 | 703 | nb_words= min(max_features, len(tokenizer.word_index)) 704 | 705 | word_embedding_matrix = np.zeros((nb_words, word_embed_size)) 706 | for word, i in tokenizer.word_index.items(): 707 | if i >= nb_words: 708 | continue 709 | if word in model.wv: 710 | word_embedding_matrix[i] = model[word] 711 | print('Null word embeddings: %d' % np.sum(np.sum(word_embedding_matrix, axis=1) == 0)) 712 | 713 | 714 | 715 | import tensorflow as tf 716 | 717 | gpu_options = tf.GPUOptions(allow_growth=True) 718 | sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) 719 | 720 | #set up keras session 721 | 722 | tf.keras.backend.set_session(sess) 723 | 724 | from tensorflow.python.client import device_lib 725 | print(device_lib.list_local_devices()) 726 | 727 | 728 | from numpy.random import seed 729 | seed(1) 730 | 731 | 732 | 733 | def bigru_pool_model_multi_input(hidden_dim_1=136,hidden_dim_2=50): 734 | main_input=Input(shape=(maxlen,),name='main_input')#, name='main_input' 735 | Ngram_input= Input(shape=(maxlen_char,), name='aux_input')#, name='aux_input' 736 | embedded_sequences= Embedding(max_features, word_embed_size,weights=[word_embedding_matrix],trainable=False)(main_input) 737 | embedded_sequences_2= Embedding(nb_char, char_embed_size,weights=[char_embedding_matrix],trainable=True)(Ngram_input) 738 | 739 | #word level 740 | x=SpatialDropout1D(0.22)(embedded_sequences) #0.1 741 | x_gru_1 = Bidirectional(CuDNNGRU(hidden_dim_1,recurrent_regularizer=regularizers.l2(1e-6),return_sequences=True))(x) 742 | 743 | #char level 744 | x_2=SpatialDropout1D(0.21)(embedded_sequences_2) #0.1 745 | x_gru_2 = Bidirectional(CuDNNGRU(hidden_dim_2,recurrent_regularizer=regularizers.l2(1e-8),return_sequences=True))(x_2) 746 | 747 | 748 | x_ave_1=GlobalAveragePooling1D()(x_gru_1) 749 | x_ave_2=GlobalAveragePooling1D()(x_gru_2) 750 | x_ave= concatenate([x_ave_1,x_ave_2]) 751 | x_max_1=GlobalMaxPool1D()(x_gru_1) 752 | x_max_2=GlobalMaxPool1D()(x_gru_2) 753 | x_max= concatenate([x_max_1,x_max_2]) 754 | x_dense= concatenate([x_max,x_ave]) 755 | x_dense=BatchNormalization()(x_dense) 756 | x_dense= Dropout(0.35)(x_dense) 757 | x_dense = Dense(256, activation="elu")(x_dense) 758 | x_dense = Dropout(0.3)(x_dense) 759 | x_dense = Dense(128, activation="elu")(x_dense) 760 | x = Dropout(0.2)(x_dense) 761 | x= Dense(6, activation="sigmoid",kernel_regularizer=regularizers.l2(1e-8))(x) 762 | 763 | model = Model(inputs=[main_input,Ngram_input], outputs=x) 764 | nadam=Nadam(lr=0.00262, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.00325) 765 | model.compile(loss='binary_crossentropy', 766 | optimizer=nadam, 767 | metrics=['accuracy',f1_score,auc]) 768 | print(model.summary()) 769 | return model 770 | 771 | 772 | batch_size = 1280 #faster for char level embedding model 773 | 774 | #total average roc_auc: 0.9888378030202132 775 | 776 | 777 | -------------------------------------------------------------------------------- /toxic_comment/model/criteria.py: -------------------------------------------------------------------------------- 1 | tf.keras.backend.set_session(sess) 2 | 3 | from tensorflow.python.client import device_lib 4 | print(device_lib.list_local_devices()) 5 | 6 | 7 | from numpy.random import seed 8 | seed(1) 9 | 10 | 11 | import keras.backend as K 12 | 13 | def f1_score(y_true, y_pred): 14 | 15 | # Count positive samples. 16 | c1 = K.sum(K.round(K.clip(y_true * y_pred, 0, 1))) 17 | c2 = K.sum(K.round(K.clip(y_pred, 0, 1))) 18 | c3 = K.sum(K.round(K.clip(y_true, 0, 1))) 19 | 20 | # If there are no true samples, fix the F1 score at 0. 21 | if c3 == 0: 22 | return 0 23 | 24 | # How many selected items are relevant? 25 | precision = c1 / c2 26 | 27 | # How many relevant items are selected? 28 | recall = c1 / c3 29 | 30 | # Calculate f1_score 31 | f1_score = 2 * (precision * recall) / (precision + recall) 32 | return f1_score 33 | 34 | 35 | def binary_PFA(y_true, y_pred, threshold=K.variable(value=0.5)): 36 | y_pred = K.cast(y_pred >= threshold, 'float32') 37 | # N = total number of negative labels 38 | N = K.sum(1 - y_true) 39 | # FP = total number of false alerts, alerts from the negative class labels 40 | FP = K.sum(y_pred - y_pred * y_true) 41 | return FP/N 42 | #----------------------------------------------------------------------------------------------------------------------------------------------------- 43 | # P_TA prob true alerts for binary classifier 44 | """ 45 | The threshold here is simplify to 0.5, but in reality the threshold will moving, 46 | and here just for checking by very intuitive way, 47 | so finally it "has" to be checked by sklearn package!! 48 | """ 49 | def binary_PTA(y_true, y_pred, threshold=K.variable(value=0.5)): 50 | y_pred = K.cast(y_pred >= threshold, 'float32') 51 | # P = total number of positive labels 52 | P = K.sum(y_true) 53 | # TP = total number of correct alerts, alerts from the positive class labels 54 | TP = K.sum(y_pred * y_true) 55 | return TP/P 56 | 57 | 58 | def auc(y_true, y_pred): 59 | ptas = tf.stack([binary_PTA(y_true,y_pred,k) for k in np.linspace(0, 1, 1000)],axis=0) 60 | pfas = tf.stack([binary_PFA(y_true,y_pred,k) for k in np.linspace(0, 1, 1000)],axis=0) 61 | pfas = tf.concat([tf.ones((1,)) ,pfas],axis=0) 62 | binSizes = -(pfas[1:]-pfas[:-1]) 63 | s = ptas*binSizes 64 | return K.sum(s, axis=0) 65 | -------------------------------------------------------------------------------- /toxic_comment/model/stacking.py: -------------------------------------------------------------------------------- 1 | 2 | 3 | import pandas as pd 4 | import numpy as np 5 | import xgboost as xgb 6 | from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier, 7 | GradientBoostingClassifier, ExtraTreesClassifier) 8 | from sklearn.svm import SVC 9 | from sklearn.cross_validation import KFold 10 | import xgboost as xgb 11 | 12 | 13 | 14 | 15 | train = pd.read_csv('train.csv') 16 | test = pd.read_csv('test.csv') 17 | 18 | list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"] 19 | y= train[list_classes].values 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | import pickle 28 | #======================= 29 | 30 | #........load out of fold....... 31 | 32 | #======================= 33 | class_names = list(train)[-6:] 34 | multarray = np.array([100000, 10000, 1000, 100, 10, 1]) 35 | y_multi = np.sum(train[class_names].values * multarray, axis=1) 36 | 37 | 38 | 39 | from sklearn.model_selection import StratifiedKFold 40 | splits = 10 41 | skf = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42) 42 | 43 | # produce two lists of ids. each list has n items where n is the 44 | # number of folds and each item is a pandas series of indexed id numbers 45 | train_ids = [] 46 | val_ids = [] 47 | for i, (train_idx, val_idx) in enumerate(skf.split(np.zeros(train.shape[0]), y_multi)): 48 | train_ids.append(train.loc[train_idx, 'id']) 49 | val_ids.append(train.loc[val_idx, 'id']) 50 | 51 | 52 | 53 | 54 | 55 | #======================= 56 | 57 | #........hash back....... 58 | 59 | #======================= 60 | 61 | 62 | #absord high correlated data 63 | 64 | 65 | absord=0.5*oof_CNN+oof_Capsule*0.5 66 | absord_te=0.5*test_ave_CNN+test_ave_Capsule*0.5 67 | 68 | 69 | absord_GRU=oof_GRU_no_char*0.5+oof_GRU_3*0.5 70 | absord_GRU_te=test_ave_GRU_3*0.5+test_ave_no_char*0.5 71 | 72 | 73 | 74 | print(absord.shape) 75 | print(oof_GRU.shape) 76 | print(oof_LSTM.shape) 77 | print(oof_lgbm.shape) 78 | 79 | X_tr=np.hstack((absord,oof_GRU,oof_LSTM,oof_lgbm,absord_GRU)) 80 | X_te=np.hstack((absord_te,test_ave_GRU,test_ave_LSTM,test_ave_lgbm,absord_GRU_te)) 81 | 82 | 83 | 84 | 85 | 86 | def runXGB(train_X, train_y, test_X, test_y=None, feature_names=None, seed_val=2017, num_rounds=500): 87 | param = {} 88 | param['objective'] = 'binary:logistic' 89 | param['eta'] = 0.11 #0.12 90 | param['max_depth'] = 3 #4 91 | param['silent'] = 1 92 | #param['max_leaf_nodes'] = 2000 93 | param['eval_metric'] = 'logloss' 94 | param['min_child_weight'] = 1 95 | param['subsample'] = 0.65 96 | param['colsample_bytree'] = 0.785 97 | #param['booster']='dart' 98 | param['seed'] = seed_val 99 | num_rounds = num_rounds 100 | 101 | plst = list(param.items()) 102 | xgtrain = xgb.DMatrix(train_X, label=train_y) 103 | 104 | if test_y is not None: 105 | xgtest = xgb.DMatrix(test_X, label=test_y) 106 | watchlist = [ (xgtrain,'train'), (xgtest, 'test') ] 107 | model = xgb.train(plst, xgtrain, num_rounds, watchlist, early_stopping_rounds=15) 108 | else: 109 | xgtest = xgb.DMatrix(test_X) 110 | model = xgb.train(plst, xgtrain, num_rounds) 111 | 112 | pred_test_y = model.predict(xgtest) 113 | if test_y is not None: 114 | print('ROC AUC:', roc_auc_score( test_y, pred_test_y)) 115 | return model 116 | 117 | 118 | 119 | import xgboost as xgb 120 | from sklearn.metrics import roc_auc_score 121 | K_fold=10 122 | pred_val_accumulator=test_accumulator=None 123 | accumulator=[] 124 | for i in range(splits): 125 | print("=================================") 126 | print("Start on: "+str(i)+" fold") 127 | c_train_X = X_tr[train.id.isin(train_ids[i])] 128 | c_train_y = y[train.id.isin(train_ids[i])] 129 | c_val_X = X_tr[train.id.isin(val_ids[i])] 130 | c_val_y = y[train.id.isin(val_ids[i])] 131 | 132 | 133 | sub_accumulator=[] 134 | pred_val=np.zeros((len(c_val_X),len(list_classes))) 135 | y_test=np.zeros((len(test),len(list_classes))) 136 | for j in range(0,len(list_classes)): 137 | model = runXGB(c_train_X, c_train_y[:,j], c_val_X,c_val_y[:,j]) 138 | pred_val[:,j]=model.predict(xgb.DMatrix(c_val_X)) 139 | y_test[:,j]=model.predict(xgb.DMatrix(X_te)) 140 | result=pred_val[:,j].reshape(-1, 1) 141 | roc_score=roc_auc_score(c_val_y[:,j].reshape(-1, 1),result) 142 | print("#Column: "+str(j)+" Roc_auc_score: "+str(roc_score)) 143 | sub_accumulator.append(roc_score) 144 | 145 | if(i==0): 146 | pred_val_accumulator=pred_val 147 | test_accumulator=y_test 148 | else: 149 | pred_val_accumulator=np.vstack((pred_val_accumulator,pred_val)) 150 | test_accumulator=test_accumulator+y_test 151 | 152 | print("#Average Roc_auc_score is: {}\n".format( np.mean(sub_accumulator) )) 153 | pickle.dump(pred_val_accumulator,open("second_layer_tr"+str(i)+".pkl", "wb")) 154 | pickle.dump(test_accumulator,open("second_layer_te"+str(i)+".pkl", "wb")) 155 | accumulator.append(np.mean(sub_accumulator)) 156 | del model 157 | 158 | print("#Total average Roc_auc_score is: {}\n".format( np.mean(accumulator) )) 159 | print("#std Roc_auc_score is: {}\n".format( np.std(accumulator) )) 160 | 161 | test_accumulator=test_accumulator/10 162 | 163 | pickle.dump(pred_val_accumulator,open("second_layer_oof.pkl", "wb")) 164 | if test_accumulator is not None: 165 | pickle.dump(test_accumulator,open("second_layer.pkl", "wb")) 166 | 167 | 168 | 169 | 170 | 171 | -------------------------------------------------------------------------------- /toxic_comment/out-of-fold-cv.py: -------------------------------------------------------------------------------- 1 | from sklearn.metrics import roc_auc_score 2 | import numpy as np 3 | 4 | #train = train.sample(frac=1) 5 | 6 | class_names = list(train)[-6:] 7 | multarray = np.array([100000, 10000, 1000, 100, 10, 1]) 8 | y_multi = np.sum(train[class_names].values * multarray, axis=1) 9 | 10 | print(class_names) 11 | 12 | print(y_multi) 13 | 14 | 15 | from sklearn.model_selection import StratifiedKFold 16 | splits = 10 17 | skf = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42) 18 | 19 | # produce two lists of ids. each list has n items where n is the 20 | # number of folds and each item is a pandas series of indexed id numbers 21 | train_ids = [] 22 | val_ids = [] 23 | for i, (train_idx, val_idx) in enumerate(skf.split(np.zeros(train.shape[0]), y_multi)): 24 | train_ids.append(train.loc[train_idx, 'id']) 25 | val_ids.append(train.loc[val_idx, 'id']) 26 | 27 | 28 | 29 | from sklearn.metrics import roc_auc_score 30 | splits 31 | 32 | accumulator=[] 33 | for i in range(splits): 34 | print("======") 35 | print(str(i)+"fold") 36 | print("======") 37 | c_train_X = X_tr[train.id.isin(train_ids[i])] 38 | c_train_y = y_tr[train.id.isin(train_ids[i])] 39 | c_val_X = X_tr[train.id.isin(val_ids[i])] 40 | c_val_y = y_tr[train.id.isin(val_ids[i])] 41 | #c_train_X_twitter=np.vstack((c_train_XOne_twitter,c_train_XTwo_twitter)) 42 | 43 | """" 44 | 45 | 46 | NN model training 47 | 48 | """ 49 | #record the curve plot 50 | import matplotlib 51 | matplotlib.use('Agg') 52 | import matplotlib.pyplot as plt 53 | 54 | 55 | 56 | ##code validation for NN model 57 | print(history.history.keys()) 58 | plt.clf() 59 | plt.plot(history.history['loss']) 60 | plt.plot(history.history['val_loss']) 61 | plt.title('model accuracy') 62 | plt.ylabel('loss') 63 | plt.xlabel('epoch') 64 | plt.legend(['train', 'val'], loc='upper left') 65 | plt.show() 66 | 67 | plt.savefig('k-fold-plot1'+str(i)+'.png', format='png') 68 | 69 | 70 | print(history.history.keys()) 71 | plt.clf() 72 | plt.plot(history.history['acc']) 73 | plt.plot(history.history['val_acc']) 74 | plt.title('model accuracy') 75 | plt.ylabel('acc') 76 | plt.xlabel('epoch') 77 | plt.legend(['train', 'val'], loc='upper left') 78 | plt.show() 79 | 80 | plt.savefig('k-fold-plot2'+str(i)+'.png', format='png') 81 | 82 | model.load_weights(file_path) 83 | pred_val = model.predict(c_val_X,batch_size=batch_size, verbose=1) 84 | #pred = model.predict( {'main_input':c_val_X, 'aux_input': c_val_X_twitter},batch_size=batch_size, verbose=1) 85 | y_test = model.predict( X_te_1,batch_size=batch_size, verbose=1) 86 | 87 | if(i==0): 88 | pred_val_accumulator=pred_val 89 | test_accumulator=y_test 90 | else: 91 | #pred_accumulator is not None: 92 | pred_val_accumulator=np.vstack((pred_val_accumulator,pred_val)) 93 | #test_accumulator is not None: 94 | test_accumulator=test_accumulator+y_test 95 | sub_accumulator=[] 96 | for j in range(0,len(list_classes)): 97 | result=pred_val[:,j].reshape(-1, 1) 98 | roc_score=roc_auc_score(c_val_y[:,j].reshape(-1, 1),result) 99 | print("#Column: "+str(j)+" Roc_auc_score: "+str(roc_score)) 100 | sub_accumulator.append(roc_score) 101 | print("#Average Roc_auc_score is: {}\n".format( np.mean(sub_accumulator) )) 102 | pickle.dump(pred_val_accumulator,open("OOF_"+str(i)+".pkl", "wb")) 103 | pickle.dump(test_accumulator,open("test_average_"+str(i)+".pkl", "wb")) 104 | accumulator.append(np.mean(sub_accumulator)) 105 | del model 106 | test_average=test_accumulator/K_fold 107 | 108 | print("#Total average Roc_auc_score is: {}\n".format( np.mean(accumulator) )) 109 | 110 | 111 | pickle.dump(pred_val_accumulator,open("OOF_.pkl", "wb")) 112 | if test_accumulator is not None: 113 | pickle.dump(test_average,open("test_average_.pkl", "wb")) 114 | 115 | -------------------------------------------------------------------------------- /toxic_comment/preprocess/commen_preprocess.py: -------------------------------------------------------------------------------- 1 | from __future__ import absolute_import, division 2 | import sys, os, re, csv, codecs, numpy as np, pandas as pd 3 | 4 | 5 | from keras.preprocessing.sequence import pad_sequences 6 | from nltk.stem import SnowballStemmer 7 | stemmer = SnowballStemmer('english') 8 | 9 | 10 | 11 | train = pd.read_csv('train.csv') 12 | test = pd.read_csv('test.csv') 13 | 14 | 15 | from sklearn.model_selection import train_test_split 16 | 17 | x=train.iloc[:,2:].sum() 18 | rowsums=train.iloc[:,2:].sum(axis=1) 19 | train['clean']=(rowsums==0) 20 | 21 | 22 | 23 | merge=pd.concat([train,test]) 24 | df=merge.reset_index(drop=True) 25 | 26 | 27 | merge["comment_text"]=merge["comment_text"].fillna("_na_").values 28 | 29 | corpus_raw=df.comment_text 30 | 31 | 32 | no_abbre = { 33 | "aren't" : "are not", 34 | "can't" : "cannot", 35 | "couldn't" : "could not", 36 | "didn't" : "did not", 37 | "doesn't" : "does not", 38 | "don't" : "do not", 39 | "hadn't" : "had not", 40 | "hasn't" : "has not", 41 | "haven't" : "have not", 42 | "he'd" : "he would", 43 | "he'll" : "he will", 44 | "he's" : "he is", 45 | "i'd" : "I would", 46 | "i'd" : "I had", 47 | "i'll" : "I will", 48 | "i'm" : "I am", 49 | "isn't" : "is not", 50 | "it's" : "it is", 51 | "it'll":"it will", 52 | "i've" : "I have", 53 | "let's" : "let us", 54 | "mightn't" : "might not", 55 | "mustn't" : "must not", 56 | "shan't" : "shall not", 57 | "she'd" : "she would", 58 | "she'll" : "she will", 59 | "she's" : "she is", 60 | "shouldn't" : "should not", 61 | "that's" : "that is", 62 | "there's" : "there is", 63 | "they'd" : "they would", 64 | "they'll" : "they will", 65 | "they're" : "they are", 66 | "they've" : "they have", 67 | "we'd" : "we would", 68 | "we're" : "we are", 69 | "weren't" : "were not", 70 | "we've" : "we have", 71 | "what'll" : "what will", 72 | "what're" : "what are", 73 | "what's" : "what is", 74 | "what've" : "what have", 75 | "where's" : "where is", 76 | "who'd" : "who would", 77 | "who'll" : "who will", 78 | "who're" : "who are", 79 | "who's" : "who is", 80 | "who've" : "who have", 81 | "won't" : "will not", 82 | "wouldn't" : "would not", 83 | "you'd" : "you would", 84 | "you'll" : "you will", 85 | "you're" : "you are", 86 | "you've" : "you have", 87 | "'re": " are", 88 | "wasn't": "was not", 89 | "we'll":" will", 90 | "didn't": "did not", 91 | "tryin'":"trying" 92 | } 93 | 94 | 95 | 96 | emoji = { 97 | "<3": " good ", 98 | ":d": " good ", 99 | ":dd": " good ", 100 | ":p": " good ", 101 | "8)": " good ", 102 | ":-)": " good ", 103 | ":)": " good ", 104 | ";)": " good ", 105 | "(-:": " good ", 106 | "(:": " good ", 107 | "yay!": " good ", 108 | "yay": " good ", 109 | "yaay": " good ", 110 | "yaaay": " good ", 111 | "yaaaay": " good ", 112 | "yaaaaay": " good ", 113 | ":/": " bad ", 114 | ":>": " sad ", 115 | ":')": " sad ", 116 | ":-(": " bad ", 117 | ":(": " bad ", 118 | ":s": " bad ", 119 | ":-s": " bad ", 120 | "<3": " heart ", 121 | ":d": " smile ", 122 | ":p": " smile ", 123 | ":dd": " smile ", 124 | "8)": " smile ", 125 | ":-)": " smile ", 126 | ":)": " smile ", 127 | ";)": " smile ", 128 | "(-:": " smile ", 129 | "(:": " smile ", 130 | ":/": " worry ", 131 | ":>": " angry ", 132 | ":')": " sad ", 133 | ":-(": " sad ", 134 | ":(": " sad ", 135 | ":s": " sad ", 136 | ":-s": " sad ", 137 | r"\br\b": "are", 138 | r"\bu\b": "you", 139 | r"\bhaha\b": "ha", 140 | r"\bhahaha\b": "ha"} 141 | 142 | 143 | #Can add more to do the extension 144 | bad_wordBank={ 145 | 'fage':"shove your balls up your own ass or the ass of another to stretch your scrotum skin", 146 | } 147 | 148 | 149 | print("....start....cleaning") 150 | 151 | 152 | 153 | #=================stop word===================== 154 | 155 | 156 | from nltk.stem.wordnet import WordNetLemmatizer 157 | lem = WordNetLemmatizer() 158 | 159 | stop_words = ['the','a','an','and','but','if','or','because','as','what','which','this','that','these','those','then', 160 | 'just','so','than','such','both','through','about','for','is','of','while','during','to'] 161 | 162 | import string 163 | 164 | from nltk.corpus import stopwords 165 | 166 | eng_stopwords = set(stopwords.words("english")) 167 | 168 | 169 | #=================other special code===================== 170 | 171 | re_tok = re.compile(r'([�鎿�𤲞阬威鄞捍朝溘甄蝓壇螞¯岑�''\t])') 172 | 173 | 174 | #=================replace the duplicate word===================== 175 | import re 176 | 177 | def substitute_repeats_fixed_len(text, nchars, ntimes=4): 178 | """" 179 | Find substrings that consist of `nchars` non-space characters 180 | and that are repeated at least `ntimes` consecutive times, 181 | and replace them with a single occurrence. 182 | Examples: 183 | abbcccddddeeeee -> abcde (nchars = 1, ntimes = 2) 184 | abbcccddddeeeee -> abbcde (nchars = 1, ntimes = 3) 185 | abababcccababab -> abcccab (nchars = 2, ntimes = 2) 186 | """" 187 | return re.sub(r"(\S{{{}}})(\1{{{},}})".format(nchars, ntimes-1), 188 | r"\1", text) 189 | 190 | def substitute_repeats(text, ntimes=4): 191 | # Truncate consecutive repeats of short strings 192 | for nchars in range(1, 20): 193 | text = substitute_repeats_fixed_len(text, nchars, ntimes) 194 | return text 195 | 196 | 197 | 198 | #=================choose one of tokenizer======================= 199 | from nltk.tokenize import RegexpTokenizer 200 | 201 | tokenizer = RegexpTokenizer(r'\w+') 202 | 203 | from nltk.tokenize import TweetTokenizer 204 | 205 | tokenizer=TweetTokenizer() 206 | 207 | 208 | #=================final clean function======================= 209 | def clean(comment): 210 | comment=comment.lower() 211 | #remove \n 212 | comment=re.sub(r"\n",".",comment) 213 | comment=re.sub(r"\\n\n",".",comment) 214 | comment=re.sub(r"fucksex","fuck sex",comment) 215 | comment=re.sub(r"f u c k","fuck",comment) 216 | comment=re.sub(r"幹","fuck",comment) 217 | #Chinese bad word 218 | comment=re.sub(r"死","die",comment) 219 | comment=re.sub(r"他妈的","fuck",comment) 220 | comment=re.sub(r"去你妈的","fuck off",comment) 221 | comment=re.sub(r"肏你妈","fuck your mother",comment) 222 | comment=re.sub(r"肏你祖宗十八代","your ancestors to the 18th generation",comment) 223 | # remove leaky elements like ip,user 224 | comment=re.sub("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}","",comment) 225 | #removing usernames 226 | comment=re.sub("\[\[.*\]","",comment) 227 | comment = re.sub(r"\'ve", " have ", comment) 228 | 229 | comment = re.sub(r"n't", " not ", comment) 230 | comment = re.sub(r"\'d", " would ", comment) 231 | comment = re.sub(r"\'ll", " will ", comment) 232 | comment = re.sub(r"ca not", "cannot", comment) 233 | comment = re.sub(r"you ' re", "you are", comment) 234 | comment = re.sub(r"wtf","what the fuck", comment) 235 | comment = re.sub(r"i ' m", "I am", comment) 236 | comment = re.sub(r"I", "one", comment) 237 | comment = re.sub(r"II", "two", comment) 238 | comment = re.sub(r"III", "three", comment) 239 | comment = re.sub(r'牛', "cow", comment) 240 | comment=re.sub(r"mothjer","mother",comment) 241 | comment=re.sub(r"g e t r i d o f a l l i d i d p l e a s e j a ck a s s", 242 | "get rid of all i did please jackass",comment) 243 | comment=re.sub(r"nazi","nazy",comment) 244 | comment=re.sub(r"withought","with out",comment) 245 | comment=substitute_repeats(comment) 246 | s=comment 247 | 248 | s = s.replace('&', ' and ') 249 | s = s.replace('@', ' at ') 250 | s = s.replace('0', 'zero') 251 | s = s.replace('1', 'one') 252 | s = s.replace('2', 'two') 253 | s = s.replace('3', 'three') 254 | s = s.replace('4', 'four') 255 | s = s.replace('5', 'five') 256 | s = s.replace('6', 'six') 257 | s = s.replace('7', 'seven') 258 | s = s.replace('8', 'eight') 259 | s = s.replace('9', 'night') 260 | s = s.replace('雲水','') 261 | 262 | comment=s 263 | comment = re_tok.sub(' ', comment) 264 | 265 | words=tokenizer.tokenize(comment) 266 | 267 | words=[no_abbre[word] if word in APPO else word for word in words] 268 | words=[bad_wordBank[word] if word in bad_wordBank else word for word in words] 269 | words=[emoji[word] if word in repl else word for word in words] 270 | words = [w for w in words if not w in stop_words] 271 | 272 | 273 | sent=" ".join(words) 274 | # Remove some special characters, or noise charater, but do not remove all!! 275 | sent = re.sub(r'([\'\"\/\-\_\--\_])',' ', sent) 276 | clean_sent= re.sub(r'([\;\|•«\n])',' ', sent) 277 | 278 | return(clean_sent) 279 | 280 | 281 | 282 | #==================set up the multi core function to do preprocessin======================= 283 | 284 | import pandas as pd 285 | import numpy as np 286 | 287 | from multiprocessing import Pool 288 | 289 | num_partitions = 8 #number of partitions to split dataframe 290 | num_cores = 4 #number of cores on your machine 291 | 292 | def parallelize_dataframe(df, func): 293 | df_split = np.array_split(df, num_partitions) 294 | pool = Pool(num_cores) 295 | df = pd.concat(pool.map(func, df_split)) 296 | pool.close() 297 | pool.join() 298 | return df 299 | 300 | def multiply_columns_clean(data): 301 | data = data.apply(lambda x: clean(x)) 302 | return data 303 | 304 | 305 | -------------------------------------------------------------------------------- /toxic_comment/preprocess/correction.py: -------------------------------------------------------------------------------- 1 | import re 2 | from collections import Counter 3 | 4 | def words(text): return re.findall(r'\w+', text.lower()) 5 | 6 | 7 | import pandas as pd, numpy as np 8 | 9 | 10 | train = pd.read_csv('train.csv') 11 | test = pd.read_csv('test.csv') 12 | 13 | 14 | 15 | merge=pd.concat([train,test]) 16 | df=merge.reset_index(drop=True) 17 | 18 | 19 | #Assume using glove.840B.300d======================================= 20 | import os, re, csv, math, codecs 21 | print('loading word embeddings...') 22 | embeddings_index = {} 23 | #ranking index 24 | #from https://nlp.stanford.edu/pubs/glove.pdf the glove is use modeled as a power-law function of the frequency rank of that word pair 25 | f = codecs.open('glove.840B.300d.txt', encoding='utf-8') 26 | #or any other word embedding index's order depend by the word frequency 27 | from tqdm import tqdm 28 | for line in tqdm(f): 29 | values = line.rstrip().rsplit(' ') 30 | word = values[0] 31 | coefs = np.asarray(values[1:], dtype='float32') 32 | embeddings_index[word] = coefs 33 | f.close() 34 | print('found %s word vectors' % len(embeddings_index)) 35 | 36 | 37 | words = embeddings_index 38 | 39 | w_rank = {} 40 | for i,word in enumerate(words): 41 | w_rank[word] = i 42 | 43 | WORDS=w_rank 44 | 45 | print("load ") 46 | 47 | import pickle 48 | 49 | 50 | corpus=pickle.load(open("tmp_clean.pkl", "rb")) 51 | 52 | 53 | def P(word): #part from CPMP 54 | "Probability of `word`." 55 | # use inverse of rank as proxy 56 | # returns 0 if the word isn't in the dictionary 57 | return - WORDS.get(word, 0) 58 | #======================================================================= 59 | 60 | 61 | #Assume not using glove.840B.300d======================================= 62 | #origin, suitable for any others, like fastext wiki, but recommend not use like this 63 | 64 | #=========== 65 | #load data.... 66 | #=========== 67 | 68 | sum_of_words=len(WORDS) 69 | def P(word, N=sum_of_words): 70 | "Probability of `word`." 71 | return WORDS[word] / N 72 | #======================================================================= 73 | 74 | 75 | def correction(word,lower=4,upper=10): 76 | "Most probable spelling correction for word." 77 | length=len(word) 78 | if lengthupper: 79 | return word 80 | elif word in WORDS: 81 | return word 82 | return max(candidates(word), key=P) 83 | 84 | def candidates(word): 85 | "Generate possible spelling corrections for word." 86 | return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word]) 87 | 88 | def known(words): 89 | "The subset of `words` that appear in the dictionary of WORDS." 90 | return set(w for w in words if w in WORDS) 91 | 92 | def edits1(word): 93 | "All edits that are one edit away from `word`." 94 | letters = 'abcdefghijklmnopqrstuvwxyz' 95 | splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] 96 | deletes = [L + R[1:] for L, R in splits if R] 97 | transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1] 98 | replaces = [L + c + R[1:] for L, R in splits if R for c in letters] 99 | inserts = [L + c + R for L, R in splits for c in letters] 100 | return set(deletes + transposes + replaces + inserts) 101 | 102 | def edits2(word): 103 | "All edits that are two edits away from `word`." 104 | return (e2 for e1 in edits1(word) for e2 in edits1(e1)) 105 | 106 | 107 | 108 | from nltk.tokenize import TweetTokenizer 109 | 110 | 111 | 112 | tokenizer=TweetTokenizer() 113 | 114 | 115 | import re 116 | 117 | regx = re.compile('[a-z]+') 118 | record = set() 119 | def clean_correction(comment): 120 | comment=comment.lower() 121 | words=tokenizer.tokenize(comment) 122 | _init_=[] 123 | for w in words: 124 | #Only correct the characters and the new word 125 | if bool(regx.match(w)) and w not in record: 126 | w=correction(w) 127 | _init_.append(w) 128 | """ #save space version 129 | if w not in WORDS: 130 | record.add(w) 131 | """ 132 | #quick version 133 | record.add(w) 134 | else: 135 | _init_.append(w) 136 | words=_init_ 137 | sent=" ".join(words) 138 | return(sent) 139 | 140 | 141 | 142 | from multiprocessing import Pool 143 | 144 | num_partitions = 12 #number of partitions to split dataframe #4 145 | num_cores = 4 #number of cores on your machine 146 | 147 | 148 | def parallelize_dataframe(df, func): 149 | df_split = np.array_split(df, num_partitions) 150 | pool = Pool(num_cores) 151 | df = pd.concat(pool.map(func, df_split)) 152 | pool.close() 153 | pool.join() 154 | return df 155 | 156 | 157 | def multiply_columns_correction(data): 158 | data = data.apply(lambda x: clean_correction(x)) 159 | return data 160 | 161 | print("=======start correction=====") 162 | 163 | 164 | import time 165 | 166 | start=time.time() 167 | 168 | 169 | 170 | corpus= parallelize_dataframe(corpus,multiply_columns_correction) 171 | 172 | 173 | end=time.time() 174 | 175 | timeStep=end-start 176 | 177 | print("spend sencond: "+str(timeStep)) 178 | 179 | 180 | pickle.dump(corpus,open("tmp_correction_glove.pkl", "wb")) 181 | -------------------------------------------------------------------------------- /toxic_comment/preprocess/glove_twitter_preprocess.py: -------------------------------------------------------------------------------- 1 | import re 2 | import numpy as np 3 | import pandas as pd 4 | 5 | 6 | def glove_twitter_preprocess(text): 7 | """ 8 | adapted from https://nlp.stanford.edu/projects/glove/preprocess-twitter.rb 9 | 10 | #part from Dieter 11 | """ 12 | # Different regex parts for smiley faces 13 | eyes = "[8:=;]" 14 | nose = "['`\-]?" 15 | text = re.sub("https?://.* ", "", text) 16 | text = re.sub("www.* ", "", text) 17 | text = re.sub("/", " / ", text) 18 | text = re.sub("\[\[User(.*)\|", '', text) 19 | text = re.sub("<3", '', text) 20 | text = re.sub("[-+]?[.\d]*[\d]+[:,.\d]*", "", text) 21 | text = re.sub(eyes + nose + "[Dd)]", '', text) 22 | text = re.sub("[(d]" + nose + eyes, '', text) 23 | text = re.sub(eyes + nose + "p", '', text) 24 | text = re.sub(eyes + nose + "$", '', text) 25 | text = re.sub("$" + nose + eyes, '', text) 26 | text = re.sub(eyes + nose + "[/|l*]", '', text) 27 | text = re.sub("[-+]?[.\d]*[\d]+[:,.\d]*", "", text) 28 | text = re.sub("([!]){2,}", "! ", text) 29 | text = re.sub("([?]){2,}", "? ", text) 30 | text = re.sub("([.]){2,}", ". ", text) 31 | text = re.sub("(.)\1{2,}", "\1\1\1 ", text) 32 | pattern = re.compile(r"(.)\1{2,}") 33 | text = pattern.sub(r"\1" + " ", text) 34 | 35 | return text 36 | 37 | def multiply_columns_glove_twitter_preprocess(data): 38 | data = data.apply(lambda x: glove_twitter_preprocess(x)) 39 | return data 40 | -------------------------------------------------------------------------------- /toxic_comment/preprocess/word_net_lemmatize.py: -------------------------------------------------------------------------------- 1 | from nltk.corpus import wordnet 2 | from nltk import word_tokenize, pos_tag 3 | from nltk.stem import WordNetLemmatizer 4 | 5 | 6 | def get_wordnet_pos(treebank_tag): 7 | if treebank_tag.startswith('V'): 8 | return wordnet.VERB 9 | elif treebank_tag.startswith('J'): 10 | return wordnet.ADJ 11 | elif treebank_tag.startswith('N'): 12 | return wordnet.NOUN 13 | elif treebank_tag.startswith('R'): 14 | return wordnet.ADV 15 | else: 16 | return None 17 | 18 | 19 | #if you want to know what the tag means, please see there https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html 20 | 21 | def lemmatize_sentence(sentence): 22 | res = [] 23 | lemmatizer = WordNetLemmatizer() 24 | for word, pos in pos_tag(word_tokenize(sentence)): 25 | wordnet_pos = get_wordnet_pos(pos) or wordnet.NOUN 26 | res.append(lemmatizer.lemmatize(word, pos=wordnet_pos)) 27 | res=" ".join(res) 28 | 29 | return res 30 | 31 | 32 | import pandas as pd 33 | import numpy as np 34 | 35 | from multiprocessing import Pool 36 | 37 | 38 | def parallelize_dataframe(df, func): 39 | df_split = np.array_split(df, num_partitions) 40 | pool = Pool(num_cores) 41 | df = pd.concat(pool.map(func, df_split)) 42 | pool.close() 43 | pool.join() 44 | return df 45 | 46 | def multiply_columns_lemmatize_sentence(data): 47 | data=data.apply(lambda x:lemmatize_sentence(x)) 48 | return data 49 | 50 | #parallelize 4 core nearly 13 minutes, and don't need to do it every time, just run it at once, except doing different preprocess 51 | corpus= parallelize_dataframe(corpus, multiply_columns_lemmatize_sentence) 52 | #store it back to the disk 53 | pickle.dump(corpus,open("tmpWordNetlem.pkl", "wb")) 54 | corpus=pickle.load(open("tmpWordNetlem.pkl", "rb")) 55 | -------------------------------------------------------------------------------- /toxic_comment/toxic_comment_writting_report.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JasonEricZhan/Kaggle-project-list/460e33e6475d77a6db481b1e518421a9ddcb65cc/toxic_comment/toxic_comment_writting_report.pdf -------------------------------------------------------------------------------- /toxic_rank_pic.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/JasonEricZhan/Kaggle-project-list/460e33e6475d77a6db481b1e518421a9ddcb65cc/toxic_rank_pic.png --------------------------------------------------------------------------------