├── README ├── pic ├── Rplot.png ├── exp1.png ├── Rplot03.png ├── Rplot1.png └── Rplot2.png ├── Chiffon15221428248[文档1].docx ├── Chiffon15221428248[文档2].docx ├── maoning └── MCMC.R ├── R语言中的向量化运算学习笔记.Rmd ├── Wangzf └── reshap2.R └── rong360.Rmd /README: -------------------------------------------------------------------------------- 1 | OpenDataMining -------------------------------------------------------------------------------- /pic/Rplot.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lchiffon/OpenDataMining/master/pic/Rplot.png -------------------------------------------------------------------------------- /pic/exp1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lchiffon/OpenDataMining/master/pic/exp1.png -------------------------------------------------------------------------------- /pic/Rplot03.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lchiffon/OpenDataMining/master/pic/Rplot03.png -------------------------------------------------------------------------------- /pic/Rplot1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lchiffon/OpenDataMining/master/pic/Rplot1.png -------------------------------------------------------------------------------- /pic/Rplot2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lchiffon/OpenDataMining/master/pic/Rplot2.png -------------------------------------------------------------------------------- /Chiffon15221428248[文档1].docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lchiffon/OpenDataMining/master/Chiffon15221428248[文档1].docx -------------------------------------------------------------------------------- /Chiffon15221428248[文档2].docx: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Lchiffon/OpenDataMining/master/Chiffon15221428248[文档2].docx -------------------------------------------------------------------------------- /maoning/MCMC.R: -------------------------------------------------------------------------------- 1 | library(bayesm) 2 | library(xlsx) 3 | data<-"C:/Users/Administrator/Desktop/贝叶斯.xls" 4 | mydata<-read.xlsx("C:/Users/Administrator/Desktop/贝叶斯.xls") 5 | 6 | pibeta<-c(1,1,1,1,1) 7 | betahat<-c(5.31,-7.06,3.5,-3.41,2.98) 8 | cov<-matrix(c(5.6,-7.7,15.86,-2.24,-2.04,-7.7,35.73,-70.6,2.89,1.87,15.86,-70.6, 9 | 185.18,-7.05,-0.68,-2.24,2.89,-7.05,1.45,-1.22,-2.04,1.87,-0.68,-1.22, 10 | 9.41 ),nrow=5) 11 | set.seed(1234) 12 | beta0<-rnorm(50,betahat[1],sqrt(cov[1,1])) 13 | Beta0<-sort(beta0) 14 | p1<-dnorm(Beta0,betahat[1],sqrt(cov[1,1])) 15 | plot(Beta0,p1,type="l") 16 | 17 | set.seed(1234) 18 | beta2<-rnorm(1000,betahat[3],sqrt(cov[3,3])) 19 | Beta2<-sort(beta2) 20 | p2<-dnorm(Beta2,betahat[3],sqrt(cov[3,3])) 21 | plot(Beta2,p2,type="l") 22 | 23 | library(MASS) 24 | set.seed(1234) 25 | beta<-mvrnorm(1000,betahat,cov) 26 | h<-det(cov)^-1/2 27 | LBeta2<-sort(beta[,3]) 28 | pbeta2<-dnorm(LBeta2,betahat[3],sqrt(cov[3,3])) 29 | p<-h*pbeta2/sum(h*pbeta2) 30 | par(mfrow=c(1,2)) 31 | plot(LBeta2,p,type="l",xlab="Beta2",ylab="Density",col="BLUE") 32 | plot(Beta2,p2,type="l") 33 | lines(Beta2,p2,type="l",col="red") 34 | 35 | 36 | 37 | -------------------------------------------------------------------------------- /R语言中的向量化运算学习笔记.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "R中的向量化运算" 3 | author: "Louis" 4 | date: "Friday, May 08, 2015" 5 | output: html_document 6 | --- 7 | 8 | ### R语言中的向量化运算 9 | 10 | 我们主要通过iris数据集介绍R语言中的向量化运算 11 | ```{r} 12 | str(iris) 13 | ``` 14 | --- 15 | 16 | ## apply 17 | **apply**通过对数组或者矩阵的一个维度使用函数生成值得到列表或者数组、向量。 18 | ```{r,echo=FALSE} 19 | str(apply) 20 | ``` 21 | 例如我们想知道iris数据集前四列的均值: 22 | ```{r} 23 | apply(iris[1:4],2,mean) 24 | ``` 25 | --- 26 | 27 | 或者我们想知道某一行的最大值 28 | ```{r} 29 | apply(iris[1:4], 1, max) 30 | ``` 31 | --- 32 | 33 | ## lapply 和 sapply 34 | **lapply**和**sapply**通过对x的每一个元素运用函数,生成一个与元素个数相同的值列表,这里的x可以为向量,矩阵或者数据框,**sapply**是**lapply**的用户友好版本,lapply返回的是一个列表的形式,而sapply返回的一个数据框的形式 35 | 例如我们想知道iris每个元素的种类 36 | ```{r} 37 | lapply(iris,class) 38 | sapply(iris,class) 39 | ``` 40 | 我们也可以用sapply算出每一列的均值 41 | ```{r} 42 | sapply(iris,mean) 43 | ``` 44 | --- 45 | 46 | ## tapply(与aggregate作比较) 47 | **tapply**对一组非空值(典型的是一个向量)按照一组确定因子进行相应计算 48 | 例如我们想知道对于不同的Species,Sepal.Width的均值和方差 49 | ```{r} 50 | tapply(iris$Sepal.Width,INDEX=iris$Species,FUN=mean) 51 | tapply(iris$Sepal.Width,INDEX=iris$Species,FUN=sd) 52 | ``` 53 | 这里我们介绍一下*aggregate*函数,**aggregate**函数是数据整合的重要函数,它可以按照一组确定的因子对一个数据框进行相应的计算。 54 | ```{r} 55 | aggregate(iris[1:4],by=list(a=iris$Species),mean) 56 | ``` 57 | --- 58 | 59 | ## mapply 60 | mapply是sapply的多变量版本。将对...中的每个参数运行FUN函数,如有必要,参数将被循环 61 | ```{r,echo=FALSE} 62 | str(mapply) 63 | ``` 64 | 简单的举一个例子 65 | ```{r} 66 | mapply(rep,1:3,3:1) 67 | ``` 68 | --- 69 | 70 | ## vapply 71 | vapply类似于sapply函数,但是它的返回值有预定义类型,所以它使用起来会更加安全,有的时候会更快. 72 | 在vapply函数中总是会进行简化,vapply会检测FUN的所有值是否与FUN.VALUE兼容,以使他们具有相同的长度和类型。类型顺序:逻辑<整型<实数<复数 73 | ```{r} 74 | x<-data.frame(a=rnorm(4,4,4),b=rnorm(4,5,3),c=rnorm(4,5,3)) 75 | vapply(x,mean,c(c=0)) 76 | ``` 77 | 78 | -------------------------------------------------------------------------------- /Wangzf/reshap2.R: -------------------------------------------------------------------------------- 1 | #--------------------------------- 2 | read.table() 3 | read.csv() 4 | read.csv2() 5 | read.delim() 6 | read.delim(sep="|") 7 | read.delim2() 8 | read.fwf() 9 | #--------------------------------- 10 | 11 | #####Reshap 12 | library(reshape2) 13 | setwd("F:/Rworkd/DW/Intro-to-R/Day 2/data") 14 | raw1 <- read.csv("pew.csv", header = T, sep=",") 15 | head(raw1) 16 | raw1 <- read.csv("pew.csv", check.names = F) 17 | head(raw1) 18 | 19 | 20 | ###1.Values in column names 21 | #---------------------------------------------------- 22 | melt() 23 | melt(data, ..., na.rm = FALSE, value.name = "value") 24 | #---------------------------------------------------- 25 | tidy <- melt(raw1, id = "religion") 26 | head(tidy) 27 | names(tidy) <- c("religion", "income", "n") 28 | head(tidy) 29 | 30 | tidy <- melt(raw1, id = "religion", variable.name ="income", 31 | value.name ="n") 32 | head(tidy) 33 | 34 | 35 | ###2.Variable names in cells 36 | raw2 <- read.delim("weather.txt", check.names = F, na.strings = ".") 37 | head(raw2) 38 | 39 | raw2 <- melt(raw2, id = c("year", "month", "element"), 40 | variable.name = "day", na.rm = TRUE) 41 | head(raw2) 42 | raw2 <- raw2[, c("year", "month", "day", "element", "value")] 43 | head(raw2) 44 | 45 | tidy <- dcast(raw2, year + month + day ~ element, value.var = "value") 46 | head(tidy) 47 | 48 | #-------------------------------------------------------------- 49 | titanic2 <- read.csv("titanic2.csv", stringsAsFactors = FALSE) 50 | head(titanic2) 51 | 52 | tidy <- melt(titanic2, id = c("class", "age", "fate"), variable.name = "gender") 53 | head(tidy) 54 | tidy <- dcast(tidy, class + age + gender ~ fate, value.var = "value") 55 | head(tidy) 56 | titanic2$rate <- with(tidy, round(survived / (survived + perished), 2)) 57 | head(tidy) 58 | 59 | 60 | ###3.Data split across many files 61 | cbind() 62 | rbind() 63 | 64 | 65 | ###4.Saving data 66 | read.csv() 67 | write.csv() 68 | 69 | write.csv(tidy, file = "F:/importance/tidy.csv", 70 | row.names = FALSE) 71 | 72 | readRDS() 73 | saveRDS() 74 | 75 | saveRDS(tidy, "F:/importance/tidy.rds") 76 | tidy2 <- readRDS("F:/importance/tidy.rds") 77 | head(tidy2) 78 | 79 | write.csv(tidy, file = bzfile("F:/importance/tidy.csv.bz2"), row.names = FALSE) #ѹ?????? 80 | tidy3 <- read.csv("F:/importance/tidy.csv.bz2") 81 | 82 | 83 | #------------------------------------------------------------------ 84 | write.table(x, file = "", append = FALSE, quote = TRUE, sep = " ", 85 | eol = "/n", na = "NA", dec = ".", row.names = TRUE, 86 | col.names = TRUE, qmethod = c("escape", "double"), 87 | fileEncoding = "") 88 | write.table(iris, "F:/iris.txt", sep = ',', quote = T, na = 'NA') 89 | library(xlsx) 90 | write.xlsx() 91 | library(foreign) 92 | write.foreign(df, datafile, codefile, 93 | package = c("SPSS", "Stata", "SAS"), ...) 94 | #------------------------------------------------------------------ 95 | 96 | 97 | #####Transforming Data 98 | library(ggplot2) 99 | options(stringsAsFactors = FALSE) 100 | setwd("F:/Rworkd/DW/Intro-to-R/Day 2/data") 101 | bnames <- read.csv("bnames.csv.bz2") 102 | head(bnames) 103 | births <- read.csv("births.csv") 104 | head(births) 105 | 106 | #--------------------------------------------------------------------- 107 | garrett <- bnames[bnames$name == "Garrett", ] 108 | qplot(year, prop, data = garrett, geom = "line") 109 | 110 | michael <- bnames[bnames$name == "Michael", ] 111 | qplot(year, prop, data = michael, geom = "line") 112 | qplot(year, prop, data = michael, geom = "point") 113 | qplot(year, prop, data = michael, geom = "line", color = sex) 114 | 115 | michaels <- bnames[bnames$name == "Michael" | bnames$name == "Michelle", ] 116 | qplot(year, prop, data = michaels, geom = "line", 117 | color = interaction(sex, name)) 118 | #------------------------------------------------------------------------ 119 | 120 | ####dplyr----------------------------------------------------- 121 | library(dplyr) 122 | bnames <- tbl_df(bnames) 123 | births <- tbl_df(births) 124 | class(bnames) 125 | 126 | ##filter keep rows by criteria---------------------------- 127 | df <- data.frame(color = c("blue", "black", "blue", "blue", "black"), 128 | value = 1:5) 129 | df 130 | tbl <- tbl_df(df) 131 | tbl 132 | 133 | filter(tbl, color == "blue") 134 | filter(df, value %in% c(1, 4)) 135 | #---------------------------------------------------- 136 | A <- filter(bnames, name == "Garrett") 137 | B <- filter(bnames, year %% 100 == 0 & sex == "girl") 138 | 139 | dim(filter(bnames, prop > 0.01 & year > 2000)) 140 | 141 | garrett <- filter(bnames, name == "Garrett") 142 | garrett$soundex[1] 143 | filter(bnames, soundex == "G630") 144 | 145 | filter(bnames, sex == "girl" & (year == 1900 | year== 2000)) 146 | dim(filter(bnames, year > 2000 & prop > 0.01)) 147 | #------------------------------------------------------ 148 | 149 | ##select pick columns by name------------------------------ 150 | select(tbl, color) 151 | select(tbl, -color) 152 | #---------------------------------------------- 153 | a1 <- select(bnames, soundex) 154 | a2 <- select(bnames, -c(year, name, prop, sex)) 155 | a3 <- select(bnames, (-)starts_with("sound")) 156 | a4 <- select(bnames, (-)ends_with("dex")) 157 | a5 <- select(bnames, (-)contains("sound")) 158 | a6 <- select(bnames, (-)matches("")) 159 | a7 <- select(bnames, one_of("soundex")) 160 | a8 <- select(bnames, everything()) 161 | a9 <- select(x, : ) 162 | a10 <- select(x, num_range("", n:m)) 163 | 164 | a <- select(bnames, ends_with("ex")) 165 | 166 | ##arrange reorder rows------------------------------ 167 | 168 | df1 <- data.frame(color = c(4, 1, 5, 3, 2), value = 1:5) 169 | df1 170 | 171 | tbl1 <- tbl_df(df1) 172 | tbl1 173 | 174 | arrange(tbl1, color) 175 | arrange(tbl1, desc(color)) 176 | 177 | arrange(bnames, desc(prop)) 178 | garrett <- filter(bnames, name == "Garrett") 179 | arrange(garrett, desc(prop)) 180 | 181 | 182 | ##mutate add new variables----------------------------------------------- 183 | mutate(tbl, double = value * 2) 184 | mutate(tbl, double = value * 2, quadruple = double * 2) 185 | 186 | ##summarise reduce variables to values------------------------------------------ 187 | summarise(tbl, total = sum(value)) 188 | summarise(tbl, total = sum(value), avg = mean(value)) 189 | 190 | 191 | ###Joining data sets---------------------------------------------------------------------- 192 | 193 | 194 | x <- data.frame( 195 | name = c("John", "Paul", "George", "Ringo", "Stuart", "Pete"), 196 | instrument = c("guitar", "bass", "guitar", "drums", "bass", "drums")) 197 | x 198 | y <- data.frame( 199 | name = c("John", "Paul", "George", "Ringo", "Brian"), 200 | band = c("TRUE", "TRUE", "TRUE", "TRUE", "FALSE")) 201 | y 202 | left_join(x, y, by = "name") 203 | inner_join(x, y, by = "name") 204 | semi_join(x, y, by = "name") 205 | anti_join(x, y, by = "name") 206 | #------------------------------------------------------------------------ 207 | bnames2 <- left_join(bnames, births, by = c("year", "sex")) 208 | bnames2 209 | bnames2 <- mutate(bnames2, n = prop * births) 210 | bnames2 211 | bnames2 <- mutate(bnames2, n = round(prop * births)) 212 | bnames2 213 | 214 | ###Group wise operations 215 | garrett <- filter(bnames2, name == "Garrett") 216 | garrett 217 | sum(garrett$n) 218 | summarise(garrett, total = sum(n)) 219 | 220 | ##group_by----------------------------------- 221 | df <- data.frame( 222 | color = c("blue", "black", "blue", "blue", "black"), 223 | value = 1:5) 224 | df 225 | tbl <- tbl_df(df) 226 | tbl 227 | 228 | summarise(tbl, total = sum(value)) 229 | 230 | group_by(tbl, color) %>% 231 | summarise(total = sum(value)) 232 | 233 | group_by(bnames2, name) %>% 234 | summarise(total = sum(n)) 235 | #--------------------------- 236 | group_by(bnames2, name, sex) 237 | #deng jia 238 | group_by(bnames2, name) %>% 239 | group_by(sex) 240 | #--------------------------- 241 | group_by(bnames2, name, sex) %>% 242 | summarise(total = sum(n)) 243 | 244 | #remove group specifications 245 | group_by(bnames2, name, sex) %>% 246 | ungroup() 247 | 248 | 249 | 250 | 251 | 252 | 253 | ##summary functions 254 | min(x) # 255 | median(x) # 256 | mean(x) # 257 | quantile(x) # 258 | max(x) # 259 | fivenum(x) #Tukey Five-Number Summaries 260 | 261 | n(x) #The number of observations in the current group. 262 | #This function is implemented special for each data 263 | #source and can only be used from within summarise, mutate and filter 264 | sum(x) # 265 | 266 | sum(x > 10) # 267 | mean(x > 10) # 268 | 269 | var(x) # 270 | sd(x) # 271 | IQR(x) # 272 | mad(x) #Median Absolute Deviation 273 | range(x) # 274 | 275 | 276 | p <- group_by(bnames2, soundex) %>% 277 | summarise(total = sum(n)) 278 | p 279 | arrange(p, desc(total)) 280 | j500 <- filter(bnames, soundex == "J500") 281 | unique(j500$name) 282 | 283 | group_by(bnames2, year, sex) %>% 284 | summarise(births = sum(n)) 285 | 286 | group_by(bnames2, year, sex) %>% 287 | mutate(rank = rank(desc(prop))) 288 | browseVignettes(package = "dplyr") 289 | -------------------------------------------------------------------------------- /rong360.Rmd: -------------------------------------------------------------------------------- 1 | --- 2 | title: "融360数据分析" 3 | author: "Chiffon" 4 | date: "Sunday, May 03, 2015" 5 | output: word_document 6 | --- 7 | 8 | ## 摘要 9 | 10 | ## 问题重述 11 | 12 | 融360是专注于金融领域的智能搜索平台,为小微企业和个人消费者提供专业的贷款、信用卡及理财在线搜索和申请服务,目前已合作近10000家金融机构,涵盖30000款金融产品,服务全国100个主要大中城市,并致力于为用户免费提供便捷、划算、可靠的金融产品和服务[1]。 13 | 14 | 本报告基于Rong360 提供的[开放数据](http://openresearch.rong360.com/) 15 | ,涵盖了用户的各项特征描述( 访问网站记录,以及信用评估记录)和产品的属性(提供的银行,申请需要条件等).根据以上开放数据,本报告提供了以下两个方面内容的描述: 16 | 17 | 1. 下单用户属性的统计描述,以及可视化的图表展示. 18 | 2. 订单审批结果与用户数据、订单数据间的关系. 19 | 20 | 21 | 最后是附录中包含的初赛题目的代码和运算环境.模型最后的评价是基于f1值的结论,也就是召回率(recall)和精确率(precision)的调和平均. 22 | 23 | 24 | ![pic](exp1.png) 25 | 26 | ## Part 1: 统计描述 27 | 28 | 对于原始的开放数据,首先完成数据的合并的工作,将缺失值较少的变量与`order_train`数据(包含了每笔订单的订单人ID,产品ID,等)合并起来,并进行一些特征工程.包括某产品需要房屋抵押,并且用户名下有房产,某产品需要汽车抵押,并且用户有汽车等.可以看到,在合并之后,还是会出现很多无法匹配的记录: 29 | 30 | 31 | 以下是出现缺失值变量的名称以及缺失值中通过申请的比例,值得注意的是,其中有相当多的数据缺失比例相同:是一些类似PV的数据,也就是文件`user_final.txt`文件,包含了用户访问rong360网站的数据.这部分数据缺失较多,而且在后面的预测中也发现,将该类数据移除掉反而会使得最终结果有较大提升,所以在最后的程序中将不会用到用户访问网站记录的数据. 32 | 33 | ```{r echo = F,eval = F} 34 | load("tr4.Rdata") 35 | names(train_final) 36 | library(ggplot2) 37 | library(reshape2) 38 | train_final[train_final == -10000] = NA 39 | 40 | names(train_final)[81] = "result" 41 | dat1 = melt(train_final,id = "result") 42 | ggplot(data = dat1,aes(variable,fill = is.na(value)))+ 43 | geom_bar()+coord_flip() + 44 | labs(title = "缺失值的比例情况")+ 45 | theme(legend.title = element_text("是否缺失")) 46 | 47 | 48 | 49 | dat2 = dat1[is.na(dat1$value),] 50 | ggplot(data = dat2,aes(variable,fill = result==1))+ 51 | geom_bar()+coord_flip() + 52 | labs(title = "缺失值中申请成功的比例")+ 53 | theme(legend.title = element_text("是否申请成功")) 54 | 55 | ``` 56 | 57 | ![fig](Rplot1.png) 58 | 59 | 缺失比例表: 60 | 61 | 62 | ## Part 2: 关系描述 63 | 64 | 对于各个数据关于是否申请成功的关系描述,将会从各个不同的模型来对关系进行挖掘,主要的两个思路是: 65 | 66 | 1. 描述型模型 67 | 2. 预测型模型 68 | 69 | 描述型模型旨在描述变量和最终预测结果之间的关系,所以,预测的性能不是很好,起代表作用的模型为logistic回归和决策树(CART).这两个模型能够很好的对数据进行描述,比如logistic模型描述的结果是各个变量对于最终是否成功申请的机率的影响,而决策树(CART)会给出一个简单的判断决策树,从而可以发掘业务中的决策准则. 70 | 71 | 预测型描述的含义在于可以用一个效果更好的模型来进行预测,有效的提高F1值,比如randomforest,gbm,SVM,xgboost等,这里我们主要使用的是gbm和xgboost这样的提升树的方法.由于数据量较大,randomforest无法完成简单模型的建立,而SVM需要数据是无缺失值,并且均为数值变量,所以最终选取xgboost的方式. 72 | 73 | 最后,最优的预测方法应该是使用stacking的方法,结合不同的预测结果,投票得到一个最优的结论,限于时间以及最后需要提交的源代码,所以没有使用stacking的方式. 74 | 75 | 模型的检验使用的是Hold-out方式,将原始数据分为训练集和测试集,在训练集上进行训练,用测试集验证,二者比较选择最优的参数.划分准则为前80%的数据为训练集,后20%的数据为预测集.使用这样划分的原因是由于训练集和测试集数据结构的差别,训练集的时间为754-1546,而测试集的时间为1547-1652在这个时间的变化下,模型的参数有所改变(比如通过率由训练集的17%降低到了11%左右). 76 | 77 | ### 描述型模型 78 | 79 | #### logistic回归 80 | 描述型预测首先使用了logistic模型来进行预测: 81 | ```{r eval = F} 82 | # load("tr4.Rdata") 83 | model = glm(result~.-bank_id,data = train_final,family = "binomial") 84 | summary(model) 85 | pre = predict(model,test_final) 86 | out = (pre>0.215)+0 87 | writeLines(as.character(out),"submit/3.26.1.txt") # 0.2985 88 | ``` 89 | 90 | 其中数据中的运算会自动将因子变量转化为Dummy Variable来完成计算,去除bank_id是因为因子变量具有较多等级(几百个),会产生过拟合的结果,整体来说logistic模型可以给出各个变量对于最终结果的关系,但计算时间较长,预测效果也一般(f1最高不超过0.30) 91 | 92 | 93 | #### rpart决策树 94 | 95 | 决策树(Decision Tree)是在已知各种情况发生概率的基础上,通过构成决策树来求取净现值的期望值大于等于零的概率,评价项目风险,判断其可行性的决策分析方法,是直观运用概率分析的一种图解法。由于这种决策分支画成图形很像一棵树的枝干,故称决策树。在机器学习中,决策树是一个预测模型,他代表的是对象属性与对象值之间的一种映射关系。Entropy = 系统的凌乱程度,使用算法ID3, C4.5和C5.0生成树算法使用熵。这一度量是基于信息学理论中熵的概念[2]。 96 | 97 | 本文使用的CART的方法来生成决策树,基于Gini系数来进行分割. 98 | ![fig](Rplot03.png) 99 | 100 | 可以看到CART的方法可以较好的展现出一些业务结构,比如`fangkuan`>90并`loan_quota_max`比80W小,并且`fangkuan_num`大于292的人有较高申请通过的可能... 101 | 102 | 103 | ### 预测型模型 104 | 105 | 预测型模型主要使用的是提升树的方法,主要使用的是xgboost, xgboost 的全称是eXtreme Gradient Boosting。正如其名,它是Gradient Boosting Machine的一个c++实现,作者为正在华盛顿大学研究机器学习的大牛 陈天奇 。他在研究中深感自己受制于现有库的计算速度和精度,因此在一年前开始着手搭建xgboost项目,并在去年夏天逐渐成型。xgboost最大的特点在于,它能够自动利用CPU的多线程进行并行,同时在算法上加以改进提高了精度。它的处女秀是Kaggle的 希格斯子信号识别 竞赛,因为出众的效率与较高的预测准确度在比赛论坛中引起了参赛选手的 广泛关注 ,在1700多支队伍的激烈竞争中占有一席之地。随着它在Kaggle社区知名度的提高,最近也有队伍借助xgboost在比赛中夺得第一。[3] 106 | 107 | 该两种算法均为黑箱模型,无法提取一些描述型的结论来对于数据之间的关系作出很好的解释.而预测效果是远比描述型模型的效果要好(会提高5%左右的f1值). 108 | 109 | 110 | ## 附录1: 问题3代码及运算环境 111 | linux Deepin (Ubuntu) X64 0.353 112 | ```{r eval =FALSE} 113 | library(caret) 114 | library(dplyr) 115 | 116 | ##################################读取数据 117 | 118 | product = read.table("data/product.final.txt", 119 | header = T,sep = "\t") 120 | 121 | product_final = select(product,-is_p2p) %>% 122 | mutate(., 123 | early_repayment_na = is.na(early_repayment)+0, 124 | penalty_na = is.na(penalty)+0, 125 | apply_ratio = fangkuan_num/apply_num) 126 | product_final$apply_ratio[is.na(product_final$apply_ratio)] <- 0 127 | 128 | apply(product_final,2,function(x) sum(is.na(x))) 129 | save(product_final,file = "newdata/product.Rdata") 130 | 131 | 132 | rm(list = ls()) 133 | 134 | 135 | ##########################################quality 136 | quality = read.table("data/quality.final.txt", 137 | header = T,sep = "\t") 138 | 139 | 140 | quality = group_by(quality,user_id) %>% 141 | summarise(., 142 | city_id = first(city_id), 143 | application_type = first(application_type), 144 | application_term = first(application_term), 145 | application_limit = first(application_limit), 146 | op_type = first(op_type), 147 | col_type = first(col_type), 148 | user_loan_experience = first(user_loan_experience), 149 | user_has_car = first(user_has_car), 150 | user_income_by_card = first(user_income_by_card), 151 | user_work_period = first(user_work_period), 152 | col_value = first(col_value), 153 | house_payment_records = first(house_payment_records), 154 | car_value = first(car_value), 155 | col_has_mortgage = first(col_has_mortgage), 156 | reapply_count = all(is.na(reapply_count)), 157 | product_type = first(product_type), 158 | apply_from = first(apply_from), 159 | platform = first(platform), 160 | spam_score = first(spam_score), 161 | mobile_verify = first(mobile_verify), 162 | source = first(source), 163 | medium = first(medium), 164 | mobile_source = first(mobile_source), 165 | mobile_medium = first(mobile_medium), 166 | bank_id = first(bank_id), 167 | quality_amount = n() 168 | ) 169 | 170 | 171 | quality_final = quality 172 | save(quality_final,file = "newdata/quality.Rdata") 173 | ##load("newdata/quality.Rdata") 174 | 175 | user = read.table("data/user.final.txt",head = T,sep ="\t") 176 | me = first 177 | # function(x){ 178 | # mean(x,na.rm=T) 179 | # } 180 | # name.user = names(user) 181 | user = group_by(user,user_id) %>% 182 | summarise(.,pv = me(pv), 183 | pv_index_loan = me(pv_index_loan), 184 | pv_apply_total = me(pv_apply_total), 185 | pv_ask = me(pv_ask), 186 | pv_calculator = me(pv_calculator), 187 | order_count_loan = me(order_count_loan), 188 | pv_daikuan = me(pv_daikuan), 189 | pv_credit = me(pv_credit), 190 | pv_search_daikuan = me(pv_search_daikuan), 191 | pv_detail_daikuan = me(pv_detail_daikuan), 192 | pv_date = me(date),user_amount = n()) 193 | # names(user) = c(name.user,"user_amount") 194 | user_final = user 195 | save(user_final,file = "newdata/user.Rdata") 196 | 197 | rm(list = ls()) 198 | load("newdata/product.Rdata") 199 | load("newdata/quality.Rdata") 200 | load("newdata/user.Rdata") 201 | train = read.table("data/order_train.txt",header = T,sep = "\t") 202 | test = read.table("data/order_test_no_label.txt",header = T,sep = "\t") 203 | 204 | 205 | 206 | 207 | train_final = left_join(train,user_final,by = "user_id") %>% 208 | left_join(.,quality_final,by = "user_id") %>% 209 | left_join(.,product_final,by = "product_id") 210 | 211 | test_final = left_join(test,user_final,by = "user_id") %>% 212 | left_join(.,quality_final,by = "user_id") %>% 213 | left_join(.,product_final,by = "product_id") 214 | 215 | 216 | ## All completed cases 217 | dim(train_final) 218 | # [1] 143152 59 219 | dim(test_final) 220 | # [2] 36108 58 221 | 222 | 223 | yun = rbind(train_final[,-6],test_final) 224 | 225 | ##################### 1 city_fit 226 | 227 | city_fit = yun %>% 228 | group_by(.,city_id.x,city_id.y) %>% 229 | summarise(.,n = n()) %>% 230 | arrange(.,desc(n)) %>% 231 | summarise(.,cityFit = first(as.character(city_id.y))) 232 | 233 | yun = left_join(yun,city_fit,by = 'city_id.x') %>% 234 | mutate(.,city_fit = 0+(as.character(city_id.y) == cityFit), 235 | city_blank = is.na(city_id.x)) %>% 236 | select(.,-c(city_id.y,cityFit)) %>% 237 | rename(.,city_id = city_id.x) 238 | 239 | ######################### 2 limit 240 | 241 | yun = mutate(yun,big_limit = limit>100 & limit!=200, 242 | med_limit = limit<100 & limit>50 & limit%%10!=0, 243 | dig_limit = round(limit) != limit & 244 | !(limit %in% c(2.5,3.5,1.5,4.5) )) 245 | 246 | 247 | ######################### house 248 | 249 | 250 | house_function = function(col_type,house_payment_records, 251 | col_has_mortgage,col_value){ 252 | a1 = col_type %in% c(1,2,3,4,5,6,8,10,12,14,16,100) 253 | a2 = house_payment_records == 1 254 | a3 = col_has_mortgage == 2 255 | a4 = col_value != 0 256 | any(a1,a2,a3,a4,na.rm = T) 257 | } 258 | 259 | yun = mutate(yun,house_1 = col_type %in% c(1,2,3,4,5,6,8,10,12,14,16,100), 260 | house_2 = col_value != 0) 261 | 262 | 263 | table(yun$house_1[1:143152],train_final$result) 264 | 265 | 266 | train_final = cbind(yun[1:143152,],train$result) 267 | test_final = yun[143153:179260,] 268 | AddVariable = function(data = train,yun = yun){ 269 | require(dplyr) 270 | prepare1 = yun %>% group_by(.,city_id) %>% 271 | summarise(.,city_amount = n()) 272 | prepare2 = yun %>% group_by(.,bank_id.y) %>% 273 | summarise(.,bank_amount = n()) 274 | 275 | data = left_join(data,prepare1,by = "city_id") %>% 276 | left_join(.,prepare2,by = "bank_id.y") 277 | 278 | mutate(data,weekday = factor(date%%7), 279 | month = factor(date%%365%/%31), 280 | fit_user = is.na(user_amount), 281 | fit_quality = is.na(quality_amount), 282 | big_city = city_amount >1000, 283 | med_bank = bank_amount > 30, 284 | big_bank = bank_amount >100, 285 | house_3 = house*house_1, 286 | house_4 = house*house_2*house_1 287 | ) 288 | 289 | } 290 | 291 | 292 | 293 | test_final = AddVariable(data = test_final, 294 | yun = yun) %>% 295 | select (.,-c(bank_id.y, 296 | product_id,user_id)) 297 | 298 | 299 | train_final = AddVariable(data = train_final, 300 | yun = yun) %>% 301 | select (.,-c(bank_id.y, 302 | product_id,user_id)) 303 | 304 | train_final[is.na(train_final)] = -10000 305 | test_final[is.na(test_final)] = -10000 306 | 307 | levels = table(trainx$city_id) %>% 308 | sort(.,decreasing = T) 309 | 310 | 311 | 312 | train_final = cbind(train_final,j1) 313 | test_final = cbind(test_final,j2) 314 | 315 | dim(train_final)[1] -> n 316 | 317 | index = round(n*0.8):n 318 | 319 | trainx = train_final[-index,] 320 | testx = train_final[index,] 321 | save(train_final,test_final,trainx,testx,file = "tr4.Rdata") 322 | rm(list = ls()) 323 | 324 | load("tr4.Rdata") 325 | require(xgboost) 326 | require(methods) 327 | require(plyr) 328 | 329 | apply(trainx,2,function(x) sum(x ==-10000)) 330 | 331 | load("newdata//user.Rdata") 332 | name = names(user_final) 333 | names(trainx) %in% name 334 | trainx = trainx[,!names(trainx) %in% name] 335 | testx = testx[,!names(testx) %in% name] 336 | train_final = train_final[,!names(train_final) %in% name] 337 | test_final = test_final[,!names(test_final) %in% name] 338 | 339 | 340 | fc = function(pre=res,labels = train$result){ 341 | tp = sum(pre == 1 & labels == 1)/sum(pre == 1) 342 | fp = sum(pre == 1 & labels == 1)/sum(labels == 1) 343 | 2*tp*fp/(tp+fp) 344 | } 345 | 346 | 347 | change = function(x){ 348 | as.numeric(x) 349 | } 350 | 351 | 352 | 353 | 354 | label <- as.numeric(as.character(trainx[,69])) 355 | 356 | data <- as.matrix(colwise(change)(trainx[,-69])) 357 | 358 | data2 <- as.matrix(colwise(change)(testx[,-69])) 359 | label2 = as.numeric(as.character(testx[,69])) 360 | # weight <- as.numeric(dtrain[[32]]) * testsize / length(label) 361 | 362 | xgmat <- xgb.DMatrix(data, label = label, missing = -10000) 363 | param <- list("objective" = "binary:logistic", 364 | "bst:eta" = 0.05, 365 | "bst:max_depth" = 5, 366 | "eval_metric" = "logloss", 367 | "gamma" = 1, 368 | "silent" = 1, 369 | "nthread" = 16 , 370 | "min_child_weight" =1.45 371 | ) 372 | watchlist <- list("train" = xgmat) 373 | nround =300 374 | print ("loading data end, start to boost trees") 375 | 376 | 377 | 378 | 379 | 380 | label3 <- as.numeric(as.character(train_final[,69])) 381 | data3 <- as.matrix(colwise(as.numeric)(train_final[,-69])) 382 | 383 | 384 | data4 <- as.matrix(colwise(as.numeric)(test_final)) 385 | 386 | xgmat <- xgb.DMatrix(data3, label = label3, missing = -10000) 387 | 388 | bst2 = xgb.train(param, xgmat, nround, watchlist); 389 | # bst.cv = xgb.cv(param, xgmat, nround,nfold = 10,watchlist) 390 | pre3 = predict(bst2,data3) 391 | 392 | ans1 = rep(0,999) 393 | for (i in 1:999){ 394 | j = 0.001*i 395 | res = pre3>j 396 | ans1[i] = fc(pre=res,labels = label3) 397 | } 398 | summary(ans1) 399 | 400 | which.max(ans1) 401 | 402 | pre.final = predict(bst2,data4) 403 | out = pre.final>0.23 404 | writeLines(as.character(out),"submit/4.14.1.txt") # 0.3417 405 | ``` 406 | 407 | 408 | ## 附录2: 参考资料 409 | [1] rong360.com 410 | [2] http://baike.baidu.com/link?url=MkyRLzJyfyGzVwQXhyvZURRe625idG_-YaqXjCQudyKgE7mpiIklwxs-pgvUh_ps3m3Fveovoqjg6PjVEeL5bK 411 | [3] http://cos.name/2015/03/xgboost/ 412 | 413 | 414 | ![fig](Rplot.png) --------------------------------------------------------------------------------