├── Spark中组件Mllib的学习100之模版.md ├── 1数据类型 ├── Spark中组件Mllib的学习12之密集向量和稀疏向量的生成.md ├── Spark中组件Mllib的学习13之给向量打标签.md ├── Spark中组件Mllib的学习15之创建分布式矩阵.md ├── Spark中组件Mllib的学习16之分布式行矩阵的四种形式.md ├── Spark中组件Mllib的学习14之从文本中读取带标签的数据,生成带label的向量.md └── Spark中组件Mllib的学习3之用户相似度计算.md ├── 7特征提取和转换 ├── Spark中组件Mllib的学习53之HashingTF理解和使用.md ├── Spark中组件Mllib的学习64之元素智能乘积ElementwiseProduct.md ├── Spark中组件Mllib的学习60之归一化(Normalizer)Normalization using L^Inf distance.md ├── Spark中组件Mllib的学习60之归一化(Normalizer)Normalization using L^Inf distance.md ├── Spark中组件Mllib的学习59之归一化(Normalizer)Normalization using L2 distance.md ├── Spark中组件Mllib的学习59之归一化(Normalizer)Normalization using L2 distance.md ├── Spark中组件Mllib的学习58之归一化(Normalizer)Normalization using L1 distance.md ├── Spark中组件Mllib的学习58之归一化(Normalizer)Normalization using L1 distance.md ├── Spark中组件Mllib的学习53之Word2Vec简单实例.md ├── Spark中组件Mllib的学习62之特征选择中的卡方选择器.md ├── Spark中组件Mllib的学习52之TF-IDF学习.md └── Spark中组件Mllib的学习54之word2Vec实例分析(text8数据集).md ├── 2基本统计 ├── Spark中组件Mllib的学习20之假设检验-卡方检验.md ├── Spark中组件Mllib的学习17之colStats_以列为基础计算统计量的基本数据.md ├── Spark中组件Mllib的学习21之随机数-RandomRDD产生.md ├── Spark中组件Mllib的学习42之rowMatrix的QR分解.md ├── Spark中组件Mllib的学习22之假设检验-卡方检验概念理解.md ├── Spark中组件Mllib的学习19之分层抽样.md └── Spark中组件Mllib的学习3之用户相似度计算.md ├── 8频繁项挖掘 ├── Spark中组件Mllib的学习67之关联规则AssociationRules.md ├── Spark中组件Mllib的学习68之PrefixSpan.md └── Spark中组件Mllib的学习66之FP-growth.md ├── 3分类和回归 ├── Spark中组件Mllib的学习24之线性回归1-小数据集.md ├── Spark中组件Mllib的学习35之随机森林(entropy)进行分类.md ├── Spark中组件Mllib的学习25之线性回归2-较大数据集(多元).md ├── Spark中组件Mllib的学习30之逻辑回归LogisticRegressionWithLBFGS.md ├── Spark中组件Mllib的学习34之决策树(使用entropy)_.md ├── Spark中组件Mllib的学习26之逻辑回归-简单数据集,带预测.md ├── Spark中组件Mllib的学习38之随机森林(使用variance)进行回归.md ├── Spark中组件Mllib的学习36之决策树(使用variance)进行回归.md ├── Spark中组件Mllib的学习33之决策树(使用Gini).md ├── Spark中组件Mllib的学习32之朴素贝叶斯分类器(伯努利朴素贝叶斯)_.md ├── Spark中组件Mllib的学习31之朴素贝叶斯分类器(多项式朴素贝叶斯).md ├── Spark中组件Mllib的学习37之随机森林(Gini)进行分类.md ├── Spark中组件Mllib的学习23之随机梯度下降(SGD).md ├── Spark中组件Mllib的学习40之梯度提升树(GBT)用于回归_.md └── Spark中组件Mllib的学习41之保序回归(Isotonic regression).md ├── 9评估度量 ├── Spark中组件Mllib的学习73之回归问题的评估.md ├── Spark中组件Mllib的学习71之对多标签分类进行评估.md ├── Spark中组件Mllib的学习70之对多类分类结果进行评估Multiclass classification.md └── Spark中组件Mllib的学习72之RankingSystem进行评估.md ├── 6降维 ├── Spark中组件Mllib的学习49之奇异值分解SVD(Singular value decomposition).md └── Spark中组件Mllib的学习50之主成份分析PCA.md ├── 5聚类 ├── Spark中组件Mllib的学习46之Power iteration clustering.md ├── Spark中组件Mllib的学习45之用高斯混合模型来预测.md ├── Spark中组件Mllib的学习44之高斯混合聚类GaussianMixture.md ├── Spark中组件Mllib的学习48之流式k均值(Streaming kmeans).md ├── Spark中组件Mllib的学习47之隐含狄利克雷分布(Latent Dirichlet allocation (LDA)学习.md └── Spark中组件Mllib的学习47之隐含狄利克雷分布(Latent Dirichlet allocation,LDA)学习.md ├── README.md ├── 11优化 └── Spark中组件Mllib的学习75之L-BFGS.md └── 10PMML模型输出 └── Spark中组件Mllib的学习74之预言模型标记语言PMML.md /Spark中组件Mllib的学习100之模版.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | 9 | 10 | 11 | 2.代码: 12 | 13 | 14 | 15 | 3.结果: 16 | 17 | 18 | 19 | 参考 20 | 21 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 22 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 23 | 【3】https://github.com/xubo245/SparkLearning 24 | 【4】book:Machine Learning with Spark ,Nick Pertreach 25 | 【5】book:Spark MlLib机器学习实战 26 | -------------------------------------------------------------------------------- /1数据类型/Spark中组件Mllib的学习12之密集向量和稀疏向量的生成.md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之基础概念篇 3 | 1解释 4 | mllib生成Vector 5 | 6 | 2.代码: 7 | 8 | ``` 9 | /** 10 | * @author xubo 11 | * ref:Spark MlLib机器学习实战 12 | * more code:https://github.com/xubo245/SparkLearning 13 | * more blog:http://blog.csdn.net/xubo245 14 | */ 15 | package org.apache.spark.mllib.learning.basic 16 | 17 | import org.apache.spark.mllib.linalg.Vectors 18 | 19 | /** 20 | * Created by xubo on 2016/5/23. 21 | * Vector 22 | */ 23 | object VectorLearning { 24 | def main(args: Array[String]) { 25 | 26 | val vd = Vectors.dense(2, 0, 6) 27 | println(vd(2)) 28 | println(vd) 29 | 30 | //数据个数,序号,value 31 | val vs = Vectors.sparse(4, Array(0, 1, 2, 3), Array(9, 5, 2, 7)) 32 | println(vs(2)) 33 | println(vs) 34 | 35 | val vs2 = Vectors.sparse(4, Array(0, 2, 1, 3), Array(9, 5, 2, 7)) 36 | println(vs2(2)) 37 | println(vs2) 38 | 39 | 40 | } 41 | } 42 | 43 | ``` 44 | 45 | 3.结果: 46 | 47 | ``` 48 | 6.0 49 | [2.0,0.0,6.0] 50 | 2.0 51 | (4,[0,1,2,3],[9.0,5.0,2.0,7.0]) 52 | 5.0 53 | (4,[0,2,1,3],[9.0,5.0,2.0,7.0]) 54 | ``` 55 | 56 | 参考 57 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 58 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 59 | 【3】https://github.com/xubo245/SparkLearning 60 | -------------------------------------------------------------------------------- /1数据类型/Spark中组件Mllib的学习13之给向量打标签.md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之基础概念篇 3 | 1解释 4 | 给数据打label,用于后续监督学习等 5 | 6 | 2.代码: 7 | 8 | ``` 9 | /** 10 | * @author xubo 11 | * ref:Spark MlLib机器学习实战 12 | * more code:https://github.com/xubo245/SparkLearning 13 | * more blog:http://blog.csdn.net/xubo245 14 | */ 15 | package org.apache.spark.mllib.learning.basic 16 | 17 | import org.apache.spark.mllib.util.MLUtils 18 | import org.apache.spark.{SparkConf, SparkContext} 19 | import org.apache.spark.mllib.linalg.Vectors 20 | import org.apache.spark.mllib.regression.LabeledPoint 21 | 22 | /** 23 | * Created by xubo on 2016/5/23. 24 | * 给Vector打Label 25 | */ 26 | object LabeledPointLearning { 27 | def main(args: Array[String]) { 28 | 29 | val vd = Vectors.dense(2, 0, 6) 30 | val pos = LabeledPoint(1, vd) //对密集向量建立标记点 31 | println(pos.features) 32 | println(pos.label) 33 | println(pos) 34 | 35 | val vs = Vectors.sparse(4, Array(0, 1, 2, 3), Array(9, 5, 2, 7)) 36 | val neg = LabeledPoint(2, vs) //对稀疏向量建立标记点 37 | println(neg.features) 38 | println(neg.label) 39 | println(neg) 40 | 41 | 42 | } 43 | } 44 | 45 | ``` 46 | 47 | 3.结果: 48 | 49 | ``` 50 | [2.0,0.0,6.0] 51 | 1.0 52 | (1.0,[2.0,0.0,6.0]) 53 | (4,[0,1,2,3],[9.0,5.0,2.0,7.0]) 54 | 2.0 55 | (2.0,(4,[0,1,2,3],[9.0,5.0,2.0,7.0])) 56 | 57 | ``` 58 | 59 | 参考 60 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 61 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 62 | 【3】https://github.com/xubo245/SparkLearning 63 | -------------------------------------------------------------------------------- /7特征提取和转换/Spark中组件Mllib的学习53之HashingTF理解和使用.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | HashingTF将文档的每行转换成(hash值(2^20),(词的id,单个字符时同ascii码一样),(词频))形式 9 | 10 | 11 | 2.代码: 12 | 13 | test("hashing tf on an RDD") { 14 | val hashingTF = new HashingTF 15 | val localDocs: Seq[Seq[String]] = Seq( 16 | "a a b b b c d".split(" "), 17 | "a b c d a b c".split(" "), 18 | "c b a c b a a".split(" ")) 19 | val docs = sc.parallelize(localDocs, 2) 20 | assert(hashingTF.transform(docs).collect().toSet === localDocs.map(hashingTF.transform).toSet) 21 | 22 | println("docs:") 23 | docs.foreach(println) 24 | println("hashingTF.transform(docs).collect():") 25 | hashingTF.transform(docs).collect().foreach(println) 26 | println(" localDocs.map(hashingTF.transform):") 27 | localDocs.map(hashingTF.transform).foreach(println) 28 | 29 | } 30 | 31 | 32 | 3.结果: 33 | 34 | docs: 35 | WrappedArray(a, a, b, b, b, c, d) 36 | WrappedArray(a, b, c, d, a, b, c) 37 | WrappedArray(c, b, a, c, b, a, a) 38 | hashingTF.transform(docs).collect(): 39 | (1048576,[97,98,99,100],[2.0,3.0,1.0,1.0]) 40 | (1048576,[97,98,99,100],[2.0,2.0,2.0,1.0]) 41 | (1048576,[97,98,99],[3.0,2.0,2.0]) 42 | localDocs.map(hashingTF.transform): 43 | (1048576,[97,98,99,100],[2.0,3.0,1.0,1.0]) 44 | (1048576,[97,98,99,100],[2.0,2.0,2.0,1.0]) 45 | (1048576,[97,98,99],[3.0,2.0,2.0]) 46 | 47 | 48 | aa的id为3104 49 | 50 | 51 | 52 | 参考 53 | 54 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 55 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 56 | 【3】https://github.com/xubo245/SparkLearning 57 | 【4】book:Machine Learning with Spark ,Nick Pertreach 58 | 【5】book:Spark MlLib机器学习实战 59 | -------------------------------------------------------------------------------- /2基本统计/Spark中组件Mllib的学习20之假设检验-卡方检验.md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之基础概念篇 3 | 1解释 4 | 分别对Vector和Matrix进行卡方检验 5 | 6 | 7 | 2.代码: 8 | 9 | ``` 10 | /** 11 | * @author xubo 12 | * ref:Spark MlLib机器学习实战 13 | * more code:https://github.com/xubo245/SparkLearning 14 | * more blog:http://blog.csdn.net/xubo245 15 | */ 16 | package org.apache.spark.mllib.learning.basic 17 | 18 | import org.apache.spark.mllib.linalg.{Matrices, Vectors} 19 | import org.apache.spark.mllib.stat.Statistics 20 | import org.apache.spark.{SparkConf, SparkContext} 21 | 22 | /** 23 | * Created by xubo on 2016/5/23. 24 | */ 25 | object ChiSqLearning { 26 | def main(args: Array[String]) { 27 | val vd = Vectors.dense(1, 2, 3, 4, 5) 28 | val vdResult = Statistics.chiSqTest(vd) 29 | println(vdResult) 30 | println("-------------------------------") 31 | val mtx = Matrices.dense(3, 2, Array(1, 3, 5, 2, 4, 6)) 32 | val mtxResult = Statistics.chiSqTest(mtx) 33 | println(mtxResult) 34 | //print :方法、自由度、方法的统计量、p值 35 | } 36 | } 37 | 38 | ``` 39 | 40 | 3.结果: 41 | 42 | ``` 43 | Chi squared test summary: 44 | method: pearson 45 | degrees of freedom = 4 46 | statistic = 3.333333333333333 47 | pValue = 0.5036682742334986 48 | No presumption against null hypothesis: observed follows the same distribution as expected.. 49 | ------------------------------- 50 | Chi squared test summary: 51 | method: pearson 52 | degrees of freedom = 2 53 | statistic = 0.14141414141414144 54 | pValue = 0.931734784568187 55 | No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.. 56 | ``` 57 | 58 | 参考 59 | 60 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 61 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 62 | 【3】https://github.com/xubo245/SparkLearning -------------------------------------------------------------------------------- /8频繁项挖掘/Spark中组件Mllib的学习67之关联规则AssociationRules.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | Association Rules 9 | 10 | AssociationRules implements a parallel rule generation algorithm for constructing rules that have a single item as the consequent. 11 | 12 | 13 | 2.代码: 14 | 15 | /** 16 | * @author xubo 17 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 18 | * more code:https://github.com/xubo245/SparkLearning 19 | * more blog:http://blog.csdn.net/xubo245 20 | */ 21 | package org.apache.spark.mllib.FrequentPatternMining 22 | 23 | import org.apache.spark.util.SparkLearningFunSuite 24 | 25 | /** 26 | * Created by xubo on 2016/6/13. 27 | */ 28 | class AssociationRulesFunSuite extends SparkLearningFunSuite { 29 | test("testFunSuite") { 30 | import org.apache.spark.rdd.RDD 31 | import org.apache.spark.mllib.fpm.AssociationRules 32 | import org.apache.spark.mllib.fpm.FPGrowth.FreqItemset 33 | 34 | val freqItemsets = sc.parallelize(Seq( 35 | new FreqItemset(Array("a"), 15L), 36 | new FreqItemset(Array("b"), 35L), 37 | new FreqItemset(Array("a", "b"), 12L) 38 | )); 39 | 40 | val ar = new AssociationRules() 41 | .setMinConfidence(0.8) 42 | val results = ar.run(freqItemsets) 43 | 44 | results.collect().foreach { rule => 45 | println("[" + rule.antecedent.mkString(",") 46 | + "=>" 47 | + rule.consequent.mkString(",") + "]," + rule.confidence) 48 | } 49 | } 50 | } 51 | 52 | 53 | 3.结果: 54 | 55 | [a=>b],0.8 56 | 57 | 结果分析:12/15=0.8 58 | 59 | 参考 60 | 61 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 62 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 63 | 【3】https://github.com/xubo245/SparkLearning 64 | 【4】book:Machine Learning with Spark ,Nick Pertreach 65 | 【5】book:Spark MlLib机器学习实战 66 | -------------------------------------------------------------------------------- /3分类和回归/Spark中组件Mllib的学习24之线性回归1-小数据集.md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之回归分析篇 3 | 1解释 4 | 简单的对6组数据进行model的training,然后再利用model来predict具体的值 5 | 。过程中有输出model的权重 6 | 公式:f(x)=aX1+bX2 7 | 8 | 9 | 2.代码: 10 | 11 | ``` 12 | /** 13 | * @author xubo 14 | * ref:Spark MlLib机器学习实战 15 | * more code:https://github.com/xubo245/SparkLearning 16 | * more blog:http://blog.csdn.net/xubo245 17 | */ 18 | package org.apache.spark.mllib.learning.regression 19 | 20 | import org.apache.spark.mllib.linalg.Vectors 21 | import org.apache.spark.mllib.regression.{LinearRegressionWithSGD, LabeledPoint} 22 | import org.apache.spark.{SparkConf, SparkContext} 23 | 24 | /** 25 | * Created by xubo on 2016/5/23. 26 | */ 27 | object LinearRegression2Learning { 28 | def main(args: Array[String]) { 29 | val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 30 | val sc = new SparkContext(conf) 31 | 32 | val data = sc.textFile("file/data/mllib/input/ridge-data/lpsa2.data") //获取数据集路径 33 | val parsedData = data.map { line => //开始对数据集处理 34 | val parts = line.split(',') //根据逗号进行分区 35 | LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) 36 | }.cache() //转化数据格式 37 | val model = LinearRegressionWithSGD.train(parsedData, 100, 0.1) //建立模型 38 | val result = model.predict(Vectors.dense(2,1)) //通过模型预测模型 39 | println("model weights:") 40 | println(model.weights) 41 | println("result:") 42 | println(result) //打印预测结果 43 | sc.stop 44 | } 45 | } 46 | 47 | ``` 48 | 数据: 49 | 50 | ``` 51 | 5,1 1 52 | 7,2 1 53 | 9,3 2 54 | 11,4 1 55 | 19,5 3 56 | 18,6 2 57 | ``` 58 | 59 | 3.结果: 60 | 61 | ``` 62 | model weights: 63 | [2.54036018771162,1.5591873026695686] 64 | result: 65 | 6.6399076780928095 66 | ``` 67 | 68 | 参考 69 | 70 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 71 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 72 | 【3】https://github.com/xubo245/SparkLearning 73 | 【4】Spark MlLib机器学习实战 74 | -------------------------------------------------------------------------------- /8频繁项挖掘/Spark中组件Mllib的学习68之PrefixSpan.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | PrefixSpan 是一种序列模式挖掘算法 9 | 10 | 序列模式定义:给定一个由不同序列组成的集合,其中,每个序列由不同的元素按顺序有序排列,每个元素由不同项目组成,同时给定一个用户指定的最小支持度阈值,序列模式挖掘就是找出所有的频繁子序列,即该子序列在序列集中的出现频率不低于用户指定的最小支持度阈值 11 | 12 | Spak.mllib PrefixSpan 需要配置以下参数 13 | 14 | 1) minSupport : 满足频度序列模式的最小支持度 15 | 16 | 2) maxPatternLength: 频度序列的最大长度。凡是超过此长度的频度序列都会被乎略。 17 | 18 | 3) maxLocalProjDBSize:本地迭代处理投影数据库(projected database)之前,需要满足前缀投影数据库(prefix-projecteddatabase)中最大的物品数。 19 | 20 | 2.代码: 21 | 22 | /** 23 | * @author xubo 24 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 25 | * more code:https://github.com/xubo245/SparkLearning 26 | * more blog:http://blog.csdn.net/xubo245 27 | */ 28 | package org.apache.spark.mllib.FrequentPatternMining 29 | 30 | import org.apache.spark.util.SparkLearningFunSuite 31 | 32 | /** 33 | * Created by xubo on 2016/6/13. 34 | */ 35 | class PrefixSpanFunSuite extends SparkLearningFunSuite { 36 | test("testFunSuite") { 37 | import org.apache.spark.mllib.fpm.PrefixSpan 38 | 39 | val sequences = sc.parallelize(Seq( 40 | Array(Array(1, 2), Array(3)), 41 | Array(Array(1), Array(3, 2), Array(1, 2)), 42 | Array(Array(1, 2), Array(5)), 43 | Array(Array(6)) 44 | ), 2).cache() 45 | val prefixSpan = new PrefixSpan() 46 | .setMinSupport(0.5) 47 | .setMaxPatternLength(5) 48 | val model = prefixSpan.run(sequences) 49 | model.freqSequences.collect().foreach { freqSequence => 50 | println( 51 | freqSequence.sequence.map(_.mkString("[", ", ", "]")).mkString("[", ", ", "]") + ", " + freqSequence.freq) 52 | } 53 | } 54 | } 55 | 56 | 57 | 3.结果: 58 | 59 | [[2]], 3 60 | [[3]], 2 61 | [[1]], 3 62 | [[2, 1]], 3 63 | [[1], [3]], 2 64 | 65 | 参考 66 | 67 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 68 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 69 | 【3】https://github.com/xubo245/SparkLearning 70 | 【4】book:Machine Learning with Spark ,Nick Pertreach 71 | 【5】book:Spark MlLib机器学习实战 72 | 【6】http://www.jone.tech/?p=41 73 | -------------------------------------------------------------------------------- /7特征提取和转换/Spark中组件Mllib的学习64之元素智能乘积ElementwiseProduct.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 |   元素智能乘积 9 | 10 |   ElementwiseProduct对每一个输入向量乘以一个给定的“权重”向量。换句话说,就是通过一个乘子对数据集的每一列进行缩放。这个转换可以表示为如下的形式: 11 | 12 | ![](http://i.imgur.com/d6lY7G0.png) 13 | 14 | 2.代码: 15 | 16 | /** 17 | * @author xubo 18 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 19 | * more code:https://github.com/xubo245/SparkLearning 20 | * more blog:http://blog.csdn.net/xubo245 21 | */ 22 | package org.apache.spark.mllib.FeatureExtractionAndTransformation 23 | 24 | import org.apache.spark.util.SparkLearningFunSuite 25 | 26 | /** 27 | * Created by xubo on 2016/6/13. 28 | */ 29 | class ElementwiseProductFunSuite extends SparkLearningFunSuite { 30 | test("testFunSuite") { 31 | 32 | 33 | import org.apache.spark.SparkContext._ 34 | import org.apache.spark.mllib.feature.ElementwiseProduct 35 | import org.apache.spark.mllib.linalg.Vectors 36 | 37 | // Create some vector data; also works for sparse vectors 38 | val data = sc.parallelize(Array(Vectors.dense(1.0, 2.0, 3.0), Vectors.dense(4.0, 5.0, 6.0))) 39 | 40 | val transformingVector = Vectors.dense(0.0, 1.0, 2.0) 41 | val transformer = new ElementwiseProduct(transformingVector) 42 | 43 | // Batch transform and per-row transform give the same results: 44 | val transformedData = transformer.transform(data) 45 | val transformedData2 = data.map(x => transformer.transform(x)) 46 | 47 | println("data:") 48 | data.foreach(println) 49 | println("transformer:" + transformer.scalingVec) 50 | 51 | // transformer.foreach(println) 52 | println("transformedData:") 53 | transformedData.foreach(println) 54 | 55 | } 56 | } 57 | 58 | 59 | 3.结果: 60 | 61 | data: 62 | [1.0,2.0,3.0] 63 | [4.0,5.0,6.0] 64 | transformer:[0.0,1.0,2.0] 65 | transformedData: 66 | [0.0,5.0,12.0] 67 | [0.0,2.0,6.0] 68 | 69 | 70 | 参考 71 | 72 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 73 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 74 | 【3】https://github.com/xubo245/SparkLearning 75 | 【4】book:Machine Learning with Spark ,Nick Pertreach 76 | 【5】book:Spark MlLib机器学习实战 77 | -------------------------------------------------------------------------------- /1数据类型/Spark中组件Mllib的学习15之创建分布式矩阵.md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之基础概念篇 3 | 1解释 4 | 创建分布式矩阵 5 | 6 | 2.代码: 7 | 8 | ``` 9 | 10 | /** 11 | * @author xubo 12 | * ref:Spark MlLib机器学习实战 13 | * more code:https://github.com/xubo245/SparkLearning 14 | * more blog:http://blog.csdn.net/xubo245 15 | */ 16 | package org.apache.spark.mllib.learning.basic 17 | 18 | import org.apache.spark.mllib.linalg.Matrices 19 | import org.apache.spark.mllib.util.MLUtils 20 | import org.apache.spark.{SparkContext, SparkConf} 21 | 22 | /** 23 | * Created by xubo on 2016/5/23. 24 | * 创建分布式矩阵 25 | */ 26 | object MatrixLearning { 27 | def main(args: Array[String]) { 28 | val mx = Matrices.dense(2, 3, Array(1, 2, 3, 4, 5, 6)) //创建一个分布式矩阵 29 | println(mx) //打印结果 30 | 31 | val arr=(1 to 6).toArray.map(_.toDouble) 32 | val mx2 = Matrices.dense(2, 3, arr) //创建一个分布式矩阵 33 | println(mx2) //打印结果 34 | 35 | val arr3=(1 to 20).toArray.map(_.toDouble) 36 | val mx3 = Matrices.dense(4, 5, arr3) //创建一个分布式矩阵 37 | println(mx3) //打印结果 38 | println(mx3.index(0,0)) 39 | println(mx3.index(1,1)) 40 | println(mx3.index(2,2)) 41 | println(mx3.numRows) 42 | println(mx3.numCols) 43 | } 44 | } 45 | 46 | 47 | ``` 48 | 49 | 3.结果: 50 | 51 | ``` 52 | 1.0 3.0 5.0 53 | 2.0 4.0 6.0 54 | 55 | 1.0 3.0 5.0 56 | 2.0 4.0 6.0 57 | 58 | 1.0 5.0 9.0 13.0 17.0 59 | 2.0 6.0 10.0 14.0 18.0 60 | 3.0 7.0 11.0 15.0 19.0 61 | 4.0 8.0 12.0 16.0 20.0 62 | 0 63 | 5 64 | 10 65 | 4 66 | 5 67 | ``` 68 | 69 | 感觉index有问题: 70 | 源码: 71 | 72 | ``` 73 | /** Return the index for the (i, j)-th element in the backing array. */ 74 | private[mllib] def index(i: Int, j: Int): Int 75 | ``` 76 | 77 | 在dense的时候: 78 | 79 | ``` 80 | private[mllib] def index(i: Int, j: Int): Int = { 81 | if (!isTransposed) i + numRows * j else j + numCols * i 82 | } 83 | 84 | ``` 85 | 如果按照这个源码理解没问题 86 | 将数组改为: 87 | 88 | ``` 89 | val arr3=(21 to 40).toArray.map(_.toDouble) 90 | ``` 91 | 92 | ``` 93 | 21.0 25.0 29.0 33.0 37.0 94 | 22.0 26.0 30.0 34.0 38.0 95 | 23.0 27.0 31.0 35.0 39.0 96 | 24.0 28.0 32.0 36.0 40.0 97 | 0 98 | 5 99 | 10 100 | 4 101 | 5 102 | ``` 103 | 104 | 疑问:如何按照坐标打印元素?比如(1,1)对应6 105 | 106 | 107 | 参考 108 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 109 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 110 | 【3】https://github.com/xubo245/SparkLearning 111 | -------------------------------------------------------------------------------- /9评估度量/Spark中组件Mllib的学习73之回归问题的评估.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | Regression model evaluation 9 | ![](http://i.imgur.com/uxyr2kb.png) 10 | 11 | 2.代码: 12 | 13 | /** 14 | * @author xubo 15 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 16 | * more code:https://github.com/xubo245/SparkLearning 17 | * more blog:http://blog.csdn.net/xubo245 18 | */ 19 | package org.apache.spark.mllib.EvaluationMetrics 20 | 21 | import org.apache.spark.util.SparkLearningFunSuite 22 | 23 | /** 24 | * Created by xubo on 2016/6/13. 25 | */ 26 | class RegressionModelEvaluationFunSuite extends SparkLearningFunSuite { 27 | test("testFunSuite") { 28 | 29 | 30 | import org.apache.spark.mllib.regression.LabeledPoint 31 | import org.apache.spark.mllib.regression.LinearRegressionModel 32 | import org.apache.spark.mllib.regression.LinearRegressionWithSGD 33 | import org.apache.spark.mllib.linalg.Vectors 34 | import org.apache.spark.mllib.evaluation.RegressionMetrics 35 | import org.apache.spark.mllib.util.MLUtils 36 | 37 | // Load the data 38 | val data = MLUtils.loadLibSVMFile(sc, "file/data/mllib/input/mllibFromSpark/sample_linear_regression_data.txt").cache() 39 | 40 | // Build the model 41 | val numIterations = 100 42 | val model = LinearRegressionWithSGD.train(data, numIterations) 43 | 44 | // Get predictions 45 | val valuesAndPreds = data.map{ point => 46 | val prediction = model.predict(point.features) 47 | (prediction, point.label) 48 | } 49 | 50 | // Instantiate metrics object 51 | val metrics = new RegressionMetrics(valuesAndPreds) 52 | 53 | // Squared error 54 | println(s"MSE = ${metrics.meanSquaredError}") 55 | println(s"RMSE = ${metrics.rootMeanSquaredError}") 56 | 57 | // R-squared 58 | println(s"R-squared = ${metrics.r2}") 59 | 60 | // Mean absolute error 61 | println(s"MAE = ${metrics.meanAbsoluteError}") 62 | 63 | // Explained variance 64 | println(s"Explained variance = ${metrics.explainedVariance}") 65 | 66 | 67 | } 68 | } 69 | 70 | 71 | 72 | 3.结果: 73 | 74 | MSE = 103.30968681818085 75 | RMSE = 10.164137288436281 76 | R-squared = 0.027639110967836777 77 | MAE = 8.148691907953307 78 | Explained variance = 2.888395201717894 79 | 80 | 参考 81 | 82 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 83 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 84 | 【3】https://github.com/xubo245/SparkLearning 85 | 【4】book:Machine Learning with Spark ,Nick Pertreach 86 | 【5】book:Spark MlLib机器学习实战 87 | -------------------------------------------------------------------------------- /2基本统计/Spark中组件Mllib的学习17之colStats_以列为基础计算统计量的基本数据.md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之基础概念篇 3 | 1解释 4 | colStats:以列为基础计算统计量的基本数据 5 | 6 | 2.代码: 7 | 8 | ``` 9 | /** 10 | * @author xubo 11 | * ref:Spark MlLib机器学习实战 12 | * more code:https://github.com/xubo245/SparkLearning 13 | * more blog:http://blog.csdn.net/xubo245 14 | */ 15 | package org.apache.spark.mllib.learning.basic 16 | 17 | import org.apache.spark.mllib.linalg.Vectors 18 | import org.apache.spark.mllib.stat.Statistics 19 | import org.apache.spark.{SparkConf, SparkContext} 20 | 21 | /** 22 | * Created by xubo on 2016/5/23. 23 | */ 24 | object StatisticsColStatsLearning { 25 | def main(args: Array[String]) { 26 | val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 27 | val sc = new SparkContext(conf) 28 | // val rdd = sc.textFile("file/data/mllib/input/basic/MatrixRow.txt") //读取文件 29 | val rdd = sc.textFile("file/data/mllib/input/basic/stats.txt") //读取文件 30 | .map(_.split(' ') //按“ ”分割 31 | .map(_.toDouble)) //转成Double类型 32 | .map(line => Vectors.dense(line)) 33 | val summary = Statistics.colStats(rdd) //获取Statistics实例 34 | 35 | // rdd.foreach(each => print(each + " ")) 36 | rdd.foreach(println) 37 | println("rdd.count:" + rdd.count()) 38 | println() 39 | println(summary) 40 | println(summary.max) //最大 41 | println(summary.min) //最小 42 | println("count" + summary.count) //个数 43 | println(summary.numNonzeros) //非零 44 | println("variance:" + summary.variance) //方差 45 | println(summary.mean) //计算均值 46 | println(summary.variance) //计算标准差 47 | println(summary.normL1) //计算曼哈段距离:相加 48 | println(summary.normL2) //计算欧几里得距离:平方根 49 | 50 | 51 | // /行向量 52 | println("\n row Vector:") 53 | val vec = Vectors.dense(1, 2, 3, 4, 5) 54 | println(vec) 55 | println(vec.size) 56 | println(vec.numActives) 57 | // println(vec.variance)//不存在 58 | 59 | sc.stop 60 | } 61 | } 62 | 63 | ``` 64 | 65 | 3.结果: 66 | 67 | ``` 68 | [1.0] 69 | [2.0] 70 | [3.0] 71 | [4.0] 72 | [5.0] 73 | rdd.count:5 74 | 75 | org.apache.spark.mllib.stat.MultivariateOnlineSummarizer@7f9de19a 76 | [5.0] 77 | [1.0] 78 | count5 79 | [5.0] 80 | variance:[2.5] 81 | [3.0] 82 | [2.5] 83 | [15.0] 84 | [7.416198487095663] 85 | 86 | row Vector: 87 | [1.0,2.0,3.0,4.0,5.0] 88 | 5 89 | 5 90 | ``` 91 | 92 | 参考 93 | 94 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 95 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 96 | 【3】https://github.com/xubo245/SparkLearning 97 | -------------------------------------------------------------------------------- /3分类和回归/Spark中组件Mllib的学习35之随机森林(entropy)进行分类.md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之分类篇 3 | 1解释 4 | 随机森林:RandomForest 5 | 大概思想就是生成多个决策树,都单独训练;如果来了一个数据,用各个决策树进行回归预测,如果是非连续结果,则取最多个数的值;如果连续,则取多个决策树结果的平均值。 6 | 7 | 2.代码: 8 | 9 | ``` 10 | /** 11 | * @author xubo 12 | * ref:Spark MlLib机器学习实战 13 | * more code:https://github.com/xubo245/SparkLearning 14 | * more blog:http://blog.csdn.net/xubo245 15 | */ 16 | package org.apache.spark.mllib.learning.classification 17 | 18 | import org.apache.spark.mllib.tree.{RandomForest, DecisionTree} 19 | import org.apache.spark.mllib.util.MLUtils 20 | import org.apache.spark.{SparkConf, SparkContext} 21 | 22 | /** 23 | * Created by xubo on 2016/5/23. 24 | * 25 | */ 26 | object DecisionTrees3GBT { 27 | def main(args: Array[String]) { 28 | val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 29 | val sc = new SparkContext(conf) 30 | 31 | // Load and parse the data file. 32 | val data = MLUtils.loadLibSVMFile(sc, "file/data/mllib/input/classification/dt.txt") 33 | val numClasses = 2 //设定分类的数量 34 | val categoricalFeaturesInfo = Map[Int, Int]() //设置输入数据格式 35 | val numTrees = 3 //设置随机雨林中决策树的数目 36 | val featureSubsetStrategy = "auto" //设置属性在节点计算数 37 | val impurity = "entropy" //设定信息增益计算方式 38 | val maxDepth = 5 //设定树高度 39 | val maxBins = 3 //设定分裂数据集 40 | 41 | val model = RandomForest.trainClassifier(data, numClasses, categoricalFeaturesInfo, 42 | numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins) //建立模型 43 | 44 | model.trees.foreach(println) //打印每棵树的相信信息 45 | 46 | val labelAndPreds = data.take(2).map { point => 47 | val prediction = model.predict(point.features) 48 | (point.label, prediction) 49 | } 50 | labelAndPreds.foreach(println) 51 | println(model.toDebugString) 52 | sc.stop 53 | } 54 | } 55 | 56 | ``` 57 | 58 | 3.结果: 59 | 60 | ``` 61 | DecisionTreeModel classifier of depth 2 with 5 nodes 62 | DecisionTreeModel classifier of depth 1 with 3 nodes 63 | DecisionTreeModel classifier of depth 0 with 1 nodes 64 | (1.0,1.0) 65 | (0.0,0.0) 66 | TreeEnsembleModel classifier with 3 trees 67 | 68 | Tree 0: 69 | If (feature 2 <= 0.0) 70 | If (feature 0 <= 0.0) 71 | Predict: 0.0 72 | Else (feature 0 > 0.0) 73 | Predict: 1.0 74 | Else (feature 2 > 0.0) 75 | Predict: 0.0 76 | Tree 1: 77 | If (feature 2 <= 0.0) 78 | Predict: 1.0 79 | Else (feature 2 > 0.0) 80 | Predict: 0.0 81 | Tree 2: 82 | Predict: 1.0 83 | ``` 84 | 85 | 参考 86 | 87 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 88 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 89 | 【3】https://github.com/xubo245/SparkLearning -------------------------------------------------------------------------------- /7特征提取和转换/Spark中组件Mllib的学习60之归一化(Normalizer)Normalization using L^Inf distance.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | 归一化 9 | 10 | Normalization using L^Inf distance 11 | 12 | 公式:分母为元素中绝对值最大值 13 | 14 | 15 | 2.代码: 16 | 17 | test("Normalization using L^Inf distance.") { 18 | val lInfNormalizer = new Normalizer(Double.PositiveInfinity) 19 | 20 | val dataInf = data.map(lInfNormalizer.transform) 21 | val dataInfRDD = lInfNormalizer.transform(dataRDD) 22 | 23 | println("dataRDD:") 24 | dataRDD.foreach(println) 25 | println("dataInf:") 26 | dataInf.foreach(println) 27 | println("dataInfRDD:") 28 | dataInfRDD.foreach(println) 29 | 30 | assert((data, dataInf, dataInfRDD.collect()).zipped.forall { 31 | case (v1: DenseVector, v2: DenseVector, v3: DenseVector) => true 32 | case (v1: SparseVector, v2: SparseVector, v3: SparseVector) => true 33 | case _ => false 34 | }, "The vector type should be preserved after normalization.") 35 | 36 | assert((dataInf, dataInfRDD.collect()).zipped.forall((v1, v2) => v1 ~== v2 absTol 1E-5)) 37 | 38 | assert(dataInf(0).toArray.map(math.abs).max ~== 1.0 absTol 1E-5) 39 | assert(dataInf(2).toArray.map(math.abs).max ~== 1.0 absTol 1E-5) 40 | assert(dataInf(3).toArray.map(math.abs).max ~== 1.0 absTol 1E-5) 41 | assert(dataInf(4).toArray.map(math.abs).max ~== 1.0 absTol 1E-5) 42 | 43 | assert(dataInf(0) ~== Vectors.sparse(3, Seq((0, -0.86956522), (1, 1.0))) absTol 1E-5) 44 | assert(dataInf(1) ~== Vectors.dense(0.0, 0.0, 0.0) absTol 1E-5) 45 | assert(dataInf(2) ~== Vectors.dense(0.2, -0.36666667, -1.0) absTol 1E-5) 46 | assert(dataInf(3) ~== Vectors.sparse(3, Seq((1, 0.284375), (2, 1.0))) absTol 1E-5) 47 | assert(dataInf(4) ~== Vectors.dense(1.0, 0.12631579, 0.473684211) absTol 1E-5) 48 | assert(dataInf(5) ~== Vectors.sparse(3, Seq()) absTol 1E-5) 49 | } 50 | 51 | 52 | 3.结果: 53 | 54 | dataRDD: 55 | [0.6,-1.1,-3.0] 56 | (3,[1,2],[0.91,3.2]) 57 | (3,[0,1],[-2.0,2.3]) 58 | [0.0,0.0,0.0] 59 | (3,[0,1,2],[5.7,0.72,2.7]) 60 | (3,[],[]) 61 | dataInf: 62 | (3,[0,1],[-0.8695652173913044,1.0]) 63 | [0.0,0.0,0.0] 64 | [0.19999999999999998,-0.3666666666666667,-1.0] 65 | (3,[1,2],[0.284375,1.0]) 66 | (3,[0,1,2],[1.0,0.12631578947368421,0.4736842105263158]) 67 | (3,[],[]) 68 | dataInfRDD: 69 | [0.19999999999999998,-0.3666666666666667,-1.0] 70 | (3,[1,2],[0.284375,1.0]) 71 | (3,[0,1],[-0.8695652173913044,1.0]) 72 | [0.0,0.0,0.0] 73 | (3,[0,1,2],[1.0,0.12631578947368421,0.4736842105263158]) 74 | (3,[],[]) 75 | 76 | 77 | 结果分析: 78 | 79 | 80 | -3/3=-1 81 | 82 | 参考 83 | 84 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 85 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 86 | 【3】https://github.com/xubo245/SparkLearning 87 | 【4】book:Machine Learning with Spark ,Nick Pertreach 88 | 【5】book:Spark MlLib机器学习实战 89 | -------------------------------------------------------------------------------- /7特征提取和转换/Spark中组件Mllib的学习60之归一化(Normalizer)Normalization using L^Inf distance.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | 归一化 9 | 10 | Normalization using L^Inf distance 11 | 12 | 公式:分母为元素中绝对值最大值 13 | 14 | 15 | 2.代码: 16 | 17 | test("Normalization using L^Inf distance.") { 18 | val lInfNormalizer = new Normalizer(Double.PositiveInfinity) 19 | 20 | val dataInf = data.map(lInfNormalizer.transform) 21 | val dataInfRDD = lInfNormalizer.transform(dataRDD) 22 | 23 | println("dataRDD:") 24 | dataRDD.foreach(println) 25 | println("dataInf:") 26 | dataInf.foreach(println) 27 | println("dataInfRDD:") 28 | dataInfRDD.foreach(println) 29 | 30 | assert((data, dataInf, dataInfRDD.collect()).zipped.forall { 31 | case (v1: DenseVector, v2: DenseVector, v3: DenseVector) => true 32 | case (v1: SparseVector, v2: SparseVector, v3: SparseVector) => true 33 | case _ => false 34 | }, "The vector type should be preserved after normalization.") 35 | 36 | assert((dataInf, dataInfRDD.collect()).zipped.forall((v1, v2) => v1 ~== v2 absTol 1E-5)) 37 | 38 | assert(dataInf(0).toArray.map(math.abs).max ~== 1.0 absTol 1E-5) 39 | assert(dataInf(2).toArray.map(math.abs).max ~== 1.0 absTol 1E-5) 40 | assert(dataInf(3).toArray.map(math.abs).max ~== 1.0 absTol 1E-5) 41 | assert(dataInf(4).toArray.map(math.abs).max ~== 1.0 absTol 1E-5) 42 | 43 | assert(dataInf(0) ~== Vectors.sparse(3, Seq((0, -0.86956522), (1, 1.0))) absTol 1E-5) 44 | assert(dataInf(1) ~== Vectors.dense(0.0, 0.0, 0.0) absTol 1E-5) 45 | assert(dataInf(2) ~== Vectors.dense(0.2, -0.36666667, -1.0) absTol 1E-5) 46 | assert(dataInf(3) ~== Vectors.sparse(3, Seq((1, 0.284375), (2, 1.0))) absTol 1E-5) 47 | assert(dataInf(4) ~== Vectors.dense(1.0, 0.12631579, 0.473684211) absTol 1E-5) 48 | assert(dataInf(5) ~== Vectors.sparse(3, Seq()) absTol 1E-5) 49 | } 50 | 51 | 52 | 3.结果: 53 | 54 | dataRDD: 55 | [0.6,-1.1,-3.0] 56 | (3,[1,2],[0.91,3.2]) 57 | (3,[0,1],[-2.0,2.3]) 58 | [0.0,0.0,0.0] 59 | (3,[0,1,2],[5.7,0.72,2.7]) 60 | (3,[],[]) 61 | dataInf: 62 | (3,[0,1],[-0.8695652173913044,1.0]) 63 | [0.0,0.0,0.0] 64 | [0.19999999999999998,-0.3666666666666667,-1.0] 65 | (3,[1,2],[0.284375,1.0]) 66 | (3,[0,1,2],[1.0,0.12631578947368421,0.4736842105263158]) 67 | (3,[],[]) 68 | dataInfRDD: 69 | [0.19999999999999998,-0.3666666666666667,-1.0] 70 | (3,[1,2],[0.284375,1.0]) 71 | (3,[0,1],[-0.8695652173913044,1.0]) 72 | [0.0,0.0,0.0] 73 | (3,[0,1,2],[1.0,0.12631578947368421,0.4736842105263158]) 74 | (3,[],[]) 75 | 76 | 77 | 结果分析: 78 | 79 | 80 | -3/3=-1 81 | 82 | 参考 83 | 84 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 85 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 86 | 【3】https://github.com/xubo245/SparkLearning 87 | 【4】book:Machine Learning with Spark ,Nick Pertreach 88 | 【5】book:Spark MlLib机器学习实战 89 | -------------------------------------------------------------------------------- /2基本统计/Spark中组件Mllib的学习21之随机数-RandomRDD产生.md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之基础概念篇 3 | 1解释 4 | 在org.apache.spark.mllib.random下RandomRDDs对象,处理生成RandomRDD,还可以生成uniformRDD、poissonRDD、exponentialRDD、gammaRDD等 5 | 6 | 7 | 2.代码: 8 | 9 | ``` 10 | /** 11 | * @author xubo 12 | * ref:Spark MlLib机器学习实战 13 | * more code:https://github.com/xubo245/SparkLearning 14 | * more blog:http://blog.csdn.net/xubo245 15 | */ 16 | package org.apache.spark.mllib.learning.basic 17 | 18 | import org.apache.spark.mllib.random.RandomRDDs._ 19 | import org.apache.spark.{SparkConf, SparkContext} 20 | 21 | /** 22 | * Created by xubo on 2016/5/23. 23 | */ 24 | object RandomRDDLearning { 25 | def main(args: Array[String]) { 26 | val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 27 | val sc = new SparkContext(conf) 28 | println("normalRDD:") 29 | val randomNum = normalRDD(sc, 10) 30 | randomNum.foreach(println) 31 | println("uniformRDD:") 32 | uniformRDD(sc, 10).foreach(println) 33 | println("poissonRDD:") 34 | poissonRDD(sc, 5,10).foreach(println) 35 | println("exponentialRDD:") 36 | exponentialRDD(sc,7, 10).foreach(println) 37 | println("gammaRDD:") 38 | gammaRDD(sc, 3,3,10).foreach(println) 39 | sc.stop 40 | } 41 | } 42 | 43 | ``` 44 | 45 | 3.结果: 46 | 47 | ``` 48 | normalRDD: 49 | 0.19139342057444655 50 | 0.42847625833602926 51 | 0.432676150766411 52 | 2.031243580737701 53 | -1.6210366564577097 54 | -0.5736390968158938 55 | 0.5118950917391826 56 | 0.36612870444413614 57 | -0.7841387585110905 58 | 0.11439913262616007 59 | uniformRDD: 60 | 0.2438450552072624 61 | 0.7003522704053741 62 | 0.24235558263747725 63 | 0.49701950142885765 64 | 0.46652368533423283 65 | 0.980827677073354 66 | 0.6825558070196546 67 | 0.4817949839139517 68 | 0.9965017651788755 69 | 0.7568845648015728 70 | poissonRDD: 71 | 2.0 72 | 2.0 73 | 4.0 74 | 6.0 75 | 4.0 76 | 2.0 77 | 4.0 78 | 2.0 79 | 3.0 80 | 9.0 81 | exponentialRDD: 82 | 12.214082193307469 83 | 4.682554578220504 84 | 0.9758739534780947 85 | 1.0228072708547165 86 | 5.844697536923258 87 | 1.11718191688843 88 | 18.3001169404778 89 | 3.0254219574726964 90 | 1.9807047388403134 91 | 7.218371820752084 92 | gammaRDD: 93 | 15.362945490679401 94 | 12.508341430761691 95 | 6.284582685039609 96 | 2.731284321611819 97 | 19.032454731810525 98 | 14.508395124068773 99 | 8.684880785422951 100 | 3.5329956660355206 101 | 15.852625148469828 102 | 4.284198644233831 103 | 104 | ``` 105 | 106 | 参考 107 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 108 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 109 | 【3】https://github.com/xubo245/SparkLearning -------------------------------------------------------------------------------- /6降维/Spark中组件Mllib的学习49之奇异值分解SVD(Singular value decomposition).md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | 具体请参考【4】,讲的比较详细 9 | 10 | 11 | 2.代码: 12 | 13 | /** 14 | * @author xubo 15 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 16 | * more code:https://github.com/xubo245/SparkLearning 17 | * more blog:http://blog.csdn.net/xubo245 18 | */ 19 | package org.apache.spark.mllib.DimensionalityReduction 20 | 21 | import org.apache.spark.mllib.linalg.Vectors 22 | import org.apache.spark.util.SparkLearningFunSuite 23 | 24 | /** 25 | * Created by xubo on 2016/6/13. 26 | * book:Machine Learning with Spark ,Nick Pertreach 27 | */ 28 | class SVDSuite extends SparkLearningFunSuite { 29 | test("testFunSuite") { 30 | import org.apache.spark.mllib.linalg.Matrix 31 | import org.apache.spark.mllib.linalg.distributed.RowMatrix 32 | import org.apache.spark.mllib.linalg.SingularValueDecomposition 33 | 34 | // val mat: RowMatrix =... 35 | val rdd = sc.textFile("file/data/mllib/input/basic/MatrixRow33.txt") //创建RDD文件路径 36 | .map(_.split(' ') //按“ ”分割 37 | .map(_.toDouble)) //转成Double类型 38 | .map(line => Vectors.dense(line)) //转成Vector格式 39 | val mat = new RowMatrix(rdd) 40 | 41 | 42 | // Compute the top 3 singular values and corresponding singular vectors. 43 | val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(3, computeU = true) 44 | val U: RowMatrix = svd.U // The U factor is a RowMatrix. 45 | val s = svd.s // The singular values are stored in a local dense vector. 46 | val V: Matrix = svd.V // The V factor is a local dense matrix. 47 | 48 | println("mat:") 49 | mat.rows.foreach(println) 50 | println("U:") 51 | U.rows.foreach(println) 52 | println("s:" + s) 53 | println("V:\n" + V) 54 | } 55 | } 56 | 57 | 58 | 59 | 3.结果: 60 | 61 | mat: 62 | [1.0,1.0,1.0] 63 | [1.0,1.0,1.0] 64 | [1.0,0.0,0.0] 65 | [1.0,1.0,0.0] 66 | U: 67 | [-0.6067637394094294,-0.3352266406762714,0.13950220040841022] 68 | [-0.2418162496491055,0.712015746118639,0.6592104964916438] 69 | [-0.45299054127146393,0.517957311021789,-0.7256168365450623] 70 | [-0.6067637394094294,-0.3352266406762714,0.13950220040841022] 71 | s:[2.809211800166755,0.88646771116676,0.5678944081980605] 72 | V: 73 | -0.6793130619863371 0.6311789687764828 0.37436195478307166 74 | -0.5932333119173848 -0.1720265367929079 -0.7864356987513785 75 | -0.431981482758553 -0.7563200248659911 0.49129626351156824 76 | 77 | 78 | 参考 79 | 80 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 81 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 82 | 【3】https://github.com/xubo245/SparkLearning 83 | 【4】http://spark.apache.org/docs/1.5.2/mllib-dimensionality-reduction.html 84 | 【5】book:Machine Learning with Spark ,Nick Pertreach 85 | -------------------------------------------------------------------------------- /3分类和回归/Spark中组件Mllib的学习25之线性回归2-较大数据集(多元).md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之回归分析篇 3 | 1解释 4 | 5 | 对多组数据进行model的training,然后再利用model来predict具体的值。过程中有输出model的权重 公式:f(x)=a1X1+a2X2+a3X3+...... 6 | 7 | 2.代码: 8 | 9 | ``` 10 | package org.apache.spark.mllib.learning.regression 11 | 12 | import java.text.SimpleDateFormat 13 | import java.util.Date 14 | 15 | import org.apache.log4j.{Level, Logger} 16 | import org.apache.spark.mllib.linalg.Vectors 17 | import org.apache.spark.mllib.regression.{LabeledPoint, LinearRegressionWithSGD} 18 | import org.apache.spark.{SparkConf, SparkContext} 19 | 20 | import scala.Array.canBuildFrom 21 | 22 | object LinearRegression { 23 | def main(args: Array[String]): Unit = { 24 | // 屏蔽不必要的日志显示终端上 25 | Logger.getLogger("org.apache.spark").setLevel(Level.ERROR) 26 | Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF) 27 | 28 | // 设置运行环境 29 | val conf = new SparkConf().setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))).setMaster("local[4]") 30 | val sc = new SparkContext(conf) 31 | 32 | // Load and parse the data 33 | val data = sc.textFile("file/data/mllib/input/ridge-data/lpsa.data",1) 34 | //如果读入不加1,会产生两个文件,应该是默认生成了两个partition 35 | val parsedData = data.map { line => 36 | val parts = line.split(',') 37 | LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) 38 | } 39 | 40 | // Building the model 41 | //建立model的数据和predict的数据没有分开 42 | val numIterations = 100 43 | val model = LinearRegressionWithSGD.train(parsedData, numIterations) 44 | // for(i<-parsedData) println(i.label+":"+i.features); 45 | // Evaluate model on training examples and compute training error 46 | val valuesAndPreds = parsedData.map { point => 47 | val prediction = model.predict(point.features) 48 | (point.label, prediction) 49 | } 50 | //print model.weights 51 | var weifhts=model.weights 52 | println("model.weights"+weifhts) 53 | 54 | //save as file 55 | val iString = new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date()) 56 | val path = "file/data/mllib/output/LinearRegression/" + iString + "/result" 57 | valuesAndPreds.saveAsTextFile(path) 58 | val MSE = valuesAndPreds.map { case (v, p) => math.pow((v - p), 2) }.reduce(_ + _) / valuesAndPreds.count 59 | println("training Mean Squared Error = " + MSE) 60 | 61 | sc.stop() 62 | } 63 | } 64 | ``` 65 | 数据请见github或者spark源码 66 | 67 | 3.结果: 68 | 69 | ``` 70 | model.weights[0.5808575763272221,0.18930001482946976,0.2803086929991066,0.1110834181777876,0.4010473965597895,-0.5603061626684255,-0.5804740464000981,0.8742741176970946] 71 | training Mean Squared Error = 6.207597210613579 72 | ``` 73 | 74 | 参考 75 | 76 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 77 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 78 | 【3】https://github.com/xubo245/SparkLearning 79 | 【4】Spark MlLib机器学习实战 80 | -------------------------------------------------------------------------------- /7特征提取和转换/Spark中组件Mllib的学习59之归一化(Normalizer)Normalization using L2 distance.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | 归一化 9 | 10 | Normalization using L2 distance 11 | 12 | 公式: 13 | 14 | 分母为:其中本例中p为2 15 | 16 | ![](http://i.imgur.com/Ll5M0O8.png) 17 | 18 | 19 | 2.代码: 20 | 21 | test("Normalization using L2 distance") { 22 | val l2Normalizer = new Normalizer() 23 | 24 | val data2 = data.map(l2Normalizer.transform) 25 | val data2RDD = l2Normalizer.transform(dataRDD) 26 | 27 | println("dataRDD:") 28 | dataRDD.foreach(println) 29 | println("data2RDD:") 30 | data2RDD.foreach(println) 31 | 32 | assert((data, data2, data2RDD.collect()).zipped.forall { 33 | case (v1: DenseVector, v2: DenseVector, v3: DenseVector) => true 34 | case (v1: SparseVector, v2: SparseVector, v3: SparseVector) => true 35 | case _ => false 36 | }, "The vector type should be preserved after normalization.") 37 | 38 | assert((data2, data2RDD.collect()).zipped.forall((v1, v2) => v1 ~== v2 absTol 1E-5)) 39 | 40 | assert(brzNorm(data2(0).toBreeze, 2) ~== 1.0 absTol 1E-5) 41 | assert(brzNorm(data2(2).toBreeze, 2) ~== 1.0 absTol 1E-5) 42 | assert(brzNorm(data2(3).toBreeze, 2) ~== 1.0 absTol 1E-5) 43 | assert(brzNorm(data2(4).toBreeze, 2) ~== 1.0 absTol 1E-5) 44 | 45 | assert(data2(0) ~== Vectors.sparse(3, Seq((0, -0.65617871), (1, 0.75460552))) absTol 1E-5) 46 | assert(data2(1) ~== Vectors.dense(0.0, 0.0, 0.0) absTol 1E-5) 47 | assert(data2(2) ~== Vectors.dense(0.184549876, -0.3383414, -0.922749378) absTol 1E-5) 48 | assert(data2(3) ~== Vectors.sparse(3, Seq((1, 0.27352993), (2, 0.96186349))) absTol 1E-5) 49 | assert(data2(4) ~== Vectors.dense(0.897906166, 0.113419726, 0.42532397) absTol 1E-5) 50 | assert(data2(5) ~== Vectors.sparse(3, Seq()) absTol 1E-5) 51 | } 52 | 53 | 54 | 3.结果: 55 | 56 | dataRDD: 57 | [0.6,-1.1,-3.0] 58 | (3,[0,1],[-2.0,2.3]) 59 | (3,[1,2],[0.91,3.2]) 60 | [0.0,0.0,0.0] 61 | (3,[0,1,2],[5.7,0.72,2.7]) 62 | (3,[],[]) 63 | data2RDD: 64 | [0.18454987557625951,-0.3383414385564758,-0.9227493778812975] 65 | (3,[1,2],[0.2735299305180406,0.9618634919315713]) 66 | (3,[0,1,2],[0.8979061661970154,0.11341972625646508,0.4253239734617441]) 67 | (3,[],[]) 68 | (3,[0,1],[-0.6561787149247866,0.7546055221635046]) 69 | [0.0,0.0,0.0] 70 | 71 | 结果分析: 72 | 73 | 74 | scala> 0.6/Math.sqrt(0.6*0.6+1.1*1.1+3.0*3) 75 | res32: Double = 0.18454987557625951 76 | 77 | scala> -1.1/Math.sqrt(0.6*0.6+1.1*1.1+3.0*3) 78 | res33: Double = -0.3383414385564758 79 | 80 | scala> -3.0/Math.sqrt(0.6*0.6+1.1*1.1+3.0*3) 81 | res34: Double = -0.9227493778812975 82 | 83 | 参考 84 | 85 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 86 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 87 | 【3】https://github.com/xubo245/SparkLearning 88 | 【4】book:Machine Learning with Spark ,Nick Pertreach 89 | 【5】book:Spark MlLib机器学习实战 90 | 【6】https://github.com/endymecy/spark-ml-source-analysis/blob/master/%E7%89%B9%E5%BE%81%E6%8A%BD%E5%8F%96%E5%92%8C%E8%BD%AC%E6%8D%A2/normalizer.md 91 | 92 | -------------------------------------------------------------------------------- /7特征提取和转换/Spark中组件Mllib的学习59之归一化(Normalizer)Normalization using L2 distance.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | 归一化 9 | 10 | Normalization using L2 distance 11 | 12 | 公式: 13 | 14 | 分母为:其中本例中p为2 15 | 16 | ![](http://i.imgur.com/Ll5M0O8.png) 17 | 18 | 19 | 2.代码: 20 | 21 | test("Normalization using L2 distance") { 22 | val l2Normalizer = new Normalizer() 23 | 24 | val data2 = data.map(l2Normalizer.transform) 25 | val data2RDD = l2Normalizer.transform(dataRDD) 26 | 27 | println("dataRDD:") 28 | dataRDD.foreach(println) 29 | println("data2RDD:") 30 | data2RDD.foreach(println) 31 | 32 | assert((data, data2, data2RDD.collect()).zipped.forall { 33 | case (v1: DenseVector, v2: DenseVector, v3: DenseVector) => true 34 | case (v1: SparseVector, v2: SparseVector, v3: SparseVector) => true 35 | case _ => false 36 | }, "The vector type should be preserved after normalization.") 37 | 38 | assert((data2, data2RDD.collect()).zipped.forall((v1, v2) => v1 ~== v2 absTol 1E-5)) 39 | 40 | assert(brzNorm(data2(0).toBreeze, 2) ~== 1.0 absTol 1E-5) 41 | assert(brzNorm(data2(2).toBreeze, 2) ~== 1.0 absTol 1E-5) 42 | assert(brzNorm(data2(3).toBreeze, 2) ~== 1.0 absTol 1E-5) 43 | assert(brzNorm(data2(4).toBreeze, 2) ~== 1.0 absTol 1E-5) 44 | 45 | assert(data2(0) ~== Vectors.sparse(3, Seq((0, -0.65617871), (1, 0.75460552))) absTol 1E-5) 46 | assert(data2(1) ~== Vectors.dense(0.0, 0.0, 0.0) absTol 1E-5) 47 | assert(data2(2) ~== Vectors.dense(0.184549876, -0.3383414, -0.922749378) absTol 1E-5) 48 | assert(data2(3) ~== Vectors.sparse(3, Seq((1, 0.27352993), (2, 0.96186349))) absTol 1E-5) 49 | assert(data2(4) ~== Vectors.dense(0.897906166, 0.113419726, 0.42532397) absTol 1E-5) 50 | assert(data2(5) ~== Vectors.sparse(3, Seq()) absTol 1E-5) 51 | } 52 | 53 | 54 | 3.结果: 55 | 56 | dataRDD: 57 | [0.6,-1.1,-3.0] 58 | (3,[0,1],[-2.0,2.3]) 59 | (3,[1,2],[0.91,3.2]) 60 | [0.0,0.0,0.0] 61 | (3,[0,1,2],[5.7,0.72,2.7]) 62 | (3,[],[]) 63 | data2RDD: 64 | [0.18454987557625951,-0.3383414385564758,-0.9227493778812975] 65 | (3,[1,2],[0.2735299305180406,0.9618634919315713]) 66 | (3,[0,1,2],[0.8979061661970154,0.11341972625646508,0.4253239734617441]) 67 | (3,[],[]) 68 | (3,[0,1],[-0.6561787149247866,0.7546055221635046]) 69 | [0.0,0.0,0.0] 70 | 71 | 结果分析: 72 | 73 | 74 | scala> 0.6/Math.sqrt(0.6*0.6+1.1*1.1+3.0*3) 75 | res32: Double = 0.18454987557625951 76 | 77 | scala> -1.1/Math.sqrt(0.6*0.6+1.1*1.1+3.0*3) 78 | res33: Double = -0.3383414385564758 79 | 80 | scala> -3.0/Math.sqrt(0.6*0.6+1.1*1.1+3.0*3) 81 | res34: Double = -0.9227493778812975 82 | 83 | 参考 84 | 85 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 86 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 87 | 【3】https://github.com/xubo245/SparkLearning 88 | 【4】book:Machine Learning with Spark ,Nick Pertreach 89 | 【5】book:Spark MlLib机器学习实战 90 | 【6】https://github.com/endymecy/spark-ml-source-analysis/blob/master/%E7%89%B9%E5%BE%81%E6%8A%BD%E5%8F%96%E5%92%8C%E8%BD%AC%E6%8D%A2/normalizer.md 91 | 92 | -------------------------------------------------------------------------------- /6降维/Spark中组件Mllib的学习50之主成份分析PCA.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | Principal component analysis (PCA):主成份分析 9 | 10 | 从数据矩阵中抽取矩阵的k个主向量,代表矩阵的主要影响向量 11 | 12 | 具体请看【5】和【6】 13 | 14 | 2.代码: 15 | 16 | /** 17 | * @author xubo 18 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 19 | * more code:https://github.com/xubo245/SparkLearning 20 | * more blog:http://blog.csdn.net/xubo245 21 | */ 22 | package org.apache.spark.mllib.DimensionalityReduction 23 | 24 | import org.apache.spark.mllib.linalg.Vectors 25 | import org.apache.spark.util.SparkLearningFunSuite 26 | 27 | /** 28 | * Created by xubo on 2016/6/13. 29 | */ 30 | class PCASuite extends SparkLearningFunSuite { 31 | test("testFunSuite") { 32 | import org.apache.spark.mllib.linalg.Matrix 33 | import org.apache.spark.mllib.linalg.distributed.RowMatrix 34 | 35 | // val mat: RowMatrix = ... 36 | val rdd = sc.textFile("file/data/mllib/input/basic/MatrixRow33.txt") //创建RDD文件路径 37 | .map(_.split(' ') //按“ ”分割 38 | .map(_.toDouble)) //转成Double类型 39 | .map(line => Vectors.dense(line)) //转成Vector格式 40 | val mat = new RowMatrix(rdd) 41 | 42 | // Compute the top 10 principal components. 43 | val pc: Matrix = mat.computePrincipalComponents(3) // Principal components are stored in a local dense matrix. 44 | 45 | // Project the rows to the linear space spanned by the top 10 principal components. 46 | val projected: RowMatrix = mat.multiply(pc) 47 | 48 | println("mat:") 49 | mat.rows.foreach(println) 50 | println("pc:") 51 | println(pc) 52 | println("projected:") 53 | projected.rows.foreach(println) 54 | } 55 | } 56 | 57 | 58 | 3.结果: 59 | 60 | 主成份为2: 61 | 62 | mat: 63 | [1.0,0.0,0.0] 64 | [1.0,1.0,1.0] 65 | [1.0,1.0,0.0] 66 | [1.0,1.0,1.0] 67 | pc: 68 | 0.0 0.0 69 | -0.6154122094026357 -0.7882054380161092 70 | -0.7882054380161091 0.6154122094026356 71 | projected: 72 | [0.0,0.0] 73 | [-1.403617647418745,-0.17279322861347357] 74 | [-0.6154122094026357,-0.7882054380161092] 75 | [-1.403617647418745,-0.17279322861347357] 76 | 77 | 主成份为3: 78 | 79 | mat: 80 | [1.0,0.0,0.0] 81 | [1.0,1.0,1.0] 82 | [1.0,1.0,0.0] 83 | [1.0,1.0,1.0] 84 | pc: 85 | 0.0 0.0 1.0 86 | -0.6154122094026357 -0.7882054380161092 0.0 87 | -0.7882054380161091 0.6154122094026356 0.0 88 | projected: 89 | [0.0,0.0,1.0] 90 | [-1.403617647418745,-0.17279322861347357,1.0] 91 | [-0.6154122094026357,-0.7882054380161092,1.0] 92 | [-1.403617647418745,-0.17279322861347357,1.0] 93 | 94 | 参考 95 | 96 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 97 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 98 | 【3】https://github.com/xubo245/SparkLearning 99 | 【4】book:Machine Learning with Spark ,Nick Pertreach 100 | 【5】Spark MlLib机器学习实战 101 | 【6】http://spark.apache.org/docs/1.5.2/mllib-dimensionality-reduction.html 102 | -------------------------------------------------------------------------------- /3分类和回归/Spark中组件Mllib的学习30之逻辑回归LogisticRegressionWithLBFGS.md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之分类篇 3 | 1解释 4 | 5 | Limited-memory BFGS (L-BFGS or LM-BFGS) 6 | Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm 7 | =》 8 | LBFGS :Limited-memory Broyden–Fletcher–Goldfarb–Shanno 9 | 10 | 具体的概念在【4】、【5】中都有讲到,还没细看 11 | 12 | 13 | 2.代码: 14 | 15 | ``` 16 | /** 17 | * @author xubo 18 | * ref:Spark MlLib机器学习实战 19 | * more code:https://github.com/xubo245/SparkLearning 20 | * more blog:http://blog.csdn.net/xubo245 21 | */ 22 | package org.apache.spark.mllib.learning.regression 23 | 24 | import java.text.SimpleDateFormat 25 | import java.util.Date 26 | 27 | import org.apache.spark.{SparkConf, SparkContext} 28 | import org.apache.spark.mllib.classification.{LogisticRegressionModel, LogisticRegressionWithLBFGS} 29 | import org.apache.spark.mllib.evaluation.MulticlassMetrics 30 | import org.apache.spark.mllib.regression.LabeledPoint 31 | import org.apache.spark.mllib.util.MLUtils 32 | 33 | /** 34 | * Created by xubo on 2016/5/23. 35 | * 一元逻辑回归 36 | */ 37 | object LogisticRegressionWithLDFGS { 38 | def main(args: Array[String]) { 39 | val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 40 | val sc = new SparkContext(conf) 41 | 42 | // Load training data in LIBSVM format. 43 | val data = MLUtils.loadLibSVMFile(sc, "file/data/mllib/input/regression/sample_libsvm_data.txt") 44 | 45 | // Split data into training (60%) and test (40%). 46 | val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L) 47 | val training = splits(0).cache() 48 | val test = splits(1) 49 | 50 | // Run training algorithm to build the model 51 | val model = new LogisticRegressionWithLBFGS() 52 | .setNumClasses(10) 53 | .run(training) 54 | 55 | // Compute raw scores on the test set. 56 | val predictionAndLabels = test.map { case LabeledPoint(label, features) => 57 | val prediction = model.predict(features) 58 | (prediction, label) 59 | } 60 | 61 | // Get evaluation metrics. 62 | val metrics = new MulticlassMetrics(predictionAndLabels) 63 | val precision = metrics.precision 64 | println("Precision = " + precision) 65 | 66 | // Save and load model 67 | val iString = new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date()) 68 | val path = "file/data/mllib/output/regression/LogisticRegressionWithLDFGS" + iString + "/result" 69 | model.save(sc, path) 70 | val sameModel = LogisticRegressionModel.load(sc, path) 71 | sc.stop 72 | } 73 | } 74 | 75 | ``` 76 | 77 | 3.结果: 78 | 79 | ``` 80 | Precision = 1.0 81 | ``` 82 | 准确率也是1 83 | 84 | 85 | 参考 86 | 87 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 88 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 89 | 【3】https://github.com/xubo245/SparkLearning 90 | 【4】http://blog.csdn.net/itplus/article/details/21897715 91 | 【5】http://blog.csdn.net/zhirom/article/details/38332111 92 | 【6】https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm -------------------------------------------------------------------------------- /5聚类/Spark中组件Mllib的学习46之Power iteration clustering.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | Power iteration clustering是生成图聚类的聚类算法,翻译为幂迭代聚类 9 | 测试数据为: 10 | 11 | 12 | 0 1 1.0 13 | 0 2 1.0 14 | 0 3 1.0 15 | 16 | 其中第一列和第二列都是点的id,第三列为相似度的值。 17 | 18 | 对于图的顶点聚类(顶点相似度作为边的属性)问题,幂迭代聚类(PIC)是高效并且易扩展的算法(参考: Lin and Cohen, Power Iteration Clustering)。MLlib包含了一个使用GraphX(MLlib)为基础的实现。算法的输入是RDD[srcID, dstID, similarity],输出是每个顶点对应的聚类的模型。相似度(similarity)必须是非负值。PIC假设相似度的衡量是对称的,也就是说在输入数据中,(srcID, dstID)顺序无关(例如:<1, 2, 0.1>, <2, 1, 0.1等价),但是只能出现一次。输入中没有指定相似度的点对,相似度会置0。MLlib中的PIC实现具有下列参数: 19 | 20 | k: 聚簇的数量 21 | maxIterations: 最大迭代次数 22 | initializationMode: 初始化模式:默认值“random”,表示使用一个随机向量作为顶点的聚类属性;也可以是“degree”,表示使用归一化的相似度和(作为顶点的聚类属性)。 23 | 具体请间【4】 24 | 25 | 2.代码: 26 | 27 | /** 28 | * @author xubo 29 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 30 | * more code:https://github.com/xubo245/SparkLearning 31 | * more blog:http://blog.csdn.net/xubo245 32 | */ 33 | package org.apache.spark.mllib.clustering.PowerIterationClusteringLearning 34 | 35 | import org.apache.spark.util.SparkLearningFunSuite 36 | 37 | /** 38 | * Created by xubo on 2016/6/13. 39 | */ 40 | class PICFromWebSuite extends SparkLearningFunSuite { 41 | test("testFunSuite") { 42 | import org.apache.spark.mllib.clustering.{PowerIterationClustering, PowerIterationClusteringModel} 43 | import org.apache.spark.mllib.linalg.Vectors 44 | 45 | // Load and parse the data 46 | val data = sc.textFile("file/data/mllib/input/mllibFromSpark/pic_data.txt") 47 | val similarities = data.map { line => 48 | val parts = line.split(' ') 49 | (parts(0).toLong, parts(1).toLong, parts(2).toDouble) 50 | } 51 | 52 | // Cluster the data into two classes using PowerIterationClustering 53 | val pic = new PowerIterationClustering() 54 | .setK(2) 55 | .setMaxIterations(10) 56 | val model = pic.run(similarities) 57 | 58 | model.assignments.foreach { a => 59 | println(s"${a.id} -> ${a.cluster}") 60 | } 61 | // model.pre 62 | // Save and load model 63 | // model.save(sc, "myModelPath") 64 | // val sameModel = PowerIterationClusteringModel.load(sc, "myModelPath")F 65 | } 66 | } 67 | 68 | 69 | 数据: 70 | 71 | 0 1 1.0 72 | 0 2 1.0 73 | 0 3 1.0 74 | 1 2 1.0 75 | 1 3 1.0 76 | 2 3 1.0 77 | 3 4 0.1 78 | 4 5 1.0 79 | 4 15 1.0 80 | 5 6 1.0 81 | 6 7 1.0 82 | 7 8 1.0 83 | 8 9 1.0 84 | 9 10 1.0 85 | 10 11 1.0 86 | 11 12 1.0 87 | 12 13 1.0 88 | 13 14 1.0 89 | 14 15 1.0 90 | 91 | 92 | 3.结果: 93 | 94 | 4 -> 0 95 | 14 -> 0 96 | 0 -> 1 97 | 6 -> 0 98 | 8 -> 0 99 | 12 -> 0 100 | 10 -> 0 101 | 2 -> 1 102 | 13 -> 0 103 | 15 -> 0 104 | 11 -> 0 105 | 1 -> 1 106 | 3 -> 1 107 | 7 -> 0 108 | 9 -> 0 109 | 5 -> 0 110 | 111 | 该模型中没有看到predict函数 112 | 113 | 参考 114 | 115 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 116 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 117 | 【3】https://github.com/xubo245/SparkLearning 118 | 【4】http://www.fuqingchuan.com/2015/03/609.html#power-iteration-clustering-pic 119 | -------------------------------------------------------------------------------- /7特征提取和转换/Spark中组件Mllib的学习58之归一化(Normalizer)Normalization using L1 distance.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 |   规则化器缩放单个样本让其拥有单位范数。这是文本分类和聚类常用的操作。例如,两个规则化的TFIDF向量的点乘就是两个向量的cosine相似度。 9 | 10 |   Normalizer实现VectorTransformer,将一个向量规则化为转换的向量,或者将一个RDD规则化为另一个RDD。下面是一个规则化的例子。 11 | 12 | Normalization using L1 distance ,来自spark mllib 的test 13 | 14 | 15 | 16 | 2.代码: 17 | 18 | val data = Array( 19 | Vectors.sparse(3, Seq((0, -2.0), (1, 2.3))), 20 | Vectors.dense(0.0, 0.0, 0.0), 21 | Vectors.dense(0.6, -1.1, -3.0), 22 | Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))), 23 | Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))), 24 | Vectors.sparse(3, Seq()) 25 | ) 26 | 27 | lazy val dataRDD = sc.parallelize(data, 3) 28 | 29 | test("Normalization using L1 distance") { 30 | val l1Normalizer = new Normalizer(1) 31 | 32 | val data1 = data.map(l1Normalizer.transform) 33 | val data1RDD = l1Normalizer.transform(dataRDD) 34 | 35 | println("dataRDD:") 36 | dataRDD.foreach(println) 37 | println("data1RDD:") 38 | data1RDD.foreach(println) 39 | assert((data, data1, data1RDD.collect()).zipped.forall { 40 | case (v1: DenseVector, v2: DenseVector, v3: DenseVector) => true 41 | case (v1: SparseVector, v2: SparseVector, v3: SparseVector) => true 42 | case _ => false 43 | }, "The vector type should be preserved after normalization.") 44 | 45 | assert((data1, data1RDD.collect()).zipped.forall((v1, v2) => v1 ~== v2 absTol 1E-5)) 46 | 47 | assert(brzNorm(data1(0).toBreeze, 1) ~== 1.0 absTol 1E-5) 48 | assert(brzNorm(data1(2).toBreeze, 1) ~== 1.0 absTol 1E-5) 49 | assert(brzNorm(data1(3).toBreeze, 1) ~== 1.0 absTol 1E-5) 50 | assert(brzNorm(data1(4).toBreeze, 1) ~== 1.0 absTol 1E-5) 51 | 52 | assert(data1(0) ~== Vectors.sparse(3, Seq((0, -0.465116279), (1, 0.53488372))) absTol 1E-5) 53 | assert(data1(1) ~== Vectors.dense(0.0, 0.0, 0.0) absTol 1E-5) 54 | assert(data1(2) ~== Vectors.dense(0.12765957, -0.23404255, -0.63829787) absTol 1E-5) 55 | assert(data1(3) ~== Vectors.sparse(3, Seq((1, 0.22141119), (2, 0.7785888))) absTol 1E-5) 56 | assert(data1(4) ~== Vectors.dense(0.625, 0.07894737, 0.29605263) absTol 1E-5) 57 | assert(data1(5) ~== Vectors.sparse(3, Seq()) absTol 1E-5) 58 | } 59 | 60 | 3.结果: 61 | 62 | dataRDD: 63 | [0.6,-1.1,-3.0] 64 | (3,[0,1],[-2.0,2.3]) 65 | (3,[1,2],[0.91,3.2]) 66 | [0.0,0.0,0.0] 67 | (3,[0,1,2],[5.7,0.72,2.7]) 68 | (3,[],[]) 69 | data1RDD: 70 | [0.1276595744680851,-0.23404255319148937,-0.6382978723404255] 71 | (3,[1,2],[0.2214111922141119,0.778588807785888]) 72 | (3,[0,1],[-0.46511627906976744,0.5348837209302325]) 73 | [0.0,0.0,0.0] 74 | (3,[0,1,2],[0.625,0.07894736842105261,0.29605263157894735]) 75 | (3,[],[]) 76 | 77 | 结果分析: 78 | 79 | -3/(0.6+1.1+3)=-0.6382978723404255 80 | 81 | scala> 3.0/4.7 82 | res16: Double = 0.6382978723404255 83 | 84 | scala> 5.7/9.12 85 | res17: Double = 0.6250000000000001 86 | 87 | 故L1来归一化是将每个值除以所有元素绝对值只和 88 | 89 | 参考 90 | 91 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 92 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 93 | 【3】https://github.com/xubo245/SparkLearning 94 | 【4】book:Machine Learning with Spark ,Nick Pertreach 95 | 【5】book:Spark MlLib机器学习实战 96 | -------------------------------------------------------------------------------- /7特征提取和转换/Spark中组件Mllib的学习58之归一化(Normalizer)Normalization using L1 distance.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 |   规则化器缩放单个样本让其拥有单位范数。这是文本分类和聚类常用的操作。例如,两个规则化的TFIDF向量的点乘就是两个向量的cosine相似度。 9 | 10 |   Normalizer实现VectorTransformer,将一个向量规则化为转换的向量,或者将一个RDD规则化为另一个RDD。下面是一个规则化的例子。 11 | 12 | Normalization using L1 distance ,来自spark mllib 的test 13 | 14 | 15 | 16 | 2.代码: 17 | 18 | val data = Array( 19 | Vectors.sparse(3, Seq((0, -2.0), (1, 2.3))), 20 | Vectors.dense(0.0, 0.0, 0.0), 21 | Vectors.dense(0.6, -1.1, -3.0), 22 | Vectors.sparse(3, Seq((1, 0.91), (2, 3.2))), 23 | Vectors.sparse(3, Seq((0, 5.7), (1, 0.72), (2, 2.7))), 24 | Vectors.sparse(3, Seq()) 25 | ) 26 | 27 | lazy val dataRDD = sc.parallelize(data, 3) 28 | 29 | test("Normalization using L1 distance") { 30 | val l1Normalizer = new Normalizer(1) 31 | 32 | val data1 = data.map(l1Normalizer.transform) 33 | val data1RDD = l1Normalizer.transform(dataRDD) 34 | 35 | println("dataRDD:") 36 | dataRDD.foreach(println) 37 | println("data1RDD:") 38 | data1RDD.foreach(println) 39 | assert((data, data1, data1RDD.collect()).zipped.forall { 40 | case (v1: DenseVector, v2: DenseVector, v3: DenseVector) => true 41 | case (v1: SparseVector, v2: SparseVector, v3: SparseVector) => true 42 | case _ => false 43 | }, "The vector type should be preserved after normalization.") 44 | 45 | assert((data1, data1RDD.collect()).zipped.forall((v1, v2) => v1 ~== v2 absTol 1E-5)) 46 | 47 | assert(brzNorm(data1(0).toBreeze, 1) ~== 1.0 absTol 1E-5) 48 | assert(brzNorm(data1(2).toBreeze, 1) ~== 1.0 absTol 1E-5) 49 | assert(brzNorm(data1(3).toBreeze, 1) ~== 1.0 absTol 1E-5) 50 | assert(brzNorm(data1(4).toBreeze, 1) ~== 1.0 absTol 1E-5) 51 | 52 | assert(data1(0) ~== Vectors.sparse(3, Seq((0, -0.465116279), (1, 0.53488372))) absTol 1E-5) 53 | assert(data1(1) ~== Vectors.dense(0.0, 0.0, 0.0) absTol 1E-5) 54 | assert(data1(2) ~== Vectors.dense(0.12765957, -0.23404255, -0.63829787) absTol 1E-5) 55 | assert(data1(3) ~== Vectors.sparse(3, Seq((1, 0.22141119), (2, 0.7785888))) absTol 1E-5) 56 | assert(data1(4) ~== Vectors.dense(0.625, 0.07894737, 0.29605263) absTol 1E-5) 57 | assert(data1(5) ~== Vectors.sparse(3, Seq()) absTol 1E-5) 58 | } 59 | 60 | 3.结果: 61 | 62 | dataRDD: 63 | [0.6,-1.1,-3.0] 64 | (3,[0,1],[-2.0,2.3]) 65 | (3,[1,2],[0.91,3.2]) 66 | [0.0,0.0,0.0] 67 | (3,[0,1,2],[5.7,0.72,2.7]) 68 | (3,[],[]) 69 | data1RDD: 70 | [0.1276595744680851,-0.23404255319148937,-0.6382978723404255] 71 | (3,[1,2],[0.2214111922141119,0.778588807785888]) 72 | (3,[0,1],[-0.46511627906976744,0.5348837209302325]) 73 | [0.0,0.0,0.0] 74 | (3,[0,1,2],[0.625,0.07894736842105261,0.29605263157894735]) 75 | (3,[],[]) 76 | 77 | 结果分析: 78 | 79 | -3/(0.6+1.1+3)=-0.6382978723404255 80 | 81 | scala> 3.0/4.7 82 | res16: Double = 0.6382978723404255 83 | 84 | scala> 5.7/9.12 85 | res17: Double = 0.6250000000000001 86 | 87 | 故L1来归一化是将每个值除以所有元素绝对值只和 88 | 89 | 参考 90 | 91 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 92 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 93 | 【3】https://github.com/xubo245/SparkLearning 94 | 【4】book:Machine Learning with Spark ,Nick Pertreach 95 | 【5】book:Spark MlLib机器学习实战 96 | -------------------------------------------------------------------------------- /2基本统计/Spark中组件Mllib的学习42之rowMatrix的QR分解.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | ![](http://latex.codecogs.com/gif.latex?A=QR) 9 | 10 | 求矩阵A的Q和R分解矩阵 11 | 更多请见:【4】 12 | 13 | 14 | 15 | 2.代码: 16 | 17 | /** 18 | * @author xubo 19 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 20 | * more code:https://github.com/xubo245/SparkLearning 21 | * more blog:http://blog.csdn.net/xubo245 22 | */ 23 | package org.apache.spark.mllib.learning.basic 24 | 25 | import org.apache.spark.mllib.linalg.Vectors 26 | import org.apache.spark.rdd.RDD 27 | import org.apache.spark.util.SparkLearningFunSuite 28 | 29 | /** 30 | * Created by xubo on 2016/6/13. 31 | * ref:http://blog.csdn.net/openspirit/article/details/13800067 32 | * 结论:与ref一致 33 | * 有些矩阵无法QR分解,会报空指针异常 34 | */ 35 | class RowMatrixSuite extends SparkLearningFunSuite { 36 | test("testFunSuite") { 37 | // val rdd = sc.parallelize(Array(1, 2, 3)) 38 | // println("count:" + rdd.count()) 39 | import org.apache.spark.mllib.linalg.Vector 40 | import org.apache.spark.mllib.linalg.distributed.RowMatrix 41 | val rdd = sc.textFile("file/data/mllib/input/basic/MatrixRow33.txt") //创建RDD文件路径 42 | .map(_.split(' ') //按“ ”分割 43 | .map(_.toDouble)) //转成Double类型 44 | .map(line => Vectors.dense(line)) //转成Vector格式 45 | val mat = new RowMatrix(rdd) 46 | 47 | // Get its size. 48 | val m = mat.numRows() 49 | val n = mat.numCols() 50 | println("m:" + m) 51 | println("n:" + n) 52 | println("mat:" + mat) 53 | // QR decomposition 54 | val qrResult = mat.tallSkinnyQR(true) 55 | println() 56 | println("qrResult.R:\n" + qrResult.R) 57 | println("qrResult.Q:" + qrResult.Q) 58 | qrResult.Q.rows.foreach(println) 59 | } 60 | } 61 | 62 | 63 | 数据: 64 | 65 | 1 0 0 66 | 1 1 0 67 | 1 1 1 68 | 1 1 1 69 | 70 | 3.结果: 71 | 72 | m:4 73 | n:3 74 | mat:org.apache.spark.mllib.linalg.distributed.RowMatrix@4a34ddc9 75 | 2016-06-13 16:30:55 WARN LAPACK:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK 76 | 2016-06-13 16:30:55 WARN LAPACK:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK 77 | 78 | qrResult.R: 79 | 2.0000000000000004 1.4999999999999996 0.9999999999999998 80 | 0.0 -0.8660254037844386 -0.577350269189626 81 | 0.0 0.0 -0.8164965809277259 82 | qrResult.Q:org.apache.spark.mllib.linalg.distributed.RowMatrix@29dbcdf9 83 | 2016-06-13 16:30:56 WARN BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 84 | 2016-06-13 16:30:56 WARN BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS 85 | [0.4999999999999999,-0.2886751345948133,-0.4082482904638629] 86 | [0.4999999999999999,-0.2886751345948133,-0.4082482904638629] 87 | [0.4999999999999999,0.8660254037844384,-2.719479911021037E-16] 88 | [0.4999999999999999,-0.2886751345948133,0.8164965809277263] 89 | 90 | 91 | 92 | 参考 93 | 94 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 95 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 96 | 【3】https://github.com/xubo245/SparkLearning 97 | 【4】http://blog.csdn.net/openspirit/article/details/13800067 98 | -------------------------------------------------------------------------------- /3分类和回归/Spark中组件Mllib的学习34之决策树(使用entropy)_.md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之分类篇 3 | 1解释 4 | 5 | MLlib决策树支持三种不纯度的计算:gini、entropy、variance。其他的目前不支持 6 | 7 | ``` 8 | def fromString(name: String): Impurity = name match { 9 | case "gini" => Gini 10 | case "entropy" => Entropy 11 | case "variance" => Variance 12 | case _ => throw new IllegalArgumentException(s"Did not recognize Impurity name: $name") 13 | } 14 | } 15 | ``` 16 | 17 | ![这里写图片描述](http://img.blog.csdn.net/20160525150243291) 18 | 参考【4】 19 | 官网: 20 | 21 | ![这里写图片描述](http://img.blog.csdn.net/20160525151508370) 22 | 23 | 主要的决策树算法包括ID3,C4.5, CART等,参考【5】 24 | 25 | 2.代码: 26 | 27 | ``` 28 | /** 29 | * @author xubo 30 | * ref:Spark MlLib机器学习实战 31 | * more code:https://github.com/xubo245/SparkLearning 32 | * more blog:http://blog.csdn.net/xubo245 33 | */ 34 | package org.apache.spark.mllib.learning.classification 35 | 36 | import org.apache.spark.mllib.tree.DecisionTree 37 | import org.apache.spark.mllib.util.MLUtils 38 | import org.apache.spark.{SparkConf, SparkContext} 39 | 40 | /** 41 | * Created by xubo on 2016/5/23. 42 | * 43 | */ 44 | object DecisionTrees2ByEntropy { 45 | def main(args: Array[String]) { 46 | val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 47 | val sc = new SparkContext(conf) 48 | 49 | // Load and parse the data file. 50 | val data = MLUtils.loadLibSVMFile(sc, "file/data/mllib/input/classification/dt.txt") 51 | // Split the data into training and test sets (30% held out for testing) 52 | val numClasses = 2 //设定分类数量 53 | val categoricalFeaturesInfo = Map[Int, Int]() //设定输入格式 54 | val impurity = "entropy" //设定信息增益计算方式 55 | val maxDepth = 5 //设定树高度 56 | val maxBins = 3 //设定分裂数据集 57 | 58 | val model = DecisionTree.trainClassifier(data, numClasses, categoricalFeaturesInfo, 59 | impurity, maxDepth, maxBins) //建立模型 60 | println("model.depth:" + model.depth) 61 | println("model.numNodes:" + model.numNodes) 62 | println("model.topNode:" + model.topNode) 63 | 64 | val labelAndPreds = data.take(2).map { point => 65 | val prediction = model.predict(point.features) 66 | (point.label, prediction) 67 | } 68 | labelAndPreds.foreach(println) 69 | sc.stop 70 | } 71 | } 72 | 73 | ``` 74 | 75 | ``` 76 | 1 1:1 2:0 3:0 4:1 77 | 0 1:1 2:0 3:1 4:1 78 | 0 1:0 2:1 3:0 4:0 79 | 1 1:1 2:1 3:0 4:0 80 | 1 1:1 2:0 3:0 4:0 81 | 1 1:1 2:1 3:0 4:0 82 | 83 | ``` 84 | 85 | 86 | 3.结果: 87 | 88 | ``` 89 | model.depth:2 90 | model.numNodes:5 91 | model.topNode:id = 1, isLeaf = false, predict = 1.0 (prob = 0.6666666666666666), impurity = 0.9182958340544896, split = Some(Feature = 0, threshold = 0.0, featureType = Continuous, categories = List()), stats = Some(gain = 0.31668908831502096, impurity = 0.9182958340544896, left impurity = 0.0, right impurity = 0.7219280948873623) 92 | (1.0,1.0) 93 | (0.0,0.0) 94 | 95 | ``` 96 | 97 | 参考 98 | 99 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 100 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 101 | 【3】https://github.com/xubo245/SparkLearning 102 | 【4】《数据挖掘导论》 103 | 【5】http://blog.csdn.net/taigw/article/details/44840771 -------------------------------------------------------------------------------- /9评估度量/Spark中组件Mllib的学习71之对多标签分类进行评估.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | 9 | ![](http://i.imgur.com/iX6xZ1w.png) 10 | ![](http://i.imgur.com/WHgsjX8.png) 11 | 12 | 2.代码: 13 | 14 | /** 15 | * @author xubo 16 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 17 | * more code:https://github.com/xubo245/SparkLearning 18 | * more blog:http://blog.csdn.net/xubo245 19 | */ 20 | package org.apache.spark.mllib.EvaluationMetrics 21 | 22 | import org.apache.spark.util.SparkLearningFunSuite 23 | 24 | /** 25 | * Created by xubo on 2016/6/13. 26 | */ 27 | class MultilabelClassificationFunSuite extends SparkLearningFunSuite { 28 | test("testFunSuite") { 29 | 30 | 31 | import org.apache.spark.mllib.evaluation.MultilabelMetrics 32 | import org.apache.spark.rdd.RDD; 33 | 34 | val scoreAndLabels: RDD[(Array[Double],Array[Double])] = sc.parallelize( 35 | Seq((Array(0.0, 1.0), Array(0.0, 2.0)), (Array(0.0, 2.0), Array(0.0, 1.0)), 36 | (Array(), Array(0.0)), 37 | (Array(2.0), Array(2.0)), 38 | (Array(2.0, 0.0), Array(2.0, 0.0)), 39 | (Array(0.0, 1.0, 2.0), Array(0.0, 1.0)), (Array(1.0), Array(1.0, 2.0))), 2) 40 | 41 | // Instantiate metrics object 42 | val metrics = new MultilabelMetrics(scoreAndLabels) 43 | 44 | // Summary stats 45 | println(s"Recall = ${metrics.recall}") 46 | println(s"Precision = ${metrics.precision}") 47 | println(s"F1 measure = ${metrics.f1Measure}") 48 | println(s"Accuracy = ${metrics.accuracy}") 49 | 50 | // Individual label stats 51 | metrics.labels.foreach(label => println(s"Class $label precision = ${metrics.precision(label)}")) 52 | metrics.labels.foreach(label => println(s"Class $label recall = ${metrics.recall(label)}")) 53 | metrics.labels.foreach(label => println(s"Class $label F1-score = ${metrics.f1Measure(label)}")) 54 | 55 | // Micro stats 56 | println(s"Micro recall = ${metrics.microRecall}") 57 | println(s"Micro precision = ${metrics.microPrecision}") 58 | println(s"Micro F1 measure = ${metrics.microF1Measure}") 59 | 60 | // Hamming loss 61 | println(s"Hamming loss = ${metrics.hammingLoss}") 62 | 63 | // Subset accuracy 64 | println(s"Subset accuracy = ${metrics.subsetAccuracy}") 65 | 66 | 67 | } 68 | } 69 | 70 | 71 | 72 | 3.结果: 73 | 74 | Recall = 0.6428571428571429 75 | Precision = 0.6666666666666666 76 | F1 measure = 0.6380952380952382 77 | Accuracy = 0.5476190476190476 78 | Class 0.0 precision = 1.0 79 | Class 1.0 precision = 0.6666666666666666 80 | Class 2.0 precision = 0.5 81 | Class 0.0 recall = 0.8 82 | Class 1.0 recall = 0.6666666666666666 83 | Class 2.0 recall = 0.5 84 | Class 0.0 F1-score = 0.888888888888889 85 | Class 1.0 F1-score = 0.6666666666666666 86 | Class 2.0 F1-score = 0.5 87 | Micro recall = 0.6666666666666666 88 | Micro precision = 0.7272727272727273 89 | Micro F1 measure = 0.6956521739130435 90 | Hamming loss = 0.3333333333333333 91 | Subset accuracy = 0.2857142857142857 92 | 93 | 参考 94 | 95 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 96 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 97 | 【3】https://github.com/xubo245/SparkLearning 98 | 【4】book:Machine Learning with Spark ,Nick Pertreach 99 | 【5】book:Spark MlLib机器学习实战 100 | -------------------------------------------------------------------------------- /7特征提取和转换/Spark中组件Mllib的学习53之Word2Vec简单实例.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | word2vec是NLP领域的重要算法,它的功能是将word用K维的dense vector来表达,训练集是语料库,不含标点,以空格断句。因此可以看作是种特征处理方法。 9 | 10 | 主要优点: 11 | 12 | 加法操作。 13 | 高效。单机可处理1小时2千万词。 14 | 15 | 16 | 17 | ![](http://images0.cnblogs.com/blog2015/679630/201506/181709121708170.png) 18 | 19 | 20 | 2.代码: 21 | 22 | /** 23 | * @author xubo 24 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 25 | * more code:https://github.com/xubo245/SparkLearning 26 | * more blog:http://blog.csdn.net/xubo245 27 | */ 28 | package org.apache.spark.mllib.FeatureExtractionAndTransformation 29 | 30 | import org.apache.spark.util.SparkLearningFunSuite 31 | 32 | /** 33 | * Created by xubo on 2016/6/13. 34 | */ 35 | class word2VecSuite extends SparkLearningFunSuite { 36 | test("testFunSuite") { 37 | 38 | import org.apache.spark._ 39 | import org.apache.spark.rdd._ 40 | import org.apache.spark.SparkContext._ 41 | import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} 42 | 43 | val input = sc.textFile("file/data/mllib/input/FeatureExtractionAndTransformation/text8").map(line => line.split(" ").toSeq) 44 | //java.lang.OutOfMemoryError: Java heap space 45 | 46 | 47 | // val input = sc.textFile("file/data/mllib/input/FeatureExtractionAndTransformation/a.txt").map(line => line.split(" ").toSeq) 48 | 49 | val word2vec = new Word2Vec() 50 | 51 | val model = word2vec.fit(input) 52 | 53 | val synonyms = model.findSynonyms("china", 1) 54 | // val synonyms = model.findSynonyms("hello", 2) 55 | // val synonyms = model.findSynonyms("hell", 2) 56 | println("synonyms:" + synonyms.length) 57 | for ((synonym, cosineSimilarity) <- synonyms) { 58 | println(s"$synonym $cosineSimilarity") 59 | } 60 | 61 | // Save and load model 62 | // model.save(sc, "myModelPath") 63 | // val sameModel = Word2VecModel.load(sc, "myModelPath") 64 | 65 | } 66 | 67 | test("testFunSuite ,code From book by pk") { 68 | 69 | import org.apache.spark.mllib.feature.Word2Vec 70 | 71 | // val input = sc.textFile("file/data/mllib/input/FeatureExtractionAndTransformation/text8").map(line => line.split(" ").toSeq) 72 | //java.lang.OutOfMemoryError: Java heap space 73 | val data = sc.textFile("file/data/mllib/input/FeatureExtractionAndTransformation/aWord2vec.txt").map(line => line.split(" ").toSeq) 74 | 75 | 76 | val word2vec = new Word2Vec() //创建词向量实例 77 | val model = word2vec.fit(data) //训练模型 78 | println(model.getVectors) //打印向量模型 79 | val synonyms = model.findSynonyms("spark", 1) //寻找spar的相似词 80 | println("synonyms:" + synonyms.length) 81 | for (synonym <- synonyms) { 82 | //打印找到的内容 83 | println(synonym) 84 | } 85 | } 86 | } 87 | 88 | 89 | 90 | 3.结果: 91 | 92 | Map(hello -> [F@21cab9e, spark -> [F@28471327) 93 | synonyms:1 94 | (hello,-8.927243828523911E-4) 95 | 96 | 前面一个spark官网的test由于报内存不足,所以没有放上来。 97 | 98 | 参考 99 | 100 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 101 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 102 | 【3】https://github.com/xubo245/SparkLearning 103 | 【4】book:Machine Learning with Spark ,Nick Pertreach 104 | 【5】book:Spark MlLib机器学习实战 105 | 【6】http://www.cnblogs.com/aezero/p/4586605.html 106 | -------------------------------------------------------------------------------- /3分类和回归/Spark中组件Mllib的学习26之逻辑回归-简单数据集,带预测.md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之逻辑回归篇 3 | 1解释 4 | 什么是逻辑回归? 5 | 6 | Logistic回归与多重线性回归实际上有很多相同之处,最大的区别就在于它们的因变量不同,其他的基本都差不多。正是因为如此,这两种回归可以归于同一个家族,即广义线性模型(generalizedlinear model)。 7 | 8 | 这一家族中的模型形式基本上都差不多,不同的就是因变量不同。 9 | 如果是连续的,就是多重线性回归; 10 | 如果是二项分布,就是Logistic回归; 11 | 如果是Poisson分布,就是Poisson回归; 12 | 如果是负二项分布,就是负二项回归。 13 | 14 | Logistic回归的因变量可以是二分类的,也可以是多分类的,但是二分类的更为常用,也更加容易解释。所以实际中最常用的就是二分类的Logistic回归。 15 | 16 | Logistic回归的主要用途: 17 | 18 | 寻找危险因素:寻找某一疾病的危险因素等; 19 | 预测:根据模型,预测在不同的自变量情况下,发生某病或某种情况的概率有多大; 20 | 判别:实际上跟预测有些类似,也是根据模型,判断某人属于某病或属于某种情况的概率有多大,也就是看一下这个人有多大的可能性是属于某病。 21 | 22 | 更多请见参考【4】,【4】中还有推理、迭代更新和求解过程,正则化也有 23 | 24 | 逻辑回归: 25 | ![这里写图片描述](http://img.blog.csdn.net/20160524215642836) 26 | 27 | 2.代码: 28 | 29 | ``` 30 | /** 31 | * @author xubo 32 | * ref:Spark MlLib机器学习实战 33 | * more code:https://github.com/xubo245/SparkLearning 34 | * more blog:http://blog.csdn.net/xubo245 35 | */ 36 | package org.apache.spark.mllib.learning.regression 37 | 38 | import org.apache.spark.mllib.classification.LogisticRegressionWithSGD 39 | import org.apache.spark.mllib.linalg.Vectors 40 | import org.apache.spark.mllib.regression.LabeledPoint 41 | import org.apache.spark.{SparkConf, SparkContext} 42 | 43 | /** 44 | * Created by xubo on 2016/5/23. 45 | * 一元逻辑回归 46 | */ 47 | object LogisticRegressionLearning { 48 | def main(args: Array[String]) { 49 | val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 50 | val sc = new SparkContext(conf) 51 | 52 | val data = sc.textFile("file/data/mllib/input/regression/logisticRegression1.data") //获取数据集路径 53 | val parsedData = data.map { line => //开始对数据集处理 54 | val parts = line.split('|') //根据逗号进行分区 55 | LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) 56 | }.cache() //转化数据格式 57 | parsedData.foreach(println) 58 | val model = LogisticRegressionWithSGD.train(parsedData, 50) //建立模型 59 | val target = Vectors.dense(-1) //创建测试值 60 | val resulet = model.predict(target) //根据模型计算结果 61 | println("model.weights:") 62 | println(model.weights) 63 | println(resulet) //打印结果 64 | println(model.predict(Vectors.dense(10))) 65 | sc.stop 66 | } 67 | } 68 | 69 | ``` 70 | 71 | 数据: 72 | 73 | ``` 74 | 1|2 75 | 1|3 76 | 1|4 77 | 1|5 78 | 1|6 79 | 0|7 80 | 0|8 81 | 0|9 82 | 0|10 83 | 0|11 84 | ``` 85 | 86 | 3.结果: 87 | 88 | ``` 89 | 2016-05-24 21:59:06 WARN :139 - Your hostname, xubo-PC resolves to a loopback/non-reachable address: fe80:0:0:0:482:722f:5976:ce1f%20, but we couldn't find any external IP address! 90 | (0.0,[8.0]) 91 | (1.0,[2.0]) 92 | (0.0,[9.0]) 93 | (1.0,[3.0]) 94 | (0.0,[10.0]) 95 | (1.0,[4.0]) 96 | (0.0,[11.0]) 97 | (1.0,[5.0]) 98 | (1.0,[6.0]) 99 | (0.0,[7.0]) 100 | 2016-05-24 21:59:07 WARN BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 101 | 2016-05-24 21:59:07 WARN BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS 102 | model.weights: 103 | [-0.10590621151462867] 104 | 1.0 105 | 0.0 106 | ``` 107 | 108 | 参考 109 | 110 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 111 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 112 | 【3】https://github.com/xubo245/SparkLearning 113 | 【4】http://blog.csdn.net/pakko/article/details/37878837 -------------------------------------------------------------------------------- /5聚类/Spark中组件Mllib的学习45之用高斯混合模型来预测.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 前面主要是建立模型,这篇讲的是使用mllib中的gmm来预测,主要改自mllib test中的suite 8 | 9 | 10 | 11 | 2.代码: 12 | 13 | test("model prediction, parallel and local") { 14 | val data = sc.parallelize(GaussianTestData.data) 15 | val gmm = new GaussianMixture().setK(4).setSeed(0).run(data) 16 | 17 | val batchPredictions = gmm.predict(data) 18 | batchPredictions.zip(data).collect().foreach { case (batchPred, datum) => 19 | print("batchPred:"+batchPred) 20 | println(" datum:"+datum) 21 | assert(batchPred === gmm.predict(datum)) 22 | } 23 | /** ****************add by xubo 20160613 *************/ 24 | for (i <- 0 until gmm.k) { 25 | println("weight=%f\nmu=%s\nsigma=\n%s\n" format 26 | (gmm.weights(i), gmm.gaussians(i).mu, gmm.gaussians(i).sigma)) 27 | } 28 | 29 | /** ****************add by xubo 20160613 *************/ 30 | } 31 | 32 | 数据: 33 | 34 | object GaussianTestData { 35 | 36 | val data = Array( 37 | Vectors.dense(-5.1971), Vectors.dense(-2.5359), Vectors.dense(-3.8220), 38 | Vectors.dense(-5.2211), Vectors.dense(-5.0602), Vectors.dense(4.7118), 39 | Vectors.dense(6.8989), Vectors.dense(3.4592), Vectors.dense(4.6322), 40 | Vectors.dense(5.7048), Vectors.dense(4.6567), Vectors.dense(5.5026), 41 | Vectors.dense(4.5605), Vectors.dense(5.2043), Vectors.dense(6.2734) 42 | ) 43 | 44 | val data2: Array[Vector] = Array.tabulate(25) { i: Int => 45 | Vectors.dense(Array.tabulate(50)(i + _.toDouble)) 46 | } 47 | 48 | } 49 | 50 | 3.结果: 51 | k=2: 52 | 53 | batchPred:1 datum:[-5.1971] 54 | batchPred:1 datum:[-2.5359] 55 | batchPred:1 datum:[-3.822] 56 | batchPred:1 datum:[-5.2211] 57 | batchPred:1 datum:[-5.0602] 58 | batchPred:0 datum:[4.7118] 59 | batchPred:0 datum:[6.8989] 60 | batchPred:0 datum:[3.4592] 61 | batchPred:0 datum:[4.6322] 62 | batchPred:0 datum:[5.7048] 63 | batchPred:0 datum:[4.6567] 64 | batchPred:0 datum:[5.5026] 65 | batchPred:0 datum:[4.5605] 66 | batchPred:0 datum:[5.2043] 67 | batchPred:0 datum:[6.2734] 68 | weight=0.666667 69 | mu=[5.160440000000388] 70 | sigma= 71 | 0.8664462983997272 72 | 73 | weight=0.333333 74 | mu=[-4.367259999996172] 75 | sigma= 76 | 1.1098061864295243 77 | 78 | k=4: 79 | 80 | 81 | batchPred:0 datum:[-5.2211] 82 | batchPred:0 datum:[-5.0602] 83 | batchPred:2 datum:[4.7118] 84 | batchPred:1 datum:[6.8989] 85 | batchPred:2 datum:[3.4592] 86 | batchPred:2 datum:[4.6322] 87 | batchPred:2 datum:[5.7048] 88 | batchPred:2 datum:[4.6567] 89 | batchPred:2 datum:[5.5026] 90 | batchPred:2 datum:[4.5605] 91 | batchPred:2 datum:[5.2043] 92 | batchPred:2 datum:[6.2734] 93 | weight=0.199703 94 | mu=[-5.159541319359388] 95 | sigma= 96 | 0.005019244232416107 97 | 98 | weight=0.264231 99 | mu=[5.2562903953513125] 100 | sigma= 101 | 0.9154479492242813 102 | 103 | weight=0.402436 104 | mu=[5.09750654075634] 105 | sigma= 106 | 0.8242799738536565 107 | 108 | weight=0.133631 109 | mu=[-3.183244472524267] 110 | sigma= 111 | 0.42087592390746537 112 | 113 | 114 | 参考 115 | 116 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 117 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 118 | 【3】https://github.com/xubo245/SparkLearning 119 | 120 | 121 | 122 | 完整代码路径:【3】中可以找到 123 | 124 | package org.apache.spark.mllib.learning.clustering.GaussianMixture 125 | 126 | class GaussianMixtureFromSparkSuite -------------------------------------------------------------------------------- /1数据类型/Spark中组件Mllib的学习16之分布式行矩阵的四种形式.md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之基础概念篇 3 | 1解释 4 | 分布式行矩阵有:基本行矩阵、index 行矩阵、坐标行矩阵、块行矩阵 5 | 功能一次增加 6 | 7 | 2.代码: 8 | 9 | ``` 10 | 11 | /** 12 | * @author xubo 13 | * ref:Spark MlLib机器学习实战 14 | * more code:https://github.com/xubo245/SparkLearning 15 | * more blog:http://blog.csdn.net/xubo245 16 | */ 17 | package org.apache.spark.mllib.learning.basic 18 | 19 | import org.apache.spark.mllib.linalg.Vectors 20 | import org.apache.spark.mllib.linalg.distributed._ 21 | import org.apache.spark.{SparkConf, SparkContext} 22 | 23 | /** 24 | * Created by xubo on 2016/5/23. 25 | */ 26 | object MatrixRowLearning { 27 | def main(args: Array[String]) { 28 | val conf = new SparkConf().setMaster("local").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 29 | val sc = new SparkContext(conf) 30 | 31 | println("First:Matrix ") 32 | val rdd = sc.textFile("file/data/mllib/input/basic/MatrixRow.txt") //创建RDD文件路径 33 | .map(_.split(' ') //按“ ”分割 34 | .map(_.toDouble)) //转成Double类型 35 | .map(line => Vectors.dense(line)) //转成Vector格式 36 | val rm = new RowMatrix(rdd) //读入行矩阵 37 | // for(i <- rm){ 38 | // println(i) 39 | // } 40 | //error 41 | //疑问:如何打印行矩阵所有值,如何定位? 42 | println(rm.numRows()) //打印列数 43 | println(rm.numCols()) //打印行数 44 | rm.rows.foreach(println) 45 | 46 | println("Second:index Row Matrix ") 47 | val rdd2 = sc.textFile("file/data/mllib/input/basic/MatrixRow.txt") //创建RDD文件路径 48 | .map(_.split(' ') //按“ ”分割 49 | .map(_.toDouble)) //转成Double类型 50 | .map(line => Vectors.dense(line)) //转化成向量存储 51 | .map((vd) => new IndexedRow(vd.size, vd)) //转化格式 52 | val irm = new IndexedRowMatrix(rdd2) //建立索引行矩阵实例 53 | println(irm.getClass) //打印类型 54 | irm.rows.foreach(println) //打印内容数据 55 | //如何定位? 56 | 57 | println("Third: Coordinate Row Matrix ") 58 | val rdd3 = sc.textFile("file/data/mllib/input/basic/MatrixRow.txt") //创建RDD文件路径 59 | .map(_.split(' ') //按“ ”分割 60 | .map(_.toDouble)) //转成Double类型 61 | .map(vue => (vue(0).toLong, vue(1).toLong, vue(2))) //转化成坐标格式 62 | .map(vue2 => new MatrixEntry(vue2 _1, vue2 _2, vue2 _3)) //转化成坐标矩阵格式 63 | val crm = new CoordinateMatrix(rdd3) //实例化坐标矩阵 64 | crm.entries.foreach(println) //打印数据 65 | println(crm.numCols()) 66 | println(crm.numCols()) 67 | // Return approximate number of distinct elements in the RDD. 68 | println(crm.entries.countApproxDistinct()) 69 | 70 | 71 | println("Fourth: Block Matrix :null") 72 | //块矩阵待完善 73 | 74 | sc.stop 75 | } 76 | } 77 | 78 | ``` 79 | 80 | 3.结果: 81 | 82 | ``` 83 | First:Matrix 84 | 2016-05-23 19:04:24 WARN :139 - Your hostname, xubo-PC resolves to a loopback/non-reachable address: fe80:0:0:0:482:722f:5976:ce1f%20, but we couldn't find any external IP address! 85 | 2 86 | 3 87 | [1.0,2.0,3.0] 88 | [4.0,5.0,6.0] 89 | Second:index Row Matrix 90 | class org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix 91 | IndexedRow(3,[1.0,2.0,3.0]) 92 | IndexedRow(3,[4.0,5.0,6.0]) 93 | Third: Coordinate Row Matrix 94 | MatrixEntry(1,2,3.0) 95 | MatrixEntry(4,5,6.0) 96 | 6 97 | 6 98 | 2 99 | Fourth: Block Matrix :null 100 | ``` 101 | 102 | 参考 103 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 104 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 105 | 【3】https://github.com/xubo245/SparkLearning 106 | -------------------------------------------------------------------------------- /5聚类/Spark中组件Mllib的学习44之高斯混合聚类GaussianMixture.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | GaussianMixture可以理解为多个高斯函数进行混合,每个高斯函数权重不一样。 9 | 10 | 源码分析请见【4】 11 | 12 | 13 | 14 | 2.代码: 15 | 16 | /** 17 | * @author xubo 18 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 19 | * more code:https://github.com/xubo245/SparkLearning 20 | * more blog:http://blog.csdn.net/xubo245 21 | */ 22 | package org.apache.spark.mllib.learning.clustering.GaussianMixture 23 | 24 | import java.text.SimpleDateFormat 25 | import java.util.Date 26 | 27 | import org.apache.spark.mllib.clustering.{GaussianMixture, GaussianMixtureModel} 28 | import org.apache.spark.mllib.linalg.Vectors 29 | import org.apache.spark.util.SparkLearningFunSuite 30 | 31 | /** 32 | * Created by xubo on 2016/6/13. 33 | */ 34 | class GaussianMixtureSuite extends SparkLearningFunSuite { 35 | test("testFunSuite") { 36 | 37 | // Load and parse the data 38 | val data = sc.textFile("file/data/mllib/input/mllibFromSpark/gmm_data.txt") 39 | val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble))).cache() 40 | 41 | // Cluster the data into two classes using GaussianMixture 42 | val gmm = new GaussianMixture().setK(2).run(parsedData) 43 | 44 | // Save and load model 45 | val iString = new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date()) 46 | val output = "file/data/mllib/output/mllibFromSpark/myGMMModel" + iString 47 | println("output:" + output) 48 | println("gmm.weights:" + gmm.weights) 49 | println("gmm.weights:" + gmm.weights) 50 | println("gmm.weights.length:" + gmm.weights.length) 51 | 52 | // gmm.save(sc, output) 53 | // val sameModel = GaussianMixtureModel.load(sc, output) 54 | 55 | // output parameters of max-likelihood model 56 | for (i <- 0 until gmm.k) { 57 | println("weight=%f\nmu=%s\nsigma=\n%s\n" format 58 | (gmm.weights(i), gmm.gaussians(i).mu, gmm.gaussians(i).sigma)) 59 | } 60 | } 61 | } 62 | 63 | 64 | 65 | 3.结果: 66 | k=2: 67 | 68 | output:file/data/mllib/output/mllibFromSpark/myGMMModel20160613172124714 69 | gmm.weights:[D@76556aff 70 | gmm.weights.length:2 71 | weight=0.479516 72 | mu=[0.07229334132596348,0.016699872146612848] 73 | sigma= 74 | 4.787456829778775 1.8802887426093131 75 | 1.8802887426093131 0.9160786892956797 76 | 77 | weight=0.520484 78 | mu=[-0.10418516009966565,0.04279316009103921] 79 | sigma= 80 | 4.899755775046639 -2.002791396537124 81 | -2.002791396537124 1.0099533766097555 82 | 83 | 84 | k=3: 85 | 86 | output:file/data/mllib/output/mllibFromSpark/myGMMModel20160613173058582 87 | gmm.weights:[D@63f7f62 88 | gmm.weights.length:3 89 | weight=0.478456 90 | mu=[0.07294229123698849,0.016880200460870468] 91 | sigma= 92 | 4.799622783996325 1.884861638240691 93 | 1.884861638240691 0.9186484216430504 94 | 95 | weight=0.352254 96 | mu=[-0.016078882202045494,-0.09041925014095285] 97 | sigma= 98 | 4.214865453249383 -1.679616942501954 99 | -1.679616942501954 0.8180392275099243 100 | 101 | weight=0.169290 102 | mu=[-0.2882428708757841,0.31930454247949075] 103 | sigma= 104 | 6.239284901444058 -2.5886021455230255 105 | -2.5886021455230255 1.288079844267794 106 | 107 | 108 | 109 | 参考 110 | 111 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 112 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 113 | 【3】https://github.com/xubo245/SparkLearning 114 | 【4】http://blog.csdn.net/notheory/article/details/50219451 115 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | 2 | 系统环境: 3 | Spark-1.5.2 4 | hadoop2.6.0 5 | scala-2.10.4 6 | idea 15.0.4 7 | 8 | Spark mllib学习目录: 9 | 10 | **1.数据类型** 11 | 12 | Spark中组件Mllib的学习12之密集向量和稀疏向量的生成 13 | Spark中组件Mllib的学习13之给向量打标签 14 | Spark中组件Mllib的学习14之从文本中读取带标签的数据,生成带label的向量 15 | Spark中组件Mllib的学习15之创建分布式矩阵 16 | Spark中组件Mllib的学习16之分布式行矩阵的四种形式 17 | Spark中组件Mllib的学习43之BlockMatrix 18 | 19 | **2.基本统计** 20 | 21 | Spark中组件Mllib的学习3之用户相似度计算 22 | Spark中组件Mllib的学习17之colStats_以列为基础计算统计量的基本数据 23 | Spark中组件Mllib的学习18之corr_两组数据相关关系计算(Pearson、Spearman) 24 | Spark中组件Mllib的学习19之分层抽样 25 | Spark中组件Mllib的学习20之假设检验-卡方检验 26 | Spark中组件Mllib的学习21之随机数-RandomRDD产生 27 | Spark中组件Mllib的学习22之假设检验-卡方检验概念理解 28 | Spark中组件Mllib的学习42之rowMatrix的QR分解 29 | 30 | **3.分类和回归** 31 | 32 | Spark中组件Mllib的学习23之随机梯度下降(SGD) 33 | Spark中组件Mllib的学习24之线性回归1-小数据集 34 | Spark中组件Mllib的学习25之线性回归2-较大数据集(多元) 35 | Spark中组件Mllib的学习26之逻辑回归-简单数据集,带预测 36 | Spark中组件Mllib的学习27之逻辑回归-多元逻辑回归,较大数据集,带预测准确度计算 37 | Spark中组件Mllib的学习28之支持向量机SVM-方法1 38 | Spark中组件Mllib的学习29之支持向量机SVM-方法2 39 | Spark中组件Mllib的学习30之逻辑回归LogisticRegressionWithLBFGS 40 | Spark中组件Mllib的学习31之朴素贝叶斯分类器(多项式朴素贝叶斯) 41 | Spark中组件Mllib的学习32之朴素贝叶斯分类器(伯努利朴素贝叶斯) 42 | Spark中组件Mllib的学习33之决策树(使用Gini) 43 | Spark中组件Mllib的学习34之决策树(使用entropy) 44 | Spark中组件Mllib的学习35之随机森林(entropy)进行分类 45 | Spark中组件Mllib的学习36之决策树(使用variance)进行回归 46 | Spark中组件Mllib的学习37之随机森林(Gini)进行分类 47 | Spark中组件Mllib的学习38之随机森林(使用variance)进行回归 48 | Spark中组件Mllib的学习39之梯度提升树(GBT)用于分类 49 | Spark中组件Mllib的学习40之梯度提升树(GBT)用于回归 50 | Spark中组件Mllib的学习41之保序回归(Isotonic regression) 51 | 52 | 53 | **4.协同过滤** 54 | 55 | Spark中组件Mllib的学习2之MovieLensALS学习(集群run-eaxmples运行) 56 | Spark中组件Mllib的学习4之examples中的MovieLensALS修改本地运行 57 | Spark中组件Mllib的学习5之ALS测试(apache spark) 58 | Spark中组件Mllib的学习6之ALS测试(apache spark 含隐式转换) 59 | Spark中组件Mllib的学习7之ALS隐式转换训练的model来预测数据 60 | Spark中组件Mllib的学习8之ALS训练的model来预测数据 61 | Spark中组件Mllib的学习9之ALS训练的model来预测数据的准确率研究 62 | Spark中组件Mllib的学习10之修改MovieLens来对movieLen中的100k数据进行预测 63 | Spark中组件Mllib的学习11之使用ALS对movieLens中一百万条(1M)数据集进行训练,并对输入的新用户数据进行电影推荐 64 | 65 | **5.聚类** 66 | 67 | Spark中组件Mllib的学习1之Kmeans错误解决 68 | Spark中组件Mllib的学习44之高斯混合聚类GaussianMixture 69 | Spark中组件Mllib的学习45之用高斯混合模型来预测 70 | Spark中组件Mllib的学习46之Power iteration clustering 71 | Spark中组件Mllib的学习47之隐含狄利克雷分布(Latent Dirichlet allocation (LDA)学习 72 | Spark中组件Mllib的学习48之流式k均值(Streaming kmeans) 73 | 74 | **6.降维** 75 | 76 | Spark中组件Mllib的学习49之奇异值分解SVD(Singular value decomposition) 77 | Spark中组件Mllib的学习50之主成份分析PCA 78 | Spark中组件Mllib的学习51之使用PCA从数据集中得到主向量 79 | 80 | **7.特征提取和转换** 81 | 82 | Spark中组件Mllib的学习52之TF-IDF学习 83 | Spark中组件Mllib的学习53之HashingTF理解和使用 84 | Spark中组件Mllib的学习53之Word2Vec简单实例 85 | Spark中组件Mllib的学习54之word2Vec实例分析(text8数据集) 86 | Spark中组件Mllib的学习55之使用TfIdf来分析20news数据集 87 | Spark中组件Mllib的学习56之标准化(StandardScaler,来自SparkWeb) 88 | Spark中组件Mllib的学习57之标准化参数和公式理解(StandardScaler,来自SparkCode) 89 | Spark中组件Mllib的学习58之归一化(Normalizer)Normalization using L1 distance 90 | Spark中组件Mllib的学习59之归一化(Normalizer)Normalization using L2 distance 91 | Spark中组件Mllib的学习60之归一化(Normalizer)Normalization using L^Inf distance 92 | Spark中组件Mllib的学习61之归一化(Normalizer)SparkWeb实例分析 93 | Spark中组件Mllib的学习62之特征选择中的卡方选择器 94 | Spark中组件Mllib的学习63之特征选择中的卡方选择器实例(libsvm数据集) 95 | Spark中组件Mllib的学习64之元素智能乘积ElementwiseProduct 96 | Spark中组件Mllib的学习65之使用PCA进行特征转换 97 | 98 | **8.频繁项挖掘** 99 | 100 | Spark中组件Mllib的学习66之FP-growth 101 | Spark中组件Mllib的学习67之关联规则AssociationRules 102 | Spark中组件Mllib的学习68之PrefixSpan 103 | 104 | **9.评估度量** 105 | 106 | Spark中组件Mllib的学习69之对二分类进行评估Binary classification 107 | Spark中组件Mllib的学习70之对多类分类结果进行评估Multiclass classification 108 | Spark中组件Mllib的学习71之对多标签分类进行评估 109 | Spark中组件Mllib的学习72之RankingSystem进行评估 110 | Spark中组件Mllib的学习73之回归问题的评估 111 | 112 | **10.PMML模型输出** 113 | 114 | Spark中组件Mllib的学习74之预言模型标记语言PMML 115 | 116 | **11优化** 117 | 118 | Spark中组件Mllib的学习75之L-BFGS -------------------------------------------------------------------------------- /11优化/Spark中组件Mllib的学习75之L-BFGS.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | L-BFGS的概念请参考【6】 9 | 10 | 11 | 12 | 2.代码: 13 | 14 | /** 15 | * @author xubo 16 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 17 | * more code:https://github.com/xubo245/SparkLearning 18 | * more blog:http://blog.csdn.net/xubo245 19 | */ 20 | package org.apache.spark.mllib.Optimization 21 | 22 | import org.apache.spark.util.SparkLearningFunSuite 23 | 24 | /** 25 | * Created by xubo on 2016/6/13. 26 | */ 27 | class LBFGSFunSuite extends SparkLearningFunSuite { 28 | test("testFunSuite") { 29 | 30 | 31 | import org.apache.spark.SparkContext 32 | import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics 33 | import org.apache.spark.mllib.linalg.Vectors 34 | import org.apache.spark.mllib.util.MLUtils 35 | import org.apache.spark.mllib.classification.LogisticRegressionModel 36 | import org.apache.spark.mllib.optimization.{LBFGS, LogisticGradient, SquaredL2Updater} 37 | 38 | val data = MLUtils.loadLibSVMFile(sc, "file/data/mllib/input/mllibFromSpark/sample_libsvm_data.txt") 39 | val numFeatures = data.take(1)(0).features.size 40 | 41 | // Split data into training (60%) and test (40%). 42 | val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L) 43 | 44 | // Append 1 into the training data as intercept. 45 | val training = splits(0).map(x => (x.label, MLUtils.appendBias(x.features))).cache() 46 | 47 | val test = splits(1) 48 | 49 | // Run training algorithm to build the model 50 | val numCorrections = 10 51 | val convergenceTol = 1e-4 52 | val maxNumIterations = 20 53 | val regParam = 0.1 54 | val initialWeightsWithIntercept = Vectors.dense(new Array[Double](numFeatures + 1)) 55 | 56 | val (weightsWithIntercept, loss) = LBFGS.runLBFGS( 57 | training, 58 | new LogisticGradient(), 59 | new SquaredL2Updater(), 60 | numCorrections, 61 | convergenceTol, 62 | maxNumIterations, 63 | regParam, 64 | initialWeightsWithIntercept) 65 | 66 | val model = new LogisticRegressionModel( 67 | Vectors.dense(weightsWithIntercept.toArray.slice(0, weightsWithIntercept.size - 1)), 68 | weightsWithIntercept(weightsWithIntercept.size - 1)) 69 | 70 | // Clear the default threshold. 71 | model.clearThreshold() 72 | 73 | // Compute raw scores on the test set. 74 | val scoreAndLabels = test.map { point => 75 | val score = model.predict(point.features) 76 | (score, point.label) 77 | } 78 | 79 | // Get evaluation metrics. 80 | val metrics = new BinaryClassificationMetrics(scoreAndLabels) 81 | val auROC = metrics.areaUnderROC() 82 | 83 | println("Loss of each step in training process") 84 | loss.foreach(println) 85 | println("Area under ROC = " + auROC) 86 | 87 | 88 | } 89 | } 90 | 91 | 92 | 93 | 3.结果: 94 | 95 | Loss of each step in training process 96 | 0.6931471805599448 97 | 0.6493820266740578 98 | 0.21500294643532605 99 | 0.0021993980987416607 100 | 5.202713917046544E-4 101 | 3.4927255641289994E-4 102 | 1.888143055954077E-4 103 | 1.2596418162046915E-4 104 | 9.190860508937821E-5 105 | 7.563586578488929E-5 106 | 6.752517240852286E-5 107 | 6.361011786413444E-5 108 | 6.141097383991715E-5 109 | 5.932972379127243E-5 110 | 5.554554457038146E-5 111 | 4.480277717224111E-5 112 | 3.240944275555446E-5 113 | 3.0444586625324565E-5 114 | Area under ROC = 1.0 115 | 116 | 参考 117 | 118 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 119 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 120 | 【3】https://github.com/xubo245/SparkLearning 121 | 【4】book:Machine Learning with Spark ,Nick Pertreach 122 | 【5】book:Spark MlLib机器学习实战 123 | 【6】https://github.com/endymecy/spark-ml-source-analysis/blob/master/%E6%9C%80%E4%BC%98%E5%8C%96%E7%AE%97%E6%B3%95/L-BFGS/lbfgs.md 124 | 125 | -------------------------------------------------------------------------------- /3分类和回归/Spark中组件Mllib的学习38之随机森林(使用variance)进行回归.md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之分类篇 3 | 1解释 4 | 随机森林(使用variance)进行回归 5 | 6 | 7 | 2.代码: 8 | 9 | ``` 10 | /** 11 | * @author xubo 12 | * ref:Spark MlLib机器学习实战 13 | * more code:https://github.com/xubo245/SparkLearning 14 | * more blog:http://blog.csdn.net/xubo245 15 | */ 16 | package org.apache.spark.mllib.learning.classification 17 | 18 | import org.apache.spark.mllib.tree.RandomForest 19 | import org.apache.spark.mllib.util.MLUtils 20 | import org.apache.spark.{SparkConf, SparkContext} 21 | import org.apache.spark.mllib.tree.RandomForest 22 | import org.apache.spark.mllib.tree.model.RandomForestModel 23 | import org.apache.spark.mllib.util.MLUtils 24 | 25 | /** 26 | * Created by xubo on 2016/5/23. 27 | */ 28 | object RandomForest3VarianceRegression { 29 | def main(args: Array[String]) { 30 | val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 31 | val sc = new SparkContext(conf) 32 | 33 | // Load and parse the data file. 34 | val data = MLUtils.loadLibSVMFile(sc, "file/data/mllib/input/classification/sample_libsvm_data.txt") 35 | 36 | // Split the data into training and test sets (30% held out for testing) 37 | val splits = data.randomSplit(Array(0.7, 0.3)) 38 | val (trainingData, testData) = (splits(0), splits(1)) 39 | 40 | // Train a RandomForest model. 41 | // Empty categoricalFeaturesInfo indicates all features are continuous. 42 | val numClasses = 2 43 | val categoricalFeaturesInfo = Map[Int, Int]() 44 | val numTrees = 3 // Use more in practice. 45 | val featureSubsetStrategy = "auto" // Let the algorithm choose. 46 | val impurity = "variance" 47 | val maxDepth = 4 48 | val maxBins = 32 49 | 50 | val model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo, 51 | numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins) 52 | 53 | // Evaluate model on test instances and compute test error 54 | val labelsAndPredictions = testData.map { point => 55 | val prediction = model.predict(point.features) 56 | (point.label, prediction) 57 | } 58 | val testMSE = labelsAndPredictions.map { case (v, p) => math.pow((v - p), 2) }.mean() 59 | println("Test Mean Squared Error = " + testMSE) 60 | println("Learned regression forest model:\n" + model.toDebugString) 61 | 62 | println("data.count:" + data.count()) 63 | println("trainingData.count:" + trainingData.count()) 64 | println("testData.count:" + testData.count()) 65 | println("model.algo:" + model.algo) 66 | println("model.trees:" + model.trees) 67 | 68 | // Save and load model 69 | // val iString = new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date()) 70 | // val path = "file/data/mllib/output/classification/RandomForestModel" + iString + "/result" 71 | // model.save(sc, path) 72 | // val sameModel = RandomForestModel.load(sc, path) 73 | // println(sameModel.algo) 74 | sc.stop 75 | } 76 | } 77 | 78 | ``` 79 | 80 | 3.结果: 81 | 82 | ``` 83 | Test Mean Squared Error = 0.006944444444444447 84 | Learned regression forest model: 85 | TreeEnsembleModel regressor with 3 trees 86 | 87 | Tree 0: 88 | If (feature 379 <= 23.0) 89 | Predict: 0.0 90 | Else (feature 379 > 23.0) 91 | Predict: 1.0 92 | Tree 1: 93 | If (feature 434 <= 0.0) 94 | Predict: 0.0 95 | Else (feature 434 > 0.0) 96 | Predict: 1.0 97 | Tree 2: 98 | If (feature 490 <= 31.0) 99 | Predict: 0.0 100 | Else (feature 490 > 31.0) 101 | Predict: 1.0 102 | 103 | data.count:100 104 | trainingData.count:68 105 | testData.count:32 106 | model.algo:Regression 107 | model.trees:[Lorg.apache.spark.mllib.tree.model.DecisionTreeModel;@43f17f99 108 | ``` 109 | 110 | 参考 111 | 112 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 113 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 114 | 【3】https://github.com/xubo245/SparkLearning -------------------------------------------------------------------------------- /2基本统计/Spark中组件Mllib的学习22之假设检验-卡方检验概念理解.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | Spark中组件Mllib的学习之基础概念篇 4 | 1解释 5 | 参考【4】的博文讲的比较清楚了,只是里面有些错误。 6 | 定义 7 | 8 | 卡方检验就是统计样本的实际观测值与理论推断值之间的偏离程度,实际观测值与理论推断值之间的偏离程度就决定卡方值的大小,卡方值越大,越不符合;卡方值越小,偏差越小,越趋于符合,若两个值完全相等时,卡方值就为0,表明理论值完全符合。 9 | 10 | (1)提出原假设: 11 | H0:总体X的分布函数为F(x). 12 | 13 | 基于皮尔逊的检验统计量: 14 | ![这里写图片描述](http://img.blog.csdn.net/20160524113039028) 15 | 16 | 理解:n次试验中样本值落入第i个小区间Ai的频率fi/n与概率pi应很接近,当H0不真时,则fi/n与pi相差很大。在假设成立的情况下服从自由度为k-1的卡方分布。 17 | 18 | 参考【4】中给了例子,比较好理解,下面是截图: 19 | ![这里写图片描述](http://img.blog.csdn.net/20160524113559186) 20 | 21 | 说明:19,34,24,10为实际测量值,括号内为计算值,比如26.2=(53/87)*43 22 | 计算卡方检验的值: 23 | 如上图3,也可以是下图专门的计算公式: 24 | ![这里写图片描述](http://img.blog.csdn.net/20160524113840250) 25 | 26 | p-value确定:具体的没理解,根据参考【4】查表可以知道大概在0.001 27 | 28 | 29 | 【4】中还给出了:“从表20-14可见,T1.2和T2.2数值都<5,且总例数大于40,故宜用校正公式(20.15)检验”,可以去看看 30 | 31 | 2.代码: 32 | 33 | ``` 34 | /** 35 | * @author xubo 36 | * ref:Spark MlLib机器学习实战 37 | * more code:https://github.com/xubo245/SparkLearning 38 | * more blog:http://blog.csdn.net/xubo245 39 | */ 40 | package org.apache.spark.mllib.learning.basic 41 | 42 | import org.apache.spark.mllib.linalg.{Matrix, Matrices, Vectors} 43 | import org.apache.spark.mllib.stat.Statistics 44 | import org.apache.spark.{SparkConf, SparkContext} 45 | 46 | /** 47 | * Created by xubo on 2016/5/23. 48 | */ 49 | object ChiSqLearning { 50 | def main(args: Array[String]) { 51 | val vd = Vectors.dense(1, 2, 3, 4, 5) 52 | val vdResult = Statistics.chiSqTest(vd) 53 | println(vd) 54 | println(vdResult) 55 | println("-------------------------------") 56 | val mtx = Matrices.dense(3, 2, Array(1, 3, 5, 2, 4, 6)) 57 | val mtxResult = Statistics.chiSqTest(mtx) 58 | println(mtx) 59 | println(mtxResult) 60 | //print :方法、自由度、方法的统计量、p值 61 | println("-------------------------------") 62 | val mtx2 = Matrices.dense(2, 2, Array(19.0, 34, 24, 10.0)) 63 | printChiSqTest(mtx2) 64 | printChiSqTest( Matrices.dense(2, 2, Array(26.0, 36, 7, 2.0))) 65 | // val mtxResult2 = Statistics.chiSqTest(mtx2) 66 | // println(mtx2) 67 | // println(mtxResult2) 68 | } 69 | 70 | def printChiSqTest(matrix: Matrix): Unit = { 71 | println("-------------------------------") 72 | val mtxResult2 = Statistics.chiSqTest(matrix) 73 | println(matrix) 74 | println(mtxResult2) 75 | } 76 | 77 | 78 | } 79 | 80 | ``` 81 | 82 | 3.结果: 83 | 84 | ``` 85 | [1.0,2.0,3.0,4.0,5.0] 86 | Chi squared test summary: 87 | method: pearson 88 | degrees of freedom = 4 89 | statistic = 3.333333333333333 90 | pValue = 0.5036682742334986 91 | No presumption against null hypothesis: observed follows the same distribution as expected.. 92 | ------------------------------- 93 | 1.0 2.0 94 | 3.0 4.0 95 | 5.0 6.0 96 | Chi squared test summary: 97 | method: pearson 98 | degrees of freedom = 2 99 | statistic = 0.14141414141414144 100 | pValue = 0.931734784568187 101 | No presumption against null hypothesis: the occurrence of the outcomes is statistically independent.. 102 | ------------------------------- 103 | ------------------------------- 104 | 19.0 24.0 105 | 34.0 10.0 106 | Chi squared test summary: 107 | method: pearson 108 | degrees of freedom = 1 109 | statistic = 9.999815802502738 110 | pValue = 0.0015655588405594223 111 | Very strong presumption against null hypothesis: the occurrence of the outcomes is statistically independent.. 112 | ------------------------------- 113 | 26.0 7.0 114 | 36.0 2.0 115 | Chi squared test summary: 116 | method: pearson 117 | degrees of freedom = 1 118 | statistic = 4.05869675818742 119 | pValue = 0.043944401832082036 120 | Strong presumption against null hypothesis: the occurrence of the outcomes is statistically independent.. 121 | 122 | ``` 123 | 第四个例子可以用【4】中的校正公式,这里代码没用。 124 | 125 | 参考 126 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 127 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 128 | 【3】https://github.com/xubo245/SparkLearning 129 | 【4】http://blog.csdn.net/wermnb/article/details/6628555 130 | 【5】http://baike.baidu.com/link?url=y1Ryc0tbOLSL4zULGihtY3gXRbJO26FvHw05cfFYZ01V87h9h2gF0Bl2su2uA52TWq4FGnPAblXLX2jQhFRK3K -------------------------------------------------------------------------------- /3分类和回归/Spark中组件Mllib的学习36之决策树(使用variance)进行回归.md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之分类篇 3 | 1解释 4 | 决策树(使用variance)进行回归 5 | 6 | 2.代码: 7 | 8 | ``` 9 | /** 10 | * @author xubo 11 | * ref:Spark MlLib机器学习实战 12 | * more code:https://github.com/xubo245/SparkLearning 13 | * more blog:http://blog.csdn.net/xubo245 14 | */ 15 | package org.apache.spark.mllib.learning.classification 16 | 17 | import java.text.SimpleDateFormat 18 | import java.util.Date 19 | 20 | import org.apache.spark.mllib.tree.DecisionTree 21 | import org.apache.spark.mllib.tree.model.DecisionTreeModel 22 | import org.apache.spark.mllib.util.MLUtils 23 | import org.apache.spark.{SparkConf, SparkContext} 24 | 25 | /** 26 | * Created by xubo on 2016/5/23. 27 | */ 28 | object DecisionTrees4ByVarianceRegression { 29 | def main(args: Array[String]) { 30 | val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 31 | val sc = new SparkContext(conf) 32 | 33 | // Load and parse the data file. 34 | val data = MLUtils.loadLibSVMFile(sc, "file/data/mllib/input/classification/sample_libsvm_data.txt") 35 | 36 | // Split the data into training and test sets (30% held out for testing) 37 | val splits = data.randomSplit(Array(0.7, 0.3)) 38 | val (trainingData, testData) = (splits(0), splits(1)) 39 | 40 | // Train a DecisionTree model. 41 | // Empty categoricalFeaturesInfo indicates all features are continuous. 42 | val categoricalFeaturesInfo = Map[Int, Int]() 43 | val impurity = "variance" 44 | val maxDepth = 5 45 | val maxBins = 32 46 | 47 | val model = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo, impurity, 48 | maxDepth, maxBins) 49 | 50 | // Evaluate model on test instances and compute test error 51 | val labelsAndPredictions = testData.map { point => 52 | val prediction = model.predict(point.features) 53 | (point.label, prediction) 54 | } 55 | // println("Learned classification tree model:\n" + model.toDebugString) 56 | println("data.count:" + data.count()) 57 | println("trainingData.count:" + trainingData.count()) 58 | println("testData.count:" + testData.count()) 59 | println("model.depth:" + model.depth) 60 | println("model.numNodes:" + model.numNodes) 61 | println("model.topNode:" + model.topNode) 62 | 63 | println("labelAndPreds") 64 | labelsAndPredictions.take(10).foreach(println) 65 | 66 | val testMSE = labelsAndPredictions.map { case (v, p) => math.pow((v - p), 2) }.mean() 67 | println("Test Mean Squared Error = " + testMSE) 68 | println("Learned regression tree model:\n" + model.toDebugString) 69 | 70 | // Save and load model 71 | // val iString = new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date()) 72 | // val path = "file/data/mllib/output/classification/DecisionTreesLearning" + iString + "/result" 73 | // model.save(sc, path) 74 | // val sameModel = DecisionTreeModel.load(sc, path) 75 | // println(sameModel.algo) 76 | sc.stop 77 | } 78 | } 79 | 80 | ``` 81 | 82 | 3.结果: 83 | 84 | ``` 85 | data.count:100 86 | trainingData.count:65 87 | testData.count:35 88 | model.depth:2 89 | model.numNodes:5 90 | model.topNode:id = 1, isLeaf = false, predict = 0.6307692307692307 (prob = -1.0), impurity = 0.23289940828402367, split = Some(Feature = 434, threshold = 0.0, featureType = Continuous, categories = List()), stats = Some(gain = 0.2181301775147929, impurity = 0.23289940828402367, left impurity = 0.0384, right impurity = 0.0) 91 | labelAndPreds 92 | (1.0,1.0) 93 | (0.0,0.0) 94 | (1.0,1.0) 95 | (0.0,0.0) 96 | (0.0,0.0) 97 | (1.0,1.0) 98 | (0.0,0.0) 99 | (0.0,0.0) 100 | (1.0,1.0) 101 | (1.0,1.0) 102 | Test Mean Squared Error = 0.0 103 | Learned regression tree model: 104 | DecisionTreeModel regressor of depth 2 with 5 nodes 105 | If (feature 434 <= 0.0) 106 | If (feature 100 <= 165.0) 107 | Predict: 0.0 108 | Else (feature 100 > 165.0) 109 | Predict: 1.0 110 | Else (feature 434 > 0.0) 111 | Predict: 1.0 112 | ``` 113 | 114 | 参考 115 | 116 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 117 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 118 | 【3】https://github.com/xubo245/SparkLearning -------------------------------------------------------------------------------- /1数据类型/Spark中组件Mllib的学习14之从文本中读取带标签的数据,生成带label的向量.md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之基础概念篇 3 | 1解释 4 | 从文本中读取带标签的数据,生成带label的向量 5 | 6 | 2.代码: 7 | 8 | ``` 9 | /** 10 | * @author xubo 11 | * ref:Spark MlLib机器学习实战 12 | * more code:https://github.com/xubo245/SparkLearning 13 | * more blog:http://blog.csdn.net/xubo245 14 | */ 15 | package org.apache.spark.mllib.learning.basic 16 | 17 | import org.apache.spark.mllib.util.MLUtils 18 | import org.apache.spark.{SparkContext, SparkConf} 19 | 20 | /** 21 | * Created by xubo on 2016/5/23. 22 | * 从文本中读取带标签的数据 23 | */ 24 | object LabeledPointLoadlibSVMFile { 25 | def main(args: Array[String]) { 26 | val conf = new SparkConf().setMaster("local").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 27 | // println(this.getClass().getSimpleName().filter(!_.equals('$'))) 28 | //设置环境变量 29 | val sc = new SparkContext(conf) 30 | 31 | val mu = MLUtils.loadLibSVMFile(sc, "file/data/mllib/input/basic/sample_libsvm_data.txt") //读取文件 32 | mu.foreach(println) //打印内容 33 | 34 | sc.stop 35 | } 36 | } 37 | 38 | ``` 39 | 数据: 40 | 一行 41 | 42 | ``` 43 | 0 128:51 129:159 130:253 131:159 132:50 155:48 156:238 157:252 158:252 159:252 160:237 182:54 183:227 184:253 185:252 186:239 187:233 188:252 189:57 190:6 208:10 209:60 210:224 211:252 212:253 213:252 214:202 215:84 216:252 217:253 218:122 236:163 237:252 238:252 239:252 240:253 241:252 242:252 243:96 244:189 245:253 246:167 263:51 264:238 265:253 266:253 267:190 268:114 269:253 270:228 271:47 272:79 273:255 274:168 290:48 291:238 292:252 293:252 294:179 295:12 296:75 297:121 298:21 301:253 302:243 303:50 317:38 318:165 319:253 320:233 321:208 322:84 329:253 330:252 331:165 344:7 345:178 346:252 347:240 348:71 349:19 350:28 357:253 358:252 359:195 372:57 373:252 374:252 375:63 385:253 386:252 387:195 400:198 401:253 402:190 413:255 414:253 415:196 427:76 428:246 429:252 430:112 441:253 442:252 443:148 455:85 456:252 457:230 458:25 467:7 468:135 469:253 470:186 471:12 483:85 484:252 485:223 494:7 495:131 496:252 497:225 498:71 511:85 512:252 513:145 521:48 522:165 523:252 524:173 539:86 540:253 541:225 548:114 549:238 550:253 551:162 567:85 568:252 569:249 570:146 571:48 572:29 573:85 574:178 575:225 576:253 577:223 578:167 579:56 595:85 596:252 597:252 598:252 599:229 600:215 601:252 602:252 603:252 604:196 605:130 623:28 624:199 625:252 626:252 627:253 628:252 629:252 630:233 631:145 652:25 653:128 654:252 655:253 656:252 657:141 658:37 44 | 。。。 45 | 46 | ``` 47 | 48 | 3.结果: 49 | 50 | ``` 51 | (0.0,(692,[127,128,129,130,131,154,155,156,157,158,159,181,182,183,184,185,186,187,188,189,207,208,209,210,211,212,213,214,215,216,217,235,236,237,238,239,240,241,242,243,244,245,262,263,264,265,266,267,268,269,270,271,272,273,289,290,291,292,293,294,295,296,297,300,301,302,316,317,318,319,320,321,328,329,330,343,344,345,346,347,348,349,356,357,358,371,372,373,374,384,385,386,399,400,401,412,413,414,426,427,428,429,440,441,442,454,455,456,457,466,467,468,469,470,482,483,484,493,494,495,496,497,510,511,512,520,521,522,523,538,539,540,547,548,549,550,566,567,568,569,570,571,572,573,574,575,576,577,578,594,595,596,597,598,599,600,601,602,603,604,622,623,624,625,626,627,628,629,630,651,652,653,654,655,656,657],[51.0,159.0,253.0,159.0,50.0,48.0,238.0,252.0,252.0,252.0,237.0,54.0,227.0,253.0,252.0,239.0,233.0,252.0,57.0,6.0,10.0,60.0,224.0,252.0,253.0,252.0,202.0,84.0,252.0,253.0,122.0,163.0,252.0,252.0,252.0,253.0,252.0,252.0,96.0,189.0,253.0,167.0,51.0,238.0,253.0,253.0,190.0,114.0,253.0,228.0,47.0,79.0,255.0,168.0,48.0,238.0,252.0,252.0,179.0,12.0,75.0,121.0,21.0,253.0,243.0,50.0,38.0,165.0,253.0,233.0,208.0,84.0,253.0,252.0,165.0,7.0,178.0,252.0,240.0,71.0,19.0,28.0,253.0,252.0,195.0,57.0,252.0,252.0,63.0,253.0,252.0,195.0,198.0,253.0,190.0,255.0,253.0,196.0,76.0,246.0,252.0,112.0,253.0,252.0,148.0,85.0,252.0,230.0,25.0,7.0,135.0,253.0,186.0,12.0,85.0,252.0,223.0,7.0,131.0,252.0,225.0,71.0,85.0,252.0,145.0,48.0,165.0,252.0,173.0,86.0,253.0,225.0,114.0,238.0,253.0,162.0,85.0,252.0,249.0,146.0,48.0,29.0,85.0,178.0,225.0,253.0,223.0,167.0,56.0,85.0,252.0,252.0,252.0,229.0,215.0,252.0,252.0,252.0,196.0,130.0,28.0,199.0,252.0,252.0,253.0,252.0,252.0,233.0,145.0,25.0,128.0,252.0,253.0,252.0,141.0,37.0])) 52 | 。。。 53 | ``` 54 | 55 | 参考 56 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 57 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 58 | 【3】https://github.com/xubo245/SparkLearning 59 | -------------------------------------------------------------------------------- /9评估度量/Spark中组件Mllib的学习70之对多类分类结果进行评估Multiclass classification.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | 对多分类结果进行评估 9 | 10 | ![](http://i.imgur.com/uB5zk5z.png) 11 | ![](http://i.imgur.com/TwJs11O.png) 12 | 13 | 14 | 15 | 2.代码: 16 | 17 | /** 18 | * @author xubo 19 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 20 | * more code:https://github.com/xubo245/SparkLearning 21 | * more blog:http://blog.csdn.net/xubo245 22 | */ 23 | package org.apache.spark.mllib.EvaluationMetrics 24 | 25 | import org.apache.spark.util.SparkLearningFunSuite 26 | 27 | /** 28 | * Created by xubo on 2016/6/13. 29 | */ 30 | class MulticlassClassificationFunSuite extends SparkLearningFunSuite { 31 | test("testFunSuite") { 32 | 33 | 34 | import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS 35 | import org.apache.spark.mllib.evaluation.MulticlassMetrics 36 | import org.apache.spark.mllib.regression.LabeledPoint 37 | import org.apache.spark.mllib.util.MLUtils 38 | 39 | // Load training data in LIBSVM format 40 | val data = MLUtils.loadLibSVMFile(sc, "file/data/mllib/input/mllibFromSpark/sample_multiclass_classification_data.txt") 41 | 42 | // Split data into training (60%) and test (40%) 43 | val Array(training, test) = data.randomSplit(Array(0.6, 0.4), seed = 11L) 44 | training.cache() 45 | 46 | // Run training algorithm to build the model 47 | val model = new LogisticRegressionWithLBFGS() 48 | .setNumClasses(3) 49 | .run(training) 50 | 51 | // Compute raw scores on the test set 52 | val predictionAndLabels = test.map { case LabeledPoint(label, features) => 53 | val prediction = model.predict(features) 54 | (prediction, label) 55 | } 56 | 57 | // Instantiate metrics object 58 | val metrics = new MulticlassMetrics(predictionAndLabels) 59 | 60 | // Confusion matrix 61 | println("Confusion matrix:") 62 | println(metrics.confusionMatrix) 63 | 64 | // Overall Statistics 65 | val precision = metrics.precision 66 | val recall = metrics.recall // same as true positive rate 67 | val f1Score = metrics.fMeasure 68 | println("Summary Statistics") 69 | println(s"Precision = $precision") 70 | println(s"Recall = $recall") 71 | println(s"F1 Score = $f1Score") 72 | 73 | // Precision by label 74 | val labels = metrics.labels 75 | labels.foreach { l => 76 | println(s"Precision($l) = " + metrics.precision(l)) 77 | } 78 | 79 | // Recall by label 80 | labels.foreach { l => 81 | println(s"Recall($l) = " + metrics.recall(l)) 82 | } 83 | 84 | // False positive rate by label 85 | labels.foreach { l => 86 | println(s"FPR($l) = " + metrics.falsePositiveRate(l)) 87 | } 88 | 89 | // F-measure by label 90 | labels.foreach { l => 91 | println(s"F1-Score($l) = " + metrics.fMeasure(l)) 92 | } 93 | 94 | // Weighted stats 95 | println(s"Weighted precision: ${metrics.weightedPrecision}") 96 | println(s"Weighted recall: ${metrics.weightedRecall}") 97 | println(s"Weighted F1 score: ${metrics.weightedFMeasure}") 98 | println(s"Weighted false positive rate: ${metrics.weightedFalsePositiveRate}") 99 | 100 | 101 | } 102 | } 103 | 104 | 105 | 3.结果: 106 | 107 | Confusion matrix: 108 | 11.0 0.0 1.0 109 | 0.0 12.0 0.0 110 | 10.0 0.0 16.0 111 | Summary Statistics 112 | Precision = 0.78 113 | Recall = 0.78 114 | F1 Score = 0.78 115 | Precision(0.0) = 0.5238095238095238 116 | Precision(1.0) = 1.0 117 | Precision(2.0) = 0.9411764705882353 118 | Recall(0.0) = 0.9166666666666666 119 | Recall(1.0) = 1.0 120 | Recall(2.0) = 0.6153846153846154 121 | FPR(0.0) = 0.2631578947368421 122 | FPR(1.0) = 0.0 123 | FPR(2.0) = 0.041666666666666664 124 | F1-Score(0.0) = 0.6666666666666667 125 | F1-Score(1.0) = 1.0 126 | F1-Score(2.0) = 0.744186046511628 127 | Weighted precision: 0.8551260504201681 128 | Weighted recall: 0.78 129 | Weighted F1 score: 0.7869767441860466 130 | Weighted false positive rate: 0.08482456140350877 131 | 132 | 133 | 134 | 参考 135 | 136 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 137 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 138 | 【3】https://github.com/xubo245/SparkLearning 139 | 【4】book:Machine Learning with Spark ,Nick Pertreach 140 | 【5】book:Spark MlLib机器学习实战 141 | -------------------------------------------------------------------------------- /5聚类/Spark中组件Mllib的学习48之流式k均值(Streaming kmeans).md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | 流式K均值 9 | 10 | 当数据以流式到达,就需要动态预测分类,每当新数据到来时要更新模型。MLlib提供了流式k均值聚类,该方法使用参数来控制数据的衰减。这个算法使用mini-batch k均值更新规则的一种泛化版本。对于每一批数据,将所有点赋给最近的簇,计算新的簇中心,然后使用下面的方法更新簇: 11 | ![](http://www.fuqingchuan.com/wp-content/uploads/2015/03/111.png) 12 | 13 | 14 | 15 | 2.代码: 16 | 17 | test("accuracy for single center and equivalence to grand average") { 18 | // set parameters 19 | val numBatches = 10 20 | val numPoints = 50 21 | val k = 1 22 | val d = 5 23 | val r = 0.1 24 | 25 | // create model with one cluster 26 | val model = new StreamingKMeans() 27 | .setK(1) 28 | .setDecayFactor(1.0) 29 | .setInitialCenters(Array(Vectors.dense(0.0, 0.0, 0.0, 0.0, 0.0)), Array(0.0)) 30 | 31 | // generate random data for k-means 32 | val (input, centers) = StreamingKMeansDataGenerator(numPoints, numBatches, k, d, r, 42) 33 | 34 | // setup and run the model training 35 | ssc = setupStreams(input, (inputDStream: DStream[Vector]) => { 36 | model.trainOn(inputDStream) 37 | inputDStream.count() 38 | }) 39 | runStreams(ssc, numBatches, numBatches) 40 | 41 | // estimated center should be close to true center 42 | assert(centers(0) ~== model.latestModel().clusterCenters(0) absTol 1E-1) 43 | /** ****************add by xubo 20160613 *************/ 44 | println("model.latestModel().clusterCenters:") 45 | model.latestModel().clusterCenters.foreach(println) 46 | println("model.latestModel().clusterWeights:") 47 | model.latestModel().clusterWeights.foreach(println) 48 | 49 | /** ****************add by xubo 20160613 *************/ 50 | // estimated center from streaming should exactly match the arithmetic mean of all data points 51 | // because the decay factor is set to 1.0 52 | val grandMean = 53 | input.flatten.map(x => x.toBreeze).reduce(_ + _) / (numBatches * numPoints).toDouble 54 | assert(model.latestModel().clusterCenters(0) ~== Vectors.dense(grandMean.toArray) absTol 1E-5) 55 | /** ****************add by xubo 20160613 *************/ 56 | //println("input") 57 | //input.foreach(println) 58 | println("grandMean") 59 | grandMean.foreach(println) 60 | 61 | /** ****************add by xubo 20160613 *************/ 62 | } 63 | 64 | 65 | 66 | 3.结果: 67 | 68 | model.latestModel().clusterCenters: 69 | [-0.4725511979691583,0.9644503899125422,-1.668776373542808,1.2721254429935838,0.37815209739836425] 70 | model.latestModel().clusterWeights: 71 | 500.0 72 | grandMean 73 | -0.4725511979691581 74 | 0.9644503899125427 75 | -1.6687763735428087 76 | 1.2721254429935853 77 | 0.37815209739836464 78 | 79 | input数据量有点多,就没有放上来了。 80 | 81 | 参考 82 | 83 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 84 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 85 | 【3】https://github.com/xubo245/SparkLearning 86 | 87 | 88 | 附录: 89 | 90 | /** 91 | * @author xubo 92 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 93 | * more code:https://github.com/xubo245/SparkLearning 94 | * more blog:http://blog.csdn.net/xubo245 95 | */ 96 | package org.apache.spark.mllib.clustering.kmeans 97 | 98 | import org.apache.spark.SparkConf 99 | import org.apache.spark.mllib.linalg.Vectors 100 | import org.apache.spark.mllib.regression.LabeledPoint 101 | import org.apache.spark.mllib.clustering.StreamingKMeans 102 | import org.apache.spark.streaming.{Seconds, StreamingContext} 103 | import org.apache.spark.util.SparkLearningFunSuite 104 | 105 | /** 106 | * Created by xubo on 2016/6/13. 107 | * 需要集群运行,目前没有运行测试 108 | */ 109 | class StreamingKmeansFromWebSuite extends SparkLearningFunSuite { 110 | test("testFunSuite") { 111 | val conf = new SparkConf() 112 | .setMaster("local[4]") 113 | .setAppName("SparkLearningTest") 114 | val ssc = new StreamingContext(conf, Seconds(1)) 115 | val trainingData = ssc.textFileStream("file/data/mllib/input/trainingDic").map(Vectors.parse) 116 | val testData = ssc.textFileStream("file/data/mllib/input/testingDic").map(LabeledPoint.parse) 117 | val numDimensions = 3 118 | val numClusters = 2 119 | val model = new StreamingKMeans() 120 | .setK(numClusters) 121 | .setDecayFactor(1.0) 122 | .setRandomCenters(numDimensions, 0.0) 123 | model.trainOn(trainingData) 124 | model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print() 125 | 126 | ssc.start() 127 | ssc.awaitTermination() 128 | } 129 | } 130 | -------------------------------------------------------------------------------- /3分类和回归/Spark中组件Mllib的学习33之决策树(使用Gini).md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之分类篇 3 | 1解释 4 | 决策树:Decision Trees 5 | 请见【4】【5】 6 | 数据每次是随机划分,所以准确率每次不一定 7 | 8 | 2.代码: 9 | 10 | ``` 11 | /** 12 | * @author xubo 13 | * ref:Spark MlLib机器学习实战 14 | * more code:https://github.com/xubo245/SparkLearning 15 | * more blog:http://blog.csdn.net/xubo245 16 | */ 17 | package org.apache.spark.mllib.learning.classification 18 | 19 | import java.text.SimpleDateFormat 20 | import java.util.Date 21 | 22 | import org.apache.spark.mllib.tree.DecisionTree 23 | import org.apache.spark.mllib.tree.model.DecisionTreeModel 24 | import org.apache.spark.mllib.util.MLUtils 25 | import org.apache.spark.{SparkConf, SparkContext} 26 | 27 | /** 28 | * Created by xubo on 2016/5/23. 29 | */ 30 | object DecisionTreesLearning { 31 | def main(args: Array[String]) { 32 | val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 33 | val sc = new SparkContext(conf) 34 | 35 | // Load and parse the data file. 36 | val data = MLUtils.loadLibSVMFile(sc, "file/data/mllib/input/classification/sample_libsvm_data.txt") 37 | // Split the data into training and test sets (30% held out for testing) 38 | val splits = data.randomSplit(Array(0.7, 0.3)) 39 | val (trainingData, testData) = (splits(0), splits(1)) 40 | 41 | // Train a DecisionTree model. 42 | // Empty categoricalFeaturesInfo indicates all features are continuous. 43 | val numClasses = 2 44 | val categoricalFeaturesInfo = Map[Int, Int]() 45 | val impurity = "gini" 46 | val maxDepth = 5 47 | val maxBins = 32 48 | 49 | val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, 50 | impurity, maxDepth, maxBins) 51 | 52 | // Evaluate model on test instances and compute test error 53 | val labelAndPreds = testData.map { point => 54 | val prediction = model.predict(point.features) 55 | (point.label, prediction) 56 | } 57 | val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count() 58 | println("Test Error = " + testErr) 59 | println("Learned classification tree model:\n" + model.toDebugString) 60 | println("data.count:" + data.count()) 61 | println("trainingData.count:" + trainingData.count()) 62 | println("testData.count:" + testData.count()) 63 | println("model.depth:"+model.depth) 64 | println("model.numNodes:"+model.numNodes) 65 | println("model.topNode:"+model.topNode) 66 | 67 | println("labelAndPreds") 68 | labelAndPreds.take(30).foreach(println) 69 | // Save and load model 70 | // val iString = new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date()) 71 | // val path = "file/data/mllib/output/classification/DecisionTreesLearning" + iString + "/result" 72 | // model.save(sc, path) 73 | // val sameModel = DecisionTreeModel.load(sc, path) 74 | // println(sameModel.algo) 75 | sc.stop 76 | } 77 | } 78 | 79 | ``` 80 | 81 | 3.结果: 82 | 83 | ``` 84 | Test Error = 0.0 85 | Learned classification tree model: 86 | DecisionTreeModel classifier of depth 2 with 5 nodes 87 | If (feature 434 <= 0.0) 88 | If (feature 100 <= 165.0) 89 | Predict: 0.0 90 | Else (feature 100 > 165.0) 91 | Predict: 1.0 92 | Else (feature 434 > 0.0) 93 | Predict: 1.0 94 | 95 | data.count:100 96 | trainingData.count:78 97 | testData.count:22 98 | model.depth:2 99 | model.numNodes:5 100 | model.topNode:id = 1, isLeaf = false, predict = 1.0 (prob = 0.5384615384615384), impurity = 0.49704142011834324, split = Some(Feature = 434, threshold = 0.0, featureType = Continuous, categories = List()), stats = Some(gain = 0.47209339517031834, impurity = 0.49704142011834324, left impurity = 0.05259313367421467, right impurity = 0.0) 101 | labelAndPreds 102 | (0.0,0.0) 103 | (0.0,0.0) 104 | (1.0,1.0) 105 | (1.0,1.0) 106 | (1.0,1.0) 107 | (0.0,0.0) 108 | (1.0,1.0) 109 | (0.0,0.0) 110 | (1.0,1.0) 111 | (0.0,0.0) 112 | (1.0,1.0) 113 | (1.0,1.0) 114 | (0.0,0.0) 115 | (1.0,1.0) 116 | (1.0,1.0) 117 | (1.0,1.0) 118 | (0.0,0.0) 119 | (1.0,1.0) 120 | (1.0,1.0) 121 | (1.0,1.0) 122 | (1.0,1.0) 123 | (1.0,1.0) 124 | ``` 125 | 126 | 参考 127 | 128 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 129 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 130 | 【3】https://github.com/xubo245/SparkLearning 131 | 【4】http://blog.csdn.net/dark_scope/article/details/13168827 132 | 【5】http://spark.apache.org/docs/1.5.2/mllib-decision-tree.html -------------------------------------------------------------------------------- /3分类和回归/Spark中组件Mllib的学习32之朴素贝叶斯分类器(伯努利朴素贝叶斯)_.md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之分类篇 3 | 1解释 4 | 5 | (1) 朴素贝叶斯分类器种类 6 | 7 | 在把训练集中的每个文档向量化的过程中,存在两个模型。一个是统计词在文档中出现的次数(多项式模型);一个是统计词是否在文档中出现过(柏努利模型) 8 | 目前mllib只支持多项式朴素贝叶斯和伯努利贝叶斯(spark-1.5.2),不支持高斯朴素贝叶斯。 9 | 10 | 根据: 11 | 12 | ``` 13 | /** 14 | * Trains a Naive Bayes model given an RDD of `(label, features)` pairs. 15 | * 16 | * This is the Multinomial NB ([[http://tinyurl.com/lsdw6p]]) which can handle all kinds of 17 | * discrete data. For example, by converting documents into TF-IDF vectors, it can be used for 18 | * document classification. By making every vector a 0-1 vector, it can also be used as 19 | * Bernoulli NB ([[http://tinyurl.com/p7c96j6]]). The input feature values must be nonnegative. 20 | */ 21 | @Since("0.9.0") 22 | class NaiveBayes private ( 23 | private var lambda: Double, 24 | private var modelType: String) extends Serializable with Logging { 25 | 26 | import NaiveBayes.{Bernoulli, Multinomial} 27 | 28 | @Since("1.4.0") 29 | def this(lambda: Double) = this(lambda, NaiveBayes.Multinomial) 30 | 31 | ``` 32 | 33 | 三种朴素贝叶斯分类器都在【4】中有提到 34 | 35 | (2)伯努利贝叶斯分类器 36 | ![这里写图片描述](http://img.blog.csdn.net/20160525103109299) 37 | 参考【5】 38 | 39 | 2.代码: 40 | 41 | ``` 42 | /** 43 | * @author xubo 44 | * ref:Spark MlLib机器学习实战 45 | * more code:https://github.com/xubo245/SparkLearning 46 | * more blog:http://blog.csdn.net/xubo245 47 | */ 48 | package org.apache.spark.mllib.learning.classification 49 | 50 | import java.text.SimpleDateFormat 51 | import java.util.Date 52 | 53 | import org.apache.spark.mllib.classification.NaiveBayes._ 54 | import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel} 55 | import org.apache.spark.mllib.linalg.Vectors 56 | import org.apache.spark.mllib.regression.LabeledPoint 57 | import org.apache.spark.{SparkException, SparkConf, SparkContext} 58 | 59 | /** 60 | * Created by xubo on 2016/5/23. 61 | * From:NaiveBayesSuite.scala in spark 1.5.2 sources 62 | * another examples:NaiveBayesSuite test("Naive Bayes Bernoulli") 63 | */ 64 | object BernoulliNaiveBayesLearning { 65 | def main(args: Array[String]) { 66 | val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 67 | val sc = new SparkContext(conf) 68 | 69 | val badTrain = Seq( 70 | LabeledPoint(1.0, Vectors.dense(1.0)), 71 | // LabeledPoint(0.0, Vectors.dense(2.0)), 72 | LabeledPoint(1.0, Vectors.dense(1.0)), 73 | LabeledPoint(1.0, Vectors.dense(0.0))) 74 | 75 | 76 | val model1 = NaiveBayes.train(sc.makeRDD(badTrain, 2), 1.0, Bernoulli) 77 | println("model1:") 78 | println(model1) 79 | sc.makeRDD(badTrain, 2).foreach(println) 80 | 81 | val okTrain = Seq( 82 | LabeledPoint(1.0, Vectors.dense(1.0)), 83 | LabeledPoint(0.0, Vectors.dense(0.0)), 84 | LabeledPoint(1.0, Vectors.dense(1.0)), 85 | LabeledPoint(1.0, Vectors.dense(1.0)), 86 | LabeledPoint(0.0, Vectors.dense(0.0)), 87 | LabeledPoint(1.0, Vectors.dense(1.0)), 88 | LabeledPoint(1.0, Vectors.dense(1.0)) 89 | ) 90 | 91 | val badPredict = Seq( 92 | Vectors.dense(1.0), 93 | // Vectors.dense(2.0), 94 | Vectors.dense(1.0), 95 | Vectors.dense(0.0)) 96 | 97 | val model = NaiveBayes.train(sc.makeRDD(okTrain, 2), 1.0, Bernoulli) 98 | // intercept[SparkException] { 99 | val pre2 = model.predict(sc.makeRDD(badPredict, 2)).collect() 100 | // } 101 | println("model2:") 102 | sc.makeRDD(okTrain, 2).foreach(println) 103 | println("predict data:") 104 | sc.makeRDD(badPredict, 2).foreach(println) 105 | println(model) 106 | println("predict result:") 107 | pre2.foreach(println) 108 | 109 | sc.stop 110 | } 111 | } 112 | 113 | ``` 114 | 115 | 3.结果: 116 | 117 | ``` 118 | model1: 119 | org.apache.spark.mllib.classification.NaiveBayesModel@79d63340 120 | (1.0,[1.0]) 121 | (1.0,[1.0]) 122 | (1.0,[0.0]) 123 | model2: 124 | (1.0,[1.0]) 125 | (0.0,[0.0]) 126 | (1.0,[1.0]) 127 | (1.0,[1.0]) 128 | (0.0,[0.0]) 129 | (1.0,[1.0]) 130 | (1.0,[1.0]) 131 | predict data: 132 | [1.0] 133 | [0.0] 134 | [1.0] 135 | org.apache.spark.mllib.classification.NaiveBayesModel@3eda0bed 136 | predict result: 137 | 1.0 138 | 1.0 139 | 0.0 140 | ``` 141 | 142 | 参考 143 | 144 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 145 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 146 | 【3】https://github.com/xubo245/SparkLearning 147 | 【4】http://www.letiantian.me/2014-10-12-three-models-of-naive-nayes/ 148 | 【5】http://blog.csdn.net/xlinsist/article/details/51264829 -------------------------------------------------------------------------------- /3分类和回归/Spark中组件Mllib的学习31之朴素贝叶斯分类器(多项式朴素贝叶斯).md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之分类篇 3 | 1 解释 4 | (1) 贝叶斯: 5 | ![这里写图片描述](http://img.blog.csdn.net/20160525100938165) 6 | 7 | 推广: 8 | ![这里写图片描述](http://img.blog.csdn.net/20160525100949310) 9 | 10 | (2)朴素贝叶斯: 11 | 12 | 为了简化计算,朴素贝叶斯算法做了一假设:“朴素的认为各个特征相互独立”。这么一来,上式的分子就简化成了: 13 | 14 | P(C)*P(F1|C)*P(F2|C)...P(Fn|C)。 15 | 16 | 这样简化过后,计算起来就方便多了。 17 | 18 | 朴素贝叶斯分类器 naive Bayes有: 19 | 20 | 多项式朴素贝叶斯( multinomial naive Bayes )和伯努利朴素贝叶斯( Bernoulli naive Bayes) 21 | 要注意的是,MultinomialNB这个分类器以出现次数作为特征值,使用的TF-IDF也能符合这类分布。 22 | 其他的朴素贝叶斯分类器如GaussianNB适用于高斯分布(正态分布)的特征,而BernoulliNB适用于伯努利分布(二值分布)的特征。 23 | 24 | 贝叶斯理论是处理不确定性信息的重要工具。作为一种不确定性推理方法,它基于概率和统计理论,具有坚实的数学基础,贝叶斯网络在处理不确定信息的智能化系统中已经得到了广泛的应用,并且成功地用于医疗诊断、统计决策、专家系统等领域。这些成功的应用,充分说明了贝叶斯技术是一种强有力的不确定性推理方法。贝叶斯分类器分为两种:一种是朴素贝叶斯分类器,另一种贝叶斯网分类器。 25 | 26 | 朴素贝叶斯分类器是一种有监督的学习方法,其假定一个属性的值对给定类的影响而独立于其他属性值,此限制条件较强,现实中往往不能满足,但是朴素贝叶斯分类器取得了较大的成功,表现出高精度和高效率,具有最小的误分类率,耗时开销小的特征。贝叶斯网分类器是一种有向无环图模型,能够表示属性集间的因果依赖。通过提供图形化的方法来表示知识,以条件概率分布表表示属性依赖关系的强弱,将先验信息和样本知识有机结合起来;通过贝叶斯概率对某一事件未来可能发生的概率进行估计,克服了基于规则的系统所具有的许多概念和计算上的困难。其优点是具有很强的学习和推理能力,能够很好地利用先验知识,缺点是对发生频率较低的事件预测效果不好,且推理与学习过程是NP—Hard的。 27 | 28 | 具体请看参考【4】【5】 29 | 30 | 31 | 32 | 2.代码: 33 | 34 | ``` 35 | /** 36 | * @author xubo 37 | * ref:Spark MlLib机器学习实战 38 | * more code:https://github.com/xubo245/SparkLearning 39 | * more blog:http://blog.csdn.net/xubo245 40 | */ 41 | package org.apache.spark.mllib.learning.classification 42 | 43 | import java.text.SimpleDateFormat 44 | import java.util.Date 45 | 46 | import org.apache.spark.mllib.classification.{LogisticRegressionModel, NaiveBayes, NaiveBayesModel} 47 | import org.apache.spark.mllib.linalg.Vectors 48 | import org.apache.spark.mllib.regression.LabeledPoint 49 | import org.apache.spark.{SparkConf, SparkContext} 50 | 51 | /** 52 | * Created by xubo on 2016/5/23. 53 | */ 54 | object NaiveBayesLearning { 55 | def main(args: Array[String]) { 56 | val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 57 | val sc = new SparkContext(conf) 58 | 59 | 60 | val data = sc.textFile("file/data/mllib/input/classification/sample_naive_bayes_data.txt") 61 | val parsedData = data.map { line => 62 | val parts = line.split(',') 63 | LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) 64 | } 65 | // Split data into training (60%) and test (40%). 66 | val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L) 67 | val training = splits(0) 68 | val test = splits(1) 69 | 70 | val model = NaiveBayes.train(training, lambda = 1.0, modelType = "multinomial") 71 | 72 | val predictionAndLabel = test.map(p => (model.predict(p.features), p.label)) 73 | val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / test.count() 74 | 75 | 76 | println("result:") 77 | println("training.count:" + training.count()) 78 | println("test.count:" + test.count()) 79 | println("model.modelType:" + model.modelType) 80 | println("accuracy:" + accuracy) 81 | predictionAndLabel.take(10).foreach(println) 82 | // model. 83 | 84 | // Save and load model 85 | val iString = new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date()) 86 | val path = "file/data/mllib/output/classification/NaiveBayesModel" + iString + "/result" 87 | model.save(sc, path) 88 | val sameModel = NaiveBayesModel.load(sc, path) 89 | println(sameModel.modelType) 90 | 91 | println("end") 92 | // model.save(sc, "myModelPath") 93 | // val sameModel = NaiveBayesModel.load(sc, "myModelPath") 94 | 95 | sc.stop 96 | } 97 | } 98 | 99 | ``` 100 | 101 | 3.结果: 102 | 103 | ``` 104 | result: 105 | training.count:10 106 | test.count:2 107 | model.modelType:multinomial 108 | accuracy:1.0 109 | (1.0,1.0) 110 | (2.0,2.0) 111 | SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". 112 | SLF4J: Defaulting to no-operation (NOP) logger implementation 113 | SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 114 | 2016-05-24 23:00:45 WARN ParquetRecordReader:193 - Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl 115 | multinomial 116 | end 117 | ``` 118 | 准确率为1 119 | 120 | 121 | 参考 122 | 123 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 124 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 125 | 【3】https://github.com/xubo245/SparkLearning 126 | 【4】http://blog.csdn.net/sulliy/article/details/6629201 127 | 【5】http://blog.csdn.net/lsldd/article/details/41542107 -------------------------------------------------------------------------------- /3分类和回归/Spark中组件Mllib的学习37之随机森林(Gini)进行分类.md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之分类篇 3 | 1解释 4 | 随机森林:RandomForest 5 | 大概思想就是生成多个决策树,都单独训练;如果来了一个数据,用各个决策树进行回归预测,如果是非连续结果,则取最多个数的值;如果连续,则取多个决策树结果的平均值。 6 | 7 | 8 | 2.代码: 9 | 10 | ``` 11 | /** 12 | * @author xubo 13 | * ref:Spark MlLib机器学习实战 14 | * more code:https://github.com/xubo245/SparkLearning 15 | * more blog:http://blog.csdn.net/xubo245 16 | */ 17 | package org.apache.spark.mllib.learning.classification 18 | 19 | import java.text.SimpleDateFormat 20 | import java.util.Date 21 | 22 | import org.apache.spark.mllib.tree.RandomForest 23 | import org.apache.spark.mllib.tree.model.RandomForestModel 24 | import org.apache.spark.mllib.util.MLUtils 25 | import org.apache.spark.{SparkConf, SparkContext} 26 | 27 | /** 28 | * Created by xubo on 2016/5/23. 29 | */ 30 | object RandomForest2Spark { 31 | def main(args: Array[String]) { 32 | val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 33 | val sc = new SparkContext(conf) 34 | 35 | // Load and parse the data file. 36 | val data = MLUtils.loadLibSVMFile(sc, "file/data/mllib/input/classification/sample_libsvm_data.txt") 37 | 38 | // Split the data into training and test sets (30% held out for testing) 39 | val splits = data.randomSplit(Array(0.7, 0.3)) 40 | val (trainingData, testData) = (splits(0), splits(1)) 41 | 42 | // Train a RandomForest model. 43 | // Empty categoricalFeaturesInfo indicates all features are continuous. 44 | val numClasses = 2 45 | val categoricalFeaturesInfo = Map[Int, Int]() 46 | val numTrees = 3 // Use more in practice. 47 | val featureSubsetStrategy = "auto" // Let the algorithm choose. 48 | val impurity = "gini" 49 | val maxDepth = 4 50 | val maxBins = 32 51 | 52 | val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, 53 | numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins) 54 | 55 | // Evaluate model on test instances and compute test error 56 | val labelAndPreds = testData.map { point => 57 | val prediction = model.predict(point.features) 58 | (point.label, prediction) 59 | } 60 | val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count() 61 | println("Test Error = " + testErr) 62 | println("Learned classification forest model:\n" + model.toDebugString) 63 | 64 | 65 | // println("Learned classification tree model:\n" + model.toDebugString) 66 | println("data.count:" + data.count()) 67 | println("trainingData.count:" + trainingData.count()) 68 | println("testData.count:" + testData.count()) 69 | println("model.algo:" + model.algo) 70 | println("model.trees:" + model.trees) 71 | 72 | println("labelAndPreds") 73 | labelAndPreds.take(10).foreach(println) 74 | 75 | // Save and load model 76 | // val iString = new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date()) 77 | // val path = "file/data/mllib/output/classification/RandomForestModel" + iString + "/result" 78 | // model.save(sc, path) 79 | // val sameModel = RandomForestModel.load(sc, path) 80 | // println(sameModel.algo) 81 | sc.stop 82 | } 83 | } 84 | 85 | ``` 86 | 87 | 3.结果: 88 | 89 | ``` 90 | Test Error = 0.04 91 | Learned classification forest model: 92 | TreeEnsembleModel classifier with 3 trees 93 | 94 | Tree 0: 95 | If (feature 511 <= 0.0) 96 | If (feature 434 <= 0.0) 97 | Predict: 0.0 98 | Else (feature 434 > 0.0) 99 | Predict: 1.0 100 | Else (feature 511 > 0.0) 101 | Predict: 0.0 102 | Tree 1: 103 | If (feature 490 <= 31.0) 104 | Predict: 0.0 105 | Else (feature 490 > 31.0) 106 | Predict: 1.0 107 | Tree 2: 108 | If (feature 302 <= 0.0) 109 | If (feature 461 <= 0.0) 110 | If (feature 208 <= 107.0) 111 | Predict: 1.0 112 | Else (feature 208 > 107.0) 113 | Predict: 0.0 114 | Else (feature 461 > 0.0) 115 | Predict: 1.0 116 | Else (feature 302 > 0.0) 117 | Predict: 0.0 118 | 119 | data.count:100 120 | trainingData.count:75 121 | testData.count:25 122 | model.algo:Classification 123 | model.trees:[Lorg.apache.spark.mllib.tree.model.DecisionTreeModel;@753c93d5 124 | labelAndPreds 125 | (1.0,1.0) 126 | (1.0,0.0) 127 | (0.0,0.0) 128 | (0.0,0.0) 129 | (1.0,1.0) 130 | (0.0,0.0) 131 | (1.0,1.0) 132 | (1.0,1.0) 133 | (1.0,1.0) 134 | (0.0,0.0) 135 | ``` 136 | 137 | 参考 138 | 139 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 140 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 141 | 【3】https://github.com/xubo245/SparkLearning -------------------------------------------------------------------------------- /3分类和回归/Spark中组件Mllib的学习23之随机梯度下降(SGD).md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之回归分析篇 3 | 1解释 4 | SGD(Stochastic Gradient Descent-随机梯度下降) 5 | 6 | ![这里写图片描述](http://img.blog.csdn.net/20160524165035690) 7 | 8 | sgd解决了梯度下降的两个问题: 收敛速度慢和陷入局部最优。 具体的介绍请见【4】、【5】和【6】 9 | 10 | 背景: 11 | 12 | 梯度下降法的缺点是: 13 | 靠近极小值时速度减慢。 14 | 直线搜索可能会产生一些问题。 15 | 可能会'之字型'地下降。 16 | 17 | ``` 18 | 随机梯度下降法stochastic gradient descent,也叫增量梯度下降 19 | 由于梯度下降法收敛速度慢,而随机梯度下降法会快很多 20 | 21 | –根据某个单独样例的误差增量计算权值更新,得到近似的梯度下降搜索(随机取一个样例) 22 | 23 | –可以看作为每个单独的训练样例定义不同的误差函数 24 | 25 | –在迭代所有训练样例时,这些权值更新的序列给出了对于原来误差函数的梯度下降的一个合理近似 26 | 27 | –通过使下降速率的值足够小,可以使随机梯度下降以任意程度接近于真实梯度下降 28 | 29 | •标准梯度下降和随机梯度下降之间的关键区别 30 | 31 | –标准梯度下降是在权值更新前对所有样例汇总误差,而随机梯度下降的权值是通过考查某个训练样例来更新的 32 | 33 | –在标准梯度下降中,权值更新的每一步对多个样例求和,需要更多的计算 34 | 35 | –标准梯度下降,由于使用真正的梯度,标准梯度下降对于每一次权值更新经常使用比随机梯度下降大的步长 36 | 37 | –如果标准误差曲面有多个局部极小值,随机梯度下降有时可能避免陷入这些局部极小值中 38 | ``` 39 | 40 | 2.代码: 41 | 42 | ``` 43 | /** 44 | * @author xubo 45 | * ref:Spark MlLib机器学习实战 46 | * more code:https://github.com/xubo245/SparkLearning 47 | * more blog:http://blog.csdn.net/xubo245 48 | */ 49 | package org.apache.spark.mllib.learning.regression 50 | 51 | import org.apache.spark.{SparkConf, SparkContext} 52 | 53 | import scala.collection.mutable.HashMap 54 | 55 | /** 56 | * Created by xubo on 2016/5/23. 57 | */ 58 | object SGDLearning { 59 | val data = HashMap[Int, Int]() 60 | 61 | //创建数据集 62 | def getData(): HashMap[Int, Int] = { 63 | //生成数据集内容 64 | for (i <- 1 to 50) { 65 | //创建50个数据 66 | data += (i -> (20 * i)) //写入公式y=2x 67 | } 68 | data //返回数据集 69 | } 70 | 71 | var θ: Double = 0 72 | //第一步假设θ为0 73 | var α: Double = 0.1 //设置步进系数 74 | 75 | def sgd(x: Double, y: Double) = { 76 | //设置迭代公式 77 | θ = θ - α * ((θ * x) - y) //迭代公式 78 | } 79 | 80 | def main(args: Array[String]) { 81 | val dataSource = getData() //获取数据集 82 | println("data:") 83 | dataSource.foreach(each => print(each + " ")) 84 | println("\nresult:") 85 | var num = 1; 86 | dataSource.foreach(myMap => { 87 | //开始迭代 88 | println(num + ":" + θ+" ("+myMap._1+","+myMap._2+")") 89 | sgd(myMap._1, myMap._2) //输入数据 90 | num = num + 1; 91 | }) 92 | println("最终结果θ值为 " + θ) //显示结果 93 | } 94 | } 95 | 96 | ``` 97 | 98 | 3.结果: 99 | 100 | ``` 101 | data: 102 | (23,460) (50,1000) (32,640) (41,820) (17,340) (8,160) (35,700) (44,880) (26,520) (11,220) (29,580) (38,760) (47,940) (20,400) (2,40) (5,100) (14,280) (46,920) (40,800) (49,980) (4,80) (13,260) (22,440) (31,620) (16,320) (7,140) (43,860) (25,500) (34,680) (10,200) (37,740) (1,20) (19,380) (28,560) (45,900) (27,540) (36,720) (18,360) (9,180) (21,420) (48,960) (3,60) (12,240) (30,600) (39,780) (15,300) (42,840) (24,480) (6,120) (33,660) 103 | result: 104 | 1:0.0 (23,460) 105 | 2:46.0 (50,1000) 106 | 3:-84.0 (32,640) 107 | 4:248.8 (41,820) 108 | 5:-689.2800000000002 (17,340) 109 | 6:516.4960000000003 (8,160) 110 | 7:119.29920000000004 (35,700) 111 | 8:-228.24800000000016 (44,880) 112 | 9:864.0432000000006 (26,520) 113 | 10:-1330.469120000001 (11,220) 114 | 11:155.04691200000025 (29,580) 115 | 12:-236.58913280000047 (38,760) 116 | 13:738.4495718400013 (47,940) 117 | 14:-2638.263415808005 (20,400) 118 | 15:2678.263415808006 (2,40) 119 | 16:2146.610732646405 (5,100) 120 | 17:1083.3053663232024 (14,280) 121 | 18:-405.3221465292811 (46,920) 122 | 19:1551.159727505412 (40,800) 123 | 20:-4573.4791825162365 (49,980) 124 | 21:17934.568811813326 (4,80) 125 | 22:10768.741287087996 (13,260) 126 | 23:-3204.6223861264007 (22,440) 127 | 24:3889.546863351681 (31,620) 128 | 25:-8106.04841303853 (16,320) 129 | 26:4895.6290478231185 (7,140) 130 | 27:1482.6887143469353 (43,860) 131 | 28:-4806.872757344887 (25,500) 132 | 29:7260.309136017331 (34,680) 133 | 30:-17356.741926441595 (10,200) 134 | 31:20.0 (37,740) 135 | 32:20.0 (1,20) 136 | 33:20.0 (19,380) 137 | 34:20.0 (28,560) 138 | 35:20.0 (45,900) 139 | 36:20.0 (27,540) 140 | 37:20.0 (36,720) 141 | 38:20.0 (18,360) 142 | 39:20.0 (9,180) 143 | 40:20.0 (21,420) 144 | 41:20.0 (48,960) 145 | 42:20.0 (3,60) 146 | 43:20.0 (12,240) 147 | 44:20.0 (30,600) 148 | 45:20.0 (39,780) 149 | 46:20.0 (15,300) 150 | 47:20.0 (42,840) 151 | 48:20.0 (24,480) 152 | 49:20.0 (6,120) 153 | 50:20.0 (33,660) 154 | 最终结果θ值为 20.0 155 | ``` 156 | 157 | 分析: 158 | 当α为0.1的时候,一般30次计算就计算出来了;如果是0.5,一般15次计算就有正确结果 。如果是1,则50次都没有结果 159 | 160 | 参考 161 | 162 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 163 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 164 | 【3】https://github.com/xubo245/SparkLearning 165 | 【4】Spark MlLib机器学习实战 166 | 【5】http://blog.csdn.net/zbc1090549839/article/details/38149561 167 | 【6】http://blog.csdn.net/woxincd/article/details/7040944 168 | -------------------------------------------------------------------------------- /8频繁项挖掘/Spark中组件Mllib的学习66之FP-growth.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | FP-growth是数据挖掘里面很常见的一个算法 9 | 10 | 11 | 2.代码: 12 | 13 | /** 14 | * @author xubo 15 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 16 | * more code:https://github.com/xubo245/SparkLearning 17 | * more blog:http://blog.csdn.net/xubo245 18 | */ 19 | package org.apache.spark.mllib.FrequentPatternMining 20 | 21 | import org.apache.spark.util.SparkLearningFunSuite 22 | 23 | /** 24 | * Created by xubo on 2016/6/13. 25 | */ 26 | class FPgroupFunSuite extends SparkLearningFunSuite { 27 | test("testFunSuite") { 28 | import org.apache.spark.rdd.RDD 29 | import org.apache.spark.mllib.fpm.FPGrowth 30 | 31 | val data = sc.textFile("file/data/mllib/input/mllibFromSpark/sample_fpgrowth.txt") 32 | 33 | val transactions: RDD[Array[String]] = data.map(s => s.trim.split(' ')) 34 | 35 | val fpg = new FPGrowth() 36 | .setMinSupport(0.2) 37 | .setNumPartitions(10) 38 | val model = fpg.run(transactions) 39 | 40 | model.freqItemsets.collect().foreach { itemset => 41 | println(itemset.items.mkString("[", ",", "]") + ", " + itemset.freq) 42 | } 43 | 44 | val minConfidence = 0.8 45 | model.generateAssociationRules(minConfidence).collect().foreach { rule => 46 | println( 47 | rule.antecedent.mkString("[", ",", "]") 48 | + " => " + rule.consequent.mkString("[", ",", "]") 49 | + ", " + rule.confidence) 50 | } 51 | } 52 | } 53 | 54 | 55 | 3.结果: 56 | 57 | [z], 5 58 | [x], 4 59 | [x,z], 3 60 | [y], 3 61 | [y,x], 3 62 | [y,x,z], 3 63 | [y,z], 3 64 | [r], 3 65 | [r,x], 2 66 | [r,z], 2 67 | [s], 3 68 | [s,y], 2 69 | [s,y,x], 2 70 | [s,y,x,z], 2 71 | [s,y,z], 2 72 | [s,x], 3 73 | [s,x,z], 2 74 | [s,z], 2 75 | [t], 3 76 | [t,y], 3 77 | [t,y,x], 3 78 | [t,y,x,z], 3 79 | [t,y,z], 3 80 | [t,s], 2 81 | [t,s,y], 2 82 | [t,s,y,x], 2 83 | [t,s,y,x,z], 2 84 | [t,s,y,z], 2 85 | [t,s,x], 2 86 | [t,s,x,z], 2 87 | [t,s,z], 2 88 | [t,x], 3 89 | [t,x,z], 3 90 | [t,z], 3 91 | [p], 2 92 | [p,r], 2 93 | [p,r,z], 2 94 | [p,z], 2 95 | [q], 2 96 | [q,y], 2 97 | [q,y,x], 2 98 | [q,y,x,z], 2 99 | [q,y,z], 2 100 | [q,t], 2 101 | [q,t,y], 2 102 | [q,t,y,x], 2 103 | [q,t,y,x,z], 2 104 | [q,t,y,z], 2 105 | [q,t,x], 2 106 | [q,t,x,z], 2 107 | [q,t,z], 2 108 | [q,x], 2 109 | [q,x,z], 2 110 | [q,z], 2 111 | [t,s,y] => [x], 1.0 112 | [t,s,y] => [z], 1.0 113 | [y,x,z] => [t], 1.0 114 | [y] => [x], 1.0 115 | [y] => [z], 1.0 116 | [y] => [t], 1.0 117 | [p] => [r], 1.0 118 | [p] => [z], 1.0 119 | [q,t,z] => [y], 1.0 120 | [q,t,z] => [x], 1.0 121 | [q,y] => [x], 1.0 122 | [q,y] => [z], 1.0 123 | [q,y] => [t], 1.0 124 | [t,s,x] => [y], 1.0 125 | [t,s,x] => [z], 1.0 126 | [q,t,y,z] => [x], 1.0 127 | [q,t,x,z] => [y], 1.0 128 | [q,x] => [y], 1.0 129 | [q,x] => [t], 1.0 130 | [q,x] => [z], 1.0 131 | [t,x,z] => [y], 1.0 132 | [x,z] => [y], 1.0 133 | [x,z] => [t], 1.0 134 | [p,z] => [r], 1.0 135 | [t] => [y], 1.0 136 | [t] => [x], 1.0 137 | [t] => [z], 1.0 138 | [y,z] => [x], 1.0 139 | [y,z] => [t], 1.0 140 | [p,r] => [z], 1.0 141 | [t,s] => [y], 1.0 142 | [t,s] => [x], 1.0 143 | [t,s] => [z], 1.0 144 | [q,z] => [y], 1.0 145 | [q,z] => [t], 1.0 146 | [q,z] => [x], 1.0 147 | [q,y,z] => [x], 1.0 148 | [q,y,z] => [t], 1.0 149 | [y,x] => [z], 1.0 150 | [y,x] => [t], 1.0 151 | [q,x,z] => [y], 1.0 152 | [q,x,z] => [t], 1.0 153 | [t,y,z] => [x], 1.0 154 | [q,y,x] => [z], 1.0 155 | [q,y,x] => [t], 1.0 156 | [q,t,y,x] => [z], 1.0 157 | [t,s,x,z] => [y], 1.0 158 | [s,y,x] => [z], 1.0 159 | [s,y,x] => [t], 1.0 160 | [s,x,z] => [y], 1.0 161 | [s,x,z] => [t], 1.0 162 | [q,y,x,z] => [t], 1.0 163 | [s,y] => [x], 1.0 164 | [s,y] => [z], 1.0 165 | [s,y] => [t], 1.0 166 | [q,t,y] => [x], 1.0 167 | [q,t,y] => [z], 1.0 168 | [t,y] => [x], 1.0 169 | [t,y] => [z], 1.0 170 | [t,z] => [y], 1.0 171 | [t,z] => [x], 1.0 172 | [t,s,y,x] => [z], 1.0 173 | [t,y,x] => [z], 1.0 174 | [q,t] => [y], 1.0 175 | [q,t] => [x], 1.0 176 | [q,t] => [z], 1.0 177 | [q] => [y], 1.0 178 | [q] => [t], 1.0 179 | [q] => [x], 1.0 180 | [q] => [z], 1.0 181 | [t,s,z] => [y], 1.0 182 | [t,s,z] => [x], 1.0 183 | [t,x] => [y], 1.0 184 | [t,x] => [z], 1.0 185 | [s,z] => [y], 1.0 186 | [s,z] => [x], 1.0 187 | [s,z] => [t], 1.0 188 | [s,y,x,z] => [t], 1.0 189 | [s] => [x], 1.0 190 | [t,s,y,z] => [x], 1.0 191 | [s,y,z] => [x], 1.0 192 | [s,y,z] => [t], 1.0 193 | [q,t,x] => [y], 1.0 194 | [q,t,x] => [z], 1.0 195 | [r,z] => [p], 1.0 196 | 197 | 198 | 199 | 参考 200 | 201 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 202 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 203 | 【3】https://github.com/xubo245/SparkLearning 204 | 【4】book:Machine Learning with Spark ,Nick Pertreach 205 | 【5】book:Spark MlLib机器学习实战 206 | -------------------------------------------------------------------------------- /2基本统计/Spark中组件Mllib的学习19之分层抽样.md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之基础概念篇 3 | 1解释 4 | 分层抽样的概念就不讲了,具体的操作: 5 | RDD有个操作可以直接进行抽样:sampleByKey和sample等,这里主要介绍这两个 6 | (1)将字符串长度为2划分为层1和层2,对层1和层2按不同的概率进行抽样 7 | 数据 8 | 9 | ``` 10 | aa 11 | bb 12 | cc 13 | dd 14 | ee 15 | aaa 16 | bbb 17 | ccc 18 | ddd 19 | eee 20 | ``` 21 | 比如: 22 | val fractions: Map[Int, Double] = (List((1, 0.2), (2, 0.8))).toMap //设定抽样格式 23 | sampleByKey(withReplacement = false, fractions, 0) 24 | fractions表示在层1抽0.2,在层2中抽0.8 25 | withReplacement false表示不重复抽样 26 | 0表示随机的seed 27 | 28 | 源码: 29 | 30 | ``` 31 | /** 32 | * Return a subset of this RDD sampled by key (via stratified sampling). 33 | * 34 | * Create a sample of this RDD using variable sampling rates for different keys as specified by 35 | * `fractions`, a key to sampling rate map, via simple random sampling with one pass over the 36 | * RDD, to produce a sample of size that's approximately equal to the sum of 37 | * math.ceil(numItems * samplingRate) over all key values. 38 | * 39 | * @param withReplacement whether to sample with or without replacement 40 | * @param fractions map of specific keys to sampling rates 41 | * @param seed seed for the random number generator 42 | * @return RDD containing the sampled subset 43 | */ 44 | def sampleByKey(withReplacement: Boolean, 45 | fractions: Map[K, Double], 46 | seed: Long = Utils.random.nextLong): RDD[(K, V)] = self.withScope { 47 | 48 | require(fractions.values.forall(v => v >= 0.0), "Negative sampling rates.") 49 | 50 | val samplingFunc = if (withReplacement) { 51 | StratifiedSamplingUtils.getPoissonSamplingFunction(self, fractions, false, seed) 52 | } else { 53 | StratifiedSamplingUtils.getBernoulliSamplingFunction(self, fractions, false, seed) 54 | } 55 | self.mapPartitionsWithIndex(samplingFunc, preservesPartitioning = true) 56 | } 57 | ``` 58 | 59 | (2)和(3)类似,请看代码 60 | 61 | 62 | 2.代码: 63 | 64 | ``` 65 | /** 66 | * @author xubo 67 | * ref:Spark MlLib机器学习实战 68 | * more code:https://github.com/xubo245/SparkLearning 69 | * more blog:http://blog.csdn.net/xubo245 70 | */ 71 | package org.apache.spark.mllib.learning.basic 72 | 73 | import org.apache.spark.{SparkConf, SparkContext} 74 | 75 | /** 76 | * Created by xubo on 2016/5/23. 77 | */ 78 | object StratifiedSamplingLearning { 79 | def main(args: Array[String]) { 80 | val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 81 | val sc = new SparkContext(conf) 82 | println("First:") 83 | val data = sc.textFile("file/data/mllib/input/basic/StratifiedSampling.txt") //读取数 84 | .map(row => { 85 | //开始处理 86 | if (row.length == 3) //判断字符数 87 | (row, 1) //建立对应map 88 | else (row, 2) //建立对应map 89 | }).map(each => (each._2, each._1)) 90 | data.foreach(println) 91 | println("sampleByKey:") 92 | val fractions: Map[Int, Double] = (List((1, 0.2), (2, 0.8))).toMap //设定抽样格式 93 | val approxSample = data.sampleByKey(withReplacement = false, fractions, 0) //计算抽样样本 94 | approxSample.foreach(println) 95 | 96 | println("Second:") 97 | //http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#sampleByKey 98 | val randRDD = sc.parallelize(List((7, "cat"), (6, "mouse"), (7, "cup"), (6, "book"), (7, "tv"), (6, "screen"), (7, "heater"))) 99 | val sampleMap = List((7, 0.4), (6, 0.8)).toMap 100 | val sample2 = randRDD.sampleByKey(false, sampleMap, 42).collect 101 | sample2.foreach(println) 102 | 103 | println("Third:") 104 | //http://bbs.csdn.net/topics/390953396 105 | val a = sc.parallelize(1 to 20, 3) 106 | val b = a.sample(true, 0.8, 0) 107 | val c = a.sample(false, 0.8, 0) 108 | println("RDD a : " + a.collect().mkString(" , ")) 109 | println("RDD b : " + b.collect().mkString(" , ")) 110 | println("RDD c : " + c.collect().mkString(" , ")) 111 | sc.stop 112 | } 113 | } 114 | 115 | ``` 116 | 117 | 3.结果: 118 | 119 | ``` 120 | First: 121 | 2016-05-23 22:37:34 WARN :139 - Your hostname, xubo-PC resolves to a loopback/non-reachable address: fe80:0:0:0:482:722f:5976:ce1f%20, but we couldn't find any external IP address! 122 | (2,aa) 123 | (1,bbb) 124 | (2,bb) 125 | (1,ccc) 126 | (2,cc) 127 | (1,ddd) 128 | (2,dd) 129 | (1,eee) 130 | (2,ee) 131 | (1,aaa) 132 | sampleByKey: 133 | (2,aa) 134 | (2,bb) 135 | (2,cc) 136 | (2,ee) 137 | Second: 138 | (7,cat) 139 | (6,mouse) 140 | (6,book) 141 | (6,screen) 142 | (7,heater) 143 | Third: 144 | RDD a : 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 145 | RDD b : 2 , 4 , 5 , 6 , 10 , 14 , 19 , 20 146 | RDD c : 1 , 2 , 4 , 5 , 8 , 10 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 147 | ``` 148 | 149 | 参考 150 | 151 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 152 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 153 | 【3】https://github.com/xubo245/SparkLearning -------------------------------------------------------------------------------- /3分类和回归/Spark中组件Mllib的学习40之梯度提升树(GBT)用于回归_.md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之分类篇 3 | 1解释 4 | GBRT(Gradient Boost Regression Tree)渐进梯度回归树 5 | 同样的setCategoricalFeaturesInfo有问题。注释掉了。 6 | 7 | 2.代码: 8 | 9 | ``` 10 | /** 11 | * @author xubo 12 | * ref:Spark MlLib机器学习实战 13 | * more code:https://github.com/xubo245/SparkLearning 14 | * more blog:http://blog.csdn.net/xubo245 15 | */ 16 | package org.apache.spark.mllib.learning.classification 17 | 18 | import java.text.SimpleDateFormat 19 | import java.util.Date 20 | 21 | import org.apache.spark.mllib.tree.DecisionTree 22 | import org.apache.spark.mllib.util.MLUtils 23 | import org.apache.spark.{SparkConf, SparkContext} 24 | import org.apache.spark.mllib.tree.GradientBoostedTrees 25 | import org.apache.spark.mllib.tree.configuration.BoostingStrategy 26 | import org.apache.spark.mllib.tree.model.{DecisionTreeModel, GradientBoostedTreesModel} 27 | import org.apache.spark.mllib.util.MLUtils 28 | import java.util.Map 29 | import org.apache.spark.mllib.tree.GradientBoostedTrees 30 | import org.apache.spark.mllib.tree.configuration.BoostingStrategy 31 | import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel 32 | import org.apache.spark.mllib.util.MLUtils 33 | 34 | /** 35 | * Created by xubo on 2016/5/23. 36 | */ 37 | object GBTs2Regression { 38 | def main(args: Array[String]) { 39 | val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 40 | val sc = new SparkContext(conf) 41 | 42 | // Load and parse the data file. 43 | val data = MLUtils.loadLibSVMFile(sc, "file/data/mllib/input/classification/sample_libsvm_data.txt") 44 | 45 | // Split the data into training and test sets (30% held out for testing) 46 | val splits = data.randomSplit(Array(0.7, 0.3)) 47 | val (trainingData, testData) = (splits(0), splits(1)) 48 | 49 | // Train a GradientBoostedTrees model. 50 | // The defaultParams for Classification use LogLoss by default. 51 | val boostingStrategy = BoostingStrategy.defaultParams("Regression") 52 | boostingStrategy.setNumIterations(3) 53 | boostingStrategy.treeStrategy.setMaxDepth(5) 54 | // boostingStrategy.treeStrategy.setCategoricalFeaturesInfo(Map[Int, Int]()) 55 | 56 | // Train a GradientBoostedTrees model. 57 | // The defaultParams for Regression use SquaredError by default. 58 | // val boostingStrategy = BoostingStrategy.defaultParams("Regression") 59 | // boostingStrategy.numIterations = 3 // Note: Use more iterations in practice. 60 | // boostingStrategy.treeStrategy.maxDepth = 5 61 | // // Empty categoricalFeaturesInfo indicates all features are continuous. 62 | // boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]() 63 | 64 | val model = GradientBoostedTrees.train(trainingData, boostingStrategy) 65 | 66 | // Evaluate model on test instances and compute test error 67 | val labelsAndPredictions = testData.map { point => 68 | val prediction = model.predict(point.features) 69 | (point.label, prediction) 70 | } 71 | val testMSE = labelsAndPredictions.map { case (v, p) => math.pow((v - p), 2) }.mean() 72 | println("Test Mean Squared Error = " + testMSE) 73 | println("Learned regression GBT model:\n" + model.toDebugString) 74 | 75 | 76 | 77 | println("data.count:" + data.count()) 78 | println("trainingData.count:" + trainingData.count()) 79 | println("testData.count:" + testData.count()) 80 | println("model.algo:" + model.algo) 81 | println("model.trees:" + model.trees) 82 | println("model.treeWeights:" + model.treeWeights) 83 | 84 | 85 | 86 | // Save and load model 87 | // val iString = new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date()) 88 | // val path = "file/data/mllib/output/classification/GradientBoostedTreesModel" + iString + "/result" 89 | // model.save(sc, path) 90 | // val sameModel = DecisionTreeModel.load(sc, path) 91 | // println(sameModel.algo) 92 | sc.stop 93 | } 94 | } 95 | 96 | ``` 97 | 98 | 3.结果: 99 | 100 | ``` 101 | Test Mean Squared Error = 0.06896551724137932 102 | Learned regression GBT model: 103 | TreeEnsembleModel regressor with 3 trees 104 | 105 | Tree 0: 106 | If (feature 406 <= 72.0) 107 | If (feature 99 <= 0.0) 108 | Predict: 0.0 109 | Else (feature 99 > 0.0) 110 | Predict: 1.0 111 | Else (feature 406 > 72.0) 112 | Predict: 1.0 113 | Tree 1: 114 | Predict: 0.0 115 | Tree 2: 116 | Predict: 0.0 117 | 118 | data.count:100 119 | trainingData.count:71 120 | testData.count:29 121 | model.algo:Regression 122 | model.trees:[Lorg.apache.spark.mllib.tree.model.DecisionTreeModel;@5e9a7c29 123 | model.treeWeights:[D@78bf694d 124 | ``` 125 | 126 | 参考 127 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 128 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 129 | 【3】https://github.com/xubo245/SparkLearning -------------------------------------------------------------------------------- /9评估度量/Spark中组件Mllib的学习72之RankingSystem进行评估.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | 主要用于推荐系统 9 | 10 | ![](http://i.imgur.com/kIusv5x.png) 11 | 12 | 13 | 2.代码: 14 | 15 | /** 16 | * @author xubo 17 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 18 | * more code:https://github.com/xubo245/SparkLearning 19 | * more blog:http://blog.csdn.net/xubo245 20 | */ 21 | package org.apache.spark.mllib.EvaluationMetrics 22 | 23 | import org.apache.spark.util.SparkLearningFunSuite 24 | 25 | /** 26 | * Created by xubo on 2016/6/13. 27 | */ 28 | class RankingSystemsFunSuite extends SparkLearningFunSuite { 29 | test("testFunSuite") { 30 | 31 | 32 | import org.apache.spark.mllib.evaluation.{RegressionMetrics, RankingMetrics} 33 | import org.apache.spark.mllib.recommendation.{ALS, Rating} 34 | 35 | // Read in the ratings data 36 | val ratings = sc.textFile("file/data/mllib/input/mllibFromSpark/sample_movielens_data.txt").map { line => 37 | val fields = line.split("::") 38 | Rating(fields(0).toInt, fields(1).toInt, fields(2).toDouble - 2.5) 39 | }.cache() 40 | 41 | // Map ratings to 1 or 0, 1 indicating a movie that should be recommended 42 | val binarizedRatings = ratings.map(r => Rating(r.user, r.product, if (r.rating > 0) 1.0 else 0.0)).cache() 43 | 44 | // Summarize ratings 45 | val numRatings = ratings.count() 46 | val numUsers = ratings.map(_.user).distinct().count() 47 | val numMovies = ratings.map(_.product).distinct().count() 48 | println(s"Got $numRatings ratings from $numUsers users on $numMovies movies.") 49 | 50 | // Build the model 51 | val numIterations = 10 52 | val rank = 10 53 | val lambda = 0.01 54 | val model = ALS.train(ratings, rank, numIterations, lambda) 55 | 56 | // Define a function to scale ratings from 0 to 1 57 | def scaledRating(r: Rating): Rating = { 58 | val scaledRating = math.max(math.min(r.rating, 1.0), 0.0) 59 | Rating(r.user, r.product, scaledRating) 60 | } 61 | 62 | // Get sorted top ten predictions for each user and then scale from [0, 1] 63 | val userRecommended = model.recommendProductsForUsers(10).map { case (user, recs) => 64 | (user, recs.map(scaledRating)) 65 | } 66 | 67 | // Assume that any movie a user rated 3 or higher (which maps to a 1) is a relevant document 68 | // Compare with top ten most relevant documents 69 | val userMovies = binarizedRatings.groupBy(_.user) 70 | val relevantDocuments = userMovies.join(userRecommended).map { case (user, (actual, predictions)) => 71 | (predictions.map(_.product), actual.filter(_.rating > 0.0).map(_.product).toArray) 72 | } 73 | 74 | // Instantiate metrics object 75 | val metrics = new RankingMetrics(relevantDocuments) 76 | 77 | // Precision at K 78 | Array(1, 3, 5).foreach { k => 79 | println(s"Precision at $k = ${metrics.precisionAt(k)}") 80 | } 81 | 82 | // Mean average precision 83 | println(s"Mean average precision = ${metrics.meanAveragePrecision}") 84 | 85 | // Normalized discounted cumulative gain 86 | Array(1, 3, 5).foreach { k => 87 | println(s"NDCG at $k = ${metrics.ndcgAt(k)}") 88 | } 89 | 90 | // Get predictions for each data point 91 | val allPredictions = model.predict(ratings.map(r => (r.user, r.product))).map(r => ((r.user, r.product), r.rating)) 92 | val allRatings = ratings.map(r => ((r.user, r.product), r.rating)) 93 | val predictionsAndLabels = allPredictions.join(allRatings).map { case ((user, product), (predicted, actual)) => 94 | (predicted, actual) 95 | } 96 | 97 | // Get the RMSE using regression metrics 98 | val regressionMetrics = new RegressionMetrics(predictionsAndLabels) 99 | println(s"RMSE = ${regressionMetrics.rootMeanSquaredError}") 100 | 101 | // R-squared 102 | println(s"R-squared = ${regressionMetrics.r2}") 103 | 104 | 105 | } 106 | } 107 | 108 | 109 | 110 | 3.结果: 111 | 112 | Got 1501 ratings from 30 users on 100 movies. 113 | 2016-06-14 21:05:44 WARN BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 114 | 2016-06-14 21:05:44 WARN BLAS:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS 115 | 2016-06-14 21:05:45 WARN LAPACK:61 - Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK 116 | 2016-06-14 21:05:45 WARN LAPACK:61 - Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK 117 | Precision at 1 = 0.3 118 | Precision at 3 = 0.3888888888888889 119 | Precision at 5 = 0.4800000000000001 120 | Mean average precision = 0.26666169370152676 121 | NDCG at 1 = 0.3 122 | NDCG at 3 = 0.3687147515984829 123 | NDCG at 5 = 0.43771295011419203 124 | RMSE = 0.2403021781310338 125 | R-squared = 0.9590077885328022 126 | 127 | 参考 128 | 129 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 130 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 131 | 【3】https://github.com/xubo245/SparkLearning 132 | 【4】book:Machine Learning with Spark ,Nick Pertreach 133 | 【5】book:Spark MlLib机器学习实战 134 | -------------------------------------------------------------------------------- /7特征提取和转换/Spark中组件Mllib的学习62之特征选择中的卡方选择器.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 |   特征选择试图识别相关的特征用于模型构建。它改变特征空间的大小,它可以提高速度以及统计学习行为。ChiSqSelector实现卡方特征选择,它操作于带有类别特征的标注数据。 ChiSqSelector根据独立的卡方测试对特征进行排序,然后选择排序最高的特征。 9 | 10 | 卡方选择(ChiSqSelector) 11 | 12 | ChiSqSelector是指使用卡方(Chi-Squared)做特征选择。该方法操作的是有标签的类别型数据。ChiSqSelector基于卡方检验来排序数据,然后选出卡方值较大(也就是跟标签最相关)的特征(topk)。 13 | 模型拟合 14 | 15 | ChiSqSelector 的构造函数有如下特征: 16 | 17 | numTopFeatures 保留的卡方较大的特征的数量。 18 | 19 | ChiSqSelector.fit() 方法以具有类别特征的RDD[LabeledPoint]为输入,计算汇总统计信息,然后返回ChiSqSelectorModel,这个类将输入数据转化到降维的特征空间。 20 | 21 | 模型实现了 VectorTransformer,这个类可以在Vector和RDD[Vector]上做卡方特征选择。 22 | 23 | 注意:也可以手工构造一个ChiSqSelectorModel,需要提供升序排列的特征索引。 24 | 25 | 2.代码: 26 | 27 | /* 28 | * Licensed to the Apache Software Foundation (ASF) under one or more 29 | * contributor license agreements. See the NOTICE file distributed with 30 | * this work for additional information regarding copyright ownership. 31 | * The ASF licenses this file to You under the Apache License, Version 2.0 32 | * (the "License"); you may not use this file except in compliance with 33 | * the License. You may obtain a copy of the License at 34 | * 35 | * http://www.apache.org/licenses/LICENSE-2.0 36 | * 37 | * Unless required by applicable law or agreed to in writing, software 38 | * distributed under the License is distributed on an "AS IS" BASIS, 39 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 40 | * See the License for the specific language governing permissions and 41 | * limitations under the License. 42 | */ 43 | 44 | package org.apache.spark.mllib.FeatureExtractionAndTransformation 45 | 46 | import org.apache.spark.SparkFunSuite 47 | import org.apache.spark.mllib.feature.ChiSqSelector 48 | import org.apache.spark.mllib.linalg.Vectors 49 | import org.apache.spark.mllib.regression.LabeledPoint 50 | import org.apache.spark.mllib.util.MLlibTestSparkContext 51 | 52 | class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext { 53 | 54 | /* 55 | * Contingency tables 56 | * feature0 = {8.0, 0.0} 57 | * class 0 1 2 58 | * 8.0||1|0|1| 59 | * 0.0||0|2|0| 60 | * 61 | * feature1 = {7.0, 9.0} 62 | * class 0 1 2 63 | * 7.0||1|0|0| 64 | * 9.0||0|2|1| 65 | * 66 | * feature2 = {0.0, 6.0, 8.0, 5.0} 67 | * class 0 1 2 68 | * 0.0||1|0|0| 69 | * 6.0||0|1|0| 70 | * 8.0||0|1|0| 71 | * 5.0||0|0|1| 72 | * 73 | * Use chi-squared calculator from Internet 74 | */ 75 | 76 | test("ChiSqSelector transform test (sparse & dense vector)") { 77 | val labeledDiscreteData = sc.parallelize( 78 | Seq(LabeledPoint(0.0, Vectors.sparse(3, Array((0, 8.0), (1, 7.0)))), 79 | LabeledPoint(1.0, Vectors.sparse(3, Array((1, 9.0), (2, 6.0)))), 80 | LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 8.0))), 81 | LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0)))), 2) 82 | val preFilteredData = 83 | Set(LabeledPoint(0.0, Vectors.dense(Array(0.0))), 84 | LabeledPoint(1.0, Vectors.dense(Array(6.0))), 85 | LabeledPoint(1.0, Vectors.dense(Array(8.0))), 86 | LabeledPoint(2.0, Vectors.dense(Array(5.0)))) 87 | val model = new ChiSqSelector(2).fit(labeledDiscreteData) 88 | val filteredData = labeledDiscreteData.map { lp => 89 | LabeledPoint(lp.label, model.transform(lp.features)) 90 | }.collect().toSet 91 | // assert(filteredData == preFilteredData) 92 | 93 | println("labeledDiscreteData:") 94 | labeledDiscreteData.foreach(println) 95 | println("model:") 96 | model.selectedFeatures.foreach(println) 97 | // model.selectedFeatures. 98 | println("filteredData:") 99 | filteredData.foreach(println) 100 | println("preFilteredData:") 101 | preFilteredData.foreach(println) 102 | } 103 | } 104 | 105 | 106 | 3.结果: 107 | 108 | (1)new ChiSqSelector(1) 109 | 110 | labeledDiscreteData: 111 | (0.0,(3,[0,1],[8.0,7.0])) 112 | (1.0,(3,[1,2],[9.0,6.0])) 113 | (1.0,[0.0,9.0,8.0]) 114 | (2.0,[8.0,9.0,5.0]) 115 | model: 116 | 2 117 | filteredData: 118 | (0.0,(1,[],[])) 119 | (1.0,(1,[0],[6.0])) 120 | (1.0,[8.0]) 121 | (2.0,[5.0]) 122 | preFilteredData: 123 | (0.0,[0.0]) 124 | (1.0,[6.0]) 125 | (1.0,[8.0]) 126 | (2.0,[5.0]) 127 | 128 | (2)new ChiSqSelector(2) 129 | 130 | labeledDiscreteData: 131 | (1.0,[0.0,9.0,8.0]) 132 | (2.0,[8.0,9.0,5.0]) 133 | (0.0,(3,[0,1],[8.0,7.0])) 134 | (1.0,(3,[1,2],[9.0,6.0])) 135 | model: 136 | 0 137 | 2 138 | filteredData: 139 | (0.0,(2,[0],[8.0])) 140 | (1.0,(2,[1],[6.0])) 141 | (1.0,[0.0,8.0]) 142 | (2.0,[8.0,5.0]) 143 | preFilteredData: 144 | (0.0,[0.0]) 145 | (1.0,[6.0]) 146 | (1.0,[8.0]) 147 | (2.0,[5.0]) 148 | 149 | 150 | 151 | 参考 152 | 153 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 154 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 155 | 【3】https://github.com/xubo245/SparkLearning 156 | 【4】book:Machine Learning with Spark ,Nick Pertreach 157 | 【5】book:Spark MlLib机器学习实战 158 | 【6】https://github.com/endymecy/spark-ml-source-analysis/blob/master/%E7%89%B9%E5%BE%81%E6%8A%BD%E5%8F%96%E5%92%8C%E8%BD%AC%E6%8D%A2/chi-square-selector.md 159 | 【7】http://www.fuqingchuan.com/2015/03/643.html#standardscaler 160 | -------------------------------------------------------------------------------- /5聚类/Spark中组件Mllib的学习47之隐含狄利克雷分布(Latent Dirichlet allocation (LDA)学习.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | 隐含狄利克雷分布(Latent Dirichlet allocation (LDA) 9 | 10 | 隐含狄利克雷分布(LDA) 是一个主题模型,它能够推理出一个文本文档集合的主体。LDA可以认为是一个聚类算法,原因如下: 11 | 12 | 主题对应聚类中心,文档对应数据集中的样本(数据行) 13 | 主题和文档都在一个特征空间中,其特征向量是词频向量。 14 | 跟使用传统的距离来评估聚类不一样的是,LDA使用评估方式是一个函数,该函数基于文档如何生成的统计模型。 15 | 16 | LDA以词频向量表示的文档集合作为输入。然后在最大似然函数上使用期望最大(EM)算法 来学习聚类。完成文档拟合之后,LDA提供: 17 | 18 | Topics: 推断出的主题,每个主体是单词上的概率分布。 19 | Topic distributions for documents: 对训练集中的每个文档,LDA给了一个在主题上的概率分布。 20 | 21 | LDA参数如下: 22 | 23 | k: 主题数量(或者说聚簇中心数量) 24 | maxIterations: EM算法的最大迭代次数。 25 | docConcentration: 文档在主题上分布的先验参数。当前必须大于1,值越大,推断出的分布越平滑。 26 | topicConcentration: 主题在单词上的先验分布参数。当前必须大于1,值越大,推断出的分布越平滑。 27 | checkpointInterval: 检查点间隔。maxIterations很大的时候,检查点可以帮助减少shuffle文件大小并且可以帮助故障恢复。 28 | 29 | 参考【4】 30 | 31 | 2.代码: 32 | 33 | /** 34 | * @author xubo 35 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 36 | * more code:https://github.com/xubo245/SparkLearning 37 | * more blog:http://blog.csdn.net/xubo245 38 | */ 39 | package org.apache.spark.mllib.clustering.LDALearning 40 | 41 | import org.apache.spark.mllib.clustering.LDA 42 | import org.apache.spark.util.SparkLearningFunSuite 43 | 44 | /** 45 | * Created by xubo on 2016/6/13. 46 | */ 47 | class LDAFromWebSuite extends SparkLearningFunSuite { 48 | test("testFunSuite") { 49 | 50 | 51 | import org.apache.spark.mllib.linalg.Vectors 52 | 53 | // Load and parse the data 54 | val data = sc.textFile("file/data/mllib/input/mllibFromSpark/sample_lda_data.txt") 55 | val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble))) 56 | // Index documents with unique IDs 57 | val corpus = parsedData.zipWithIndex.map(_.swap).cache() 58 | 59 | // Cluster the documents into three topics using LDA 60 | val ldaModel = new LDA().setK(3).run(corpus) 61 | 62 | //input data 63 | println("parsedData:") 64 | parsedData.foreach(println) 65 | println("corpus:") 66 | corpus.foreach(println) 67 | 68 | // Output topics. Each is a distribution over words (matching word count vectors) 69 | println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize + " words):") 70 | val topics = ldaModel.topicsMatrix 71 | for (topic <- Range(0, 3)) { 72 | print("Topic " + topic + ":") 73 | for (word <- Range(0, ldaModel.vocabSize)) { 74 | print(" " + topics(word, topic)); 75 | } 76 | println() 77 | } 78 | 79 | // Save and load model. 80 | // ldaModel.save(sc, "myLDAModel") 81 | // val sameModel = DistributedLDAModel.load(sc, "myLDAModel") 82 | 83 | 84 | } 85 | } 86 | 87 | 数据: 88 | 89 | 1 2 6 0 2 3 1 1 0 0 3 90 | 1 3 0 1 3 0 0 2 0 0 1 91 | 1 4 1 0 0 4 9 0 1 2 0 92 | 2 1 0 3 0 0 5 0 2 3 9 93 | 3 1 1 9 3 0 2 0 0 1 3 94 | 4 2 0 3 4 5 1 1 1 4 0 95 | 2 1 0 3 0 0 5 0 2 2 9 96 | 1 1 1 9 2 1 2 0 0 1 3 97 | 4 4 0 3 4 2 1 3 0 0 0 98 | 2 8 2 0 3 0 2 0 2 7 2 99 | 1 1 1 9 0 2 2 0 0 3 3 100 | 4 1 0 0 4 5 1 3 0 1 0 101 | 102 | 103 | 3.结果: 104 | 105 | parsedData: 106 | [1.0,1.0,1.0,9.0,2.0,1.0,2.0,0.0,0.0,1.0,3.0] 107 | [1.0,2.0,6.0,0.0,2.0,3.0,1.0,1.0,0.0,0.0,3.0] 108 | [4.0,4.0,0.0,3.0,4.0,2.0,1.0,3.0,0.0,0.0,0.0] 109 | [1.0,3.0,0.0,1.0,3.0,0.0,0.0,2.0,0.0,0.0,1.0] 110 | [2.0,8.0,2.0,0.0,3.0,0.0,2.0,0.0,2.0,7.0,2.0] 111 | [1.0,4.0,1.0,0.0,0.0,4.0,9.0,0.0,1.0,2.0,0.0] 112 | [1.0,1.0,1.0,9.0,0.0,2.0,2.0,0.0,0.0,3.0,3.0] 113 | [2.0,1.0,0.0,3.0,0.0,0.0,5.0,0.0,2.0,3.0,9.0] 114 | [4.0,1.0,0.0,0.0,4.0,5.0,1.0,3.0,0.0,1.0,0.0] 115 | [3.0,1.0,1.0,9.0,3.0,0.0,2.0,0.0,0.0,1.0,3.0] 116 | [4.0,2.0,0.0,3.0,4.0,5.0,1.0,1.0,1.0,4.0,0.0] 117 | [2.0,1.0,0.0,3.0,0.0,0.0,5.0,0.0,2.0,2.0,9.0] 118 | corpus: 119 | (7,[1.0,1.0,1.0,9.0,2.0,1.0,2.0,0.0,0.0,1.0,3.0]) 120 | (8,[4.0,4.0,0.0,3.0,4.0,2.0,1.0,3.0,0.0,0.0,0.0]) 121 | (9,[2.0,8.0,2.0,0.0,3.0,0.0,2.0,0.0,2.0,7.0,2.0]) 122 | (0,[1.0,2.0,6.0,0.0,2.0,3.0,1.0,1.0,0.0,0.0,3.0]) 123 | (10,[1.0,1.0,1.0,9.0,0.0,2.0,2.0,0.0,0.0,3.0,3.0]) 124 | (1,[1.0,3.0,0.0,1.0,3.0,0.0,0.0,2.0,0.0,0.0,1.0]) 125 | (11,[4.0,1.0,0.0,0.0,4.0,5.0,1.0,3.0,0.0,1.0,0.0]) 126 | (2,[1.0,4.0,1.0,0.0,0.0,4.0,9.0,0.0,1.0,2.0,0.0]) 127 | (3,[2.0,1.0,0.0,3.0,0.0,0.0,5.0,0.0,2.0,3.0,9.0]) 128 | (4,[3.0,1.0,1.0,9.0,3.0,0.0,2.0,0.0,0.0,1.0,3.0]) 129 | (5,[4.0,2.0,0.0,3.0,4.0,5.0,1.0,1.0,1.0,4.0,0.0]) 130 | (6,[2.0,1.0,0.0,3.0,0.0,0.0,5.0,0.0,2.0,2.0,9.0]) 131 | Learned topics (as distributions over vocab of 11 words): 132 | Topic 0: 10.452679685126427 10.181668875779492 2.558644879228586 3.851438041310386 9.929534544713832 11.940154625598604 14.086100675895626 4.8961707413781115 2.2995952755592106 7.381361487130488 8.981150231959049 133 | Topic 1: 10.279220142758316 5.956661018866242 4.910211518699095 30.538789151743963 5.928882165794898 5.447495432535608 6.549479250479619 3.011959583638183 1.0753194351327675 3.217481558556803 9.62611924184504 134 | Topic 2: 5.268100172115256 12.861670105354264 4.531143602072319 5.6097728069456565 9.141583289491269 4.612349941865787 10.364420073624755 2.0918696749837067 4.625085289308021 13.40115695431271 14.392730526195912 135 | 136 | 137 | 138 | 参考 139 | 140 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 141 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 142 | 【3】https://github.com/xubo245/SparkLearning 143 | 【4】http://www.fuqingchuan.com/2015/03/609.html#latent-dirichlet-allocation-lda -------------------------------------------------------------------------------- /5聚类/Spark中组件Mllib的学习47之隐含狄利克雷分布(Latent Dirichlet allocation,LDA)学习.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | 隐含狄利克雷分布(Latent Dirichlet allocation (LDA) 9 | 10 | 隐含狄利克雷分布(LDA) 是一个主题模型,它能够推理出一个文本文档集合的主体。LDA可以认为是一个聚类算法,原因如下: 11 | 12 | 主题对应聚类中心,文档对应数据集中的样本(数据行) 13 | 主题和文档都在一个特征空间中,其特征向量是词频向量。 14 | 跟使用传统的距离来评估聚类不一样的是,LDA使用评估方式是一个函数,该函数基于文档如何生成的统计模型。 15 | 16 | LDA以词频向量表示的文档集合作为输入。然后在最大似然函数上使用期望最大(EM)算法 来学习聚类。完成文档拟合之后,LDA提供: 17 | 18 | Topics: 推断出的主题,每个主体是单词上的概率分布。 19 | Topic distributions for documents: 对训练集中的每个文档,LDA给了一个在主题上的概率分布。 20 | 21 | LDA参数如下: 22 | 23 | k: 主题数量(或者说聚簇中心数量) 24 | maxIterations: EM算法的最大迭代次数。 25 | docConcentration: 文档在主题上分布的先验参数。当前必须大于1,值越大,推断出的分布越平滑。 26 | topicConcentration: 主题在单词上的先验分布参数。当前必须大于1,值越大,推断出的分布越平滑。 27 | checkpointInterval: 检查点间隔。maxIterations很大的时候,检查点可以帮助减少shuffle文件大小并且可以帮助故障恢复。 28 | 29 | 参考【4】 30 | 31 | 2.代码: 32 | 33 | /** 34 | * @author xubo 35 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 36 | * more code:https://github.com/xubo245/SparkLearning 37 | * more blog:http://blog.csdn.net/xubo245 38 | */ 39 | package org.apache.spark.mllib.clustering.LDALearning 40 | 41 | import org.apache.spark.mllib.clustering.LDA 42 | import org.apache.spark.util.SparkLearningFunSuite 43 | 44 | /** 45 | * Created by xubo on 2016/6/13. 46 | */ 47 | class LDAFromWebSuite extends SparkLearningFunSuite { 48 | test("testFunSuite") { 49 | 50 | 51 | import org.apache.spark.mllib.linalg.Vectors 52 | 53 | // Load and parse the data 54 | val data = sc.textFile("file/data/mllib/input/mllibFromSpark/sample_lda_data.txt") 55 | val parsedData = data.map(s => Vectors.dense(s.trim.split(' ').map(_.toDouble))) 56 | // Index documents with unique IDs 57 | val corpus = parsedData.zipWithIndex.map(_.swap).cache() 58 | 59 | // Cluster the documents into three topics using LDA 60 | val ldaModel = new LDA().setK(3).run(corpus) 61 | 62 | //input data 63 | println("parsedData:") 64 | parsedData.foreach(println) 65 | println("corpus:") 66 | corpus.foreach(println) 67 | 68 | // Output topics. Each is a distribution over words (matching word count vectors) 69 | println("Learned topics (as distributions over vocab of " + ldaModel.vocabSize + " words):") 70 | val topics = ldaModel.topicsMatrix 71 | for (topic <- Range(0, 3)) { 72 | print("Topic " + topic + ":") 73 | for (word <- Range(0, ldaModel.vocabSize)) { 74 | print(" " + topics(word, topic)); 75 | } 76 | println() 77 | } 78 | 79 | // Save and load model. 80 | // ldaModel.save(sc, "myLDAModel") 81 | // val sameModel = DistributedLDAModel.load(sc, "myLDAModel") 82 | 83 | 84 | } 85 | } 86 | 87 | 数据: 88 | 89 | 1 2 6 0 2 3 1 1 0 0 3 90 | 1 3 0 1 3 0 0 2 0 0 1 91 | 1 4 1 0 0 4 9 0 1 2 0 92 | 2 1 0 3 0 0 5 0 2 3 9 93 | 3 1 1 9 3 0 2 0 0 1 3 94 | 4 2 0 3 4 5 1 1 1 4 0 95 | 2 1 0 3 0 0 5 0 2 2 9 96 | 1 1 1 9 2 1 2 0 0 1 3 97 | 4 4 0 3 4 2 1 3 0 0 0 98 | 2 8 2 0 3 0 2 0 2 7 2 99 | 1 1 1 9 0 2 2 0 0 3 3 100 | 4 1 0 0 4 5 1 3 0 1 0 101 | 102 | 103 | 3.结果: 104 | 105 | parsedData: 106 | [1.0,1.0,1.0,9.0,2.0,1.0,2.0,0.0,0.0,1.0,3.0] 107 | [1.0,2.0,6.0,0.0,2.0,3.0,1.0,1.0,0.0,0.0,3.0] 108 | [4.0,4.0,0.0,3.0,4.0,2.0,1.0,3.0,0.0,0.0,0.0] 109 | [1.0,3.0,0.0,1.0,3.0,0.0,0.0,2.0,0.0,0.0,1.0] 110 | [2.0,8.0,2.0,0.0,3.0,0.0,2.0,0.0,2.0,7.0,2.0] 111 | [1.0,4.0,1.0,0.0,0.0,4.0,9.0,0.0,1.0,2.0,0.0] 112 | [1.0,1.0,1.0,9.0,0.0,2.0,2.0,0.0,0.0,3.0,3.0] 113 | [2.0,1.0,0.0,3.0,0.0,0.0,5.0,0.0,2.0,3.0,9.0] 114 | [4.0,1.0,0.0,0.0,4.0,5.0,1.0,3.0,0.0,1.0,0.0] 115 | [3.0,1.0,1.0,9.0,3.0,0.0,2.0,0.0,0.0,1.0,3.0] 116 | [4.0,2.0,0.0,3.0,4.0,5.0,1.0,1.0,1.0,4.0,0.0] 117 | [2.0,1.0,0.0,3.0,0.0,0.0,5.0,0.0,2.0,2.0,9.0] 118 | corpus: 119 | (7,[1.0,1.0,1.0,9.0,2.0,1.0,2.0,0.0,0.0,1.0,3.0]) 120 | (8,[4.0,4.0,0.0,3.0,4.0,2.0,1.0,3.0,0.0,0.0,0.0]) 121 | (9,[2.0,8.0,2.0,0.0,3.0,0.0,2.0,0.0,2.0,7.0,2.0]) 122 | (0,[1.0,2.0,6.0,0.0,2.0,3.0,1.0,1.0,0.0,0.0,3.0]) 123 | (10,[1.0,1.0,1.0,9.0,0.0,2.0,2.0,0.0,0.0,3.0,3.0]) 124 | (1,[1.0,3.0,0.0,1.0,3.0,0.0,0.0,2.0,0.0,0.0,1.0]) 125 | (11,[4.0,1.0,0.0,0.0,4.0,5.0,1.0,3.0,0.0,1.0,0.0]) 126 | (2,[1.0,4.0,1.0,0.0,0.0,4.0,9.0,0.0,1.0,2.0,0.0]) 127 | (3,[2.0,1.0,0.0,3.0,0.0,0.0,5.0,0.0,2.0,3.0,9.0]) 128 | (4,[3.0,1.0,1.0,9.0,3.0,0.0,2.0,0.0,0.0,1.0,3.0]) 129 | (5,[4.0,2.0,0.0,3.0,4.0,5.0,1.0,1.0,1.0,4.0,0.0]) 130 | (6,[2.0,1.0,0.0,3.0,0.0,0.0,5.0,0.0,2.0,2.0,9.0]) 131 | Learned topics (as distributions over vocab of 11 words): 132 | Topic 0: 10.452679685126427 10.181668875779492 2.558644879228586 3.851438041310386 9.929534544713832 11.940154625598604 14.086100675895626 4.8961707413781115 2.2995952755592106 7.381361487130488 8.981150231959049 133 | Topic 1: 10.279220142758316 5.956661018866242 4.910211518699095 30.538789151743963 5.928882165794898 5.447495432535608 6.549479250479619 3.011959583638183 1.0753194351327675 3.217481558556803 9.62611924184504 134 | Topic 2: 5.268100172115256 12.861670105354264 4.531143602072319 5.6097728069456565 9.141583289491269 4.612349941865787 10.364420073624755 2.0918696749837067 4.625085289308021 13.40115695431271 14.392730526195912 135 | 136 | 137 | 138 | 参考 139 | 140 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 141 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 142 | 【3】https://github.com/xubo245/SparkLearning 143 | 【4】http://www.fuqingchuan.com/2015/03/609.html#latent-dirichlet-allocation-lda -------------------------------------------------------------------------------- /7特征提取和转换/Spark中组件Mllib的学习52之TF-IDF学习.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | TF-IDF(Term frequency-inverse document frequency ) 是文本挖掘中一种广泛使用的特征向量化方法。TF-IDF反映了语料中单词对文档的重要程度。假设单词用t表示,文档用d表示,语料用D表示,那么文档频度DF(t, D)是包含单词t的文档数。如果我们只是使用词频度量重要性,就会很容易过分强调重负次数多但携带信息少的单词,例如:”a”, “the”以及”of”。如果某个单词在整个语料库中高频出现,意味着它没有携带专门针对某特殊文档的信息。逆文档频度(IDF)是单词携带信息量的数值度量。 9 | 10 | TF-IDF的概念参考【4】中也讲的很详细,例子也很详细。 11 | 12 | 13 | 14 | 15 | 2.代码: 16 | 17 | /** 18 | * @author xubo 19 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 20 | * more code:https://github.com/xubo245/SparkLearning 21 | * more blog:http://blog.csdn.net/xubo245 22 | */ 23 | package org.apache.spark.mllib.FeatureExtractionAndTransformation 24 | 25 | import org.apache.spark.util.SparkLearningFunSuite 26 | 27 | /** 28 | * Created by xubo on 2016/6/13. 29 | */ 30 | class TFIDFSuite extends SparkLearningFunSuite { 31 | test("testFunSuite") { 32 | import org.apache.spark.rdd.RDD 33 | import org.apache.spark.SparkContext 34 | import org.apache.spark.mllib.feature.HashingTF 35 | import org.apache.spark.mllib.linalg.Vector 36 | 37 | // val sc: SparkContext = ... 38 | 39 | // Load documents (one per line). 40 | val documents: RDD[Seq[String]] = sc.textFile("file/data/mllib/input/FeatureExtractionAndTransformation/a.txt").map(_.split(" ").toSeq) 41 | 42 | val hashingTF = new HashingTF() 43 | val tf: RDD[Vector] = hashingTF.transform(documents) 44 | println("tf:" + tf) 45 | tf.foreach(println) 46 | import org.apache.spark.mllib.feature.IDF 47 | 48 | // ... continue from the previous example 49 | tf.cache() 50 | val idf = new IDF().fit(tf) 51 | val tfidf: RDD[Vector] = idf.transform(tf) 52 | // println("idf:" + idf.idf) 53 | // idf.idf 54 | println("tfidf:" + tfidf) 55 | tfidf.foreach(println) 56 | import org.apache.spark.mllib.feature.IDF 57 | 58 | // ... continue from the previous example 59 | // tf.cache() 60 | val idf2 = new IDF(minDocFreq = 2).fit(tf) 61 | val tfidf2: RDD[Vector] = idf2.transform(tf) 62 | // println("idf2:" + idf2.idf) 63 | // tf.foreach(println) 64 | println("tfidf2:" + tfidf2) 65 | tfidf2.foreach(println) 66 | } 67 | } 68 | 69 | 第一次数据: 70 | 71 | hello scala 72 | goodbyr spark 73 | hello spark 74 | hello mllib 75 | spark 76 | goodbyr spark 77 | 78 | 第二次数据: 79 | 80 | hello scala hello scala hello scala hello 81 | goodbyr spark 82 | hello spark 83 | hello mllib 84 | spark 85 | goodbyr spark 86 | 87 | 88 | 3.结果: 89 | 90 | 第一次: 91 | 92 | tf:MapPartitionsRDD[3] at map at HashingTF.scala:78 93 | 2016-06-13 21:40:33 WARN :139 - Your hostname, xubo-PC resolves to a loopback/non-reachable address: fe80:0:0:0:482:722f:5976:ce1f%39, but we couldn't find any external IP address! 94 | (1048576,[179334,596178],[1.0,1.0]) 95 | (1048576,[198982,596178],[1.0,1.0]) 96 | (1048576,[586461],[1.0]) 97 | (1048576,[452894,586461],[1.0,1.0]) 98 | (1048576,[452894,586461],[1.0,1.0]) 99 | (1048576,[586461,596178],[1.0,1.0]) 100 | tfidf:MapPartitionsRDD[5] at mapPartitions at IDF.scala:182 101 | (1048576,[198982,596178],[1.252762968495368,0.5596157879354227]) 102 | (1048576,[452894,586461],[0.8472978603872037,0.3364722366212129]) 103 | (1048576,[586461,596178],[0.3364722366212129,0.5596157879354227]) 104 | (1048576,[179334,596178],[1.252762968495368,0.5596157879354227]) 105 | (1048576,[586461],[0.3364722366212129]) 106 | (1048576,[452894,586461],[0.8472978603872037,0.3364722366212129]) 107 | tfidf2:MapPartitionsRDD[7] at mapPartitions at IDF.scala:182 108 | (1048576,[198982,596178],[0.0,0.5596157879354227]) 109 | (1048576,[452894,586461],[0.8472978603872037,0.3364722366212129]) 110 | (1048576,[586461,596178],[0.3364722366212129,0.5596157879354227]) 111 | (1048576,[179334,596178],[0.0,0.5596157879354227]) 112 | (1048576,[586461],[0.3364722366212129]) 113 | (1048576,[452894,586461],[0.8472978603872037,0.3364722366212129]) 114 | 115 | 第二次: 116 | 117 | tf:MapPartitionsRDD[3] at map at HashingTF.scala:78 118 | 2016-06-13 21:51:20 WARN :139 - Your hostname, xubo-PC resolves to a loopback/non-reachable address: fe80:0:0:0:482:722f:5976:ce1f%39, but we couldn't find any external IP address! 119 | (1048576,[198982,596178],[3.0,4.0]) 120 | (1048576,[586461,596178],[1.0,1.0]) 121 | (1048576,[452894,586461],[1.0,1.0]) 122 | (1048576,[179334,596178],[1.0,1.0]) 123 | (1048576,[586461],[1.0]) 124 | (1048576,[452894,586461],[1.0,1.0]) 125 | idf:1048576 126 | tfidf:MapPartitionsRDD[5] at mapPartitions at IDF.scala:182 127 | (1048576,[198982,596178],[3.758288905486104,2.2384631517416906]) 128 | (1048576,[452894,586461],[0.8472978603872037,0.3364722366212129]) 129 | (1048576,[586461,596178],[0.3364722366212129,0.5596157879354227]) 130 | (1048576,[179334,596178],[1.252762968495368,0.5596157879354227]) 131 | (1048576,[586461],[0.3364722366212129]) 132 | (1048576,[452894,586461],[0.8472978603872037,0.3364722366212129]) 133 | idf2:1048576 134 | tfidf2:MapPartitionsRDD[7] at mapPartitions at IDF.scala:182 135 | (1048576,[198982,596178],[0.0,2.2384631517416906]) 136 | (1048576,[452894,586461],[0.8472978603872037,0.3364722366212129]) 137 | (1048576,[586461,596178],[0.3364722366212129,0.5596157879354227]) 138 | (1048576,[179334,596178],[0.0,0.5596157879354227]) 139 | (1048576,[586461],[0.3364722366212129]) 140 | (1048576,[452894,586461],[0.8472978603872037,0.3364722366212129]) 141 | 142 | 参考 143 | 144 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 145 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 146 | 【3】https://github.com/xubo245/SparkLearning 147 | 【4】book:Machine Learning with Spark ,Nick Pertreach 148 | 【5】book:Spark MlLib机器学习实战 149 | 【6】http://www.fuqingchuan.com/2015/03/643.html#tf-idf 150 | -------------------------------------------------------------------------------- /3分类和回归/Spark中组件Mllib的学习41之保序回归(Isotonic regression).md: -------------------------------------------------------------------------------- 1 | 更多代码请见:https://github.com/xubo245/SparkLearning 2 | Spark中组件Mllib的学习之分类篇 3 | 1解释 4 | 5 | 问题描述:给定一个无序数字序列,要求不改变每个元素的位置,但可以修改每个元素的值,修改后得到一个非递减序列,问如何使误差(该处取平方差)最小? 6 | 保序回归法:从该序列的首元素往后观察,一旦出现乱序现象停止该轮观察,从该乱序元素开始逐个吸收元素组成一个序列,直到该序列所有元素的平均值小于或等于下一个待吸收的元素。 7 | 举例: 8 | 原始序列:<9, 10, 14> 9 | 结果序列:<9, 10, 14> 10 | 分析:从9往后观察,到最后的元素14都未发现乱序情况,不用处理。 11 | 原始序列:<9, 14, 10> 12 | 结果序列:<9, 12, 12> 13 | 参考【4】 14 | 15 | 2.代码: 16 | 17 | ``` 18 | /** 19 | * @author xubo 20 | * ref:Spark MlLib机器学习实战 21 | * more code:https://github.com/xubo245/SparkLearning 22 | * more blog:http://blog.csdn.net/xubo245 23 | */ 24 | package org.apache.spark.mllib.learning.classification 25 | 26 | import java.text.SimpleDateFormat 27 | import java.util.Date 28 | 29 | import org.apache.spark.mllib.tree.DecisionTree 30 | import org.apache.spark.mllib.util.MLUtils 31 | import org.apache.spark.{SparkConf, SparkContext} 32 | import org.apache.spark.mllib.tree.GradientBoostedTrees 33 | import org.apache.spark.mllib.tree.configuration.BoostingStrategy 34 | import org.apache.spark.mllib.tree.model.{DecisionTreeModel, GradientBoostedTreesModel} 35 | import org.apache.spark.mllib.util.MLUtils 36 | import java.util.Map 37 | import org.apache.spark.mllib.regression.{IsotonicRegression, IsotonicRegressionModel} 38 | 39 | /** 40 | * Created by xubo on 2016/5/23. 41 | */ 42 | object IsotonicRegression1 { 43 | def main(args: Array[String]) { 44 | val conf = new SparkConf().setMaster("local[4]").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 45 | val sc = new SparkContext(conf) 46 | 47 | // Load and parse the data file. 48 | // val data = MLUtils.loadLibSVMFile(sc, "file/data/mllib/input/classification/dt.txt") 49 | 50 | val data = sc.textFile("file/data/mllib/input/classification/sample_isotonic_regression_data.txt") 51 | 52 | // Create label, feature, weight tuples from input data with weight set to default value 1.0. 53 | val parsedData = data.map { line => 54 | val parts = line.split(',').map(_.toDouble) 55 | (parts(0), parts(1), 1.0) 56 | } 57 | 58 | // Split data into training (60%) and test (40%) sets. 59 | val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L) 60 | val training = splits(0) 61 | val test = splits(1) 62 | 63 | // Create isotonic regression model from training data. 64 | // Isotonic parameter defaults to true so it is only shown for demonstration 65 | val model = new IsotonicRegression().setIsotonic(true).run(training) 66 | 67 | // Create tuples of predicted and real labels. 68 | val predictionAndLabel = test.map { point => 69 | val predictedLabel = model.predict(point._2) 70 | (predictedLabel, point._1) 71 | } 72 | 73 | // Calculate mean squared error between predicted and real labels. 74 | val meanSquaredError = predictionAndLabel.map { case (p, l) => math.pow((p - l), 2) }.mean() 75 | println("Mean Squared Error = " + meanSquaredError) 76 | println("data.count:" + data.count()) 77 | println("trainingData.count:" + training.count()) 78 | println("testData.count:" + test.count()) 79 | println(model.boundaries) 80 | println(model.isotonic) 81 | model.predictions.take(10).foreach(println) 82 | println("predictionAndLabel") 83 | predictionAndLabel.take(10).foreach(println) 84 | 85 | 86 | // Save and load model 87 | val iString = new SimpleDateFormat("yyyyMMddHHmmssSSS").format(new Date()) 88 | val path = "file/data/mllib/output/classification/IsotonicRegressionModel" + iString + "/result" 89 | model.save(sc, path) 90 | val sameModel = IsotonicRegressionModel.load(sc, path) 91 | println(sameModel.isotonic) 92 | 93 | sc.stop 94 | } 95 | } 96 | 97 | ``` 98 | 99 | 3.结果: 100 | 101 | ``` 102 | Mean Squared Error = 0.004883368896285485 103 | data.count:100 104 | trainingData.count:64 105 | testData.count:36 106 | [D@7dd9d603 107 | true 108 | 0.1739693246153848 109 | 0.1739693246153848 110 | 0.196430394 111 | 0.196430394 112 | 0.20040796 113 | 0.29576747 114 | 0.51300357 115 | 0.51300357 116 | 0.5566037736363637 117 | 0.5566037736363637 118 | predictionAndLabel 119 | (0.1739693246153848,0.03926568) 120 | (0.1739693246153848,0.12952575) 121 | (0.1739693246153848,0.08873024) 122 | (0.18519985930769242,0.15247323) 123 | (0.196430394,0.19581846) 124 | (0.196430394,0.13717491) 125 | (0.196430394,0.19020908) 126 | (0.196430394,0.2009179) 127 | (0.198419177,0.18510964) 128 | (0.40438552,0.43396226) 129 | SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". 130 | SLF4J: Defaulting to no-operation (NOP) logger implementation 131 | SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 132 | 2016-05-25 16:58:04 WARN ParquetRecordReader:193 - Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl 133 | 2016-05-25 16:58:04 WARN ParquetRecordReader:193 - Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl 134 | 2016-05-25 16:58:04 WARN ParquetRecordReader:193 - Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl 135 | 2016-05-25 16:58:04 WARN ParquetRecordReader:193 - Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl 136 | true 137 | ``` 138 | 139 | 参考 140 | 141 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 142 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 143 | 【3】https://github.com/xubo245/SparkLearning 144 | 【4】http://blog.csdn.net/fsz521/article/details/7706250 -------------------------------------------------------------------------------- /7特征提取和转换/Spark中组件Mllib的学习54之word2Vec实例分析(text8数据集).md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | text8数据集下载:http://mattmahoney.net/dc/text8.zip,没有上传到github,主要是由于大于50M,上传不了。 9 | 10 | 需要解压上传到hdfs 11 | 12 | 对text8数据集进行训练,然后查找与china相关的40个单词和相似度 13 | 14 | 使用的是余弦相似度 15 | 16 | 17 | 2.代码: 18 | 19 | package org.apache.spark.mllib.FeatureExtractionAndTransformation 20 | 21 | /** 22 | * Created by xubo on 2016/6/13. 23 | */ 24 | object Word2VecSparkWeb { 25 | def main(args: Array[String]) { 26 | import org.apache.spark._ 27 | import org.apache.spark.rdd._ 28 | import org.apache.spark.SparkContext._ 29 | import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} 30 | val conf = new SparkConf() 31 | .setAppName("Word2VecSparkWeb") 32 | // println("start sc") 33 | val sc = new SparkContext(conf) 34 | // "file/data/mllib/input/FeatureExtractionAndTransformation/text8" 35 | val input = sc.textFile(args(0)).map(line => line.split(" ").toSeq) 36 | //java.lang.OutOfMemoryError: Java heap space 37 | 38 | 39 | // val input = sc.textFile("file/data/mllib/input/FeatureExtractionAndTransformation/a.txt").map(line => line.split(" ").toSeq) 40 | 41 | val word2vec = new Word2Vec() 42 | 43 | val model = word2vec.fit(input) 44 | 45 | val synonyms = model.findSynonyms("china", 40) 46 | // val synonyms = model.findSynonyms("hello", 2) 47 | // val synonyms = model.findSynonyms("hell", 2) 48 | println("synonyms:" + synonyms.length) 49 | for ((synonym, cosineSimilarity) <- synonyms) { 50 | println(s"$synonym $cosineSimilarity") 51 | } 52 | } 53 | } 54 | 55 | 脚本: 56 | 57 | hadoop@Master:~/xubo/project/sparkLearning/Word2VecSparkWeb$ cat run.sh 58 | #!/usr/bin/env bash 59 | spark-submit \ 60 | --class org.apache.spark.mllib.FeatureExtractionAndTransformation.Word2VecSparkWeb \ 61 | --master spark://Master:7077 \ 62 | --executor-memory 4096M \ 63 | --total-executor-cores 20 SparkLearning.jar /xubo/project/sparkLearning/text8 64 | 65 | 66 | 67 | 3.结果: 68 | 第一次运行 69 | 70 | synonyms:40 71 | taiwan 2.0250848910722588 72 | korea 1.8783838604188001 73 | japan 1.8418373325670603 74 | mongolia 1.6881217861875888 75 | thailand 1.6622166551684234 76 | republic 1.6286308610644606 77 | manchuria 1.6185821262551892 78 | kyrgyzstan 1.6155559907230572 79 | taiwan 2.0250848910722588 80 | korea 1.8783838604188001 81 | japan 1.8418373325670603 82 | mongolia 1.6881217861875888 83 | thailand 1.6622166551684234 84 | republic 1.6286308610644606 85 | manchuria 1.6185821262551892 86 | kyrgyzstan 1.6155559907230572 87 | laos 1.6103165736195577 88 | tibet 1.5989105922525122 89 | kazakhstan 1.5744151601314242 90 | singapore 1.5616986094026124 91 | macau 1.5499675794102241 92 | mainland 1.5375663678873703 93 | malaysia 1.5285559184299211 94 | tajikistan 1.5243343371990146 95 | india 1.5165506076453936 96 | nepal 1.5119063076061532 97 | pakistan 1.5024014083777038 98 | macedonia 1.5019503598037696 99 | russia 1.4935935877285467 100 | manchukuo 1.4881581055559592 101 | myanmar 1.4821476909912992 102 | indonesia 1.4793831566122821 103 | liberia 1.463797338924459 104 | xinjiang 1.4609436920718337 105 | philippines 1.4547708371463373 106 | shanghai 1.4503251746969463 107 | latvia 1.4386811130949109 108 | shenzhen 1.4199746865615956 109 | vietnam 1.418931441623602 110 | changsha 1.418418516788373 111 | 112 | 第二次运行: 113 | 114 | hadoop@Master:~/xubo/project/sparkLearning/Word2VecSparkWeb$ ./run.sh 115 | synonyms:40 116 | taiwan 1.9833786891085838 117 | korea 1.8726567347414271 118 | japan 1.7783736448331358 119 | republic 1.7004898528298036 120 | thailand 1.6917626667336083 121 | tibet 1.6878122461434133 122 | mongolia 1.652209839095614 123 | kyrgyzstan 1.645476213591011 124 | manchuria 1.6096494198211908 125 | nepal 1.6029630877195205 126 | singapore 1.5831389923108918 127 | xinjiang 1.5792116676867995 128 | guangdong 1.578964448792793 129 | laos 1.5787364724446695 130 | macau 1.5749300509413349 131 | indonesia 1.5711485054771392 132 | india 1.5706135472342697 133 | malaysia 1.5674786938684857 134 | shanghai 1.5370738084879059 135 | malaya 1.5315005636344519 136 | philippines 1.5288921196216254 137 | yuan 1.5130452356753659 138 | pakistan 1.498783617851528 139 | mainland 1.4975791691563867 140 | kazakhstan 1.4828377324602193 141 | guangzhou 1.479015936080569 142 | cambodia 1.4727652499197696 143 | tajikistan 1.469555846355169 144 | russia 1.4676529059005547 145 | uzbekistan 1.4619275437713692 146 | 147 | 第三次运行: 148 | 149 | hadoop@Master:~/xubo/project/sparkLearning/Word2VecSparkWeb$ ./run.sh 150 | synonyms:40 151 | taiwan 1.9153186898933967 152 | japan 1.79181381373875 153 | korea 1.775808989075448 154 | mongolia 1.7218100800986855 155 | thailand 1.7082852803611082 156 | indonesia 1.679178757980267 157 | malaysia 1.6728186293077718 158 | pakistan 1.6419833973021383 159 | india 1.6394439184980092 160 | laos 1.632823427649131 161 | kazakhstan 1.6233583074612854 162 | manchuria 1.6154866307442364 163 | republic 1.605084908540867 164 | nepal 1.5889757766337764 165 | tibet 1.5553698521689054 166 | mainland 1.5422192430156836 167 | cambodia 1.5393252900985754 168 | myanmar 1.5363584360594595 169 | kyrgyzstan 1.531264076283158 170 | singapore 1.5305132940994244 171 | philippines 1.523034664556905 172 | macau 1.5160013609117333 173 | xinjiang 1.4825373292002604 174 | latvia 1.471622519733523 175 | kenya 1.4696299318908457 176 | changsha 1.4664040946553276 177 | shanghai 1.455466110605061 178 | malaya 1.4548293900077052 179 | burma 1.4509943221704922 180 | ingushetia 1.4487999900318091 181 | 182 | 183 | 参考 184 | 185 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 186 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 187 | 【3】https://github.com/xubo245/SparkLearning 188 | 【4】book:Machine Learning with Spark ,Nick Pertreach 189 | 【5】book:Spark MlLib机器学习实战 190 | -------------------------------------------------------------------------------- /10PMML模型输出/Spark中组件Mllib的学习74之预言模型标记语言PMML.md: -------------------------------------------------------------------------------- 1 | 2 | 更多代码请见:https://github.com/xubo245/SparkLearning 3 | 4 | Spark中组件Mllib的学习 5 | 6 | 1.解释 7 | 8 | 全称预言模型标记语言(Predictive Model Markup Language),利用XML描述和存储数据挖掘模型,是一个已经被W3C所接受的标准。MML是一种基于XML的语言,用来定义预言模型。 9 | 10 | 11 | 12 | 2.代码: 13 | 14 | /** 15 | * @author xubo 16 | * ref:http://spark.apache.org/docs/1.5.2/mllib-guide.html 17 | * more code:https://github.com/xubo245/SparkLearning 18 | * more blog:http://blog.csdn.net/xubo245 19 | */ 20 | package org.apache.spark.mllib.EvaluationMetrics 21 | 22 | import org.apache.spark.util.SparkLearningFunSuite 23 | 24 | /** 25 | * Created by xubo on 2016/6/13. 26 | */ 27 | class RegressionModelEvaluationFunSuite extends SparkLearningFunSuite { 28 | test("testFunSuite") { 29 | 30 | 31 | import org.apache.spark.mllib.regression.LabeledPoint 32 | import org.apache.spark.mllib.regression.LinearRegressionModel 33 | import org.apache.spark.mllib.regression.LinearRegressionWithSGD 34 | import org.apache.spark.mllib.linalg.Vectors 35 | import org.apache.spark.mllib.evaluation.RegressionMetrics 36 | import org.apache.spark.mllib.util.MLUtils 37 | 38 | // Load the data 39 | val data = MLUtils.loadLibSVMFile(sc, "file/data/mllib/input/mllibFromSpark/sample_linear_regression_data.txt").cache() 40 | 41 | // Build the model 42 | val numIterations = 100 43 | val model = LinearRegressionWithSGD.train(data, numIterations) 44 | 45 | // Get predictions 46 | val valuesAndPreds = data.map{ point => 47 | val prediction = model.predict(point.features) 48 | (prediction, point.label) 49 | } 50 | 51 | // Instantiate metrics object 52 | val metrics = new RegressionMetrics(valuesAndPreds) 53 | 54 | // Squared error 55 | println(s"MSE = ${metrics.meanSquaredError}") 56 | println(s"RMSE = ${metrics.rootMeanSquaredError}") 57 | 58 | // R-squared 59 | println(s"R-squared = ${metrics.r2}") 60 | 61 | // Mean absolute error 62 | println(s"MAE = ${metrics.meanAbsoluteError}") 63 | 64 | // Explained variance 65 | println(s"Explained variance = ${metrics.explainedVariance}") 66 | 67 | 68 | } 69 | } 70 | 71 | 72 | 73 | 3.结果: 74 | 75 | PMML Model: 76 | 77 | 78 |
79 | 80 | 2016-06-14T21:20:36 81 |
82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 9.099999999999998 9.099999999999998 9.099999999999998 101 | 102 | 103 | 0.1 0.1 0.1 104 | 105 | 106 |
107 | 108 | 109 | 110 |
111 | 112 | 2016-06-14T21:20:40 113 |
114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | 125 | 126 | 127 | 128 | 129 | 130 | 131 | 132 | 9.099999999999998 9.099999999999998 9.099999999999998 133 | 134 | 135 | 0.1 0.1 0.1 136 | 137 | 138 |
139 | 140 | analysis: 141 | 142 | clusters.clusterCenters.foreach(println) 143 | 144 | [9.099999999999998,9.099999999999998,9.099999999999998] 145 | [0.1,0.1,0.1] 146 | 147 | 参考 148 | 149 | 【1】http://spark.apache.org/docs/1.5.2/mllib-guide.html 150 | 【2】http://spark.apache.org/docs/1.5.2/programming-guide.html 151 | 【3】https://github.com/xubo245/SparkLearning 152 | 【4】book:Machine Learning with Spark ,Nick Pertreach 153 | 【5】book:Spark MlLib机器学习实战 154 | -------------------------------------------------------------------------------- /1数据类型/Spark中组件Mllib的学习3之用户相似度计算.md: -------------------------------------------------------------------------------- 1 | 代码: 2 | 3 | ``` 4 | /** 5 | * @author xubo 6 | * time 2016.516 7 | * ref 《Spark MlLib 机器学习实战》P64 8 | */ 9 | package org.apache.spark.mllib.learning.recommend 10 | 11 | import org.apache.spark.{SparkConf, SparkContext} 12 | 13 | import scala.collection.mutable.Map 14 | 15 | object CollaborativeFilteringSpark { 16 | val conf = new SparkConf().setMaster("local").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 17 | println(this.getClass().getSimpleName().filter(!_.equals('$'))) 18 | //设置环境变量 19 | val sc = new SparkContext(conf) 20 | //实例化环境 21 | val users = sc.parallelize(Array("aaa", "bbb", "ccc", "ddd", "eee")) 22 | //设置用户 23 | val films = sc.parallelize(Array("smzdm", "ylxb", "znh", "nhsc", "fcwr")) //设置电影名 24 | 25 | val source = Map[String, Map[String, Int]]() 26 | //使用一个source嵌套map作为姓名电影名和分值的存储 27 | val filmSource = Map[String, Int]() 28 | 29 | //设置一个用以存放电影分的map 30 | def getSource(): Map[String, Map[String, Int]] = { 31 | //设置电影评分 32 | val user1FilmSource = Map("smzdm" -> 2, "ylxb" -> 3, "znh" -> 1, "nhsc" -> 0, "fcwr" -> 1) 33 | val user2FilmSource = Map("smzdm" -> 1, "ylxb" -> 2, "znh" -> 2, "nhsc" -> 1, "fcwr" -> 4) 34 | val user3FilmSource = Map("smzdm" -> 2, "ylxb" -> 1, "znh" -> 0, "nhsc" -> 1, "fcwr" -> 4) 35 | val user4FilmSource = Map("smzdm" -> 3, "ylxb" -> 2, "znh" -> 0, "nhsc" -> 5, "fcwr" -> 3) 36 | val user5FilmSource = Map("smzdm" -> 5, "ylxb" -> 3, "znh" -> 1, "nhsc" -> 1, "fcwr" -> 2) 37 | source += ("aaa" -> user1FilmSource) //对人名进行存储 38 | source += ("bbb" -> user2FilmSource) //对人名进行存储 39 | source += ("ccc" -> user3FilmSource) //对人名进行存储 40 | source += ("ddd" -> user4FilmSource) //对人名进行存储 41 | source += ("eee" -> user5FilmSource) //对人名进行存储 42 | source //返回嵌套map 43 | } 44 | 45 | //两两计算分值,采用余弦相似性 46 | def getCollaborateSource(user1: String, user2: String): Double = { 47 | val user1FilmSource = source.get(user1).get.values.toVector //获得第1个用户的评分 48 | val user2FilmSource = source.get(user2).get.values.toVector //获得第2个用户的评分 49 | val member = user1FilmSource.zip(user2FilmSource).map(d => d._1 * d._2).reduce(_ + _).toDouble //对公式分子部分进行计算 50 | val temp1 = math.sqrt(user1FilmSource.map(num => { 51 | //求出分母第1个变量值 52 | math.pow(num, 2) //数学计算 53 | }).reduce(_ + _)) //进行叠加 54 | val temp2 = math.sqrt(user2FilmSource.map(num => { 55 | ////求出分母第2个变量值 56 | math.pow(num, 2) //数学计算 57 | }).reduce(_ + _)) //进行叠加 58 | val denominator = temp1 * temp2 //求出分母 59 | member / denominator //进行计算 60 | } 61 | 62 | def main(args: Array[String]) { 63 | getSource() //初始化分数 64 | var name = "bbb" //设定目标对象 65 | users.foreach(user => { 66 | //迭代进行计算 67 | println(name + " 相对于 " + user + "的相似性分数是:" + getCollaborateSource(name, user)) 68 | }) 69 | println() 70 | name = "aaa" 71 | users.foreach(user => { 72 | //迭代进行计算 73 | println(name + " 相对于 " + user + "的相似性分数是:" + getCollaborateSource(name, user)) 74 | }) 75 | } 76 | } 77 | 78 | ``` 79 | 80 | 81 | 结果: 82 | 83 | ``` 84 | D:\1win7\java\jdk\bin\java -Didea.launcher.port=7534 "-Didea.launcher.bin.path=D:\1win7\idea\IntelliJ IDEA Community Edition 15.0.4\bin" -Dfile.encoding=UTF-8 -classpath "D:\all\idea\SparkLearning\bin;D:\1win7\java\jdk\jre\lib\charsets.jar;D:\1win7\java\jdk\jre\lib\deploy.jar;D:\1win7\java\jdk\jre\lib\ext\access-bridge-64.jar;D:\1win7\java\jdk\jre\lib\ext\dnsns.jar;D:\1win7\java\jdk\jre\lib\ext\jaccess.jar;D:\1win7\java\jdk\jre\lib\ext\localedata.jar;D:\1win7\java\jdk\jre\lib\ext\sunec.jar;D:\1win7\java\jdk\jre\lib\ext\sunjce_provider.jar;D:\1win7\java\jdk\jre\lib\ext\sunmscapi.jar;D:\1win7\java\jdk\jre\lib\ext\zipfs.jar;D:\1win7\java\jdk\jre\lib\javaws.jar;D:\1win7\java\jdk\jre\lib\jce.jar;D:\1win7\java\jdk\jre\lib\jfr.jar;D:\1win7\java\jdk\jre\lib\jfxrt.jar;D:\1win7\java\jdk\jre\lib\jsse.jar;D:\1win7\java\jdk\jre\lib\management-agent.jar;D:\1win7\java\jdk\jre\lib\plugin.jar;D:\1win7\java\jdk\jre\lib\resources.jar;D:\1win7\java\jdk\jre\lib\rt.jar;D:\1win7\scala;D:\1win7\scala\lib;D:\1win7\java\otherJar\spark-assembly-1.5.2-hadoop2.6.0.jar;D:\1win7\java\otherJar\adam-apis_2.10-0.18.3-SNAPSHOT.jar;D:\1win7\java\otherJar\adam-cli_2.10-0.18.3-SNAPSHOT.jar;D:\1win7\java\otherJar\adam-core_2.10-0.18.3-SNAPSHOT.jar;D:\1win7\java\otherJar\SparkCSV\com.databricks_spark-csv_2.10-1.4.0.jar;D:\1win7\java\otherJar\SparkCSV\com.univocity_univocity-parsers-1.5.1.jar;D:\1win7\java\otherJar\SparkCSV\org.apache.commons_commons-csv-1.1.jar;D:\1win7\java\otherJar\SparkAvro\spark-avro_2.10-2.0.1.jar;D:\1win7\java\otherJar\SparkAvro\spark-avro_2.10-2.0.1-javadoc.jar;D:\1win7\java\otherJar\SparkAvro\spark-avro_2.10-2.0.1-sources.jar;D:\1win7\java\otherJar\avro\spark-avro_2.10-2.0.2-SNAPSHOT.jar;D:\1win7\java\otherJar\tachyon\tachyon-assemblies-0.7.1-jar-with-dependencies.jar;D:\1win7\scala\lib\scala-actors-migration.jar;D:\1win7\scala\lib\scala-actors.jar;D:\1win7\scala\lib\scala-library.jar;D:\1win7\scala\lib\scala-reflect.jar;D:\1win7\scala\lib\scala-swing.jar;D:\1win7\idea\IntelliJ IDEA Community Edition 15.0.4\lib\idea_rt.jar" com.intellij.rt.execution.application.AppMain org.apache.spark.mllib.learning.recommend.CollaborativeFilteringSpark 85 | CollaborativeFilteringSpark 86 | SLF4J: Class path contains multiple SLF4J bindings. 87 | SLF4J: Found binding in [jar:file:/D:/1win7/java/otherJar/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] 88 | SLF4J: Found binding in [jar:file:/D:/1win7/java/otherJar/adam-cli_2.10-0.18.3-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class] 89 | SLF4J: Found binding in [jar:file:/D:/1win7/java/otherJar/tachyon/tachyon-assemblies-0.7.1-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class] 90 | SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 91 | SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 92 | 2016-05-16 20:57:50 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 93 | 2016-05-16 20:57:52 WARN MetricsSystem:71 - Using default name DAGScheduler for source because spark.app.id is not set. 94 | bbb 相对于 aaa的相似性分数是:0.7089175569585667 95 | bbb 相对于 bbb的相似性分数是:1.0000000000000002 96 | bbb 相对于 ccc的相似性分数是:0.8780541105074453 97 | bbb 相对于 ddd的相似性分数是:0.6865554812287477 98 | bbb 相对于 eee的相似性分数是:0.6821910402406466 99 | 100 | aaa 相对于 aaa的相似性分数是:0.9999999999999999 101 | aaa 相对于 bbb的相似性分数是:0.7089175569585667 102 | aaa 相对于 ccc的相似性分数是:0.6055300708194983 103 | aaa 相对于 ddd的相似性分数是:0.564932682866032 104 | aaa 相对于 eee的相似性分数是:0.8981462390204985 105 | 106 | Process finished with exit code 0 107 | 108 | ``` -------------------------------------------------------------------------------- /2基本统计/Spark中组件Mllib的学习3之用户相似度计算.md: -------------------------------------------------------------------------------- 1 | 代码: 2 | 3 | ``` 4 | /** 5 | * @author xubo 6 | * time 2016.516 7 | * ref 《Spark MlLib 机器学习实战》P64 8 | */ 9 | package org.apache.spark.mllib.learning.recommend 10 | 11 | import org.apache.spark.{SparkConf, SparkContext} 12 | 13 | import scala.collection.mutable.Map 14 | 15 | object CollaborativeFilteringSpark { 16 | val conf = new SparkConf().setMaster("local").setAppName(this.getClass().getSimpleName().filter(!_.equals('$'))) 17 | println(this.getClass().getSimpleName().filter(!_.equals('$'))) 18 | //设置环境变量 19 | val sc = new SparkContext(conf) 20 | //实例化环境 21 | val users = sc.parallelize(Array("aaa", "bbb", "ccc", "ddd", "eee")) 22 | //设置用户 23 | val films = sc.parallelize(Array("smzdm", "ylxb", "znh", "nhsc", "fcwr")) //设置电影名 24 | 25 | val source = Map[String, Map[String, Int]]() 26 | //使用一个source嵌套map作为姓名电影名和分值的存储 27 | val filmSource = Map[String, Int]() 28 | 29 | //设置一个用以存放电影分的map 30 | def getSource(): Map[String, Map[String, Int]] = { 31 | //设置电影评分 32 | val user1FilmSource = Map("smzdm" -> 2, "ylxb" -> 3, "znh" -> 1, "nhsc" -> 0, "fcwr" -> 1) 33 | val user2FilmSource = Map("smzdm" -> 1, "ylxb" -> 2, "znh" -> 2, "nhsc" -> 1, "fcwr" -> 4) 34 | val user3FilmSource = Map("smzdm" -> 2, "ylxb" -> 1, "znh" -> 0, "nhsc" -> 1, "fcwr" -> 4) 35 | val user4FilmSource = Map("smzdm" -> 3, "ylxb" -> 2, "znh" -> 0, "nhsc" -> 5, "fcwr" -> 3) 36 | val user5FilmSource = Map("smzdm" -> 5, "ylxb" -> 3, "znh" -> 1, "nhsc" -> 1, "fcwr" -> 2) 37 | source += ("aaa" -> user1FilmSource) //对人名进行存储 38 | source += ("bbb" -> user2FilmSource) //对人名进行存储 39 | source += ("ccc" -> user3FilmSource) //对人名进行存储 40 | source += ("ddd" -> user4FilmSource) //对人名进行存储 41 | source += ("eee" -> user5FilmSource) //对人名进行存储 42 | source //返回嵌套map 43 | } 44 | 45 | //两两计算分值,采用余弦相似性 46 | def getCollaborateSource(user1: String, user2: String): Double = { 47 | val user1FilmSource = source.get(user1).get.values.toVector //获得第1个用户的评分 48 | val user2FilmSource = source.get(user2).get.values.toVector //获得第2个用户的评分 49 | val member = user1FilmSource.zip(user2FilmSource).map(d => d._1 * d._2).reduce(_ + _).toDouble //对公式分子部分进行计算 50 | val temp1 = math.sqrt(user1FilmSource.map(num => { 51 | //求出分母第1个变量值 52 | math.pow(num, 2) //数学计算 53 | }).reduce(_ + _)) //进行叠加 54 | val temp2 = math.sqrt(user2FilmSource.map(num => { 55 | ////求出分母第2个变量值 56 | math.pow(num, 2) //数学计算 57 | }).reduce(_ + _)) //进行叠加 58 | val denominator = temp1 * temp2 //求出分母 59 | member / denominator //进行计算 60 | } 61 | 62 | def main(args: Array[String]) { 63 | getSource() //初始化分数 64 | var name = "bbb" //设定目标对象 65 | users.foreach(user => { 66 | //迭代进行计算 67 | println(name + " 相对于 " + user + "的相似性分数是:" + getCollaborateSource(name, user)) 68 | }) 69 | println() 70 | name = "aaa" 71 | users.foreach(user => { 72 | //迭代进行计算 73 | println(name + " 相对于 " + user + "的相似性分数是:" + getCollaborateSource(name, user)) 74 | }) 75 | } 76 | } 77 | 78 | ``` 79 | 80 | 81 | 结果: 82 | 83 | ``` 84 | D:\1win7\java\jdk\bin\java -Didea.launcher.port=7534 "-Didea.launcher.bin.path=D:\1win7\idea\IntelliJ IDEA Community Edition 15.0.4\bin" -Dfile.encoding=UTF-8 -classpath "D:\all\idea\SparkLearning\bin;D:\1win7\java\jdk\jre\lib\charsets.jar;D:\1win7\java\jdk\jre\lib\deploy.jar;D:\1win7\java\jdk\jre\lib\ext\access-bridge-64.jar;D:\1win7\java\jdk\jre\lib\ext\dnsns.jar;D:\1win7\java\jdk\jre\lib\ext\jaccess.jar;D:\1win7\java\jdk\jre\lib\ext\localedata.jar;D:\1win7\java\jdk\jre\lib\ext\sunec.jar;D:\1win7\java\jdk\jre\lib\ext\sunjce_provider.jar;D:\1win7\java\jdk\jre\lib\ext\sunmscapi.jar;D:\1win7\java\jdk\jre\lib\ext\zipfs.jar;D:\1win7\java\jdk\jre\lib\javaws.jar;D:\1win7\java\jdk\jre\lib\jce.jar;D:\1win7\java\jdk\jre\lib\jfr.jar;D:\1win7\java\jdk\jre\lib\jfxrt.jar;D:\1win7\java\jdk\jre\lib\jsse.jar;D:\1win7\java\jdk\jre\lib\management-agent.jar;D:\1win7\java\jdk\jre\lib\plugin.jar;D:\1win7\java\jdk\jre\lib\resources.jar;D:\1win7\java\jdk\jre\lib\rt.jar;D:\1win7\scala;D:\1win7\scala\lib;D:\1win7\java\otherJar\spark-assembly-1.5.2-hadoop2.6.0.jar;D:\1win7\java\otherJar\adam-apis_2.10-0.18.3-SNAPSHOT.jar;D:\1win7\java\otherJar\adam-cli_2.10-0.18.3-SNAPSHOT.jar;D:\1win7\java\otherJar\adam-core_2.10-0.18.3-SNAPSHOT.jar;D:\1win7\java\otherJar\SparkCSV\com.databricks_spark-csv_2.10-1.4.0.jar;D:\1win7\java\otherJar\SparkCSV\com.univocity_univocity-parsers-1.5.1.jar;D:\1win7\java\otherJar\SparkCSV\org.apache.commons_commons-csv-1.1.jar;D:\1win7\java\otherJar\SparkAvro\spark-avro_2.10-2.0.1.jar;D:\1win7\java\otherJar\SparkAvro\spark-avro_2.10-2.0.1-javadoc.jar;D:\1win7\java\otherJar\SparkAvro\spark-avro_2.10-2.0.1-sources.jar;D:\1win7\java\otherJar\avro\spark-avro_2.10-2.0.2-SNAPSHOT.jar;D:\1win7\java\otherJar\tachyon\tachyon-assemblies-0.7.1-jar-with-dependencies.jar;D:\1win7\scala\lib\scala-actors-migration.jar;D:\1win7\scala\lib\scala-actors.jar;D:\1win7\scala\lib\scala-library.jar;D:\1win7\scala\lib\scala-reflect.jar;D:\1win7\scala\lib\scala-swing.jar;D:\1win7\idea\IntelliJ IDEA Community Edition 15.0.4\lib\idea_rt.jar" com.intellij.rt.execution.application.AppMain org.apache.spark.mllib.learning.recommend.CollaborativeFilteringSpark 85 | CollaborativeFilteringSpark 86 | SLF4J: Class path contains multiple SLF4J bindings. 87 | SLF4J: Found binding in [jar:file:/D:/1win7/java/otherJar/spark-assembly-1.5.2-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] 88 | SLF4J: Found binding in [jar:file:/D:/1win7/java/otherJar/adam-cli_2.10-0.18.3-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class] 89 | SLF4J: Found binding in [jar:file:/D:/1win7/java/otherJar/tachyon/tachyon-assemblies-0.7.1-jar-with-dependencies.jar!/org/slf4j/impl/StaticLoggerBinder.class] 90 | SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 91 | SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 92 | 2016-05-16 20:57:50 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 93 | 2016-05-16 20:57:52 WARN MetricsSystem:71 - Using default name DAGScheduler for source because spark.app.id is not set. 94 | bbb 相对于 aaa的相似性分数是:0.7089175569585667 95 | bbb 相对于 bbb的相似性分数是:1.0000000000000002 96 | bbb 相对于 ccc的相似性分数是:0.8780541105074453 97 | bbb 相对于 ddd的相似性分数是:0.6865554812287477 98 | bbb 相对于 eee的相似性分数是:0.6821910402406466 99 | 100 | aaa 相对于 aaa的相似性分数是:0.9999999999999999 101 | aaa 相对于 bbb的相似性分数是:0.7089175569585667 102 | aaa 相对于 ccc的相似性分数是:0.6055300708194983 103 | aaa 相对于 ddd的相似性分数是:0.564932682866032 104 | aaa 相对于 eee的相似性分数是:0.8981462390204985 105 | 106 | Process finished with exit code 0 107 | 108 | ``` --------------------------------------------------------------------------------