├── imgs ├── 1.1.png ├── 1.2.png ├── 1.3.png ├── 1.4.png ├── 1.5.png ├── 1.6.png ├── 2.1.png ├── 2.2.png ├── 2.3.png ├── 2.4.png ├── 3.1.png ├── 4.1.png ├── 4.2.png ├── 5.1.png └── 5.2.png ├── docs ├── graphx.pdf └── pregel-a_system_for_large-scale_graph_processing.pdf ├── operators ├── readme.md ├── cache.md ├── join.md ├── transformation.md ├── structure.md └── aggregate.md ├── SUMMARY.md ├── README.md ├── graphAlgorithm ├── BFS.md ├── shortest_path.md ├── ConnectedComponents.md ├── TriangleCounting.md └── PageRank.md ├── vertex-edge-triple.md ├── pregel-api.md ├── graphx-introduce.md ├── vertex-cut.md ├── parallel-graph-system.md └── .idea └── uiDesigner.xml /imgs/1.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/endymecy/spark-graphx-source-analysis/HEAD/imgs/1.1.png -------------------------------------------------------------------------------- /imgs/1.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/endymecy/spark-graphx-source-analysis/HEAD/imgs/1.2.png -------------------------------------------------------------------------------- /imgs/1.3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/endymecy/spark-graphx-source-analysis/HEAD/imgs/1.3.png -------------------------------------------------------------------------------- /imgs/1.4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/endymecy/spark-graphx-source-analysis/HEAD/imgs/1.4.png -------------------------------------------------------------------------------- /imgs/1.5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/endymecy/spark-graphx-source-analysis/HEAD/imgs/1.5.png -------------------------------------------------------------------------------- /imgs/1.6.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/endymecy/spark-graphx-source-analysis/HEAD/imgs/1.6.png -------------------------------------------------------------------------------- /imgs/2.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/endymecy/spark-graphx-source-analysis/HEAD/imgs/2.1.png -------------------------------------------------------------------------------- /imgs/2.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/endymecy/spark-graphx-source-analysis/HEAD/imgs/2.2.png -------------------------------------------------------------------------------- /imgs/2.3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/endymecy/spark-graphx-source-analysis/HEAD/imgs/2.3.png -------------------------------------------------------------------------------- /imgs/2.4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/endymecy/spark-graphx-source-analysis/HEAD/imgs/2.4.png -------------------------------------------------------------------------------- /imgs/3.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/endymecy/spark-graphx-source-analysis/HEAD/imgs/3.1.png -------------------------------------------------------------------------------- /imgs/4.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/endymecy/spark-graphx-source-analysis/HEAD/imgs/4.1.png -------------------------------------------------------------------------------- /imgs/4.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/endymecy/spark-graphx-source-analysis/HEAD/imgs/4.2.png -------------------------------------------------------------------------------- /imgs/5.1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/endymecy/spark-graphx-source-analysis/HEAD/imgs/5.1.png -------------------------------------------------------------------------------- /imgs/5.2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/endymecy/spark-graphx-source-analysis/HEAD/imgs/5.2.png -------------------------------------------------------------------------------- /docs/graphx.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/endymecy/spark-graphx-source-analysis/HEAD/docs/graphx.pdf -------------------------------------------------------------------------------- /operators/readme.md: -------------------------------------------------------------------------------- 1 | # GraphX的图运算操作 2 | 3 | * [转换操作](transformation.md) 4 | * [结构操作](structure.md) 5 | * [关联操作](join.md) 6 | * [聚合操作](aggregate.md) 7 | * [缓存操作](cache.md) -------------------------------------------------------------------------------- /docs/pregel-a_system_for_large-scale_graph_processing.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/endymecy/spark-graphx-source-analysis/HEAD/docs/pregel-a_system_for_large-scale_graph_processing.pdf -------------------------------------------------------------------------------- /SUMMARY.md: -------------------------------------------------------------------------------- 1 | * [分布式图计算](parallel-graph-system.md) 2 | * [GraphX简介](graphx-introduce.md) 3 | * [GraphX点切分存储](vertex-cut.md) 4 | * [vertices、edges和triplets](vertex-edge-triple.md) 5 | * [图的构建](build-graph.md) 6 | * [GraphX的图运算操作](operators/readme.md) 7 | * [转换操作](operators/transformation.md) 8 | * [结构操作](operators/structure.md) 9 | * [关联操作](operators/join.md) 10 | * [聚合操作](operators/aggregate.md) 11 | * [缓存操作](operators/cache.md) 12 | * [GraphX Pregel API](pregel-api.md) 13 | * [图算法实现] 14 | * [宽度优先遍历](graphAlgorithm/BFS.md) 15 | * [单源最短路径](graphAlgorithm/shortest_path.md) 16 | * [连通组件](graphAlgorithm/ConnectedComponents.md) 17 | * [三角计数](graphAlgorithm/TriangleCounting.md) 18 | * [PageRank](graphAlgorithm/PageRank.md) -------------------------------------------------------------------------------- /operators/cache.md: -------------------------------------------------------------------------------- 1 | # 缓存操作 2 | 3 |   在`Spark`中,`RDD`默认是不缓存的。为了避免重复计算,当需要多次利用它们时,我们必须显示地缓存它们。`GraphX`中的图也有相同的方式。当利用到图多次时,确保首先访问`Graph.cache()`方法。 4 | 5 |   在迭代计算中,为了获得最佳的性能,不缓存可能是必须的。默认情况下,缓存的`RDD`和图会一直保留在内存中直到因为内存压力迫使它们以`LRU`的顺序删除。对于迭代计算,先前的迭代的中间结果将填充到缓存 6 | 中。虽然它们最终会被删除,但是保存在内存中的不需要的数据将会减慢垃圾回收。只有中间结果不需要,不缓存它们是更高效的。然而,因为图是由多个`RDD`组成的,正确的不持久化它们是困难的。对于迭代计算,我们建议使用`Pregel API`,它可以正确的不持久化中间结果。 7 | 8 |   `GraphX`中的缓存操作有`cache`,`persist`,`unpersist`和`unpersistVertices`。它们的接口分别是: 9 | 10 | ```scala 11 | def persist(newLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED] 12 | def cache(): Graph[VD, ED] 13 | def unpersist(blocking: Boolean = true): Graph[VD, ED] 14 | def unpersistVertices(blocking: Boolean = true): Graph[VD, ED] 15 | ``` -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # `Spark GraphX`源码分析 2 | 3 |   `Spark GraphX`是一个新的`Spark API`,它用于图和分布式图(`graph-parallel`)的计算。`GraphX` 综合了 `Pregel` 和 `GraphLab` 两者的优点,即接口相对简单,又保证性能,可以应对点分割的图存储模式,胜任符合幂律分布的自然图的大型计算。 4 | 本专题会详细介绍`GraphX`的实现原理,并对`GraphX`的存储结构以及部分操作作详细分析。 5 | 6 |   本专题介绍的内容如下: 7 | 8 | ## 目录 9 | 10 | * [分布式图计算](parallel-graph-system.md) 11 | * [GraphX简介](graphx-introduce.md) 12 | * [GraphX点切分存储](vertex-cut.md) 13 | * [vertices、edges和triplets](vertex-edge-triple.md) 14 | * [图的构建](build-graph.md) 15 | * [GraphX的图运算操作](operators/readme.md) 16 | * [转换操作](operators/transformation.md) 17 | * [结构操作](operators/structure.md) 18 | * [关联操作](operators/join.md) 19 | * [聚合操作](operators/aggregate.md) 20 | * [缓存操作](operators/cache.md) 21 | * [GraphX Pregel API](pregel-api.md) 22 | * [图算法实现] 23 | * [宽度优先遍历](graphAlgorithm/BFS.md) 24 | * [单源最短路径](graphAlgorithm/shortest_path.md) 25 | * [连通组件](graphAlgorithm/ConnectedComponents.md) 26 | * [三角计数](graphAlgorithm/TriangleCounting.md) 27 | * [PageRank](graphAlgorithm/PageRank.md) 28 | -------------------------------------------------------------------------------- /graphAlgorithm/BFS.md: -------------------------------------------------------------------------------- 1 | # 广度优先遍历 2 | 3 | ```scala 4 | val graph = GraphLoader.edgeListFile(sc, "graphx/data/test_graph.txt") 5 | 6 | val root: VertexId = 1 7 | val initialGraph = graph.mapVertices((id, _) => if (id == root) 0.0 else 8 | Double.PositiveInfinity) 9 | 10 | val vprog = { (id: VertexId, attr: Double, msg: Double) => math.min(attr,msg) } 11 | 12 | val sendMessage = { (triplet: EdgeTriplet[Double, Int]) => 13 | var iter:Iterator[(VertexId, Double)] = Iterator.empty 14 | val isSrcMarked = triplet.srcAttr != Double.PositiveInfinity 15 | val isDstMarked = triplet.dstAttr != Double.PositiveInfinity 16 | if(!(isSrcMarked && isDstMarked)){ 17 | if(isSrcMarked){ 18 | iter = Iterator((triplet.dstId,triplet.srcAttr+1)) 19 | }else{ 20 | iter = Iterator((triplet.srcId,triplet.dstAttr+1)) 21 | } 22 | } 23 | iter 24 | } 25 | 26 | val reduceMessage = { (a: Double, b: Double) => math.min(a,b) } 27 | 28 | val bfs = initialGraph.pregel(Double.PositiveInfinity, 20)(vprog, sendMessage, reduceMessage) 29 | 30 | println(bfs.vertices.collect.mkString("\n")) 31 | ``` -------------------------------------------------------------------------------- /graphAlgorithm/shortest_path.md: -------------------------------------------------------------------------------- 1 | # 单源最短路径 2 | 3 | ```scala 4 | import scala.reflect.ClassTag 5 | 6 | import org.apache.spark.graphx._ 7 | 8 | /** 9 | * Computes shortest paths to the given set of landmark vertices, returning a graph where each 10 | * vertex attribute is a map containing the shortest-path distance to each reachable landmark. 11 | */ 12 | object ShortestPaths { 13 | /** Stores a map from the vertex id of a landmark to the distance to that landmark. */ 14 | type SPMap = Map[VertexId, Int] 15 | 16 | private def makeMap(x: (VertexId, Int)*) = Map(x: _*) 17 | 18 | private def incrementMap(spmap: SPMap): SPMap = spmap.map { case (v, d) => v -> (d + 1) } 19 | 20 | private def addMaps(spmap1: SPMap, spmap2: SPMap): SPMap = 21 | (spmap1.keySet ++ spmap2.keySet).map { 22 | k => k -> math.min(spmap1.getOrElse(k, Int.MaxValue), spmap2.getOrElse(k, Int.MaxValue)) 23 | }.toMap 24 | 25 | /** 26 | * Computes shortest paths to the given set of landmark vertices. 27 | * 28 | * @tparam ED the edge attribute type (not used in the computation) 29 | * 30 | * @param graph the graph for which to compute the shortest paths 31 | * @param landmarks the list of landmark vertex ids. Shortest paths will be computed to each 32 | * landmark. 33 | * 34 | * @return a graph where each vertex attribute is a map containing the shortest-path distance to 35 | * each reachable landmark vertex. 36 | */ 37 | def run[VD, ED: ClassTag](graph: Graph[VD, ED], landmarks: Seq[VertexId]): Graph[SPMap, ED] = { 38 | val spGraph = graph.mapVertices { (vid, attr) => 39 | if (landmarks.contains(vid)) makeMap(vid -> 0) else makeMap() 40 | } 41 | 42 | val initialMessage = makeMap() 43 | 44 | def vertexProgram(id: VertexId, attr: SPMap, msg: SPMap): SPMap = { 45 | addMaps(attr, msg) 46 | } 47 | 48 | def sendMessage(edge: EdgeTriplet[SPMap, _]): Iterator[(VertexId, SPMap)] = { 49 | val newAttr = incrementMap(edge.dstAttr) 50 | if (edge.srcAttr != addMaps(newAttr, edge.srcAttr)) Iterator((edge.srcId, newAttr)) 51 | else Iterator.empty 52 | } 53 | 54 | Pregel(spGraph, initialMessage)(vertexProgram, sendMessage, addMaps) 55 | } 56 | } 57 | 58 | ``` -------------------------------------------------------------------------------- /vertex-edge-triple.md: -------------------------------------------------------------------------------- 1 | # `GraphX`中`vertices`、`edges`以及`triplets` 2 | 3 |   `vertices`、`edges`以及`triplets`是`GraphX`中三个非常重要的概念。我们在前文[GraphX介绍](graphx-introduce.md)中对这三个概念有初步的了解。 4 | 5 | ## 1 vertices 6 | 7 |   在`GraphX`中,`vertices`对应着名称为`VertexRDD`的`RDD`。这个`RDD`有顶点`id`和顶点属性两个成员变量。它的源码如下所示: 8 | 9 | ```scala 10 | abstract class VertexRDD[VD]( 11 | sc: SparkContext, 12 | deps: Seq[Dependency[_]]) extends RDD[(VertexId, VD)](sc, deps) 13 | ``` 14 |   从源码中我们可以看到,`VertexRDD`继承自`RDD[(VertexId, VD)]`,这里`VertexId`表示顶点`id`,`VD`表示顶点所带的属性的类别。这从另一个角度也说明`VertexRDD`拥有顶点`id`和顶点属性。 15 | 16 | ## 2 edges 17 | 18 |   在`GraphX`中,`edges`对应着`EdgeRDD`。这个`RDD`拥有三个成员变量,分别是源顶点`id`、目标顶点`id`以及边属性。它的源码如下所示: 19 | 20 | ```scala 21 | abstract class EdgeRDD[ED]( 22 | sc: SparkContext, 23 | deps: Seq[Dependency[_]]) extends RDD[Edge[ED]](sc, deps) 24 | ``` 25 |   从源码中我们可以看到,`EdgeRDD`继承自`RDD[Edge[ED]]`,即类型为`Edge[ED]`的`RDD`。`Edge[ED]`在后文会讲到。 26 | 27 | ## 3 triplets 28 | 29 |   在`GraphX`中,`triplets`对应着`EdgeTriplet`。它是一个三元组视图,这个视图逻辑上将顶点和边的属性保存为一个`RDD[EdgeTriplet[VD, ED]]`。可以通过下面的`Sql`表达式表示这个三元视图的含义: 30 | 31 | ```sql 32 | SELECT src.id, dst.id, src.attr, e.attr, dst.attr 33 | FROM edges AS e LEFT JOIN vertices AS src, vertices AS dst 34 | ON e.srcId = src.Id AND e.dstId = dst.Id 35 | ``` 36 |   同样,也可以通过下面图解的形式来表示它的含义: 37 | 38 |
3.1

39 | 40 |   `EdgeTriplet`的源代码如下所示: 41 | 42 | ```scala 43 | class EdgeTriplet[VD, ED] extends Edge[ED] { 44 | //源顶点属性 45 | var srcAttr: VD = _ // nullValue[VD] 46 | //目标顶点属性 47 | var dstAttr: VD = _ // nullValue[VD] 48 | protected[spark] def set(other: Edge[ED]): EdgeTriplet[VD, ED] = { 49 | srcId = other.srcId 50 | dstId = other.dstId 51 | attr = other.attr 52 | this 53 | } 54 | ``` 55 |   `EdgeTriplet`类继承自`Edge`类,我们来看看这个父类: 56 | 57 | ```scala 58 | case class Edge[@specialized(Char, Int, Boolean, Byte, Long, Float, Double) ED] ( 59 | var srcId: VertexId = 0, 60 | var dstId: VertexId = 0, 61 | var attr: ED = null.asInstanceOf[ED]) 62 | extends Serializable 63 | ``` 64 |   `Edge`类中包含源顶点`id`,目标顶点`id`以及边的属性。所以从源代码中我们可以知道,`triplets`既包含了边属性也包含了源顶点的`id`和属性、目标顶点的`id`和属性。 65 | 66 | ## 4 参考文献 67 | 68 | 【1】[spark源码](https://github.com/apache/spark) 69 | 70 | 71 | 72 | -------------------------------------------------------------------------------- /graphAlgorithm/ConnectedComponents.md: -------------------------------------------------------------------------------- 1 | # 连通图 2 | 3 | ```scala 4 | 5 | import scala.reflect.ClassTag 6 | 7 | import org.apache.spark.graphx._ 8 | 9 | /** Connected components algorithm. */ 10 | object ConnectedComponents { 11 | /** 12 | * Compute the connected component membership of each vertex and return a graph with the vertex 13 | * value containing the lowest vertex id in the connected component containing that vertex. 14 | * 15 | * @tparam VD the vertex attribute type (discarded in the computation) 16 | * @tparam ED the edge attribute type (preserved in the computation) 17 | * @param graph the graph for which to compute the connected components 18 | * @param maxIterations the maximum number of iterations to run for 19 | * @return a graph with vertex attributes containing the smallest vertex in each 20 | * connected component 21 | */ 22 | def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED], 23 | maxIterations: Int): Graph[VertexId, ED] = { 24 | require(maxIterations > 0, s"Maximum of iterations must be greater than 0," + 25 | s" but got ${maxIterations}") 26 | 27 | val ccGraph = graph.mapVertices { case (vid, _) => vid } 28 | def sendMessage(edge: EdgeTriplet[VertexId, ED]): Iterator[(VertexId, VertexId)] = { 29 | if (edge.srcAttr < edge.dstAttr) { 30 | Iterator((edge.dstId, edge.srcAttr)) 31 | } else if (edge.srcAttr > edge.dstAttr) { 32 | Iterator((edge.srcId, edge.dstAttr)) 33 | } else { 34 | Iterator.empty 35 | } 36 | } 37 | val initialMessage = Long.MaxValue 38 | val pregelGraph = Pregel(ccGraph, initialMessage, 39 | maxIterations, EdgeDirection.Either)( 40 | vprog = (id, attr, msg) => math.min(attr, msg), 41 | sendMsg = sendMessage, 42 | mergeMsg = (a, b) => math.min(a, b)) 43 | ccGraph.unpersist() 44 | pregelGraph 45 | } // end of connectedComponents 46 | 47 | /** 48 | * Compute the connected component membership of each vertex and return a graph with the vertex 49 | * value containing the lowest vertex id in the connected component containing that vertex. 50 | * 51 | * @tparam VD the vertex attribute type (discarded in the computation) 52 | * @tparam ED the edge attribute type (preserved in the computation) 53 | * @param graph the graph for which to compute the connected components 54 | * @return a graph with vertex attributes containing the smallest vertex in each 55 | * connected component 56 | */ 57 | def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): Graph[VertexId, ED] = { 58 | run(graph, Int.MaxValue) 59 | } 60 | } 61 | 62 | ``` 63 | -------------------------------------------------------------------------------- /operators/join.md: -------------------------------------------------------------------------------- 1 | # 关联操作 2 | 3 |   在许多情况下,有必要将外部数据加入到图中。例如,我们可能有额外的用户属性需要合并到已有的图中或者我们可能想从一个图中取出顶点特征加入到另外一个图中。这些任务可以用`join`操作完成。 4 | 主要的`join`操作如下所示。 5 | 6 | ```scala 7 | class Graph[VD, ED] { 8 | def joinVertices[U](table: RDD[(VertexId, U)])(map: (VertexId, VD, U) => VD) 9 | : Graph[VD, ED] 10 | def outerJoinVertices[U, VD2](table: RDD[(VertexId, U)])(map: (VertexId, VD, Option[U]) => VD2) 11 | : Graph[VD2, ED] 12 | } 13 | ``` 14 | 15 |   `joinVertices`操作`join`输入`RDD`和顶点,返回一个新的带有顶点特征的图。这些特征是通过在连接顶点的结果上使用用户定义的`map`函数获得的。没有匹配的顶点保留其原始值。 16 | 下面详细地来分析这两个函数。 17 | 18 | ## 1 joinVertices 19 | 20 | ```scala 21 | def joinVertices[U: ClassTag](table: RDD[(VertexId, U)])(mapFunc: (VertexId, VD, U) => VD) 22 | : Graph[VD, ED] = { 23 | val uf = (id: VertexId, data: VD, o: Option[U]) => { 24 | o match { 25 | case Some(u) => mapFunc(id, data, u) 26 | case None => data 27 | } 28 | } 29 | graph.outerJoinVertices(table)(uf) 30 | } 31 | ``` 32 |   我们可以看到,`joinVertices`的实现是通过`outerJoinVertices`来实现的。这是因为`join`本来就是`outer join`的一种特例。 33 | 34 | ## 2 outerJoinVertices 35 | 36 | ```scala 37 | override def outerJoinVertices[U: ClassTag, VD2: ClassTag] 38 | (other: RDD[(VertexId, U)]) 39 | (updateF: (VertexId, VD, Option[U]) => VD2) 40 | (implicit eq: VD =:= VD2 = null): Graph[VD2, ED] = { 41 | if (eq != null) { 42 | vertices.cache() 43 | // updateF preserves type, so we can use incremental replication 44 | val newVerts = vertices.leftJoin(other)(updateF).cache() 45 | val changedVerts = vertices.asInstanceOf[VertexRDD[VD2]].diff(newVerts) 46 | val newReplicatedVertexView = replicatedVertexView.asInstanceOf[ReplicatedVertexView[VD2, ED]] 47 | .updateVertices(changedVerts) 48 | new GraphImpl(newVerts, newReplicatedVertexView) 49 | } else { 50 | // updateF does not preserve type, so we must re-replicate all vertices 51 | val newVerts = vertices.leftJoin(other)(updateF) 52 | GraphImpl(newVerts, replicatedVertexView.edges) 53 | } 54 | } 55 | ``` 56 |   通过以上的代码我们可以看到,如果`updateF`不改变类型,我们只需要创建改变的顶点即可,否则我们要重新创建所有的顶点。我们讨论不改变类型的情况。 57 | 这种情况分三步。 58 | 59 | - 1 修改顶点属性值 60 | 61 | ```scala 62 | val newVerts = vertices.leftJoin(other)(updateF).cache() 63 | ``` 64 |   这一步会用顶点`RDD` `join` 传入的`RDD`,然后用`updateF`作用`joinRDD`中的所有顶点,改变它们的值。 65 | 66 | - 2 找到发生改变的顶点 67 | 68 | ```scala 69 | val changedVerts = vertices.asInstanceOf[VertexRDD[VD2]].diff(newVerts) 70 | ``` 71 | 72 | - 3 更新newReplicatedVertexView中边分区中的顶点属性 73 | 74 | ```scala 75 | val newReplicatedVertexView = replicatedVertexView.asInstanceOf[ReplicatedVertexView[VD2, ED]] 76 | .updateVertices(changedVerts) 77 | ``` 78 | 79 |   第2、3两步的源码已经在[转换操作](transformation.md)中详细介绍。 80 | 81 | 82 | -------------------------------------------------------------------------------- /graphAlgorithm/TriangleCounting.md: -------------------------------------------------------------------------------- 1 | # 三角计数 2 | 3 | ```scala 4 | import scala.reflect.ClassTag 5 | 6 | import org.apache.spark.graphx._ 7 | 8 | /** 9 | * Compute the number of triangles passing through each vertex. 10 | * 11 | * The algorithm is relatively straightforward and can be computed in three steps: 12 | * 13 | * 18 | * 19 | * There are two implementations. The default `TriangleCount.run` implementation first removes 20 | * self cycles and canonicalizes the graph to ensure that the following conditions hold: 21 | * 26 | * However, the canonicalization procedure is costly as it requires repartitioning the graph. 27 | * If the input data is already in "canonical form" with self cycles removed then the 28 | * `TriangleCount.runPreCanonicalized` should be used instead. 29 | * 30 | * {{{ 31 | * val canonicalGraph = graph.mapEdges(e => 1).removeSelfEdges().canonicalizeEdges() 32 | * val counts = TriangleCount.runPreCanonicalized(canonicalGraph).vertices 33 | * }}} 34 | * 35 | */ 36 | object TriangleCount { 37 | 38 | def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): Graph[Int, ED] = { 39 | // Transform the edge data something cheap to shuffle and then canonicalize 40 | val canonicalGraph = graph.mapEdges(e => true).removeSelfEdges().convertToCanonicalEdges() 41 | // Get the triangle counts 42 | val counters = runPreCanonicalized(canonicalGraph).vertices 43 | // Join them bath with the original graph 44 | graph.outerJoinVertices(counters) { (vid, _, optCounter: Option[Int]) => 45 | optCounter.getOrElse(0) 46 | } 47 | } 48 | 49 | 50 | def runPreCanonicalized[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): Graph[Int, ED] = { 51 | // Construct set representations of the neighborhoods 52 | val nbrSets: VertexRDD[VertexSet] = 53 | graph.collectNeighborIds(EdgeDirection.Either).mapValues { (vid, nbrs) => 54 | val set = new VertexSet(nbrs.length) 55 | var i = 0 56 | while (i < nbrs.length) { 57 | // prevent self cycle 58 | if (nbrs(i) != vid) { 59 | set.add(nbrs(i)) 60 | } 61 | i += 1 62 | } 63 | set 64 | } 65 | 66 | // join the sets with the graph 67 | val setGraph: Graph[VertexSet, ED] = graph.outerJoinVertices(nbrSets) { 68 | (vid, _, optSet) => optSet.getOrElse(null) 69 | } 70 | 71 | // Edge function computes intersection of smaller vertex with larger vertex 72 | def edgeFunc(ctx: EdgeContext[VertexSet, ED, Int]) { 73 | val (smallSet, largeSet) = if (ctx.srcAttr.size < ctx.dstAttr.size) { 74 | (ctx.srcAttr, ctx.dstAttr) 75 | } else { 76 | (ctx.dstAttr, ctx.srcAttr) 77 | } 78 | val iter = smallSet.iterator 79 | var counter: Int = 0 80 | while (iter.hasNext) { 81 | val vid = iter.next() 82 | if (vid != ctx.srcId && vid != ctx.dstId && largeSet.contains(vid)) { 83 | counter += 1 84 | } 85 | } 86 | ctx.sendToSrc(counter) 87 | ctx.sendToDst(counter) 88 | } 89 | 90 | // compute the intersection along edges 91 | val counters: VertexRDD[Int] = setGraph.aggregateMessages(edgeFunc, _ + _) 92 | // Merge counters with the graph and divide by two since each triangle is counted twice 93 | graph.outerJoinVertices(counters) { (_, _, optCounter: Option[Int]) => 94 | val dblCount = optCounter.getOrElse(0) 95 | // This algorithm double counts each triangle so the final count should be even 96 | require(dblCount % 2 == 0, "Triangle count resulted in an invalid number of triangles.") 97 | dblCount / 2 98 | } 99 | } 100 | } 101 | 102 | 103 | ``` -------------------------------------------------------------------------------- /pregel-api.md: -------------------------------------------------------------------------------- 1 | # Pregel API 2 | 3 |   图本身是递归数据结构,顶点的属性依赖于它们邻居的属性,这些邻居的属性又依赖于自己邻居的属性。所以许多重要的图算法都是迭代的重新计算每个顶点的属性,直到满足某个确定的条件。 4 | 一系列的图并发(`graph-parallel`)抽象已经被提出来用来表达这些迭代算法。`GraphX`公开了一个类似`Pregel`的操作,它是广泛使用的`Pregel`和`GraphLab`抽象的一个融合。 5 | 6 |   `GraphX`中实现的这个更高级的`Pregel`操作是一个约束到图拓扑的批量同步(`bulk-synchronous`)并行消息抽象。`Pregel`操作者执行一系列的超步(`super steps`),在这些步骤中,顶点从 7 | 之前的超步中接收进入(`inbound`)消息的总和,为顶点属性计算一个新的值,然后在以后的超步中发送消息到邻居顶点。不像`Pregel`而更像`GraphLab`,消息通过边`triplet`的一个函数被并行计算, 8 | 消息的计算既会访问源顶点特征也会访问目的顶点特征。在超步中,没有收到消息的顶点会被跳过。当没有消息遗留时,`Pregel`操作停止迭代并返回最终的图。 9 | 10 |   注意,与标准的`Pregel`实现不同的是,`GraphX`中的顶点仅仅能发送信息给邻居顶点,并且可以利用用户自定义的消息函数并行地构造消息。这些限制允许对`GraphX`进行额外的优化。 11 | 12 |   下面的代码是`pregel`的具体实现。 13 | 14 | ```scala 15 | def apply[VD: ClassTag, ED: ClassTag, A: ClassTag] 16 | (graph: Graph[VD, ED], 17 | initialMsg: A, 18 | maxIterations: Int = Int.MaxValue, 19 | activeDirection: EdgeDirection = EdgeDirection.Either) 20 | (vprog: (VertexId, VD, A) => VD, 21 | sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)], 22 | mergeMsg: (A, A) => A) 23 | : Graph[VD, ED] = 24 | { 25 | var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() 26 | // 计算消息 27 | var messages = g.mapReduceTriplets(sendMsg, mergeMsg) 28 | var activeMessages = messages.count() 29 | // 迭代 30 | var prevG: Graph[VD, ED] = null 31 | var i = 0 32 | while (activeMessages > 0 && i < maxIterations) { 33 | // 接收消息并更新顶点 34 | prevG = g 35 | g = g.joinVertices(messages)(vprog).cache() 36 | val oldMessages = messages 37 | // 发送新消息 38 | messages = g.mapReduceTriplets( 39 | sendMsg, mergeMsg, Some((oldMessages, activeDirection))).cache() 40 | activeMessages = messages.count() 41 | i += 1 42 | } 43 | g 44 | } 45 | ``` 46 | ## 1 pregel计算模型 47 | 48 |   `Pregel`计算模型中有三个重要的函数,分别是`vertexProgram`、`sendMessage`和`messageCombiner`。 49 | 50 | - `vertexProgram`:用户定义的顶点运行程序。它作用于每一个顶点,负责接收进来的信息,并计算新的顶点值。 51 | 52 | - `sendMsg`:发送消息 53 | 54 | - `mergeMsg`:合并消息 55 | 56 |   我们具体分析它的实现。根据代码可以知道,这个实现是一个迭代的过程。在开始迭代之前,先完成一些初始化操作: 57 | 58 | ```scala 59 | var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() 60 | // 计算消息 61 | var messages = g.mapReduceTriplets(sendMsg, mergeMsg) 62 | var activeMessages = messages.count() 63 | ``` 64 |   程序首先用`vprog`函数处理图中所有的顶点,生成新的图。然后用生成的图调用聚合操作(`mapReduceTriplets`,实际的实现是我们前面章节讲到的`aggregateMessagesWithActiveSet`函数)获取聚合后的消息。 65 | `activeMessages`指`messages`这个`VertexRDD`中的顶点数。 66 | 67 |   下面就开始迭代操作了。在迭代内部,分为二步。 68 | 69 | - 1 接收消息,并更新顶点 70 | 71 | ```scala 72 | g = g.joinVertices(messages)(vprog).cache() 73 | //joinVertices的定义 74 | def joinVertices[U: ClassTag](table: RDD[(VertexId, U)])(mapFunc: (VertexId, VD, U) => VD) 75 | : Graph[VD, ED] = { 76 | val uf = (id: VertexId, data: VD, o: Option[U]) => { 77 | o match { 78 | case Some(u) => mapFunc(id, data, u) 79 | case None => data 80 | } 81 | } 82 | graph.outerJoinVertices(table)(uf) 83 | } 84 | ``` 85 |   这一步实际上是使用`outerJoinVertices`来更新顶点属性。`outerJoinVertices`在[关联操作](operators/join.md)中有详细介绍。 86 | 87 | - 2 发送新消息 88 | 89 | ```scala 90 | messages = g.mapReduceTriplets( 91 | sendMsg, mergeMsg, Some((oldMessages, activeDirection))).cache() 92 | ``` 93 |   注意,在上面的代码中,`mapReduceTriplets`多了一个参数`Some((oldMessages, activeDirection))`。这个参数的作用是:它使我们在发送新的消息时,会忽略掉那些两端都没有接收到消息的边,减少计算量。 94 | 95 | ## 2 pregel实现最短路径 96 | 97 | ```scala 98 | import org.apache.spark.graphx._ 99 | import org.apache.spark.graphx.util.GraphGenerators 100 | val graph: Graph[Long, Double] = 101 | GraphGenerators.logNormalGraph(sc, numVertices = 100).mapEdges(e => e.attr.toDouble) 102 | val sourceId: VertexId = 42 // The ultimate source 103 | // 初始化图 104 | val initialGraph = graph.mapVertices((id, _) => if (id == sourceId) 0.0 else Double.PositiveInfinity) 105 | val sssp = initialGraph.pregel(Double.PositiveInfinity)( 106 | (id, dist, newDist) => math.min(dist, newDist), // Vertex Program 107 | triplet => { // Send Message 108 | if (triplet.srcAttr + triplet.attr < triplet.dstAttr) { 109 | Iterator((triplet.dstId, triplet.srcAttr + triplet.attr)) 110 | } else { 111 | Iterator.empty 112 | } 113 | }, 114 | (a,b) => math.min(a,b) // Merge Message 115 | ) 116 | println(sssp.vertices.collect.mkString("\n")) 117 | ``` 118 |   上面的例子中,`Vertex Program`函数定义如下: 119 | 120 | ```scala 121 | (id, dist, newDist) => math.min(dist, newDist) 122 | ``` 123 |   这个函数的定义显而易见,当两个消息来的时候,取它们当中路径的最小值。同理`Merge Message`函数也是同样的含义。 124 | 125 |   `Send Message`函数中,会首先比较`triplet.srcAttr + triplet.attr`和`triplet.dstAttr`,即比较加上边的属性后,这个值是否小于目的节点的属性,如果小于,则发送消息到目的顶点。 126 | 127 | ## 3 参考文献 128 | 129 | 【1】[spark源码](https://github.com/apache/spark) -------------------------------------------------------------------------------- /graphx-introduce.md: -------------------------------------------------------------------------------- 1 | # GraphX介绍 2 | 3 | ## 1 GraphX的优势 4 | 5 |   `GraphX`是一个新的`Spark API`,它用于图和分布式图(`graph-parallel`)的计算。`GraphX`通过引入弹性分布式属性图([Resilient Distributed Property Graph](property-graph.md)): 6 | 顶点和边均有属性的有向多重图,来扩展`Spark RDD`。为了支持图计算,`GraphX`开发了一组基本的功能操作以及一个优化过的`Pregel API`。另外,`GraphX`包含了一个快速增长的图算法和图`builders`的 7 | 集合,用以简化图分析任务。 8 | 9 |   从社交网络到语言建模,不断增长的规模以及图形数据的重要性已经推动了许多新的分布式图系统(如[Giraph](http://giraph.apache.org/)和[GraphLab](http://graphlab.org/))的发展。 10 | 通过限制计算类型以及引入新的技术来切分和分配图,这些系统可以高效地执行复杂的图形算法,比一般的分布式数据计算(`data-parallel`,如`spark`、`MapReduce`)快很多。 11 | 12 |
2.1

13 | 14 |   分布式图(`graph-parallel`)计算和分布式数据(`data-parallel`)计算类似,分布式数据计算采用了一种`record-centric`的集合视图,而分布式图计算采用了一种`vertex-centric`的图视图。 15 | 分布式数据计算通过同时处理独立的数据来获得并发的目的,分布式图计算则是通过对图数据进行分区(即切分)来获得并发的目的。更准确的说,分布式图计算递归地定义特征的转换函数(这种转换函数作用于邻居特征),通过并发地执行这些转换函数来获得并发的目的。 16 | 17 |   分布式图计算比分布式数据计算更适合图的处理,但是在典型的图处理流水线中,它并不能很好地处理所有操作。例如,虽然分布式图系统可以很好的计算`PageRank`以及`label diffusion`,但是它们不适合从不同的数据源构建图或者跨过多个图计算特征。 18 | 更准确的说,分布式图系统提供的更窄的计算视图无法处理那些构建和转换图结构以及跨越多个图的需求。分布式图系统中无法提供的这些操作需要数据在图本体之上移动并且需要一个图层面而不是单独的顶点或边层面的计算视图。例如,我们可能想限制我们的分析到几个子图上,然后比较结果。 19 | 这不仅需要改变图结构,还需要跨多个图计算。 20 | 21 |
2.2

22 | 23 |   我们如何处理数据取决于我们的目标,有时同一原始数据可能会处理成许多不同表和图的视图,并且图和表之间经常需要能够相互移动。如下图所示: 24 | 25 |
2.3

26 | 27 |   所以我们的图流水线必须通过组合`graph-parallel`和`data- parallel`来实现。但是这种组合必然会导致大量的数据移动以及数据复制,同时这样的系统也非常复杂。 28 | 例如,在传统的图计算流水线中,在`Table View`视图下,可能需要`Spark`或者`Hadoop`的支持,在`Graph View`这种视图下,可能需要`Prege`或者`GraphLab`的支持。也就是把图和表分在不同的系统中分别处理。 29 | 不同系统之间数据的移动和通信会成为很大的负担。 30 | 31 |   `GraphX`项目将`graph-parallel`和`data-parallel`统一到一个系统中,并提供了一个唯一的组合`API`。`GraphX`允许用户把数据当做一个图和一个集合(`RDD`),而不需要数据移动或者复制。也就是说`GraphX`统一了`Graph View`和`Table View`, 32 | 可以非常轻松的做`pipeline`操作。 33 | 34 | ## 2 弹性分布式属性图 35 | 36 |   `GraphX`的核心抽象是[弹性分布式属性图](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Graph),它是一个有向多重图,带有连接到每个顶点和边的用户定义的对象。 37 | 有向多重图中多个并行的边共享相同的源和目的顶点。支持并行边的能力简化了建模场景,相同的顶点可能存在多种关系(例如`co-worker`和`friend`)。 38 | 每个顶点用一个唯一的64位长的标识符(`VertexID`)作为`key`。`GraphX`并没有对顶点标识强加任何排序。同样,边拥有相应的源和目的顶点标识符。 39 | 40 |   属性图扩展了`Spark RDD`的抽象,有`Table`和`Graph`两种视图,但是只需要一份物理存储。两种视图都有自己独有的操作符,从而使我们同时获得了操作的灵活性和执行的高效率。 41 | 属性图以`vertex(VD)`和`edge(ED)`类型作为参数类型,这些类型分别是顶点和边相关联的对象的类型。 42 | 43 |   在某些情况下,在同样的图中,我们可能希望拥有不同属性类型的顶点。这可以通过继承完成。例如,将用户和产品建模成一个二分图,我们可以用如下方式: 44 | 45 | ```scala 46 | class VertexProperty() 47 | case class UserProperty(val name: String) extends VertexProperty 48 | case class ProductProperty(val name: String, val price: Double) extends VertexProperty 49 | // The graph might then have the type: 50 | var graph: Graph[VertexProperty, String] = null 51 | ``` 52 | 53 |   和`RDD`一样,属性图是不可变的、分布式的、容错的。图的值或者结构的改变需要生成一个新的图来实现。注意,原始图中不受影响的部分都可以在新图中重用,用来减少存储的成本。 54 | 执行者使用一系列顶点分区方法来对图进行分区。如`RDD`一样,图的每个分区可以在发生故障的情况下被重新创建在不同的机器上。 55 | 56 |   逻辑上,属性图对应于一对类型化的集合(`RDD`),这个集合包含每一个顶点和边的属性。因此,图的类中包含访问图中顶点和边的成员变量。 57 | 58 | ```scala 59 | class Graph[VD, ED] { 60 | val vertices: VertexRDD[VD] 61 | val edges: EdgeRDD[ED] 62 | } 63 | ``` 64 | 65 |   `VertexRDD[VD]`和`EdgeRDD[ED]`类是`RDD[(VertexID, VD)]`和`RDD[Edge[ED]]`的继承和优化版本。`VertexRDD[VD]`和`EdgeRDD[ED]`都提供了额外的图计算功能并提供内部优化功能。 66 | 67 | ```scala 68 | abstract class VertexRDD[VD]( 69 | sc: SparkContext, 70 | deps: Seq[Dependency[_]]) extends RDD[(VertexId, VD)](sc, deps) 71 | 72 | abstract class EdgeRDD[ED]( 73 | sc: SparkContext, 74 | deps: Seq[Dependency[_]]) extends RDD[Edge[ED]](sc, deps) 75 | ``` 76 | 77 | ## 3 GraphX的图存储模式 78 | 79 |   `Graphx`借鉴`PowerGraph`,使用的是`Vertex-Cut`( 点分割 ) 方式存储图,用三个`RDD`存储图数据信息: 80 | 81 | - `VertexTable(id, data)`:`id`为顶点`id`, `data`为顶点属性 82 | 83 | - `EdgeTable(pid, src, dst, data)`:`pid` 为分区`id` ,`src`为源顶点`id` ,`dst`为目的顶点`id`,`data`为边属性 84 | 85 | - `RoutingTable(id, pid)`:`id` 为顶点`id` ,`pid` 为分区`id` 86 | 87 |   点分割存储实现如下图所示: 88 | 89 |
2.3

90 | 91 |   在后文的[图构建](build-graph.md)部分,我们会详细介绍这三个部分。 92 | 93 | ## 4 GraphX底层设计的核心点 94 | 95 | - 1 对`Graph`视图的所有操作,最终都会转换成其关联的`Table`视图的`RDD`操作来完成。一个图的计算在逻辑上等价于一系列`RDD`的转换过程。因此,`Graph`最终具备了`RDD`的3个关键特性:不变性、分布性和容错性。其中最关键的是不变性。逻辑上,所有图的转换和操作都产生了一个新图;物理上,`GraphX`会有一定程度的不变顶点和边的复用优化,对用户透明。 96 | 97 | - 2 两种视图底层共用的物理数据,由`RDD[VertexPartition]`和`RDD[EdgePartition]`这两个`RDD`组成。点和边实际都不是以表`Collection[tuple]`的形式存储的,而是由`VertexPartition/EdgePartition`在内部存储一个带索引结构的分片数据块,以加速不同视图下的遍历速度。不变的索引结构在`RDD`转换过程中是共用的,降低了计算和存储开销。 98 | 99 | - 3 图的分布式存储采用点分割模式,而且使用`partitionBy`方法,由用户指定不同的划分策略。下一章会具体讲到划分策略。 100 | 101 | ## 5 参考文献 102 | 103 | 【1】[spark graphx参考文献](https://github.com/endymecy/spark-programming-guide-zh-cn/tree/master/graphx-programming-guide) 104 | 105 | 【2】[快刀初试:Spark GraphX在淘宝的实践](http://www.csdn.net/article/2014-08-07/2821097) 106 | 107 | 【3】[GraphX: Unifying Data-Parallel and Graph-Parallel](docs/graphx.pdf) -------------------------------------------------------------------------------- /vertex-cut.md: -------------------------------------------------------------------------------- 1 | # 点分割存储 2 | 3 |   在第一章分布式图系统中,我们介绍了图存储的两种方式:点分割存储和边分割存储。`GraphX`借鉴`powerGraph`,使用的是点分割方式存储图。这种存储方式特点是任何一条边只会出现在一台机器上,每个点有可能分布到不同的机器上。 4 | 当点被分割到不同机器上时,是相同的镜像,但是有一个点作为主点,其他的点作为虚点,当点的数据发生变化时,先更新主点的数据,然后将所有更新好的数据发送到虚点所在的所有机器,更新虚点。 5 | 这样做的好处是在边的存储上是没有冗余的,而且对于某个点与它的邻居的交互操作,只要满足交换律和结合律,就可以在不同的机器上面执行,网络开销较小。但是这种分割方式会存储多份点数据,更新点时, 6 | 会发生网络传输,并且有可能出现同步问题。 7 | 8 |   `GraphX`在进行图分割时,有几种不同的分区(`partition`)策略,它通过`PartitionStrategy`专门定义这些策略。在`PartitionStrategy`中,总共定义了`EdgePartition2D`、`EdgePartition1D`、`RandomVertexCut`以及 9 | `CanonicalRandomVertexCut`这四种不同的分区策略。下面分别介绍这几种策略。 10 | 11 | ## 1 RandomVertexCut 12 | 13 | ```scala 14 | case object RandomVertexCut extends PartitionStrategy { 15 | override def getPartition(src: VertexId, dst: VertexId, numParts: PartitionID): PartitionID = { 16 | math.abs((src, dst).hashCode()) % numParts 17 | } 18 | } 19 | ``` 20 |   这个方法比较简单,通过取源顶点和目标顶点`id`的哈希值来将边分配到不同的分区。这个方法会产生一个随机的边分割,两个顶点之间相同方向的边会分配到同一个分区。 21 | 22 | ## 2 CanonicalRandomVertexCut 23 | 24 | ```scala 25 | case object CanonicalRandomVertexCut extends PartitionStrategy { 26 | override def getPartition(src: VertexId, dst: VertexId, numParts: PartitionID): PartitionID = { 27 | if (src < dst) { 28 | math.abs((src, dst).hashCode()) % numParts 29 | } else { 30 | math.abs((dst, src).hashCode()) % numParts 31 | } 32 | } 33 | } 34 | ``` 35 |   这种分割方法和前一种方法没有本质的不同。不同的是,哈希值的产生带有确定的方向(即两个顶点中较小`id`的顶点在前)。两个顶点之间所有的边都会分配到同一个分区,而不管方向如何。 36 | 37 | ## 3 EdgePartition1D 38 | 39 | ```scala 40 | case object EdgePartition1D extends PartitionStrategy { 41 | override def getPartition(src: VertexId, dst: VertexId, numParts: PartitionID): PartitionID = { 42 | val mixingPrime: VertexId = 1125899906842597L 43 | (math.abs(src * mixingPrime) % numParts).toInt 44 | } 45 | } 46 | ``` 47 |   这种方法仅仅根据源顶点`id`来将边分配到不同的分区。有相同源顶点的边会分配到同一分区。 48 | 49 | ## 4 EdgePartition2D 50 | 51 | ```scala 52 | case object EdgePartition2D extends PartitionStrategy { 53 | override def getPartition(src: VertexId, dst: VertexId, numParts: PartitionID): PartitionID = { 54 | val ceilSqrtNumParts: PartitionID = math.ceil(math.sqrt(numParts)).toInt 55 | val mixingPrime: VertexId = 1125899906842597L 56 | if (numParts == ceilSqrtNumParts * ceilSqrtNumParts) { 57 | // Use old method for perfect squared to ensure we get same results 58 | val col: PartitionID = (math.abs(src * mixingPrime) % ceilSqrtNumParts).toInt 59 | val row: PartitionID = (math.abs(dst * mixingPrime) % ceilSqrtNumParts).toInt 60 | (col * ceilSqrtNumParts + row) % numParts 61 | } else { 62 | // Otherwise use new method 63 | val cols = ceilSqrtNumParts 64 | val rows = (numParts + cols - 1) / cols 65 | val lastColRows = numParts - rows * (cols - 1) 66 | val col = (math.abs(src * mixingPrime) % numParts / rows).toInt 67 | val row = (math.abs(dst * mixingPrime) % (if (col < cols - 1) rows else lastColRows)).toInt 68 | col * rows + row 69 | } 70 | } 71 | } 72 | ``` 73 |   这种分割方法同时使用到了源顶点`id`和目的顶点`id`。它使用稀疏边连接矩阵的2维区分来将边分配到不同的分区,从而保证顶点的备份数不大于`2 * sqrt(numParts)`的限制。这里`numParts`表示分区数。 74 | 这个方法的实现分两种情况,即分区数能完全开方和不能完全开方两种情况。当分区数能完全开方时,采用下面的方法: 75 | 76 | ```scala 77 | val col: PartitionID = (math.abs(src * mixingPrime) % ceilSqrtNumParts).toInt 78 | val row: PartitionID = (math.abs(dst * mixingPrime) % ceilSqrtNumParts).toInt 79 | (col * ceilSqrtNumParts + row) % numParts 80 | ``` 81 | 82 |   当分区数不能完全开方时,采用下面的方法。这个方法的最后一列允许拥有不同的行数。 83 | 84 | ```scala 85 | val cols = ceilSqrtNumParts 86 | val rows = (numParts + cols - 1) / cols 87 | //最后一列允许不同的行数 88 | val lastColRows = numParts - rows * (cols - 1) 89 | val col = (math.abs(src * mixingPrime) % numParts / rows).toInt 90 | val row = (math.abs(dst * mixingPrime) % (if (col < cols - 1) rows else lastColRows)).toInt 91 | col * rows + row 92 | ``` 93 |   下面举个例子来说明该方法。假设我们有一个拥有12个顶点的图,要把它切分到9台机器。我们可以用下面的稀疏矩阵来表示: 94 | 95 | ``` 96 | __________________________________ 97 | v0 | P0 * | P1 | P2 * | 98 | v1 | **** | * | | 99 | v2 | ******* | ** | **** | 100 | v3 | ***** | * * | * | 101 | ---------------------------------- 102 | v4 | P3 * | P4 *** | P5 ** * | 103 | v5 | * * | * | | 104 | v6 | * | ** | **** | 105 | v7 | * * * | * * | * | 106 | ---------------------------------- 107 | v8 | P6 * | P7 * | P8 * *| 108 | v9 | * | * * | | 109 | v10 | * | ** | * * | 110 | v11 | * <-E | *** | ** | 111 | ---------------------------------- 112 | ``` 113 | 114 |   上面的例子中`*`表示分配到处理器上的边。`E`表示连接顶点`v11`和`v1`的边,它被分配到了处理器`P6`上。为了获得边所在的处理器,我们将矩阵切分为`sqrt(numParts) * sqrt(numParts)`块。 115 | 注意,上图中与顶点`v11`相连接的边只出现在第一列的块`(P0,P3,P6)`或者最后一行的块`(P6,P7,P8)`中,这保证了`V11`的副本数不会超过`2 * sqrt(numParts)`份,在上例中即副本不能超过6份。 116 | 117 |   在上面的例子中,`P0`里面存在很多边,这会造成工作的不均衡。为了提高均衡,我们首先用顶点`id`乘以一个大的素数,然后再`shuffle`顶点的位置。乘以一个大的素数本质上不能解决不平衡的问题,只是减少了不平衡的情况发生。 118 | 119 | # 5 参考文献 120 | 121 | 【1】[spark源码](https://github.com/apache/spark) -------------------------------------------------------------------------------- /parallel-graph-system.md: -------------------------------------------------------------------------------- 1 | # 分布式图计算 2 | 3 |   在介绍`GraphX`之前,我们需要先了解分布式图计算框架。简言之,分布式图框架就是将大型图的各种操作封装成接口,让分布式存储、并行计算等复杂问题对上层透明,从而使工程师将焦点放在图相关的模型设计和使用上,而不用关心底层的实现细节。 4 | 分布式图框架的实现需要考虑两个问题,第一是怎样切分图以更好的计算和保存;第二是采用什么图计算模型。下面分别介绍这两个问题。 5 | 6 | # 1 图切分方式 7 | 8 |   图的切分总体上说有点切分和边切分两种方式。 9 | 10 | - 点切分:通过点切分之后,每条边只保存一次,并且出现在同一台机器上。邻居多的点会被分发到不同的节点上,增加了存储空间,并且有可能产生同步问题。但是,它的优点是减少了网络通信。 11 | 12 | - 边切分:通过边切分之后,顶点只保存一次,切断的边会打断保存在两台机器上。在基于边的操作时,对于两个顶点分到两个不同的机器的边来说,需要进行网络传输数据。这增加了网络传输的数据量,但好处是节约了存储空间。 13 | 14 |   以上两种切分方式虽然各有优缺点,但是点切分还是占有优势。`GraphX`以及后文提到的`Pregel`、`GraphLab`都使用到了点切分。 15 | 16 | # 2 图计算框架 17 | 18 |   图计算框架基本上都遵循分布式批同步(`Bulk Synchronous Parallell,BSP`)计算模式。基于`BSP`模式,目前有两种比较成熟的图计算框架:`Pregel`框架和`GraphLab`框架。 19 | 20 | ## 2.1 BSP 21 | 22 | ### 2.1.1 BSP基本原理 23 | 24 |   在`BSP`中,一次计算过程由一系列全局超步组成,每一个超步由并发计算、通信和同步三个步骤组成。同步完成,标志着这个超步的完成及下一个超步的开始。 25 | `BSP`模式的准则是批量同步(`bulk synchrony`),其独特之处在于超步(`superstep`)概念的引入。一个`BSP`程序同时具有水平和垂直两个方面的结构。从垂直上看,一个`BSP`程序由一系列串行的超步(`superstep`)组成,如图所示: 26 | 27 |
1.1

28 | 29 |   从水平上看,在一个超步中,所有的进程并行执行局部计算。一个超步可分为三个阶段,如图所示: 30 | 31 |
1.2

32 | 33 | - 本地计算阶段,每个处理器只对存储在本地内存中的数据进行本地计算。 34 | - 全局通信阶段,对任何非本地数据进行操作。 35 | - 栅栏同步阶段,等待所有通信行为的结束。 36 | 37 | ### 2.1.2 BSP模型特点 38 | 39 |   BSP模型有如下几个特点: 40 | 41 | - 1 将计算划分为一个一个的超步(`superstep`),有效避免死锁; 42 | 43 | - 2 将处理器和路由器分开,强调了计算任务和通信任务的分开,而路由器仅仅完成点到点的消息传递,不提供组合、复制和广播等功能,这样做既掩盖具体的互连网络拓扑,又简化了通信协议; 44 | 45 | - 3 采用障碍同步的方式、以硬件实现的全局同步是可控的粗粒度级,提供了执行紧耦合同步式并行算法的有效方式 46 | 47 | ## 2.2 `Pregel`框架 48 | 49 |   `Pregel`是一种面向图算法的分布式编程框架,采用迭代的计算模型:在每一轮,每个顶点处理上一轮收到的消息,并发出消息给其它顶点,并更新自身状态和拓扑结构(出、入边)等。 50 | 51 | ### 2.2.1 `Pregel`框架执行过程 52 | 53 |   在`Pregel`计算模式中,输入是一个有向图,该有向图的每一个顶点都有一个相应的由字符串描述的`vertex identifier`。每一个顶点都有一些属性,这些属性可以被修改,其初始值由用户定义。每一条有向边都和其源顶点关联,并且也拥有一些用户定义的属性和值,并同时还记录了其目的顶点的`ID`。 54 | 55 |   一个典型的`Pregel`计算过程如下:读取输入,初始化该图,当图被初始化好后,运行一系列的超步,每一次超步都在全局的角度上独立运行,直到整个计算结束,输出结果。 56 | 在每一次超步中,顶点的计算都是并行的,并且执行用户定义的同一个函数。每个顶点可以修改其自身的状态信息或以它为起点的出边的信息,从前序超步中接受消息,并传送给其后续超步,或者修改整个图的拓扑结构。边,在这种计算模式中并不是核心对象,没有相应的计算运行在其上。 57 | 58 |   算法是否能够结束取决于是否所有的顶点都已经`vote`标识其自身已经达到`halt`状态了。在`superstep 0`中,所有顶点都置于`active`状态,每一个`active`的顶点都会在计算的执行中在某一次的`superstep`中被计算。顶点通过将其自身的状态设置成`halt`来表示它已经不再`active`。这就表示该顶点没有进一步的计算需要进行,除非被其他的运算触发,而`Pregel`框架将不会在接下来的`superstep`中计算该顶点,除非该顶点收到一个其他`superstep`传送的消息。 59 | 如果顶点接收到消息,该消息将该顶点重新置`active`,那么在随后的计算中该顶点必须再次`deactive`其自身。整个计算在所有顶点都达到`inactive`状态,并且没有消息在传送的时候宣告结束。这种简单的状态机制在下图中描述: 60 | 61 |
1.3

62 | 63 |   我们用`PageRank`为例来说明`Pregel`的计算过程。 64 | 65 | ```c++ 66 | def PageRank(v: Id, msgs: List[Double]) { 67 | // 计算消息和 68 | var msgSum = 0 69 | for (m <- msgs) { msgSum = msgSum + m } 70 | // 更新 PageRank (PR) 71 | A(v).PR = 0.15 + 0.85 * msgSum 72 | // 广播新的PR消息 73 | for (j <- OutNbrs(v)) { 74 | msg = A(v).PR / A(v).NumLinks 75 | send_msg(to=j, msg) 76 | } 77 | // 检查终止 78 | if (converged(A(v).PR)) voteToHalt(v) 79 | } 80 | ``` 81 | 82 |   以上代码中,顶点`v`首先接收来自上一次迭代的消息,计算它们的和。然后使用计算的消息和重新计算`PageRank`,之后程序广播这个重新计算的`PageRank`的值到顶点`v`的所有邻居,最后程序判断算法是否应该停止。 83 | 84 | ### 2.2.1 `Pregel`框架的消息模式 85 | 86 |   `Pregel`选择了一种纯消息传递的模式,忽略远程数据读取和其他共享内存的方式,这样做有两个原因。 87 | 88 | - 第一,消息的传递有足够高效的表达能力,不需要远程读取(`remote reads`)。 89 | 90 | - 第二,性能的考虑。在一个集群环境中,从远程机器上读取一个值是会有很高的延迟的,这种情况很难避免。而消息传递模式通过异步和批量的方式传递消息,可以缓解这种远程读取的延迟。 91 | 92 |   图算法其实也可以被写成是一系列的链式`MapReduce`作业。选择不同的模式的原因在于可用性和性能。`Pregel`将顶点和边在本地机器进行运算,而仅仅利用网络来传输信息,而不是传输数据。 93 | 而`MapReduce`本质上是面向函数的,所以将图算法用`MapReduce`来实现就需要将整个图的状态从一个阶段传输到另外一个阶段,这样就需要许多的通信和随之而来的序列化和反序列化的开销。另外,在一连串的`MapReduce`作业中各阶段需要协同工作也给编程增加了难度,这样的情况能够在`Pregel`的各轮超步的迭代中避免。 94 | 95 | ### 2.2.3 `Pregel`框架的缺点 96 | 97 |   这个模型虽然简单,但是缺陷明显,那就是对于邻居数很多的顶点,它需要处理的消息非常庞大,而且在这个模式下,它们是无法被并发处理的。所以对于符合幂律分布的自然图,这种计算模型下很容易发生假死或者崩溃。 98 | 99 | ## 2.3 `GraphLab`框架 100 | 101 |   `GraphLab`将数据抽象成`Graph`结构,将基于顶点切分的算法的执行过程抽象成`Gather、Apply、Scatter`三个步骤。以下面的例子作为一个说明。 102 | 103 |
1.4

104 | 105 |   示例中,需要完成对`V0`邻接顶点的求和计算,串行实现中,`V0`对其所有的邻接点进行遍历,累加求和。而`GraphLab`中,将顶点`V0`进行切分,将`V0`的边关系以及对应的邻接点部署在两台处理器上,各台机器上并行进行部分求和运算,然后通过`master`(蓝色)顶点和`mirror`(橘红色)顶点的通信完成最终的计算。 106 | 107 | ### 2.3.1 `GraphLab`框架的数据模型 108 | 109 |   对于分割的某个顶点,它会被部署到多台机器,一台机器作为`master`顶点,其余机器作为`mirror`。`master`作为所有`mirror`的管理者,负责给`mirror`安排具体计算任务;`mirror`作为该顶点在各台机器上的代理执行者,与`master`数据的保持同步。 110 | 111 |   对于某条边,`GraphLab`将其唯一部署在某一台机器上,而对边关联的顶点进行多份存储,解决了边数据量大的问题。 112 | 113 |   同一台机器上的所有顶点和边构成一个本地图(`local graph)`,在每台机器上,存在一份本地`id`到全局`id`的映射表。顶点是一个进程上所有线程共享的,在并行计算过程中,各个线程分摊进程中所有顶点的`gather->apply->scatter`操作。 114 | 115 |   我们用下面这个例子说明,`GraphLab`是怎么构建`Graph`的。图中,以顶点`v2`和`v3`进行分割。顶点`v2`和`v3`同时存在于两个进程中,并且两个线程共同分担顶点计算。 116 | 117 |
1.5

118 | 119 | ### 2.3.2 `GraphLab`框架的执行模型 120 | 121 |   每个顶点每一轮迭代会经过`gather -> apple -> scatter`三个阶段。 122 | 123 | - **Gather阶段**,工作顶点的边从连接顶点和自身收集数据。这一阶段对工作顶点、边都是只读的。 124 | 125 | - **Apply阶段**,`mirror`将`gather`阶段计算的结果发送给`master`顶点,`master`进行汇总并结合上一步的顶点数据,按照业务需求进行进一步的计算,然后更新`master`的顶点数据,并同步给`mirror`。`Apply`阶段中,工作顶点可修改,边不可修改。 126 | 127 | - **Scatter阶段**,工作顶点更新完成之后,更新边上的数据,并通知对其有依赖的邻结顶点更新状态。在`scatter`过程中,工作顶点只读,边上数据可写。 128 | 129 |   在执行模型中,`GraphLab`通过控制三个阶段的读写权限来达到互斥的目的。在`gather`阶段只读,`apply`对顶点只写,`scatter`对边只写。并行计算的同步通过`master`和`mirror`来实现,`mirror`相当于每个顶点对外的一个接口人,将复杂的数据通信抽象成顶点的行为。 130 | 131 |   下面这个例子说明`GraphLab`的执行模型: 132 | 133 |
1.6

134 | 135 |   利用`GraphLab`实现的`PageRank`的代码如下所示: 136 | 137 | ```c++ 138 | //汇总 139 | def Gather(a: Double, b: Double) = a + b 140 | //更新顶点 141 | def Apply(v, msgSum) { 142 | A(v).PR = 0.15 + 0.85 * msgSum 143 | if (converged(A(v).PR)) voteToHalt(v) 144 | } 145 | //更新边 146 | def Scatter(v, j) = A(v).PR / A(v).NumLinks 147 | ``` 148 | 149 |   由于`gather/scatter`函数是以单条边为操作粒度,所以对于一个顶点的众多邻边,可以分别由相应的节点独立调用`gather/scatter`函数。这一设计主要是为了适应点分割的图存储模式,从而避免`Pregel`模型会遇到的问题。 150 | 151 | # 3 GraphX 152 | 153 |    `GraphX`也是基于`BSP`模式。`GraphX`公开了一个类似`Pregel`的操作,它是广泛使用的`Pregel`和`GraphLab`抽象的一个融合。在`GraphX`中,`Pregel`操作者执行一系列的超步,在这些超步中,顶点从之前的超步中接收进入(`inbound`)消息,为顶点属性计算一个新的值,然后在以后的超步中发送消息到邻居顶点。 154 | 不像`Pregel`而更像`GraphLab`,消息通过边`triplet`的一个函数被并行计算,消息的计算既会访问源顶点特征也会访问目的顶点特征。在超步中,没有收到消息的顶点会被跳过。当没有消息遗留时,`Pregel`操作停止迭代并返回最终的图。 155 | 156 | # 4 参考文献 157 | 158 | 【1】[Preg el: A System for Larg e-Scale Graph Processing](docs/pregel-a_system_for_large-scale_graph_processing.pdf) 159 | 160 | 【2】[快刀初试:Spark GraphX在淘宝的实践](http://www.csdn.net/article/2014-08-07/2821097) 161 | 162 | 【3】[GraphLab:A New Parallel Framework for Machine Learning](http://www.select.cs.cmu.edu/code/graphlab/) -------------------------------------------------------------------------------- /operators/transformation.md: -------------------------------------------------------------------------------- 1 | # 转换操作 2 | 3 |   `GraphX`中的转换操作主要有`mapVertices`,`mapEdges`和`mapTriplets`三个,它们在`Graph`文件中定义,在`GraphImpl`文件中实现。下面分别介绍这三个方法。 4 | 5 | ## 1 `mapVertices` 6 | 7 |   `mapVertices`用来更新顶点属性。从图的构建那章我们知道,顶点属性保存在边分区中,所以我们需要改变的是边分区中的属性。 8 | 9 | ```scala 10 | override def mapVertices[VD2: ClassTag] 11 | (f: (VertexId, VD) => VD2)(implicit eq: VD =:= VD2 = null): Graph[VD2, ED] = { 12 | if (eq != null) { 13 | vertices.cache() 14 | // 使用方法f处理vertices 15 | val newVerts = vertices.mapVertexPartitions(_.map(f)).cache() 16 | //获得两个不同vertexRDD的不同 17 | val changedVerts = vertices.asInstanceOf[VertexRDD[VD2]].diff(newVerts) 18 | //更新ReplicatedVertexView 19 | val newReplicatedVertexView = replicatedVertexView.asInstanceOf[ReplicatedVertexView[VD2, ED]] 20 | .updateVertices(changedVerts) 21 | new GraphImpl(newVerts, newReplicatedVertexView) 22 | } else { 23 | GraphImpl(vertices.mapVertexPartitions(_.map(f)), replicatedVertexView.edges) 24 | } 25 | } 26 | ``` 27 |   上面的代码中,当`VD`和`VD2`类型相同时,我们可以重用没有发生变化的点,否则需要重新创建所有的点。我们分析`VD`和`VD2`相同的情况,分四步处理。 28 | 29 | - 1 使用方法`f`处理`vertices`,获得新的`VertexRDD` 30 | 31 | - 2 使用在`VertexRDD`中定义的`diff`方法求出新`VertexRDD`和源`VertexRDD`的不同 32 | 33 | ```scala 34 | override def diff(other: VertexRDD[VD]): VertexRDD[VD] = { 35 | val otherPartition = other match { 36 | case other: VertexRDD[_] if this.partitioner == other.partitioner => 37 | other.partitionsRDD 38 | case _ => 39 | VertexRDD(other.partitionBy(this.partitioner.get)).partitionsRDD 40 | } 41 | val newPartitionsRDD = partitionsRDD.zipPartitions( 42 | otherPartition, preservesPartitioning = true 43 | ) { (thisIter, otherIter) => 44 | val thisPart = thisIter.next() 45 | val otherPart = otherIter.next() 46 | Iterator(thisPart.diff(otherPart)) 47 | } 48 | this.withPartitionsRDD(newPartitionsRDD) 49 | } 50 | ``` 51 |   这个方法首先处理新生成的`VertexRDD`的分区,如果它的分区和源`VertexRDD`的分区一致,那么直接取出它的`partitionsRDD`,否则重新分区后取出它的`partitionsRDD`。 52 | 针对新旧两个`VertexRDD`的所有分区,调用`VertexPartitionBaseOps`中的`diff`方法求得分区的不同。 53 | 54 | ```scala 55 | def diff(other: Self[VD]): Self[VD] = { 56 | //首先判断 57 | if (self.index != other.index) { 58 | diff(createUsingIndex(other.iterator)) 59 | } else { 60 | val newMask = self.mask & other.mask 61 | var i = newMask.nextSetBit(0) 62 | while (i >= 0) { 63 | if (self.values(i) == other.values(i)) { 64 | newMask.unset(i) 65 | } 66 | i = newMask.nextSetBit(i + 1) 67 | } 68 | this.withValues(other.values).withMask(newMask) 69 | } 70 | } 71 | ``` 72 |   该方法隐藏两个`VertexRDD`中相同的顶点信息,得到一个新的`VertexRDD`。 73 | 74 | - 3 更新`ReplicatedVertexView` 75 | 76 | ```scala 77 | def updateVertices(updates: VertexRDD[VD]): ReplicatedVertexView[VD, ED] = { 78 | //生成一个VertexAttributeBlock 79 | val shippedVerts = updates.shipVertexAttributes(hasSrcId, hasDstId) 80 | .setName("ReplicatedVertexView.updateVertices - shippedVerts %s %s (broadcast)".format( 81 | hasSrcId, hasDstId)) 82 | .partitionBy(edges.partitioner.get) 83 | //生成新的边RDD 84 | val newEdges = edges.withPartitionsRDD(edges.partitionsRDD.zipPartitions(shippedVerts) { 85 | (ePartIter, shippedVertsIter) => ePartIter.map { 86 | case (pid, edgePartition) => 87 | (pid, edgePartition.updateVertices(shippedVertsIter.flatMap(_._2.iterator))) 88 | } 89 | }) 90 | new ReplicatedVertexView(newEdges, hasSrcId, hasDstId) 91 | } 92 | ``` 93 |   `updateVertices`方法返回一个新的`ReplicatedVertexView`,它更新了边分区中包含的顶点属性。我们看看它的实现过程。首先看`shipVertexAttributes`方法的调用。 94 | 调用`shipVertexAttributes`方法会生成一个`VertexAttributeBlock`,`VertexAttributeBlock`包含当前分区的顶点属性,这些属性可以在特定的边分区使用。 95 | 96 | ```scala 97 | def shipVertexAttributes( 98 | shipSrc: Boolean, shipDst: Boolean): Iterator[(PartitionID, VertexAttributeBlock[VD])] = { 99 | Iterator.tabulate(routingTable.numEdgePartitions) { pid => 100 | val initialSize = if (shipSrc && shipDst) routingTable.partitionSize(pid) else 64 101 | val vids = new PrimitiveVector[VertexId](initialSize) 102 | val attrs = new PrimitiveVector[VD](initialSize) 103 | var i = 0 104 | routingTable.foreachWithinEdgePartition(pid, shipSrc, shipDst) { vid => 105 | if (isDefined(vid)) { 106 | vids += vid 107 | attrs += this(vid) 108 | } 109 | i += 1 110 | } 111 | //(边分区id,VertexAttributeBlock(顶点id,属性)) 112 | (pid, new VertexAttributeBlock(vids.trim().array, attrs.trim().array)) 113 | } 114 | } 115 | ``` 116 |   获得新的顶点属性之后,我们就可以调用`updateVertices`更新边中顶点的属性了,如下面代码所示: 117 | 118 | ```scala 119 | edgePartition.updateVertices(shippedVertsIter.flatMap(_._2.iterator)) 120 | //更新EdgePartition的属性 121 | def updateVertices(iter: Iterator[(VertexId, VD)]): EdgePartition[ED, VD] = { 122 | val newVertexAttrs = new Array[VD](vertexAttrs.length) 123 | System.arraycopy(vertexAttrs, 0, newVertexAttrs, 0, vertexAttrs.length) 124 | while (iter.hasNext) { 125 | val kv = iter.next() 126 | //global2local获得顶点的本地index 127 | newVertexAttrs(global2local(kv._1)) = kv._2 128 | } 129 | new EdgePartition( 130 | localSrcIds, localDstIds, data, index, global2local, local2global, newVertexAttrs, 131 | activeSet) 132 | } 133 | ``` 134 | 135 | ## 2 `mapEdges` 136 | 137 |   `mapEdges`用来更新边属性。 138 | 139 | ```scala 140 | override def mapEdges[ED2: ClassTag]( 141 | f: (PartitionID, Iterator[Edge[ED]]) => Iterator[ED2]): Graph[VD, ED2] = { 142 | val newEdges = replicatedVertexView.edges 143 | .mapEdgePartitions((pid, part) => part.map(f(pid, part.iterator))) 144 | new GraphImpl(vertices, replicatedVertexView.withEdges(newEdges)) 145 | } 146 | ``` 147 |   相比于`mapVertices`,`mapEdges`显然要简单得多,它只需要根据方法`f`生成新的`EdgeRDD`,然后再初始化即可。 148 | 149 | ## 3 `mapTriplets`:用来更新边属性 150 | 151 |   `mapTriplets`用来更新边属性。 152 | 153 | ```scala 154 | override def mapTriplets[ED2: ClassTag]( 155 | f: (PartitionID, Iterator[EdgeTriplet[VD, ED]]) => Iterator[ED2], 156 | tripletFields: TripletFields): Graph[VD, ED2] = { 157 | vertices.cache() 158 | replicatedVertexView.upgrade(vertices, tripletFields.useSrc, tripletFields.useDst) 159 | val newEdges = replicatedVertexView.edges.mapEdgePartitions { (pid, part) => 160 | part.map(f(pid, part.tripletIterator(tripletFields.useSrc, tripletFields.useDst))) 161 | } 162 | new GraphImpl(vertices, replicatedVertexView.withEdges(newEdges)) 163 | } 164 | ``` 165 |   这段代码中,`replicatedVertexView`调用`upgrade`方法修改当前的`ReplicatedVertexView`,使调用者可以访问到指定级别的边信息(如仅仅可以读源顶点的属性)。 166 | 167 | ```scala 168 | def upgrade(vertices: VertexRDD[VD], includeSrc: Boolean, includeDst: Boolean) { 169 | //判断传递级别 170 | val shipSrc = includeSrc && !hasSrcId 171 | val shipDst = includeDst && !hasDstId 172 | if (shipSrc || shipDst) { 173 | val shippedVerts: RDD[(Int, VertexAttributeBlock[VD])] = 174 | vertices.shipVertexAttributes(shipSrc, shipDst) 175 | .setName("ReplicatedVertexView.upgrade(%s, %s) - shippedVerts %s %s (broadcast)".format( 176 | includeSrc, includeDst, shipSrc, shipDst)) 177 | .partitionBy(edges.partitioner.get) 178 | val newEdges = edges.withPartitionsRDD(edges.partitionsRDD.zipPartitions(shippedVerts) { 179 | (ePartIter, shippedVertsIter) => ePartIter.map { 180 | case (pid, edgePartition) => 181 | (pid, edgePartition.updateVertices(shippedVertsIter.flatMap(_._2.iterator))) 182 | } 183 | }) 184 | edges = newEdges 185 | hasSrcId = includeSrc 186 | hasDstId = includeDst 187 | } 188 | } 189 | ``` 190 |   最后,用`f`处理边,生成新的`RDD`,最后用新的数据初始化图。 191 | 192 | ## 4 总结 193 | 194 |   调用`mapVertices`,`mapEdges`和`mapTriplets`时,其内部的结构化索引(`Structural indices`)并不会发生变化,它们都重用路由表中的数据。 195 | -------------------------------------------------------------------------------- /operators/structure.md: -------------------------------------------------------------------------------- 1 | # 结构操作 2 | 3 |   当前的`GraphX`仅仅支持一组简单的常用结构性操作。下面是基本的结构性操作列表。 4 | 5 | ```scala 6 | class Graph[VD, ED] { 7 | def reverse: Graph[VD, ED] 8 | def subgraph(epred: EdgeTriplet[VD,ED] => Boolean, 9 | vpred: (VertexId, VD) => Boolean): Graph[VD, ED] 10 | def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED] 11 | def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED] 12 | } 13 | ``` 14 |   下面分别介绍这四种函数的原理。 15 | 16 | # 1 reverse 17 | 18 |   `reverse`操作返回一个新的图,这个图的边的方向都是反转的。例如,这个操作可以用来计算反转的PageRank。因为反转操作没有修改顶点或者边的属性或者改变边的数量,所以我们可以 19 | 在不移动或者复制数据的情况下有效地实现它。 20 | 21 | ```scala 22 | override def reverse: Graph[VD, ED] = { 23 | new GraphImpl(vertices.reverseRoutingTables(), replicatedVertexView.reverse()) 24 | } 25 | def reverse(): ReplicatedVertexView[VD, ED] = { 26 | val newEdges = edges.mapEdgePartitions((pid, part) => part.reverse) 27 | new ReplicatedVertexView(newEdges, hasDstId, hasSrcId) 28 | } 29 | //EdgePartition中的reverse 30 | def reverse: EdgePartition[ED, VD] = { 31 | val builder = new ExistingEdgePartitionBuilder[ED, VD]( 32 | global2local, local2global, vertexAttrs, activeSet, size) 33 | var i = 0 34 | while (i < size) { 35 | val localSrcId = localSrcIds(i) 36 | val localDstId = localDstIds(i) 37 | val srcId = local2global(localSrcId) 38 | val dstId = local2global(localDstId) 39 | val attr = data(i) 40 | //将源顶点和目标顶点换位置 41 | builder.add(dstId, srcId, localDstId, localSrcId, attr) 42 | i += 1 43 | } 44 | builder.toEdgePartition 45 | } 46 | ``` 47 | 48 | ## 2 subgraph 49 | 50 |   `subgraph`操作利用顶点和边的判断式(`predicates`),返回的图仅仅包含满足顶点判断式的顶点、满足边判断式的边以及满足顶点判断式的`triple`。`subgraph`操作可以用于很多场景,如获取 51 | 感兴趣的顶点和边组成的图或者获取清除断开连接后的图。 52 | 53 | ```scala 54 | override def subgraph( 55 | epred: EdgeTriplet[VD, ED] => Boolean = x => true, 56 | vpred: (VertexId, VD) => Boolean = (a, b) => true): Graph[VD, ED] = { 57 | vertices.cache() 58 | // 过滤vertices, 重用partitioner和索引 59 | val newVerts = vertices.mapVertexPartitions(_.filter(vpred)) 60 | // 过滤 triplets 61 | replicatedVertexView.upgrade(vertices, true, true) 62 | val newEdges = replicatedVertexView.edges.filter(epred, vpred) 63 | new GraphImpl(newVerts, replicatedVertexView.withEdges(newEdges)) 64 | } 65 | ``` 66 |   该代码显示,`subgraph`方法的实现分两步:先过滤`VertexRDD`,然后再过滤`EdgeRDD`。如上,过滤`VertexRDD`比较简单,我们重点看过滤`EdgeRDD`的过程。 67 | 68 | ```scala 69 | def filter( 70 | epred: EdgeTriplet[VD, ED] => Boolean, 71 | vpred: (VertexId, VD) => Boolean): EdgeRDDImpl[ED, VD] = { 72 | mapEdgePartitions((pid, part) => part.filter(epred, vpred)) 73 | } 74 | //EdgePartition中的filter方法 75 | def filter( 76 | epred: EdgeTriplet[VD, ED] => Boolean, 77 | vpred: (VertexId, VD) => Boolean): EdgePartition[ED, VD] = { 78 | val builder = new ExistingEdgePartitionBuilder[ED, VD]( 79 | global2local, local2global, vertexAttrs, activeSet) 80 | var i = 0 81 | while (i < size) { 82 | // The user sees the EdgeTriplet, so we can't reuse it and must create one per edge. 83 | val localSrcId = localSrcIds(i) 84 | val localDstId = localDstIds(i) 85 | val et = new EdgeTriplet[VD, ED] 86 | et.srcId = local2global(localSrcId) 87 | et.dstId = local2global(localDstId) 88 | et.srcAttr = vertexAttrs(localSrcId) 89 | et.dstAttr = vertexAttrs(localDstId) 90 | et.attr = data(i) 91 | if (vpred(et.srcId, et.srcAttr) && vpred(et.dstId, et.dstAttr) && epred(et)) { 92 | builder.add(et.srcId, et.dstId, localSrcId, localDstId, et.attr) 93 | } 94 | i += 1 95 | } 96 | builder.toEdgePartition 97 | } 98 | ``` 99 |   因为用户可以看到`EdgeTriplet`的信息,所以我们不能重用`EdgeTriplet`,需要重新创建一个,然后在用`epred`函数处理。这里`localSrcIds,localDstIds,local2global`等前文均有介绍,在此不再赘述。 100 | 101 | ## 3 mask 102 | 103 |   `mask`操作构造一个子图,这个子图包含输入图中包含的顶点和边。它的实现很简单,顶点和边均做`inner join`操作即可。这个操作可以和`subgraph`操作相结合,基于另外一个相关图的特征去约束一个图。 104 | 105 | ```scala 106 | override def mask[VD2: ClassTag, ED2: ClassTag] ( 107 | other: Graph[VD2, ED2]): Graph[VD, ED] = { 108 | val newVerts = vertices.innerJoin(other.vertices) { (vid, v, w) => v } 109 | val newEdges = replicatedVertexView.edges.innerJoin(other.edges) { (src, dst, v, w) => v } 110 | new GraphImpl(newVerts, replicatedVertexView.withEdges(newEdges)) 111 | } 112 | ``` 113 | 114 | ## 4 groupEdges 115 | 116 |   `groupEdges`操作合并多重图中的并行边(如顶点对之间重复的边)。在大量的应用程序中,并行的边可以合并(它们的权重合并)为一条边从而降低图的大小。 117 | 118 | ```scala 119 | override def groupEdges(merge: (ED, ED) => ED): Graph[VD, ED] = { 120 | val newEdges = replicatedVertexView.edges.mapEdgePartitions( 121 | (pid, part) => part.groupEdges(merge)) 122 | new GraphImpl(vertices, replicatedVertexView.withEdges(newEdges)) 123 | } 124 | def groupEdges(merge: (ED, ED) => ED): EdgePartition[ED, VD] = { 125 | val builder = new ExistingEdgePartitionBuilder[ED, VD]( 126 | global2local, local2global, vertexAttrs, activeSet) 127 | var currSrcId: VertexId = null.asInstanceOf[VertexId] 128 | var currDstId: VertexId = null.asInstanceOf[VertexId] 129 | var currLocalSrcId = -1 130 | var currLocalDstId = -1 131 | var currAttr: ED = null.asInstanceOf[ED] 132 | // 迭代处理所有的边 133 | var i = 0 134 | while (i < size) { 135 | //如果源顶点和目的顶点都相同 136 | if (i > 0 && currSrcId == srcIds(i) && currDstId == dstIds(i)) { 137 | // 合并属性 138 | currAttr = merge(currAttr, data(i)) 139 | } else { 140 | // This edge starts a new run of edges 141 | if (i > 0) { 142 | // 添加到builder中 143 | builder.add(currSrcId, currDstId, currLocalSrcId, currLocalDstId, currAttr) 144 | } 145 | // Then start accumulating for a new run 146 | currSrcId = srcIds(i) 147 | currDstId = dstIds(i) 148 | currLocalSrcId = localSrcIds(i) 149 | currLocalDstId = localDstIds(i) 150 | currAttr = data(i) 151 | } 152 | i += 1 153 | } 154 | if (size > 0) { 155 | builder.add(currSrcId, currDstId, currLocalSrcId, currLocalDstId, currAttr) 156 | } 157 | builder.toEdgePartition 158 | } 159 | ``` 160 |   在[图构建](build-graph.md)那章我们说明过,存储的边按照源顶点`id`排过序,所以上面的代码可以通过一次迭代完成对所有相同边的处理。 161 | 162 | ## 5 应用举例 163 | 164 | ```scala 165 | // Create an RDD for the vertices 166 | val users: RDD[(VertexId, (String, String))] = 167 | sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")), 168 | (5L, ("franklin", "prof")), (2L, ("istoica", "prof")), 169 | (4L, ("peter", "student")))) 170 | // Create an RDD for edges 171 | val relationships: RDD[Edge[String]] = 172 | sc.parallelize(Array(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"), 173 | Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"), 174 | Edge(4L, 0L, "student"), Edge(5L, 0L, "colleague"))) 175 | // Define a default user in case there are relationship with missing user 176 | val defaultUser = ("John Doe", "Missing") 177 | // Build the initial Graph 178 | val graph = Graph(users, relationships, defaultUser) 179 | // Notice that there is a user 0 (for which we have no information) connected to users 180 | // 4 (peter) and 5 (franklin). 181 | graph.triplets.map( 182 | triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1 183 | ).collect.foreach(println(_)) 184 | // Remove missing vertices as well as the edges to connected to them 185 | val validGraph = graph.subgraph(vpred = (id, attr) => attr._2 != "Missing") 186 | // The valid subgraph will disconnect users 4 and 5 by removing user 0 187 | validGraph.vertices.collect.foreach(println(_)) 188 | validGraph.triplets.map( 189 | triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1 190 | ).collect.foreach(println(_)) 191 | 192 | / Run Connected Components 193 | val ccGraph = graph.connectedComponents() // No longer contains missing field 194 | // Remove missing vertices as well as the edges to connected to them 195 | val validGraph = graph.subgraph(vpred = (id, attr) => attr._2 != "Missing") 196 | // Restrict the answer to the valid subgraph 197 | val validCCGraph = ccGraph.mask(validGraph) 198 | ``` 199 | 200 | ## 6 参考文献 201 | 202 | 【1】[spark源码](https://github.com/apache/spark) -------------------------------------------------------------------------------- /.idea/uiDesigner.xml: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | -------------------------------------------------------------------------------- /graphAlgorithm/PageRank.md: -------------------------------------------------------------------------------- 1 | # PageRank 2 | 3 | ```scala 4 | import scala.language.postfixOps 5 | import scala.reflect.ClassTag 6 | 7 | import org.apache.spark.graphx._ 8 | import org.apache.spark.internal.Logging 9 | 10 | /** 11 | * PageRank algorithm implementation. There are two implementations of PageRank implemented. 12 | * 13 | * The first implementation uses the standalone [[Graph]] interface and runs PageRank 14 | * for a fixed number of iterations: 15 | * {{{ 16 | * var PR = Array.fill(n)( 1.0 ) 17 | * val oldPR = Array.fill(n)( 1.0 ) 18 | * for( iter <- 0 until numIter ) { 19 | * swap(oldPR, PR) 20 | * for( i <- 0 until n ) { 21 | * PR[i] = alpha + (1 - alpha) * inNbrs[i].map(j => oldPR[j] / outDeg[j]).sum 22 | * } 23 | * } 24 | * }}} 25 | * 26 | * The second implementation uses the [[Pregel]] interface and runs PageRank until 27 | * convergence: 28 | * 29 | * {{{ 30 | * var PR = Array.fill(n)( 1.0 ) 31 | * val oldPR = Array.fill(n)( 0.0 ) 32 | * while( max(abs(PR - oldPr)) > tol ) { 33 | * swap(oldPR, PR) 34 | * for( i <- 0 until n if abs(PR[i] - oldPR[i]) > tol ) { 35 | * PR[i] = alpha + (1 - \alpha) * inNbrs[i].map(j => oldPR[j] / outDeg[j]).sum 36 | * } 37 | * } 38 | * }}} 39 | * 40 | * `alpha` is the random reset probability (typically 0.15), `inNbrs[i]` is the set of 41 | * neighbors which link to `i` and `outDeg[j]` is the out degree of vertex `j`. 42 | * 43 | * Note that this is not the "normalized" PageRank and as a consequence pages that have no 44 | * inlinks will have a PageRank of alpha. 45 | */ 46 | object PageRank extends Logging { 47 | 48 | 49 | /** 50 | * Run PageRank for a fixed number of iterations returning a graph 51 | * with vertex attributes containing the PageRank and edge 52 | * attributes the normalized edge weight. 53 | * 54 | * @tparam VD the original vertex attribute (not used) 55 | * @tparam ED the original edge attribute (not used) 56 | * 57 | * @param graph the graph on which to compute PageRank 58 | * @param numIter the number of iterations of PageRank to run 59 | * @param resetProb the random reset probability (alpha) 60 | * 61 | * @return the graph containing with each vertex containing the PageRank and each edge 62 | * containing the normalized weight. 63 | */ 64 | def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED], numIter: Int, 65 | resetProb: Double = 0.15): Graph[Double, Double] = 66 | { 67 | runWithOptions(graph, numIter, resetProb) 68 | } 69 | 70 | /** 71 | * Run PageRank for a fixed number of iterations returning a graph 72 | * with vertex attributes containing the PageRank and edge 73 | * attributes the normalized edge weight. 74 | * 75 | * @tparam VD the original vertex attribute (not used) 76 | * @tparam ED the original edge attribute (not used) 77 | * 78 | * @param graph the graph on which to compute PageRank 79 | * @param numIter the number of iterations of PageRank to run 80 | * @param resetProb the random reset probability (alpha) 81 | * @param srcId the source vertex for a Personalized Page Rank (optional) 82 | * 83 | * @return the graph containing with each vertex containing the PageRank and each edge 84 | * containing the normalized weight. 85 | * 86 | */ 87 | def runWithOptions[VD: ClassTag, ED: ClassTag]( 88 | graph: Graph[VD, ED], numIter: Int, resetProb: Double = 0.15, 89 | srcId: Option[VertexId] = None): Graph[Double, Double] = 90 | { 91 | require(numIter > 0, s"Number of iterations must be greater than 0," + 92 | s" but got ${numIter}") 93 | require(resetProb >= 0 && resetProb <= 1, s"Random reset probability must belong" + 94 | s" to [0, 1], but got ${resetProb}") 95 | 96 | val personalized = srcId isDefined 97 | val src: VertexId = srcId.getOrElse(-1L) 98 | 99 | // Initialize the PageRank graph with each edge attribute having 100 | // weight 1/outDegree and each vertex with attribute resetProb. 101 | // When running personalized pagerank, only the source vertex 102 | // has an attribute resetProb. All others are set to 0. 103 | var rankGraph: Graph[Double, Double] = graph 104 | // Associate the degree with each vertex 105 | .outerJoinVertices(graph.outDegrees) { (vid, vdata, deg) => deg.getOrElse(0) } 106 | // Set the weight on the edges based on the degree 107 | .mapTriplets( e => 1.0 / e.srcAttr, TripletFields.Src ) 108 | // Set the vertex attributes to the initial pagerank values 109 | .mapVertices { (id, attr) => 110 | if (!(id != src && personalized)) resetProb else 0.0 111 | } 112 | 113 | def delta(u: VertexId, v: VertexId): Double = { if (u == v) 1.0 else 0.0 } 114 | 115 | var iteration = 0 116 | var prevRankGraph: Graph[Double, Double] = null 117 | while (iteration < numIter) { 118 | rankGraph.cache() 119 | 120 | // Compute the outgoing rank contributions of each vertex, perform local preaggregation, and 121 | // do the final aggregation at the receiving vertices. Requires a shuffle for aggregation. 122 | val rankUpdates = rankGraph.aggregateMessages[Double]( 123 | ctx => ctx.sendToDst(ctx.srcAttr * ctx.attr), _ + _, TripletFields.Src) 124 | 125 | // Apply the final rank updates to get the new ranks, using join to preserve ranks of vertices 126 | // that didn't receive a message. Requires a shuffle for broadcasting updated ranks to the 127 | // edge partitions. 128 | prevRankGraph = rankGraph 129 | val rPrb = if (personalized) { 130 | (src: VertexId, id: VertexId) => resetProb * delta(src, id) 131 | } else { 132 | (src: VertexId, id: VertexId) => resetProb 133 | } 134 | 135 | rankGraph = rankGraph.joinVertices(rankUpdates) { 136 | (id, oldRank, msgSum) => rPrb(src, id) + (1.0 - resetProb) * msgSum 137 | }.cache() 138 | 139 | rankGraph.edges.foreachPartition(x => {}) // also materializes rankGraph.vertices 140 | logInfo(s"PageRank finished iteration $iteration.") 141 | prevRankGraph.vertices.unpersist(false) 142 | prevRankGraph.edges.unpersist(false) 143 | 144 | iteration += 1 145 | } 146 | 147 | rankGraph 148 | } 149 | 150 | /** 151 | * Run a dynamic version of PageRank returning a graph with vertex attributes containing the 152 | * PageRank and edge attributes containing the normalized edge weight. 153 | * 154 | * @tparam VD the original vertex attribute (not used) 155 | * @tparam ED the original edge attribute (not used) 156 | * 157 | * @param graph the graph on which to compute PageRank 158 | * @param tol the tolerance allowed at convergence (smaller => more accurate). 159 | * @param resetProb the random reset probability (alpha) 160 | * 161 | * @return the graph containing with each vertex containing the PageRank and each edge 162 | * containing the normalized weight. 163 | */ 164 | def runUntilConvergence[VD: ClassTag, ED: ClassTag]( 165 | graph: Graph[VD, ED], tol: Double, resetProb: Double = 0.15): Graph[Double, Double] = 166 | { 167 | runUntilConvergenceWithOptions(graph, tol, resetProb) 168 | } 169 | 170 | /** 171 | * Run a dynamic version of PageRank returning a graph with vertex attributes containing the 172 | * PageRank and edge attributes containing the normalized edge weight. 173 | * 174 | * @tparam VD the original vertex attribute (not used) 175 | * @tparam ED the original edge attribute (not used) 176 | * 177 | * @param graph the graph on which to compute PageRank 178 | * @param tol the tolerance allowed at convergence (smaller => more accurate). 179 | * @param resetProb the random reset probability (alpha) 180 | * @param srcId the source vertex for a Personalized Page Rank (optional) 181 | * 182 | * @return the graph containing with each vertex containing the PageRank and each edge 183 | * containing the normalized weight. 184 | */ 185 | def runUntilConvergenceWithOptions[VD: ClassTag, ED: ClassTag]( 186 | graph: Graph[VD, ED], tol: Double, resetProb: Double = 0.15, 187 | srcId: Option[VertexId] = None): Graph[Double, Double] = 188 | { 189 | require(tol >= 0, s"Tolerance must be no less than 0, but got ${tol}") 190 | require(resetProb >= 0 && resetProb <= 1, s"Random reset probability must belong" + 191 | s" to [0, 1], but got ${resetProb}") 192 | 193 | val personalized = srcId.isDefined 194 | val src: VertexId = srcId.getOrElse(-1L) 195 | 196 | // Initialize the pagerankGraph with each edge attribute 197 | // having weight 1/outDegree and each vertex with attribute 1.0. 198 | val pagerankGraph: Graph[(Double, Double), Double] = graph 199 | // Associate the degree with each vertex 200 | .outerJoinVertices(graph.outDegrees) { 201 | (vid, vdata, deg) => deg.getOrElse(0) 202 | } 203 | // Set the weight on the edges based on the degree 204 | .mapTriplets( e => 1.0 / e.srcAttr ) 205 | // Set the vertex attributes to (initialPR, delta = 0) 206 | .mapVertices { (id, attr) => 207 | if (id == src) (resetProb, Double.NegativeInfinity) else (0.0, 0.0) 208 | } 209 | .cache() 210 | 211 | // Define the three functions needed to implement PageRank in the GraphX 212 | // version of Pregel 213 | def vertexProgram(id: VertexId, attr: (Double, Double), msgSum: Double): (Double, Double) = { 214 | val (oldPR, lastDelta) = attr 215 | val newPR = oldPR + (1.0 - resetProb) * msgSum 216 | (newPR, newPR - oldPR) 217 | } 218 | 219 | def personalizedVertexProgram(id: VertexId, attr: (Double, Double), 220 | msgSum: Double): (Double, Double) = { 221 | val (oldPR, lastDelta) = attr 222 | var teleport = oldPR 223 | val delta = if (src==id) 1.0 else 0.0 224 | teleport = oldPR*delta 225 | 226 | val newPR = teleport + (1.0 - resetProb) * msgSum 227 | val newDelta = if (lastDelta == Double.NegativeInfinity) newPR else newPR - oldPR 228 | (newPR, newDelta) 229 | } 230 | 231 | def sendMessage(edge: EdgeTriplet[(Double, Double), Double]) = { 232 | if (edge.srcAttr._2 > tol) { 233 | Iterator((edge.dstId, edge.srcAttr._2 * edge.attr)) 234 | } else { 235 | Iterator.empty 236 | } 237 | } 238 | 239 | def messageCombiner(a: Double, b: Double): Double = a + b 240 | 241 | // The initial message received by all vertices in PageRank 242 | val initialMessage = if (personalized) 0.0 else resetProb / (1.0 - resetProb) 243 | 244 | // Execute a dynamic version of Pregel. 245 | val vp = if (personalized) { 246 | (id: VertexId, attr: (Double, Double), msgSum: Double) => 247 | personalizedVertexProgram(id, attr, msgSum) 248 | } else { 249 | (id: VertexId, attr: (Double, Double), msgSum: Double) => 250 | vertexProgram(id, attr, msgSum) 251 | } 252 | 253 | Pregel(pagerankGraph, initialMessage, activeDirection = EdgeDirection.Out)( 254 | vp, sendMessage, messageCombiner) 255 | .mapVertices((vid, attr) => attr._1) 256 | } // end of deltaPageRank 257 | 258 | } 259 | 260 | ``` -------------------------------------------------------------------------------- /operators/aggregate.md: -------------------------------------------------------------------------------- 1 | # 聚合操作 2 | 3 |   `GraphX`中提供的聚合操作有`aggregateMessages`、`collectNeighborIds`和`collectNeighbors`三个,其中`aggregateMessages`在`GraphImpl`中实现,`collectNeighborIds`和`collectNeighbors`在 4 | `GraphOps`中实现。下面分别介绍这几个方法。 5 | 6 | # 1 `aggregateMessages` 7 | 8 | ## 1.1 `aggregateMessages`接口 9 |   `aggregateMessages`是`GraphX`最重要的`API`,用于替换`mapReduceTriplets`。目前`mapReduceTriplets`最终也是通过`aggregateMessages`来实现的。它主要功能是向邻边发消息,合并邻边收到的消息,返回`messageRDD`。 10 | `aggregateMessages`的接口如下: 11 | 12 | ```scala 13 | def aggregateMessages[A: ClassTag]( 14 | sendMsg: EdgeContext[VD, ED, A] => Unit, 15 | mergeMsg: (A, A) => A, 16 | tripletFields: TripletFields = TripletFields.All) 17 | : VertexRDD[A] = { 18 | aggregateMessagesWithActiveSet(sendMsg, mergeMsg, tripletFields, None) 19 | } 20 | ``` 21 |   该接口有三个参数,分别为发消息函数,合并消息函数以及发消息的方向。 22 | 23 | - `sendMsg`: 发消息函数 24 | 25 | ```scala 26 | private def sendMsg(ctx: EdgeContext[KCoreVertex, Int, Map[Int, Int]]): Unit = { 27 | ctx.sendToDst(Map(ctx.srcAttr.preKCore -> -1, ctx.srcAttr.curKCore -> 1)) 28 | ctx.sendToSrc(Map(ctx.dstAttr.preKCore -> -1, ctx.dstAttr.curKCore -> 1)) 29 | } 30 | ``` 31 | 32 | - `mergeMsg`:合并消息函数 33 | 34 |   该函数用于在`Map`阶段每个`edge`分区中每个点收到的消息合并,并且它还用于`reduce`阶段,合并不同分区的消息。合并`vertexId`相同的消息。 35 | 36 | - `tripletFields`:定义发消息的方向 37 | 38 | ## 1.2 `aggregateMessages`处理流程 39 | 40 |   `aggregateMessages`方法分为`Map`和`Reduce`两个阶段,下面我们分别就这两个阶段说明。 41 | 42 | ### 1.2.1 Map阶段 43 | 44 |   从入口函数进入`aggregateMessagesWithActiveSet`函数,该函数首先使用`VertexRDD[VD]`更新`replicatedVertexView`, 只更新其中`vertexRDD`中`attr`对象。如[构建图](../build-graph.md)中介绍的, 45 | `replicatedVertexView`是点和边的视图,点的属性有变化,要更新边中包含的点的`attr`。 46 | 47 | ```scala 48 | replicatedVertexView.upgrade(vertices, tripletFields.useSrc, tripletFields.useDst) 49 | val view = activeSetOpt match { 50 | case Some((activeSet, _)) => 51 | //返回只包含活跃顶点的replicatedVertexView 52 | replicatedVertexView.withActiveSet(activeSet) 53 | case None => 54 | replicatedVertexView 55 | } 56 | ``` 57 |   程序然后会对`replicatedVertexView`的`edgeRDD`做`mapPartitions`操作,所有的操作都在每个边分区的迭代中完成,如下面的代码: 58 | 59 | ```scala 60 | val preAgg = view.edges.partitionsRDD.mapPartitions(_.flatMap { 61 | case (pid, edgePartition) => 62 | // 选择 scan 方法 63 | val activeFraction = edgePartition.numActives.getOrElse(0) / edgePartition.indexSize.toFloat 64 | activeDirectionOpt match { 65 | case Some(EdgeDirection.Both) => 66 | if (activeFraction < 0.8) { 67 | edgePartition.aggregateMessagesIndexScan(sendMsg, mergeMsg, tripletFields, 68 | EdgeActiveness.Both) 69 | } else { 70 | edgePartition.aggregateMessagesEdgeScan(sendMsg, mergeMsg, tripletFields, 71 | EdgeActiveness.Both) 72 | } 73 | case Some(EdgeDirection.Either) => 74 | edgePartition.aggregateMessagesEdgeScan(sendMsg, mergeMsg, tripletFields, 75 | EdgeActiveness.Either) 76 | case Some(EdgeDirection.Out) => 77 | if (activeFraction < 0.8) { 78 | edgePartition.aggregateMessagesIndexScan(sendMsg, mergeMsg, tripletFields, 79 | EdgeActiveness.SrcOnly) 80 | } else { 81 | edgePartition.aggregateMessagesEdgeScan(sendMsg, mergeMsg, tripletFields, 82 | EdgeActiveness.SrcOnly) 83 | } 84 | case Some(EdgeDirection.In) => 85 | edgePartition.aggregateMessagesEdgeScan(sendMsg, mergeMsg, tripletFields, 86 | EdgeActiveness.DstOnly) 87 | case _ => // None 88 | edgePartition.aggregateMessagesEdgeScan(sendMsg, mergeMsg, tripletFields, 89 | EdgeActiveness.Neither) 90 | } 91 | }) 92 | ``` 93 |   在分区内,根据`activeFraction`的大小选择是进入`aggregateMessagesEdgeScan`还是`aggregateMessagesIndexScan`处理。`aggregateMessagesEdgeScan`会顺序地扫描所有的边, 94 | 而`aggregateMessagesIndexScan`会先过滤源顶点索引,然后在扫描。我们重点去分析`aggregateMessagesEdgeScan`。 95 | 96 | ```scala 97 | def aggregateMessagesEdgeScan[A: ClassTag]( 98 | sendMsg: EdgeContext[VD, ED, A] => Unit, 99 | mergeMsg: (A, A) => A, 100 | tripletFields: TripletFields, 101 | activeness: EdgeActiveness): Iterator[(VertexId, A)] = { 102 | var ctx = new AggregatingEdgeContext[VD, ED, A](mergeMsg, aggregates, bitset) 103 | var i = 0 104 | while (i < size) { 105 | val localSrcId = localSrcIds(i) 106 | val srcId = local2global(localSrcId) 107 | val localDstId = localDstIds(i) 108 | val dstId = local2global(localDstId) 109 | val srcAttr = if (tripletFields.useSrc) vertexAttrs(localSrcId) else null.asInstanceOf[VD] 110 | val dstAttr = if (tripletFields.useDst) vertexAttrs(localDstId) else null.asInstanceOf[VD] 111 | ctx.set(srcId, dstId, localSrcId, localDstId, srcAttr, dstAttr, data(i)) 112 | sendMsg(ctx) 113 | i += 1 114 | } 115 | ``` 116 |   该方法由两步组成,分别是获得顶点相关信息,以及发送消息。 117 | 118 | - 获取顶点相关信息 119 | 120 |   在前文介绍`edge partition`时,我们知道它包含`localSrcIds,localDstIds, data, index, global2local, local2global, vertexAttrs`这几个重要的数据结构。其中`localSrcIds,localDstIds`分别表示源顶点、目的顶点在当前分区中的索引。 121 | 所以我们可以遍历`localSrcIds`,根据其下标去`localSrcIds`中拿到`srcId`在全局`local2global`中的索引,最后拿到`srcId`。通过`vertexAttrs`拿到顶点属性。通过`data`拿到边属性。 122 | 123 | - 发送消息 124 | 125 |   发消息前会根据接口中定义的`tripletFields`,拿到发消息的方向。发消息的过程就是遍历到一条边,向`localSrcIds/localDstIds`中添加数据,如果`localSrcIds/localDstIds`中已经存在该数据,则执行合并函数`mergeMsg`。 126 | 127 | ```scala 128 | override def sendToSrc(msg: A) { 129 | send(_localSrcId, msg) 130 | } 131 | override def sendToDst(msg: A) { 132 | send(_localDstId, msg) 133 | } 134 | @inline private def send(localId: Int, msg: A) { 135 | if (bitset.get(localId)) { 136 | aggregates(localId) = mergeMsg(aggregates(localId), msg) 137 | } else { 138 | aggregates(localId) = msg 139 | bitset.set(localId) 140 | } 141 | } 142 | ``` 143 |   每个点之间在发消息的时候是独立的,即:点单纯根据方向,向以相邻点的以`localId`为下标的数组中插数据,互相独立,可以并行运行。`Map`阶段最后返回消息`RDD` `messages: RDD[(VertexId, VD2)]` 144 | 145 |   `Map`阶段的执行流程如下例所示: 146 | 147 |
graphx_aggmsg_map

148 | 149 | ### 1.2.2 Reduce阶段 150 | 151 |   `Reduce`阶段的实现就是调用下面的代码 152 | 153 | ```scala 154 | vertices.aggregateUsingIndex(preAgg, mergeMsg) 155 | override def aggregateUsingIndex[VD2: ClassTag]( 156 | messages: RDD[(VertexId, VD2)], reduceFunc: (VD2, VD2) => VD2): VertexRDD[VD2] = { 157 | val shuffled = messages.partitionBy(this.partitioner.get) 158 | val parts = partitionsRDD.zipPartitions(shuffled, true) { (thisIter, msgIter) => 159 | thisIter.map(_.aggregateUsingIndex(msgIter, reduceFunc)) 160 | } 161 | this.withPartitionsRDD[VD2](parts) 162 | } 163 | ``` 164 |   上面的代码通过两步实现。 165 | 166 | - 1 对`messages`重新分区,分区器使用`VertexRDD`的`partitioner`。然后使用`zipPartitions`合并两个分区。 167 | 168 | - 2 对等合并`attr`, 聚合函数使用传入的`mergeMsg`函数 169 | 170 | ```scala 171 | def aggregateUsingIndex[VD2: ClassTag]( 172 | iter: Iterator[Product2[VertexId, VD2]], 173 | reduceFunc: (VD2, VD2) => VD2): Self[VD2] = { 174 | val newMask = new BitSet(self.capacity) 175 | val newValues = new Array[VD2](self.capacity) 176 | iter.foreach { product => 177 | val vid = product._1 178 | val vdata = product._2 179 | val pos = self.index.getPos(vid) 180 | if (pos >= 0) { 181 | if (newMask.get(pos)) { 182 | newValues(pos) = reduceFunc(newValues(pos), vdata) 183 | } else { // otherwise just store the new value 184 | newMask.set(pos) 185 | newValues(pos) = vdata 186 | } 187 | } 188 | } 189 | this.withValues(newValues).withMask(newMask) 190 | } 191 | ``` 192 |   根据传参,我们知道上面的代码迭代的是`messagePartition`,并不是每个节点都会收到消息,所以`messagePartition`集合最小,迭代速度会快。 193 | 194 |   这段代码表示,我们根据`vetexId`从`index`中取到其下标`pos`,再根据下标,从`values`中取到`attr`,存在`attr`就用`mergeMsg`合并`attr`,不存在就直接赋值。 195 | 196 |   `Reduce`阶段的过程如下图所示: 197 | 198 |
graphx_aggmsg_map

199 | 200 | ## 1.3 举例 201 | 202 |   下面的例子计算比用户年龄大的追随者(即`followers`)的平均年龄。 203 | 204 | ```scala 205 | // Import random graph generation library 206 | import org.apache.spark.graphx.util.GraphGenerators 207 | // Create a graph with "age" as the vertex property. Here we use a random graph for simplicity. 208 | val graph: Graph[Double, Int] = 209 | GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) => id.toDouble ) 210 | // Compute the number of older followers and their total age 211 | val olderFollowers: VertexRDD[(Int, Double)] = graph.aggregateMessages[(Int, Double)]( 212 | triplet => { // Map Function 213 | if (triplet.srcAttr > triplet.dstAttr) { 214 | // Send message to destination vertex containing counter and age 215 | triplet.sendToDst(1, triplet.srcAttr) 216 | } 217 | }, 218 | // Add counter and age 219 | (a, b) => (a._1 + b._1, a._2 + b._2) // Reduce Function 220 | ) 221 | // Divide total age by number of older followers to get average age of older followers 222 | val avgAgeOfOlderFollowers: VertexRDD[Double] = 223 | olderFollowers.mapValues( (id, value) => value match { case (count, totalAge) => totalAge / count } ) 224 | // Display the results 225 | avgAgeOfOlderFollowers.collect.foreach(println(_)) 226 | ``` 227 | 228 | # 2 `collectNeighbors` 229 | 230 |   该方法的作用是收集每个顶点的邻居顶点的顶点`id`和顶点属性。 231 | 232 | ```scala 233 | def collectNeighbors(edgeDirection: EdgeDirection): VertexRDD[Array[(VertexId, VD)]] = { 234 | val nbrs = edgeDirection match { 235 | case EdgeDirection.Either => 236 | graph.aggregateMessages[Array[(VertexId, VD)]]( 237 | ctx => { 238 | ctx.sendToSrc(Array((ctx.dstId, ctx.dstAttr))) 239 | ctx.sendToDst(Array((ctx.srcId, ctx.srcAttr))) 240 | }, 241 | (a, b) => a ++ b, TripletFields.All) 242 | case EdgeDirection.In => 243 | graph.aggregateMessages[Array[(VertexId, VD)]]( 244 | ctx => ctx.sendToDst(Array((ctx.srcId, ctx.srcAttr))), 245 | (a, b) => a ++ b, TripletFields.Src) 246 | case EdgeDirection.Out => 247 | graph.aggregateMessages[Array[(VertexId, VD)]]( 248 | ctx => ctx.sendToSrc(Array((ctx.dstId, ctx.dstAttr))), 249 | (a, b) => a ++ b, TripletFields.Dst) 250 | case EdgeDirection.Both => 251 | throw new SparkException("collectEdges does not support EdgeDirection.Both. Use" + 252 | "EdgeDirection.Either instead.") 253 | } 254 | graph.vertices.leftJoin(nbrs) { (vid, vdata, nbrsOpt) => 255 | nbrsOpt.getOrElse(Array.empty[(VertexId, VD)]) 256 | } 257 | } 258 | ``` 259 |   从上面的代码中,第一步是根据`EdgeDirection`来确定调用哪个`aggregateMessages`实现聚合操作。我们用满足条件`EdgeDirection.Either`的情况来说明。可以看到`aggregateMessages`的方式消息的函数为: 260 | 261 | ```scala 262 | ctx => { 263 | ctx.sendToSrc(Array((ctx.dstId, ctx.dstAttr))) 264 | ctx.sendToDst(Array((ctx.srcId, ctx.srcAttr))) 265 | }, 266 | ``` 267 |   这个函数在处理每条边时都会同时向源顶点和目的顶点发送消息,消息内容分别为`(目的顶点id,目的顶点属性)`、`(源顶点id,源顶点属性)`。为什么会这样处理呢? 268 | 我们知道,每条边都由两个顶点组成,对于这个边,我需要向源顶点发送目的顶点的信息来记录它们之间的邻居关系,同理向目的顶点发送源顶点的信息来记录它们之间的邻居关系。 269 | 270 |   `Merge`函数是一个集合合并操作,它合并同同一个顶点对应的所有目的顶点的信息。如下所示: 271 | 272 | ```scala 273 | (a, b) => a ++ b 274 | ``` 275 |   通过`aggregateMessages`获得包含邻居关系信息的`VertexRDD`后,把它和现有的`vertices`作`join`操作,得到每个顶点的邻居消息。 276 | 277 | # 3 `collectNeighborIds` 278 | 279 |   该方法的作用是收集每个顶点的邻居顶点的顶点`id`。它的实现和`collectNeighbors`非常相同。 280 | 281 | ```scala 282 | def collectNeighborIds(edgeDirection: EdgeDirection): VertexRDD[Array[VertexId]] = { 283 | val nbrs = 284 | if (edgeDirection == EdgeDirection.Either) { 285 | graph.aggregateMessages[Array[VertexId]]( 286 | ctx => { ctx.sendToSrc(Array(ctx.dstId)); ctx.sendToDst(Array(ctx.srcId)) }, 287 | _ ++ _, TripletFields.None) 288 | } else if (edgeDirection == EdgeDirection.Out) { 289 | graph.aggregateMessages[Array[VertexId]]( 290 | ctx => ctx.sendToSrc(Array(ctx.dstId)), 291 | _ ++ _, TripletFields.None) 292 | } else if (edgeDirection == EdgeDirection.In) { 293 | graph.aggregateMessages[Array[VertexId]]( 294 | ctx => ctx.sendToDst(Array(ctx.srcId)), 295 | _ ++ _, TripletFields.None) 296 | } else { 297 | throw new SparkException("It doesn't make sense to collect neighbor ids without a " + 298 | "direction. (EdgeDirection.Both is not supported; use EdgeDirection.Either instead.)") 299 | } 300 | graph.vertices.leftZipJoin(nbrs) { (vid, vdata, nbrsOpt) => 301 | nbrsOpt.getOrElse(Array.empty[VertexId]) 302 | } 303 | } 304 | ``` 305 |   和`collectNeighbors`的实现不同的是,`aggregateMessages`函数中的`sendMsg`函数只发送顶点`Id`到源顶点和目的顶点。其它的实现基本一致。 306 | 307 | ```scala 308 | ctx => { ctx.sendToSrc(Array(ctx.dstId)); ctx.sendToDst(Array(ctx.srcId)) } 309 | ``` 310 | 311 | # 4 参考文献 312 | 313 | 【1】[Graphx:构建graph和聚合消息](https://github.com/shijinkui/spark_study/blob/master/spark_graphx_analyze.markdown) 314 | 315 | 【2】[spark源码](https://github.com/apache/spark) --------------------------------------------------------------------------------