├── README.md ├── pics └── pci-001.png └── sql └── catalyst └── src └── main └── scala └── org └── apache └── spark └── sql └── catalyst └── analysis └── ResolveHints.scala /README.md: -------------------------------------------------------------------------------- 1 | # 场景 2 | 有时我们通过sparkSQL来分析数据,当使用Join操作时,最让人头疼的莫过于数据倾斜了,如果你是大表关联小表的情况, 3 | 那情况还不是很糟糕,可以使用MAPJOIN来破解一下,spark使用spark.sql.autoBroadcastJoinThreshold参数来自动 4 | 开启MAPJOIN; BUT,如果两张表数据量都很大的话,MAPJOIN就无能为力了。 5 | 6 | 7 | # 使用自定义hint 8 | 处理Join导致的数据倾斜常规方式是对导致倾斜的keys做单独处理,最后在做union, 但问题来了,使用SQL如何处理? 9 | 这时我们自定义hint就派上用场了,自定义hint需要扩展ResolveHints解析器的逻辑,修改也比较简单,详细请参见GitHub上源码 10 | 我这边自定义的hint为SKEWED_JOIN。 11 | 用法如下: 12 | 13 | ```SKEWED_JOIN(join_key(leftTB.field, rightTB.field), skewed_values('value1', 'value2'))``` 14 | 15 | 假如我们有SQL: 16 | ```SELECT f1, f2, f3, f4 FROM leftTB t1 LEFT JOIN rightTB t2 on t1.id=t2.id ``` 17 | 18 | leftTB数据量:```1亿``` 19 | 20 | rightTB数据量:```5000万``` 21 | 22 | 关联key: ```leftTB.id = rightTB.id``` 23 | 24 | 解析后的plan: 25 | 26 | ``` 27 | Project [f1#1, f2#2, f3#4, f4#5] 28 | +- Join LeftOuter, (id#0 = id#3) 29 | :- SubqueryAlias t1 30 | : +- SubqueryAlias leftTB 31 | : +- Relation[id#0,f1#1,f2#2] parquet 32 | +- SubqueryAlias t2 33 | +- SubqueryAlias rightTB 34 | +- Relation[id#3,f3#4,f4#5] parquet 35 | ``` 36 | 37 | 由于 leftTB.id列的数值 5和6非常多,这样就会导致数据处理倾斜(注:rightTB.id列的数值分布正常,如果不正常是另一种场景, 数据膨胀) 38 | 这是我们使用自定义hint来处理数据倾斜 39 | 40 | 现在SQL: 41 | 42 | ``` 43 | SELECT /*+ SKEWED_JOIN(join_key(leftTB.id,rightTB.id),skewed_values(5,6)) */ f1, f2, f3, f4 FROM leftTB t1 LEFT JOIN rightTB t2 on t1.id=t2.id 44 | ``` 45 | 46 | 解析后的plan: 47 | 48 | ``` 49 | Project [f1#1, f2#2, f3#4, f4#5] 50 | +- ResolvedHint none 51 | +- Union 52 | :- Join LeftOuter, (id#0 = id#3) 53 | : :- Filter NOT id#0 IN (5, 6) 54 | : : +- SubqueryAlias t1 55 | : : +- SubqueryAlias leftTB 56 | : : +- Relation[id#0,f1#1,f2#2] parquet 57 | : +- Filter NOT id#3 IN (5, 6) 58 | : +- SubqueryAlias t2 59 | : +- SubqueryAlias rightTB 60 | : +- Relation[id#3,f3#4,f4#5] parquet 61 | +- Join Inner, (id#0 = id#3) 62 | :- ResolvedHint (broadcast) 63 | : +- Filter id#0 IN (5, 6) 64 | : +- SubqueryAlias t1 65 | : +- SubqueryAlias leftTB 66 | : +- Relation[id#0,f1#1,f2#2] parquet 67 | +- ResolvedHint (broadcast) 68 | +- Filter id#3 IN (5, 6) 69 | +- SubqueryAlias t2 70 | +- SubqueryAlias rightTB 71 | +- Relation[id#3,f3#4,f4#5] parquet 72 | ``` 73 | 74 | 从plan我们可以看到SKEWED_JOIN hint帮我们把语法树拆解成两Join,并且把导致倾斜的值过滤出来单独做MAPJOIN,最后再做了Union 75 | 76 | # 执行效果 77 | ![alt image](https://github.com/frb502/spark-skewed-join-hint/blob/master/pics/pci-001.png?raw=true) 78 | 79 | # FAQ 80 | A: 这种拆解Join最终的执行结果与原Join的结果一致吗? 81 | 82 | Q: SKEWED_JOIN hint只是把导致倾斜的某几个value单独过滤出来做Inner Join, 最后再做Union,理论上不会影响执行结果 83 | 84 | 85 | # 简书 86 | [https://www.jianshu.com/p/ea52f3801d7b] 87 | -------------------------------------------------------------------------------- /pics/pci-001.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/frb502/spark-skewed-join-hint/28c3ac5601f938d1a8c61548cf09f7ff998046c2/pics/pci-001.png -------------------------------------------------------------------------------- /sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveHints.scala: -------------------------------------------------------------------------------- 1 | /* 2 | * Licensed to the Apache Software Foundation (ASF) under one or more 3 | * contributor license agreements. See the NOTICE file distributed with 4 | * this work for additional information regarding copyright ownership. 5 | * The ASF licenses this file to You under the Apache License, Version 2.0 6 | * (the "License"); you may not use this file except in compliance with 7 | * the License. You may obtain a copy of the License at 8 | * 9 | * http://www.apache.org/licenses/LICENSE-2.0 10 | * 11 | * Unless required by applicable law or agreed to in writing, software 12 | * distributed under the License is distributed on an "AS IS" BASIS, 13 | * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 | * See the License for the specific language governing permissions and 15 | * limitations under the License. 16 | */ 17 | 18 | package org.apache.spark.sql.catalyst.analysis 19 | 20 | import java.util.Locale 21 | 22 | import org.apache.spark.sql.AnalysisException 23 | import org.apache.spark.sql.catalyst.expressions._ 24 | import org.apache.spark.sql.catalyst.plans.Inner 25 | import org.apache.spark.sql.catalyst.plans.logical._ 26 | import org.apache.spark.sql.catalyst.rules.Rule 27 | import org.apache.spark.sql.catalyst.trees.CurrentOrigin 28 | import org.apache.spark.sql.internal.SQLConf 29 | 30 | 31 | /** 32 | * Collection of rules related to hints. The only hint currently available is broadcast join hint. 33 | * 34 | * Note that this is separately into two rules because in the future we might introduce new hint 35 | * rules that have different ordering requirements from broadcast. 36 | */ 37 | object ResolveHints { 38 | /** 39 | * For broadcast hint, we accept "BROADCAST", "BROADCASTJOIN", and "MAPJOIN", and a sequence of 40 | * relation aliases can be specified in the hint. A broadcast hint plan node will be inserted 41 | * on top of any relation (that is not aliased differently), subquery, or common table expression 42 | * that match the specified name. 43 | * 44 | * The hint resolution works by recursively traversing down the query plan to find a relation or 45 | * subquery that matches one of the specified broadcast aliases. The traversal does not go past 46 | * beyond any existing broadcast hints, subquery aliases. 47 | * 48 | * This rule must happen before common table expressions. 49 | */ 50 | class ResolveBroadcastHints(conf: SQLConf) extends Rule[LogicalPlan] { 51 | private val BROADCAST_HINT_NAMES = Set("BROADCAST", "BROADCASTJOIN", "MAPJOIN") 52 | 53 | // SKEWED_JOIN(join_key(left.field, right.field), skewed_values('value1', 'value2')) 54 | private val SKEWED_JOIN = "SKEWED_JOIN" 55 | 56 | def resolver: Resolver = conf.resolver 57 | 58 | private def applyBroadcastHint(plan: LogicalPlan, toBroadcast: Set[String]): LogicalPlan = { 59 | // Whether to continue recursing down the tree 60 | var recurse = true 61 | 62 | val newNode = CurrentOrigin.withOrigin(plan.origin) { 63 | plan match { 64 | case u: UnresolvedRelation if toBroadcast.exists(resolver(_, u.tableIdentifier.table)) => 65 | ResolvedHint(plan, HintInfo(broadcast = true)) 66 | case r: SubqueryAlias if toBroadcast.exists(resolver(_, r.alias)) => 67 | ResolvedHint(plan, HintInfo(broadcast = true)) 68 | 69 | case _: ResolvedHint | _: View | _: With | _: SubqueryAlias => 70 | // Don't traverse down these nodes. 71 | // For an existing broadcast hint, there is no point going down (if we do, we either 72 | // won't change the structure, or will introduce another broadcast hint that is useless. 73 | // The rest (view, with, subquery) indicates different scopes that we shouldn't traverse 74 | // down. Note that technically when this rule is executed, we haven't completed view 75 | // resolution yet and as a result the view part should be deadcode. I'm leaving it here 76 | // to be more future proof in case we change the view we do view resolution. 77 | recurse = false 78 | plan 79 | 80 | case _ => 81 | plan 82 | } 83 | } 84 | 85 | if ((plan fastEquals newNode) && recurse) { 86 | newNode.mapChildren(child => applyBroadcastHint(child, toBroadcast)) 87 | } else { 88 | newNode 89 | } 90 | } 91 | 92 | private def getTbAlias(plan: LogicalPlan, tableName: String): String = { 93 | plan.map(lp => lp) 94 | .filter(_.isInstanceOf[SubqueryAlias]) 95 | .map(_.asInstanceOf[SubqueryAlias]) 96 | .filter(_.child.isInstanceOf[UnresolvedRelation]) 97 | .filter{ sa => 98 | sa.child.asInstanceOf[UnresolvedRelation].tableName == tableName 99 | }.map(s => s"${s.alias}.").headOption.getOrElse("") 100 | } 101 | 102 | private def applySkewedJoinHint(plan: LogicalPlan, skewedJoin: SkewedJoin): LogicalPlan = { 103 | // scalastyle:off println 104 | var recurse = true 105 | val newNode = CurrentOrigin.withOrigin(plan.origin) { 106 | plan match { 107 | case Join(left, right, joinType, condition) if condition.isDefined => 108 | val joinKey = skewedJoin.joinKey 109 | val hasLeftTb = left.find { lp => 110 | lp.isInstanceOf[UnresolvedRelation] && 111 | lp.asInstanceOf[UnresolvedRelation].tableName == joinKey.leftTable 112 | }.isDefined 113 | 114 | val hasRightTb = right.find { lp => 115 | lp.isInstanceOf[UnresolvedRelation] && 116 | lp.asInstanceOf[UnresolvedRelation].tableName == joinKey.rightTable 117 | }.isDefined 118 | 119 | val leftField = getTbAlias(left, joinKey.leftTable) + joinKey.leftField 120 | val rightField = getTbAlias(right, joinKey.rightTable) + joinKey.rightField 121 | val joinKeys = condition.get.map(expr => expr) 122 | .filter(_.isInstanceOf[UnresolvedAttribute]) 123 | .map(_.asInstanceOf[UnresolvedAttribute].name) 124 | .filter(n => n.endsWith(joinKey.leftField) || n.endsWith(joinKey.rightField)) 125 | 126 | val newPlan = if (hasLeftTb && hasRightTb && joinKeys.length >= 2) { 127 | val inList = skewedJoin.skewedValues.map(Literal(_)) 128 | val left1 = Filter(Not(In(UnresolvedAttribute(leftField), inList)), left) 129 | val right1 = Filter(Not(In(UnresolvedAttribute(rightField), inList)), right) 130 | val left2 = Filter(In(UnresolvedAttribute(leftField), inList), left) 131 | val right2 = Filter(In(UnresolvedAttribute(rightField), inList), right) 132 | 133 | val join1 = Join(left1, right1, joinType, condition) 134 | // use mapjoin 135 | val join2 = Join(ResolvedHint(left2, HintInfo(broadcast = true)), 136 | ResolvedHint(right2, HintInfo(broadcast = true)), 137 | Inner, condition) 138 | Union(Seq(join1, join2)) 139 | } else plan 140 | ResolvedHint(newPlan) 141 | case _: ResolvedHint | _: View | _: With | _: SubqueryAlias => 142 | recurse = false 143 | plan 144 | 145 | case _ => 146 | plan 147 | } 148 | } 149 | if ((plan fastEquals newNode) && recurse) { 150 | newNode.mapChildren(child => applySkewedJoinHint(child, skewedJoin)) 151 | } else { 152 | newNode 153 | } 154 | } 155 | 156 | def apply(plan: LogicalPlan): LogicalPlan = { 157 | var castFieldId: Set[String] = Set() 158 | 159 | val newNode = plan transformUp { 160 | case h: UnresolvedHint if BROADCAST_HINT_NAMES.contains(h.name.toUpperCase(Locale.ROOT)) => 161 | if (h.parameters.isEmpty) { 162 | // If there is no table alias specified, turn the entire subtree into a BroadcastHint. 163 | ResolvedHint(h.child, HintInfo(broadcast = true)) 164 | } else { 165 | // Otherwise, find within the subtree query plans that should be broadcasted. 166 | applyBroadcastHint(h.child, h.parameters.map { 167 | case tableName: String => tableName 168 | case tableId: UnresolvedAttribute => tableId.name 169 | case unsupported => throw new AnalysisException("Broadcast hint parameter should be" + 170 | s" an identifier or string but was $unsupported (${unsupported.getClass}") 171 | }.toSet) 172 | } 173 | 174 | case h: UnresolvedHint if SKEWED_JOIN == h.name.toUpperCase(Locale.ROOT) => 175 | val paramMap = h.parameters.map { 176 | case UnresolvedFunction(funId, children, _) => 177 | (funId.funcName, children.map { 178 | case ua: UnresolvedAttribute => ua.name 179 | case other => other.toString 180 | }) 181 | case unsupported => throw new AnalysisException("SKEWED hint parameter should be" + 182 | s" Function but was $unsupported (${unsupported.getClass}") 183 | }.toMap 184 | val joinKey = paramMap.get("join_key") 185 | val skewedValues = paramMap.get("skewed_values") 186 | if (joinKey.nonEmpty && joinKey.get.length == 2 187 | && skewedValues.nonEmpty && skewedValues.get.length > 0) { 188 | applySkewedJoinHint(h.child, 189 | SkewedJoin(JoinKey(joinKey.get(0), joinKey.get(1)), skewedValues.get)) 190 | } else { 191 | ResolvedHint(h.child) 192 | } 193 | } 194 | newNode 195 | } 196 | } 197 | 198 | /** 199 | * Removes all the hints, used to remove invalid hints provided by the user. 200 | * This must be executed after all the other hint rules are executed. 201 | */ 202 | object RemoveAllHints extends Rule[LogicalPlan] { 203 | def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { 204 | case h: UnresolvedHint => h.child 205 | } 206 | } 207 | 208 | case class JoinKey(leftTbField: String, rightTbField: String) { 209 | val leftTable: String = leftTbField.split("\\.")(0) 210 | val leftField: String = leftTbField.split("\\.")(1) 211 | val rightTable: String = rightTbField.split("\\.")(0) 212 | val rightField: String = rightTbField.split("\\.")(1) 213 | } 214 | case class SkewedJoin(joinKey: JoinKey, skewedValues: Seq[String]) 215 | } 216 | --------------------------------------------------------------------------------