前言
由前面博客我们知道了SparkSql整个解析流程如下:
sqlText 经过 SqlParser 解析成 Unresolved LogicalPlan;
analyzer 模块结合catalog进行绑定,生成 resolved LogicalPlan;
optimizer 模块对 resolved LogicalPlan 进行优化,生成 optimized LogicalPlan;
SparkPlan 将 LogicalPlan 转换成PhysicalPlan;
prepareForExecution()将 PhysicalPlan 转换成可执行物理计划;
使用 execute()执行可执行物理计划;
详解analyzer模块
Analyzer模块将Unresolved LogicalPlan结合元数据catalog进行绑定,最终转化为Resolved LogicalPlan。跟着代码看流程:
// 代码1spark.sql("select * from table").show(false) ---// 代码2def sql(sqlText: String): DataFrame = { Dataset.ofRows(self, sessionState.sqlParser.parsePlan(sqlText)) } ---// 代码3def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = { val qe = sparkSession.sessionState.executePlan(logicalPlan) qe.assertAnalyzed() new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema)) }
代码2中的后半段sessionState.sqlParser.parsePlan(sqlText)
在上篇博客已经解析,即将sqlText通过第三方解析器antlr解析成语法树。
接着进入代码3,通过Unresolved LogicalPlan创建QueryExecution对象, 这是一个非常关键的类,analyzer 、optimizer 、SparkPlan、executedPlan等都是在该类中触发的。继续跟着代码3走:
// 代码4def assertAnalyzed(): Unit = { // Analyzer is invoked outside the try block to avoid calling it again from within the // catch block below. analyzed ...// 代码5lazy val analyzed: LogicalPlan = { SparkSession.setActiveSession(sparkSession) sparkSession.sessionState.analyzer.execute(logical) }
最终调用analyzer的execute方法,该方法在Analyzer的父类RuleExecutor中,另外还继承了CheckAnalysis 类,用于对 plan 做一些解析,如果解析失败则抛出用户层面的错误:
class Analyzer( catalog: SessionCatalog, conf: SQLConf, maxIterations: Int) extends RuleExecutor[LogicalPlan] with CheckAnalysis {
可以看到构造器中有SessionCatalog类型的catalog,此类管理着临时表、view、函数及外部依赖元数据(如hive metastore),是analyzer进行绑定的桥梁。
继承了RuleExecutor的类(Analyzer、Optimizer)需要实现def batches: Seq[Batch]
方法,在execute方法中再对此batches进行遍历执行,batches 由多个Batch构成,每个Batch由多个Rule构成,看看Batch的定义protected case class Batch(name: String, strategy: Strategy, rules: Rule[TreeType]*)
,Strategy是每个Batch的执行策略即该batch被最大执行次数maxIterations ,Once和FixedPoint即执行一次和多次(默认是100次),停止执行batch的条件有两个,一是在执行maxIterations 次之前规则前后plan没有变化,二是执行次数达到maxIterations 。batch里面的所有规则都继承了Rule,在execute方法里就是遍历这些batchs,将所有的规则应用到LogicalPlan上。
接下来我们看看execute中具体是怎么做的:
def execute(plan: TreeType): TreeType = { var curPlan = plan //遍历batches batches.foreach { batch => val batchStartPlan = curPlan var iteration = 1 //每个batch单独计数 var lastPlan = curPlan //保存遍历batch之前的plan,以便和遍历后的plan进行比较,若无变化则停止执行当前batch var continue = true // Run until fix point (or the max number of iterations as specified in the strategy. while (continue) { curPlan = batch.rules.foldLeft(curPlan) { // 遍历一个batch所有的Rule,并应用到LogicalPlan上 case (plan, rule) => val startTime = System.nanoTime() val result = rule(plan) // 规则应用到LogicalPlan val runTime = System.nanoTime() - startTime RuleExecutor.timeMap.addAndGet(rule.ruleName, runTime) if (!result.fastEquals(plan)) { logTrace( s""" |=== Applying Rule ${rule.ruleName} === |${sideBySide(plan.treeString, result.treeString).mkString("\n")} """.stripMargin) } result } iteration += 1 //对当前batch执行次数进行计数 if (iteration > batch.strategy.maxIterations) { // 若大于了执行策略定义的次数,则停止执行此batch // Only log if this is a rule that is supposed to run more than once. if (iteration != 2) { val message = s"Max iterations (${iteration - 1}) reached for batch ${batch.name}" if (Utils.isTesting) { throw new TreeNodeException(curPlan, message, null) } else { logWarning(message) } } continue = false } if (curPlan.fastEquals(lastPlan)) { // 若执行batch前后,plan没有变化,则停止执行此batch logTrace( s"Fixed point reached for batch ${batch.name} after ${iteration - 1} iterations.") continue = false } lastPlan = curPlan } if (!batchStartPlan.fastEquals(curPlan)) { logDebug( s""" |=== Result of Batch ${batch.name} === |${sideBySide(batchStartPlan.treeString, curPlan.treeString).mkString("\n")} """.stripMargin) } else { logTrace(s"Batch ${batch.name} has no effect.") } } curPlan }
主要执行步骤都在代码中进行了注释。
batch和里面的rules都是连续执行的,每执行完一个batch都判断此batch执行的次数是否达到maxIterations 和执行此batch前后是否有变化,达到maxIterations 或者执行batch前后无变化都不再执行此batch。
Analyzer的batches 如下:
lazy val batches: Seq[Batch] = Seq( Batch("Hints", fixedPoint, new ResolveHints.ResolveBroadcastHints(conf), ResolveHints.RemoveAllHints), Batch("Simple Sanity Check", Once, LookupFunctions), Batch("Substitution", fixedPoint, CTESubstitution, WindowsSubstitution, EliminateUnions, new SubstituteUnresolvedOrdinals(conf)), Batch("Resolution", fixedPoint, ResolveTableValuedFunctions :: ResolveRelations :: ResolveReferences :: ResolveCreateNamedStruct :: ResolveDeserializer :: ResolveNewInstance :: ResolveUpCast :: ResolveGroupingAnalytics :: ResolvePivot :: ResolveOrdinalInOrderByAndGroupBy :: ResolveAggAliasInGroupBy :: ResolveMissingReferences :: ExtractGenerator :: ResolveGenerate :: ResolveFunctions :: ResolveAliases :: ResolveSubquery :: ResolveWindowOrder :: ResolveWindowFrame :: ResolveNaturalAndUsingJoin :: ExtractWindowExpressions :: GlobalAggregates :: ResolveAggregateFunctions :: TimeWindowing :: ResolveInlineTables(conf) :: ResolveTimeZone(conf) :: TypeCoercion.typeCoercionRules ++ extendedResolutionRules : _*), Batch("Post-Hoc Resolution", Once, postHocResolutionRules: _*), Batch("View", Once, AliasViewChild(conf)), Batch("Nondeterministic", Once, PullOutNondeterministic), Batch("UDF", Once, HandleNullInputsForUDF), Batch("FixNullability", Once, FixNullability), Batch("Subquery", Once, UpdateOuterReferences), Batch("Cleanup", fixedPoint, CleanupAliases) )
继续回到代码3(如下代码),这里通过analyzer模块和catalog绑定完后,由sparkSession、queryExecution和Row编码器构造了Dataset就返回了,并没有继续执行后面的其他模块,其他模块都是lazy的,只有出发了action操作的时候才会去执行。
def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = { val qe = sparkSession.sessionState.executePlan(logicalPlan) qe.assertAnalyzed() new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema)) }
接下来举例子看看Analyzer模块中的规则Rule是怎么通过catalog进行绑定的。
ResolveRelations
此规则是通过catalog替换掉UnresolvedRelation:
UnresolvedRelation(tableIdentifier: TableIdentifier)case class TableIdentifier(table: String, database: Option[String])
即可以从中获取到database和table的名字,接下来从入口方法apply看是怎么一步一步替换掉的:
def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperators { case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => EliminateSubqueryAliases(lookupTableFromCatalog(u)) match { case v: View => u.failAnalysis(s"Inserting into a view is not allowed. View: ${v.desc.identifier}.") case other => i.copy(table = other) } case u: UnresolvedRelation => resolveRelation(u) }
首先执行的是plan的resolveOperators 方法,这是一个柯里化函数,跟进看看:
def resolveOperators(rule: PartialFunction[LogicalPlan, LogicalPlan]): LogicalPlan = { if (!analyzed) { val afterRuleOnChildren = mapChildren(_.resolveOperators(rule)) if (this fastEquals afterRuleOnChildren) { CurrentOrigin.withOrigin(origin) { rule.applyOrElse(this, identity[LogicalPlan]) } } else { CurrentOrigin.withOrigin(origin) { rule.applyOrElse(afterRuleOnChildren, identity[LogicalPlan]) } } } else { this } }
首先判断此plan是否已经被处理过,接着调用mapChildren,并且传入的是resolveOperators方法,其实就是一个递归调用,它会优先处理它的子节点,然后再处理自己,如果处理后的LogicalPlan和当前的相等就说明他没有子节点了,则处理它自己,反之处理返回的plan。
回到前面看看这个Rule是怎么应用起来的:
case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => EliminateSubqueryAliases(lookupTableFromCatalog(u)) match { case v: View => u.failAnalysis(s"Inserting into a view is not allowed. View: ${v.desc.identifier}.") case other => i.copy(table = other) } case u: UnresolvedRelation => resolveRelation(u)
先看第二种情况若为UnresolvedRelation,则调用resolveRelation方法进行解析:
def resolveRelation(plan: LogicalPlan): LogicalPlan = plan match { //不是这种情况 select * from parquet.`/path/to/query` case u: UnresolvedRelation if !isRunningDirectlyOnFiles(u.tableIdentifier) => val defaultDatabase = AnalysisContext.get.defaultDatabase // 获取默认database val relation = lookupTableFromCatalog(u, defaultDatabase) resolveRelation(relation) // The view's child should be a logical plan parsed from the `desc.viewText`, the variable // `viewText` should be defined, or else we throw an error on the generation of the View // operator. case view @ View(desc, _, child) if !child.resolved => // Resolve all the UnresolvedRelations and Views in the child. val newChild = AnalysisContext.withAnalysisContext(desc.viewDefaultDatabase) { if (AnalysisContext.get.nestedViewDepth > conf.maxNestedViewDepth) { view.failAnalysis(s"The depth of view ${view.desc.identifier} exceeds the maximum " + s"view resolution depth (${conf.maxNestedViewDepth}). Analysis is aborted to " + "avoid errors. Increase the value of spark.sql.view.maxNestedViewDepth to work " + "aroud this.") } execute(child) } view.copy(child = newChild) case p @ SubqueryAlias(_, view: View) => val newChild = resolveRelation(view) p.copy(child = newChild) case _ => plan }
这里第一次进来肯定是先进入第一个case,然后会调用lookupTableFromCatalog方法从catalog中找关系,此方法最终调用了SessionCatalog的lookupRelation方法:
def lookupRelation(name: TableIdentifier): LogicalPlan = { synchronized { val db = formatDatabaseName(name.database.getOrElse(currentDb)) val table = formatTableName(name.table) if (db == globalTempViewManager.database) { globalTempViewManager.get(table).map { viewDef => SubqueryAlias(table, viewDef) }.getOrElse(throw new NoSuchTableException(db, table)) } else if (name.database.isDefined || !tempTables.contains(table)) { val metadata = externalCatalog.getTable(db, table) if (metadata.tableType == CatalogTableType.VIEW) { val viewText = metadata.viewText.getOrElse(sys.error("Invalid view without text.")) // The relation is a view, so we wrap the relation by: // 1. Add a [[View]] operator over the relation to keep track of the view desc; // 2. Wrap the logical plan in a [[SubqueryAlias]] which tracks the name of the view. val child = View( desc = metadata, output = metadata.schema.toAttributes, child = parser.parsePlan(viewText)) SubqueryAlias(table, child) } else { val tableRelation = CatalogRelation( metadata, // we assume all the columns are nullable. metadata.dataSchema.asNullable.toAttributes, metadata.partitionSchema.asNullable.toAttributes) SubqueryAlias(table, tableRelation) } } else { SubqueryAlias(table, tempTables(table)) } } }
若db等于globalTempViewManager.database,globalTempViewManager维护了一个全局viewName和其元数据LogicalPlan 的映射:
val viewDefinitions = new mutable.HashMap[String, LogicalPlan]
则直接从globalTempViewManager获取并返回。若database已定义,且临时表中未有此table:
从externalCatalog(如hive)中获取table对应的元数据信息metadata:CatalogTable
,此对象包含了table对应的类型(table(内部还是外部表),view)、存储格式、字段shema信息等:若返回的table是View类型则构造View对象(包括将viewText通过parser模块解析成语法树),并传入构造一个SubqueryAlias返回
说明此table名对应的就是一个如hive的table表,通过metadata、数据和分区列的schema构造了CatalogRelation,并以此tableRelation构造SubqueryAlias返回。这里就可以看出从一个未绑定的UnresolvedRelation 到通过catalog替换的过程。
说明是个session级别的临时表,从tempTables获取到包含元数据信息的LogicalPlan 并构造SubqueryAlias返回。
再次回到resolveRelation方法:
def resolveRelation(plan: LogicalPlan): LogicalPlan = plan match { case u: UnresolvedRelation if !isRunningDirectlyOnFiles(u.tableIdentifier) => val defaultDatabase = AnalysisContext.get.defaultDatabase val relation = lookupTableFromCatalog(u, defaultDatabase) resolveRelation(relation) // The view's child should be a logical plan parsed from the `desc.viewText`, the variable // `viewText` should be defined, or else we throw an error on the generation of the View // operator. case view @ View(desc, _, child) if !child.resolved => // Resolve all the UnresolvedRelations and Views in the child. val newChild = AnalysisContext.withAnalysisContext(desc.viewDefaultDatabase) { if (AnalysisContext.get.nestedViewDepth > conf.maxNestedViewDepth) { view.failAnalysis(s"The depth of view ${view.desc.identifier} exceeds the maximum " + s"view resolution depth (${conf.maxNestedViewDepth}). Analysis is aborted to " + "avoid errors. Increase the value of spark.sql.view.maxNestedViewDepth to work " + "aroud this.") } execute(child) } view.copy(child = newChild) case p @ SubqueryAlias(_, view: View) => val newChild = resolveRelation(view) p.copy(child = newChild) case _ => plan }
经过lookupTableFromCatalog方法后,又调用了resolveRelation方法本身:
case UnresolvedRelation上面讲过了
case View,通过上面的解析可知这可能是外部catalog(如hive)的View,其child是viewText被parser模块解析完的Unresolved LogicalPlan,调用execute方法进行analyze。简单的说若是View,则会获取viewText重走parser和analyzer模块。
case SubqueryAlias(_, view: View):为view调用resolveRelation方法
case _ :若是其他情况,直接返回plan
总之经过resolveRelation方法之后,返回的plan是已经和实际元数据绑定好的plan,可能是从globalTempViewManager直接获取的,可能是从tempTables直接获取,也可能是从externalCatalog获取的元数据。
再回到最初的apply方法:
def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperators { case i @ InsertIntoTable(u: UnresolvedRelation, parts, child, _, _) if child.resolved => EliminateSubqueryAliases(lookupTableFromCatalog(u)) match { case v: View => u.failAnalysis(s"Inserting into a view is not allowed. View: ${v.desc.identifier}.") case other => i.copy(table = other) } case u: UnresolvedRelation => resolveRelation(u) }
这里第二种情况已经分析完,再看看第一种情况,若plan是InsertIntoTable类型并且其对应的table还未绑定,则调用lookupTableFromCatalog方法与catalog进行analyze之后应用到Rule EliminateSubqueryAliases:
object EliminateSubqueryAliases extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { case SubqueryAlias(_, child) => child } }
遍历子节点有两种方式,transformDown(默认,前序遍历)、transformUp 后续遍历。
UnresolvedRelation解析后可能会是SubqueryAlias,真正有用的是其child(CatalogRelation),一旦解析完就将其删除掉保留child。
到这里Rule ResolveRelations就解析完了,其他就不再一一列举了。
作者:BIGUFO
链接:https://www.jianshu.com/p/55530d1ee21b
共同学习,写下你的评论
评论加载中...
作者其他优质文章