• 如果您觉得本站非常有看点,那么赶紧使用Ctrl+D 收藏吧

Hive源码系列(八)编译模块之语义解析 整体分析

hive diligentman 3个月前 (08-29) 29次浏览

语义解析主要是把AST Tree转化为QueryBlock,那为什么要转成QueryBlock呢?从之前的分析,我们可以看到AST Tree 还是很抽象,并且也不携带表、字段相关的信息,进行语义解析可以将AST Tree分模块存入QueryBlock 并携带对应的元数据信息,为生成逻辑执行计划做准备

 

简单串一下语义解析

sql编译器的入口:

      BaseSemanticAnalyzer sem = SemanticAnalyzerFactory.get(queryState, tree);      List<HiveSemanticAnalyzerHook> saHooks =          getHooks(HiveConf.ConfVars.SEMANTIC_ANALYZER_HOOK,              HiveSemanticAnalyzerHook.class);
      // Flush the metastore cache.  This assures that we don't pick up objects from a previous      // query running in this same thread.  This has to be done after we get our semantic      // analyzer (this is when the connection to the metastore is made) but before we analyze,      // because at that point we need access to the objects.      Hive.get().getMSC().flushCache();
      // Do semantic analysis and plan generation      if (saHooks != null && !saHooks.isEmpty()) { //hive的hook机制,在hook中实现一些方法来对语句做预判        HiveSemanticAnalyzerHookContext hookCtx = new HiveSemanticAnalyzerHookContextImpl();        hookCtx.setConf(conf);        hookCtx.setUserName(userName);        hookCtx.setIpAddress(SessionState.get().getUserIpAddress());        hookCtx.setCommand(command);        for (HiveSemanticAnalyzerHook hook : saHooks) {          tree = hook.preAnalyze(hookCtx, tree);        }        sem.analyze(tree, ctx);        hookCtx.update(sem);        for (HiveSemanticAnalyzerHook hook : saHooks) {          hook.postAnalyze(hookCtx, sem.getAllRootTasks());        }      } else {        sem.analyze(tree, ctx); //直接进入编译      }

进入sql编译之前,先判断是不是设置了hive.semantic.analyzer.hook参数,这个是hive的hook机制,在hook中实现一些方法来对语句做预判,具体做法是实现HiveSemanticAnalyzerHook接口,preAnalyze方法和postAnalyze方法会分别在编译之前和之后执行。 在这里,我们更关心编译模块sem.analyze(tree, ctx)。

sem是由BaseSemanticAnalyzer sem = SemanticAnalyzerFactory.get(queryState, tree) 获取。这里用到了java设计模式中的工厂模式:

public static BaseSemanticAnalyzer get(QueryState queryState, ASTNode tree)      throws SemanticException {    if (tree.getToken() == null) {      throw new RuntimeException("Empty Syntax Tree");    } else {      HiveOperation opType = commandType.get(tree.getType());      queryState.setCommandType(opType);      switch (tree.getType()) {      case HiveParser.TOK_EXPLAIN:        return new ExplainSemanticAnalyzer(queryState);      case HiveParser.TOK_EXPLAIN_SQ_REWRITE:        return new ExplainSQRewriteSemanticAnalyzer(queryState);      case HiveParser.TOK_LOAD:        return new LoadSemanticAnalyzer(queryState);      case HiveParser.TOK_EXPORT:        return new ExportSemanticAnalyzer(queryState);      case HiveParser.TOK_IMPORT:        return new ImportSemanticAnalyzer(queryState);      case HiveParser.TOK_ALTERTABLE: {        Tree child = tree.getChild(1);        switch (child.getType()) {          case HiveParser.TOK_ALTERTABLE_RENAME:          case HiveParser.TOK_ALTERTABLE_TOUCH:          case HiveParser.TOK_ALTERTABLE_ARCHIVE:          case HiveParser.TOK_ALTERTABLE_UNARCHIVE:          case HiveParser.TOK_ALTERTABLE_ADDCOLS:          case HiveParser.TOK_ALTERTABLE_RENAMECOL:          case HiveParser.TOK_ALTERTABLE_REPLACECOLS:          case HiveParser.TOK_ALTERTABLE_DROPPARTS:          case HiveParser.TOK_ALTERTABLE_ADDPARTS:          case HiveParser.TOK_ALTERTABLE_PARTCOLTYPE:          case HiveParser.TOK_ALTERTABLE_PROPERTIES:          case HiveParser.TOK_ALTERTABLE_DROPPROPERTIES:          case HiveParser.TOK_ALTERTABLE_EXCHANGEPARTITION:          case HiveParser.TOK_ALTERTABLE_SKEWED:          case HiveParser.TOK_ALTERTABLE_DROPCONSTRAINT:          case HiveParser.TOK_ALTERTABLE_ADDCONSTRAINT:          queryState.setCommandType(commandType.get(child.getType()));          return new DDLSemanticAnalyzer(queryState);        }        opType =            tablePartitionCommandType.get(child.getType())[tree.getChildCount() > 2 ? 1 : 0];        queryState.setCommandType(opType);        return new DDLSemanticAnalyzer(queryState);      }      case HiveParser.TOK_ALTERVIEW: {        Tree child = tree.getChild(1);        switch (child.getType()) {          case HiveParser.TOK_ALTERVIEW_PROPERTIES:          case HiveParser.TOK_ALTERVIEW_DROPPROPERTIES:          case HiveParser.TOK_ALTERVIEW_ADDPARTS:          case HiveParser.TOK_ALTERVIEW_DROPPARTS:          case HiveParser.TOK_ALTERVIEW_RENAME:            opType = commandType.get(child.getType());            queryState.setCommandType(opType);            return new DDLSemanticAnalyzer(queryState);        }        // TOK_ALTERVIEW_AS        assert child.getType() == HiveParser.TOK_QUERY;        queryState.setCommandType(HiveOperation.ALTERVIEW_AS);        return new SemanticAnalyzer(queryState);      }      case HiveParser.TOK_CREATEDATABASE:      case HiveParser.TOK_DROPDATABASE:      case HiveParser.TOK_SWITCHDATABASE:      case HiveParser.TOK_DROPTABLE:      case HiveParser.TOK_DROPVIEW:      case HiveParser.TOK_DESCDATABASE:      case HiveParser.TOK_DESCTABLE:      case HiveParser.TOK_DESCFUNCTION:      case HiveParser.TOK_MSCK:      case HiveParser.TOK_ALTERINDEX_REBUILD:      case HiveParser.TOK_ALTERINDEX_PROPERTIES:      case HiveParser.TOK_SHOWDATABASES:      case HiveParser.TOK_SHOWTABLES:      case HiveParser.TOK_SHOWCOLUMNS:      case HiveParser.TOK_SHOW_TABLESTATUS:      case HiveParser.TOK_SHOW_TBLPROPERTIES:      case HiveParser.TOK_SHOW_CREATEDATABASE:      case HiveParser.TOK_SHOW_CREATETABLE:      case HiveParser.TOK_SHOWFUNCTIONS:      case HiveParser.TOK_SHOWPARTITIONS:      case HiveParser.TOK_SHOWINDEXES:      case HiveParser.TOK_SHOWLOCKS:      case HiveParser.TOK_SHOWDBLOCKS:      case HiveParser.TOK_SHOW_COMPACTIONS:      case HiveParser.TOK_SHOW_TRANSACTIONS:      case HiveParser.TOK_ABORT_TRANSACTIONS:      case HiveParser.TOK_SHOWCONF:      case HiveParser.TOK_CREATEINDEX:      case HiveParser.TOK_DROPINDEX:      case HiveParser.TOK_ALTERTABLE_CLUSTER_SORT:      case HiveParser.TOK_LOCKTABLE:      case HiveParser.TOK_UNLOCKTABLE:      case HiveParser.TOK_LOCKDB:      case HiveParser.TOK_UNLOCKDB:      case HiveParser.TOK_CREATEROLE:      case HiveParser.TOK_DROPROLE:      case HiveParser.TOK_GRANT:      case HiveParser.TOK_REVOKE:      case HiveParser.TOK_SHOW_GRANT:      case HiveParser.TOK_GRANT_ROLE:      case HiveParser.TOK_REVOKE_ROLE:      case HiveParser.TOK_SHOW_ROLE_GRANT:      case HiveParser.TOK_SHOW_ROLE_PRINCIPALS:      case HiveParser.TOK_SHOW_ROLES:      case HiveParser.TOK_ALTERDATABASE_PROPERTIES:      case HiveParser.TOK_ALTERDATABASE_OWNER:      case HiveParser.TOK_TRUNCATETABLE:      case HiveParser.TOK_SHOW_SET_ROLE:      case HiveParser.TOK_CACHE_METADATA:        return new DDLSemanticAnalyzer(queryState);
      case HiveParser.TOK_CREATEFUNCTION:      case HiveParser.TOK_DROPFUNCTION:      case HiveParser.TOK_RELOADFUNCTION:        return new FunctionSemanticAnalyzer(queryState);
      case HiveParser.TOK_ANALYZE:        return new ColumnStatsSemanticAnalyzer(queryState);
      case HiveParser.TOK_CREATEMACRO:      case HiveParser.TOK_DROPMACRO:        return new MacroSemanticAnalyzer(queryState);
      case HiveParser.TOK_UPDATE_TABLE:      case HiveParser.TOK_DELETE_FROM:        return new UpdateDeleteSemanticAnalyzer(queryState);
      case HiveParser.TOK_START_TRANSACTION:      case HiveParser.TOK_COMMIT:      case HiveParser.TOK_ROLLBACK:      case HiveParser.TOK_SET_AUTOCOMMIT:      default: {        SemanticAnalyzer semAnalyzer = HiveConf //判断CBO是否为true,如果为true,走CalcitePlanner,否则SemanticAnalyzer            .getBoolVar(queryState.getConf(), HiveConf.ConfVars.HIVE_CBO_ENABLED) ?                new CalcitePlanner(queryState) : new SemanticAnalyzer(queryState);          return semAnalyzer;      }      }    }  }

针对不同功能的sql,hive有多种编译方式,比如:explain走ExplainSemanticAnalyzer,DDL走DDLSemanticAnalyzer,load走LoadSemanticAnalyzer等等,工厂模式可以使这些不同的功能隔离开,在一定程度上解耦,也增加了可扩展性,比如某天需要再添加个import数据的编译过程,开发个ImportSemanticAnalyzer类 在SemanticAnalyzerFactory工厂里注册一下就ok了。

然而,我们更多的是使用query,这次的源码分析也是围绕query展开,因此,我们就进入了default 选项,然后就会发现,有个判断,判断hive.cbo.enable是否为true,如果为 true就走 CalcitePlanner 类,否则走 SemanticAnalyzer 类。 CBO是基于代价的优化方式,功能很强大,hive2.x在CBO方面也下了很大的功夫,这个优化默认是开启的,我们在究研源码的时候,先关闭掉CBO,后面会专门再来讨论CBO。

hql的编译主要就是 SemanticAnalyzer.analyzeInternal 方法:

void analyzeInternal(ASTNode ast, PlannerContext plannerCtx) throws SemanticException {    // 1. Generate Resolved Parse tree from syntax tree    LOG.info("Starting Semantic Analysis");    if (!genResolvedParseTree(ast, plannerCtx)) { //语义解析      return;    }
    // 2. Gen OP Tree from resolved Parse Tree    Operator sinkOp = genOPTree(ast, plannerCtx); //生成逻辑执行计划
   ...   ...
    // 7. Perform Logical optimization    if (LOG.isDebugEnabled()) {      LOG.debug("Before logical optimizationn" + Operator.toString(pCtx.getTopOps().values()));    }    Optimizer optm = new Optimizer();    optm.setPctx(pCtx);    optm.initialize(conf);    pCtx = optm.optimize(); //优化逻辑执行计划    if (pCtx.getColumnAccessInfo() != null) {      // set ColumnAccessInfo for view column authorization      setColumnAccessInfo(pCtx.getColumnAccessInfo());    }    FetchTask origFetchTask = pCtx.getFetchTask();    if (LOG.isDebugEnabled()) {      LOG.debug("After logical optimizationn" + Operator.toString(pCtx.getTopOps().values()));    }
    // 8. Generate column access stats if required - wait until column pruning    // takes place during optimization    boolean isColumnInfoNeedForAuth = SessionState.get().isAuthorizationModeV2()        && HiveConf.getBoolVar(conf, HiveConf.ConfVars.HIVE_AUTHORIZATION_ENABLED);    if (isColumnInfoNeedForAuth        || HiveConf.getBoolVar(this.conf, HiveConf.ConfVars.HIVE_STATS_COLLECT_SCANCOLS)) {      ColumnAccessAnalyzer columnAccessAnalyzer = new ColumnAccessAnalyzer(pCtx);      // view column access info is carried by this.getColumnAccessInfo().      setColumnAccessInfo(columnAccessAnalyzer.analyzeColumnAccess(this.getColumnAccessInfo()));    }
    // 9. Optimize Physical op tree & Translate to target execution engine (MR,    // TEZ..)    if (!ctx.getExplainLogical()) {      TaskCompiler compiler = TaskCompilerFactory.getCompiler(conf, pCtx);      compiler.init(queryState, console, db);      compiler.compile(pCtx, rootTasks, inputs, outputs); //生成物理执行计划及优化      fetchTask = pCtx.getFetchTask();    }    LOG.info("Completed plan generation");
    // 10. put accessed columns to readEntity    if (HiveConf.getBoolVar(this.conf, HiveConf.ConfVars.HIVE_STATS_COLLECT_SCANCOLS)) {      putAccessedColumnsToReadEntity(inputs, columnAccessInfo);    }
    // 11. if desired check we're not going over partition scan limits    if (!ctx.getExplain()) {      enforceScanLimits(pCtx, origFetchTask);    }
    return;  }

语义解析是代码中的第一步genResolvedParseTree  方法

boolean genResolvedParseTree(ASTNode ast, PlannerContext plannerCtx) throws SemanticException {    ASTNode child = ast;    this.ast = ast;    viewsExpanded = new ArrayList<String>();    ctesExpanded = new ArrayList<String>();
    .....        // 4. continue analyzing from the child ASTNode.    Phase1Ctx ctx_1 = initPhase1Ctx();    preProcessForInsert(child, qb);    if (!doPhase1(child, qb, ctx_1, plannerCtx)) { //核心1      // if phase1Result false return      return false;    }    LOG.info("Completed phase 1 of Semantic Analysis");
    // 5. Resolve Parse Tree    // Materialization is allowed if it is not a view definition    getMetaData(qb, createVwDesc == null); ////核心2    LOG.info("Completed getting MetaData in Semantic Analysis");
    plannerCtx.setParseTreeAttr(child, ctx_1);
    return true;  }

genResolvedParseTree方法比较长,其中最核心的就是doPhase1和getMetaData。

doPhase1负责把 ASTTree 大卸八块存入对应的QB,getMetaData 负责把表、字段等元数据信息存入QB。

当这一切都执行完了之后,语义解析模块就结束了。

总结一下语义解析的代码路径:

Driver.compile

–> SemanticAnalyzerFactory.get(queryState, tree)

–>SemanticAnalyzer.analyzeInternal

>SemanticAnalyzer.genResolvedParseTree

–>SemanticAnalyzer.doPhase1

->SemanticAnalyzer.getMetaData `

路径这么描述也不是很准确,重要的是能明白就好


程序员灯塔 , 版权所有
转载请注明原文链接:https://www.wangt.cc/2019/08/5481d075b4/
喜欢 (0)