Kylin源码解析——从CubingJob的构建过程看Kylin的工作原理

来源：互联网发布：架子鼓训练软件安卓版编辑：程序博客网时间：2024/06/08 05:06

在Kylin中通过数据源、计算引擎、存储之间的相互协作来实现CubeSegment的构建，向Kylin服务端发送构建新的CubeSegment的请求后，会走到JobService的submitJob方法中，CubingJob构建入口如下：

 job = EngineFactory.createBatchCubingJob(newSeg, submitter);

从这里可以看出，CubingJob的构建的入口是由计算引擎提供的。

private static ImplementationSwitch<IBatchCubingEngine> batchEngines;static {    Map<Integer, String> impls = KylinConfig.getInstanceFromEnv().getJobEngines();    batchEngines = new ImplementationSwitch<>(impls, IBatchCubingEngine.class);}public static IBatchCubingEngine batchEngine(IEngineAware aware) {    return batchEngines.get(aware.getEngineType());}public static DefaultChainedExecutable createBatchCubingJob(CubeSegment newSegment, String submitter) {    return batchEngine(newSegment).createBatchCubingJob(newSegment, submitter);}

Kylin所支持的所有计算引擎，都会在EngineFactory中注册，并保存在batchEngines中，目前Kylin支持的计算引擎有：

public Map<Integer, String> getJobEngines() {    Map<Integer, String> r = Maps.newLinkedHashMap();    // ref constants in IEngineAware        r.put(0, "org.apache.kylin.engine.mr.MRBatchCubingEngine"); //IEngineAware.ID_MR_V1        r.put(2, "org.apache.kylin.engine.mr.MRBatchCubingEngine2"); //IEngineAware.ID_MR_V2        r.put(4, "org.apache.kylin.engine.spark.SparkBatchCubingEngine2"); //IEngineAware.ID_SPARK    r.putAll(convertKeyToInteger(getPropertiesByPrefix("kylin.engine.provider.")));    return r;}

在Kylin中，计算引擎都实现了IBatchCubingEngine这个接口，版本2.0.0中，Kylin默认的计算引擎代号为2，也就是MRBatchCubingEngine2，所以上面最终调用的也就是MRBatchCubingEngine2的createBatchCubingJob方法。

public DefaultChainedExecutable createBatchCubingJob(CubeSegment newSegment, String submitter) {    return new BatchCubingJobBuilder2(newSegment, submitter).build();}

这里可以看出，MRBatchCubingEngine2的createBatchCubingJob方法逻辑很简单，就是创建了一个BatchCubingJobBuilder2实例，然后调用其build方法，返回了一个CubingJob。

在这里为止，计算引擎身影已经出现，就是MRBatchCubingEngine2，但是数据源、存储两个模块还没有出现，以及如何与计算引擎协作，继续向下看。

先来看BatchCubingJobBuilder2的初始化过程。

public BatchCubingJobBuilder2(CubeSegment newSegment, String submitter) {    super(newSegment, submitter);    this.inputSide = MRUtil.getBatchCubingInputSide(seg);    this.outputSide = MRUtil.getBatchCubingOutputSide2(seg);}

super(newSegment, submitter)就是执行父类的构造方法，进行了一些属性的初始化赋值，这里来看一下BatchCubingJobBuilder2自身的两个属性inputSide、outputSide，这两个属性的定义：

private final IMRBatchCubingInputSide inputSide;private final IMRBatchCubingOutputSide2 outputSide;

这两个属性就分别是数据源、存储在CubingJob构建中，与引擎进行协作，具体怎么协作，先来看一下其获取过程。

public static IMRBatchCubingInputSide getBatchCubingInputSide(CubeSegment seg) {    IJoinedFlatTableDesc flatDesc = EngineFactory.getJoinedFlatTableDesc(seg);    return SourceFactory.createEngineAdapter(seg, IMRInput.class).getBatchCubingInputSide(flatDesc);}

这个是获取inputSide的实例的过程，跟上面调用createBatchCubingJob方法的过程是一样的，最终调用的其实还是MRBatchCubingEngine2这个计算引擎的getJoinedFlatTableDesc方法。

public static IJoinedFlatTableDesc getJoinedFlatTableDesc(CubeSegment newSegment) {    return batchEngine(newSegment).getJoinedFlatTableDesc(newSegment);}

计算引擎MRBatchCubingEngine2的这个方法逻辑还是很简单，直接返回了一个CubeJoinedFlatTableDesc实例，CubeJoinedFlatTableDesc对象可以理解成就是对数据源表的分装。

public IJoinedFlatTableDesc getJoinedFlatTableDesc(CubeSegment newSegment) {    return new CubeJoinedFlatTableDesc(newSegment);}

在获取flatDesc实例后，就要来获取inputSide的实例，这句代码是否很熟悉，和上面获取计算引擎的代码是否很类似，所以里面的具体实现过程也会很相似，具体如下：

private static ImplementationSwitch<ISource> sources;static {    Map<Integer, String> impls = KylinConfig.getInstanceFromEnv().getSourceEngines();    sources = new ImplementationSwitch<>(impls, ISource.class);}public static ISource tableSource(ISourceAware aware) {    return sources.get(aware.getSourceType());}public static <T> T createEngineAdapter(ISourceAware table, Class<T> engineInterface) {    return tableSource(table).adaptToBuildEngine(engineInterface);}

跟计算引擎的那段代码就是一个模子刻出来的吗，目前Kylin中支持的数据源有

public Map<Integer, String> getSourceEngines() {    Map<Integer, String> r = Maps.newLinkedHashMap();    // ref constants in ISourceAware        r.put(0, "org.apache.kylin.source.hive.HiveSource");    r.put(1, "org.apache.kylin.source.kafka.KafkaSource");    r.putAll(convertKeyToInteger(getPropertiesByPrefix("kylin.source.provider.")));    return r;}

在Kylin中，离线数据源，采用HiveSource，所以接下来的分析中，也都是以HiveSource为例。

所以SourceFactory的createEngineAdapter，实际调用的是HiveSource的adaptToBuildEngine，根据传入的IMRInput.class接口，最终返回的是HiveMRInput的实例。

public <I> I adaptToBuildEngine(Class<I> engineInterface) {    if (engineInterface == IMRInput.class) {        return (I) new HiveMRInput();    } else {        throw new RuntimeException("Cannot adapt to " + engineInterface);    }}

从这里可以看出计算引擎MRBatchCubingEngine2与数据源HiveSource之间产生了关联，且HiveSource需要提供适配计算引擎MRBatchCubingEngine2的IMRInput的实现，也就是HiveMRInput实例。这里也就体现了在Kylin中，数据源需要主动的去适配计算引擎，需要按照计算引擎要求的输入接口，提供具体的实现。

现在就可以通过HiveMRInput的getBatchCubingInputSide方法，该方法的实现逻辑很简单，就是新建了一个BatchCubingInputSide的实例。

public IMRBatchCubingInputSide getBatchCubingInputSide(IJoinedFlatTableDesc flatDesc) {    return new BatchCubingInputSide(flatDesc);}

接下来再具体看下outputSide的实例获取过程。

public static IMRBatchCubingOutputSide2 getBatchCubingOutputSide2(CubeSegment seg) {    return StorageFactory.createEngineAdapter(seg, IMROutput2.class).getBatchCubingOutputSide(seg);}

看到了和获取inputSide一样套路的代码，看下具体的过程

private static ImplementationSwitch<IStorage> storages;static {    Map<Integer, String> impls = KylinConfig.getInstanceFromEnv().getStorageEngines();    storages = new ImplementationSwitch<IStorage>(impls, IStorage.class);}public static IStorage storage(IStorageAware aware) {    return storages.get(aware.getStorageType());}public static <T> T createEngineAdapter(IStorageAware aware, Class<T> engineInterface) {    return storage(aware).adaptToBuildEngine(engineInterface);}

目前Kylin中支持的存储有，默认的是HBaseStorage。

public Map<Integer, String> getStorageEngines() {    Map<Integer, String> r = Maps.newLinkedHashMap();    // ref constants in IStorageAware        r.put(0, "org.apache.kylin.storage.hbase.HBaseStorage");    r.put(1, "org.apache.kylin.storage.hybrid.HybridStorage");    r.put(2, "org.apache.kylin.storage.hbase.HBaseStorage");    r.putAll(convertKeyToInteger(getPropertiesByPrefix("kylin.storage.provider.")));    return r;}

通过HBaseStorage的adaptToBuildEngine获取的适配计算引擎的存储为

public <I> I adaptToBuildEngine(Class<I> engineInterface) {    if (engineInterface == IMROutput.class) {        return (I) new HBaseMROutput();    } else if (engineInterface == IMROutput2.class) {        return (I) new HBaseMROutput2Transition();    } else {        throw new RuntimeException("Cannot adapt to " + engineInterface);    }}

根据传入的参数IMROutput2接口，获取的实现实例是HBaseMROutput2Transition。这里也就是计算引擎与存储之间的协作关联，也就是存储的实现HBaseStore需要提供适配计算引擎的输出实现。

通过HBaseMROutput2Transition的getBatchCubingOutputSide方法就可以获取的outputSide的实例。

public IMRBatchCubingOutputSide2 getBatchCubingOutputSide(final CubeSegment seg) {    return new IMRBatchCubingOutputSide2() {        HBaseMRSteps steps = new HBaseMRSteps(seg);        @Override               public void addStepPhase2_BuildDictionary(DefaultChainedExecutable jobFlow) {            jobFlow.addTask(steps.createCreateHTableStepWithStats(jobFlow.getId()));        }        @Override               public void addStepPhase3_BuildCube(DefaultChainedExecutable jobFlow) {            jobFlow.addTask(steps.createConvertCuboidToHfileStep(jobFlow.getId()));            jobFlow.addTask(steps.createBulkLoadStep(jobFlow.getId()));        }        @Override               public void addStepPhase4_Cleanup(DefaultChainedExecutable jobFlow) {            // nothing to do                }    };}

返回了一个IMRBatchCubingOutputSide2的接口的具体的实现。

Kylin在设计上是把数据源、计算引擎、数据存储，三大模块之间是相互独立的，各自只需要实现ISource、IBatchCubingEngine、IStorage接口即可，同时具体的实现需要在SourceFactory、EngineFactory、StorageFactory中注册，就是在kylin.properties配置文件中，添加对应的配置即可，如

#数据源注册kylin.source.provider.0=org.apache.kylin.source.hive.HiveSource#计算引擎注册kylin.engine.provider.2=org.apache.kylin.engine.mr.MRBatchCubingEngine2#数据存储注册kylin.storage.provider.2=org.apache.kylin.storage.hbase.HBaseStorage

在构建一个Cube时，这是三个模块还是相互独立的，当构建一个具体的CubeSegment的时候，则会由计算引擎发起，要求数据源和数据存储分别来适配自己，并给出适配的接口，如果都适配了，则CubingJob就构建成功，如果不失败，则无法构建CubingJob。

从这个过程中，也可以看出，计算引擎可以有不同的数据源和数据存储的实现，只要实现了计算引擎所要求的接口即可，同时数据源、数据存储，也不是就在一个计算引擎上吊死，它们还可以去适配其他的计算引擎，也只需要实现相应的接口即可，也就是它们与计算引擎之间是一种多对多的关系，但是计算引擎是主导地位，数据源、数据存储要想用某个计算引擎，就必须主动的去适配其相应的接口，这就使得Kylin在设计上可以比较灵活，且可扩展性很好，当有新的数据源、计算引擎、数据存储技术出现时，只需要实现一些接口，就可以快速接入Kylin，而Kylin的整个框架在这个过程中是保持一致的。

在上面的分析中，可以看出，BatchCubingJobBuilder2就是将数据源、计算引擎、数据存储三个模块关联到其的纽带，其构造函数中，已经把三个模块关联到一起了，但是三个模块之间如果具体的协作的呢，还记得上面MRBatchCubingEngine2的createBatchCubingJob方法中，在new出一个BatchCubingJobBuilder2实例后，紧接着就调用了build方法，最后返回了一个CubingJob实例。build方法的逻辑如下

public CubingJob build() {    logger.info("MR_V2 new job to BUILD segment " + seg);    final CubingJob result = CubingJob.createBuildJob(seg, submitter, config);    final String jobId = result.getId();    final String cuboidRootPath = getCuboidRootPath(jobId);    // Phase 1: Create Flat Table & Materialize Hive View in Lookup Tables       inputSide.addStepPhase1_CreateFlatTable(result);    // Phase 2: Build Dictionary        result.addTask(createFactDistinctColumnsStepWithStats(jobId));    result.addTask(createBuildDictionaryStep(jobId));    result.addTask(createSaveStatisticsStep(jobId));    outputSide.addStepPhase2_BuildDictionary(result);    // Phase 3: Build Cube        addLayerCubingSteps(result, jobId, cuboidRootPath); // layer cubing, only selected algorithm will execute        addInMemCubingSteps(result, jobId, cuboidRootPath); // inmem cubing, only selected algorithm will execute        outputSide.addStepPhase3_BuildCube(result);    // Phase 4: Update Metadata & Cleanup        result.addTask(createUpdateCubeInfoAfterBuildStep(jobId));    inputSide.addStepPhase4_Cleanup(result);    outputSide.addStepPhase4_Cleanup(result);    return result;}

上面的逻辑是比较简明清晰的，就是将构建一个CubeSegment的步骤，依次顺序的加入到CubingJob的任务list中。

先看初始化CubingJob的逻辑，也就是CubingJob的createBuildJob方法，里面又调用了initCubingJob方法，方法的逻辑也很简明。

public static CubingJob createBuildJob(CubeSegment seg, String submitter, JobEngineConfig config) {    return initCubingJob(seg, "BUILD", submitter, config);}private static CubingJob initCubingJob(CubeSegment seg, String jobType, String submitter, JobEngineConfig config) {    KylinConfig kylinConfig = config.getConfig();    CubeInstance cube = seg.getCubeInstance();    List<ProjectInstance> projList = ProjectManager.getInstance(kylinConfig).findProjects(cube.getType(), cube.getName());    if (projList == null || projList.size() == 0) {        throw new RuntimeException("Cannot find the project containing the cube " + cube.getName() + "!!!");    } else if (projList.size() >= 2) {        String msg = "Find more than one project containing the cube " + cube.getName() + ". It does't meet the uniqueness requirement!!! ";        if (!config.getConfig().allowCubeAppearInMultipleProjects()) {            throw new RuntimeException(msg);        } else {            logger.warn(msg);        }    }    CubingJob result = new CubingJob();    SimpleDateFormat format = new SimpleDateFormat("z yyyy-MM-dd HH:mm:ss");    format.setTimeZone(TimeZone.getTimeZone(config.getTimeZone()));    result.setDeployEnvName(kylinConfig.getDeployEnv());    result.setProjectName(projList.get(0).getName());    CubingExecutableUtil.setCubeName(seg.getCubeInstance().getName(), result.getParams());    CubingExecutableUtil.setSegmentId(seg.getUuid(), result.getParams());    result.setName(seg.getCubeInstance().getName() + " - " + seg.getName() + " - " + jobType + " - " + format.format(new Date(System.currentTimeMillis())));    result.setSubmitter(submitter);    result.setNotifyList(seg.getCubeInstance().getDescriptor().getNotifyList());    return result;}

上面的逻辑中，有一点需要注意，在initCubingJob方法中，会先根据cube去查询其所在的project，这里的查询就是根据cube的name去查询的，这也就是说，如果在不同的project下，构建了name相同的cube，则在这里都会返回，所以返回的是一个List，其下逻辑，紧接着做了判断，其中就有看配置中，是否允许cube的名称重名，如果不允许，则直接抛异常，如果允许，则也会打一条warn日志。在接着看下面CubingJob设置projectName时，取的是返回list中的第一个元素。这就说明，及时配置中允许cube的name重名，则也有可能在这里，导致projectName设置错误，所以为了安全起见，在设置cube的name时，还是保证cube的name是全局唯一，不仅仅是project内唯一。

在CubingJob初始化后，会获取cuboidRootPath，获取的逻辑如下

public String getCuboidRootPath(String jobId) {    return getRealizationRootPath(jobId) + "/cuboid/";}public String getRealizationRootPath(String jobId) {    return getJobWorkingDir(jobId) + "/" + seg.getRealization().getName();}public String getJobWorkingDir(String jobId) {    return getJobWorkingDir(config, jobId);}public static String getJobWorkingDir(JobEngineConfig conf, String jobId) {    return getJobWorkingDir(conf.getHdfsWorkingDirectory(), jobId);}public static String getJobWorkingDir(String hdfsDir, String jobId) {    if (!hdfsDir.endsWith("/")) {        hdfsDir = hdfsDir + "/";    }    return hdfsDir + "kylin-" + jobId;}

经过这一连串的调用拼装，最终获取的路径地址是

#格式如下hdfs:///kylin/kylin_metadata/kylin-jobId/cubeName/cuboid#样例hdfs:///kylin/kylin_metadata/kylin-f8ce6d9a-6366-45ab-9a8e-b81037a3c924/test4_mode1_cube_1/cuboid

基本工作完成之后，就要开始进入CubingJob的任务链的初始化过程。

CubeSegment的构建过程，主要分为4个阶段，创建宽表、创建字典、构建Cube、更新元数据和清理，build方法中的逻辑也就是依据这个步骤执行的。

创建宽表就是准备好构建CubeSegment的输入数据，属于数据准备阶段，所以这个阶段的任务属于数据源测的任务，所以具体的宽表创建过程，也就有数据源测来实现，具体是在inputSide中实现，通过inputSide的addStepPhase1_CreateFlatTable方法，将这阶段的任务加入到任务list中，具体添加过程如下

public void addStepPhase1_CreateFlatTable(DefaultChainedExecutable jobFlow) {    final String cubeName = CubingExecutableUtil.getCubeName(jobFlow.getParams());    final KylinConfig cubeConfig = CubeManager.getInstance(KylinConfig.getInstanceFromEnv()).getCube(cubeName).getConfig();    JobEngineConfig conf = new JobEngineConfig(cubeConfig);    final String hiveInitStatements = JoinedFlatTable.generateHiveInitStatements(flatTableDatabase);    final String jobWorkingDir = getJobWorkingDir(jobFlow);    // create flat table first, then count and redistribute        jobFlow.addTask(createFlatHiveTableStep(hiveInitStatements, jobWorkingDir, cubeName));    if (cubeConfig.isHiveRedistributeEnabled() == true) {        jobFlow.addTask(createRedistributeFlatHiveTableStep(hiveInitStatements, cubeName));    }    AbstractExecutable task = createLookupHiveViewMaterializationStep(hiveInitStatements, jobWorkingDir);    if (task != null) {        jobFlow.addTask(task);    }}

其中，
第一个任务就是从hive表中，将所需字段从事实表和维表中提取出来，构建一个宽表；
第二个任务就是将第一个任务得到的宽表，按照某个字段进行重新分配，如果没有指定字段，则随机，目的是产生一份，各个文件大小差不多的文件，作为后续构建任务的输入，防止数据倾斜；
第三个任务怎是将hive中的视图物化；

创建字典就是抽取列值、创建字典、保存统计信息，是有MR引擎来实现，所以程序中，也就是在build方法中，直接add到任务list中，在这个过程最后，还需要创建table，这个就属于数据存储测的任务，所以是在outputSide中具体实现；

构建Cube则是根据准备好的数据，依次产生cuboid的数据，属于计算引擎的任务，在cuboid的数据都产生好之后，还需要放入到存储层中，所以紧接着就调用了outputSide的addStepPhase3_BuildCube方法，主要有两步，将计算引擎产生的cuboid的数据，转换成存储层的数据，这里就是转成成HFile，还有就是把转换好的HFile数据，加载到HBase中。

public void addStepPhase3_BuildCube(DefaultChainedExecutable jobFlow) {    jobFlow.addTask(steps.createConvertCuboidToHfileStep(jobFlow.getId()));    jobFlow.addTask(steps.createBulkLoadStep(jobFlow.getId()));}

更新元数据和清理，在上述任务都完成后，计算引擎则负责将cube的元数据信息进行更新，数据源、数据存储侧，则分别清理各自的过程临时数据。

通过上述一系列过程，完成了一个CubingJob的构建，再回到最开始的地方，也就是JobService的submitJob方法中，在得到CubingJob的实例后，会执行如下代码

getExecutableManager().addJob(job);

这里完成的工作就是将CubingJob的信息物化到kylin_metadata的存储中，这里就是存储到HBase的表kylin_metadata中，然后还进行了一些其他的操作，但是也就是结束了，也即是CubingJob构建完成后，并没有真正的提交执行。

而Kylin真正执行CubingJob的地方是在DefaultScheduler，它里面有一个线程会每隔1分钟，就去HBase中扫一遍所有的CubingJob，然后将需要执行的job，提交到线程池执行。

这也就是说Kylin中，任务的构建和任务的执行是异步的。

Kylin的服务端的工作模式有三种

# Kylin server mode, valid value [all, query, job]kylin.server.mode=all

query——只提供查询服务
job——只提供任务的真正构建服务
all——则是兼具query、job的功能

对于Kylin集群来说，在一个集群中，只能有一台服务的状体是all或者job，但query状体的服务器则无所谓。

query和job这两种模式下，一个区别就是，在query模式下，不会启动DefaultScheduler这个调度器，而job和all模式，则会启动，这也就是保证了在Kylin集群中，之后有一个调度器去执行CubingJob。

原本以为，在query模式下，应该只具备查询功能，不会具备CubeSegment的构建等功能，但是在实际操作中，发现query模式下，也是可以提交CubeSegment的构建任务的，当时还觉得很诧异，现在就可以理解了，CubeSegment的构建，分为CubingJob的构建，和CubingJob的执行，在query模型的服务器上，只是通过JobService的submitJob将，CubingJob物化到了HBase中，而CubingJob的真正执行，还是通过all或job模式的服务器上的DefaultScheduler进行的调度执行。

这就是CubingJob的构建和调度执行过程。

阅读全文

0 0