Kylin源码解析-构建层级分析

本文主要分析，在Kylin层级构建的时候，这个层级的获取流程

代码分析

下面是层级构建的代码，调用了CuboidUtil.getLongestDepth

protected void addLayerCubingSteps(final CubingJob result, final String jobId, final String cuboidRootPath) {
    // Don't know statistics so that tree cuboid scheduler is not determined. Determine the maxLevel at runtime
    // 不知道统计信息，所以cube调度程序不确定。确定运行时的maxLevel
    final int maxLevel = CuboidUtil.getLongestDepth(seg.getCuboidScheduler().getAllCuboidIds());
    // base cuboid step
    result.addTask(createBaseCuboidStep(getCuboidOutputPathsByLevel(cuboidRootPath, 0), jobId));
    // n dim cuboid steps
    for (int i = 1; i <= maxLevel; i++) {
        result.addTask(createNDimensionCuboidStep(getCuboidOutputPathsByLevel(cuboidRootPath, i - 1), getCuboidOutputPathsByLevel(cuboidRootPath, i), i, jobId));
    }
}

在CuboidUtil.getLongestDepth中，调用了CuboidStatsUtil.createDirectChildrenCache,生成了一个Map，key为cuboid，value为这个cuboid的直接的child列表，然后根据这个map，循环层级，生成一个最大的深度。所以核心代码在CuboidStatsUtil.createDirectChildrenCache。

public static int getLongestDepth(Set<Long> cuboidSet) {
       Map<Long, List<Long>> directChildrenCache = CuboidStatsUtil.createDirectChildrenCache(cuboidSet); //这里返回的事每一个cuboid的直接下级的list
       List<Long> cuboids = Lists.newArrayList(cuboidSet);
       Collections.sort(cuboids, new Comparator<Long>() {
           @Override
           public int compare(Long o1, Long o2) {
               return -Long.compare(o1, o2);
           }
       });

       //{1=[], 2=[], 3=[2, 1], 4=[], 5=[4, 1], 7=[5, 3], 15=[7]}
       //循环顺序 15 7 5 4 3 2 1
       int longestDepth = 0;
       Map<Long, Integer> cuboidDepthMap = Maps.newHashMap();
       for (Long cuboid : cuboids) {
           // 这里从最大到最小依次去计算，child的深度，肯定大于parent的深度，最大的那个，是没有深度的
           int parentDepth = cuboidDepthMap.get(cuboid) == null ? 0 : cuboidDepthMap.get(cuboid);
           for (Long childCuboid : directChildrenCache.get(cuboid)) {
               //parentDepth + 1 这里是获取父亲cuboid的深度，如果最大深度小于这个，则设置最大深度为这个。
               if (cuboidDepthMap.get(childCuboid) == null || cuboidDepthMap.get(childCuboid) < parentDepth + 1) {
                   cuboidDepthMap.put(childCuboid, parentDepth + 1);
                   if (longestDepth < parentDepth + 1) {
                       longestDepth = parentDepth + 1;
                   }
               }
           }
       }
       System.out.println(cuboidDepthMap);

       return longestDepth;
   }

下面是的核心代码createDirectChildrenCache


//15 b1111 , 12 b1100  13 b1101  14 b1110  // 1 0001  2 0010  4 0100      3 0011  5 0101 7 0111  15 1111
    public static Map<Long, List<Long>> createDirectChildrenCache(final Set<Long> cuboidSet) {
        /**
         * Sort the list by ascending order:
         * */
        final List<Long> cuboidList = Lists.newArrayList(cuboidSet);
        //这里根据自然顺序排序
        Collections.sort(cuboidList);
        /**
         * Sort the list by ascending order:  按升序排序
         * 1. the more bit count of its value, the bigger  他的二进制位中为1的数越多，就越大
         * 2. the larger of its value, the bigger 它的值越大，越大
         * */
        List<Integer> layerIdxList = Lists.newArrayListWithExpectedSize(cuboidList.size());
        for (int i = 0; i < cuboidList.size(); i++) {
            layerIdxList.add(i);
        }
        Collections.sort(layerIdxList, new Comparator<Integer>() {
            @Override
            public int compare(Integer i1, Integer i2) {
                Long o1 = cuboidList.get(i1);
                Long o2 = cuboidList.get(i2);
                int nBitDiff = Long.bitCount(o1) - Long.bitCount(o2);
                if (nBitDiff != 0) {
                    return nBitDiff;
                }
                return Long.compare(o1, o2);
            }
        });
        /**
         * *构造一个索引数组，用于指向layerIdxList中的位置* (layerCuboidList用于加速连续迭代)
         *
         * Construct an index array for pointing the position in layerIdxList
         * (layerCuboidList is for speeding up continuous iteration)
         * */
        int[] toLayerIdxArray = new int[layerIdxList.size()];
        final List<Long> layerCuboidList = Lists.newArrayListWithExpectedSize(cuboidList.size());
        for (int i = 0; i < layerIdxList.size(); i++) {
            int cuboidIdx = layerIdxList.get(i);
            toLayerIdxArray[cuboidIdx] = i;
            layerCuboidList.add(cuboidList.get(cuboidIdx));
        }

        int[] previousLayerLastIdxArray = new int[layerIdxList.size()];
        int currentBitCount = 0;
        int previousLayerLastIdx = -1;
        for (int i = 0; i < layerIdxList.size(); i++) {
            int cuboidIdx = layerIdxList.get(i);
            //这里获取根据layeridlist排序后的顺序  获取原始数据排序后的数据
            //比如layerIdxList 数据为  0 1 3 2 4 5 6
            //则这里获取的依次就是 1 2 4 3 5 7 15
            int nBits = Long.bitCount(cuboidList.get(cuboidIdx));
            //这里根据位数为1的数量做对比，如果两个cuboid的位数是一样的，则为相同
            if (nBits > currentBitCount) {
                currentBitCount = nBits;
                previousLayerLastIdx = i - 1;
            }
            previousLayerLastIdxArray[i] = previousLayerLastIdx;
        }
        //最后 previousLayerLastIdxArray 返回 -1 -1 -1 2 2 4 5
        //{1=[], 2=[], 3=[2, 1], 4=[], 5=[4, 1], 7=[5, 3], 15=[7]}
        Map<Long, List<Long>> directChildrenCache = Maps.newHashMap();
        for (int i = 0; i < cuboidList.size(); i++) {
            //cuboidList 顺序 1 2 3 4 5 7 15
            Long currentCuboid = cuboidList.get(i);
            LinkedList<Long> directChildren = Lists.newLinkedList();
            //这里拿到对应的cuboid对应的层级
            //拿到的对应的层级值值就是 -1 -1 2 -1 2 4 5
            int lastLayerIdx = previousLayerLastIdxArray[toLayerIdxArray[i]];
            /**
             * Choose one of the two scan strategies
             * 1. cuboids are sorted by its value, like 1,2,3,4,...
             * 2. cuboids are layered and sorted, like 1,2,4,8,...,3,5,...
             * */
            /**
             * 这里有两种扫描策略
             * 1. 按照本来的原生值排序的顺序，比如 1 2 3 4
             * 2. 根据分层来的
             * 有个问题： 这里为什么要用 i-1 和 lastLayerIds对比了
             * 这事因为上面previousLayerLastIdxArray 存的数据是一个层级的数据
             * 比如如果过每个值都是一层， 应该是 -1 0 1 2 3 4 5 ，因为有的值的 bit count 相同，所有相同的为同一级，就成了 -1 -1 -1 2 2 4 5
             * 所以这个 i-1 是一个正常的顺序 -1 0 1 2 3 4 5
             * 而lastLayerIdx 是一个真实的层级，
             *
             *
             */
            if (i - 1 <= lastLayerIdx) {
                //进入到这个循环，表明是进入了新的一层 比如 -1 -1 -1  在第一个-1的时候，会返回True，第二个的时候就是False了，
                //如果是新的一层
                /**
                 * 1. Adding cuboid by descending order
                 * */
                for (int j = i - 1; j >= 0; j--) {
                    //这里是直接拿当前cuboid和比它小的值进行对比，当然，如果是第一层的， 0-1 < 0 ，所以根本没有子cuboid，不会进入下一层
                    //他会和每一个比它小的进行对比，从当前的 i-1 到 0
                    checkAndAddDirectChild(directChildren, currentCuboid, cuboidList.get(j));
                }
            } else {
                //如果不是新得一层，则需要检查当前cuboid和前一层
                /**
                 * 1. Adding cuboid by descending order
                 * 2. Check from lower cuboid layer
                 * */
                for (int j = lastLayerIdx; j >= 0; j--) {
                    // layerCuboidList 值为 1 2 4 3 5 7 15
                    // 而 lastLayerIdx 是-1 -1 -1 2 2 4 5，所以当需要和前面的层级进行比较的时候，直接从layerCuboidList中依次从该层级往下拿就行了
                    // 比如 5 这个值，他的层级为2  而 3 也为2 ，所以5就走到这个分支，又因为 3 和 5 不可能有父子关系，所以就要从上一个层级开始进行判断
                    // 所以拿到的第一个 layerCuboidList.get(2) 就是 4了，依次类推，就能够判断了。这样做的目的其实是为了提高性能
                    checkAndAddDirectChild(directChildren, currentCuboid, layerCuboidList.get(j));
                }
            }
            directChildrenCache.put(currentCuboid, directChildren);
        }
        return directChildrenCache;
    }

主要分为这么几个阶段

根据cuboid的升序排序
生成一个layerIdxList，这个layerIdxList先只是插入了cuboidList的下标，然后再根据指定算法去排序，排序逻辑
1. 首先是cuboid转换成二进制后的1的数量排序
2. 如果数量相同，则根据值升序排序
构造一个索引数组，用于指向layerIdxList中的位置
生成一个layerCuboidList，存储的就是根据上面2逻辑排序后的值
构建一个 previousLayerLastIdxArray 数组，这个主要是标识一个层级关系，存的数据是一个层级的数据，比如如果过每个值都是一层，应该是 -1 0 1 2 3 4 5 ，因为有的值的 bit count 相同，所有相同的为同一级，就成了 -1 -1 -1 2 2 4 5，这里存的就是这样的值。
遍历cuboidList，这里就是根据原生的顺序(大小排序)进行遍历，遍历逻辑如下
1. 获取到对应的lastLayerIdx，也就是上面生成的那个层级
2. 然后进行判断，这里有两种扫描策略
  1. 按照本来的原生值排序的顺序，比如 1 2 3 4
  2. 根据分层
3. 如果是按照原生的顺序
  1. 则遍历从当前值减去1一直到0，和当前的值进行父子判断，如果是子，则加入到对应的list中去
4. 如果是按照层级
  1. 到这里来的一个原因是为了提高性能，对于同层级的，可能会有很多，所以同层级的和同层级的完全不用去比较是不是父子关系，比如3 和 5 ，是同一个层级，完全不用比较，所以就直接和改层级之前的数据进行比较，比如和 1、2进行比较

下面是对比父子关系的代码逻辑，比较简单就不在介绍了。

private static void checkAndAddDirectChild(List<Long> directChildren, Long currentCuboid, Long checkedCuboid) {
       // 这里是判断是否能够有上下级的关系
       if (isDescendant(checkedCuboid, currentCuboid)) {
           //默认是直接的
           boolean ifDirectChild = true;
           for (long directChild : directChildren) {
               //如果有子和这个checedCuboid是上下级，则不用再次引用了
               if (isDescendant(checkedCuboid, directChild)) {
                   ifDirectChild = false;
                   break;
               }
           }
           if (ifDirectChild) {
               directChildren.add(checkedCuboid);
           }
       }
   }
   // 比如 cuboidToCheck 是 12(1100) parentCuboid是13(1101) 则 12 & 13 = 12 则返回true
   public static boolean isDescendant(long cuboidToCheck, long parentCuboid) {
       return (cuboidToCheck & parentCuboid) == cuboidToCheck;
   }

测试与运行时数据分析

下面做一个测试，测试 1L, 2L, 4L, 3L, 5L, 7L, 15L这几个cuboid，下面会一步步介绍响应的数据结构里的数据

    @Test
    public void createDirectChildrenCacheTest11() {
//        Set<Long> cuboidSet1 = Sets.newHashSet(15L, 12L, 13L, 14L);//15 b1111 , 12 b1100  13 b1101  14 b1110
        Set<Long> cuboidSet1 = Sets.newHashSet(1L, 2L, 4L, 3L, 5L, 7L, 15L);// 1 0001  2 0010  4 0100      3 0011  5 0101 7 0111  15 1111
        Map<Long, List<Long>> directChildrenCache = CuboidStatsUtil.createDirectChildrenCache(cuboidSet1);
        System.out.println(directChildrenCache);
    }

原始数据

Index	Value	Binary
0	1	0001
1	2	0010
2	4	0100
3	3	0011
4	5	0101
5	7	0111
6	15	1111

cuboidList

第一次排序

1	Collections.sort(cuboidList)

这里是根据原生大小顺序进行排序

Index	Value	Binary
0	1	0001
1	2	0010
2	3	0011
3	4	0100
4	5	0101
5	7	0111
6	15	1111

layerIdxList

这里构建一个layerIdxList，首先先保存原始的cuboidList的索引

Index	Value
0	0
1	1
2	2
3	3
4	4
5	5
6	6

layerIdxList 排序

这里根据自定义排序方法进行排序，按照升序

cuboidList对应索引的Long值的二进制位中为1的数越多，就越大
如果位数相同，则它的值越大，越大

请看下面的结果，顺序发生了变化，3 和 2 调换了，因为cuboidList.get(2) 为 3(0011) ,而cuboidList.get(3)为4(0100),因为3的1的位数比2的1的位数多，所以4排在前面。

Index	Value
0	0
1	1
2	3
3	2
4	4
5	5
6	6

toLayerIdxArray

构造一个索引数组，用于指向layerIdxList中的位置，对应值如下

Index	Value
0	0
1	1
2	3
3	2
4	4
5	5
6	6

layerCuboidList

(layerCuboidList用于加速连续迭代)

Index	Value	Binary
0	1	0001
1	2	0010
2	4	0100
3	3	0011
4	5	0101
5	7	0111
6	15	1111

previousLayerLastIdxArray

Index	Value
0	-1
1	-1
2	-1
3	2
4	2
5	4
6	5

主要有下面几个步骤

循环layerIdxList
因为layerIdxList是根据bit count和原生大小排过序的，所以拿到的cuboidIdx的值就是从小到大拿
然后拿到cuboidList对应cuboidIdx的值，并且得到1的bit counts
然后和currentBitCount对比，因为currentBitCount默认是为0的，所以第一个肯定比它大
然后previousLayerLastIdx为-1
后面继续循环，如果 bit count值不变，如上，1、2、4、的bitcount都为1，则一直没有变，则对应的值为-1
以此类推

directChildrenCache

循环逻辑，请看下面的列表，Value是按照原生数据排序的，代表cuboidlist，lastLayerIdx是代表当前值对应的层级，i是循环索引，i - 1 <= lastLayerIdx 是做的判断，可以清晰的看到，当在一个新的层级的时候，比如第一次进入 -1 这个层级，上面的判断返回的事True，则表示需要和前面的原始数据进行比较，如果是同层级的后面的数据，则返回了False，直接和上一个层级的数据进行比较。

Index	Value	Binary	lastLayerIdx	i	i - 1 <= lastLayerIdx
0	1	0001	-1	0	True
1	2	0010	-1	1	False
2	3	0011	2	2	True
3	4	0100	-1	3	False
4	5	0101	2	4	False
5	7	0111	4	5	True
6	15	1111	5	6	True

构建结果

1	{1=[], 2=[], 3=[2, 1], 4=[], 5=[4, 1], 7=[5, 3], 15=[7]}

总结

本章通过代码加示例以及图表的方式，详细描述了构建层级的生成逻辑。如果想更详细的了解，可以自己clone项目下来阅读。项目地址https://github.com/pengshuangbao/kylin/blob/2.5.x-comment/

Kylin源码解析系列目录

构建引擎系列

1、Kylin源码解析-kylin构建任务生成与调度执行 | 编程狂想

2、Kylin源码解析-kylin构建流程总览 | 编程狂想

3、Kylin源码解析-构建引擎实现原理 | 编程狂想

4、Kylin源码解析-生成Hive宽表及其他操作 | 编程狂想

5、Kylin源码解析-提取事实表唯一列 | 编程狂想

6、Kylin源码解析-构建层级分析 | 编程狂想

7、Kylin源码解析-构建数据字典和生成Cuboid统计数据

8、Kylin源码解析-生成Hbase表

9、Kylin源码解析-构建Cuboid

10、Kylin源码解析-转换HDFS为Hfile

11、Kylin源码解析-加载Hfile到Hbase中

12、Kylin源码解析-修改元数据以及其他清理工作