treeAggregate
官方文档描述:
Aggregates the elements of this RDD in a multi-level tree pattern.
函数原型:
def treeAggregate[U]( zeroValue: U, seqOp: JFunction2[U, T, U], combOp: JFunction2[U, U, U], depth: Int): U def treeAggregate[U]( zeroValue: U, seqOp: JFunction2[U, T, U], combOp: JFunction2[U, U, U]): U
**
可理解为更复杂的多阶aggregate。
**
源码分析:
def treeAggregate[U: ClassTag](zeroValue: U)( seqOp: (U, T) => U, combOp: (U, U) => U, depth: Int = 2): U = withScope { require(depth >= 1, s"Depth must be greater than or equal to 1 but got $depth.") if (partitions.length == 0) { Utils.clone(zeroValue, context.env.closureSerializer.newInstance()) } else { val cleanSeqOp = context.clean(seqOp) val cleanCombOp = context.clean(combOp) val aggregatePartition = (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp) var partiallyAggregated = mapPartitions(it => Iterator(aggregatePartition(it))) var numPartitions = partiallyAggregated.partitions.length val scale = math.max(math.ceil(math.pow(numPartitions, 1.0 / depth)).toInt, 2) // If creating an extra level doesn't help reduce // the wall-clock time, we stop tree aggregation. // Don't trigger TreeAggregation when it doesn't save wall-clock time while (numPartitions > scale + math.ceil(numPartitions.toDouble / scale)) { numPartitions /= scale val curNumPartitions = numPartitions partiallyAggregated = partiallyAggregated.mapPartitionsWithIndex { (i, iter) => iter.map((i % curNumPartitions, _)) }.reduceByKey(new HashPartitioner(curNumPartitions), cleanCombOp).values } partiallyAggregated.reduce(cleanCombOp) } }
**
从源码中可以看出,treeAggregate函数先是对每个分区利用scala的aggregate函数进行局部聚合的操作;同时,依据depth参数计算scale,如果当分区数量过多时,则按i%curNumPartitions
进行key值计算,再按key进行重新分区合并计算;最后,在进行reduce聚合操作。这样可以通过调解深度来减少reduce的开销。
**
实例:
List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2); JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,3);//转化操作JavaRDD<String> javaRDD1 = javaRDD.map(new Function<Integer, String>() { @Override public String call(Integer v1) throws Exception { return Integer.toString(v1); } });String result1 = javaRDD1.treeAggregate("0", new Function2<String, String, String>() { @Override public String call(String v1, String v2) throws Exception { System.out.println(v1 + "=seq=" + v2); return v1 + "=seq=" + v2; } }, new Function2<String, String, String>() { @Override public String call(String v1, String v2) throws Exception { System.out.println(v1 + "<=comb=>" + v2); return v1 + "<=comb=>" + v2; } }); System.out.println(result1);
treeReduce
官方文档描述:
Reduces the elements of this RDD in a multi-level tree pattern.
函数原型:
def treeReduce(f: JFunction2[T, T, T], depth: Int): Tdef treeReduce(f: JFunction2[T, T, T]): T
**
与treeAggregate类似,只不过是seqOp和combOp相同的treeAggregate。
**
源码分析:
def treeReduce(f: (T, T) => T, depth: Int = 2): T = withScope { require(depth >= 1, s"Depth must be greater than or equal to 1 but got $depth.") val cleanF = context.clean(f) val reducePartition: Iterator[T] => Option[T] = iter => { if (iter.hasNext) { Some(iter.reduceLeft(cleanF)) } else { None } } val partiallyReduced = mapPartitions(it => Iterator(reducePartition(it))) val op: (Option[T], Option[T]) => Option[T] = (c, x) => { if (c.isDefined && x.isDefined) { Some(cleanF(c.get, x.get)) } else if (c.isDefined) { c } else if (x.isDefined) { x } else { None } } partiallyReduced.treeAggregate(Option.empty[T])(op, op, depth) .getOrElse(throw new UnsupportedOperationException("empty collection"))}
**
从源码中可以看出,treeReduce函数先是针对每个分区利用scala的reduceLeft函数进行计算;最后,在将局部合并的RDD进行treeAggregate计算,这里的seqOp和combOp一样,初值为空。在实际应用中,可以用treeReduce来代替reduce,主要是用于单个reduce操作开销比较大,而treeReduce可以通过调整深度来控制每次reduce的规模。
**
实例:
List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2); JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,5); JavaRDD<String> javaRDD1 = javaRDD.map(new Function<Integer, String>() { @Override public String call(Integer v1) throws Exception { return Integer.toString(v1); } });String result = javaRDD1.treeReduce(new Function2<String, String, String>() { @Override public String call(String v1, String v2) throws Exception { System.out.println(v1 + "=" + v2); return v1 + "=" + v2; } }); System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + treeReduceRDD);
作者:小飞_侠_kobe
链接:https://www.jianshu.com/p/27222830d21a
点击查看更多内容
为 TA 点赞
评论
共同学习,写下你的评论
评论加载中...
作者其他优质文章
正在加载中
感谢您的支持,我会继续努力的~
扫码打赏,你说多少就多少
赞赏金额会直接到老师账户
支付方式
打开微信扫一扫,即可进行扫码打赏哦