首页手记【Spark Java...

【Spark Java API】Action(5)—treeAggregate、treeReduce

标签：

Spark

treeAggregate

官方文档描述：

Aggregates the elements of this RDD in a multi-level tree pattern.

函数原型：

def treeAggregate[U](    
    zeroValue: U,    
    seqOp: JFunction2[U, T, U],    
    combOp: JFunction2[U, U, U],
    depth: Int): U 
def treeAggregate[U](    
    zeroValue: U,    
    seqOp: JFunction2[U, T, U],    
    combOp: JFunction2[U, U, U]): U

**
可理解为更复杂的多阶aggregate。
**

源码分析：

def treeAggregate[U: ClassTag](zeroValue: U)(    
    seqOp: (U, T) => U,    
    combOp: (U, U) => U,    
    depth: Int = 2): U = withScope {  
  require(depth >= 1, s"Depth must be greater than or equal to 1 but got $depth.")  
  if (partitions.length == 0) {    
    Utils.clone(zeroValue, context.env.closureSerializer.newInstance())  
  } else {    
    val cleanSeqOp = context.clean(seqOp)    
    val cleanCombOp = context.clean(combOp)    
    val aggregatePartition =      
      (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp)    
    var partiallyAggregated = mapPartitions(it => Iterator(aggregatePartition(it)))    
    var numPartitions = partiallyAggregated.partitions.length    
    val scale = math.max(math.ceil(math.pow(numPartitions, 1.0 / depth)).toInt, 2)    
    // If creating an extra level doesn't help reduce    
    // the wall-clock time, we stop tree aggregation.          
    // Don't trigger TreeAggregation when it doesn't save wall-clock time    
    while (numPartitions > scale + math.ceil(numPartitions.toDouble / scale)) {      
      numPartitions /= scale      
      val curNumPartitions = numPartitions      
      partiallyAggregated = partiallyAggregated.mapPartitionsWithIndex {        
        (i, iter) => iter.map((i % curNumPartitions, _))      
      }.reduceByKey(new HashPartitioner(curNumPartitions), cleanCombOp).values    
  }    
  partiallyAggregated.reduce(cleanCombOp)  
  }
}

**
从源码中可以看出，treeAggregate函数先是对每个分区利用scala的aggregate函数进行局部聚合的操作；同时，依据depth参数计算scale，如果当分区数量过多时，则按i%curNumPartitions进行key值计算，再按key进行重新分区合并计算；最后，在进行reduce聚合操作。这样可以通过调解深度来减少reduce的开销。
**

实例：

List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,3);//转化操作JavaRDD<String> javaRDD1 = javaRDD.map(new Function<Integer, String>() {    
  @Override    
  public String call(Integer v1) throws Exception {        
    return Integer.toString(v1);    
  }
});String result1 = javaRDD1.treeAggregate("0", new Function2<String, String, String>() {    
  @Override    
  public String call(String v1, String v2) throws Exception {        
    System.out.println(v1 + "=seq=" + v2);        
    return v1 + "=seq=" + v2;    
  }
}, new Function2<String, String, String>() {    
    @Override    
    public String call(String v1, String v2) throws Exception {        
      System.out.println(v1 + "<=comb=>" + v2);        
      return v1 + "<=comb=>" + v2;    
  }
});
System.out.println(result1);

treeReduce

官方文档描述：

Reduces the elements of this RDD in a multi-level tree pattern.

函数原型：

def treeReduce(f: JFunction2[T, T, T], depth: Int): Tdef treeReduce(f: JFunction2[T, T, T]): T

**
与treeAggregate类似，只不过是seqOp和combOp相同的treeAggregate。
**

源码分析：

def treeReduce(f: (T, T) => T, depth: Int = 2): T = withScope {  
  require(depth >= 1, s"Depth must be greater than or equal to 1 but got $depth.")  
  val cleanF = context.clean(f)  
  val reducePartition: Iterator[T] => Option[T] = iter => {    
    if (iter.hasNext) {      
      Some(iter.reduceLeft(cleanF))    
    } else {      
      None    
    }  
  }  
  val partiallyReduced = mapPartitions(it => Iterator(reducePartition(it)))  
  val op: (Option[T], Option[T]) => Option[T] = (c, x) => {    
  if (c.isDefined && x.isDefined) {      
    Some(cleanF(c.get, x.get))    
  } else if (c.isDefined) {      
    c    
  } else if (x.isDefined) {      
    x    
  } else {      
    None    
  }  
 }  
partiallyReduced.treeAggregate(Option.empty[T])(op, op, depth)    
  .getOrElse(throw new UnsupportedOperationException("empty collection"))}

**
从源码中可以看出，treeReduce函数先是针对每个分区利用scala的reduceLeft函数进行计算；最后，在将局部合并的RDD进行treeAggregate计算，这里的seqOp和combOp一样，初值为空。在实际应用中，可以用treeReduce来代替reduce，主要是用于单个reduce操作开销比较大，而treeReduce可以通过调整深度来控制每次reduce的规模。
**

实例：

List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,5);
JavaRDD<String> javaRDD1 = javaRDD.map(new Function<Integer, String>() {    
    @Override    
    public String call(Integer v1) throws Exception {        
      return Integer.toString(v1);    
    }
});String result = javaRDD1.treeReduce(new Function2<String, String, String>() {    
    @Override    
    public String call(String v1, String v2) throws Exception {        
      System.out.println(v1 + "=" + v2);        
      return v1 + "=" + v2;    
  }
});
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + treeReduceRDD);

作者：小飞_侠_kobe
链接：https://www.jianshu.com/p/27222830d21a

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

青春有我

JAVA开发工程师

手记
篇

粉丝

205

获赞与收藏

1008

关注作者，订阅最新文章

阅读免费教程

后端通用面试教程

41个小节 30854 345

网络编程入门教程

20个小节 12725 240

Pandas 入门教程

25个小节 18607 342

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空

【Spark Java API】Action(5)—treeAggregate、treeReduce

treeAggregate

官方文档描述：

函数原型：

源码分析：

实例：

treeReduce

官方文档描述：

函数原型：

源码分析：

实例：

阅读免费教程