首页手记 spark RDD，reduceByKey vs...

spark RDD，reduceByKey vs groupByKey

标签：

Spark

Spark 中有两个类似的api，分别是 reduceByKey 和 groupByKey 。这两个的功能类似，但底层实现却有些不同，那么为什么要这样设计呢？我们来从源码的角度分析一下。

先看两者的调用顺序（都是使用默认的Partitioner，即defaultPartitioner）

所用 spark 版本：spark 2.1.0

先看reduceByKey

Step1

  def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    reduceByKey(defaultPartitioner(self), func)
  }

Setp2

  def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {
    combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)
  }

Setp3

def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
    if (keyClass.isArray) {      if (mapSideCombine) {        throw new SparkException("Cannot use map-side combining with array keys.")
      }      if (partitioner.isInstanceOf[HashPartitioner]) {        throw new SparkException("HashPartitioner cannot partition array keys.")
      }
    }
    val aggregator = new Aggregator[K, V, C](      self.context.clean(createCombiner),      self.context.clean(mergeValue),      self.context.clean(mergeCombiners))    if (self.partitioner == Some(partitioner)) {      self.mapPartitions(iter => {
        val context = TaskContext.get()        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
      }, preservesPartitioning = true)
    } else {      new ShuffledRDD[K, V, C](self, partitioner)
        .setSerializer(serializer)
        .setAggregator(aggregator)
        .setMapSideCombine(mapSideCombine)
    }
  }

姑且不去看方法里面的细节，我们会只要知道最后调用的是 combineByKeyWithClassTag 这个方法。这个方法有两个参数我们来重点看一下，

def combineByKeyWithClassTag[C](
      createCombiner: V => C,      mergeValue: (C, V) => C,      mergeCombiners: (C, C) => C,      partitioner: Partitioner,      mapSideCombine: Boolean = true,      serializer: Serializer = null)

首先是 partitioner 参数，这个即是 RDD 的分区设置。除了默认的 defaultPartitioner，Spark 还提供了 RangePartitioner 和 HashPartitioner 外，此外用户也可以自定义 partitioner 。通过源码可以发现如果是 HashPartitioner 的话，那么是会抛出一个错误的。

然后是 mapSideCombine 参数，这个参数正是 reduceByKey 和 groupByKey 最大不同的地方，它决定是是否会先在节点上进行一次 Combine 操作，下面会有更具体的例子来介绍。

然后是groupByKey

Step1

  def groupByKey(): RDD[(K, Iterable[V])] = self.withScope {
    groupByKey(defaultPartitioner(self))
  }

Step2

  def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {    // groupByKey shouldn't use map side combine because map side combine does not
    // reduce the amount of data shuffled and requires all map side data be inserted
    // into a hash table, leading to more objects in the old gen.
    val createCombiner = (v: V) => CompactBuffer(v)
    val mergeValue = (buf: CompactBuffer[V], v: V) => buf += v
    val mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2
    val bufs = combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)
    bufs.asInstanceOf[RDD[(K, Iterable[V])]]
  }

Setp3

def combineByKeyWithClassTag[C](
      createCombiner: V => C,
      mergeValue: (C, V) => C,
      mergeCombiners: (C, C) => C,
      partitioner: Partitioner,
      mapSideCombine: Boolean = true,
      serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {    require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0
    if (keyClass.isArray) {      if (mapSideCombine) {        throw new SparkException("Cannot use map-side combining with array keys.")
      }      if (partitioner.isInstanceOf[HashPartitioner]) {        throw new SparkException("HashPartitioner cannot partition array keys.")
      }
    }
    val aggregator = new Aggregator[K, V, C](      self.context.clean(createCombiner),      self.context.clean(mergeValue),      self.context.clean(mergeCombiners))    if (self.partitioner == Some(partitioner)) {      self.mapPartitions(iter => {
        val context = TaskContext.get()        new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))
      }, preservesPartitioning = true)
    } else {      new ShuffledRDD[K, V, C](self, partitioner)
        .setSerializer(serializer)
        .setAggregator(aggregator)
        .setMapSideCombine(mapSideCombine)
    }
  }

结合上面 reduceByKey 的调用链，可以发现最终其实都是调用 combineByKeyWithClassTag 这个方法的，但调用的参数不同。
reduceByKey的调用

combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)

groupByKey的调用

combineByKeyWithClassTag[CompactBuffer[V]](
      createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)

正是两者不同的调用方式导致了两个方法的差别，我们分别来看

reduceByKey的泛型参数直接是[V]，而groupByKey的泛型参数是[CompactBuffer[V]]。这直接导致了 reduceByKey 和 groupByKey 的返回值不同，前者是RDD[(K, V)]，而后者是RDD[(K, Iterable[V])]
然后就是mapSideCombine = false 了，这个mapSideCombine 参数的默认是true的。这个值有什么用呢，上面也说了，这个参数的作用是控制要不要在map端进行初步合并（Combine）。可以看看下面具体的例子。

从功能上来说，可以发现 ReduceByKey 其实就是会在每个节点先进行一次合并的操作，而 groupByKey 没有。

这么来看 ReduceByKey 的性能会比 groupByKey 好很多，因为有些工作在节点已经处理了。那么 groupByKey 为什么存在，它的应用场景是什么呢？我也不清楚，如果观看这篇文章的读者知道的话不妨在评论里说出来吧。非常感谢！

作者：终日而思一
链接：https://www.jianshu.com/p/6b535117f826

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

海绵宝宝撒

JAVA开发工程师

手记
篇

粉丝

40

获赞与收藏

127

关注作者，订阅最新文章

阅读免费教程

后端通用面试教程

41个小节 32886 371

网络编程入门教程

20个小节 13641 256

Pandas 入门教程

25个小节 20282 387

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空

spark RDD，reduceByKey vs groupByKey

先看reduceByKey

然后是groupByKey

阅读免费教程