首页手记【Spark Java...

【Spark Java API】Transformation(11)—reduceByKey、foldByKey

标签：

Spark

reduceByKey

官方文档描述：

Merge the values for each key using an associative reduce function. 
This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.

函数原型：

def reduceByKey(partitioner: Partitioner, func: JFunction2[V, V, V]): JavaPairRDD[K, V]def reduceByKey(func: JFunction2[V, V, V], numPartitions: Int): JavaPairRDD[K, V]

**
该函数利用映射函数将每个K对应的V进行运算。
其中参数说明如下：
**

func：映射函数，根据需求自定义；
partitioner：分区函数；
numPartitions：分区数，默认的分区函数是HashPartitioner。

源码分析：

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {  
  combineByKey[V]((v: V) => v, func, func, partitioner)
}

**
从源码中可以看出，reduceByKey()是基于combineByKey()实现的，其中createCombiner只是简单的转化，而mergeValue和mergeCombiners相同，都是利用用户自定义函数。reduceyByKey() 相当于传统的 MapReduce，整个数据流也与 Hadoop 中的数据流基本一样。在combineByKey()中在 map 端开启 combine()，因此，reduceyByKey() 默认也在 map 端开启 combine()，这样在 shuffle 之前先通过 mapPartitions 操作进行 combine，得到 MapPartitionsRDD，然后 shuffle 得到 ShuffledRDD，再进行 reduce（通过 aggregate + mapPartitions() 操作来实现）得到 MapPartitionsRDD。
**

实例：

List<Integer> data = Arrays.asList(1, 2, 4, 3, 5, 6, 7);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);//转化为K，V格式JavaPairRDD<Integer,Integer> javaPairRDD = javaRDD.mapToPair(new PairFunction<Integer, Integer, Integer>() {    
    @Override    
    public Tuple2<Integer, Integer> call(Integer integer) throws Exception {        
      return new Tuple2<Integer, Integer>(integer,1);    
  }
});
JavaPairRDD<Integer,Integer> reduceByKeyRDD = javaPairRDD.reduceByKey(new Function2<Integer, Integer, Integer>() {    
    @Override      
    public Integer call(Integer v1, Integer v2) throws Exception {        
      return v1 + v2;    
  }
});
System.out.println(reduceByKeyRDD.collect());//指定numPartitionsJavaPairRDD<Integer,Integer> reduceByKeyRDD2 = javaPairRDD.reduceByKey(new Function2<Integer, Integer, Integer>() {    
    @Override    
    public Integer call(Integer v1, Integer v2) throws Exception {        
      return v1 + v2;    
  }
},2);
System.out.println(reduceByKeyRDD2.collect());//自定义partitionJavaPairRDD<Integer,Integer> reduceByKeyRDD4 = javaPairRDD.reduceByKey(new Partitioner() {    
      @Override    
      public int numPartitions() {    return 2;    }    
      @Override    
      public int getPartition(Object o) {        
        return (o.toString()).hashCode()%numPartitions();    
  }
}, new Function2<Integer, Integer, Integer>() {    
    @Override      
    public Integer call(Integer v1, Integer v2) throws Exception {        
      return v1 + v2;    
  }
});
System.out.println(reduceByKeyRDD4.collect());

foldByKey

官方文档描述：

Merge the values for each key using an associative function and a neutral "zero value" which 
may be added to the result an arbitrary number of times, and must not change the result 
(e.g., Nil for list concatenation, 0 for addition, or 1 for multiplication.).

函数原型：

def foldByKey(zeroValue: V, partitioner: Partitioner, func: JFunction2[V, V, V]): JavaPairRDD[K, V]def foldByKey(zeroValue: V, numPartitions: Int, func: JFunction2[V, V, V]): JavaPairRDD[K, V]def foldByKey(zeroValue: V, func: JFunction2[V, V, V]): JavaPairRDD[K, V]

**
该函数用于将K对应V利用函数映射进行折叠、合并处理，其中参数zeroValue是对V进行初始化。
具体参数如下：
**

zeroValue：初始值；
numPartitions：分区数，默认的分区函数是HashPartitioner；
partitioner：分区函数；
func：映射函数，用户自定义函数。

源码分析：

def foldByKey( zeroValue: V,  partitioner: Partitioner)(func: (V, V) => V): RDD[(K, V)] = self.withScope {  
    // Serialize the zero value to a byte array so that we can get a new clone of it on each key  
    val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)  
    val zeroArray = new Array[Byte](zeroBuffer.limit)  
    zeroBuffer.get(zeroArray)  
    // When deserializing, use a lazy val to create just one instance of the serializer per task  
    lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()  
    val createZero = () => cachedSerializer.deserialize[V](ByteBuffer.wrap(zeroArray))  
    val cleanedFunc = self.context.clean(func)  
    combineByKey[V]((v: V) => cleanedFunc(createZero(), v), cleanedFunc, cleanedFunc, partitioner)
}

**
从foldByKey()实现可以看出，该函数是基于combineByKey()实现的，其中createCombiner只是利用zeroValue对V进行初始化，而mergeValue和mergeCombiners相同，都是利用用户自定义函数。在这里需要注意如果实现K的V聚合操作，初始设置需要特别注意，不要改变聚合的结果。
**

实例：

List<Integer> data = Arrays.asList(1, 2, 4, 3, 5, 6, 7, 1, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data);final Random rand = new Random(10);
JavaPairRDD<Integer,String> javaPairRDD = javaRDD.mapToPair(new PairFunction<Integer, Integer, String>() {    
    @Override    
    public Tuple2<Integer, String> call(Integer integer) throws Exception {  
      return new Tuple2<Integer, String>(integer,Integer.toString(rand.nextInt(10)));    
  }
});

JavaPairRDD<Integer,String> foldByKeyRDD = javaPairRDD.foldByKey("X", new Function2<String, String, String>() {    
    @Override    
    public String call(String v1, String v2) throws Exception {        
      return v1 + ":" + v2;    
  }
});
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + foldByKeyRDD.collect());

JavaPairRDD<Integer,String> foldByKeyRDD1 = javaPairRDD.foldByKey("X", 2, new Function2<String, String, String>() {    
    @Override    
    public String call(String v1, String v2) throws Exception {        
      return v1 + ":" + v2;    
    }
});
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + foldByKeyRDD1.collect());

JavaPairRDD<Integer,String> foldByKeyRDD2 = javaPairRDD.foldByKey("X", new Partitioner() {    
    @Override    
    public int numPartitions() {        return 3;    }    
    @Override    
    public int getPartition(Object key) {        
      return key.toString().hashCode()%numPartitions();    
  }
}, new Function2<String, String, String>() {    
    @Override    
    public String call(String v1, String v2) throws Exception {        
      return v1 + ":" + v2;    
  }
});
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + foldByKeyRDD2.collect());

作者：小飞_侠_kobe
链接：https://www.jianshu.com/p/164c02b682ed

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

慕哥9229398

手记
篇

粉丝

199

获赞与收藏

913

关注作者，订阅最新文章

阅读免费教程

后端通用面试教程

41个小节 30854 345

网络编程入门教程

20个小节 12725 240

Pandas 入门教程

25个小节 18619 342

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空

【Spark Java API】Transformation(11)—reduceByKey、foldByKey

reduceByKey

官方文档描述：

函数原型：

源码分析：

实例：

foldByKey

官方文档描述：

函数原型：

源码分析：

实例：

阅读免费教程