首页手记【Spark Java...

【Spark Java API】Transformation(12)—zipPartitions、zip

标签：

Spark

zipPartitions

官方文档描述：

Zip this RDD's partitions with one (or more) RDD(s) and return a new RDD by applying a function 
to the zipped partitions. Assumes that all the RDDs have the *same number of partitions*, 
but does *not* require them to have the same number of elements in each partition.

函数原型：

def zipPartitions[U, V](    
    other: JavaRDDLike[U, _], 
    f: FlatMapFunction2[java.util.Iterator[T], java.util.Iterator[U], V]): JavaRDD[V]

该函数将两个分区RDD按照partition进行合并，形成一个新的RDD。

源码分析：

def zipPartitions[B: ClassTag, V: ClassTag]    
      (rdd2: RDD[B], preservesPartitioning: Boolean)    
      (f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V] = withScope {  
    new ZippedPartitionsRDD2(sc, sc.clean(f), this, rdd2, preservesPartitioning)
}private[spark] class ZippedPartitionsRDD2[A: ClassTag, B: ClassTag, V: ClassTag](    
    sc: SparkContext,    
    var f: (Iterator[A], Iterator[B]) => Iterator[V],    
    var rdd1: RDD[A],    
    var rdd2: RDD[B],    
    preservesPartitioning: Boolean = false)  
  extends ZippedPartitionsBaseRDD[V](sc, List(rdd1, rdd2), preservesPartitioning) {  

  override def compute(s: Partition, context: TaskContext): Iterator[V] = {    
      val partitions = s.asInstanceOf[ZippedPartitionsPartition].partitions    
      f(rdd1.iterator(partitions(0), context), rdd2.iterator(partitions(1), context))  
  }  

  override def clearDependencies() {    
      super.clearDependencies()    
      rdd1 = null    
      rdd2 = null    
      f = null  
  }
}

从源码中可以看出，zipPartitions函数生成ZippedPartitionsRDD2，该RDD继承ZippedPartitionsBaseRDD，在ZippedPartitionsBaseRDD中的getPartitions方法中判断需要组合的RDD是否具有相同的分区数，但是该RDD实现中并没有要求每个partitioner内的元素数量相同。

实例：

List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,3);
List<Integer> data1 = Arrays.asList(3, 2, 12, 5, 6, 1);
JavaRDD<Integer> javaRDD1 = javaSparkContext.parallelize(data1,3);
JavaRDD<String> zipPartitionsRDD = javaRDD.zipPartitions(javaRDD1, new FlatMapFunction2<Iterator<Integer>, Iterator<Integer>, String>() {    
    @Override    
    public Iterable<String> call(Iterator<Integer> integerIterator, Iterator<Integer> integerIterator2) throws Exception {        
        LinkedList<String> linkedList = new LinkedList<String>();        
        while(integerIterator.hasNext() && integerIterator2.hasNext())            
            linkedList.add(integerIterator.next().toString() + "_" + integerIterator2.next().toString());        
        return linkedList;    
  }
});
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + zipPartitionsRDD.collect());

zip

官方文档描述：

Zips this RDD with another one, returning key-value pairs with the first element in each RDD,
second element in each RDD, etc. Assumes that the two RDDs have the *same number of partitions* 
and the *same number of elements in each partition* (e.g. one was made through a map on the other).

函数原型：

def zip[U](other: JavaRDDLike[U, _]): JavaPairRDD[T, U]

该函数用于将两个RDD进行组合，组合成一个key/value形式的RDD。

源码分析：

def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {  
  zipPartitions(other, preservesPartitioning = false) { (thisIter, otherIter) =>    
    new Iterator[(T, U)] {      
      def hasNext: Boolean = (thisIter.hasNext, otherIter.hasNext) match {        
        case (true, true) => true        
        case (false, false) => false        
        case _ => throw new SparkException("Can only zip RDDs with " +          "same number of elements in each partition")      
      }      
      def next(): (T, U) = (thisIter.next(), otherIter.next())    
    }  
  }
}

从源码中可以看出，zip函数是基于zipPartitions实现的，其中preservesPartitioning为false，preservesPartitioning表示是否保留父RDD的partitioner分区；另外，两个RDD的partition数量及元数的数量都是相同的，否则会抛出异常。

实例：

List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);
JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,3);
List<Integer> data1 = Arrays.asList(3,2,12,5,6,1,7);
JavaRDD<Integer> javaRDD1 = javaSparkContext.parallelize(data1);
JavaPairRDD<Integer,Integer> zipRDD = javaRDD.zip(javaRDD1);
System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + zipRDD.collect());

作者：小飞_侠_kobe
链接：https://www.jianshu.com/p/d19263471050

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

慕姐8265434

手记
篇

粉丝

222

获赞与收藏

1065

关注作者，订阅最新文章

阅读免费教程

后端通用面试教程

41个小节 30854 345

网络编程入门教程

20个小节 12725 240

Pandas 入门教程

25个小节 18607 342

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空

【Spark Java API】Transformation(12)—zipPartitions、zip

zipPartitions

官方文档描述：

函数原型：

源码分析：

实例：

zip

官方文档描述：

函数原型：

源码分析：

实例：

阅读免费教程