Learning Spark 笔记(七) -- 受益于分区的操作

来源：互联网发布：java ftpclient api 编辑：程序博客网时间：2024/06/08 08:17

10 . 单元操作是怎样受益的？以reduceByKey为例子，reduceByKey是在本地归约后再发送到一个主机上再进行归约。如果父RDD是有分区信息的，那么就可能只会在本地归约了，而不会再跨网络发送到其它主机上。

二元操作是怎样受益于分区的？比如join()，至少会有一个RDD不会被shuffle。如果两个RDD有同样的partitioner，且被缓存在相同的机器上，则不会shuffle产生，比如

val b = a.mapValues(x=>x*x)a.join(b)

因为mapValues()操作不会改变主键且不是shuffle操作，所以a和b的分区信息是一样的且在同一台机器上，二元操作join()不会有shuffle产生。

受分区信息影响的操作有

//一定会输出有Partitioner子RDD的操作cogroup(),groupWith(), join(), leftOuterJoin(), rightOuterJoin(), groupByKey(), sort(), reduceByKey(),combineByKey(), partitionBy(),  //如果有父RDD指定了Partitioner，则一定会输出有Partitioner的子RDD的操作mapValues(), flatMapValues(), filter()

除了上面的这些操作，其它的操作都不会产生Partitioner。

最后三个操作
那么子RDD是怎样确定Partitioner和分区的呢？下面是Spark2.0.0的Partition.defaultPartitioner的源码，源码中定义了怎样确定子RDD的partitioner和分区数。

/**   * Choose a partitioner to use for a cogroup-like operation between a number of RDDs.   *   * If any of the RDDs already has a partitioner, choose that one.   *   * Otherwise, we use a default HashPartitioner. For the number of partitions, if spark.default.parallelism is set, then we'll use the value from SparkContext defaultParallelism, otherwise we'll use the max number of upstream partitions.   *   * Unless spark.default.parallelism is set, the number of partitions will be the same as the number of partitions in the largest upstream RDD, as this should be least likely to cause out-of-memory errors.   *   * We use two method parameters (rdd, others) to enforce callers passing at least 1 RDD.   */  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {    val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.length).reverse    for (r <- bySize if r.partitioner.isDefined && r.partitioner.get.numPartitions > 0) {      return r.partitioner.get    }    if (rdd.context.conf.contains("spark.default.parallelism")) {      new HashPartitioner(rdd.context.defaultParallelism)    } else {      new HashPartitioner(bySize.head.partitions.length)    }  }

可以看到：
1. 父RDD中只有一个指定了Partitioner，则子RDD继承这个partitioner；
2. 父RDD中有若干个指定了Partitioner，则子RDD继承的是分区数最多的那个父RDD的Partitioner；
3. 如果父RDD中没有一个指定了Partitioner，则子RDD默认是HashPartitioner；
4. 在并行度没有指定的情况下，子RDD的分区数是和最大分区数的父RDD是一致的，如果指定了就优先使用指定的并行度。

利用partitionBy()操作优化程序的例子，PageRank：

// Assume that our neighbor list was saved as a Spark objectFileval links = sc.objectFile[(String, Seq[String])]("links")                //指定分区为哈希                .partitionBy(new HashPartitioner(100))                //持久化，默认为Storagelevel.MEMORY_ONLY                                               .persist()                                                           // Initialize each page's rank to 1.0; since we use mapValues, the resulting RDD will have the same partitioner as links//mapValues不会改变分区信息var ranks = links.mapValues(v => 1.0)                                             // Run 10 iterations of PageRankfor (i <- 0 until 10) {    //flatMap会改变分区信息，但却没有改变分区数100，所以contributions并没有持有links的RDD的分区信息    val contributions = links.join(ranks).flatMap {                                       case (pageId, (links, rank)) =>            links.map(dest => (dest, rank / links.size))    }//  reduceByKey会产生默认的HashPartitioner分区，并且继承了之前的分区数，mapValues没有改变分区信息ranks = contributions.reduceByKey((x, y) => x + y).mapValues(v => 0.15 + 0.85*v)}// Write out the final ranksranks.saveAsTextFile("ranks")

要最大化潜在的分区相关的优化，应该使用mapValues或者flatMapValues(),无论什么时候都不要改变元素的主键。

1 0