mapPartitions各部分执行流程解析

来源：互联网发布：linux chs bg 编辑：程序博客网时间：2024/06/10 21:08

How-to: Translate from MapReduce to Apache Spark http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/

这篇文章写得很好, 为从MR过渡到Spark提供了一个很好的指示.

文章wen'zh中间关于如何模仿MR的cleanup()方法,文章给出了解决方案,不过说得略微不够详细，我想展开说一下.

解决方案

文中主要是引用了spark mail list中的一份讨论(http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAPH-c_O9kQO6yJ4khXUVdO=+D4vj=JfG2tP9eqn5RPko=dRNAg@mail.gmail.com%3E)给出的解决方案:

Pleasebe careful with that, it will not work as expected. First, it would have to be:

rdd.mapPartitions { partition =>   // Some setup code here  @1   val result =partition.map(yourfunction)  @2   // Some cleanup code here @3   result}

becausethe function passed in to mapPartitions() needs to return an
Iterator, and if you do it like this, then the cleanup code willrun
*before* the processing takes place because partition.map() isexecuted
lazily.

One example of what actually works is:

rdd.mapPartitions{ partition =>   if (!partition.isEmpty) {     // Some setup code here     partition.map(item => {       val output =yourfunction(item)       if (!partition.hasNext){         // Some cleanup code here       }       output     })   } else {    // return an empty Iterator of your return type   }}

解决方案的说明

val result = partition.map(yourfunction)这句话到底做了什么呢?

partition是一个Iterator类的对象, Iterator::map()方法里面其实并没有执行任何的迭代,而只是生成了一个新的AbstractIterator对象, 这个对象的next()方法会调用用户传入的yourfunction方法.

也就是说, 其实partition.map()方法在创建AbstractIterator对象后, 就返回了,没有执行任何的数据处理相关的操作.

这样, 如果cleanup code放在partition.map()之后执行, 是完全没有起作用的.

下面讲解一下整个过程, 及各方法的调用地方:

首先定位到每个shullfe的拉取task:

func(context,rdd.iterator(partition,context))

这里func就是对应的action动作中实际执行的操作, 如rdd::collect()等操作中的操作;
rdd.iterator()这个方法是真正进行数据读取的重要源头, 下面主要围绕这块的调用展开;iterator()会调用到RDD的computer()方法.

RDD会实现自己的compute()方法.

MapPartitionsRDD的compute()方法:

f(context,split.index,firstParent[T].iterator(split,context)); 这里的f就是在创建MapPartitionsRDD时传入的变化函数, 一般为用户调用transform时设置的函数字面量;
firstParent[T].iterator(split,context)就是在调用这个RDD依赖的RDD, 一直到ShuffleRDD,然后ShuffleRDD的coumpute()方法被调用.

ShuffledRDD的compute()方法:

可以看到, 这里就会从各个机器拉取shuffle数据, 然后创建iterator, 并返回;

这样, 我们通过发生的先后顺序，理一下程序的执行过程:

ShuffledRDD的compute()方法读取shuffle的数据, 返回一个iterator;
MapPartitionsRDD的compute()方法使用上面得到的iterator, 调用用户编写的应用层函数字面量, 即下面代码中用户定义的部分.

rdd.mapPartitions { partition=>   // Some setup code here   @1 //在变换iterator时就会调用到;   val result =partition.map(yourfunction) //yourfunction是真正在不断从iterator取数据时进行调用的方法;// Some cleanup code here  @3//在变换iterator时就会调用到;   result}

上面用户定义的部分中,partition.map()生成了一个新的iterator并返回, 这个iterator的next()方法中,实现了使用用户自定义的yourfunction函数字面量来处理原MapPartitionsRDD的iterator的next()方法得到的值;
上一步得到的iterator,会在用户设定的action方法中的具体操作行为中调用, 这个操作行为就是func(context,rdd.iterator(partition,context)) .

所以可以看到, 只有"yourfunction"这个字面量, 才会在真正从iterator中取数据时被调用到;

通过上面的分析, 就不难理解mail list里的那种写法了.

最后再总结一下:

lines.mapPartitions { valueIterator=> if (valueIterator.isEmpty) {   Iterator[...]() } else {   val dbConnection = ...   valueIterator.map { item =>     val transformedItem = ...     if (!valueIterator.hasNext) {       dbConnection.close()     }     transformedItem   } }}

lines.mapPartitions(usercode)方法是在driver中调用的, 会生成DAG中RDD的一环,并且会将user code传入创建的mapPartitionRDD中;
valueIterator => …就是上面的"user code", 这些是在RDD的compute()方法中调用的, 这些代码是在生成RDD的Iterator对象,为了可以从中遍历数据;
item =>….这些代码是在真正遍历RDD的时候才会执行的, 上面一步可以生成新的Iterator对象, 并且老的Iterator对象可以作为闭包传递给这个函数字面量;

另外说一点, 英文文档中说到:

"Evenso, it would only close the connection on the driver, not necessarily freeingresources allocated by serialized copies."

我觉得这个说的有误, 从上面分析可以看出, 其实RDD的compute()方法是在分布在各个executor中执行的,而不是在driver中执行的, 所以connection的open和close都是在executor中的task中执行的, 非driver中.

0 0