首页手记 6 Spark Streaming 中Job的动态生成

6 Spark Streaming 中Job的动态生成

标签：

Spark

本文内容以以Socket数据来源为例，通过WordCount计算来跟踪Job的生成
代码如下：

    objectNetworkWordCount {
      defmain(args:Array[String]) {        if (args.length< 2) {
          System.err.println("Usage:NetworkWordCount<hostname> <port>")
          System.exit(1)
        }
        val sparkConf= newSparkConf().setAppName("NetworkWordCount").setMaster("local[2]")
        val ssc = newStreamingContext(sparkConf,Seconds(1))
        val lines= ssc.socketTextStream(args(0),args(1).toInt,StorageLevel.MEMORY_AND_DISK_SER)
        val words= lines.flatMap(_.split(""))
        val wordCounts= words.map(x => (x,1)).reduceByKey(_+ _)
        wordCounts.print()
        ssc.start()
        ssc.awaitTermination()
      ｝
    }

从ssc.start()开始看，在start方法中调用了scheduler的start()方法，这里的scheduler就是
JobScheduler，我们看start的代码

def start(): Unit = synchronized {    if (eventLoop != null) return // scheduler has already been started
    
    logDebug("Starting JobScheduler")
    eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {    override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)    override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
    }    // 启动JobScheduler的事件循环器
    eventLoop.start()    
    // attach rate controllers of input streams to receive batch completion updates
    for { inputDStream <- ssc.graph.getInputStreams
    rateController <- inputDStream.rateController
    } ssc.addStreamingListener(rateController)
    
    listenerBus.start(ssc.sparkContext)
    receiverTracker = new ReceiverTracker(ssc)
    inputInfoTracker = new InputInfoTracker(ssc)    // 启动ReceiverTracker,数据的接收逻辑从这里开始
    receiverTracker.start()    // 启动JobGenerator，job的生成从这里开始
    jobGenerator.start()
    logInfo("Started JobScheduler")
}

Spark Streaming由JobScheduler、ReceiverTracker、JobGenerator三大组件组成，其中ReceiverTracker、
JobGenerator包含在JobScheduler中。这里分别执行三大组件的start方法。

我们先看Job的生成，jobGenerator.start()方法。在JobGenerator的start方法中都做了什么，继续往下看。
首先启动了一个EventLoop并来回调processEvent方法，那么什么时候会触发回调呢，来看一下EventLoop的内部结构

 private[spark] abstract class EventLoop\[E](name: String) extends Logging {  //线程安全的阻塞队列
  private val eventQueue: BlockingQueue[E] = new LinkedBlockingDeque\[E]()  private val stopped = new AtomicBoolean(false)  private val eventThread = new Thread(name) {    //后台线程
    setDaemon(true)

    override def run(): Unit = {      try {        while (!stopped.get) {
          val event = eventQueue.take()          try {            //回调子类的onReceive方法，就是事件的逻辑代码
            onReceive(event)
          } catch {            case NonFatal(e) => {              try {
                onError(e)
              } catch {                case NonFatal(e) => logError("Unexpected error in " + name, e)
              }
            }
          }
        }
      } catch {        case ie: InterruptedException => // exit even if eventQueue is not empty
        case NonFatal(e) => logError("Unexpected error in " + name, e)
      }
    }
  }

  def start(): Unit = {    if (stopped.get) {      throw new IllegalStateException(name + " has already been stopped")
    }    // Call onStart before starting the event thread to make sure it happens before onReceive
    onStart()    // 启动事件循环器
    eventThread.start()
  }

  def stop(): Unit = {    // stopped.compareAndSet(false, true) 判断是否为false，同时赋值为true
    if (stopped.compareAndSet(false, true)) {
     eventThread.interrupt()
      var onStopCalled = false
      try {
        eventThread.join()        // Call onStop after the event thread exits to make sure onReceive happens before onStop
        onStopCalled = true
        onStop()
      } catch {        case ie: InterruptedException =>
          Thread.currentThread().interrupt()          if (!onStopCalled) {            // ie is thrown from `eventThread.join()`. Otherwise, we should not call `onStop` since
            // it's already called.
            onStop()
          }
      }
    } else {      // Keep quiet to allow calling `stop` multiple times.
    }
  }

  def post(event: E): Unit = {
    eventQueue.put(event)
  }

  def isActive: Boolean = eventThread.isAlive  protected def onStart(): Unit = {}  protected def onStop(): Unit = {}  protected def onReceive(event: E): Unit  protected def onError(e: Throwable): Unit

 }

在EventLoop内部其实是维护了一个队列，开辟了一条后台线程来回调实现类的onReceive方法。
那么是什么时候把事件放入EventLoop的队列中呢，就要找EventLoop的post方法了。在JobGenerator实例化的时
候创建了一个RecurringTimer，代码如下：

 private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,  // 回调 eventLoop.post(GenerateJobs(new Time(longTime)))将GenerateJobs事件放入事件循环器
  longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")

RecurringTimer就是一个定时器，看一下他的构造参数和内部代码，
* @param clock 时钟
* @param period 间歇时间
* @param callback 回调方法
* @param name 定时器的名称
很清楚的知道根据用户传入的时间间隔，周期性的回调callback方法。Callback就是前面看到的

longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")

将GenerateJobs事件提交到EventLoop的队列中，此时RecurringTimer还没有执行。
回到JobGenerator中的start方法向下看，因为是第一次运行，所以调用了startFirstTime方法。
在startFirstTime方法中，有一行关键代码timer.start(startTime.milliseconds)，终于看到了定时器的启动

从定时器的start方法开始往回看，周期性的回调eventLoop.post方法将GenerateJobs事件发送到EvenLoop的队列，然后回调rocessEvent方法，看generateJobs(time)。
generateJobs代码如下

private def generateJobs(time: Time) {  // Set the SparkEnv in this thread, so that job generation code can access the environment
  // Example: BlockRDDs are created in this thread, and it needs to access BlockManager
  // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.
  SparkEnv.set(ssc.env)  Try {
    jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
    graph.generateJobs(time) // generate jobs using allocated block
  } match {    case Success(jobs) =>      // 获取元数据信息
      val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)      // 提交jobSet
      jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))    case Failure(e) =>
      jobScheduler.reportError("Error generating jobs for time " + time, e)
  }
  eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}
进入graph.generateJobs(time) ，调用每一个outputStream的generateJob方法，generateJob代码如下private[streaming] def generateJob(time: Time): Option[Job] = {
  getOrCompute(time) match {    case Some(rdd) => {      // jobRunc中包装了runJob的方法
      val jobFunc = () => {
        val emptyFunc = { (iterator: Iterator[T]) => {} }
        context.sparkContext.runJob(rdd, emptyFunc)
      }
      Some(new Job(time, jobFunc))
    }    case None => None
  }
}

getOrCompute返回一个RDD，RDD的生成以后再说，定义了一个函数jobFunc，可以看到函数的作用是提交job，
把jobFunc封装到Job对象然后返回。

返回的是多个job，jobs生成成功后提交JobSet，代码如下
jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
然后分别提交每一个job，把job包装到JobHandler(Runnable子类)交给线程池运行，执行JobHandler的run
方法，调用job.run()，在Job的run方法中就一行，执行Try(func())，这个func()函数就是上面代码中
的jobFunc，看到这里整个Job的生成与提交就连通了。
下面附上一张Job动态生成流程图

以上内容如有错误，欢迎指正

作者：海纳百川_spark
链接：https://www.jianshu.com/p/717b3ed2be54

点击查看更多内容

为 TA 点赞

若觉得本文不错，就分享一下吧！

评论

评论

共同学习，写下你的评论

评论加载中...

展开查看更多评论

作者其他优质文章

正在加载中

富国沪深

算法工程师

手记
篇

粉丝

41

获赞与收藏

158

关注作者，订阅最新文章

阅读免费教程

后端通用面试教程

41个小节 30787 345

网络编程入门教程

20个小节 12717 239

Pandas 入门教程

25个小节 18599 341

推荐

评论

收藏

共同学习，写下你的评论



感谢您的支持，我会继续努力的～

扫码打赏，你说多少就多少

赞赏金额会直接到老师账户

支付方式

打开微信扫一扫，即可进行扫码打赏哦

今天注册有机会得

100积分直接送

付费专栏免费学

大额优惠券免费领

立即参与放弃机会

点击
抽奖

慕课手记新用户专享福利

恭喜你，你的运气太好了，居然抽中了 100个积分！

恭喜你，抽中了价值元的专栏！

太棒了，直接落到你账户里！

积分商城里的罗技鼠标、机械键盘、
Kindle 阅读器、小米平衡车
Apple iPad （10.2英寸）、大额优惠券
在等着你去兑换了噢

作者：

免费赠送

兑换码：1111222211 复制

优惠券可用于购买实战课、体系课
无门槛使用

先去看看，有什么好东西马上兑换我爱学习，选课去


热搜

最近搜索清空

6 Spark Streaming 中Job的动态生成

阅读免费教程