首页手记 12 Spark on YARN

12 Spark on YARN

标签：

Java

Spark的部署方式灵活多变，主要有Local、Standalone、Mesos和YARN。

（1）如果只是在单机上部署运行用于学习，可以使用Local部署方式；

（2）如果想用于真正的集群上，可以采取Standalone、Mesos和YARN的部署方式，Standalone是Spark内建的部署方式，Mesos和YARN是外部的资源调度框架。在实际的生产过程中，选择使用YARN的部署方式的主要优点：便于与已有的Hadoop系统整合，便于管理集群和共享内存。Spark on YARN根据Driver运行位置的不同，分为Spark on YARN-Cluster和Spark on YARN-Client模式。

YARN是一种全新Hadoop资源管理器，对于运行在其上的框架提供了操作系统级别的调度。

YARN架构的重要组成部分：

ResourceManager(RM)
NodeManager(NM)
ApplicationMaster(AM)
Container

Spark on YARN的部署模式

如果将Spark部署在YARN上面，必须确保HADOOP_CONF_DIR或YARN_CONF_DIR（在spark-env.sh中可以配置）指向Client端包含Hadoop集群配置的目录。Spark通过这些配置可以连接YARN ResourceManager，并且能够向HDFS写入数据。该目录中的配置文件会被分发到YARN集群，以便于应用使用的Container能够使用同样的配置。

在YARN上启动Spark应用依据Spark Driver运行位置的不同，可以分为两种部署模式：yarn-cluster和yarn-client。

（1）在yarn-cluster模式下，Spark Driver运行被YARN管理的ApplicationMaster进程中，在应用启动之后，Client端可以退出。ResourceManager的地址是从Hadoop的配置中读出来的。因此，YARN模式下的--master命令行参数可以设置为yarn-client或者yarn-cluster。

（2）在yarn-client模式下，Spark Driver运行在Client进程中，并且在该模式下，AppicationMaster只会向YARN请求资源。

Spark on YARN的部署模式中yarn-cluster和yarn-client的区别

在yarn-cluster模式是实际生产常见的模式，而yarn-client更加适合于用户的交互和调试，也就是希望快速地看到application的输出。

先明白一个概念：Application Master。在YARN中，每个Application实例都有一个Application Master进程，它是Application启动的第一个容器。它负责和ResourceManager打交道，并请求资源。获取资源之后告诉NodeManager为其启动container。

yarn-cluster和yarn-client模式的区别其实就是Application Master进程的区别：

（1）yarn-cluster模式下，driver运行在AM(Application Master)中，它负责向YARN申请资源，并监督作业的运行状况。当用户提交了作业之后，就可以关掉Client，作业会继续在YARN上运行。然而yarn-cluster模式不适合运行交互类型的作业。

（2）yarn-client模式下，Application Master仅仅向YARN请求executor，client会和请求的container通信来调度他们工作，也就是说Client不能离开。Driver运行在Client上，而Driver包含了DAGScheduler及TaskScheduler，因此在整个应用未执行完成期间，Client不能退出。

看下下面的两幅图应该会明白（上图是yarn-cluster模式，下图是yarn-client模式）：

如：

spark-submit --master yarn \
--deploy-mode cluster \
--num-executors 25 \
--executor-cores 2 \
--driver-memory 4g \
--executor-memory 4g \
--conf spark.broadcast.compress=true spark_data_analysis_cluster.py > /app/log/.out 2>&1

该提交命令启动了YARN Client程序，并且通过该Client程序启动默认的ApplicationMaster，然后SparkPi将作为Application的一个子线程运行。Client将定期向ApplicationMaster更新状态并将其显示在终端。在应用运行完成后，Client会退出。

在yarn-client模式下，ApplicationMaster只负责向RM申请Executor需要的资源。当Spark on YARN时，spark-shell和pyspark必须使用yarn-client模式。如果要在yarn-client模式下启动Spark应用，只需要在命令行参数--master后传入yarn-client参数即可，例如：

pyspark --master yarn-client \
--executor-memory 1g \
--driver-memory 1g \
--num-executors 4 \
--executor-cores 2

--num-executors 分配给应用的YARN Container的总数；
--driver-memory 分配给Driver的最大heap size；
--executor-memory 分配给每个executor的最大heap size；
--executor-cores 分配给每个executor的最大处理器core数量；

下面使用spark-submit --help所显示的部分内容：

[root@server106 ~]# spark-submit --help

Usage: spark-submit [options] [app arguments]
Usage: spark-submit --kill [submission ID] --master [spark://...]
Usage: spark-submit --status [submission ID] --master [spark://...]

Options:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client).
--class CLASS_NAME Your application's main class (for Java / Scala apps).
--name NAME A name of your application.
--jars JARS Comma-separated list of local jars to include on the driver and executor classpaths.
--packages Comma-separated list of maven coordinates of jars to include
on the driver and executor classpaths. Will search the local
maven repo, then maven central and any additional remote
repositories given by --repositories. The format for the
coordinates should be groupId:artifactId:version.
--exclude-packages Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in --packages to avoid dependency conflicts.
--repositories Comma-separated list of additional remote repositories to search for the maven coordinates given with --packages.
--py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place
on the PYTHONPATH for Python apps.
--files FILES Comma-separated list of files to be placed in the working
directory of each executor.
--conf PROP=VALUE Arbitrary Spark configuration property.
--properties-file FILE Path to a file from which to load extra properties. If not
specified, this will look for conf/spark-defaults.conf.
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
--driver-java-options Extra Java options to pass to the driver.
--driver-library-path Extra library path entries to pass to the driver.
--driver-class-path Extra class path entries to pass to the driver. Note that
jars added with --jars are automatically included in the classpath.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
--proxy-user NAME User to impersonate when submitting the application.
--help, -h Show this help message and exit
--verbose, -v Print additional debug output
--version, Print the version of current Spark

Spark on YARN:
--executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode,or all available cores on the worker in standalone mode)
--driver-cores NUM Number of cores used by the driver, only in cluster mode
(Default: 1).
--queue QUEUE_NAME The YARN queue to submit to (Default: "default").
--num-executors NUM Number of executors to launch (Default: 2).
--archives ARCHIVES Comma separated list of archives to be extracted into the
working directory of each executor.
--principal PRINCIPAL Principal to be used to login to KDC, while running on
secure HDFS.
--keytab KEYTAB The full path to the file that contains the keytab for the
principal specified above. This keytab will be copied to the node running the Application Master via the Secure Distributed Cache, for renewing the login tickets and the delegation tokens periodically.

作者：7125messi
链接：https://www.jianshu.com/p/3d33f5373120

点击查看更多内容