Overview
At a high level, every Spark application consists of a driver program that runs the user’s main function and executes various parallel operations on a cluster. The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.
在高版本中,每一个Spark应用程序都包含一个程序驱动器,它运行用户的主函数并在集群上执行各种并行计算,Spark提供了一个最重要的抽象概念:弹性分布式数据集(RDD),它是在集群的节点上分区的集合,可以执行并行计算。RDDs可以通过Hadoop的文件系统(或任何Hadopp支持的文件系统)或者在驱动程序中使用已经存在的Scala集合进行创建,用户也可以使用spark将RDD持久化到内存中,使得并行操作中能够高效的复用数据集。最后,RDDs还提供了能从故障的节点中自动重试的机制
A second abstraction in Spark is shared variables that can be used in parallel operations. By default, when Spark runs a function in parallel as a set of tasks on different nodes, it ships a copy of each variable used in the function to each task. Sometimes, a variable needs to be shared across tasks, or between tasks and the driver program. Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.
Spark提供的第二个抽象概念是共享变量可用于并行计算。默认情况下,当Spark在不同的节点上并行运行一组任务时,它会将每个变量的一个副本装载到每个任务中。有时,变量需要在任务之间、任务和驱动程序之间共享。Spark支持两种类型的共享变量:广播变量,它可以用来缓存所有节点上的内存值。以及累加器,把所有的变量累加到一起。
This guide shows each of these features in each of Spark’s supported languages. It is easiest to follow along with if you launch Spark’s interactive shell – either bin/spark-shell for the Scala shell or bin/pyspark for the Python one.
这个指南展示了Spark支持的每个语言中的每个特性。如果您启动Spark的交互式shell,那么它是最容易遵循的,即Scala shell的bin / Spark - shell,或者Python one的bin / pyspark。
Linking with Spark
Spark 2.2.0 is built and distributed to work with Scala 2.11 by default. (Spark can be built to work with other versions of Scala, too.) To write applications in Scala, you will need to use a compatible Scala version (e.g. 2.11.X).
spark2.2.0需要scala2.11.x以上版本
To write a Spark application, you need to add a Maven dependency on Spark. Spark is available through Maven Central at:
编写一个spark应用,你需要添加如下maven依赖
groupId = org.apache.spark
artifactId = spark-core_2.11
version = 2.2.0
In addition, if you wish to access an HDFS cluster, you need to add a dependency on hadoop-client for your version of HDFS.
此外,如果你想访问HDFS集群,你需要添加如下HDFS依赖
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>
Finally, you need to import some Spark classes into your program. Add the following lines:
最后,你需要导入一些spark类到你的项目中,如下:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
(Before Spark 1.3.0, you need to explicitly import org.apache.spark.SparkContext._ to enable essential implicit conversions.)
在spark1.3.0之前,你需要导入import org.apache.spark.SparkContext._启用基本隐式转换。
Initializing Spark
The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext you first need to build a SparkConf object that contains information about your application.
一个Spark应用程序第一件事必须创建一个SparkContext对象,通过这个对象去连接Spark集群,创建SparkContext对象需要先构建一个SparkConf对象,这个对象包含了应用程序的配置信息。
Only one SparkContext may be active per JVM. You must stop() the active SparkContext before creating a new one.
JVM中只能有一个SparkContext对象处于活跃状态,如果你想新创建一个SparkContext对象必须先关闭上一个SparkContxt
val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
The appName parameter is a name for your application to show on the cluster UI. master is a Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode. In practice, when running on a cluster, you will not want to hardcode master in the program, but rather launch the application with spark-submit and receive it there. However, for local testing and unit tests, you can pass “local” to run Spark in-process.
appname参数是在UI界面中显示的应用名称,master是应用运行的模式可指定Mesos、YARN或者local。当你运行一个集群的时候,你不会想要把master硬编码到程序中,而是在启动应用程序时通过spark-submit命令指定。但是,对于本地测试和单元测试,你可以通过“local”模式来运行spark
Using the Shell
In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the variable called sc. Making your own SparkContext will not work. You can set which master the context connects to using the --master argument, and you can add JARs to the classpath by passing a comma-separated list to the --jars argument. You can also add dependencies (e.g. Spark Packages) to your shell session by supplying a comma-separated list of Maven coordinates to the --packages argument. Any additional repositories where dependencies might exist (e.g. Sonatype) can be passed to the --repositories argument. For example, to run bin/spark-shell on exactly four cores, use:
在spark shell中,已经为创建好了一个变量名为sc的SparkContext对象。你不能在自己创建一个SparkContext对象,你可以使用--master 来指定运行模式,还可以使用--jars 来指定添加需要的jar包,如果有多个可以使用“,”号分割,可以用过--packages 来指定你需要的Maven依赖,多个依赖同样使用“,”号分割。使用4个cores来启动spark-shell的示例代码如下:
$ ./bin/spark-shell --master local[4]
Or, to also add code.jar to its classpath, use:
如果需要引入code.jar
$ ./bin/spark-shell --master local[4] --jars code.jar
To include a dependency using Maven coordinates:
添加一个maven依赖
$ ./bin/spark-shell --master local[4] --packages "org.example:example:0.1"
For a complete list of options, run spark-shell --help. Behind the scenes, spark-shell invokes the more general spark-submit script.
通过spark-shell --help命令可以获得所有选项配置
Resilient Distributed Datasets (RDDs)
Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.
弹性分布式数据集(RDD)是Spark的核心概念,RDD是一个容错的集合并且可以执行并行计算。我们可以使用两种方式创建RDDs:在驱动器程序中调用parallelize方法作用于一个已经存在的数据集。,或者读取一个外部数据集,比如共享文件系统,HDFS,HBase或者任何Hadoop支持的数据集。
Parallelized Collections
Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your driver program (a Scala Seq). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. For example, here is how to create a parallelized collection holding the numbers 1 to 5:
在驱动程序中通过调用SparkContext的parallelize方法传入一个已经存在的数据集可以创建一个并行化的集合。集合的元素被复制成一个可以并行操作的分布式数据集。比如,这里使用1到5的数字创建了一个并行化的数据集:
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
Once created, the distributed dataset (distData) can be operated on in parallel. For example, we might call distData.reduce((a, b) => a + b) to add up the elements of the array. We describe operations on distributed datasets later on.
一旦创建,分布式数据集(distData)就可以进行并行计算,我们可以调用distData.reduce((a, b) => a + b)方法将数组中的元素进行累加,我们稍后将介绍分布式数据集上的更多操作。
One important parameter for parallel collections is the number of partitions to cut the dataset into. Spark will run one task for each partition of the cluster. Typically you want 2-4 partitions for each CPU in your cluster. Normally, Spark tries to set the number of partitions automatically based on your cluster. However, you can also set it manually by passing it as a second parameter to parallelize (e.g. sc.parallelize(data, 10)). Note: some places in the code use the term slices (a synonym for partitions) to maintain backward compatibility.
并行集合的一个重要参数是设置数据集的分区数。Spark将为集群的每个分区运行一个任务。通常你想为集群中的每个CPU分配2-4个分区。正常情况下,Spark将根据您的集群自动设置分区的数量。然而,你也可以通过设置parallelize的第二个参数来设置分区的个数,例:sc.parallelize(data, 10)。
External Datasets
Spark can create distributed datasets from any storage source supported by Hadoop, including your local file system, HDFS, Cassandra, HBase, Amazon S3, etc. Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.
Spark可以使用hadoop支持任何数据源创建分布式数据集,包括你本地的文件系统,HDFS, Cassandra, HBase, Amazon S3, etc。Spark支持文本文件SequenceFiles,和任何Hadoop支持的InputFormat
Text file RDDs can be created using SparkContext’s textFile method. This method takes an URI for the file (either a local path on the machine, or a hdfs://, s3n://, etc URI) and reads it as a collection of lines. Here is an example invocation:
使用SparkContext的textFile方法读入一个text文件创建RDDs,这个方法需要传入一个文件的URI(本地机器路径,hdfs://, s3n://, etc URI)并且按行读取。这里有一个调用的例子:
scala> val distFile = sc.textFile("data.txt")
distFile: org.apache.spark.rdd.RDD[String] = data.txt MapPartitionsRDD[10] at textFile at <console>:26
Once created, distFile can be acted on by dataset operations. For example, we can add up the sizes of all the lines using the map and reduce operations as follows: distFile.map(s => s.length).reduce((a, b) => a + b).
一旦distFile被创建就可以使用该数据集上的所有操作,我们可以使用map和reduce操作计算出所有行的总长度,例:distFile.map(s => s.length).reduce((a, b) => a + b).
Some notes on reading files with Spark:
一些spark读取文件需要注意的事项:
-
If using a path on the local filesystem, the file must also be accessible at the same path on worker nodes. Either copy the file to all workers or use a network-mounted shared file system.
-
如果使用本地文件系统,则需要保证所有的工作节点也有相同的路径,要么将文件拷贝到所有的工作节点或者使用共享文件系统
-
All of Spark’s file-based input methods, including textFile, support running on directories, compressed files, and wildcards as well. For example, you can use textFile("/my/directory"), textFile("/my/directory/.txt"), and textFile("/my/directory/.gz").
-
Spark的所有基于文件的输入方法,包括textFile,支持在加载目录、压缩文件和通配符,比如,你可以使用textFile("/my/directory")加载一个目录,使用textFile("/my/directory/.txt")带通配符的方式加载该目录下的所有txt文件,或者像这样
textFile("/my/directory/.gz")加载目录下所有的gz压缩文件 -
The textFile method also takes an optional second argument for controlling the number of partitions of the file. By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS), but you can also ask for a higher number of partitions by passing a larger value. Note that you cannot have fewer partitions than blocks.
- textFile方法同样可以设置第二个参数来控制输入文件的分区个数。默认情况下,Spark将对每一个数据块文件(HDFS上默认每个数据块是128M)创建一个分区,但是你也可以为每个数据块设置多个分区,注意你不能设置比数据块更小的分区数
Apart from text files, Spark’s Scala API also supports several other data formats:
除了文本文件之外,Spark的Scala API还支持其他几种数据格式:
- SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file. Partitioning is determined by data locality which, in some cases, may result in too few partitions. For those cases, wholeTextFiles provides an optional second argument for controlling the minimal number of partitions.
SparkContext.wholeTextFiles方法可以读取包含多个小文本文件的目录,并且返回(文件名,内容)的键值对数据。这和textFile方法有点区别,textFile将在每个文件中返回一条记录。分区是由数据位置决定的,在某些情况下,可能导致分区太少。对于这些情况,wholeTextFiles 提供了一个可选的第二个参数来控制最小的分区数。
- For SequenceFiles, use SparkContext’s sequenceFile[K, V] method where K and V are the types of key and values in the file. These should be subclasses of Hadoop’s Writable interface, like IntWritable and Text. In addition, Spark allows you to specify native types for a few common Writables; for example, sequenceFile[Int, String] will automatically read IntWritables and Texts.
对于SequenceFiles,使用SparkContext的sequenceFile[K,V]方法,其中K和V是文件中键和值的类型。这些应该是Hadoop Writable的子类,如IntWritable和Text。此外,Spark还允许您为一些常见的可写程序指定本机类型;例如,sequenceFile[Int,String]将自动读取IntWritables和文本。
- For other Hadoop InputFormats, you can use the SparkContext.hadoopRDD method, which takes an arbitrary JobConf and input format class, key class and value class. Set these the same way you would for a Hadoop job with your input source. You can also use SparkContext.newAPIHadoopRDD for InputFormats based on the “new” MapReduce API (org.apache.hadoop.mapreduce).
对于其他Hadoop inputformat,您可以使用SparkContext。hadoop rdd方法,它采用任意的JobConf和输入格式类、键类和值类。设置这些方法与使用输入源的Hadoop作业相同。您还可以使用SparkContext。newAPIHadoopRDD InputFormats基于“新”MapReduce的API(org.apache.hadoop.mapreduce)。
- RDD.saveAsObjectFile and SparkContext.objectFile support saving an RDD in a simple format consisting of serialized Java objects. While this is not as efficient as specialized formats like Avro, it offers an easy way to save any RDD.
RDD.saveAsObjectFile和SparkContext.objectFile支持以一种简单的格式(由序列化的Java对象组成)来保存RDD。虽然这并不像Avro那样的特殊格式,但它提供了一种简单的方法来保存任何RDD。
RDD Operations
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
RDDs支持两种类型的操作:转换(transformations),从已有的数据集中通过转换操作创建一个新的RDD。行为(actions),在数据集上执行计算并返回结果到驱动程序。比如,map是一个转换操作将对数据集中的每个元素执行某个函数里面的逻辑并返回一个新带的数据集。另一方面,reduce是一种行为,它使用某个函数聚合所有RDD元素,并将最终结果返回给驱动程序
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
所有的转换操作都是惰性加载的,它们并不会立即进行计算操作。事实上,转换操作仅仅记录下应用操作了一些基础数据集信息。当action操作执行时transformations才会进行真正的计算。
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
默认情况下,每个转换的RDD可以在每次运行时重新计算。然而,您也可以使用持久化(或缓存)方法在内存中持久化一个RDD,在这种情况下,Spark将在您下次查询时使集群的元素更快地访问。还可以支持在磁盘上持久化RDDs,或者在多个节点上进行复制。
共同学习,写下你的评论
评论加载中...
作者其他优质文章