Spark 2.1 Get Started

Spark 初探

检查Scala环境

2.1版本要求scala版本为2.11以上,若没有相应环境,可参考Scala 2.11安装

1
2
$ scala -version
$ Scala code runner version 2.10.2 -- Copyright 2002-2013, LAMP/EPFL

Download Spark 2.1

下载源码包

1
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0.tgz

编译安装

Note: Starting version 2.0, Spark is built with Scala 2.11 by default. Scala 2.10 users should download the Spark source package
and build with Scala 2.10 support.
由于Spark编译版本从2.0版本开始默认是scala 2.11编译的,因为当前环境scala版本为2.10,所以需要手动编译安装

Building for Scala 2.10

  • To produce a Spark package compiled with Scala 2.10, use the -Dscala-2.10 property:
    1
    2
    $ ./dev/change-scala-version.sh 2.10
    $ ./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package

Note that support for Scala 2.10 is deprecated as of Spark 2.1.0 and may be removed in Spark 2.2.0.
Spark完整构建完成

  • Building submodules individually
    如果只是用Spark的某一个子模块的功能,比如使用Spark Streaming,就可以单独构建Spark Streaming,而不用去额外构建诸如Spark SQL、GraphX等模块

    It’s possible to build Spark sub-modules using the mvn -pl option.
    For instance, you can build the Spark Streaming module using:

    1
    $ ./build/mvn -pl :spark-streaming_2.11 clean install

where spark-streaming_2.11 is the artifactId as defined in streaming/pom.xml file.

Continuous Compilation

Spark同时还支持可持续编译,换句话说,就是动态热加载

We use the scala-maven-plugin which supports incremental and continuous compilation. E.g.

1
$ ./build/mvn scala:cc

should run continuous compilation (i.e. wait for changes). However, this has not been tested extensively. A couple of gotchas to note:

  • it only scans the paths src/main and src/test (see docs), so it will only work from within certain submodules that have that structure.
  • you’ll typically need to run mvn install from the project root for compilation within specific submodules to work; this is because submodules that depend on other submodules do so via the spark-parent module).

Thus, the full flow for running continuous-compilation of the core submodule may look more like:

1
2
3
$ ./build/mvn install -DskipTests
$ cd core
$ ../build/mvn scala:cc

Issue

  • 编译失败
    SSL报错,这是curl使用SSL的一个bug,需要更新curl版本
    更新后使用的curl版本信息

    1
    curl 7.47.1 (x86_64-pc-linux-gnu) libcurl/7.47.1 OpenSSL/1.0.2g zlib/1.2.7 libssh2/1.4.3
  • 使用Spark shell出错
    在执行./bin/spark-shell 或 ./bin/pyspark时,出现以下错误信息:

    ./spark/spark-2.1.0/bin/spark-class: line 77: syntax error near unexpected token "$ARG"' ./spark/spark-2.1.0/bin/spark-class: line 77: CMD+=(“$ARG”)’

Google了一下相关信息,在Apache的JIRA发现了相关解释,有评论说是bash的问题,Bash是从3.1版本开始支持的‘+=’运算符,直接升级bash版本即可解决

1
$ bash --version
  • 其他问题
    当你构建完成后,除非你的scala版本为当前Spark2.1需求的2.11+,否则每次重新构建都需要走上面的构建流程

命令行模式

  • 启动
    1
    $ ./bin/spark-shell

或者

1
$ ./bin/spark-shell --master local[2]

The –master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing. For a full list of options, run Spark shell with the –help option.

  • Hello World

    Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs.

1
2
3
4
5
6
7
8
$ scala> val tf = sc.textFile("README.md")
tf: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at <console>:27
$ scala> tf.count()
res0: Long = 104
$ scala> tf.first()
res1: String = # Apache Spark

其他的语法糖详见Actionstransformations

示例

Spark Standalone Mode(单实例模式)

  • 指定master运行ip和port
    1
    2
    $ cp conf/spark-env.sh.template conf/spark-env.sh
    vim conf/spark-env.sh

增加如下内容

1
2
3
export SPARK_MASTER_HOST=0.0.0.0
export SPARK_MASTER_PORT=8077
export SPARK_MASTER_WEBUI_PORT=8078

  • Issue
    • 运行 ./bin/spark-shell 抛出java.net.UnknownHostException
      解决过程:初识Spark,猜想Spark需要依赖Hadoop环境运行,遂尝试安装Hadoop,安装Hadoop并启动,会报同样的错误,继续追踪,发现原来是因为无法识别Hostname导致的,修改host环境即可解决

master + slave

  • 启动master
    1
    $ ./sbin/start-master.sh

注:此时若直接连接spark集群,会报资源不足的相关错误,原因是还没有启动相应的Slave资源给集群.(暂时猜想Master只提供管控资源,不提供计算资源)

  • 启动slave

    1
    $ ./sbin/start-slave.sh spark://IP:PORT
  • 连接集群

    1
    $ ./bin/spark-shell --master spark://0.0.0.0:7077 --total-executor-cores 2

–total-executor-cores 控制shell使用集群的core数量

提交任务

  • first task
    示例代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/** SimpleApp.scala */
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md"
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numOfA = logData.filter(line => line.contains("a")).count()
val numOfB = logData.filter(line => line.contains("b")).count()
println("Lines with a: $numOfA, Lines with b: $numOfB")
sc.stop()
}
}
```
项目结构
simple

├── simple.sbt
├── src
│ └── main
│ └── scala
│ └── SimpleApp.scala

1
2
3
构建文件
simple.sbt

name := “Simple Project”
version := “1.0”
scalaVersion := “2.11.8”
libraryDependencies += “org.apache.spark” %% “spark-core” % “2.1.0”

1
2
构建命令

$ sbt package

1
2
3
4
5
> 构建过程报下面这样的错误
*unresolved dependency: org.glassfish.hk2#hk2-utils;2.22.2: not found*
**解决方案**:去掉not found的依赖,详情查看[stackoverflow](http://stackoverflow.com/questions/20912369/sbt-fails-to-resolve-dependency-for-jersey-container-grizzly2-http-2-5-1)
提交任务

$ spark-submit –class “SimpleApp” –master spark://0.0.0.0:7077 target/scala-2.11/simple-project_2.11-1.0.jar

1
或者本地模式提交

$ spark-submit –class “SimpleApp” –master local[4] target/scala-2.11/simple-project_2.11-1.0.jar

1
2
>spark-submit除了会读取命令行参数,还会读取 conf/spark-defaults.conf下的配置

spark.master spark://5.6.7.8:7077
spark.executor.memory 4g
spark.eventLog.enabled true
spark.serializer org.apache.spark.serializer.KryoSerializer
```

Spark + YARN

// TODO

基本概念

RDD(Resilient Distributed Dataset)
弹性分布数据集
// TODO

坚持原创技术分享,您的支持将鼓励我继续创作!