Spark 初探
检查Scala环境
2.1版本要求scala版本为2.11以上,若没有相应环境,可参考Scala 2.11安装
12 $ scala -version$ Scala code runner version 2.10.2 -- Copyright 2002-2013, LAMP/EPFL
Download Spark 2.1
下载源码包
|
|
编译安装
Note: Starting version 2.0, Spark is built with Scala 2.11 by default. Scala 2.10 users should download the Spark source package
and build with Scala 2.10 support.
由于Spark编译版本从2.0版本开始默认是scala 2.11编译的,因为当前环境scala版本为2.10,所以需要手动编译安装
Building for Scala 2.10
- To produce a Spark package compiled with Scala 2.10, use the -Dscala-2.10 property:12$ ./dev/change-scala-version.sh 2.10$ ./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package
Note that support for Scala 2.10 is deprecated as of Spark 2.1.0 and may be removed in Spark 2.2.0.
Spark完整构建完成
- Building submodules individually
如果只是用Spark的某一个子模块的功能,比如使用Spark Streaming,就可以单独构建Spark Streaming,而不用去额外构建诸如Spark SQL、GraphX等模块It’s possible to build Spark sub-modules using the mvn -pl option.
For instance, you can build the Spark Streaming module using:1$ ./build/mvn -pl :spark-streaming_2.11 clean install
where spark-streaming_2.11 is the artifactId as defined in streaming/pom.xml file.
Continuous Compilation
Spark同时还支持可持续编译,换句话说,就是动态热加载
We use the scala-maven-plugin which supports incremental and continuous compilation. E.g.
1 $ ./build/mvn scala:ccshould run continuous compilation (i.e. wait for changes). However, this has not been tested extensively. A couple of gotchas to note:
- it only scans the paths src/main and src/test (see docs), so it will only work from within certain submodules that have that structure.
- you’ll typically need to run mvn install from the project root for compilation within specific submodules to work; this is because submodules that depend on other submodules do so via the spark-parent module).
Thus, the full flow for running continuous-compilation of the core submodule may look more like:
123 $ ./build/mvn install -DskipTests$ cd core$ ../build/mvn scala:cc
Issue
编译失败
SSL报错,这是curl使用SSL的一个bug,需要更新curl版本
更新后使用的curl版本信息1curl 7.47.1 (x86_64-pc-linux-gnu) libcurl/7.47.1 OpenSSL/1.0.2g zlib/1.2.7 libssh2/1.4.3使用Spark shell出错
在执行./bin/spark-shell 或 ./bin/pyspark时,出现以下错误信息:./spark/spark-2.1.0/bin/spark-class: line 77: syntax error near unexpected token
"$ARG"' ./spark/spark-2.1.0/bin/spark-class: line 77:
CMD+=(“$ARG”)’
Google了一下相关信息,在Apache的JIRA发现了相关解释,有评论说是bash的问题,Bash是从3.1版本开始支持的‘+=’运算符,直接升级bash版本即可解决
|
|
- 其他问题
当你构建完成后,除非你的scala版本为当前Spark2.1需求的2.11+,否则每次重新构建都需要走上面的构建流程
命令行模式
- 启动1$ ./bin/spark-shell
或者
The –master option specifies the master URL for a distributed cluster, or local to run locally with one thread, or local[N] to run locally with N threads. You should start by using local for testing. For a full list of options, run Spark shell with the –help option.
- Hello World
Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs.
|
|
其他的语法糖详见Actions和transformations
示例
Spark Standalone Mode(单实例模式)
- 指定master运行ip和port12$ cp conf/spark-env.sh.template conf/spark-env.shvim conf/spark-env.sh
增加如下内容
- Issue
- 运行 ./bin/spark-shell 抛出java.net.UnknownHostException
解决过程:初识Spark,猜想Spark需要依赖Hadoop环境运行,遂尝试安装Hadoop,安装Hadoop并启动,会报同样的错误,继续追踪,发现原来是因为无法识别Hostname导致的,修改host环境即可解决
- 运行 ./bin/spark-shell 抛出java.net.UnknownHostException
master + slave
- 启动master1$ ./sbin/start-master.sh
注:此时若直接连接spark集群,会报资源不足的相关错误,原因是还没有启动相应的Slave资源给集群.(暂时猜想Master只提供管控资源,不提供计算资源)
启动slave
1$ ./sbin/start-slave.sh spark://IP:PORT连接集群
1$ ./bin/spark-shell --master spark://0.0.0.0:7077 --total-executor-cores 2
–total-executor-cores 控制shell使用集群的core数量
提交任务
- first task
示例代码
|
|
├── simple.sbt
├── src
│ └── main
│ └── scala
│ └── SimpleApp.scala
name := “Simple Project”
version := “1.0”
scalaVersion := “2.11.8”
libraryDependencies += “org.apache.spark” %% “spark-core” % “2.1.0”
$ sbt package
$ spark-submit –class “SimpleApp” –master spark://0.0.0.0:7077 target/scala-2.11/simple-project_2.11-1.0.jar
$ spark-submit –class “SimpleApp” –master local[4] target/scala-2.11/simple-project_2.11-1.0.jar
spark.master spark://5.6.7.8:7077
spark.executor.memory 4g
spark.eventLog.enabled true
spark.serializer org.apache.spark.serializer.KryoSerializer
```
Spark + YARN
// TODO
基本概念
RDD(Resilient Distributed Dataset)
弹性分布数据集
// TODO