spark安装

spark-2.1.0 源码安装

前置

jdk1.7.0_79

scala-2.11.8

apache-maven-3.3.9

步骤

方法一

./build/mvn -Dhttps.protocols=TLSv1,TLSv1.1,TLSv1.2 -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.7.0 -DskipTests clean package

  • 1
    export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512M -XX:MaxPermSize = 512M"
  • ######指定scala版本

    1
    ./dev/change-scala-version.sh 2.11

方法二

./dev/make-distribution.sh –name 2.6.0-cdh5.7.0 –tgz -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver -Dhadoop.version=2.6.0-cdh5.7.0

  • 修改/dev/make-distribution.sh里面的脚本

    1.将VERSION ,SCALA_VERSION ,SPARK_HADOOP_VERSION ,SPARK_HIVE注释掉,直接写上自己的版本

    #VERSION=$(“$MVN” help:evaluate -Dexpression=project.version $@ 2>/dev/null | grep -v “INFO” | tail -n 1) 指的是spark2.1.0这个版本

    #SCALA_VERSION=$(“$MVN” help:evaluate -Dexpression=scala.binary.version $@ 2>/dev/null\ 指的是scala 2.11

    # | grep -v “INFO”\

    # | tail -n 1)

    #SPARK_HADOOP_VERSION=$(“$MVN” help:evaluate -Dexpression=hadoop.version $@ 2>/dev/null\ 指的是hadoop.version=2.6.0-cdh5.7.0

    # | grep -v “INFO”\

    # | tail -n 1)

    #SPARK_HIVE=$(“$MVN” help:evaluate -Dexpression=project.activeProfiles -pl sql/hive $@ 2>/dev/null\ SPARK_HIVE为1表示支持

    # | grep -v “INFO”\

    # | fgrep –count “hive“;\

    # # Reset exit status to 0, otherwise the script stops here if the last grep finds nothing\

    # # because we use “set -o pipefail”

    # echo -n)

    将以下的内容贴在注释掉的那个脚本的后面即可

    VERSION=2.1.0

    SCALA_VERSION=2.11

    SPARK_HADOOP_VERSION=2.6.0-cdh5.7.0

    SPARK_HIVE=1

  • 指定scala版本
    1
    ./dev/change-scala-version.sh 2.11

坑0

问题1:[ERROR] Failed to execute goal on project spark-launcher_2.11: Could not resolve dependencies for project org.apache.spark:spark-launcher_2.11:jar:2.2.0: Failure to find org.apache.hadoop:hadoop-client:jar:2.6.0-cdh5.7.0 in https://repo1.maven.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of central has elapsed or updates are forced -> [Help 1]

######这是因为默认的是apache的仓库,但是我们hadoop的版本写的是CDH,这时要将CDH的仓库配进来,打开spark目录下的pom.xml文件,将CDH的仓库配进去

​ vi /usr/local/spark-test/app/spark-2.2.0/pom.xml 添加如下

cloudera

cloudera Repository

https://repository.cloudera.com/artifactory/cloudera-repos

添加时注意区分空格和tab,格式不对会在mvn 编译时报错

坑一

Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on project spark-tags_2.11: wrap: java.io.IOException: Cannot run program “/home/c/hadoop/jdk1.7.0_79/jre/bin/java” (in directory “.”): error=13, Permission denied -> [

解决:将jdk1.7.0_79 整个文件夹 chmod -R 777

坑二

maven 编译时 报错:

Received fatal alert: handshake_failure还有ssl 提醒之类的错误

解决:将pom.xml

1
2
3
4
5
<repository>
<id>cloudera</id>
<name>cloudera Repository</name>
<url>https://repository.cloudera.com/artifactory/cloudera-repos</url>
</repository>

里的https 改成 http

参考链接

https://blog.csdn.net/chen_1122/article/details/77935149

https://blog.csdn.net/jiaotangX/article/details/78635133

https://blog.csdn.net/suisongtiao1799/article/details/80223068

http://feitianbenyue.iteye.com/blog/2429045

坑四

因为之前的命令里面没有编译安装hive模块,所以在用spark-sql 访问 hive 时报错(提示你需要build hive和hive-thriftserver 类似的信息)以及用spark-shell访问没反应(明明hive里面有表,运行spark.sql(“show tables”).show 不显示应有数据)

执行:

1
./build/mvn -Dhttps.protocols=TLSv1,TLSv1.1,TLSv1.2 -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver -Dhadoop.version=2.6.0-cdh5.7.0 -DskipTests clean package

把hive 的依赖 加上去重新编译

重新编译过程中可能出现的问题

https://stackoverflow.com/questions/36651611/failed-to-execute-goal-net-alchim31-mavenscala-maven-plugin3-2-2

确保编译过程中有网