Winse Blog

走走停停, 熙熙攘攘, 忙忙碌碌, 不知何畏.

Hadoop不同版本yarn和hdfs混搭,spark-yarn环境配置

hadoop分为存储和计算两个主要的功能,hdfs步入hadoop2后不论稳定性还是HA等等功能都比hadoop1要更吸引人。hadoop-2.2.0的hdfs已经比较稳定,但是yarn高版本有更加丰富的功能。本文主要关注spark-yarn下日志的查看,以及spark-yarn-dynamic的配置。

hadoop-2.2.0的hdfs原本已经在使用的环境,在这基础上搭建运行yarn-2.6.0,以及spark-1.3.0-bin-2.2.0。

  • 编译

我是在虚拟机里面编译,共享了host主机的maven库。参考【VMware共享目录】,【VMware-Centos6 Build hadoop-2.6】注意cmake_symlink_library的异常,由于共享的windows目录下不能创建linux的软链接

1
2
3
4
5
6
7
8
9
10
11
12
tar zxvf ~/hadoop-2.6.0-src.tar.gz 
cd hadoop-2.6.0-src/
mvn package -Pdist,native -DskipTests -Dtar -Dmaven.javadoc.skip=true

# 由于hadoop-hdfs还是2.2的,这里编译spark需要用2.2版本!
# 如果用2.6会遇到[UnsatisfiedLinkError:org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray ](http://blog.csdn.net/zeng_84_long/article/details/44340441)
cd spark-1.3.0
export MAVEN_OPTS="-Xmx3g -XX:MaxPermSize=1g -XX:ReservedCodeCacheSize=512m"
mvn clean package -Phadoop-2.2 -Pyarn -Phive -Phive-thriftserver -Dmaven.test.skip=true -Dmaven.javadoc.skip=true -DskipTests

vi make-distribution.sh #注释掉BUILD_COMMAND那一行,不重复执行package!
./make-distribution.sh  --mvn `which mvn` --tgz  --skip-java-test -Phadoop-2.6 -Pyarn -Dmaven.test.skip=true -Dmaven.javadoc.skip=true -DskipTests
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
[eshore@bigdatamgr1 spark-1.3.0-bin-2.2.0]$ cat conf/spark-defaults.conf 
# spark.master                     spark://bigdatamgr1:7077,bigdata8:7077
# spark.eventLog.enabled           true
# spark.eventLog.dir               hdfs://namenode:8021/directory
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
# spark.executor.extraJavaOptions       -Xmx16g -Xms16g -Xmn256m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:ParallelGCThreads=10
spark.driver.memory              48g
spark.executor.memory            48g
spark.sql.shuffle.partitions     200

#spark.scheduler.mode FAIR
spark.serializer  org.apache.spark.serializer.KryoSerializer
spark.driver.maxResultSize 8g
#spark.kryoserializer.buffer.max.mb 2048

spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 4
spark.shuffle.service.enabled true

[eshore@bigdatamgr1 conf]$ cat spark-env.sh 
#!/usr/bin/env bash

JAVA_HOME=/home/eshore/jdk1.7.0_60

# log4j

__add_to_classpath() {

  root=$1

  if [ -d "$root" ] ; then
    for f in `ls $root/*.jar | grep -v -E '/hive.*.jar'`  ; do
      if [ -n "$SPARK_DIST_CLASSPATH" ] ; then
        export SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:$f
      else
        export SPARK_DIST_CLASSPATH=$f
      fi
    done
  fi

}
# this add tail of SPARK_CLASSPATH
__add_to_classpath "/home/eshore/apache-hive-0.13.1/lib"

#export HADOOP_CONF_DIR=/data/opt/ibm/biginsights/hadoop-2.2.0/etc/hadoop
export HADOOP_CONF_DIR=/home/eshore/hadoop-2.6.0/etc/hadoop
export SPARK_CLASSPATH=$SPARK_CLASSPATH:/home/eshore/spark-1.3.0-bin-2.2.0/conf:$HADOOP_CONF_DIR

# HA
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=bi-00-01.bi.domain.com:2181 -Dspark.deploy.zookeeper.dir=/spark" 

SPARK_PID_DIR=${SPARK_HOME}/pids
  • 同步
1
for h in `cat slaves ` ; do rsync -vaz hadoop-2.6.0 $h:~/ --delete --exclude=work --exclude=logs --exclude=metastore_db --exclude=data --exclude=pids ; done
  • 启动spark-hive-thrift

./sbin/start-thriftserver.sh –executor-memory 29g –master yarn-client

对于多任务的集群来说,配置自动动态分配(类似资源池)更有利于资源的使用。可以通过【All Applications】-【ApplicationMaster】-【Executors】来观察执行进程的变化。

–END

Comments