Winse Blog

走走停停, 熙熙攘攘, 忙忙碌碌, 不知何畏.

SparkSQL-on-YARN的Executors池(动态)配置

官网配置资料

实战

修改YARN配置,添加spark-yarn-shuffle.jar,同步配置和jar到nodemanager节点并重启。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
[hadoop@hadoop-master1 hadoop-2.6.3]$ vi etc/hadoop/yarn-site.xml 
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle,spark_shuffle</value>
</property>

<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

<property>
<name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
<value>org.apache.spark.network.yarn.YarnShuffleService</value>
</property>

[hadoop@hadoop-master1 hadoop-2.6.3]$ cp ~/spark-1.6.0-bin-2.6.3/lib/spark-1.6.0-yarn-shuffle.jar share/hadoop/yarn/

for h in `cat etc/hadoop/slaves` ; do rsync -az share $h:~/hadoop-2.6.3/ ; done 
for h in `cat etc/hadoop/slaves` ; do rsync -az etc $h:~/hadoop-2.6.3/ ; done 

rsync -vaz etc hadoop-master2:~/hadoop-2.6.3/
rsync -vaz share hadoop-master2:~/hadoop-2.6.3/

[hadoop@hadoop-master1 hadoop-2.6.3]$ sbin/stop-yarn.sh 
[hadoop@hadoop-master1 hadoop-2.6.3]$ sbin/start-yarn.sh 

原来已经部署了Hive-1.2.1(和spark-1.6.0的hive是匹配的),直接把hive-site.xml做一个软链到spark/conf下面:

1
2
3
4
5
[hadoop@hadoop-master1 spark-1.6.0-bin-2.6.3]$ cd conf/
[hadoop@hadoop-master1 conf]$ ln -s ~/hive/conf/hive-site.xml 

[hadoop@hadoop-master1 spark-1.6.0-bin-2.6.3]$ ll conf/hive-site.xml 
lrwxrwxrwx. 1 hadoop hadoop 36 3月  25 11:30 conf/hive-site.xml -> /home/hadoop/hive/conf/hive-site.xml

注意:如果原来配置了tez,把hive-site.xml的 hive.execution.engine 配置注释掉。或者启动的时刻换引擎: bin/spark-sql --master yarn-client --hiveconf hive.execution.engine=mr

修改spark配置

spark-defaults.conf

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[hadoop@hadoop-master1 conf]$ cat spark-defaults.conf 
spark.yarn.jar    hdfs:///spark/spark-assembly-1.6.0-hadoop2.6.3-ext-2.1.jar

spark.dynamicAllocation.enabled    true
spark.shuffle.service.enabled      true
spark.dynamicAllocation.executorIdleTimeout    600s
spark.dynamicAllocation.minExecutors    160
spark.dynamicAllocation.maxExecutors    1800
spark.dynamicAllocation.schedulerBacklogTimeout   5s

spark.driver.maxResultSize   0

spark.eventLog.enabled  true
spark.eventLog.compress  true
spark.eventLog.dir    hdfs:///spark-eventlogs
spark.yarn.historyServer.address hadoop-master2:18080

spark.serializer        org.apache.spark.serializer.KryoSerializer
  • spark.yarn.jar 配置后,spark启动后直接使用该文件作为executor的main-jar,不需要每次都上传一次spark.jar(每次都搞一下180M也不少资源了)
  • enabled 启用动态两个都配置必须设置为true
  • executorIdleTimeout 关闭不用executors需要等待的时间
  • schedulerBacklogTimeout 增加积压的任务时间来判断是否增加executors
  • minExecutors 至少存活的executors个数

spark-env.sh环境变量

1
2
3
4
5
[hadoop@hadoop-master1 conf]$ cat spark-env.sh 
SPARK_CLASSPATH=/home/hadoop/hive/lib/mysql-connector-java-5.1.21-bin.jar:$SPARK_CLASSPATH
HADOOP_CONF_DIR=/home/hadoop/hadoop/etc/hadoop
SPARK_DRIVER_MEMORY=30g
SPARK_PID_DIR=/home/hadoop/tmp/pids

启动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[hadoop@hadoop-master1 spark-1.6.0-bin-2.6.3]$ sbin/start-thriftserver.sh --master yarn-client

[hadoop@hadoop-master2 spark-1.6.0-bin-2.6.3]$ sbin/start-history-server.sh hdfs:///spark-eventlogs

# 不包括hadoop jars的情况下,自己编写脚本把hadoop的依赖包加入classpath
[hadoop@hadoop-master2 spark-1.3.1-bin-hadoop2.6.3-without-hive]$ cat start-historyserver.sh 
#!/bin/sh

bin=`dirname $0`
cd $bin

source $HADOOP_HOME/libexec/hadoop-config.sh

export SPARK_PID_DIR=/home/hadoop/tmp/pids
export SPARK_CLASSPATH=`hadoop classpath`

export SPARK_PID_DIR=/home/hadoop/tmp/pids

# http://spark.apache.org/docs/latest/monitoring.html#viewing-after-the-fact
export SPARK_HISTORY_OPTS="-Dspark.history.fs.update.interval=30s -Dspark.history.fs.cleaner.enabled=true -Dspark.history.fs.logDirectory=hdfs:///spark-eventlogs"
sbin/start-history-server.sh 

收工。

整个过程就是:添加spark-shuffle到yarn,然后配置spark参数,最后就是重启任务(yarn/hiveserver)。

–END

Comments