Winse Blog

走走停停, 熙熙攘攘, 忙忙碌碌, 不知何畏.

Tez编译及使用

初步了解

hadoop2自带的mapreduce任务中间只能传递一次,也即一个任务只能聚合一次(然后就的写入磁盘)。tez项目是对原有yarn架构的一个拓展,使用DAG(无环有向图)实现MRR的任务框架。

上图中,左边的MR任务完成一个步骤后,需要进行 数据存储 后再执行另一个任务来进行第二个 reduce ; 而tez则可以在reduce后继续执行reduce,减少了中间过程的IO以及mapreduce的启动时间。

环境整合

  • Install/Deploy
  • hadoop-2.2.0(umcc97-44:hdfs, umcc97-79:yarn)
  • windows下使用Cygwin编译

下载编译tez

首先下载tez-0.4.0-incubating.tar.gz,同时还需要protoc的程序支持(可以参考Hadoop源码编译)。 解压后,使用mvn编译。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Administrator@winseliu /cygdrive/e/local/libs/big
$ tar zxvf tez-0.4.0-incubating.tar.gz

Administrator@winseliu /cygdrive/e/local/libs/big
$ cd tez-0.4.0-incubating/

Administrator@winseliu /cygdrive/e/local/libs/big/tez-0.4.0-incubating
$ mvn install -DskipTests -Dmaven.javadoc.skip
...
[INFO] Reactor Summary:
[INFO]
[INFO] tez ............................................... SUCCESS [1.518s]
[INFO] tez-api ........................................... SUCCESS [8.890s]
[INFO] tez-common ........................................ SUCCESS [0.725s]
[INFO] tez-runtime-internals ............................. SUCCESS [2.529s]
[INFO] tez-runtime-library ............................... SUCCESS [5.100s]
[INFO] tez-mapreduce ..................................... SUCCESS [3.666s]
[INFO] tez-mapreduce-examples ............................ SUCCESS [2.692s]
[INFO] tez-dag ........................................... SUCCESS [13.943s]
[INFO] tez-tests ......................................... SUCCESS [1.691s]
[INFO] tez-dist .......................................... SUCCESS [14.370s]
[INFO] Tez ............................................... SUCCESS [0.245s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 55.791s
[INFO] Finished at: Tue Jun 17 17:33:45 CST 2014
[INFO] Final Memory: 35M/151M
[INFO] ------------------------------------------------------------------------

上传tez程序的jars到HDFS

为了简单我直接把tez jars上传到开发环境的集群上面去测试了。放到本地集群环境应该也类似。

1
2
3
4
5
6
7
8
9
10
11
Administrator@winseliu /cygdrive/e/local/libs/big/tez-0.4.0-incubating
$ cd tez-dist/

Administrator@winseliu /cygdrive/e/local/libs/big/tez-0.4.0-incubating/tez-dist
$ cd target/

Administrator@winseliu /cygdrive/e/local/libs/big/tez-0.4.0-incubating/tez-dist/target
$ export HADOOP_USER_NAME=hadoop

Administrator@winseliu /cygdrive/e/local/libs/big/tez-0.4.0-incubating/tez-dist/target
$ hadoop dfs -put tez-0.4.0-incubating/tez-0.4.0-incubating/ hdfs://umcc97-44:9000/apps/ 

配置集群环境

首先看下原来集群的classpath路径,路径中已经包括了 etc/hadoop 目录,所以这里我直接把 tez-site.xml 放到该目录下。同时把tez-lib复制到 share/hadoop/tez 目录下,并添加到 HADOOP_CLASSPATH 环境变量。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
[hadoop@umcc97-79 hadoop]$ hadoop classpath
/home/hadoop/hadoop-2.2.0/etc/hadoop:/home/hadoop/hadoop-2.2.0/share/hadoop/common/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/common/*:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs/*:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/*:/home/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/*:/home/hadoop/hadoop-2.2.0/contrib/capacity-scheduler/*.jar

# 用于map/reduce
[hadoop@umcc97-79 hadoop]$ cat tez-site.xml 
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
  <name>tez.lib.uris</name>
  <value>${fs.default.name}/apps/tez-0.4.0-incubating,${fs.default.name}/apps/tez-0.4.0-incubating/lib/</value>
</property>
</configuration>

[hadoop@umcc97-79 hadoop]$ cd ~/hadoop-2.2.0/share/hadoop/tez/
[hadoop@umcc97-79 tez]$ ll
total 9616
-rw-r--r-- 1 hadoop hadoop  303139 Jun 17 17:33 avro-1.7.4.jar
-rw-r--r-- 1 hadoop hadoop   41123 Jun 17 17:33 commons-cli-1.2.jar
-rw-r--r-- 1 hadoop hadoop  610259 Jun 17 17:33 commons-collections4-4.0.jar
-rw-r--r-- 1 hadoop hadoop 1648200 Jun 17 17:33 guava-11.0.2.jar
-rw-r--r-- 1 hadoop hadoop  710492 Jun 17 17:33 guice-3.0.jar
-rw-r--r-- 1 hadoop hadoop  656365 Jun 17 17:33 hadoop-mapreduce-client-common-2.2.0.jar
-rw-r--r-- 1 hadoop hadoop 1455001 Jun 17 17:33 hadoop-mapreduce-client-core-2.2.0.jar
-rw-r--r-- 1 hadoop hadoop   21537 Jun 17 17:33 hadoop-mapreduce-client-shuffle-2.2.0.jar
-rw-r--r-- 1 hadoop hadoop   81743 Jun 17 17:33 jettison-1.3.4.jar
-rw-r--r-- 1 hadoop hadoop  533455 Jun 17 17:33 protobuf-java-2.5.0.jar
-rw-r--r-- 1 hadoop hadoop  995968 Jun 17 17:33 snappy-java-1.0.4.1.jar
-rw-r--r-- 1 hadoop hadoop  749917 Jun 17 17:33 tez-api-0.4.0-incubating.jar
-rw-r--r-- 1 hadoop hadoop   34049 Jun 17 17:33 tez-common-0.4.0-incubating.jar
-rw-r--r-- 1 hadoop hadoop  970987 Jun 17 17:33 tez-dag-0.4.0-incubating.jar
-rw-r--r-- 1 hadoop hadoop  246409 Jun 17 17:33 tez-mapreduce-0.4.0-incubating.jar
-rw-r--r-- 1 hadoop hadoop  199934 Jun 17 17:33 tez-mapreduce-examples-0.4.0-incubating.jar
-rw-r--r-- 1 hadoop hadoop  114692 Jun 17 17:33 tez-runtime-internals-0.4.0-incubating.jar
-rw-r--r-- 1 hadoop hadoop  352177 Jun 17 17:33 tez-runtime-library-0.4.0-incubating.jar
-rw-r--r-- 1 hadoop hadoop    6845 Jun 17 17:33 tez-tests-0.4.0-incubating.jar

# MR配置,用于client任务提交
[hadoop@umcc97-79 hadoop]$ grep HADOOP_CLASSPATH hadoop-env.sh
export HADOOP_CLASSPATH=${HADOOP_HOME}/share/hadoop/tez/*:${HADOOP_HOME}/share/hadoop/tez/lib/*:$HADOOP_CLASSPATH

[hadoop@umcc97-79 hadoop]$ sed -n 19,23p mapred-site.xml
<configuration>
<property>
  <name>mapreduce.framework.name</name>
  <value>yarn-tez</value>
</property>

同步,重启yarn

1
2
3
4
5
6
for h in `cat hadoop-2.2.0/etc/hadoop/slaves ` ; do 
  rsync -vaz --exclude=logs --exclude=pid --exclude=tmp  hadoop-2.2.0 $h:~/ ; 
done

# 同步到secondnamenode
rsync -vaz --exclude=logs --exclude=pid --exclude=tmp  hadoop-2.2.0 umcc97-44:~/

测试

1
2
3
4
5
6
7
8
9
10
11
[hadoop@umcc97-79 ~]$ hadoop classpath
/home/hadoop/hadoop-2.2.0/etc/hadoop:/home/hadoop/hadoop-2.2.0/share/hadoop/common/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/common/*:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/hdfs/*:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/yarn/*:/home/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/lib/*:/home/hadoop/hadoop-2.2.0/share/hadoop/mapreduce/*:/home/hadoop/hadoop-2.2.0/share/hadoop/tez/*:/home/hadoop/hadoop-2.2.0/share/hadoop/tez/lib/*:/home/hadoop/hadoop-2.2.0/contrib/capacity-scheduler/*.jar

[hadoop@umcc97-79 ~]$ cd hadoop-2.2.0/share/hadoop/mapreduce/
[hadoop@umcc97-79 mapreduce]$ hadoop jar hadoop-mapreduce-client-jobclient-2.2.0-tests.jar sleep -mt 1 -rt 1 -m 1 -r 1

cd hadoop-2.2.0/share/hadoop/tez/

hadoop fs -put ~/hadoop-2.2.0/logs/yarn-hadoop-resourcemanager-umcc97-79.* /hello/in
hadoop fs -rmr /hello/out
hadoop jar tez-mapreduce-examples-0.4.0-incubating.jar orderedwordcount  /hello/in /hello/out

回滚,使用时临时修改环境变量即可

使用了tez后,导致hive-0.12.0不能运行。由于其他同事需要用hive,得把配置全部修改回去。【升级hive请查看hive-0.13中使用tez

在配置文件中配置为yarn,要使用tez在 提交任务 时指定配置参数即可。

1
2
3
export HADOOP_CLASSPATH=${HADOOP_HOME}/share/hadoop/tez/*:${HADOOP_HOME}/share/hadoop/tez/lib/*:$HADOOP_CLASSPATH
hadoop jar hadoop-2.2.0/share/hadoop/tez/tez-mapreduce-examples-0.4.0-incubating.jar orderedwordcount \
  -Dmapreduce.framework.name=yarn-tez  /hello/in /hello/out

org.apache.tez.mapreduce.examples.OrderedWordCount不仅计算出了结果,同时按个数大小进行了排序。

问题: tez的任务的history还不知道怎么弄的,启动historyserver没作用?

0.6版本已经有ui了。

持续更新

本来想编译好tez-0.6就往hive-0.13上面放,没想到遇到钉子了!!hive-0.13不支持!!

在编译tez并想集成到hive,先下载hive的源码,看看pom.xml中使用的是到底是什么版本的tez,再编译tez不迟!!!

1
2
apache-hive-1.1.0-src.tar.gz/pom.xml
    <tez.version>0.5.2</tez.version>

tez-0.6在hadoop-2.2基础上编译:

1
2
3
4
5
6
7
8
E:\local\opt\bigdata\apache-tez-0.6.0-src>mvn  package -Dhadoop.version=2.2.0 -DskipTests -Dmaven.javadoc.skip=true -DskipATS

vi tez-dist/pom.xml
<profile>
      <id>hadoop26</id>
      <activation>
        <activeByDefault>false</activeByDefault>
      </activation>

–END