安装部署（八） Hive+Sqoop安装部署和使用

来源：互联网发布：制作网页用什么软件编辑：程序博客网时间：2024/06/11 10:00

Hive+Sqoop安装

haddoop 2.7.2

spark 2.0.0

zookeeper 3.4.8

kafka 0.10.0.0

hbase 1.2.2

jdk1.8.0_101

ubuntu 14.04.04 x64

参考：
http://blog.csdn.net/yinedent/article/details/48275407
http://blog.csdn.net/suijiarui/article/details/51137316

一、Hive 2.1.0
1、下载
https://mirrors.tuna.tsinghua.edu.cn/apache/hive/stable-2/
https://mirrors.tuna.tsinghua.edu.cn/apache/hive/stable-2/apache-hive-2.1.0-bin.tar.gz

2、解压
root@py-server:/server# tar xvzf apache-hive-2.1.0-bin.tar.gz
root@py-server:/server# mv apache-hive-2.1.0-bin/ hive

3、环境变量
vi ~/.bashrc
export HIVE_HOME=/server/hive
export PATH=$PATH:$HIVE_HOME/bin
source ~/.bashrc

4、配置
4.1 复制出配置文件
root@py-server:/server/hive/conf# cp hive-exec-log4j2.properties.template hive-exec-log4j2.properties
root@py-server:/server/hive/conf# cp hive-log4j2.properties.template hive-log4j2.properties
root@py-server:/server/hive/conf# cp hive-env.sh.template hive-env.sh
root@py-server:/server/hive/conf# cp hive-default.xml.template hive-site.xml
root@py-server:/server/hive/conf# ll
总用量 504
drwxr-xr-x 2 root root 4096 8月 12 15:03 ./
drwxr-xr-x 9 root root 4096 8月 12 14:40 ../
-rw-r--r-- 1 root staff 1596 6月 3 18:43 beeline-log4j2.properties.template
-rw-r--r-- 1 root staff 225729 6月 17 08:03 hive-default.xml.template
-rw-r--r-- 1 root root 2378 8月 12 15:03 hive-env.sh
-rw-r--r-- 1 root staff 2378 6月 3 18:43 hive-env.sh.template
-rw-r--r-- 1 root root 2299 8月 12 15:02 hive-exec-log4j2.properties
-rw-r--r-- 1 root staff 2299 6月 3 18:43 hive-exec-log4j2.properties.template
-rw-r--r-- 1 root root 2950 8月 12 15:02 hive-log4j2.properties
-rw-r--r-- 1 root staff 2950 6月 3 18:43 hive-log4j2.properties.template
-rw-r--r-- 1 root root 225729 8月 12 15:03 hive-site.xml
-rw-r--r-- 1 root staff 2049 6月 10 17:00 ivysettings.xml
-rw-r--r-- 1 root staff 2768 6月 3 18:43 llap-cli-log4j2.properties.template
-rw-r--r-- 1 root staff 4241 6月 3 18:43 llap-daemon-log4j2.properties.template
-rw-r--r-- 1 root staff 2662 6月 9 02:47 parquet-logging.properties

4.2 修改hive-env.sh
# HADOOP_HOME=${bin}/../../hadoop
HADOOP_HOME=/server/hadoop
# Hive Configuration Directory can be controlled by:
# export HIVE_CONF_DIR=
HIVE_CONF_DIR=/server/hive/conf

4.3.1 修改hive-site.xml
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>fm1106</value>
<description>password to use against metastore database</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://py-server:3306/hive?createDatabaseIfNotExist=true&characterEncoding=utf8&useSSL=true</value>
<description>
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>Username to use against metastore database</description>
</property>
<property>
<name>hive.zookeeper.quorum</name>
<value>py-server,py-11,py-12,py-13,py-14<value/>
<description>
List of ZooKeeper servers to talk to. This is needed for:
1. Read/write locks - when hive.lock.manager is set to
org.apache.hadoop.hive.ql.lockmgr.zookeeper.ZooKeeperHiveLockManager,
2. When HiveServer2 supports service discovery via Zookeeper.
3. For delegation token storage if zookeeper store is used, if
hive.cluster.delegation.token.store.zookeeper.connectString is not set
4. LLAP daemon registry service
</description>
</property>
【可以不做修改 <property>
<name>hive.metastore.uris</name>
<value>thrift://py-server:9083</value>
<description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
</property>
】
<property>
<name>hive.exec.scratchdir</name>
<value>/server/tmp/hive</value>
<description>HDFS root scratch dir for Hive jobs which gets created with write all (733) permission. For each connecting user, an HDFS scratch dir: ${hive.exec.scratchdir}/<username> is created, with ${hive.scratch.dir.permission}.</description>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>/server/tmp/hive</value>
<description>Local scratch space for Hive jobs</description>
</property>
<property>
<name>hive.downloaded.resources.dir</name>
<value>/server/tmp/hive</value>
<description>Temporary local directory for added resources in the remote file system.</description>
</property>
注意：thrift是远程访问数据用的。没有安装thrift这个不改此项，否则报错。
另外ssl=True，否则会报警Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
。
注意：
如果包SSL错误，将hive-site.xml修改一下。
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://py-server:3306/hive?useSSL=false</value>
<description>
JDBC connect string for a JDBC metastore.
To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
</description>
</property>

4.3.1.2
赋予权限：
mkdir -p /server/tmp/hive
chmod -R 775 /server/tmp/hive
hadoop fs -mkdir /tmp
hadoop fs -mkdir /user/hive/warehouse
hadoop fs -chmod g+w /tmp
hadoop fs -chmod g+w /user/hive/warehouse
参考：
https://chu888chu888.gitbooks.io/hadoopstudy/content/Content/8/chapter0807.html
http://www.aboutyun.com/thread-10937-1-1.html
http://blog.csdn.net/suijiarui/article/details/51137316

4.3.2 上传mysql jar包
如果使用mysql的话需要在${HIVE_HOME}/lib目录下加入mysql的jdbc链接jar包
cp ${JAVA_HOME}/lib/mysql-connector-java-5.1.39-bin.jar $HIVE_HOME/lib

4.3.3 授权
mysql必须授权远程登录，如果你是的MySQL与hive是同一个服务器，还需要本地登录授权
root@py-server:/server# mysql -uroot -p
mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'localhost'IDENTIFIED BY 'fm1106' WITH GRANT OPTION;
Query OK, 0 rows affected, 1 warning (0.36 sec)

mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'py-server'IDENTIFIED BY 'fm1106' WITH GRANT OPTION;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY 'fm1106' WITH GRANT OPTION;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> flush privileges;
Query OK, 0 rows affected (0.25 sec)

4.4 建目录
如果改了logs也要建比如mkdir logs
root@py-server:/server# hadoop fs -mkdir /user/hive
root@py-server:/server# hadoop fs -mkdir /user/hive/warehouse

4.5替换zookeeper的jar包
root@py-server:/server# cp /server/zookeeper/zookeeper-3.4.6.jar $HBASE_HOME/lib
root@py-server:/server# cp /server/zookeeper/zookeeper-3.4.6.jar $HIVE_HOME/lib
如果有就不用拷贝

4.6创建表
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| gfdata |
| hive |
| mysql |
| performance_schema |
| stockdata |
| sys |
+--------------------+
7 rows in set (0.05 sec)
如果存在hive，就不用create database hive;

4.7启动
4.7.1
进入之前需要初始化数据库
root@py-server:/server/tmp# schematool -initSchema -dbType mysql
4.7.2
先启动hive元数据服务，后台启动
root@py-server:/server/tmp# hive --service metastore&
[1] 10609
root@py-server:/server/tmp# Starting Hive Metastore Server
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/server/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/server/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

4.7.3
root@py-server:/server/tmp# hive

4.7.4 查询
hive> show tables;
OK
Time taken: 1.219 seconds
hive> show databases;
OK
default
Time taken: 0.013 seconds, Fetched: 1 row(s)
hive>
4.7.5 建数据库
hive> create database testdb;
OK
Time taken: 0.303 seconds
hive> show databases;
OK
default
testdb
Time taken: 0.011 seconds, Fetched: 2 row(s)
4.7.6 建表
参考：http://blog.itpub.net/26143577/viewspace-720092/
hive> create table test_hive2 (id int,id2 int,name string) row format delimited fields terminated by '\t';
OK
Time taken: 0.601 seconds
4.7.7 load txt
参考：http://blog.csdn.net/dst1213/article/details/51419072
http://blog.csdn.net/yinedent/article/details/48275407

二、Sqoop
Note that 1.99.7 is not compatible with 1.4.6 and not feature complete, it is not intended for production deployment.
生成环境还是根据建议用1.4.6，想装1.99.7的可以参考本人之前的文章。
http://apache.fayea.com/sqoop/1.4.6/
http://apache.fayea.com/sqoop/1.4.6/sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz

1 解压
root@py-server:/server# tar xvzf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
root@py-server:/server# mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha/ sqoop

2 环境变量
vi ~/.bashrc
export SQOOP_HOME=/server/sqoop
export PATH=$PATH:$SQOOP_HOME/bin
source ~/.bashrc

3 配置
cd $SQOOP_HOME/conf
cp sqoop-env-template.sh sqoop-env.sh
如果有安装以下组件就修改相应的。
#Set path to where bin/hadoop is available
export HADOOP_COMMON_HOME=/server/hadoop

#Set path to where hadoop-*-core.jar is available
export HADOOP_MAPRED_HOME=/server/hadoop

#set the path to where bin/hbase is available
export HBASE_HOME=/server/hbase

#Set the path to where bin/hive is available
export HIVE_HOME=/server/hive

#Set the path for where zookeper config dir is
export ZOOCFGDIR=/server/zookeeper/conf

4 mysql jar包
cp ${JAVA_HOME}/lib/mysql-connector-java-5.1.39-bin.jar $SQOOP_HOME/lib

5 环境变量
vi ~/.bashrc
export CLASSPATH=$CLASSPATH:$SQOOP_HOME/lib
source ~/.bashrc

6 测试
6.1 列出数据库
root@py-server:/server/zookeeper/conf# sqoop list-databases --connect jdbc:mysql://py-server:3306/?useSSL=false --username root -P
结果
root@py-server:/server/zookeeper/conf# sqoop list-databases --connect jdbc:mysql://py-server:3306/?useSSL=false --username root -P
Warning: /server/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /server/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
16/08/12 17:50:15 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
Enter password:
16/08/12 17:50:20 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
information_schema
gfdata
hive
mysql
performance_schema
stockdata
sys
root@py-server:/server/zookeeper/conf#
使用useSSL=false可以不报警ssl一堆warning，也可以不加
sqoop list-databases --connect jdbc:mysql://py-server:3306/ --username root -P
7 使用
7.1 列出表
sqoop list-tables --connect jdbc:mysql://py-server:3306/gfdata?useSSL=false --username root --password abc
结果：
root@py-server:/server/zookeeper/conf# sqoop list-tables --connect jdbc:mysql://py-server:3306/gfdata?useSSL=false --username root --password fm1106
Warning: /server/sqoop/../hcatalog does not exist! HCatalog jobs will fail.
Please set $HCAT_HOME to the root of your HCatalog installation.
Warning: /server/sqoop/../accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
16/08/12 17:54:09 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6
16/08/12 17:54:09 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
16/08/12 17:54:09 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
000010
000011
000030
000059
000065
000420

7.2 将MySQL的test.t1表结构复制到Hive的test库中，表名为mysql_t1
sqoop create-hive-table --connect jdbc:mysql://py-server:3306/gfdata?useSSL=false --table 000010 --username root --password 123456 --hive-table gfdata.mysql_000010
Time taken: 0.015 seconds, Fetched: 1 row(s)
hive> show tables;
OK
mysql_000010
mysql_603608

注：该命令可以多次执行不报错

7.3 将mysql表的数据导入到hive中
# 建数据库：
hive> create database gfdata;
OK
Time taken: 0.861 seconds
hive> show databases;
OK
default
gfdata
testdb
Time taken: 0.132 seconds, Fetched: 3 row(s)

# 追加数据
7.2获得表结构，或者hive>create table mysql_603608;

root@py-server:/server/zookeeper/conf# sqoop import --connect jdbc:mysql://py-server:3306/gfdata?useSSL=false --username root --password fm1106 --table 603608 --hive-import --hive-table gfdata.mysql_603608
结果：
16/08/12 21:45:28 INFO hive.HiveImport: Export directory is contains the _SUCCESS file only, removing the directory.
hive> select * from mysql_603608;
OK
2016-02-18 11.5113.86 11.5113.86 58617819801
2016-02-19 15.2715.27 15.2715.27 42024652212
注意：hive默认不打印列名
参考：http://blog.csdn.net/qiaochao911/article/details/9035225
解决：

# 覆盖数据
hive>create database test;
hive>use test;
hive>create table test
以上可以用7.2代替
sqoop import --connect jdbc:mysql://py-server:3306/test?useSSL=false --username root --password 123456 --table t1 --hive-import --hive-overwrite --hive-table test.mysql_t1
注：如果MySQL中的表没有主键，则需要加--autoreset-to-one-mapper参数

7.4 将hive表的数据导入到mysql中
与HDFS导入MySQL相同，注意--table必须是空表，先用mysql创建好
step 1
mysql> create database gftest;
Query OK, 1 row affected (0.00 sec)

step 2
mysql> create table s000010 (date DATE PRIMARY KEY NOT NULL, open Double, high Double, low Double, close Double, volume Bigint, amount Bigint);
Query OK, 0 rows affected (0.19 sec)
mysql> show tables;
+------------------+
| Tables_in_gftest |
+------------------+
| s000010 |
+------------------+
1 row in set (0.00 sec)

step 3
root@py-server:/server/zookeeper/conf# hadoop fs -ls /user/hive/warehouse
Found 2 items
drwxrwxr-x - root supergroup 0 2016-08-12 21:08 /user/hive/warehouse/gfdata.db
drwxrwxr-x - root supergroup 0 2016-08-12 17:26 /user/hive/warehouse/test_hive2
root@py-14:~# hadoop fs -ls /user/hive/warehouse/gfdata.db
Found 1 items
drwxrwxr-x - root supergroup 0 2016-08-12 21:08 /user/hive/warehouse/gfdata.db/mysql_000010

root@py-server:/server/zookeeper/conf# sqoop export --connect jdbc:mysql://py-server:3306/hive?useSSL=false --username root --password s123456 --table s603608 --export-dir /user/hive/warehouse/gfdata.db/mysql_603608 --fields-terminated-by '\001'
【注意】：一id那个要加--fields-terminated-by '\001'，不然会报Error: java.io.IOException: Can't export data, please check failed map task logs
15/12/02 02:01:13 INFO mapreduce.ExportJobBase: Exported 0 records.
15/12/02 02:01:13 ERROR tool.ExportTool: Error during export: Export job failed!

参考：
http://blog.csdn.net/dst1213/article/details/51419072
http://blog.csdn.net/yinedent/article/details/48275407
http://blog.csdn.net/wzy0623/article/details/50921702

#################################################
hive默认查询不会显示列名，
参考：http://blog.csdn.net/qiaochao911/article/details/9035225
当一个表字段比较多的时候，往往看不出值与列之间的对应关系，对日常查错及定位问题带来不便，应同事要求，看了HIVE CLI源码，做了下些许调整，加入列头打印及行转列显示功能

未开启行转列功能之前:

hive>
>
> select * from example_table where dt='2012-03-31-02' limit 2;
OK
1333133185 0cf49387a23d9cec25da3d76d6988546 3CD5E9A1721861AE6688260ED26206C2 guanwang 1.1 3d3b0a5eca816ba47fc270967953f881 192.168.1.2.13331317500.0 NA 031/Mar/2012:02:46:44 +080 222.71.121.111 2012-03-31-02
1333133632 0cf49387a23d9cec25da3d76d6988546 3CD5E9A1721861AE6688260ED26206C2 10002 1.1 e4eec776b973366be21518b709486f3c 110.6.100.57.1332909301867.6 NA 0 31/Mar/2012:02:54:16 +080 110.6.74.219 2012-03-31-02
Time taken: 0.62 seconds
开启行转列功能之后:

set hive.cli.print.header=true; // 打印列名
set hive.cli.print.row.to.vertical=true; // 开启行转列功能, 前提必须开启打印列名功能
set hive.cli.print.row.to.vertical.num=1; // 设置每行显示的列数
> select * from example_table where pt='2012-03-31-02' limit 2;
OK
datetime col_1 col_2 channel version pcs cookie trac new time ip
datetime=1333133185
col_1=0cf49387a23d9cec25da3d76d6988546
clo_2=3CD5E9A1721861AE6688260ED26206C2
channel=test_name1
version=1.1
pcs=3d3b0a5eca816ba47fc270967953f881
cookie=192.168.1.2.13331317500.0
trac=NA
new=0
time=31/Mar/2012:02:46:44 +080
ip=222.71.121.111
-------------------------Gorgeous-split-line-----------------------
datetime=1333133632
col_1=0cf49387a23d9cec25da3d76d6988546
col_2=3CD5E9A1721861AE6688260ED26206C2
channel=test_name2
version=1.1
pcs=e4eec776b973366be21518b709486f3c
cookie=110.6.100.57.1332909301867.6
trac=NA
new=0
time=31/Mar/2012:02:54:16 +080
ip=110.6.74.219
--------------------------Gorgeous-split-line-----------------------
Time taken: 0.799 seconds
开启行转列功能后，每一行都已列显示，值前面都加上列名，方便问题查找！

1 0