Home | 簡體中文 | 繁體中文 | 雜文 | 打賞(Donations) | ITEYE 博客 | OSChina 博客 | Facebook | Linkedin | 知乎專欄 | Search | Email

第 20 章 Apache Hive

目錄

20.1. 安裝 Apache Hive
20.1.1. MySQL
20.1.2. Hadoop
20.1.3. Hive
20.1.4. 啟動 Hive
20.1.5. 訪問 Hive
20.1.6. 配置 hiveserver2
20.2. beeline
20.3. 管理 Hive
20.3.1. 表管理
20.3.1.1. 創建表
20.3.1.2. 顯示表
20.3.1.3. 刪除表
20.3.1.4. 查看表結構
20.3.1.5. 為表增加欄位
20.3.1.6. 修改表名稱
20.3.1.7. 使用已有表結構創建新表
20.3.2. 分區表
20.3.2.1. 創建分區表
20.3.2.2. 顯示分區情況
20.3.2.3. 增加分區
20.3.2.4. 向分區表導入數據
20.3.3. 視圖管理
20.3.3.1. 創建視圖
20.3.3.2. 刪除視圖
20.3.4. 數據管理
20.3.4.1. 從文本檔案導入數據
20.3.4.2. 從其他表查詢數據並創建新表
20.3.4.3. 從其他表查詢數據然後插入指定表中
20.3.4.4. 從現有表中查詢數據然後插入到新的分區表中
20.3.5. HDFS與本地檔案系統管理
20.3.5.1. HDFS 目錄遷移
20.3.5.2. 導出表數據到本地檔案
20.3.5.3. 導出到HDFS
20.3.5.4.
20.4. HiveQL - Hive查詢語言
20.4.1. JOIN 連接查詢
20.4.2. 子查詢
20.5. FAQ
20.5.1. adoop.security.authorize.AuthorizationException): User: hadoop is not allowed to impersonate anonymous

Hive是基于Hadoop構建的一套數據倉庫分析系統,它提供了豐富的SQL查詢方式來分析存儲在Hadoop 分散式檔案系統中的數據。其在Hadoop的架構體系中承擔了一個SQL解析的過程,它提供了對外的入口來獲取用戶的指令然後對指令進行分析,解析出一個MapReduce程序組成可執行計劃,並按照該計劃生成對應的MapReduce任務提交給Hadoop集群處理,獲取最終的結果。

20.1. 安裝 Apache Hive

安裝 Apache Hive 需要 Hadoop和MySQL,這裡假設你已經懂得如何安裝Hadoop和MySQL,所以一下將採用Netkiller OSCM一件安裝腳本來初始化Hadoop和MySQL,如果需要詳細的安裝步驟請參考筆者的相關文章。

20.1.1. MySQL

預設情況下, Hive 使用內嵌的 Derby 資料庫保存元數據, 通常生產環境會使用 MySQL 來存放 Hive 元數據。

使用下面腳本一鍵安裝MySQL 5.7 安裝後會顯示mysql的初始密碼,是所有初始密碼登陸後修改為你的需要密碼

			
curl -s https://raw.githubusercontent.com/oscm/shell/master/database/mysql/5.7/mysql.server.sh | bash

2016-02-16T08:22:58.253030Z 1 [Note] A temporary password is generated for root@localhost: sd%%my.Ak7Ma
			
			

安裝 MySQL JDBC 連接庫。

			
curl -s https://raw.githubusercontent.com/oscm/shell/master/database/mysql/5.7/mysql-connector-java.sh | bash
			
			

創建一個 hive 資料庫用來存儲 Hive 元數據,且資料庫訪問的用戶名和密碼都為 hive。

			
mysql> CREATE DATABASE hive; 
Query OK, 1 row affected (0.03 sec)
			
			

創建用戶hive並授權訪問hive資料庫

			
mysql> CREATE USER 'hive'@'localhost' IDENTIFIED BY 'hive';
Query OK, 0 rows affected (0.04 sec)

mysql> GRANT ALL ON hive.* TO 'hive'@'localhost' IDENTIFIED BY 'hive';
Query OK, 0 rows affected (0.01 sec)

mysql> GRANT ALL ON hive.* TO 'hive'@'%' IDENTIFIED BY 'hive';
Query OK, 0 rows affected (0.00 sec)

mysql> FLUSH PRIVILEGES;
Query OK, 0 rows affected (0.00 sec)

mysql> quit;
Bye
			
			

20.1.2. Hadoop

安裝 Hadoop 採用單機模式

			
curl -s https://raw.githubusercontent.com/oscm/shell/master/distributed/hadoop/hadoop-2.8.0.sh | bash
curl -s https://raw.githubusercontent.com/oscm/shell/master/distributed/hadoop/single.sh | bash
curl -s https://raw.githubusercontent.com/oscm/shell/master/distributed/hadoop/startup.sh | bash 
			
			

20.1.3. Hive

可以從 Apache 鏡像站點中下載最新穩定版的 apache-hive-2.1.1-bin.tar.gz

			
cd /usr/local/src
wget http://mirrors.hust.edu.cn/apache/hive/stable-2/apache-hive-2.1.1-bin.tar.gz

tar zxf apache-hive-2.1.1-bin.tar.gz
mv apache-hive-2.1.1-bin /srv/apache-hive-2.1.1
ln -s /srv/apache-hive-2.1.1/ /srv/apache-hive
chown hadoop:hadoop -R /srv/apache-hive-2.1.1
			
			

			
cat > /srv/apache-hive/conf/hive-env.sh <<'EOF'
export JAVA_HOME=/srv/java
export HADOOP_HOME=/srv/apache-hadoop
export HBASE_HOME=/srv/apache-hbase
export HIVE_HOME=/srv/apache-hive
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin
EOF

cat >> ~/.bash_profile <<'EOF'
export JAVA_HOME=/srv/java
export HADOOP_HOME=/srv/apache-hadoop
export HBASE_HOME=/srv/apache-hbase
export HIVE_HOME=/srv/apache-hive
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin
EOF

source ~/.bash_profile
			
			

安裝JDBC驅動

			
[root@localhost apache-hive]# ln -s  /usr/share/java/mysql-connector-java.jar /srv/apache-hive/lib/
[root@localhost apache-hive]# ll /srv/apache-hive/lib/mysql-connector-java.jar 
lrwxrwxrwx 1 root root 40 Jun 29 01:59 /srv/apache-hive/lib/mysql-connector-java.jar -> /usr/share/java/mysql-connector-java.jar
			
			

修改 hive-site.xml 配置檔案,配置工作目錄

			
<property>
    <name>system:java.io.tmpdir</name>
    <value>/tmp/hive</value>
</property>
<property>
    <name>system:user.name</name>
    <value>hadoop</value>
</property>

  <property>
    <name>hive.querylog.location</name>
    <value>/tmp/live/hadoop</value>
    <description>Location of Hive run time structured log file</description>
  </property>
  <property>
    <name>hive.exec.local.scratchdir</name>
    <value>/tmp/hive</value>
    <description>Local scratch space for Hive jobs</description>
  </property>
  <property>
    <name>hive.downloaded.resources.dir</name>
    <value>/tmp/hive/${hive.session.id}_resources</value>
    <description>Temporary local directory for added resources in the remote file system.</description>
  </property>

  <property>
    <name>hive.querylog.location</name>
    <value>/user/hive/log</value>
    <description>Location of Hive run time structured log file</description>
  </property>
			
			

把預設的 Derby 修改為 MySQL 需要在該檔案中配置 MySQL 資料庫連接信息。

			
<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:derby:;databaseName=metastore_db;create=true</value>
    <description>
      JDBC connect string for a JDBC metastore.
      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
    </description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>org.apache.derby.jdbc.EmbeddedDriver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>APP</value>
    <description>Username to use against metastore database</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>mine</value>
    <description>password to use against metastore database</description>
  </property>
			
			

將上面配置項 value 改為下面的配置

			
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&amp;characterEncoding=UTF-8&amp;useSSL=false</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hive</value>
  </property>
			
			

20.1.4. 啟動 Hive

啟動 Hive 前你必須做兩件事,一是創建HDFS目錄,二是初始化 MySQL 資料庫。

為 Hive 創建 HDFS 工作目錄並給它們賦相應的權限。

			
[root@localhost ~]$ su - hadoop
[hadoop@localhost ~]$ /srv/apache-hadoop/bin/hdfs dfs -mkdir -p /user/hive/warehouse
[hadoop@localhost ~]$ /srv/apache-hadoop/bin/hdfs dfs -mkdir -p /tmp/hive
[hadoop@localhost ~]$ /srv/apache-hadoop/bin/hdfs dfs -chmod g+w /user/hive/warehouse
[hadoop@localhost ~]$ /srv/apache-hadoop/bin/hdfs dfs -chmod 777 /tmp/hive
			
			

初始化 MySQL 資料庫

			
[hadoop@localhost ~]$ /srv/apache-hive/bin/schematool -dbType mysql -initSchema
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/srv/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/srv/apache-hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL:	 jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true&characterEncoding=UTF-8&useSSL=false
Metastore Connection Driver :	 com.mysql.jdbc.Driver
Metastore connection User:	 hive
Starting metastore schema initialization to 2.1.0
Initialization script hive-schema-2.1.0.mysql.sql
Initialization script completed
schemaTool completed
			
			

如果使用內嵌資料庫 derby 請使用下面命令初始化

schematool -initSchema -dbType derby
			

20.1.5. 訪問 Hive

啟動 Hadoop

			
[hadoop@localhost ~]$ /srv/apache-hadoop/sbin/start-all.sh 
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /srv/apache-hadoop-2.8.0/logs/hadoop-hadoop-namenode-localhost.localdomain.out
localhost: starting datanode, logging to /srv/apache-hadoop-2.8.0/logs/hadoop-hadoop-datanode-localhost.localdomain.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /srv/apache-hadoop-2.8.0/logs/hadoop-hadoop-secondarynamenode-localhost.localdomain.out
starting yarn daemons
starting resourcemanager, logging to /srv/apache-hadoop-2.8.0/logs/yarn-hadoop-resourcemanager-localhost.localdomain.out
localhost: starting nodemanager, logging to /srv/apache-hadoop-2.8.0/logs/yarn-hadoop-nodemanager-localhost.localdomain.out
			
			

進入 Hive 然後輸入 show databases; 測試安裝是否正常。

			
[hadoop@localhost conf]$ /srv/apache-hive/bin/hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/srv/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/srv/apache-hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in file:/srv/apache-hive-2.1.1/conf/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> show databases;
OK
default
Time taken: 0.264 seconds, Fetched: 1 row(s)
hive>			
			
			

至此 Apache Hive 已經安裝配置完成!

嘗試執行下面的SQL命令測試Hive是否正常

create database test;
use test;
create table test_table (id int,name string) row format delimited fields terminated by ',' stored as textfile
insert into test_table values (1, 'Neo');
select * from test_table;
			

20.1.6. 配置 hiveserver2

hiveserver2 提供遠程訪問 Hive 服務,用戶可以通過IP地址和連接埠號連接到Hive,類似mysql client

			
[hadoop@localhost ~]$ /srv/apache-hive/bin/hiveserver2 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/srv/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/srv/apache-hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
			
			

檢查連接埠

[hadoop@localhost bin]$ ss -lnt | grep 10000
LISTEN     0      50           *:10000                    *:* 
			

測試 beeline 是否可以正常進入

			
[hadoop@localhost ~]$ /srv/apache-hive/bin/beeline -u jdbc:hive2://
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/srv/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/srv/apache-hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Connecting to jdbc:hive2://
17/06/29 22:01:16 [main]: WARN session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Connected to: Apache Hive (version 2.1.1)
Driver: Hive JDBC (version 2.1.1)
17/06/29 22:01:16 [main]: WARN jdbc.HiveConnection: Request to set autoCommit to false; Hive does not support autoCommit=false.
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 2.1.1 by Apache Hive
0: jdbc:hive2://> show databases;
OK
+----------------+--+
| database_name  |
+----------------+--+
| default        |
+----------------+--+
1 row selected (1.318 seconds)
			
			

如果是生產環境啟動請使用下面的方法

			
[hadoop@localhost ~]$ /srv/apache-hive/bin/hive --service hiveserver2 &
[1] 20375
[hadoop@localhost ~]$ SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/srv/apache-hive-2.1.1/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/srv/apache-hadoop-2.8.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]