博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Hive分区表简介
阅读量:6912 次
发布时间:2019-06-27

本文共 17228 字,大约阅读时间需要 57 分钟。

简介:

如果一个表中数据很多,我们查询时就很慢,耗费大量时间,如果要查询其中部分数据该怎么办呢,这时我们引入分区的概念。
Hive中的分区表分为两种:静态分区和动态分区。

1.静态分区:

  • 可以根据PARTITIONED BY创建分区表,一个表可以拥有一个或者多个分区,每个分区以文件夹的形式单独存在表文件夹的目录下。
  • 分区是以字段的形式在表结构中存在,通过describe table命令可以查看到字段存在,但是该字段不存放实际的数据内容,仅仅是分区的表示。
  • 分区建表分为2种,一种是单分区,也就是说在表文件夹目录下只有一级文件夹目录。另外一种是多分区,表文件夹下出现多文件夹嵌套模式。

单级分区表演示:

# 单分区表创建hive> create table order_partition(    > ordernumber string,    > eventtime string    > )    > partitioned by (event_month string)    > row format delimited fields terminated by '\t';OKTime taken: 0.82 seconds# 将order.txt 文件中的数据加载到order_partition表中hive> load data local inpath '/home/hadoop/order.txt' overwrite into table order_partition partition (event_month='2014-05');Loading data to table default.order_partition partition (event_month=2014-05)Partition default.order_partition{event_month=2014-05} stats: [numFiles=1, numRows=0, totalSize=208, rawDataSize=0]OKTime taken: 1.749 seconds# 查看order_partition分区数据hive> select * from order_partition where event_month='2014-05';OK10703007267488  2014-05-01 06:01:12.334+01      2014-0510101043505096  2014-05-01 07:28:12.342+01      2014-0510103043509747  2014-05-01 07:50:12.33+01       2014-0510103043501575  2014-05-01 09:27:12.33+01       2014-0510104043514061  2014-05-01 09:03:12.324+01      2014-05Time taken: 0.208 seconds, Fetched: 5 row(s)# 在元数据MySQL中查看mysql> select * from partitions;+---------+-------------+------------------+---------------------+-------+--------+| PART_ID | CREATE_TIME | LAST_ACCESS_TIME | PART_NAME           | SD_ID | TBL_ID |+---------+-------------+------------------+---------------------+-------+--------+|       1 |  1530498328 |                0 | event_month=2014-05 |    32 |     31 |+---------+-------------+------------------+---------------------+-------+--------+1 row in set (0.00 sec)mysql>  select * from partition_key_vals;+---------+--------------+-------------+| PART_ID | PART_KEY_VAL | INTEGER_IDX |+---------+--------------+-------------+|       1 | 2014-05      |           0 |+---------+--------------+-------------+1 row in set (0.00 sec)# HDFS中查看目录[hadoop@hadoop000 ~]$ hadoop fs -ls /user/hive/warehouse/order_partition/         Found 1 itemsdrwxr-xr-x   - hadoop supergroup          0 2018-07-02 10:29 /user/hive/warehouse/order_partition/event_month=2014-05

注:使用hadoop shell 加载数据也能加载数据,下面进行演示:

创建分区,也就是说在HDFS文件夹目录下会有一个分区目录,那么我们是不是直接可以在HDFS上创建一个目录,再把数据加载进去呢?

# 创建目录并上传文件[hadoop@hadoop000 ~]$ hadoop fs -mkdir -p /user/hive/warehouse/order_partition/event_month=2014-06[hadoop@hadoop000 ~]$ hadoop fs -put /home/hadoop/order.txt /user/hive/warehouse/order_partition/event_month=2014-06[hadoop@hadoop000 ~]$ hadoop fs -ls /user/hive/warehouse/order_partition/                                           Found 2 itemsdrwxr-xr-x   - hadoop supergroup          0 2018-07-02 10:29 /user/hive/warehouse/order_partition/event_month=2014-05drwxr-xr-x   - hadoop supergroup          0 2018-07-02 10:54 /user/hive/warehouse/order_partition/event_month=2014-06# 发现分区表中没有数据hive> select * from order_partition where event_month='2014-06';OKTime taken: 0.21 seconds# 原因是我们将文件上传到了hdfs,hdfs是有了数据,但hive中的元数据中还没有,执行如下命令更新hive> msck repair table order_partition;OKPartitions not in metastore:    order_partition:event_month=2014-06Repair: Added partition to metastore order_partition:event_month=2014-06Time taken: 0.178 seconds, Fetched: 2 row(s)# 再次查看分区数据hive> select * from order_partition where event_month='2014-06';OK10703007267488  2014-05-01 06:01:12.334+01      2014-0610101043505096  2014-05-01 07:28:12.342+01      2014-0610103043509747  2014-05-01 07:50:12.33+01       2014-0610103043501575  2014-05-01 09:27:12.33+01       2014-0610104043514061  2014-05-01 09:03:12.324+01      2014-06Time taken: 0.257 seconds, Fetched: 5 row(s)# 查看表分区hive> show partitions order_partition;OKevent_month=2014-05event_month=2014-06Time taken: 0.164 seconds, Fetched: 2 row(s)

注意: msck repair table命令执行后Hive会检测如果HDFS目录下存在 但表的metastore中不存在的partition元信息,更新到metastore中。如果有一张表已经存放好几年了,用这个命令去执行的话 半天都反应不了,所以这个命令太暴力了,生产中不推荐使用。可以用Add partition来添加分区。

[hadoop@hadoop000 ~]$ hadoop fs -mkdir -p /user/hive/warehouse/order_partition/event_month=2014-07[hadoop@hadoop000 ~]$ hadoop fs -put /home/hadoop/order.txt /user/hive/warehouse/order_partition/event_month=2014-07[hadoop@hadoop000 ~]$ hadoop fs -ls /user/hive/warehouse/order_partition/Found 3 itemsdrwxr-xr-x   - hadoop supergroup          0 2018-07-02 10:29 /user/hive/warehouse/order_partition/event_month=2014-05drwxr-xr-x   - hadoop supergroup          0 2018-07-02 10:54 /user/hive/warehouse/order_partition/event_month=2014-06drwxr-xr-x   - hadoop supergroup          0 2018-07-02 11:09 /user/hive/warehouse/order_partition/event_month=2014-07# 查看新的分区hive> select * from order_partition where event_month='2014-07';OKTime taken: 0.188 seconds# 添加分区hive> ALTER TABLE order_partition ADD IF NOT EXISTS PARTITION (event_month='2014-07');OKTime taken: 0.22 seconds# 再次查看hive> select * from order_partition where event_month='2014-07';OK10703007267488  2014-05-01 06:01:12.334+01      2014-0710101043505096  2014-05-01 07:28:12.342+01      2014-0710103043509747  2014-05-01 07:50:12.33+01       2014-0710103043501575  2014-05-01 09:27:12.33+01       2014-0710104043514061  2014-05-01 09:03:12.324+01      2014-07Time taken: 0.206 seconds, Fetched: 5 row(s)hive> show partitions order_partition;OKevent_month=2014-05event_month=2014-06event_month=2014-07Time taken: 0.151 seconds, Fetched: 3 row(s)

多级分区表演示:

# 创建多级分区表hive> create table order_mulit_partition(    > ordernumber string,    > eventtime string    > )    > partitioned by (event_month string,event_day string)    > row format delimited fields terminated by '\t';OKTime taken: 0.133 seconds# 加载数据hive>  load data local inpath '/home/hadoop/order.txt' overwrite into table order_mulit_partition partition (event_month='2014-05',event_day=01);# 查看分区hive> select * from order_mulit_partition where event_month='2014-05' and event_day='01';OK10703007267488  2014-05-01 06:01:12.334+01      2014-05 0110101043505096  2014-05-01 07:28:12.342+01      2014-05 0110103043509747  2014-05-01 07:50:12.33+01       2014-05 0110103043501575  2014-05-01 09:27:12.33+01       2014-05 0110104043514061  2014-05-01 09:03:12.324+01      2014-05 01hive> show partitions order_mulit_partition;OKevent_month=2014-05/event_day=01Time taken: 0.158 seconds, Fetched: 1 row(s)# HDFS中多级分区的目录结构[hadoop@hadoop000 ~]$ hadoop fs -ls /user/hive/warehouse/order_mulit_partition/event_month=2014-05Found 1 itemsdrwxr-xr-x   - hadoop supergroup          0 2018-07-02 11:17 /user/hive/warehouse/order_mulit_partition/event_month=2014-05/event_day=01

总结:单级分区在HDFS上文件目录为单级;多分区在HDFS上文件目录为多级。

2.动态分区:

参考:

  • 先看看官方为我们解释的什么是动态分区:
    Static Partition (SP) columns 静态分区;
    Dynamic Partition (DP) columns 动态分区。
DP columns are specified the same way as it is for SP columns – in the partition clause. The only difference is that DP columns do not have values, while SP columns do. In the partition clause, we need to specify all partitioning columns, even if all of them are DP columns.In INSERT ... SELECT ... queries, the dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause.

简单总结下区别:

  • 1.DP列的指定方式与SP列相同 - 在分区子句中( Partition关键字后面),唯一的区别是,DP列没有值,而SP列有值( Partition关键字后面只有key没有value)

  • 2.在INSERT … SELECT …查询中,必须在SELECT语句中的列中最后指定动态分区列,并按PARTITION()子句中出现的顺序进行排列

  • 3.所有DP列 - 只允许在非严格模式下使用。 在严格模式下,我们应该抛出一个错误

  • 4.如果动态分区和静态分区一起使用,必须是动态分区的字段在前,静态分区的字段在后。

下面举几个例子进行演示:

注意:为了演示动态分区与静态分区的区别 并且对比出静态分区的繁琐,我们先对静态分区进行操作 之后再演示动态分区。

# 创建员工静态分区表hive> CREATE TABLE emp_static_partition (    > empno int,    > ename string,    > job string,    > mgr int,    > hiredate string,    > salary double,    > comm double    > )    > PARTITIONED BY (deptno int)    > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';OKTime taken: 0.198 seconds# 将emp表里的数据插入静态分区hive> select * from emp;OK7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    207499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   307521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   307566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    207654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  307698    BLAKE   MANAGER 7839    1981-5-1        2850.0  NULL    307782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    107788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    207839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    107844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     307876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    207900    JAMES   CLERK   7698    1981-12-3       950.0   NULL    307902    FORD    ANALYST 7566    1981-12-3       3000.0  NULL    207934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10Time taken: 0.164 seconds, Fetched: 14 row(s)# 每个分区都要写一条insert语句hive> insert into table emp_static_partition partition(deptno=10)    > select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=10;Query ID = hadoop_20180702100505_f0566585-06b2-4c53-910a-b6a58791fc2dTotal jobs = 3Launching Job 1 out of 3...OKTime taken: 15.265 secondshive> insert into table emp_static_partition partition(deptno=20)    > select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=20;Query ID = hadoop_20180702100505_f0566585-06b2-4c53-910a-b6a58791fc2dTotal jobs = 3Launching Job 1 out of 3...OKTime taken: 18.527 secondshive> insert into table emp_static_partition partition(deptno=30)    > select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=30;Query ID = hadoop_20180702100505_f0566585-06b2-4c53-910a-b6a58791fc2dTotal jobs = 3Launching Job 1 out of 3...OKTime taken: 14.062 seconds# 查看各分区hive> select * from emp_static_partition  where deptno='10';OK7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    107839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    107934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10Time taken: 0.219 seconds, Fetched: 3 row(s)hive> select * from emp_static_partition  where deptno='20';OK7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    207566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    207788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    207876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    207902    FORD    ANALYST 7566    1981-12-3       3000.0  NULL    20Time taken: 0.197 seconds, Fetched: 5 row(s)hive> select * from emp_static_partition  where deptno='30';OK7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   307521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   307654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  307698    BLAKE   MANAGER 7839    1981-5-1        2850.0  NULL    307844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     307900    JAMES   CLERK   7698    1981-12-3       950.0   NULL    30Time taken: 0.181 seconds, Fetched: 6 row(s)

静态分区表有一个非常致命的缺点,每次分区的插入都要单独写insert语句。

下面利用动态分区进行演示

演示前先进行设置:hive 中默认是静态分区,想要使用动态分区,需要设置如下参数,可以使用临时设置,你也可以写在配置文件(hive-site.xml)里,永久生效。临时配置如下

set hive.exec.dynamic.partition=true;   --开启动态分区 默认为false,不开启set hive.exec.dynamic.partition.mode=nonstrict; --指定动态分区模式,默认为strict,即必须指定至少一个分区为静态分区,nonstrict模式表示允许所有的分区字段都可以使用动态分区
# 创建员工动态分区表,分区字段为deptnohive> CREATE TABLE emp_dynamic_partition (    > empno int,    > ename string,    > job string,    > mgr int,    > hiredate string,    > salary double,    > comm double    > )    > PARTITIONED BY (deptno int)    > ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';OKTime taken: 0.165 seconds# insert一条语句搞定hive> insert into table emp_dynamic_partition partition(deptno)    > select empno,ename ,job ,mgr ,hiredate ,salary ,comm, deptno from emp;Query ID = hadoop_20180702100505_f0566585-06b2-4c53-910a-b6a58791fc2dTotal jobs = 3Launching Job 1 out of 3...OKTime taken: 17.982 seconds# 查看各分区hive> show partitions emp_dynamic_partition;OKdeptno=10deptno=20deptno=30Time taken: 0.176 seconds, Fetched: 3 row(s)hive> select * from emp_dynamic_partition where deptno='10';OK7782    CLARK   MANAGER 7839    1981-6-9        2450.0  NULL    107839    KING    PRESIDENT       NULL    1981-11-17      5000.0  NULL    107934    MILLER  CLERK   7782    1982-1-23       1300.0  NULL    10Time taken: 2.662 seconds, Fetched: 3 row(s)hive> select * from emp_dynamic_partition where deptno='20';OK7369    SMITH   CLERK   7902    1980-12-17      800.0   NULL    207566    JONES   MANAGER 7839    1981-4-2        2975.0  NULL    207788    SCOTT   ANALYST 7566    1987-4-19       3000.0  NULL    207876    ADAMS   CLERK   7788    1987-5-23       1100.0  NULL    207902    FORD    ANALYST 7566    1981-12-3       3000.0  NULL    20Time taken: 0.178 seconds, Fetched: 5 row(s)hive> select * from emp_dynamic_partition where deptno='30';OK7499    ALLEN   SALESMAN        7698    1981-2-20       1600.0  300.0   307521    WARD    SALESMAN        7698    1981-2-22       1250.0  500.0   307654    MARTIN  SALESMAN        7698    1981-9-28       1250.0  1400.0  307698    BLAKE   MANAGER 7839    1981-5-1        2850.0  NULL    307844    TURNER  SALESMAN        7698    1981-9-8        1500.0  0.0     307900    JAMES   CLERK   7698    1981-12-3       950.0   NULL    30Time taken: 0.146 seconds, Fetched: 6 row(s)
  • 查看HDFS上文件目录结构
    [hadoop@hadoop000 ~]$ hadoop fs -ls /user/hive/warehouseFound 6 itemsdrwxr-xr-x   - hadoop supergroup          0 2018-06-24 15:38 /user/hive/warehouse/empdrwxr-xr-x   - hadoop supergroup          0 2018-07-02 13:55 /user/hive/warehouse/emp_dynamic_partitiondrwxr-xr-x   - hadoop supergroup          0 2018-07-02 13:50 /user/hive/warehouse/emp_static_partitiondrwxr-xr-x   - hadoop supergroup          0 2018-07-02 11:17 /user/hive/warehouse/order_mulit_partitiondrwxr-xr-x   - hadoop supergroup          0 2018-07-02 11:09 /user/hive/warehouse/order_partitiondrwxr-xr-x   - hadoop supergroup          0 2018-06-24 15:35 /user/hive/warehouse/stu[hadoop@hadoop000 ~]$ hadoop fs -ls /user/hive/warehouse/emp_static_partitionFound 3 itemsdrwxr-xr-x   - hadoop supergroup          0 2018-07-02 13:47 /user/hive/warehouse/emp_static_partition/deptno=10drwxr-xr-x   - hadoop supergroup          0 2018-07-02 13:50 /user/hive/warehouse/emp_static_partition/deptno=20drwxr-xr-x   - hadoop supergroup          0 2018-07-02 13:51 /user/hive/warehouse/emp_static_partition/deptno=30[hadoop@hadoop000 ~]$ hadoop fs -ls /user/hive/warehouse/emp_dynamic_partitionFound 3 itemsdrwxr-xr-x   - hadoop supergroup          0 2018-07-02 13:55 /user/hive/warehouse/emp_dynamic_partition/deptno=10drwxr-xr-x   - hadoop supergroup          0 2018-07-02 13:55 /user/hive/warehouse/emp_dynamic_partition/deptno=20drwxr-xr-x   - hadoop supergroup          0 2018-07-02 13:55 /user/hive/warehouse/emp_dynamic_partition/deptno=30

补充:两种分区还可以混合使用 下面做简要了解:

  • mixed SP & DP columns(混合使用动态分区和静态分区)
hive> create table student(    > id int,    > name string,    > tel string,    > age int    > )    > row format delimited fields terminated by '\t';OKTime taken: 0.125 secondshive> insert into student values(1,'zhangsan','18311111111',20),(2,'lisi','18222222222',30),(3,'wangwu','15733333333',40);Query ID = hadoop_20180702100505_f0566585-06b2-4c53-910a-b6a58791fc2dTotal jobs = 3Launching Job 1 out of 3...OKTime taken: 15.375 secondshive> select * from student;OK1       zhangsan        18311111111     202       lisi    18222222222     303       wangwu  15733333333     40Time taken: 0.106 seconds, Fetched: 3 row(s)# 创建混合分区表hive> create table stu_mixed_partition(    > id int,    > name string,    > tel string    > )    > partitioned by (ds string,age int)    > row format delimited fields terminated by '\t';OKTime taken: 0.171 seconds# 插入数据hive> insert into stu_mixed_partition partition(ds='2010-03-03',age)    > select id,name,tel,age from student;Query ID = hadoop_20180702100505_f0566585-06b2-4c53-910a-b6a58791fc2dTotal jobs = 3Launching Job 1 out of 3...OKTime taken: 18.887 seconds# 查看分区hive> show partitions stu_mixed_partition;OKds=2010-03-03/age=20ds=2010-03-03/age=30ds=2010-03-03/age=40hive> select * from stu_mixed_partition where ds='2010-03-03' and age=20;OK1       zhangsan        18311111111     2010-03-03      20Time taken: 0.184 seconds, Fetched: 1 row(s)hive> select * from stu_mixed_partition where ds='2010-03-03' and age=30;OK2       lisi    18222222222     2010-03-03      30Time taken: 0.188 seconds, Fetched: 1 row(s)hive> select * from stu_mixed_partition where ds='2010-03-03' and age=40;OK3       wangwu  15733333333     2010-03-03      40Time taken: 0.186 seconds, Fetched: 1 row(s)# 查看HDFS目录[hadoop@oradb3 ~]$  hadoop fs -ls /user/hive/warehouse/stu_mixed_partition/ds=2010-03-03Found 3 itemsdrwxr-xr-x   - hadoop supergroup          0 2018-07-02 14:10 /user/hive/warehouse/stu_mixed_partition/ds=2010-03-03/age=20drwxr-xr-x   - hadoop supergroup          0 2018-07-02 14:10 /user/hive/warehouse/stu_mixed_partition/ds=2010-03-03/age=30drwxr-xr-x   - hadoop supergroup          0 2018-07-02 14:10 /user/hive/warehouse/stu_mixed_partition/ds=2010-03-03/age=40

转载于:https://blog.51cto.com/10814168/2135046

你可能感兴趣的文章
shell实例
查看>>
OSPF-概述
查看>>
IdCardUtils 工具类 (×××)
查看>>
selenium chrome driver和 IE driver 常见问题(1)
查看>>
射频卡机井灌溉控制器特点
查看>>
solidity智能合约[49]-安全-溢出***
查看>>
Centos时间同步问题
查看>>
Spring Cloud - 关于Feign的常见问题总结
查看>>
百晓生详解nginx(下)nginx在实际生产环境中的应用(该帖连载更新)
查看>>
解决大数据难题 阿里云MaxCompute获科技大奖
查看>>
爬虫笔记3:requests库使用
查看>>
Linux 基础知识(七)
查看>>
创建私有CA
查看>>
(最简单)红米手机3S 3X的usb调试模式在哪里打开的流程
查看>>
如何将PDF压缩有什么简单的方法
查看>>
Java数据结构和算法(五)——队列
查看>>
jQuery
查看>>
Java实现爬虫示例
查看>>
JNA 基础篇<二> 结构体
查看>>
电脑上怎么将PDF图纸文件快速转换成DWG格式CAD图纸?
查看>>