Hive 数据写入ES实践

2019-04-10

部署测试环境

Hive环境部署

Hive 部署在本地docker
sudo docker run -d --name hadoop-master -P -h hadoop-master teradatalabs/cdh5-hive

Beeline连接Hive进行测试

beeline连接成功

# hive --service beeline
Beeline version 1.1.0-cdh5.9.1 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000
scan complete in 3ms
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000: 
Enter password for jdbc:hive2://localhost:10000: 
Connected to: Apache Hive (version 1.1.0-cdh5.9.1)
Driver: Hive JDBC (version 1.1.0-cdh5.9.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000>

连接成功后，操作ES数据需要添加驱动
驱动下载elasticsearch-hadoop-6.3.2.jar
1
0: jdbc:hive2://localhost:10000> ADD JAR /home/iteblog/elasticsearch-hadoop-6.3.2.jar;

参考Hive 如何添加第三方JAR

ES环境部署

ES 部署在阿里云主机 47.106.151.207

# ES
docker run -d --name es -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:6.3.2
# ES的Web控制台
docker run -d --name es_admin -p 9100:9100 mobz/elasticsearch-head:5

ES环境部署踩坑指南

单点部署配置文件 elasticsearch.yml

cluster.name: "docker-cluster"
network.host: 0.0.0.0
http.cors.enabled: true
http.cors.allow-origin: "*"
# minimum_master_nodes need to be explicitly set when bound on a public IP
# set to 1 to allow single node clusters
# Details: https://github.com/elastic/elasticsearch/pull/17288
discovery.zen.minimum_master_nodes: 1

ES-Head 无法新建索引
通过网络查看，发现是 {"error":"Content-Type header [application/x-www-form-urlencoded] is not supported","status":406}
详细解决方法

HIVE 读取 ES 中的数据

在 ES 中存入数据

先参考ES 快速上手文档。

1. 创建索引

创建ElasticSearch中的index名为iteblog，type为user；user有regtime、uid、mobile以及username四个属性。

PUT http://47.106.151.207:9200/iteblog

{
    "mappings":
    {
        "user":
        {
            "properties":
            {
                "regtime":
                {
                    "index": "false",
                    "type": "text"
                },
                "uid":
                {
                    "type": "integer"
                },
                "mobile":
                {
                    "index": "false",
                    "type": "text"
                },
                "username":
                {
                    "index": "false",
                    "type": "text"
                }
            }
        }
    }
}

2. ES 插入测试数据

POST http://47.106.151.207:9200/iteblog/user

{
    "regtime": "20190412",
    "mobile": "13310667877",
    "uid": 1,
    "username": "vin"
}

3. 创建HIVE表

建表语句如下

CREATE EXTERNAL TABLE `user`(
	register_time string, 
	user_id int, mobile string, 
	username string) 
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES(
	'es.resource' = 'iteblog/user', 
	'es.nodes' = '47.106.151.207', 
	'es.port' = '9200', 
	'es.nodes.wan.only' = 'true', 
	'es.mapping.names' = 'register_time:regtime,user_id:uid'
);

测试是否能读取到ES的数据

0: jdbc:hive2://localhost:10000> select * from user;
INFO  : Compiling command(queryId=hive_20190412034747_e895d44d-ee7c-4363-bbeb-41b287734595): select * from user
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:user.register_time, type:string, comment:null), FieldSchema(name:user.user_id, type:int, comment:null), FieldSchema(name:user.mobile, type:string, comment:null), FieldSchema(name:user.username, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20190412034747_e895d44d-ee7c-4363-bbeb-41b287734595); Time taken: 0.602 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20190412034747_e895d44d-ee7c-4363-bbeb-41b287734595): select * from user
INFO  : Completed executing command(queryId=hive_20190412034747_e895d44d-ee7c-4363-bbeb-41b287734595); Time taken: 0.003 seconds
INFO  : OK
+---------------------+---------------+--------------+----------------+--+
| user.register_time  | user.user_id  | user.mobile  | user.username  |
+---------------------+---------------+--------------+----------------+--+
| 20190412            | 1             | 13310667877  | vin            |
+---------------------+---------------+--------------+----------------+--+
1 row selected (1.987 seconds)

成功读取ES内容

HIVE 数据写入 ES

试着在Hive构造数据并插入之前新建的外部表user，看看ES是否能同步到数据

写入ES的外部表

0: jdbc:hive2://localhost:10000> insert into table user values('20190411', 2, '13356583817', 'xiaoming')
INFO  : Compiling command(queryId=hive_20190412045959_9a8ef2eb-da0c-42d8-9a64-72ebe7dd9701): insert into table user values('20190411', 2, '13356583817', 'xiaoming')
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_col0, type:string, comment:null), FieldSchema(name:_col1, type:int, comment:null), FieldSchema(name:_col2, type:string, comment:null), FieldSchema(name:_col3, type:string, comment:null)], properties:null)
INFO  : Completed compiling command(queryId=hive_20190412045959_9a8ef2eb-da0c-42d8-9a64-72ebe7dd9701); Time taken: 0.207 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing command(queryId=hive_20190412045959_9a8ef2eb-da0c-42d8-9a64-72ebe7dd9701): insert into table user values('20190411', 2, '13356583817', 'xiaoming')
INFO  : Query ID = hive_20190412045959_9a8ef2eb-da0c-42d8-9a64-72ebe7dd9701
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-0:MAPRED] in serial mode
INFO  : Number of reduce tasks is set to 0 since there's no reduce operator
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1554983306126_0002
INFO  : The url to track the job: http://hadoop-master:8088/proxy/application_1554983306126_0002/
INFO  : Starting Job = job_1554983306126_0002, Tracking URL = http://hadoop-master:8088/proxy/application_1554983306126_0002/
INFO  : Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1554983306126_0002
INFO  : Hadoop job information for Stage-0: number of mappers: 1; number of reducers: 0
INFO  : 2019-04-12 04:59:26,997 Stage-0 map = 0%,  reduce = 0%
INFO  : 2019-04-12 04:59:35,953 Stage-0 map = 100%,  reduce = 0%, Cumulative CPU 2.38 sec
INFO  : MapReduce Total cumulative CPU time: 2 seconds 380 msec
INFO  : Ended Job = job_1554983306126_0002
INFO  : MapReduce Jobs Launched:
INFO  : Stage-Stage-0: Map: 1   Cumulative CPU: 2.38 sec   HDFS Read: 3950 HDFS Write: 0 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 2 seconds 380 msec
INFO  : Completed executing command(queryId=hive_20190412045959_9a8ef2eb-da0c-42d8-9a64-72ebe7dd9701); Time taken: 21.019 seconds
INFO  : OK
No rows affected (21.239 seconds)

确认ES上是否成功新增数据
GET http://47.106.151.207:9200/_search
查看后发现新增数据成功

进一步测试

新增数据写入测试

更新数据写入测试

删除数据写入测试

参考

[1]全面docker！使用hue连接hive
[2]Docker 简单部署 ElasticSearch
[3]docker简易搭建ElasticSearch集群
[4]通过Hive将数据写入到ElasticSearch
[5]ES 快速上手
[6]Hive如何添加第三方JAR