-
Notifications
You must be signed in to change notification settings - Fork 3
/
loading-data-hive.html
270 lines (223 loc) · 10.3 KB
/
loading-data-hive.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
<!DOCTYPE html>
<html lang="en" itemscope itemtype="http://schema.org/Article">
<head>
<title>Loading data into Hive</title>
<meta charset="utf-8">
<meta property="og:title" content="Loading data into Hive">
<meta property="og:site_name" content="Modesto Mas | Blog">
<meta property="og:image" content="https://mmas.github.io/images/profile.jpg">
<meta property="og:image:width" content="200">
<meta property="og:image:height" content="200">
<meta property="og:url" content="https://mmas.github.io/loading-data-hive">
<meta property="og:locale" content="en_GB">
<meta name="twitter:image" content="https://mmas.github.io/images/profile.jpg">
<meta name="twitter:url" content="https://mmas.github.io/loading-data-hive">
<meta name="twitter:card" content="summary">
<meta name="twitter:domain" content="mmas.github.io">
<meta name="twitter:title" content="Loading data into Hive">
<meta name="description" content="Let's practise with different ways to load data into Apache Hive and optimization concepts. Hive tables Create table with different data types: CREATE...">
<meta name="twitter:description" content="Let's practise with different ways to load data into Apache Hive and optimization concepts. Hive tables Create table with different data types: CREATE...">
<meta property="og:description" content="Let's practise with different ways to load data into Apache Hive and optimization concepts. Hive tables Create table with different data types: CREATE...">
<meta name="keywords" content="data-warehousing,hadoop,hive">
<meta property="og:type" content="blog">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta property="og:type" content="article">
<meta property="article:author" content="https://github.com/mmas">
<meta property="article:section" content="data-warehousing">
<meta property="article:tag" content="data-warehousing,hadoop,hive">
<meta property="article:published_time" content="2015-10-15">
<meta property="article:modified_time" content="2015-10-15">
<link rel="stylesheet" type="text/css" href="/css/main.css">
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
CommonHTML: {
scale: 93,
showMathMenu: false
},
tex2jax: {
"inlineMath": [["$","$"], ["\\(","\\)"]]
}
});
</script>
<script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML"></script>
</head>
<body class="entry-detail">
<header>
<div>
<img src="https://mmas.github.io/images/profile.jpg">
<a class="brand" href="/">Modesto Mas</a>
<span>Data/Python/DevOps Engineer</span>
<nav>
<ul>
<li><a href="/tags">Tags</a></li>
<li><a href="https://github.com/mmas/mmas.github.io/issues" target="_blank">Issues</a></li>
</ul>
</nav>
</div>
</header>
<section id="content" role="main">
<article>
<header>
<h1><a href="/loading-data-hive">Loading data into Hive</a></h1>
<time datetime="2015-10-15">Oct 15, 2015</time>
<a class="tag" href="/tags?tag=data-warehousing">data-warehousing</a>
<a class="tag" href="/tags?tag=hadoop">hadoop</a>
<a class="tag" href="/tags?tag=hive">hive</a>
</header>
<aside id="article-nav"></aside>
<section class="body">
<p>Let's practise with different ways to load data into <a target="_blank" href="http://hive.apache.org/">Apache Hive</a> and optimization concepts.</p>
<h2>Hive tables</h2>
<p>Create table with different data types:</p>
<pre><code class="hiveql">CREATE TABLE users (
id STRING,
name STRING,
email ARRAY<STRING>,
roles STRUCT<editor:BOOLEAN, sales:BOOLEAN, admin:BOOLEAN>,
settings MAP<STRING, STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';
</code></pre>
<pre><code class="hiveql">DESC users;
</code></pre>
<pre><code class="stdout">id string
name string
email array<string>
roles struct<editor:boolean,sales:boolean,admin:boolean>
settings map<string,string>
</code></pre>
<p>The table will be stored in the Hive warehouse (defined in <code>${HIVE_HOME}/conf/hive-site.xml</code>, by default <code>/user/hive/warehouse/</code>) in HDFS. It is possible to create an external table and put the data in HDFS. An external table will be created later.</p>
<h2>Load data</h2>
<p>We are going to put some data in our database.</p>
<p>Take a look to the sample data:</p>
<pre><code class="bash">% cat /opt/data/test/users.txt
</code></pre>
<pre><code class="stdout">eo90133cf9ql|john_doe|[email protected],[email protected]|true,true,false|theme:dark,font-size:16
sh1243ihn93n|johnsmith|[email protected]|false,true,false|theme:light
aa9871kjn3l1|bob|[email protected],[email protected]|false,false,true|font-size:12
hh2342o2nkj4|alice|[email protected]|true,false,true|theme:solarized
</code></pre>
<p>Load data from local system:</p>
<pre><code class="hiveql">LOAD DATA LOCAL INPATH '/opt/data/data/users.txt'
OVERWRITE INTO TABLE users;
</code></pre>
<p>Load data from HDFS:</p>
<pre><code class="bash">% hadoop fs -put /opt/data/test/user.txt input/
</code></pre>
<pre><code class="hiveql">LOAD DATA INPATH 'input/users.txt'
OVERWRITE INTO TABLE users;
</code></pre>
<h2>Hive partitions</h2>
<p>In order to improve the performance, we can implement partitions of the data in Hive. In this case, we'll create a table with partitions columns according to a <code>day</code> field. Only the required partitions will be queried</p>
<p>First, create external table with the raw data to load the data using <code>INSERT</code> instead of <code>LOAD</code>:</p>
<pre><code class="hiveql">CREATE EXTERNAL TABLE history_raw (
user_id STRING,
datetime TIMESTAMP,
ip STRING,
browser STRING,
os STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/root/input/history_raw.txt';
</code></pre>
<p>Put the text file with the raw data into HDFS:</p>
<pre><code class="bash">% cat /opt/data/test/history.txt
</code></pre>
<pre><code class="stdout">eo90133cf9ql,2015-10-01 00:03:20,123.456.77.88,firefox,linux,20151001
eo90133cf9ql,2015-10-01 01:08:56,123.456.77.88,firefox,linux,20151001
eo90133cf9ql,2015-10-02 02:30:45,123.456.77.88,firefox,linux,20151002
sh1243ihn93n,2015-10-02 11:21:50,121.956.23.88,safari,mac,20151002
eo90133cf9ql,2015-10-10 15:02:11,133.555.23.88,firefox,android,20151010
aa9871kjn3l1,2015-10-10 18:20:43,155.215.23.88,chrome,windows,20151010
eo90133cf9ql,2015-10-11 12:18:09,123.456.23.88,firefox,android,20151011
eo90133cf9ql,2015-10-12 12:34:34,123.456.23.88,firefox,android,20151012
hh2342o2nkj4,2015-10-15 15:02:11,133.555.23.88,safari,ios,20151015
sh1243ihn93n,2015-10-15 21:21:21,121.956.23.88,safari,mac,20151015
</code></pre>
<pre><code class="bash">% hadoop fs -put /opt/data/test/history.txt input/history_raw.txt
</code></pre>
<p>Create table with partition:</p>
<pre><code class="hiveql">CREATE TABLE history (
user_id STRING,
datetime TIMESTAMP,
ip STRING,
browser STRING,
os STRING)
PARTITIONED BY (day STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
</code></pre>
<pre><code class="hiveql">DESC history;
</code></pre>
<pre><code class="stdout">user_id string
datetime timestamp
ip string
browser string
os string
day string
# Partition Information
# col_name data_type comment
day string
</code></pre>
<p>Create a partition:</p>
<pre><code class="hiveql">ALTER TABLE history
ADD PARTITION (day='20151015');
SHOW PARTITIONS history;
</code></pre>
<pre><code class="stdout">day=20151015
</code></pre>
<p>To load local data into partition table we can use <code>LOAD</code> or <code>INSERT</code>, but we can filter easily the data with <code>INSERT</code> from the raw table to put the fields in the proper partition.</p>
<pre><code class="hiveql">INSERT OVERWRITE TABLE history PARTITION (day='20151015')
SELECT * FROM history_raw
WHERE substr(datetime, 0, 10) = '2015-10-15';
</code></pre>
<p>A folder per partition will be created in the hive warehouse:</p>
<pre><code class="bash">% hadoop fs -ls /user/hive/warehouse/history
</code></pre>
<pre><code class="stdout">... /user/hive/warehouse/history/day=20151015
</code></pre>
<h2>Hive buckets</h2>
<p>Following with the optimization in Hive, bucketing is a technique for segment the files into different clusters in HDFS.</p>
<p>Create clustered table:</p>
<pre><code class="hiveql">CREATE TABLE history_buckets (
user_id STRING,
datetime TIMESTAMP,
ip STRING,
browser STRING,
os STRING)
CLUSTERED BY (user_id) INTO 10 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
</code></pre>
<p>Set the parameters to limit the reducers to the number of clusters:</p>
<pre><code class="hiveql">set hive.enforce.bucketing = true;
set hive.exec.reducers.max = 10;
</code></pre>
<p>Since <code>LOAD</code> doesn't verify the data we need to use <code>INSERT</code> for bucketing:</p>
<pre><code class="hiveql">INSERT OVERWRITE TABLE history_buckets
SELECT * FROM history_raw;
</code></pre>
<p>In HDFS, a file will be created for each cluster:</p>
<pre><code class="bash">% hadoop fs -ls /user/hive/warehouse/history_buckets
</code></pre>
<pre><code class="stdout">... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000000_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000001_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000002_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000003_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000004_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000005_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000006_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000007_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000008_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000009_0
</code></pre>
</section>
</article>
</section>
<footer></footer>
<script type="text/javascript" src="/js/article.js"></script>
</body>
</html>