-
Notifications
You must be signed in to change notification settings - Fork 3
/
data-analysis-apache-pig-practical-introduction.html
141 lines (114 loc) · 6.17 KB
/
data-analysis-apache-pig-practical-introduction.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
<!DOCTYPE html>
<html lang="en" itemscope itemtype="http://schema.org/Article">
<head>
<title>Data analysis with Apache Pig. A practical introduction</title>
<meta charset="utf-8">
<meta property="og:title" content="Data analysis with Apache Pig. A practical introduction">
<meta property="og:site_name" content="Modesto Mas | Blog">
<meta property="og:image" content="https://mmas.github.io/images/profile.jpg">
<meta property="og:image:width" content="200">
<meta property="og:image:height" content="200">
<meta property="og:url" content="https://mmas.github.io/data-analysis-apache-pig-practical-introduction">
<meta property="og:locale" content="en_GB">
<meta name="twitter:image" content="https://mmas.github.io/images/profile.jpg">
<meta name="twitter:url" content="https://mmas.github.io/data-analysis-apache-pig-practical-introduction">
<meta name="twitter:card" content="summary">
<meta name="twitter:domain" content="mmas.github.io">
<meta name="twitter:title" content="Data analysis with Apache Pig. A practical introduction">
<meta name="description" content="Apache Pig is a platform for analyzing large datasets. With Pig you have a higher level of abstraction than in MapReduce, so you can deal with richer...">
<meta name="twitter:description" content="Apache Pig is a platform for analyzing large datasets. With Pig you have a higher level of abstraction than in MapReduce, so you can deal with richer...">
<meta property="og:description" content="Apache Pig is a platform for analyzing large datasets. With Pig you have a higher level of abstraction than in MapReduce, so you can deal with richer...">
<meta name="keywords" content="data-analysis,hadoop,pig">
<meta property="og:type" content="blog">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta property="og:type" content="article">
<meta property="article:author" content="https://github.com/mmas">
<meta property="article:section" content="data-analysis">
<meta property="article:tag" content="data-analysis,hadoop,pig">
<meta property="article:published_time" content="2015-09-12">
<meta property="article:modified_time" content="2015-09-12">
<link rel="stylesheet" type="text/css" href="/css/main.css">
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
CommonHTML: {
scale: 93,
showMathMenu: false
},
tex2jax: {
"inlineMath": [["$","$"], ["\\(","\\)"]]
}
});
</script>
<script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML"></script>
</head>
<body class="entry-detail">
<header>
<div>
<img src="https://mmas.github.io/images/profile.jpg">
<a class="brand" href="/">Modesto Mas</a>
<span>Data/Python/DevOps Engineer</span>
<nav>
<ul>
<li><a href="/tags">Tags</a></li>
<li><a href="https://github.com/mmas/mmas.github.io/issues" target="_blank">Issues</a></li>
</ul>
</nav>
</div>
</header>
<section id="content" role="main">
<article>
<header>
<h1><a href="/data-analysis-apache-pig-practical-introduction">Data analysis with Apache Pig. A practical introduction</a></h1>
<time datetime="2015-09-12">Sep 12, 2015</time>
<a class="tag" href="/tags?tag=data-analysis">data-analysis</a>
<a class="tag" href="/tags?tag=hadoop">hadoop</a>
<a class="tag" href="/tags?tag=pig">pig</a>
</header>
<aside id="article-nav"></aside>
<section class="body">
<p><a target="_blank" href="http://pig.apache.org/">Apache Pig</a> is a platform for analyzing large datasets. With Pig you have a higher level of abstraction than in MapReduce, so you can deal with richer structures of data but, in general, with a lower performance. </p>
<p>Pig can be run in local mode, a single JVM with the local filesystem, or MapReduce mode, creating MapReduce queries and running them on a <a target="_blank" href="http://hadoop.apache.org/">Hadoop</a> cluster. For the following examples it is irrelevant in which mode you run Pig.</p>
<h2>Simple example with CSV records</h2>
<p>Using NCDC weather datasets with hourly precipitations in american weather stations (<a target="_blank" href="http://www.ncdc.noaa.gov/orders/qclcd/">http://www.ncdc.noaa.gov/orders/qclcd/</a>) get the total precipitations per station and the total precipitations per date for a station.</p>
<p>Read the data (use your own path to the piggybank file and dataset):</p>
<pre>-- use CSVExcelStorage from piggybank to read the csv skipping the header
REGISTER '/usr/local/pig/contrib/piggybank/java/piggybank.jar';
records = LOAD '/input/199804hpd.txt'
USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','UNIX','SKIP_INPUT_HEADER')
AS (wban:chararray, date:chararray, time:chararray, hp:float);
</pre>
<p>Analyze the data:</p>
<pre>-- precipitations registered in each weather station
grouped_records = GROUP records BY wban;
sum_records = FOREACH grouped_records GENERATE group, SUM(records.hp);
DESCRIBE sum_records;
DUMP sum_records;
-- daily precipitations registered in weather station 93808
filtered_records = FILTER records BY wban == '93808';
grouped_records = GROUP filtered_records BY date;
sum_records = FOREACH grouped_records GENERATE group, SUM(filtered_records.hp);
DESCRIBE sum_records;
DUMP sum_records;
</pre>
<h2>More complex dataset structure</h2>
<p>Read a fixed-width-columns dataset with many columns of weather information and obtain some desired columns:</p>
<pre>-- read file without schema
records = LOAD '/input/199803dailyavg.txt';
-- skip first row (header) and define the entire row as a schema
records = STREAM records THROUGH `tail -n +2` as (row:chararray);
-- get desired data spliting the row by index
records = FOREACH records GENERATE
SUBSTRING(row, 0, 5) AS wban,
SUBSTRING(row, 6, 12) AS year_month,
(float)TRIM(SUBSTRING(row, 15, 20)) AS max_temp,
(float)TRIM(SUBSTRING(row, 27, 31)) AS min_temp;
DESCRIBE records;
DUMP records;
</pre>
</section>
</article>
</section>
<footer></footer>
<script type="text/javascript" src="/js/article.js"></script>
</body>
</html>