data-analysis-apache-pig-practical-introduction.html

<!DOCTYPE html>
<html lang="en" itemscope itemtype="http://schema.org/Article">
  <head>
    <title>Data analysis with Apache Pig. A practical introduction</title>
    

    <meta charset="utf-8">
    <meta property="og:title" content="Data analysis with Apache Pig. A practical introduction">
    <meta property="og:site_name" content="Modesto Mas | Blog">
    <meta property="og:image" content="https://mmas.github.io/images/profile.jpg">
    <meta property="og:image:width" content="200">
    <meta property="og:image:height" content="200">
    <meta property="og:url" content="https://mmas.github.io/data-analysis-apache-pig-practical-introduction">
    <meta property="og:locale" content="en_GB">
    <meta name="twitter:image" content="https://mmas.github.io/images/profile.jpg">
    <meta name="twitter:url" content="https://mmas.github.io/data-analysis-apache-pig-practical-introduction">
    <meta name="twitter:card" content="summary">
    <meta name="twitter:domain" content="mmas.github.io">
    <meta name="twitter:title" content="Data analysis with Apache Pig. A practical introduction">
    <meta name="description" content="Apache Pig is a platform for analyzing large datasets. With Pig you have a higher level of abstraction than in MapReduce, so you can deal with richer...">
    <meta name="twitter:description" content="Apache Pig is a platform for analyzing large datasets. With Pig you have a higher level of abstraction than in MapReduce, so you can deal with richer...">
    <meta property="og:description" content="Apache Pig is a platform for analyzing large datasets. With Pig you have a higher level of abstraction than in MapReduce, so you can deal with richer...">
    <meta name="keywords" content="data-analysis,hadoop,pig">
    <meta property="og:type" content="blog">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    
<meta property="og:type" content="article">
<meta property="article:author" content="https://github.com/mmas">
<meta property="article:section" content="data-analysis">
<meta property="article:tag" content="data-analysis,hadoop,pig">
<meta property="article:published_time" content="2015-09-12">
<meta property="article:modified_time" content="2015-09-12">

    <link rel="stylesheet" type="text/css" href="/css/main.css">
    <script type="text/x-mathjax-config">
    MathJax.Hub.Config({
        CommonHTML: {
            scale: 93,
            showMathMenu: false
        },
        tex2jax: {
            "inlineMath": [["$","$"], ["\\(","\\)"]]
        }
    });
    </script>
    <script type="text/javascript" async src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML"></script>
  </head>
  <body class="entry-detail">

    
    <header>
  <div>
    <img src="https://mmas.github.io/images/profile.jpg">
    <a class="brand" href="/">Modesto Mas</a>
    <span>Data/Python/DevOps Engineer</span>
    <nav>
      <ul>
        <li><a href="/tags">Tags</a></li>
        <li><a href="https://github.com/mmas/mmas.github.io/issues" target="_blank">Issues</a></li>
      </ul>
    </nav>
  </div>
</header>
    

    <section id="content" role="main">
    

<article>
  
<header>
  <h1><a href="/data-analysis-apache-pig-practical-introduction">Data analysis with Apache Pig. A practical introduction</a></h1>
  <time datetime="2015-09-12">Sep 12, 2015</time>
  
  <a class="tag" href="/tags?tag=data-analysis">data-analysis</a>
  
  <a class="tag" href="/tags?tag=hadoop">hadoop</a>
  
  <a class="tag" href="/tags?tag=pig">pig</a>
  
</header>

  <aside id="article-nav"></aside>
  <section class="body">
    <p><a target="_blank" href="http://pig.apache.org/">Apache Pig</a> is a platform for analyzing large datasets. With Pig you have a higher level of abstraction than in MapReduce, so you can deal with richer structures of data but, in general, with a lower performance. </p>
<p>Pig can be run in local mode, a single JVM with the local filesystem, or MapReduce mode, creating MapReduce queries and running them on a <a target="_blank" href="http://hadoop.apache.org/">Hadoop</a> cluster. For the following examples it is irrelevant in which mode you run Pig.</p>
<h2>Simple example with CSV records</h2>
<p>Using NCDC weather datasets with hourly precipitations in american weather stations (<a target="_blank" href="http://www.ncdc.noaa.gov/orders/qclcd/">http://www.ncdc.noaa.gov/orders/qclcd/</a>) get the total precipitations per station and the total precipitations per date for a station.</p>
<p>Read the data (use your own path to the piggybank file and dataset):</p>
<pre>-- use CSVExcelStorage from piggybank to read the csv skipping the header
REGISTER '/usr/local/pig/contrib/piggybank/java/piggybank.jar';
records = LOAD '/input/199804hpd.txt'
    USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','UNIX','SKIP_INPUT_HEADER')
    AS (wban:chararray, date:chararray, time:chararray, hp:float);
</pre>

<p>Analyze the data:</p>
<pre>-- precipitations registered in each weather station
grouped_records = GROUP records BY wban;
sum_records = FOREACH grouped_records GENERATE group, SUM(records.hp);
DESCRIBE sum_records;
DUMP sum_records;

-- daily precipitations registered in weather station 93808
filtered_records = FILTER records BY wban == '93808';
grouped_records = GROUP filtered_records BY date;
sum_records = FOREACH grouped_records GENERATE group, SUM(filtered_records.hp);
DESCRIBE sum_records;
DUMP sum_records;
</pre>

<h2>More complex dataset structure</h2>
<p>Read a fixed-width-columns dataset with many columns of weather information and obtain some desired columns:</p>
<pre>-- read file without schema
records = LOAD '/input/199803dailyavg.txt';

-- skip first row (header) and define the entire row as a schema
records = STREAM records THROUGH `tail -n +2` as (row:chararray);

-- get desired data spliting the row by index
records = FOREACH records GENERATE
    SUBSTRING(row, 0, 5) AS wban,
    SUBSTRING(row, 6, 12) AS year_month,
    (float)TRIM(SUBSTRING(row, 15, 20)) AS max_temp,
    (float)TRIM(SUBSTRING(row, 27, 31)) AS min_temp;

DESCRIBE records;
DUMP records;
</pre>
  </section>
</article>


    </section>
    <footer></footer>

    
<script type="text/javascript" src="/js/article.js"></script>

  </body>
</html>