hive-udaf-maxrow is a simple user-defined aggregate function (UDAF) for Hive.
The maxrow()
aggregate function is similar to the built-in max()
function,
but it allows you to refer to additional columns in the maximal row.
For example, given the following data in a Hive table:
id | ts | somedata |
---|---|---|
1 | 2 | data-1,2 |
1 | 3 | data-1,3 |
1 | 4 | data-1,4 |
2 | 5 | data-2,5 |
2 | 3 | data-2,3 |
2 | 4 | data-2,4 |
3 | 6 | data-3,6 |
3 | 1 | data-3,1 |
3 | 4 | data-3,4 |
You can query this table using the maxrow()
function:
hive> ADD JAR hive-udaf-maxrow.jar;
hive> CREATE TEMPORARY FUNCTION maxrow AS 'com.scribd.hive.udaf.GenericUDAFMaxRow';
hive> SELECT id, maxrow(ts, somedata) FROM sometable GROUP BY id;
id | maxrow |
---|---|
1 | {"col0":4,"col1":"data-1,4"} |
2 | {"col0":5,"col1":"data-2,5"} |
3 | {"col0":6,"col1":"data-3,6"} |
While maxrow()
looks only at its first parameter ("ts
" in this case) to compute
the maximum value, it carries along any additional values ("somedata
" in this
case).
Since maxrow()
returns a "struct" value (see below), you can parse the result
with Hive's built-in "dot" notation. For example:
hive> SELECT id, m.col0 as ts, m.col1 as somedata FROM (
SELECT id, maxrow(ts, somedata) as m FROM sometable GROUP BY id
) s;
id | ts | somedata |
---|---|---|
1 | 4 | data-1,4 |
2 | 5 | data-2,5 |
3 | 6 | data-3,6 |
As can be seen from the example above, there are a couple of limitations due to how Hive UDAFs work:
- A UDAF can only output a value for a single column. Therefore,
maxrow()
returns a complex-valued "struct" object. - Hive does not provide the UDAF with the name of the columns that are being
passed as input to the UDAF. Therefore,
maxrow()
generates simple names such as "col0", "col1", etc.
To build hive-udaf-maxrow, you need to specify the location of your Hadoop and
Hive jar files using the HADOOP_HOME
and HIVE_HOME
environment variables.
The build classpath will include all of the jar files in these directories and
and their lib/
subdirectories. For example:
> HADOOP_HOME=/path/to/hadoop HIVE_HOME=/path/to/hive ant
A successful build will create the dist/hive-udaf-maxrow.jar
file. You can
add this jar file to your Hive session using the ADD JAR
command shown above.