-
Notifications
You must be signed in to change notification settings - Fork 227
Apache Pig Integration
Pig integration may be divided into two parts: a StoreFunc as a means to generate Phoenix-encoded data through Pig, and a Loader which enables Phoenix-encoded data to be read by Pig.
##Pig StoreFunc
The StoreFunc allows users to write data in Phoenix-encoded format to HBase tables using Pig scripts. This is a nice way to bulk upload data from a MapReduce job in parallel to a Phoenix table in HBase. All you need to specify is the endpoint address, HBase table name and a batch size. For example:
A = load 'testdata' as (a:chararray, b:chararray, c:chararray, d:chararray, e: datetime);
STORE A into 'hbase://CORE.ENTITY_HISTORY' using
org.apache.phoenix.pig.PhoenixHBaseStorage('localhost','-batchSize 5000');
The above reads a file 'testdata' and writes the elements to a table “CORE.ENTITY_HISTORY” in HBase that is running on localhost. First argument to this StoreFunc is the server, the 2nd argument is the batch size for upserts via Phoenix. The batch size is related to how many rows you are able to hold in memory. A good default is 1000 rows, but if your row is wide, you may want to decrease this.
Note that Pig types must be in sync with the target Phoenix data types. This StoreFunc tries best to cast based on input Pig types and target Phoenix data types, but it is recommended to provide an appropriate schema.
###Gotchas It is advised that the upsert operation be idempotent. That is, trying to re-upsert data should not cause any inconsistencies. This is important in the case when a Pig job fails in process of writing to a Phoenix table. There is no notion of rollback (due to lack of transactions in HBase), and re-trying the upsert with PhoenixHBaseStorage must result in the same data in HBase table.
For eg, let’s assume we are writing records n1….n10 to HBase. If the job fails in the middle of this process, we are left in an inconsistent state where n1….n7 made it to the phoenix tables but n8….n10 were missed. If we retry the same operation, n1….n7 would be re-upserted and n8….n10 would be upserted this time.
##Pig Loader A Pig data loader is not yet implemented, but there is work in progress tracked by this JIRA.