Identify overhead in Haivvreo performance versus unencoded data #16

dkarvounis · 2012-04-04T16:20:09Z

I find that queries on Avro data in Hive consistently take 3 to 4 times longer than on the same data in CSV format. As I understand, Haivvreo/Avro should be faster.

I have ensured that:
-The number of mappers/reducers is the same in both cases.
-The performance difference persists whether the Avro data is compressed (Deflate) or uncompressed.
-The performance difference persists whether the Avro data consists of many small files or one large file.

Do benchmarks exist comparing queries using Haivvreo on Avro data versus Hive queries on unencoded data? If Haivvreo takes longer, the cause of the overhead should be identified. Should Avro be faster?

jghoman · 2012-04-04T16:59:42Z

3x-4x is not great, but not terribly surprising. The benchmarks we did were informal and along then lines of 'is this fast enough for our purposes' and it was. That being said, there's lots of room for improvement but very little time to get it done... I'm already quite indebted in terms of promises to get things done on the code.

dkarvounis · 2012-04-11T15:51:38Z

I've identified one speedup particular to my situation. My tables consist of data from multiple dumps, where the schema of each incoming Avro file has a string identifying that dump in the "doc" field. The variation in this doc field caused the reader and writer schemata to evaluate to unequal most of the time in AvroDeserializer.deserialize(), and so reencoding was forced even when schemata were otherwise equivalent. I tried a more relaxed schema equality method (ignoring doc), and this sped up queries ~30% on data with otherwise equivalent schemata. I understand that one's expectations of .equals() come down to one's particular situation and you'd probably want to use the .equals() provided by the API by default, but I thought I'd make this scenario known.

jghoman · 2012-04-11T18:20:19Z

That's a good optimization. With Hive 8 I added a hook to allow external serdes to populate the comments field, so we'll actually be able to see the avro comments from Hive. I need to update Haivvreo to use this new API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify overhead in Haivvreo performance versus unencoded data #16

Identify overhead in Haivvreo performance versus unencoded data #16

dkarvounis commented Apr 4, 2012

jghoman commented Apr 4, 2012

dkarvounis commented Apr 11, 2012

jghoman commented Apr 11, 2012

Identify overhead in Haivvreo performance versus unencoded data #16

Identify overhead in Haivvreo performance versus unencoded data #16

Comments

dkarvounis commented Apr 4, 2012

jghoman commented Apr 4, 2012

dkarvounis commented Apr 11, 2012

jghoman commented Apr 11, 2012