You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I find that queries on Avro data in Hive consistently take 3 to 4 times longer than on the same data in CSV format. As I understand, Haivvreo/Avro should be faster.
I have ensured that:
-The number of mappers/reducers is the same in both cases.
-The performance difference persists whether the Avro data is compressed (Deflate) or uncompressed.
-The performance difference persists whether the Avro data consists of many small files or one large file.
Do benchmarks exist comparing queries using Haivvreo on Avro data versus Hive queries on unencoded data? If Haivvreo takes longer, the cause of the overhead should be identified. Should Avro be faster?
The text was updated successfully, but these errors were encountered:
3x-4x is not great, but not terribly surprising. The benchmarks we did were informal and along then lines of 'is this fast enough for our purposes' and it was. That being said, there's lots of room for improvement but very little time to get it done... I'm already quite indebted in terms of promises to get things done on the code.
I've identified one speedup particular to my situation. My tables consist of data from multiple dumps, where the schema of each incoming Avro file has a string identifying that dump in the "doc" field. The variation in this doc field caused the reader and writer schemata to evaluate to unequal most of the time in AvroDeserializer.deserialize(), and so reencoding was forced even when schemata were otherwise equivalent. I tried a more relaxed schema equality method (ignoring doc), and this sped up queries ~30% on data with otherwise equivalent schemata. I understand that one's expectations of .equals() come down to one's particular situation and you'd probably want to use the .equals() provided by the API by default, but I thought I'd make this scenario known.
That's a good optimization. With Hive 8 I added a hook to allow external serdes to populate the comments field, so we'll actually be able to see the avro comments from Hive. I need to update Haivvreo to use this new API.
I find that queries on Avro data in Hive consistently take 3 to 4 times longer than on the same data in CSV format. As I understand, Haivvreo/Avro should be faster.
I have ensured that:
-The number of mappers/reducers is the same in both cases.
-The performance difference persists whether the Avro data is compressed (Deflate) or uncompressed.
-The performance difference persists whether the Avro data consists of many small files or one large file.
Do benchmarks exist comparing queries using Haivvreo on Avro data versus Hive queries on unencoded data? If Haivvreo takes longer, the cause of the overhead should be identified. Should Avro be faster?
The text was updated successfully, but these errors were encountered: