Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify overhead in Haivvreo performance versus unencoded data #16

Open
dkarvounis opened this issue Apr 4, 2012 · 3 comments
Open

Comments

@dkarvounis
Copy link

I find that queries on Avro data in Hive consistently take 3 to 4 times longer than on the same data in CSV format. As I understand, Haivvreo/Avro should be faster.

I have ensured that:
-The number of mappers/reducers is the same in both cases.
-The performance difference persists whether the Avro data is compressed (Deflate) or uncompressed.
-The performance difference persists whether the Avro data consists of many small files or one large file.

Do benchmarks exist comparing queries using Haivvreo on Avro data versus Hive queries on unencoded data? If Haivvreo takes longer, the cause of the overhead should be identified. Should Avro be faster?

@jghoman
Copy link
Owner

jghoman commented Apr 4, 2012

3x-4x is not great, but not terribly surprising. The benchmarks we did were informal and along then lines of 'is this fast enough for our purposes' and it was. That being said, there's lots of room for improvement but very little time to get it done... I'm already quite indebted in terms of promises to get things done on the code.

@dkarvounis
Copy link
Author

I've identified one speedup particular to my situation. My tables consist of data from multiple dumps, where the schema of each incoming Avro file has a string identifying that dump in the "doc" field. The variation in this doc field caused the reader and writer schemata to evaluate to unequal most of the time in AvroDeserializer.deserialize(), and so reencoding was forced even when schemata were otherwise equivalent. I tried a more relaxed schema equality method (ignoring doc), and this sped up queries ~30% on data with otherwise equivalent schemata. I understand that one's expectations of .equals() come down to one's particular situation and you'd probably want to use the .equals() provided by the API by default, but I thought I'd make this scenario known.

@jghoman
Copy link
Owner

jghoman commented Apr 11, 2012

That's a good optimization. With Hive 8 I added a hook to allow external serdes to populate the comments field, so we'll actually be able to see the avro comments from Hive. I need to update Haivvreo to use this new API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants