-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No Performance Gain over Base Avro #6
Comments
We're glad that you found our package as a way to improve the performance of your code. We hope it will eventually work in your scenario too. Unfortunately, I see some unclarities in your example implementation, so it's hard to investigate this case without better insight into your real program/schemas (as I don't know what is an error in your code and what is just an error in code pasting). Similar routine (DataFileReader, FastSpecificDatumReader, 1M LargeAndFlat records) have 2x boost in my tests - so indeed it is very likely that your code is falling back to the default implementation. Could you carefully verify that you don't have any error logs (especially at the start of processing) which could hint us what does not work? |
I've updated the typos in the original post but thought it best to include the complete class for processing byte messages off a jms queue;
The only difference between using fastserde and pure avro is replacing the Avro SpecificDatumReader
with the fast serde FastSpecificDatumReader
and passing the instance of the Avro SpecificDatumReader created above
with the instance of the fastserde FastSpecificDatumReader (also created above)
The HfReadData class referenced in the code was generated by Avro using the schema. I can see no errors in my logs during startup nor do I see any exceptions at runtime. |
Do you have to reuse HfReadData object? For now, we don't support object reuse (as stated in "Limitations" paragraph in the readme), so I think that may be some issue. You can verify, that plain "dataFileReader.next()" loop will work faster indeed. |
Hi, We’ve had great success with avro-fastserde in our own project. This is a truly fantastic piece of software. Thank you for that! We have decided to invest in it and add some additional performance enhancements. Among those is support for object re-use. Our fork of this project is available here, if you’d like to take a look: https://github.com/linkedin/avro-util -F |
Hi, Glad to hear that you find this library useful! I briefly looked over your branch and the enhancements you made are really worth incorporating into this repository. Thank you for adding the object reuse. We haven't provided it to this day, because we didn't really need that feature and I wasn't sure if it won't spoil the overall performance gain. Would you mind if I asked you to provide the pull-request? |
Sure, we can look into providing a PR. We have shied away from doing so so far because we were not sure whether our changes were useful and appropriate for you. In particular:
cc @gaojieliu LMK what you think. -F |
Thank you for the detailed answer and willingness to provide the PR. However upon further consideration, I decided to provide the pull request by myself, basing partially on your changes. To be more specific, I will surely add the reuse parameter support but I won't provide the support for legacy avro versions, because we surely don't need this and I think that majority of potential users don't need it as well. I will also consider adding the option to switch generated classes caching on/off. From our perspective this is a side project, so it may take a while for us to introduce these changes, but nevertheless they will appear soon. Anyways, thank you for sharing your fork and investing your time into this project! |
Oh, I just saw this by chance. Not sure why I didn't get the notification for it... anyway. BTW, we have a fix for the object re-use which is not yet merged in the other repo, here: linkedin/avro-util#10 That code (including the PR) runs in production and we have validated that object re-use works fully as expected. -F |
I am using Avro to parse messages generated by another vendor (I single fixed schema) and found that the parsing performance is not great. I was looking for ways to improve performance when I came across avro-fastserde library.
My initial attempt at using the library was not successful, I was able to parse messages but the performance was identical to the base Avro implementation. I was hoping you might be able to provide some additional insights into what I might be doing wrong as the documentation does not provide a complete working example.
In my case I have used the Avro schema provided by the vendor to generate a SpecificDatumParser and all supporting classes. The code used to parse the messages looks something like this...
where the parseAvroMessage() method parses the method into various objects before passing them on to the application for processing. Note: the JSON schema is moderately simple consisting of a single record with an array of 1..n sub-records. The parser method combines sets of sub-records into single objects. This method consumes minimal cpu as verified using Java Mission Control to take Flight Recordings.
Here is the FastSerde implementation...
As noted above I get the exact same throughput regardless of whether I use FastSerde or base Avro. On my test setup it takes about 5min to process 100,000 messages and flight recordings from each test run show slight differences but otherwise are more or less the same.
You note that the FastSpecificDatumReader will fall back to the SpecificDatumReader if the specific classes are not available. I am feeling this is most likely happening in my case (hence the identical performance). I feel like I have done something wrong but not sure what that is.
The text was updated successfully, but these errors were encountered: