-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster JSON parser #112
Comments
Wow, that's a big difference. I like how your first two benchmarks isolate downloading vs parsing. If it helps, here are the results from my home desktop; I have fiber internet with a ~5 year old processor. I have the same 7.2x jump for parsing as you. Unit: microseconds
expr min lq mean median uq max neval cld
statusquo 170548.243 178850.316 182609.5076 182405.6315 186942.899 193960.856 30 c
dl 168463.594 176699.861 181490.2084 181860.3195 185492.156 195547.737 30 c
jsonlite 2786.174 3237.758 4953.9265 5831.6465 5977.747 6232.820 30 b
RcppJson 349.556 478.450 684.4094 740.9170 895.923 1018.478 30 a
RcppJson_file 294.919 548.481 635.1559 642.5110 771.255 884.151 30 a
RcppJson_id 256.885 531.879 553.2711 558.8955 597.987 795.705 30 a Even though a 7-10x jump is nice, I'm not sure it will be noticed by the user. The real bottleneck is downloading the file (ie, 0.3 sec for you and 0.18sec for me). The parsing duration is a fraction of the downloading duration. This is probably more trouble than it's worth, but I'm thinking aloud: Are the two packages essentially interchangeable? I mean, do they accept the same parameter (ie, url) and spit out a nested list with the exact same structure? If so, could the dataverse package use RcppSimdJson if it's available (using RcppSimdJson could be a suggests; this approach is explained well in the "Guarding the use of a suggested package" section of the 2nd edition of R Packages. I'm a little concerned RcppSimdJson is not easily deployed. The RcppSimdJson library has only three dependencies in the past two years. I see the current minimum requirements of jsonLite are almost nothing (R won't even work without the methods package. The suggested dependencies are almost identical to dataverse -the sf package is the only real addition. |
Overall, I think we can start with the parallel Suggests, but given that download as opposed to parsing is the real bottleneck, it is not a high priority. Re:
They are not identical (their Re:
I thought we want to depend on packages which in turn have fewer dependencies themselves? Right that jsonlite has no real dependencies. RcppSimd is also minimal too, but it does rely on Rcpp. Re:
Yes, maybe we tackle the download first. |
I think yyjsonr is actually faster these days (although it could be challenging to get identical results due to how data.frame etc are simplifed)
A very cheap way to avoid repeatedly paying the download cost is 'memoize'
and then use
Invoking This works well when the URL always returns the same value, as I think it does for (most?) calls in the client; it would not work well for, e.g., live stock quote updates, where the memoised value would always be the first value. For code that looks like dataverse-client-r/R/get_file_metadata.R Lines 51 to 53 in 8fe5184
it would make sense to refactor this so the code chunk is itself a memoised function -- checking for status and extracting the content is a one-time operation for each The type, location, and properties (maximum size, expiration time) are controlled by the If this seems interesting I could prepare a pull request. |
@mtmorgan awesome that you're willing to dive into the code. As you may have seen, we are looking for a new maintainer: #21 (comment) |
Does yyjson depend on the caching/memoise? Or are those two things independent. @beniaminogreen had started trying to allow caching in #135. Other than that note, a pull request sounds great. |
yyjson would be a jsonlite replacement, independent of memoize; I agree with the above that parsing JSON per se is probably not that limiting in terms of performance. Thanks for pointing to issue 135; probably I would suggest generalizing caching by working at a lower level (the GET request), so will explore that in light of 135. |
For dataset retrieval, we download and parse JSON metadata multiple times.
For example, in
get_dataframe_by_name
,get_fileid.character
would first find the dataset id viadataverse-client-r/R/utils.R
Line 29 in 4775a92
and the file the list of ids for each file in the dataset at
dataverse-client-r/R/get_dataset.R
Line 101 in 4775a92
It turns out the time this takes is non-trivial. Most of the time is taken up by loading the JSON from URL. A small remaining fraction (< 1%) is due to the parsing of the JSON file. We could make a minor improvement in speed by switching to a faster parser, RcppSimdJson (https://github.com/eddelbuettel/rcppsimdjson). This is about 2-10x faster in my tests, per below. The current
jsonlite::fromJSON
seems to be optimal for data science pipelines where we deal with data but here we are only interested in bits of metadata. An even faster switch is to download metadata only once.Switching packages will require changes in at least 20 places where jsonlite is used.
Created on 2022-01-05 by the reprex package (v2.0.1)
The text was updated successfully, but these errors were encountered: