-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[discussion] (de)serialize data in "C"? #60
Comments
Hi - I've been meaning to reply to this, but am a little confused here about what is missing. When we've been storing data from R into Redis, we typically either store a string that will come back as a string or serialised binary data (usually as rds format via Are you imagining/requiring some storage format where the internals of the object are available within redis? This should be fairly "easy" to do - the relevant code is here: https://github.com/richfitz/redux/blob/master/src/conversions.c#L187-L190 so it would just be a case of adding something to change serialisation there, and again on the reverse. If you are able to cook up a proof-of-concept that shows the problem clearly that would make a good starting point |
It might be under-informed, tbh, or a knee-jerk based on other things I'm doing. I believe that data is typically stored in redis/valkey in json. If that data is passed over-the-wire to R as string, that's one thing. But if that string is passed into R-space as a string and then passed to Perhaps this is just needing feel-good confirmation for me, then: when serialized data comes back from redis and is deserialized, then we do not pay the price of the string pool, is that correct? (I admit that I might have been up-late and tired when I first drafted this issue; I should have come back to revisit or edit it.) |
Right - I think I'm starting to understand. How much of an issue your problem is depends a lot on how one is using Redis/redux. If you're in control of the whole pipeline and you are using R for everything at both ends then you can serialise to binary and put that into redis directly. If you do that there is no cost to pay for dealing with strings and nothing involves json at all. This is our typical use (see for example rrq). If you are using redux to interface with an application where someone else is serialising data into json, and you want to deserialise that data on egress into R then you might benefit from deserialising into R within the C code. Unfortunately, doing that involves all of the usual headaches with deserialising JSON to R (though that's thankfully less terrible than going the other way). If a small subset is handled (e.g. a data.frame stored by row or by column) that's theoretically fine but you can imagine that this path leads to rewriting all jsonlite by hand. It does not offer a C API I believe. If you want to work up a proof of concept and find out what the headline speed gain could possibly be, that would be great. Alternatively, it's possible that something in Redis' json API might allow you to slice and dice your data before sending it over the wire, which would feel like the ideal solution potentially |
You hit it on the head: sharing the data with non-R clients. I wasn't aware of the JSON API within redis, though frankly that does not really do much, since our use is to grab "the whole thing" (data partitioning will be handled before storage). Even if we do subset in-redis, we'd still pay a price albeit slightly smaller. Yes, I was thinking something akin to (This is further challenged by the difference that R deserialization is rather straight-forward with no "options", we cannot say the same for json objects.) |
I've done some more testing, thank you for your patience in this discussion. The primary point of my issue is that I want to transfer large-ish data from a redis topic of some sort (hash, time-series, whatever) into an R-friendly object, as efficiently as possible. The clear problem-child here is R's string pool, where retrieving a large string invokes a larger cost than in most languages. However, binary objects and connections seem to work just fine. For instance, ### msgpack
obj <- RcppMsgPack::msgpack_pack(mtcars[1:3,])
R$SET("quux42", obj)
# [Redis: OK]
R$GET("quux42") |>
RcppMsgPack::msgpack_unpack() |>
# klunky, fragile, only works because we know the data is rectanguler
with(setNames(as.data.frame(lapply(value, unlist)), unlist(key)))
# mpg cyl disp hp drat wt qsec vs am gear carb
# 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 But with json-encoded data, whether string or raw, it is always returned as a string. obj <- jsonlite::toJSON(mtcars[4:6,])
R$SET("quux42", obj)
# [Redis: OK]
R$GET("quux42")
# [1] "[{\"mpg\":21.4,\"cyl\":6,\"disp\":258,\"hp\":110,\"drat\":3.08,\"wt\":3.215,\"qsec\":19.44,\"vs\":1,\"am\":0,\"gear\":3,\"carb\":1,\"_row\":\"Hornet 4 Drive\"},{\"mpg\":18.7,\"cyl\":8,\"disp\":360,\"hp\":175,\"drat\":3.15,\"wt\":3.44,\"qsec\":17.02,\"vs\":0,\"am\":0,\"gear\":3,\"carb\":2,\"_row\":\"Hornet Sportabout\"},{\"mpg\":18.1,\"cyl\":6,\"disp\":225,\"hp\":105,\"drat\":2.76,\"wt\":3.46,\"qsec\":20.22,\"vs\":1,\"am\":0,\"gear\":3,\"carb\":1,\"_row\":\"Valiant\"}]"
obj <- charToRaw(jsonlite::toJSON(mtcars[7:9,]))
head(obj)
# [1] 5b 7b 22 6d 70 67
R$SET("quux42", obj)
# [Redis: OK]
R$GET("quux42")
# [1] "[{\"mpg\":14.3,\"cyl\":8,\"disp\":360,\"hp\":245,\"drat\":3.21,\"wt\":3.57,\"qsec\":15.84,\"vs\":0,\"am\":0,\"gear\":3,\"carb\":4,\"_row\":\"Duster 360\"},{\"mpg\":24.4,\"cyl\":4,\"disp\":146.7,\"hp\":62,\"drat\":3.69,\"wt\":3.19,\"qsec\":20,\"vs\":1,\"am\":0,\"gear\":4,\"carb\":2,\"_row\":\"Merc 240D\"},{\"mpg\":22.8,\"cyl\":4,\"disp\":140.8,\"hp\":95,\"drat\":3.92,\"wt\":3.15,\"qsec\":22.9,\"vs\":1,\"am\":0,\"gear\":4,\"carb\":2,\"_row\":\"Merc 230\"}]" Is there a way to force |
Not at present - here's the heuristic: https://github.com/richfitz/redux/blob/master/src/conversions.c#L228-L255 This gets called from a a bunch of places as we build a list of redis replies (not everything is as simple as a At this point it's an interface issue, and one I don't have a strong idea about. However, it does seem much simpler than trying to do the deserialisation in C! |
I was thinking something deliberate as R$GET("quux42", as="raw") or R$GET("quux42", asraw=TRUE) The former allows for future expansion, the latter is single-purpose-simple. I admit that I don't know offhand what other redis verbs could benefit from this, I'm sure it's "some" at least. |
Can you have a look at #61, which adds control over conversion at the lower-level interface. You should be able to see the real-world performance impact of this in your application, and the description shows how to use |
Okay, three (hopefully useful) points on your branch. First, defaulting to Second, R <- redux::hiredis(port=16379) # as before
R$command(list("GET", "key"), "raw")
# Error in R$command(list("GET", "key"), "raw") : unused argument ("raw")
R$command
# function (cmd)
# {
# redis_command(ptr, cmd)
# }
# <bytecode: 0x6435764515b8>
# <environment: 0x6435870ee9e8> I'm able to get to it directly by using R2ptr <- redux:::redis_connect_tcp("localhost", 16379) # is there a better way to get at this from R above?
redux:::redis_command(R2ptr, list("GET", "key"), "raw") # works as documented Third, some quick benchmarks on (for me) representative data: frames that vary from 500-1400 rows with 74 columns (11 string, 1 POSIXt, the remainder int/float).
FunctionsR2 <- redux::hiredis(port=16379)
R2ptr <- redux:::redis_connect_tcp("localhost", 16379)
r2frame <- function(key) {
obj <- R2$GET(key)
redux::bin_to_object(obj)
}
json2frame <- function(key) {
obj <- R2$GET(key)
jsonlite::fromJSON(obj)
}
msgpack2frame <- function(key) {
obj <- R2$GET(key)
RcppMsgPack::msgpack_unpack(obj, simplify=TRUE) |>
lapply(function(z) if (is.list(z) && length(z) > 0 && all(sapply(z, is.null))) rep(NA, length(z)) else z) |>
as.data.frame()
}
r2frame_raw <- function(key) {
obj <- redux:::redis_command(R2ptr, list("GET", key), "raw")
redux::bin_to_object(obj)
}
json2frame_raw <- function(key) {
obj <- redux:::redis_command(R2ptr, list("GET", key), "raw")
jsonlite::fromJSON(rawConnection(obj))
}
msgpack2frame_raw <- function(key) {
obj <- redux:::redis_command(R2ptr, list("GET", key), "raw")
RcppMsgPack::msgpack_unpack(obj, simplify=TRUE) |>
# if all(is.na(z)) is true for any column z, then msgpack returns `list(NULL, NULL, ...)`;
# for all other columns including mixed-NA, it returns vectors;
# this (hasty) code fixes that
lapply(function(z) if (is.list(z) && length(z) > 0 && all(sapply(z, is.null))) rep(NA, length(z)) else z) |>
as.data.frame()
}
bench::mark(
r = r2frame("r/22/14"),
r_raw = r2frame_raw("r/22/14"),
json = json2frame("json/22/14"),
json_raw = json2frame_raw("json/22/14"),
msgpack = msgpack2frame("msgpack/22/14"),
msgpack_raw = msgpack2frame_raw("msgpack/22/14"),
check = FALSE, min_iterations = 10
) ### redux-1.1.4
# A tibble: 3 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 r 895.53µs 1.02ms 966. 1.62MB 0 484 0 501ms <NULL> <Rprofmem [79 × 3]> <bench_tm [484]> <tibble [484 × 3]>
2 json 87.2ms 92.95ms 10.8 5.11MB 0 10 0 925ms <NULL> <Rprofmem [3,002 × 3]> <bench_tm [10]> <tibble [10 × 3]>
3 msgpack 4.23ms 4.63ms 211. 1.66MB 6.60 96 3 454ms <NULL> <Rprofmem [127 × 3]> <bench_tm [99]> <tibble [99 × 3]>
### redux-1.1.5 # gh-60
# A tibble: 6 × 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
1 r 914.29µs 1.01ms 973. 1.62MB 0 487 0 501ms <NULL> <Rprofmem [79 × 3]> <bench_tm [487]> <tibble [487 × 3]>
2 r_raw 868.6µs 1.4ms 747. 1.62MB 0 374 0 501ms <NULL> <Rprofmem [79 × 3]> <bench_tm [374]> <tibble [374 × 3]>
3 json 84.65ms 89.11ms 11.3 5.11MB 0 10 0 889ms <NULL> <Rprofmem [3,002 × 3]> <bench_tm [10]> <tibble [10 × 3]>
4 json_raw 85.36ms 87.02ms 11.5 11.37MB 1.28 9 1 783ms <NULL> <Rprofmem [3,073 × 3]> <bench_tm [10]> <tibble [10 × 3]>
5 msgpack 4.15ms 4.34ms 227. 1.66MB 4.16 109 2 480ms <NULL> <Rprofmem [127 × 3]> <bench_tm [111]> <tibble [111 × 3]>
6 msgpack_raw 4.2ms 4.35ms 225. 1.66MB 6.38 106 3 470ms <NULL> <Rprofmem [127 × 3]> <bench_tm [109]> <tibble [109 × 3]> For those benchmarks, I am very surprised that the memory consumption of Either way, your change is a POC for the change, though until I find out why For the record, R-4.3.3 in emacs/ess on ubuntu-24.04 on linux-6.8.0, 64GB of ram. |
I haven't studied your code, so a naive question: is there a chance the underlying code pulls a string into R and then treats it as raw for the user? |
No, not within redux - this is the relevant line: Line 255 in 5d7211f
This is the same codepath that actual binary data would go through and that definitely does not get converted into a string |
okay ... then it's one of the other things I don't understand. |
The notion that redis is storing strings is fine, but R is unique among most languages in that strings can be particularly punishing. When retrieving larger objects (e.g., 1000-row frame), retrieving the JSON (or however it is stringified depending on the creation mechanism) as a string, bring into R-memory, and then deserializing from string can be much less efficient than it strictly needs to be.
What are your thoughts on including (which means writing from scratch, I believe) inline (de)serialization of data?
In my use-case, we have a rather larger cache-in-redis of relatively large amounts of data. The efficiency of in-memory caching of large objects is not my point here (since a partner company is hosting and pushing data to their redis in a cloud). The long-term storage is an arrow datamart, but for many other (non-R) apps they are using redis as a cache. The total dataset is in the millions of rows, but each redis object is a 300-1000 row (70+ column) frame. Just deserializing takes an extra 60MB (300 row frame) above what is actually used once deserialized, and all apps load hundreds of thousands of these frames at once, so 60MB will add up. (R's global string pool.) (For reference,
toJSON(dat)
strings are between 465K-1553K characters. Not huge, but thousands of these add up.)Clearly this doesn't need to support every serialization mechanism that can work with redis, but some industry standards might support R's native and JSON.
The text was updated successfully, but these errors were encountered: