-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with using FLAGS.TABLE to access table #119
Comments
is there a particular reason you want this change / architecture?
sybil is written as a single invocation binary for a given operation
instead of a server process. you'll find as such it uses these flags,
instead of passing configuration blobs around. you'll find that this change
might work for ingestion but won't work for digestion or querying
there are some pros and cons here.
memory management is easier, for one. similarly a crash or error will only
affect one operation. a downside is that all data is on disk.
if you want a server like process or embedded process or a library instead of a Unix
style model, the easiest thing is to wrap sybil calls inside a process
i have a long standing task to try and make sybil into a server process
with this data in ram architecture. i think many code paths can be shared
but it wouldn't make sense to turn sybil (as it stands) into a server,
instead to have two binaries with shared functions.
there was previous work 2-3 years ago from someone trying to migrate
architecture but i wasn't satisfied with the resulting code. I think now i
have a better idea of how it should look. please feel free to poke me on
irc or email if you want more specifics or to do a voice chat
… |
id also gladly accept a library for sybil that wraps sybil cli calls so it can be used in the way you describe. |
the main problem with your approach is if auto digest is initiated, otherwise i think it is fine (but you can use a flag to disable auto digest) |
My architecture right now is followed:
For this real time ingestion use case, it makes more sense to expose the ingest/digest api in library? Making it a library call instead of a cli call can be more light weight in terms of system resources. And data has to be in file before it can be ingested via cli. However I totally get your point of
and the benefit of
Do you have pointers as to how to wrap sybil calls inside a process? And at the same time avoid writing to disk before calling sybil? |
you don't need to write to disk before sending to sybil, you are able to stream over stdin to the sybil proc (as per the standard unix process model) in general, batching data into sybil is better than reading one record at a time, so its fine to make a buffer of 1k, 2k, 5k or 10k records in memory (or disk) before flushing to sybil - it just depends on how realtime you want your ingestion. kafka (or scribe) definitely make sense and is how the perfpipe_ stuff works. there's another person who built a simple sybil wrapper for ingesting jaeger events: https://github.com/gouthamve/redbull/blob/master/pkg/redbull/sybil.go, but they had one more caveat which is they built a "virtual table" - they have one large table and they partition it by hour. this was because sybil was not optimized at the time for very large tables (500+mm records per table) and we were seeing digestion slow down as more records were added. this is now fixed and digestion time is not going up as much as the table grows larger. If that example file isn't helpful, I can put together a smaller demo that is neater / easier to use - because their use case was a little bit over-complicated. Really, mostly you just need to use golang's "os/exec" package for calling out to sybil. in terms of API, i would try designing the API and library you want, then filling it out with the sybil calls. If you want me to help design what the API might look like, I'd be happy to spend time on it with you. An example API might be like:
and all the actual calls to the sybil proc would be hidden inside the SybilTable class. I think building the general purpose API will be helpful to other people who have similar use case as you, so I'm glad to invest in it. One last thing: I'm not sure what the overhead on the spawning a new sybil process is. On the one hand, making a new proc can take 15+ ms, but if you are batching records, it won't be as big of a deal because its happening infrequently per table (once or less per second, potentially) |
I will continue investigating both directions: 1) compiled golang wrapper around sybil proc and 2) interpreted wrapper. |
In
src/lib/table_ingest.go
, inside functionfunc (cb *SaveBlockChunkCB) CB(digestname string, records RecordList)
,FLAGS.TABLE
is used to access the target table.But using
FLAGS.TABLE
incurs a problem when I try to use it as a library instead of command line. If we have several goroutines, and each goroutine calls ingestion on different table, we simply can not guarantee they can access the table they want. In my local test, when I try to ingest data into table A, it happened that some block file is created inside/db/
, instead of/db/A/
, and final number of records stored in table A is less than the number of records I ingested. After I change the implementation using following way, this issue is gone.So instead of using FLAGS.TABLE, maybe we can store corresponding table name inside
struct SaveBlockChunkCB
, and insideCB()
, we uset := GetTable(cb.table)
to access the table.Let me know if I'm missing anything. If this is a real problem and the solution sounds viable, I will put up a pull request.
The text was updated successfully, but these errors were encountered: