Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with using FLAGS.TABLE to access table #119

Open
wang502 opened this issue Apr 8, 2020 · 6 comments
Open

Problem with using FLAGS.TABLE to access table #119

wang502 opened this issue Apr 8, 2020 · 6 comments

Comments

@wang502
Copy link

wang502 commented Apr 8, 2020

In src/lib/table_ingest.go, inside functionfunc (cb *SaveBlockChunkCB) CB(digestname string, records RecordList), FLAGS.TABLE is used to access the target table.

But using FLAGS.TABLE incurs a problem when I try to use it as a library instead of command line. If we have several goroutines, and each goroutine calls ingestion on different table, we simply can not guarantee they can access the table they want. In my local test, when I try to ingest data into table A, it happened that some block file is created inside /db/, instead of /db/A/, and final number of records stored in table A is less than the number of records I ingested. After I change the implementation using following way, this issue is gone.

So instead of using FLAGS.TABLE, maybe we can store corresponding table name inside struct SaveBlockChunkCB, and inside CB(), we use t := GetTable(cb.table) to access the table.

Let me know if I'm missing anything. If this is a real problem and the solution sounds viable, I will put up a pull request.

@okayzed
Copy link
Collaborator

okayzed commented Apr 8, 2020 via email

@okayzed
Copy link
Collaborator

okayzed commented Apr 8, 2020

id also gladly accept a library for sybil that wraps sybil cli calls so it can be used in the way you describe.

@okayzed
Copy link
Collaborator

okayzed commented Apr 9, 2020

the main problem with your approach is if auto digest is initiated, otherwise i think it is fine (but you can use a flag to disable auto digest)

@wang502
Copy link
Author

wang502 commented Apr 10, 2020

My architecture right now is followed:

  1. event data are stored in kafka
  2. my server keeps pulling messages from kafka, distribute them to different goroutines, and each goroutine handles ingestion for one sybil table.

For this real time ingestion use case, it makes more sense to expose the ingest/digest api in library? Making it a library call instead of a cli call can be more light weight in terms of system resources. And data has to be in file before it can be ingested via cli.

However I totally get your point of

wrap sybil calls inside a process

and the benefit of

have two binaries with shared functions

Do you have pointers as to how to wrap sybil calls inside a process? And at the same time avoid writing to disk before calling sybil?

@okayzed
Copy link
Collaborator

okayzed commented Apr 10, 2020

you don't need to write to disk before sending to sybil, you are able to stream over stdin to the sybil proc (as per the standard unix process model)

in general, batching data into sybil is better than reading one record at a time, so its fine to make a buffer of 1k, 2k, 5k or 10k records in memory (or disk) before flushing to sybil - it just depends on how realtime you want your ingestion. kafka (or scribe) definitely make sense and is how the perfpipe_ stuff works.

there's another person who built a simple sybil wrapper for ingesting jaeger events: https://github.com/gouthamve/redbull/blob/master/pkg/redbull/sybil.go, but they had one more caveat which is they built a "virtual table" - they have one large table and they partition it by hour. this was because sybil was not optimized at the time for very large tables (500+mm records per table) and we were seeing digestion slow down as more records were added. this is now fixed and digestion time is not going up as much as the table grows larger. If that example file isn't helpful, I can put together a smaller demo that is neater / easier to use - because their use case was a little bit over-complicated. Really, mostly you just need to use golang's "os/exec" package for calling out to sybil.

in terms of API, i would try designing the API and library you want, then filling it out with the sybil calls. If you want me to help design what the API might look like, I'd be happy to spend time on it with you.

An example API might be like:

table := SybilTable{my_table} // initializer
table.AppendRecords(...) // appends records in memory
...
...
table.Flush()  // actually makes the sybil proc call

and all the actual calls to the sybil proc would be hidden inside the SybilTable class.

I think building the general purpose API will be helpful to other people who have similar use case as you, so I'm glad to invest in it.

One last thing: I'm not sure what the overhead on the spawning a new sybil process is. On the one hand, making a new proc can take 15+ ms, but if you are batching records, it won't be as big of a deal because its happening infrequently per table (once or less per second, potentially)

@okayzed
Copy link
Collaborator

okayzed commented Apr 11, 2020

After looking at the amount of effort it takes to write a golang wrapper, I think it's kind of painful to write a wrapper (but still doable). I would recommend writing a python or other language wrapper (that reads off kafka and ships to sybil) because it will be much simpler. I can help down that path, too

I will continue investigating both directions: 1) compiled golang wrapper around sybil proc and 2) interpreted wrapper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants