You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If reduce() is a monoid that actually reduces space you can do a reduce() before writing out to the global map() location.
CORRAL_LOCAL_REDUCE=0 by default?
The input split function needs to be safe in some domains like WordCount where you don't want to split in the middle of a word. I'd support passing a simple tokenier where mappers would read an overlap of K bytes.
For really nasty grammars you can't do it in parallel, context free you can do Valiant 75' parsing via parallel matrix multiplication, and for simpler grammars like parenthesis matching you just need https://en.wikipedia.org/wiki/All_nearest_smaller_values .
The text was updated successfully, but these errors were encountered:
I agree that this is a good idea, but I think this is best left up to the corefs.FileSystem implementation -- i.e. the filesystem can maintain its own buffer, if necessary for performance reasons. For example, the S3 client already does this as it batches multipart writes.
If I'm understanding correctly, corral already does this. Mappers area allowed to read past their byte limit until they read to the end of their last record (source). I also created Allow arbitrary record split functions #3 to track adding arbitrary split functions.
Nice project! I may judiciously emulate this in C++ with Apache Arrow support :)
corral/emitter.go
Line 92 in 2c8445a
CORRAL_MAP_BUFFER_SIZE=0 by default?
CORRAL_LOCAL_REDUCE=0 by default?
For really nasty grammars you can't do it in parallel, context free you can do Valiant 75' parsing via parallel matrix multiplication, and for simpler grammars like parenthesis matching you just need https://en.wikipedia.org/wiki/All_nearest_smaller_values .
The text was updated successfully, but these errors were encountered: