Buffered writes, local memory reductions, safer input splits #1

chadbrewbaker · 2018-05-07T20:50:00Z

Nice project! I may judiciously emulate this in C++ with Apache Arrow support :)

To avoid write thrashing may want to add an optional buffer to batch writes:

corral/emitter.go

Line 92 in 2c8445a

CORRAL_MAP_BUFFER_SIZE=0 by default?

If reduce() is a monoid that actually reduces space you can do a reduce() before writing out to the global map() location.

CORRAL_LOCAL_REDUCE=0 by default?

The input split function needs to be safe in some domains like WordCount where you don't want to split in the middle of a word. I'd support passing a simple tokenier where mappers would read an overlap of K bytes.

For really nasty grammars you can't do it in parallel, context free you can do Valiant 75' parsing via parallel matrix multiplication, and for simpler grammars like parenthesis matching you just need https://en.wikipedia.org/wiki/All_nearest_smaller_values .

bcongdon · 2018-05-17T14:20:21Z

Hey, thanks for your interest!

I agree that this is a good idea, but I think this is best left up to the corefs.FileSystem implementation -- i.e. the filesystem can maintain its own buffer, if necessary for performance reasons. For example, the S3 client already does this as it batches multipart writes.
I created Add a post-Map "Combiner" step #2 to track this.
If I'm understanding correctly, corral already does this. Mappers area allowed to read past their byte limit until they read to the end of their last record (source). I also created Allow arbitrary record split functions #3 to track adding arbitrary split functions.

bcongdon closed this as completed Aug 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffered writes, local memory reductions, safer input splits #1

Buffered writes, local memory reductions, safer input splits #1

chadbrewbaker commented May 7, 2018 •

edited

Loading

bcongdon commented May 17, 2018

Buffered writes, local memory reductions, safer input splits #1

Buffered writes, local memory reductions, safer input splits #1

Comments

chadbrewbaker commented May 7, 2018 • edited Loading

bcongdon commented May 17, 2018

chadbrewbaker commented May 7, 2018 •

edited

Loading