Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffered writes, local memory reductions, safer input splits #1

Closed
chadbrewbaker opened this issue May 7, 2018 · 1 comment
Closed

Comments

@chadbrewbaker
Copy link

chadbrewbaker commented May 7, 2018

Nice project! I may judiciously emulate this in C++ with Apache Arrow support :)

  1. To avoid write thrashing may want to add an optional buffer to batch writes:

CORRAL_MAP_BUFFER_SIZE=0 by default?

  1. If reduce() is a monoid that actually reduces space you can do a reduce() before writing out to the global map() location.

CORRAL_LOCAL_REDUCE=0 by default?

  1. The input split function needs to be safe in some domains like WordCount where you don't want to split in the middle of a word. I'd support passing a simple tokenier where mappers would read an overlap of K bytes.

For really nasty grammars you can't do it in parallel, context free you can do Valiant 75' parsing via parallel matrix multiplication, and for simpler grammars like parenthesis matching you just need https://en.wikipedia.org/wiki/All_nearest_smaller_values .

@bcongdon
Copy link
Owner

Hey, thanks for your interest!

  1. I agree that this is a good idea, but I think this is best left up to the corefs.FileSystem implementation -- i.e. the filesystem can maintain its own buffer, if necessary for performance reasons. For example, the S3 client already does this as it batches multipart writes.
  2. I created Add a post-Map "Combiner" step #2 to track this.
  3. If I'm understanding correctly, corral already does this. Mappers area allowed to read past their byte limit until they read to the end of their last record (source). I also created Allow arbitrary record split functions #3 to track adding arbitrary split functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants