-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word count not counting the whole data #11
Comments
This project is not actively being maintained. That being said, what types of request errors are you seeing in the logs? I'm not sure that worker retry was ever implemented, so that may be the cause of the undercounting. |
Thanks! The errors are similar to:
|
It may be worth trying to configure "maxConcurrency" (ref) to a lower value. It looks like the default is 500, which may be too high, in retrospect. As for |
I'm running the word count program over a 86GB dataset. The data is utf8, already sanitized with newlines and spaces. I already know that the total words is around 29000M words. But the resulting output of the word count program sums just 86M words. Also, the logs are full of too many requests errors.
How can I debug why the program is not reading the whole input? It is caused by those too many requests errors? Any workaround? Thanks
The text was updated successfully, but these errors were encountered: