-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support process substitution to read keys from std::in #21
Comments
Hi all, Bash pipes and redirects are now allowed using the option |
Thanks Roberto, this is amazing! More generally, I'm a bit confused about the usage of the expression "external memory" that pthash makes. There is "external memory" in the sense that the work memory you need for mph building is also on disk; and there is "external memory" in the sense that you use You can have either of these or both, as it is now correctly captured by the separate CLI flags |
Hi @zacchiro, Thank you again Roberto! |
Hi @zacchiro , What you say about external memory is right, and I thank you for your feedback. I removed the --mmap parameter in this version (but it wasn't in the documentation, right?), as when reading from huge regular files, mmap is more efficient than other options. We came up with this "agreement" because process substitution is a feature of bash, zsh and ksh (and possibly others, I don't know) that isn't POSIX and shouldn't be used in portable code. |
Hi @zacchiro,
just merged into `master`. See now the examples in the README here: https://github.com/jermp/pthash#reading-keys-from-standard-input.
Thank you again Roberto!
Awesome, thanks to both of you!
I confirm it works for me with `--external -i -` ; I've already tried a
full run of it on my "small" dataset (no issues) and I'm currently
running it on my big dataset.
|
Upon benchmarking, it looks like Interestingly enough, this was not the case in the previous implementation even when not using |
Hi @zacchiro , Thank you again for your feedback. The outcomes of my tests are:
It means that most of the time is spent in inter-process communication when using pipe, because there are no implementation differences from our side in the three tested cases (there would be only in external memory where |
I tried the same kind of tests in external memory and I found that:
The problem of istream (and process substitution) is that it will not be possible to seek at the beginning of the "file" to perform check and lookup tests after the construction. |
I agree with your analysis about IPC being the bottleneck here.
So the ideal combination for my use case would be process substitution + external memory, (because my dataset is too big for process substitution + internal memory).
But the combination in question is currently not supported because if the input file is not "-" the build tool will try to reread it again, right? Or am I missing something?
Cheers
--
Sent from my mobile phone. Please excuse my brevity and top-posting.
|
Why not trying first
We only read it again for check and lookup. On my Mac, I'm not able to use mmap with process substitution. |
I fear we are talking past each other at this point, due to the high amount of details/variables floating around.
I have tried this. It works. But the reading phase is too slow. It's like 3-4 hours vs 1-1.5 hours in my benchmark.
That is "normal", because process substitution gives you a non regular file (a FIFO), which cannot be mmap-ed. |
Yes, exactly.
You're right, there is no reason we should stick to mmap. Indeed if you use my past commit where the option --mmap existed and you remove the copy constructor/assignment from the class |
As suggested by @zacchiro,
we should support process substitution to read keys from
std::in
.Once done, we can specify, e.g.,
-i < (cat file)
or-i < (zcat file.gz)
, etc.@roberto-trani suggested to use the same trick as used in
grep
(https://man7.org/linux/man-pages/man1/grep.1.html)where
The only caveat is that: reading from std::cin prevents from iterating twice on the stream, hence we can only build the
function but not check its correctness nor banchmark lookup (because this requires iterating again over the keys from
the stream).
The text was updated successfully, but these errors were encountered: