Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support process substitution to read keys from std::in #21

Closed
jermp opened this issue Nov 30, 2023 · 12 comments · Fixed by #22
Closed

Support process substitution to read keys from std::in #21

jermp opened this issue Nov 30, 2023 · 12 comments · Fixed by #22
Assignees
Labels
enhancement New feature or request

Comments

@jermp
Copy link
Owner

jermp commented Nov 30, 2023

As suggested by @zacchiro,
we should support process substitution to read keys from std::in.

Once done, we can specify, e.g., -i < (cat file) or -i < (zcat file.gz), etc.

@roberto-trani suggested to use the same trick as used in grep (https://man7.org/linux/man-pages/man1/grep.1.html)
where

     -f FILE, --file=FILE

          Obtain patterns from FILE, one per line.  If this option
          is used multiple times or is combined with the -e
          (--regexp) option, search for all patterns given.  The
          empty file contains zero patterns, and therefore matches
          nothing.  If FILE is - , read patterns from standard
          input.

The only caveat is that: reading from std::cin prevents from iterating twice on the stream, hence we can only build the
function but not check its correctness nor banchmark lookup (because this requires iterating again over the keys from
the stream).

@roberto-trani
Copy link
Collaborator

Hi all,
I created the branch 21-support-process-substitution-to-read-keys-from-stdin to support read from standard input.

Bash pipes and redirects are now allowed using the option -i -.
At the same time, process substitution works in internal memory but not in external memory because it is not possible to open the non-regular file using mmap.

@zacchiro
Copy link

zacchiro commented Dec 1, 2023

Thanks Roberto, this is amazing!

More generally, I'm a bit confused about the usage of the expression "external memory" that pthash makes.

There is "external memory" in the sense that the work memory you need for mph building is also on disk; and there is "external memory" in the sense that you use mmap() on the original input file rather than sequential/streaming reading.

You can have either of these or both, as it is now correctly captured by the separate CLI flags --external (for the former concern) and --mmap (for the latter).
I was initially confused by this while reading pthash doc, maybe it's worth a clarification (but I have no concrete suggestions at the moment, sorry).

@jermp jermp linked a pull request Dec 1, 2023 that will close this issue
@jermp jermp closed this as completed in #22 Dec 1, 2023
@jermp
Copy link
Owner Author

jermp commented Dec 1, 2023

Hi @zacchiro,
just merged into master. See now the examples in the README here: https://github.com/jermp/pthash#reading-keys-from-standard-input.

Thank you again Roberto!

@roberto-trani
Copy link
Collaborator

roberto-trani commented Dec 1, 2023

Hi @zacchiro ,

What you say about external memory is right, and I thank you for your feedback.

I removed the --mmap parameter in this version (but it wasn't in the documentation, right?), as when reading from huge regular files, mmap is more efficient than other options.
Instead, when using non regular files, piping is the right way to go.
This decision simplified the development and the code.

We came up with this "agreement" because process substitution is a feature of bash, zsh and ksh (and possibly others, I don't know) that isn't POSIX and shouldn't be used in portable code.
Therefore, the only edge case when using --external is with process substitution, but in that case you can use pipe (e.g., zcat file.gz | ./buidl .... -i -).

@zacchiro
Copy link

zacchiro commented Dec 1, 2023 via email

@zacchiro
Copy link

zacchiro commented Dec 1, 2023

Upon benchmarking, it looks like cat foo.txt | ./build -i - is about 3x slower than the equivalent ./build -i foo.txt, which makes it unusable on very large datasets (e.g., ~1.6 TiB like in my "big" dataset case).
Note that here there is no fault of the decompression overhead, as I'm comparing plain file versus cat and not zcat.

Interestingly enough, this was not the case in the previous implementation even when not using --mmap, when it briefly existed. So I don't think this slowdown is due to an mmap-read versus non-mmap-read, but rather to some stream handling that is different between -i - and -i filename.

@roberto-trani
Copy link
Collaborator

roberto-trani commented Dec 2, 2023

Hi @zacchiro ,

Thank you again for your feedback.
I did a similar benchmark on 50M strings using the internal memory construction, where the two versions have the same implementation.
I will report the whole time (measured using time) and the total construction time.

The outcomes of my tests are:

  • ./build -i filename (regular file) employed 25 seconds in total and 20 seconds for construction
  • .build -i <(cat filename) (process substitution) employed 25 seconds in total and 20 seconds for construction
  • cat filename | .build -i - (pipe) employed 45 seconds in total and 20 seconds for construction

It means that most of the time is spent in inter-process communication when using pipe, because there are no implementation differences from our side in the three tested cases (there would be only in external memory where -i filename employs a mmap).

@roberto-trani
Copy link
Collaborator

I tried the same kind of tests in external memory and I found that:

  • on my laptop, there is no significant difference between mmap and istream on 50M strings;
  • using istream, I achieved the same results I had in internal memory (with -i filename and -i <(cat filename) equally fast and pipe much slower than previous approaches).

The problem of istream (and process substitution) is that it will not be possible to seek at the beginning of the "file" to perform check and lookup tests after the construction.

@zacchiro
Copy link

zacchiro commented Dec 2, 2023 via email

@jermp
Copy link
Owner Author

jermp commented Dec 2, 2023

So the ideal combination for my use case would be process substitution + external memory,

Why not trying first zstdcat file.zstd | ./build (...)? If IO or IPC is the problem here...there is not much we can do about it, except for supporting iteration over zstd compressed files. But this is not something PTHash nor its build tool should support natively. Ideally, users write their own iterator and pass it to the build method of the PTHash class.

But the combination in question is currently not supported because if the input file is not "-" the build tool will try to reread it again, right?

We only read it again for check and lookup. On my Mac, I'm not able to use mmap with process substitution.

@zacchiro
Copy link

zacchiro commented Dec 2, 2023

I fear we are talking past each other at this point, due to the high amount of details/variables floating around.
But just to recap:

Why not trying first zstdcat file.zstd | ./build (...)?

I have tried this. It works. But the reading phase is too slow. It's like 3-4 hours vs 1-1.5 hours in my benchmark.

We only read it again for check and lookup. On my Mac, I'm not able to use mmap with process substitution.

That is "normal", because process substitution gives you a non regular file (a FIFO), which cannot be mmap-ed.
What I still don't understand is why ./build cannot read that file without mmapping when --external is given.
If the stream is effectively ready only once, that combination (process substitution + --external + no-mmap) would tick all the boxes of my use case.

@jermp
Copy link
Owner Author

jermp commented Dec 2, 2023

That is "normal", because process substitution gives you a non regular file (a FIFO), which cannot be mmap-ed.

Yes, exactly.

What I still don't understand is why ./build cannot read that file without mmapping when --external is given.

You're right, there is no reason we should stick to mmap. Indeed if you use my past commit where the option --mmap existed and you remove the copy constructor/assignment from the class lines_iterator_wrapper, it should work for your case. I'll check this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants