Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use winnow instead of nom #9

Draft
wants to merge 4 commits into
base: master
Choose a base branch
from
Draft

Conversation

spoutn1k
Copy link

Changes the parser combinator library to use the more actively maintained winnow fork of nom. The changes in winnow allow for lesser nesting, and better error reporting - this PR aims to enable those features for the nodeset parsing.

@fdiakh
Copy link
Owner

fdiakh commented Sep 19, 2024

Hi,

Thank you for contributing !

I'm not against switching to winnow, especially if it improves error reporting which is not great at the moment. However please also consider performance while refining this PR: I ran a test in which I feed a file containing a1b1 a2b1 ... a1000b100 to ns fold (through stdin): on my machine the time went from 70ms to 1.9s by switching to this branch. Do you see the same performance regression ?

Thanks !

@spoutn1k
Copy link
Author

Hey ! Thank you for a great library !

I am noticing it as well. It looks like most of the time lost is spent copying stuff in vecs - most likely an oversight in my implementation. Flamegraph attached for that ..a1000b100 test.

flamegraph

@spoutn1k
Copy link
Author

Found the issue. An error that needed a string made me copy a subslice of the input, and the framework just ran with it. Removing this temporarily for refactoring, the performance drops back to the same range:

nodeset> cat all.list | /dev/shm/cargo/release/ns fold
a[1-1000]b[1-100]
nodeset> hyperfine "cat all.list | /dev/shm/cargo/release/ns fold"
Benchmark 1: cat all.list | /dev/shm/cargo/release/ns fold
  Time (mean ± σ):      53.2 ms ±   1.9 ms    [User: 46.6 ms, System: 7.3 ms]
  Range (min … max):    49.2 ms …  58.5 ms    53 runs

@spoutn1k
Copy link
Author

So there is a entire parser for nodelists, with precedence climbing and stateful parsing, but reading the clustershell configuration is done using sh and sed ? I wanted to implement errors for undefined groups and I came upon this. Is it subject to change ? Or is it a design choice and would rather keep it that way ?

@fdiakh
Copy link
Owner

fdiakh commented Sep 25, 2024

Mmmh, I'm not sure what you mean exactly ?

Clustershell uses a mix of ini and yaml files which we parse with Rust crates.

The only use of sed I can see in this repo is in the configuration files that are used as input for some tests.

Clustershell allows you to configure dynamic group sources for which groups and their members are generated dynamically by calling commands that the user provides in configuration files. By default clustershell ships with a configuration file defining a dynamic group source that calls sed to find group definitions in a file, as you can see here: https://github.com/cea-hpc/clustershell/blob/3ab6cbd043ee5f451be25a46f6abea162761b0a0/conf/groups.conf

As part of the tests we validate that the default configuration file of clustershell is interpreted correctly.

@spoutn1k
Copy link
Author

That is exactly it. I had no idea the default group configuration from the /etc/clustershell/groups.d/local.cfg was parsed by a sed call internally. It made it kind of unreliable to differentiate between an empty and non-existing group.

nodeset> strace 2>&1 -ff -s8192 /dev/shm/cargo/release/ns list @oss | grep sed                   ~/Projects/nodeset::master
read(3, "# ClusterShell node groups main configuration file\n#\n# Please see `man 5 groups.conf` and\n# http://clustershell.
[pid 15631] execve("/bin/sh", ["/bin/sh", "-c", "[ -f /etc/clustershell/groups ] && f=/etc/clustershell/groups || f=/etc/clu
[pid 15631] newfstatat(AT_FDCWD, "/home/jb/.cargo/bin/sed", 0x7ffe481224b0, 0) = -1 ENOENT (No such file or directory)
[pid 15631] newfstatat(AT_FDCWD, "/home/jb/.local/bin/sed", 0x7ffe481224b0, 0) = -1 ENOENT (No such file or directory)
[pid 15631] newfstatat(AT_FDCWD, "/usr/local/bin/sed", 0x7ffe481224b0, 0) = -1 ENOENT (No such file or directory)
[pid 15631] newfstatat(AT_FDCWD, "/usr/bin/sed", {st_mode=S_IFREG|0755, st_size=108912, ...}, 0) = 0
[pid 15631] newfstatat(AT_FDCWD, "/usr/bin/sed", {st_mode=S_IFREG|0755, st_size=108912, ...}, 0) = 0
[pid 15631] access("/usr/bin/sed", X_OK) = 0
[pid 15631] newfstatat(AT_FDCWD, "/usr/bin/sed", {st_mode=S_IFREG|0755, st_size=108912, ...}, 0) = 0
[pid 15631] access("/usr/bin/sed", R_OK) = 0
[pid 15631] execve("/usr/bin/sed", ["sed", "-n", "s/^oss:\\(.*\\)/\\1/p", "/etc/clustershell/groups.d/local.cfg"], 0x5f0c9a

Makes a lot more sense with your explanation of why I was seeing sed calls being made. Not convinced this is good practice from clustershell ...

@fdiakh
Copy link
Owner

fdiakh commented Sep 29, 2024

Clustershell has been able parse static group source definition in yaml files for a while now (see the autodirs configuration), but I guess they are keeping the old format by default for backwards compatibility. There is indeed a risk that new users may keep using the less efficient method...

Speaking of backwards compatibility, I think the current behaviour for non-existing group/sources was implemented to match clustershell (only return an error for non-existing sources). Do you see a strong rationale for changing this ?

@spoutn1k
Copy link
Author

spoutn1k commented Oct 1, 2024

No I do not. My rationale was to go bottom-up for errors, as I wanted to make sure parse errors include more information, and I though the error was missing and not omitted by choice. If you are telling me this is the latter, it will be reverted.

This is going really slow because I am spread thin at the moment, sorry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants