Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debug flatpak test failure #20

Closed
wants to merge 42 commits into from
Closed

debug flatpak test failure #20

wants to merge 42 commits into from

Conversation

@martinpitt
Copy link
Owner Author

martinpitt commented Dec 6, 2023

This reliably reproduces in an interactive tmate session, but not in a local vm-run ubuntu-2204 VM which replicates the setup (which is a bit tricky due to having to install a non-ancient node via n).

xvfb-run dbus-run-session containers/flatpak/test/test-browser fails as well (the CI run doesn't get that far), and I can reproduce that locally. So queuing that for later, and looking at the xvfb-run dbus-run-session containers/flatpak/test/test-ssh failure.

I can take out the first part of the test (before "Check if we can start via pure DBus activation"), this keeps the behaviour (both local success and GH hang/timeout). This removes some noise and speeds up iteration.

I added the following debug patch:

--- a/src/client/cockpit-client
+++ b/src/client/cockpit-client
@@ -232,9 +232,11 @@ class CockpitClient(Gtk.Application):
         self.activate()
 
     def quit_action(self, *args):
+        print("XXXX quit", args)
         self.quit()
 
     def open_path(self, action, parameter, *args):
+        print("XXXX open_path", action, parameter, args)
         CockpitClientWindow(self, self.uri + parameter.get_string()).present()
 
     def do_activate(self):

Locally, with | ts it shows that the gapplication quit call is sometimes instant, and sometimes already slow. When it's fast, it looks like this:

Dec 06 08:09:42 + gapplication action org.cockpit_project.CockpitClient quit
Dec 06 08:09:43 + flatpak ps
Dec 06 08:09:43 XXXX open_path <Gio.SimpleAction object at 0x7febbc584e00 (GSimpleAction at 0x561089752530)> '=cockpit-client-test' (None,)
Dec 06 08:09:43 XXXX quit (<Gio.SimpleAction object at 0x7febbc51a400 (GSimpleAction at 0x561089734cc0)>, None, None)

and when it's slow, it looks like this:

Dec 06 07:57:05 + gapplication action org.cockpit_project.CockpitClient quit
Dec 06 07:57:21 XXXX open_path <Gio.SimpleAction object at 0x7fd32553ac80 (GSimpleAction at 0x55f07b3eb040)> '=cockpit-client-test' (None,)
Dec 06 07:57:21 XXXX quit (<Gio.SimpleAction object at 0x7fd32553ac80 (GSimpleAction at 0x55f07b5d64a0)>, None, None)
Dec 06 07:57:21 + flatpak ps

it always times out on GH, without even reaching the callbacks:

Dec 06 07:42:08 + gapplication action org.cockpit_project.CockpitClient quit
Dec 06 07:42:33 error sending ActivateAction message to application: Timeout was reached

So that "XXX open_path" callback is confusing -- that shouldn't happen after "quit", and the previous gapplication open-path already hat its own callback. However, it seems to happen in the fast case as well.

I don't see a way to change the 25s default D-Bus timeout from outside (env variable or gapplication CLI option).

@martinpitt martinpitt force-pushed the debug-flatpak branch 3 times, most recently from bcbf97a to be2761d Compare December 6, 2023 10:33
@martinpitt
Copy link
Owner Author

Meh, now it changed to time out even earlier, in the first part of the test.

@martinpitt
Copy link
Owner Author

martinpitt commented Dec 6, 2023

In my interactive machine, at least the -ssh test works again if I revert cockpit-project#19689 -- oddly, that PR did not have a flatpak-test run ❓

Reverting it definitively changes the failure mode, so it is related. But it's not hunky-dory: The tests still hang a lot, and just fails differently.

@martinpitt martinpitt force-pushed the debug-flatpak branch 2 times, most recently from ddb9a4e to 3768fed Compare December 6, 2023 12:57
dependabot bot and others added 7 commits December 7, 2023 07:53
Bumps the patternfly group with 5 updates:

| Package | From | To |
| --- | --- | --- |
| [@patternfly/react-core](https://github.com/patternfly/patternfly-react) | `5.1.1` | `5.1.2` |
| [@patternfly/react-icons](https://github.com/patternfly/patternfly-react) | `5.1.1` | `5.1.2` |
| @patternfly/react-styles | `5.1.1` | `5.1.2` |
| [@patternfly/react-table](https://github.com/patternfly/patternfly-react) | `5.1.1` | `5.1.2` |
| [@patternfly/react-tokens](https://github.com/patternfly/patternfly-react) | `5.1.1` | `5.1.2` |

Updates `@patternfly/react-core` from 5.1.1 to 5.1.2
- [Release notes](https://github.com/patternfly/patternfly-react/releases)
- [Commits](https://github.com/patternfly/patternfly-react/compare/@patternfly/[email protected]...@patternfly/[email protected])

Updates `@patternfly/react-icons` from 5.1.1 to 5.1.2
- [Release notes](https://github.com/patternfly/patternfly-react/releases)
- [Commits](https://github.com/patternfly/patternfly-react/compare/@patternfly/[email protected]...@patternfly/[email protected])

Updates `@patternfly/react-styles` from 5.1.1 to 5.1.2

Updates `@patternfly/react-table` from 5.1.1 to 5.1.2
- [Release notes](https://github.com/patternfly/patternfly-react/releases)
- [Commits](https://github.com/patternfly/patternfly-react/compare/@patternfly/[email protected]...@patternfly/[email protected])

Updates `@patternfly/react-tokens` from 5.1.1 to 5.1.2
- [Release notes](https://github.com/patternfly/patternfly-react/releases)
- [Commits](https://github.com/patternfly/patternfly-react/compare/@patternfly/[email protected]...@patternfly/[email protected])

---
updated-dependencies:
- dependency-name: "@patternfly/react-core"
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: patternfly
- dependency-name: "@patternfly/react-icons"
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: patternfly
- dependency-name: "@patternfly/react-styles"
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: patternfly
- dependency-name: "@patternfly/react-table"
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: patternfly
- dependency-name: "@patternfly/react-tokens"
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: patternfly
...

Signed-off-by: dependabot[bot] <[email protected]>
In the case that we have a filesystem mounted by a user without an fstab
entry, Cockpit warns us that this won't be re-mounted after a restart.
The option is given to unmount the filesystem which leads to an error as
it uses the `old_dir` the fstab mountpoint which is undefined. As the
`mounted-no-config` condition can happen without an fstab entry and the
`no-mount-on-boot` only with an fstab entry explicitly provide the
directory to unmount to do_unmount. With this behaviour we don't fall
back to other when we shouldn't but raise an error which should never
happen (if everything is well covered).
Commit 747e7c2 renamed send_message() to send_json(). Relax the
pattern to get along with either (so that this also works for older
cockpit versions in e.g. cockpit-machines tests).
flake8 6.0 (current in Fedora 39 and our tasks container) complains
about these. Our unit tests still run version 5.0.4 which does not yet
see this.
Add a new Browser.get_pf_progress_value() utility method for this.
When the CPU is busy, the "Usage" card shows a warning icon next to the
progress bar meter, which shifts the layout.

Fixes cockpit-project#19614
When the bridge shuts down in an orderly fashion from an EOF on stdin,
we signal the endpoints to exit, wait for them to do so, and only close
the connection after that's done.

When the connection closes suddenly, such as from a BrokenPipeError, we
don't do this waiting around.  The result is that the endpoints are
never signalled to clean themselves up, and (crucially) we don't wait
for them to do so.

Let's move the communicate() method from the protocol level to the
router so that we can make it aware of endpoints.  Then we make two
changes:

 - if we didn't already call eof_received() then make sure it happens on
   disconnect of the transport.  This is what is responsible for asking
   the endpoints to close.

 - from the end of communicate(), unconditionally wait for all endpoints
   to have closed down.  This way we'll never get endpoints still
   running after the event loop has closed.

We accomplish that by adding a bit of redundant state: an asyncio.Event
which can be awaited, and tracks if we have no endpoints left.

We also drop calling .shutdown() on all of the routing rules.  This is
no longer necessary, as we're signalling all of the endpoints directly.

A minor tweak is needed to a test which was using communicate() on a
subclass of CockpitProtocolServer — it can derive from Router now.

This should remove all of the warnings we've been seeing about "Event
loop is closed" and "Task was destroyed but it is pending".

Closes cockpit-project#18355
@martinpitt
Copy link
Owner Author

martinpitt commented Dec 7, 2023

With COCKPIT_DEBUG=all and adding prints to cockpit-client, I locally see that the initial open takes a long time:

Dec 07 13:58:17 + gapplication action org.cockpit_project.CockpitClient open-path "=rupert@cockpit-client-test"
Dec 07 13:58:32 + test -f cockpit-client-test-rupert-cockpit-client-test-22

during that time there are no actual callback calls in cockpit-client, and it also times out waiting for it to make the SSH connection. The journal also doesn't say anything for that time.

When adding G_MESSAGES_DBUG=all there is some detail:

Dec 07 14:19:41 + gapplication action org.cockpit_project.CockpitClient open-path "=rupert@cockpit-client-test"
Dec 07 14:20:06 error sending ActivateAction message to application: Timeout was reached
Dec 07 14:20:06 + rm -rf /home/admin/.ssh
Dec 07 14:20:06 ** (flatpak-spawn:49): DEBUG: 14:20:06.160: Session bus connection closed, quitting
Dec 07 14:20:06 (process:209521): dconf-DEBUG: 14:20:06.160: D-Bus connection closed, invalidating cache: Underlying GIOStream returned 0 bytes on an async read
Dec 07 14:20:06 (/usr/libexec/flatpak-portal:209562): flatpak-DEBUG: 14:20:06.160: Name lost
Dec 07 14:20:06 ** (flatpak-spawn:29): DEBUG: 14:20:06.161: Session bus connection closed, quitting
Dec 07 14:20:06 (WebKitWebProcess:2): Gdk-DEBUG: 14:20:06.168: WebKitWebProcess: Fatal IO error 11 (Resource temporarily unavailable) on X server :99.0.
Dec 07 14:20:06 
Dec 07 14:20:06 (WebKitWebProcess:2): Gdk-DEBUG: 14:20:06.168: WebKitWebProcess: Fatal IO error 11 (Resource temporarily unavailable) on X server :99.0.
Dec 07 14:20:06 
Dec 07 14:20:06 (process:15): GLib-GIO-DEBUG: 14:20:06.191: GSocketClient: Starting new address enumeration
Dec 07 14:20:06 (process:15): GLib-GIO-DEBUG: 14:20:06.191: GSocketClient: Address enumeration succeeded
Dec 07 14:20:06 (/usr/libexec/flatpak-portal:209562): flatpak-DEBUG: 14:20:06.191: Client Pid 209584 died
Dec 07 14:20:06 (process:15): GLib-GIO-DEBUG: 14:20:06.191: GSocketClient: Starting TCP connection attempt
Dec 07 14:20:06 (/usr/libexec/flatpak-portal:209562): flatpak-DEBUG: 14:20:06.191: Client Pid 209612 died
Dec 07 14:20:06 (process:15): GLib-GIO-DEBUG: 14:20:06.191: GSocketClient: TCP connection successful
Dec 07 14:20:06 (process:15): GLib-GIO-DEBUG: 14:20:06.191: GSocketClient: Starting application layer connection
Dec 07 14:20:06 (process:15): GLib-GIO-DEBUG: 14:20:06.191: GSocketClient: Connection successful!

This really feels like a flatpak bug at this point..

martinpitt and others added 13 commits December 7, 2023 15:38
The latter is always present, even for rows that have no real usage
displayed, just to get the spacing right.
This call was introduced in commit 483027a, but this API only exists
since glib 2.72, while current RHEL 9.3 still has
2.68. This causes client to crash with an AttributeError.

Provide a fallback implementation.

https://issues.redhat.com/browse/RHEL-15020
…g a host

This ought to happen somewhere between the shell and the bridge, but is
difficult to fix. In the meantime, explicitly kill existing ssh
connetions to the target host when removing a host from the shell.
See cockpit-project#19672

This is going to be important for testing more remote host specification
variants, which will happen in the next commit.
…" field

When both are given, the explicitly given "User name:" field should win,
and the `user@` part of the connection string should get ignored. This
matches what ssh does with `ssh -l someone user@host`, which will
connect as `someone`.

generate_connection_string() gets along fine with an undefined `user` argument,
so we can always call it to remove the special case.
This fixes adding a remote host with address `ssh://hostname` and
a non-empty "User name" field.

Fixes cockpit-project#17635

Co-Authored-By: Martin Pitt <[email protected]>
Otherwise the restart of kdump.service migth sometimes not regenerate
the initrd since the old initrd has the same timestamp as the new
kdump.conf.

Fixes cockpit-project#19714
We removed support for `.min.` filenames because we thought that nobody
was using it anymore, but there are still external packages that make
use of it.

Add it back again, but only for the case where the non-minified version
isn't present.  The minified version can always be explicitly requested
via its full filename (minus `.gz`, if applicable).

Add a bit of debugging output and some test cases to make sure we handle
the combinations properly.

Fixes https://issues.redhat.com/browse/RHEL-18861
Make testMountingHelp more re-usable by using one variable for the
filesystem card header identifier.
Running just "make distcheck" somehow misses regressions in
test/static-code.
martinpitt and others added 5 commits December 11, 2023 22:14
Ubuntu ≥ 23.04 (mantic) integrates netplan into NetworkManager much more
tightly. The primary config is now in netplan. That particular check
just concerns our internal error handling of the networking page, and
it's enough to run it on one OS. Skip this on all Ubuntus, so that we
won't trip over this again once we move ubuntu-2204 to -2404.
While device major number 9 is mdraid, device mapper is *not* defined to
have major number 253. That was just happenstance, but it's wrong both
ways: On Ubuntu 23.04's kernel, devmapper devices sometimes have 252,
and virtio_blk devices sometimes have 253. That lead to "real" disk
drives to get ignored and thus the metrics page does not show any disk
I/O for them.

However, all device-mapper devices are always called "dm-*" in the
kernel, and naming happens through symlinks. So we can identify them by
name instead of having to rummage arounnd in /sys (which is expensive
for every sample).

[1] https://www.kernel.org/doc/Documentation/admin-guide/devices.txt
Bumps the esbuild group with 2 updates: [esbuild](https://github.com/evanw/esbuild) and [esbuild-wasm](https://github.com/evanw/esbuild).

Updates `esbuild` from 0.19.8 to 0.19.9
- [Release notes](https://github.com/evanw/esbuild/releases)
- [Changelog](https://github.com/evanw/esbuild/blob/main/CHANGELOG.md)
- [Commits](evanw/esbuild@v0.19.8...v0.19.9)

Updates `esbuild-wasm` from 0.19.8 to 0.19.9
- [Release notes](https://github.com/evanw/esbuild/releases)
- [Changelog](https://github.com/evanw/esbuild/blob/main/CHANGELOG.md)
- [Commits](evanw/esbuild@v0.19.8...v0.19.9)

---
updated-dependencies:
- dependency-name: esbuild
  dependency-type: direct:development
  update-type: version-update:semver-patch
  dependency-group: esbuild
- dependency-name: esbuild-wasm
  dependency-type: direct:development
  update-type: version-update:semver-patch
  dependency-group: esbuild
...

Signed-off-by: dependabot[bot] <[email protected]>
For visual consistency, and to avoid a table layout bug when the first
column of all rows is empty.

Only the "other" entries didn't have one yet.

Fixes: cockpit-project#19736
@martinpitt
Copy link
Owner Author

martinpitt commented Dec 12, 2023

Interesting! The flatpak test does still work on the rhel-9.3 branch, e.g. in cockpit-project#19740 - log; so it's not a spurious env change in github, so (1) we can bisect, and (2) have a comparison how broken it was before (not at all, the whole test runs in 10s and there are no hangs).

In a local ubuntu-22.04 VM I also get some hangs, though, so it's not very well suited to investigate this 🤯

martinpitt and others added 13 commits December 12, 2023 16:30
Also move the "show the RUN label" command further up where it belongs.

Fixes cockpit-project#19701
At no point pages should have requested an explicit locale specific file
like `test.de.js`. This has always been handled internally by the C
bridge's packages loader, depending on the currently set locale.

Since commit 1f34f76, the Python bridge handles `po.js`
specially, and the locale specific files only happen for them. This
doesn't belong into the "generic package file" lookup documentation, so
just drop that.

Also the ordering of `.min` lookups was wrong: older Python bridge
versions didn't do this at all, and commit 5d00faa pinned down the
ordering to prefer `test.js{,.gz}` over `test.min.js{,.gz}`.

Finally, there is no defined preference between `.gz` and plain.

Fixes cockpit-project#19720
This test started to fail yesterday and corrupt the VM. The dialog fails
(for an unknown reason), `vgroup0` stays busy and becomes unremovable.

This doesn't happen in a local `tmt` run, in TF runs against cockpit
itself (just in cross-project tests), nor in our fedora-rawhide CI
image, and we currently don't get enough debug/journal information to
say what's going on.

Skip the test on rawhide for the time being, so that we can improve our
test machinery and investigate this without causing failure and noise to
stratis, selinux, and udisks.
Most UI elements are already disabled/not editable, just these two were
left.

This creates a bit of noise in the pixel test due to the extra
`<Privilege>` wrapper, adjust the reference.

Thanks to @g-pape for the initial patch!

Fixes cockpit-project#19242
This fixes running a test multiple times on current ubuntu-stable with
NM's new netplan backend.
This reverts commit 0d76020.
This reverts commit c5aa7f6.
@martinpitt
Copy link
Owner Author

martinpitt commented Dec 13, 2023

As this is so difficult to reproduce locally, I am now doing some bisecting. The test started to fail around Dec 5, so I reverted these:

@martinpitt martinpitt force-pushed the debug-flatpak branch 3 times, most recently from 1e1f4ed to 5f1b3c2 Compare December 13, 2023 11:45
@martinpitt
Copy link
Owner Author

Curiously this has fixed itself.

@martinpitt martinpitt closed this Jan 2, 2024
@martinpitt martinpitt deleted the debug-flatpak branch January 2, 2024 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants