debug flatpak test failure #20

martinpitt · 2023-12-06T07:19:00Z

https://github.com/cockpit-project/cockpit/actions/runs/7101988137/job/19335024307?pr=19698

martinpitt · 2023-12-06T08:05:05Z

This reliably reproduces in an interactive tmate session, but not in a local vm-run ubuntu-2204 VM which replicates the setup (which is a bit tricky due to having to install a non-ancient node via n).

xvfb-run dbus-run-session containers/flatpak/test/test-browser fails as well (the CI run doesn't get that far), and I can reproduce that locally. So queuing that for later, and looking at the xvfb-run dbus-run-session containers/flatpak/test/test-ssh failure.

I can take out the first part of the test (before "Check if we can start via pure DBus activation"), this keeps the behaviour (both local success and GH hang/timeout). This removes some noise and speeds up iteration.

I added the following debug patch:

--- a/src/client/cockpit-client
+++ b/src/client/cockpit-client
@@ -232,9 +232,11 @@ class CockpitClient(Gtk.Application):
         self.activate()
 
     def quit_action(self, *args):
+        print("XXXX quit", args)
         self.quit()
 
     def open_path(self, action, parameter, *args):
+        print("XXXX open_path", action, parameter, args)
         CockpitClientWindow(self, self.uri + parameter.get_string()).present()
 
     def do_activate(self):

Locally, with | ts it shows that the gapplication quit call is sometimes instant, and sometimes already slow. When it's fast, it looks like this:

Dec 06 08:09:42 + gapplication action org.cockpit_project.CockpitClient quit
Dec 06 08:09:43 + flatpak ps
Dec 06 08:09:43 XXXX open_path <Gio.SimpleAction object at 0x7febbc584e00 (GSimpleAction at 0x561089752530)> '=cockpit-client-test' (None,)
Dec 06 08:09:43 XXXX quit (<Gio.SimpleAction object at 0x7febbc51a400 (GSimpleAction at 0x561089734cc0)>, None, None)

and when it's slow, it looks like this:

Dec 06 07:57:05 + gapplication action org.cockpit_project.CockpitClient quit
Dec 06 07:57:21 XXXX open_path <Gio.SimpleAction object at 0x7fd32553ac80 (GSimpleAction at 0x55f07b3eb040)> '=cockpit-client-test' (None,)
Dec 06 07:57:21 XXXX quit (<Gio.SimpleAction object at 0x7fd32553ac80 (GSimpleAction at 0x55f07b5d64a0)>, None, None)
Dec 06 07:57:21 + flatpak ps

it always times out on GH, without even reaching the callbacks:

Dec 06 07:42:08 + gapplication action org.cockpit_project.CockpitClient quit
Dec 06 07:42:33 error sending ActivateAction message to application: Timeout was reached

So that "XXX open_path" callback is confusing -- that shouldn't happen after "quit", and the previous gapplication open-path already hat its own callback. However, it seems to happen in the fast case as well.

I don't see a way to change the 25s default D-Bus timeout from outside (env variable or gapplication CLI option).

martinpitt · 2023-12-06T10:44:34Z

Meh, now it changed to time out even earlier, in the first part of the test.

martinpitt · 2023-12-06T12:46:27Z

In my interactive machine, at least the -ssh test works again if I revert cockpit-project#19689 -- oddly, that PR did not have a flatpak-test run ❓

Reverting it definitively changes the failure mode, so it is related. But it's not hunky-dory: The tests still hang a lot, and just fails differently.

Bumps the patternfly group with 5 updates: | Package | From | To | | --- | --- | --- | | [@patternfly/react-core](https://github.com/patternfly/patternfly-react) | `5.1.1` | `5.1.2` | | [@patternfly/react-icons](https://github.com/patternfly/patternfly-react) | `5.1.1` | `5.1.2` | | @patternfly/react-styles | `5.1.1` | `5.1.2` | | [@patternfly/react-table](https://github.com/patternfly/patternfly-react) | `5.1.1` | `5.1.2` | | [@patternfly/react-tokens](https://github.com/patternfly/patternfly-react) | `5.1.1` | `5.1.2` | Updates `@patternfly/react-core` from 5.1.1 to 5.1.2 - [Release notes](https://github.com/patternfly/patternfly-react/releases) - [Commits](https://github.com/patternfly/patternfly-react/compare/@patternfly/[email protected]...@patternfly/[email protected]) Updates `@patternfly/react-icons` from 5.1.1 to 5.1.2 - [Release notes](https://github.com/patternfly/patternfly-react/releases) - [Commits](https://github.com/patternfly/patternfly-react/compare/@patternfly/[email protected]...@patternfly/[email protected]) Updates `@patternfly/react-styles` from 5.1.1 to 5.1.2 Updates `@patternfly/react-table` from 5.1.1 to 5.1.2 - [Release notes](https://github.com/patternfly/patternfly-react/releases) - [Commits](https://github.com/patternfly/patternfly-react/compare/@patternfly/[email protected]...@patternfly/[email protected]) Updates `@patternfly/react-tokens` from 5.1.1 to 5.1.2 - [Release notes](https://github.com/patternfly/patternfly-react/releases) - [Commits](https://github.com/patternfly/patternfly-react/compare/@patternfly/[email protected]...@patternfly/[email protected]) --- updated-dependencies: - dependency-name: "@patternfly/react-core" dependency-type: direct:production update-type: version-update:semver-patch dependency-group: patternfly - dependency-name: "@patternfly/react-icons" dependency-type: direct:production update-type: version-update:semver-patch dependency-group: patternfly - dependency-name: "@patternfly/react-styles" dependency-type: direct:production update-type: version-update:semver-patch dependency-group: patternfly - dependency-name: "@patternfly/react-table" dependency-type: direct:production update-type: version-update:semver-patch dependency-group: patternfly - dependency-name: "@patternfly/react-tokens" dependency-type: direct:production update-type: version-update:semver-patch dependency-group: patternfly ... Signed-off-by: dependabot[bot] <[email protected]>

In the case that we have a filesystem mounted by a user without an fstab entry, Cockpit warns us that this won't be re-mounted after a restart. The option is given to unmount the filesystem which leads to an error as it uses the `old_dir` the fstab mountpoint which is undefined. As the `mounted-no-config` condition can happen without an fstab entry and the `no-mount-on-boot` only with an fstab entry explicitly provide the directory to unmount to do_unmount. With this behaviour we don't fall back to other when we shouldn't but raise an error which should never happen (if everything is well covered).

Commit 747e7c2 renamed send_message() to send_json(). Relax the pattern to get along with either (so that this also works for older cockpit versions in e.g. cockpit-machines tests).

flake8 6.0 (current in Fedora 39 and our tasks container) complains about these. Our unit tests still run version 5.0.4 which does not yet see this.

Add a new Browser.get_pf_progress_value() utility method for this.

When the CPU is busy, the "Usage" card shows a warning icon next to the progress bar meter, which shifts the layout. Fixes cockpit-project#19614

When the bridge shuts down in an orderly fashion from an EOF on stdin, we signal the endpoints to exit, wait for them to do so, and only close the connection after that's done. When the connection closes suddenly, such as from a BrokenPipeError, we don't do this waiting around. The result is that the endpoints are never signalled to clean themselves up, and (crucially) we don't wait for them to do so. Let's move the communicate() method from the protocol level to the router so that we can make it aware of endpoints. Then we make two changes: - if we didn't already call eof_received() then make sure it happens on disconnect of the transport. This is what is responsible for asking the endpoints to close. - from the end of communicate(), unconditionally wait for all endpoints to have closed down. This way we'll never get endpoints still running after the event loop has closed. We accomplish that by adding a bit of redundant state: an asyncio.Event which can be awaited, and tracks if we have no endpoints left. We also drop calling .shutdown() on all of the routing rules. This is no longer necessary, as we're signalling all of the endpoints directly. A minor tweak is needed to a test which was using communicate() on a subclass of CockpitProtocolServer — it can derive from Router now. This should remove all of the warnings we've been seeing about "Event loop is closed" and "Task was destroyed but it is pending". Closes cockpit-project#18355

martinpitt · 2023-12-07T14:00:33Z

With COCKPIT_DEBUG=all and adding prints to cockpit-client, I locally see that the initial open takes a long time:

Dec 07 13:58:17 + gapplication action org.cockpit_project.CockpitClient open-path "=rupert@cockpit-client-test"
Dec 07 13:58:32 + test -f cockpit-client-test-rupert-cockpit-client-test-22

during that time there are no actual callback calls in cockpit-client, and it also times out waiting for it to make the SSH connection. The journal also doesn't say anything for that time.

When adding G_MESSAGES_DBUG=all there is some detail:

Dec 07 14:19:41 + gapplication action org.cockpit_project.CockpitClient open-path "=rupert@cockpit-client-test"
Dec 07 14:20:06 error sending ActivateAction message to application: Timeout was reached
Dec 07 14:20:06 + rm -rf /home/admin/.ssh
Dec 07 14:20:06 ** (flatpak-spawn:49): DEBUG: 14:20:06.160: Session bus connection closed, quitting
Dec 07 14:20:06 (process:209521): dconf-DEBUG: 14:20:06.160: D-Bus connection closed, invalidating cache: Underlying GIOStream returned 0 bytes on an async read
Dec 07 14:20:06 (/usr/libexec/flatpak-portal:209562): flatpak-DEBUG: 14:20:06.160: Name lost
Dec 07 14:20:06 ** (flatpak-spawn:29): DEBUG: 14:20:06.161: Session bus connection closed, quitting
Dec 07 14:20:06 (WebKitWebProcess:2): Gdk-DEBUG: 14:20:06.168: WebKitWebProcess: Fatal IO error 11 (Resource temporarily unavailable) on X server :99.0.
Dec 07 14:20:06 
Dec 07 14:20:06 (WebKitWebProcess:2): Gdk-DEBUG: 14:20:06.168: WebKitWebProcess: Fatal IO error 11 (Resource temporarily unavailable) on X server :99.0.
Dec 07 14:20:06 
Dec 07 14:20:06 (process:15): GLib-GIO-DEBUG: 14:20:06.191: GSocketClient: Starting new address enumeration
Dec 07 14:20:06 (process:15): GLib-GIO-DEBUG: 14:20:06.191: GSocketClient: Address enumeration succeeded
Dec 07 14:20:06 (/usr/libexec/flatpak-portal:209562): flatpak-DEBUG: 14:20:06.191: Client Pid 209584 died
Dec 07 14:20:06 (process:15): GLib-GIO-DEBUG: 14:20:06.191: GSocketClient: Starting TCP connection attempt
Dec 07 14:20:06 (/usr/libexec/flatpak-portal:209562): flatpak-DEBUG: 14:20:06.191: Client Pid 209612 died
Dec 07 14:20:06 (process:15): GLib-GIO-DEBUG: 14:20:06.191: GSocketClient: TCP connection successful
Dec 07 14:20:06 (process:15): GLib-GIO-DEBUG: 14:20:06.191: GSocketClient: Starting application layer connection
Dec 07 14:20:06 (process:15): GLib-GIO-DEBUG: 14:20:06.191: GSocketClient: Connection successful!

This really feels like a flatpak bug at this point..

Commit c7c54d4 fixed that. Fixes cockpit-project#18386

The latter is always present, even for rows that have no real usage displayed, just to get the spacing right.

https://issues.redhat.com/browse/RHEL-18520

This call was introduced in commit 483027a, but this API only exists since glib 2.72, while current RHEL 9.3 still has 2.68. This causes client to crash with an AttributeError. Provide a fallback implementation. https://issues.redhat.com/browse/RHEL-15020

…g a host This ought to happen somewhere between the shell and the bridge, but is difficult to fix. In the meantime, explicitly kill existing ssh connetions to the target host when removing a host from the shell. See cockpit-project#19672 This is going to be important for testing more remote host specification variants, which will happen in the next commit.

…" field When both are given, the explicitly given "User name:" field should win, and the `user@` part of the connection string should get ignored. This matches what ssh does with `ssh -l someone user@host`, which will connect as `someone`. generate_connection_string() gets along fine with an undefined `user` argument, so we can always call it to remove the special case.

This fixes adding a remote host with address `ssh://hostname` and a non-empty "User name" field. Fixes cockpit-project#17635 Co-Authored-By: Martin Pitt <[email protected]>

Otherwise the restart of kdump.service migth sometimes not regenerate the initrd since the old initrd has the same timestamp as the new kdump.conf. Fixes cockpit-project#19714

We removed support for `.min.` filenames because we thought that nobody was using it anymore, but there are still external packages that make use of it. Add it back again, but only for the case where the non-minified version isn't present. The minified version can always be explicitly requested via its full filename (minus `.gz`, if applicable). Add a bit of debugging output and some test cases to make sure we handle the combinations properly. Fixes https://issues.redhat.com/browse/RHEL-18861

Make testMountingHelp more re-usable by using one variable for the filesystem card header identifier.

Running just "make distcheck" somehow misses regressions in test/static-code.

It is broken in testing.

Ubuntu ≥ 23.04 (mantic) integrates netplan into NetworkManager much more tightly. The primary config is now in netplan. That particular check just concerns our internal error handling of the networking page, and it's enough to run it on one OS. Skip this on all Ubuntus, so that we won't trip over this again once we move ubuntu-2204 to -2404.

While device major number 9 is mdraid, device mapper is *not* defined to have major number 253. That was just happenstance, but it's wrong both ways: On Ubuntu 23.04's kernel, devmapper devices sometimes have 252, and virtio_blk devices sometimes have 253. That lead to "real" disk drives to get ignored and thus the metrics page does not show any disk I/O for them. However, all device-mapper devices are always called "dm-*" in the kernel, and naming happens through symlinks. So we can identify them by name instead of having to rummage arounnd in /sys (which is expensive for every sample). [1] https://www.kernel.org/doc/Documentation/admin-guide/devices.txt

Bumps the esbuild group with 2 updates: [esbuild](https://github.com/evanw/esbuild) and [esbuild-wasm](https://github.com/evanw/esbuild). Updates `esbuild` from 0.19.8 to 0.19.9 - [Release notes](https://github.com/evanw/esbuild/releases) - [Changelog](https://github.com/evanw/esbuild/blob/main/CHANGELOG.md) - [Commits](evanw/esbuild@v0.19.8...v0.19.9) Updates `esbuild-wasm` from 0.19.8 to 0.19.9 - [Release notes](https://github.com/evanw/esbuild/releases) - [Changelog](https://github.com/evanw/esbuild/blob/main/CHANGELOG.md) - [Commits](evanw/esbuild@v0.19.8...v0.19.9) --- updated-dependencies: - dependency-name: esbuild dependency-type: direct:development update-type: version-update:semver-patch dependency-group: esbuild - dependency-name: esbuild-wasm dependency-type: direct:development update-type: version-update:semver-patch dependency-group: esbuild ... Signed-off-by: dependabot[bot] <[email protected]>

For visual consistency, and to avoid a table layout bug when the first column of all rows is empty. Only the "other" entries didn't have one yet. Fixes: cockpit-project#19736

martinpitt · 2023-12-12T13:55:28Z

Interesting! The flatpak test does still work on the rhel-9.3 branch, e.g. in cockpit-project#19740 - log; so it's not a spurious env change in github, so (1) we can bisect, and (2) have a comparison how broken it was before (not at all, the whole test runs in 10s and there are no hangs).

In a local ubuntu-22.04 VM I also get some hangs, though, so it's not very well suited to investigate this 🤯

Also move the "show the RUN label" command further up where it belongs. Fixes cockpit-project#19701

At no point pages should have requested an explicit locale specific file like `test.de.js`. This has always been handled internally by the C bridge's packages loader, depending on the currently set locale. Since commit 1f34f76, the Python bridge handles `po.js` specially, and the locale specific files only happen for them. This doesn't belong into the "generic package file" lookup documentation, so just drop that. Also the ordering of `.min` lookups was wrong: older Python bridge versions didn't do this at all, and commit 5d00faa pinned down the ordering to prefer `test.js{,.gz}` over `test.min.js{,.gz}`. Finally, there is no defined preference between `.gz` and plain. Fixes cockpit-project#19720

Closes cockpit-project#19742

This test started to fail yesterday and corrupt the VM. The dialog fails (for an unknown reason), `vgroup0` stays busy and becomes unremovable. This doesn't happen in a local `tmt` run, in TF runs against cockpit itself (just in cross-project tests), nor in our fedora-rawhide CI image, and we currently don't get enough debug/journal information to say what's going on. Skip the test on rawhide for the time being, so that we can improve our test machinery and investigate this without causing failure and noise to stratis, selinux, and udisks.

@g-pape

Most UI elements are already disabled/not editable, just these two were left. This creates a bit of noise in the pixel test due to the extra `<Privilege>` wrapper, adjust the reference. Thanks to @g-pape for the initial patch! Fixes cockpit-project#19242

This fixes running a test multiple times on current ubuntu-stable with NM's new netplan backend.

This reverts commit 0d76020.

This reverts commit 1ea2ddc.

This reverts commit c5aa7f6.

martinpitt · 2023-12-13T11:14:11Z

As this is so difficult to reproduce locally, I am now doing some bisecting. The test started to fail around Dec 5, so I reverted these:

src/ back to 36877c3: passed test-ssh and "just" failed in test-browser
src/ back to 82c59a0: ssh test times out
Reverting commits c7c54d4 and e71bf74: surprisingly, still broken
Additionally reverting commits 7656e7b and 82c59a0: still broken; these are now all commits between the range of the first two points that touch src/. There are no changes in containers/ other than a trivial whitespace fix. So it now smells like this isn't actually triggered by a change in src/ or containers, but that contradicts the first result.
Reverting node_modules to rhel-9.3 version: different failure mode similar to debug flatpak test failure #20 (comment), so it does have an impact.

martinpitt · 2024-01-02T13:14:14Z

Curiously this has fixed itself.

martinpitt force-pushed the debug-flatpak branch 3 times, most recently from bcbf97a to be2761d Compare December 6, 2023 10:33

martinpitt force-pushed the debug-flatpak branch from be2761d to 6011868 Compare December 6, 2023 12:26

martinpitt force-pushed the debug-flatpak branch 2 times, most recently from ddb9a4e to 3768fed Compare December 6, 2023 12:57

dependabot bot and others added 7 commits December 7, 2023 07:53

test: Adjust method name in cockpit-project#18355 message ignoring

66bb5a9

Commit 747e7c2 renamed send_message() to send_json(). Relax the pattern to get along with either (so that this also works for older cockpit versions in e.g. cockpit-machines tests).

Fix flake8 whitespace in f-strings

41b31e6

flake8 6.0 (current in Fedora 39 and our tasks container) complains about these. Our unit tests still run version 5.0.4 which does not yet see this.

test: Factorize getting PF <ProgressBar> value

62ca5e7

Add a new Browser.get_pf_progress_value() utility method for this.

test: Fix "busy CPU" pixel flake on overview page

ce28cb4

When the CPU is busy, the "Usage" card shows a warning icon next to the progress bar meter, which shifts the layout. Fixes cockpit-project#19614

martinpitt and others added 13 commits December 7, 2023 15:38

test: Drop "Task was destroyed but it is pending" ignored traceback

a87ae6d

Commit c7c54d4 fixed that. Fixes cockpit-project#18386

test: Wait for real usage, not just a element with a ".usage-bar" class

c6f5e84

The latter is always present, even for rows that have no real usage displayed, just to get the spacing right.

bridge: Add back menu/tool names to cockpit-bridge --packages

7656e7b

https://issues.redhat.com/browse/RHEL-18520

test: Drop unused machine parameter in machine_remove()

d03b12d

shell: Trim ssh:// from connection address

6547000

This fixes adding a remote host with address `ssh://hostname` and a non-empty "User name" field. Fixes cockpit-project#17635 Co-Authored-By: Martin Pitt <[email protected]>

test: Delay writing of new kdump.conf

f93c27d

Otherwise the restart of kdump.service migth sometimes not regenerate the initrd since the old initrd has the same timestamp as the new kdump.conf. Fixes cockpit-project#19714

test: storaged: use a variable for "ext4 filesystem"

f5966d2

Make testMountingHelp more re-usable by using one variable for the filesystem card header identifier.

unit-tests: Run "make check" when refreshing the container

95eda94

Running just "make distcheck" somehow misses regressions in test/static-code.

unit-tests: Install python3-flake8 from unstable

75ceb58

It is broken in testing.

martinpitt and others added 5 commits December 11, 2023 22:14

test: Use defined constants instead of repeated values in TestWireGuard

9613e51

storage: Give all top-level rows an icon

114af86

For visual consistency, and to avoid a table layout bug when the first column of all rows is empty. Only the "other" entries didn't have one yet. Fixes: cockpit-project#19736

martinpitt and others added 13 commits December 12, 2023 16:30

containers/ws: Show how to append extra options to RUN label

6f6871c

Also move the "show the RUN label" command further up where it belongs. Fixes cockpit-project#19701

po: Update from Fedora Weblate

6eb6036

Closes cockpit-project#19742

test: Restore netplan configuration/state in NetworkCase

d39a2fb

This fixes running a test multiple times on current ubuntu-stable with NM's new netplan backend.

XXX disable unrelated tests

5459281

tmate

ba2950a

Revert "tmate"

d4f81dc

This reverts commit 0d76020.

cockpit-client debug messages

edf823b

Revert "cockpit-client debug messages"

c90f569

This reverts commit 1ea2ddc.

XXX test

bfd9f8c

Revert "XXX test"

13ab1f0

This reverts commit c5aa7f6.

martinpitt force-pushed the debug-flatpak branch from 3768fed to 55327ee Compare December 13, 2023 11:11

martinpitt force-pushed the debug-flatpak branch 3 times, most recently from 1e1f4ed to 5f1b3c2 Compare December 13, 2023 11:45

XXX revert node_modules to rhel-9.3

197f175

martinpitt force-pushed the debug-flatpak branch from 5f1b3c2 to 197f175 Compare December 13, 2023 11:49

XXX revert src to rhel-9.3

7cdcf12

martinpitt closed this Jan 2, 2024

martinpitt deleted the debug-flatpak branch January 2, 2024 13:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

debug flatpak test failure #20

debug flatpak test failure #20

martinpitt commented Dec 6, 2023

martinpitt commented Dec 6, 2023 •

edited

Loading

martinpitt commented Dec 6, 2023

martinpitt commented Dec 6, 2023 •

edited

Loading

martinpitt commented Dec 7, 2023 •

edited

Loading

martinpitt commented Dec 12, 2023 •

edited

Loading

martinpitt commented Dec 13, 2023 •

edited

Loading

martinpitt commented Jan 2, 2024

debug flatpak test failure #20

debug flatpak test failure #20

Conversation

martinpitt commented Dec 6, 2023

martinpitt commented Dec 6, 2023 • edited Loading

martinpitt commented Dec 6, 2023

martinpitt commented Dec 6, 2023 • edited Loading

martinpitt commented Dec 7, 2023 • edited Loading

martinpitt commented Dec 12, 2023 • edited Loading

martinpitt commented Dec 13, 2023 • edited Loading

martinpitt commented Jan 2, 2024

martinpitt commented Dec 6, 2023 •

edited

Loading

martinpitt commented Dec 6, 2023 •

edited

Loading

martinpitt commented Dec 7, 2023 •

edited

Loading

martinpitt commented Dec 12, 2023 •

edited

Loading

martinpitt commented Dec 13, 2023 •

edited

Loading