Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge master into feature/update-backwards-compat #5887

Merged
merged 69 commits into from
Jul 30, 2024

Conversation

gangj
Copy link
Contributor

@gangj gangj commented Jul 29, 2024

No description provided.

minglumlu and others added 30 commits July 9, 2024 12:06
Prior to this commit, the xapi on the coordinator host records the
'Unix.gettimeofday' as the timestamps of the heartbeat with other pool
supporter hosts. When the system clock is updated with a huge jump
forward, the timestamps would be far back into the past. This would
cause the xapi assumes that the hosts are offline as long time no
heartbeats.

In this commit, the timestamps are changed to get from a monotonic
clock. In this way, the system clock changes will not impact the
heartbeats' timestamps any more.

Additionally, Host_metrics.last_updated is only set when the object is
created. It's useless in check_host_liveness at all.

Signed-off-by: Ming Lu <[email protected]>
Xapi will send alert when the user is running on a Corosync 2 cluster
and prompting them to upgrade. This should only happen on XS 9 and is
behind the corosync3 feature flag.

XenCenter can take advantage of this alert message, but should have its
own warning message to tell the user why/how to perform the upgrade.

Signed-off-by: Vincent Liu <[email protected]>
By default message-switch calls are serialized for backwards compatibility reasons in the Lwt and Async modes.
(We tried enabling parallel actions by default but got some hard to debug failures in the CI).

This causes very long VM start times when multiple VBDs are plugged/unplugged concurrently: the operations are seen concurrently by message-switch,
but xapi-storage-script only retrieves and dispatches them sequentially, so any opportunity for parallel execution is lost.
Even though the actions themselves only take seconds, due to serialization, a VM start may take minutes.

Enable parallel processing explicitly here (instead of for all message-switch clients).
SMAPIv3 should expect to be called concurrently (on different hosts even), so in theory this change should be safe
and improve performance, but there are some known bugs in SMAPIv3 plugins currently.

So introduce a config file flag 'concurrent' for now, that defaults to false,
but that can be turned to 'true' for testing purposes.
When all SMAPIv3 concurrency bugs are believed to be fixed we can flip the default,
and eventually remove this flag once no more bugs are reported.
The configuration value is done as a global to simplify integrating intot he Lwt port,
instead of changing a lot of functions to thread through an argument.

This doesn't change the behaviour of xapi-storage-script in its default configuration.

Signed-off-by: Edwin Török <[email protected]>
CP-49634: Add alerting for Corosync upgrade
If we want to reproduce a failure we need to know the exact commandline that was used.
Longer than 80 chars is not a problem, this is a logfile, and a truncated line is worse than a long line.

Signed-off-by: Edwin Török <[email protected]>
Non-running VMs' metrics are stored in the coordinator. When the coordinator is
asked about the metrics try to unarchive them instead of failing while trying
to fetch the coordinator's IP address.

This needs to force the HTTP method of the query to be POST

Also returns a Service Unavailable when the host is marked as Broken.

Signed-off-by: Pau Ruiz Safont <[email protected]>
Forces users to use an address, instead of being implicit, this avoid the
underlying cause for the issue fixed in the previous commit: it allowed a
coordinator to call Pool_role.get_master_address, which always fails.

Signed-off-by: Pau Ruiz Safont <[email protected]>
This makes the selection of the action obvious, previously the two booleans
made it hazy to understand the decision, and was part of the error why the
coordinator tried to get the coordinator address from the pool_role file (and
failed badly)

Signed-off-by: Pau Ruiz Safont <[email protected]>
Currently a List.assoc is used, which raises an unhandled exception.

Signed-off-by: Pau Ruiz Safont <[email protected]>
Shell quoting changes in xen-api.git 65f152d
broke the dry-run functionality, as by quoting parameters in the way it was done
it meant the space separation was not properly handled to separate parameters
etc.

Signed-off-by: Alex Brett <[email protected]>
The xe-[backup,restore]-metadata scripts have cleanup logic designed to ensure
we do not leave any vbd objects etc behind.

This logic calls `vbd-unplug` with a 20s timeout, and then (regardless of the
result) allows up to 10s for any device specified in the VBD to disappear - if
it does not, it does not trigger a `vbd-destroy`.

This logic fails in the case where a VDI is attached to a VM running on the same
host, as the `device` field in the new VBD will be populated with the backend
device for the running VM. In this case, the `vbd-unplug` fails immediately (as
the vbd is not plugged because the original `vbd-plug` attempt fails as the VDI
is in use), but then we sit waiting for 10s for the device to disappear (which
is obviously does not), and then fail to trigger a `vbd-destroy`, leaving the
VBD behind.

Fix this by simply removing the wait for the device to disappear and always
attempting a `vbd-destroy`, as I am not aware of any situation where this
additional 10s wait will give any benefit given current behaviours.

Signed-off-by: Alex Brett <[email protected]>
This patch makes the feature use the debugfs utility, part of e2fsprogs. This
makes the system as a whole a heck of a lot better, if only because it won't be
able to parse XFS, ReiserFS or any of the many plugins of libfsimage.

Signed-off-by: Alejandro Vallejo <[email protected]>
Add a new option `-o` to xe-restore-metadata, which is used to distinguish
whether to allow use of legacy backup VDIs, or enforce only use of the new
format VDIs with known UUIDs.

Also modify xe-restore-metadata such that it no longer stops searching the
candidate list if only one VDI is found, but instead identifies all possible
backup VDIs. If more than one is found, and you are doing anything other than
listing the VDIs, the script will abort. This is to cover the case where a
malicious legacy format VDI is present - we will detect it and the expected
'real' backup VDI.

Modify xe-backup-metadata to always expect to use the deterministic UUID to
identify the VDI to add backups to, do not rely on the
`other-config:ctxs-pool-backup` property for identification in any way.

This is XSA-459 / CVE-2024-31144

Signed-off-by: Alex Brett <[email protected]>
- Quote a parameter
- Adjust how we check the returncode of some function calls to satifsy shellcheck
- Disable the warnings where we are explicitly relying on string splitting

Signed-off-by: Alex Brett <[email protected]>
This parameter was added in 7f1d315, but the
changes to always use the new metadata VDIs with known UUIDs means it is no
longer required, so remove it.

Signed-off-by: Alex Brett <[email protected]>
* Remove redundant parameter wiping

Removes ineffectual parameter wiping introduced by 6e24ca4.

Signed-off-by: Colin James <[email protected]>
…Unix/Lwt

It was a mix of Lwt and Unix code, which means that if the Unix call blocks the entire Lwt code blocks too.
This was only installed by the specfile in a -devel package.

`message-cli tail --follow` can be used to debug the IDL protocol instead.

Signed-off-by: Edwin Török <[email protected]>
Also add missing dependencies added recently

Signed-off-by: Pau Ruiz Safont <[email protected]>
Its implementation was identical, except for the use of time_limited_read in Threadext,
but the semantics is identical.

Use one well tested implementation instead of duplicating code.

One less function to convert to epoll.

Signed-off-by: Edwin Török <[email protected]>
[epoll] CP-47536: Drop posix_channel and channel_helper: unused and a mix of …
CA-395512: process SMAPIv3 API calls concurrently
edwintorok and others added 24 commits July 23, 2024 23:30
Reduces the amount of work the build has to do if we don't need to preprocess
everything, but only the few modules that actually using [@@deriving].

Signed-off-by: Edwin Török <[email protected]>
This signature is completely unused.
We could instead generate a client.mli, but that is more complicated,
currently the client.mli it'd generate wouldn't be polymorphic enough.

Signed-off-by: Edwin Török <[email protected]>
Server.ml takes a while to compile, but most unit tests don't actually need it.
Reorganize Api_server into Api_server+Api_server_common, where the latter
suffices for unit tests.

'dune runtest' times are improved:
```
hyperfine --min-runs 2 'dune clean; dune runtest --cache=disabled' 'cd ../scm-prev; dune clean; dune runtest --cache=disabled'
Benchmark 1: dune clean; dune runtest --cache=disabled
  Time (mean ± σ):     103.491 s ±  1.596 s    [User: 374.464 s, System: 125.957 s]
  Range (min … max):   102.363 s … 104.620 s    2 runs

Benchmark 2: cd ../scm-prev; dune clean; dune runtest --cache=disabled
  Time (mean ± σ):     114.158 s ±  2.980 s    [User: 380.638 s, System: 134.558 s]
  Range (min … max):   112.051 s … 116.266 s    2 runs

Summary
  dune clean; dune runtest --cache=disabled ran
    1.10 ± 0.03 times faster than cd ../scm-prev; dune clean; dune runtest --cache=disabled
```

Signed-off-by: Edwin Török <[email protected]>
`sexpr` is now fully thread safe without having to use locks,
doesn't need to depend on threadext.

`gen_api_main` can use the external `uuidm` module directly,
without waiting for internal one to be built.

`dune-build-info` is only needed by xapi_version.

`xapi-stdext-unix` is not needed in `xapi-idl`

The sexplib ppx runtime also doesn't need to be linked in some libraries that do not use it anymore,
and where it is used it'll be automatically linked.

Signed-off-by: Edwin Török <[email protected]>
Xapi_version depends on Build_info which can change on every commit.
It is better to remove it from the dependencies of gen_api_main,
especially that gen_api_main is on the critical path for discovering more
dependencies.

The 'xapi_user_agent' constant got moved to Xapi_version.

Signed-off-by: Edwin Török <[email protected]>
It'll be slower, but it can run a lot earlier in the build process.
Compiling the datamodels takes time, but compiling them for bytecode is faster.

Signed-off-by: Edwin Török <[email protected]>
Introduce a _minimal library, so we can start compiling server.ml earlier.

Build time reduced from 80s to:
```
Benchmark 1: dune clean; dune build --cache=disabled
  Time (mean ± σ):     67.081 s ±  0.190 s    [User: 326.847 s, System: 103.668 s
  Range (min … max):   66.946 s … 67.215 s    2 runs
```

Signed-off-by: Edwin Török <[email protected]>
Remove most sleeps, and reduce test duration from 5 seconds to 1.
(If we do want to run a performance test we can increase these again)

Run just a basic test for 0.1 seconds instead of a performance test for 5s by
default.
(can still be tweaked by overriding SECS)

Signed-off-by: Edwin Török <[email protected]>
Instead of every 5s. Speeds up testing, and may speed up startup somewhat.
And a connection try once every 0.5s won't create a lot of load on the system.
(If needed we could implement some form of exponential backoff here).

```
  Benchmark 1: dune clean; dune runtest --cache=disabled
  Time (mean ± σ):     97.642 s ±  0.933 s    [User: 354.132 s, System: 113.436 s]
  Range (min … max):   96.982 s … 98.302 s    2 runsi
```
Signed-off-by: Edwin Török <[email protected]>
These are errors in dune 3.15 and don't seem to be problematic

Signed-off-by: Pau Ruiz Safont <[email protected]>
Signed-off-by: Pau Ruiz Safont <[email protected]>
Also make dune generate the opam metadata

Signed-off-by: Pau Ruiz Safont <[email protected]>
Only tests need it to generate crypto keys, but it's needed to create the
serial when signing certificates.

Signed-off-by: Pau Ruiz Safont <[email protected]>
edwintorok and others added 3 commits July 29, 2024 10:10
We want to run these from 'quicktest', so make them available as libraries,
and add a _run.ml that would run them separately, just as before.
(running separately in the CI is better, because it can be parallelize)

No functional change.

Signed-off-by: Edwin Török <[email protected]>
Quicktest runs in Dom0, and the existing quickcheck tests run in the CI.
Some of these test as much the OCaml code, as its interaction with the system (e.g. behaviour of system calls).
So it is better to run these tests both in the CI and in Dom0.

We run these in long mode, to explore more randomly generated scenarios.

The seed can be controlled with QCHECK_SEED environment variable.
Similar to @StressTest it uses a random seed, instead of a fixed seed.

Signed-off-by: Edwin Török <[email protected]>
@gangj gangj merged commit e93b2a5 into feature/update-backwards-compat Jul 30, 2024
28 of 29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants