Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUIC] Disable QUIC tests on CLR stress tests #96392

Merged
merged 2 commits into from
Jan 4, 2024

Conversation

ManickaP
Copy link
Member

@ManickaP ManickaP commented Jan 2, 2024

Fixes #86565

I'm not sure about disabling the tests completely on Arm32, any opinions @wfurt @rzikm @CarnaViire

@ghost
Copy link

ghost commented Jan 2, 2024

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #86565

I'm not sure about disabling the tests completely on Arm32, any opinions @wfurt @rzikm @CarnaViire

Author: ManickaP
Assignees: -
Labels:

area-System.Net.Quic

Milestone: -

@@ -10,6 +10,7 @@ namespace System.Net.Quic.Tests
{
[Collection(nameof(DisableParallelization))]
[ConditionalClass(typeof(QuicTestBase), nameof(QuicTestBase.IsSupported))]
[SkipOnCoreClr("MsQuic is timing sensitive and flaky under stress", ~RuntimeTestModes.RegularRun)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no way to make the tests resilient against timing issues? What specifically is the problem that leads to the flakiness?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We suspect #55979 as root cause. There is logic in MsQuic to watch load and start refusing connection if CPU utilization is too high. For normal tests runs we disabled parallel runs to avoid it but the stress may be still be too much.

To start with, we may disable stress only on ARM. It seems like it failed on ARM64 as well but it is hard to tell much as most of the links leads to nowhere now ;(

@ManickaP
Copy link
Member Author

ManickaP commented Jan 3, 2024

I did a bit digging in Kusto:

  • jit stress re-runs/failures are only on Arm32v7 --> I can narrow down the exclusion
  • the most re-runs/failures are on libraries-pgo Arm32v7, but the last one was on 22.10.
  • out of ~4700 re-runs/failures ~4500 are Arm32v7
  • there's no way to know how much passes we got on Arm32v7 so no idea about the percentage

Query:

AzureDevOpsTests
| where TestName has "System.Net.Quic" and Repository == "dotnet/runtime" and Message has "The connection timed out from inactivity"
| summarize count(), max(RunCompleted) by TestRunName, BuildDefinitionName

@CarnaViire
Copy link
Member

Thanks @ManickaP.

the most re-runs/failures are on libraries-pgo Arm32v7, but the last one was on 22.10.

I wonder if that was #91757 (it was mitigated around Nov 13)

I think we should disable QUIC tests on jitstress runs on Arm32 and Arm64 (because Arm64 was also reported there) but keep them for other platforms.

I'd vote for not disabling Arm32 completely yet, if we don't see failures in our normal runs since Nov 13...

@ManickaP
Copy link
Member Author

ManickaP commented Jan 3, 2024

So I don't know how to combine SkipOnCoreClr and PlatformDetection in AND manner (i.e. to not exclude all Arm32 test runs). The only way I can see now is to copy the logic from https://github.com/dotnet/arcade/blob/main/src/Microsoft.DotNet.XUnitExtensions/src/CoreClrConfigurationDetection.cs and use ConditionalClass with my own condition for ClrStress run on Arm32.

Is there an easier way @ViktorHofer, @akoeplinger, @BruceForstall? Is this even worth the effort @wfurt @rzikm @CarnaViire?

@ManickaP
Copy link
Member Author

ManickaP commented Jan 3, 2024

because Arm64 was also reported there

Interestingly, there are no such records in Kusto.

@BruceForstall
Copy link
Member

Is there an easier way...

I'm certainly no expert in libraries test exclusions, but I do believe there is no "AND" functionality, as you've found.

@BruceForstall
Copy link
Member

btw, thank you for working on this!

@wfurt
Copy link
Member

wfurt commented Jan 3, 2024

because Arm64 was also reported there

Interestingly, there are no such records in Kusto.

there may be other occasional failures on other platforms. I think it would be OK to exclude arm32 since that is was majority and wait. We may need to create new static variable for the condition - especially if we want to be more selective.

@ManickaP
Copy link
Member Author

ManickaP commented Jan 4, 2024

/azp list

Copy link

CI/CD Pipelines for this repository:

@ManickaP
Copy link
Member Author

ManickaP commented Jan 4, 2024

/azp run runtime-coreclr libraries-jitstress

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ManickaP
Copy link
Member Author

ManickaP commented Jan 4, 2024

/azp run runtime-coreclr libraries-pgo

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ManickaP
Copy link
Member Author

ManickaP commented Jan 4, 2024

I manually checked the AzDO pipeline:

  • PR (Arm32 stress is skipped, Arm64 stress still ran)
    image
  • main (baseline)
    image

And failures are unrelated. Can I get any final reviews?

Copy link
Member

@CarnaViire CarnaViire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@ManickaP ManickaP merged commit baeb298 into dotnet:main Jan 4, 2024
142 of 149 checks passed
@ManickaP ManickaP deleted the mapichov/quic-clrstress-disable branch January 4, 2024 14:57
@akoeplinger
Copy link
Member

So I don't know how to combine SkipOnCoreClr and PlatformDetection in AND manner (i.e. to not exclude all Arm32 test runs). The only way I can see now is to copy the logic from dotnet/arcade@main/src/Microsoft.DotNet.XUnitExtensions/src/CoreClrConfigurationDetection.cs and use ConditionalClass with my own condition for ClrStress run on Arm32.

You don't need to copy the logic since CoreClrConfigurationDetection.IsJitStress is public so you can use it in a custom property in PlatformDetection which is used as the condition in [ConditionalClass].

@github-actions github-actions bot locked and limited conversation to collaborators Feb 5, 2024
@karelz karelz added this to the 9.0.0 milestone May 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Test failure System.Net.Quic.Tests.QuicStreamTests.ReadWrite_Random_Success
7 participants