Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests fail on ARM runner #450

Closed
danielhollas opened this issue May 8, 2024 · 9 comments
Closed

Tests fail on ARM runner #450

danielhollas opened this issue May 8, 2024 · 9 comments
Assignees

Comments

@danielhollas
Copy link
Contributor

danielhollas commented May 8, 2024

The test_create_conda_environment seems to be failing quite often on arm build. Not sure what is happening.

@unkcpz can you investigate if you can reproduce this locally? I am seeing this both on main and in #439.

See e.g. https://github.com/aiidalab/aiidalab-docker-stack/actions/runs/9006100849/job/24744034987


aiidalab_exec = <function aiidalab_exec.<locals>.execute at 0x10696aa20>
nb_user = 'jovyan'

    def test_create_conda_environment(aiidalab_exec, nb_user):
>       output = aiidalab_exec("conda create -y -n tmp", user=nb_user).strip()

tests/test_base.py:35: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/conftest.py:73: in execute
    out = docker_compose.execute(command, **kwargs)
../../../../.venv/aiidalab-runner/lib/python3.11/site-packages/pytest_docker/plugin.py:140: in execute
    return execute(command)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

command = 'docker compose -f "stack/docker-compose.full-stack.yml" -p "pytest36426" exec -T --user=jovyan aiidalab conda create -y -n tmp'
success_codes = (0,)

    def execute(command: str, success_codes: Iterable[int] = (0,)) -> Union[bytes, Any]:
        """Run a shell command."""
        try:
            output = subprocess.check_output(command, stderr=subprocess.STDOUT, shell=True)
            status = 0
        except subprocess.CalledProcessError as error:
            output = error.output or b""
            status = error.returncode
            command = error.cmd
    
        if status not in success_codes:
>           raise Exception(
                'Command {} returned {}: """{}""".'.format(command, status, output.decode("utf-8"))
            )
E           Exception: Command docker compose -f "stack/docker-compose.full-stack.yml" -p "pytest36426" exec -T --user=jovyan aiidalab conda create -y -n tmp returned 137: """Collecting package metadata (current_repodata.json): ...working... done
E           Solving environment: ...working... done
E           """.

../../../../.venv/aiidalab-runner/lib/python3.11/site-packages/pytest_docker/plugin.py:35: Exception
=========================== short test summary info ============================
FAILED tests/test_base.py::test_create_conda_environment - Exception: Command docker compose -f "stack/docker-compose.full-stack.yml" -p "pytest36426" exec -T --user=jovyan aiidalab conda create -y -n tmp returned 137: """Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done
""".
@danielhollas
Copy link
Contributor Author

Maybe memory issue? https://www.google.com/search?client=firefox-b-lm&q=command+return+137

This is currently blocking release. Maybe some memory needs to be freed on the Mac runner?

@unkcpz
Copy link
Member

unkcpz commented May 8, 2024

This is currently blocking release. Maybe some memory needs to be freed on the Mac runner?

I guess so, but didn't know how to do that. Will check in my laptop.

@danielhollas
Copy link
Contributor Author

Just a note that this is currently still blocking release, ARM tests are failing consistently now.

@danielhollas danielhollas changed the title Flaky test Flaky tests on ARM May 9, 2024
@unkcpz
Copy link
Member

unkcpz commented May 9, 2024

It is more of a qeapp test in arm64 rather than the pure architecture issue. So I bring up again that having integration test on qeapp may not be a good idea.
I feel the same with you, I am not comfortable that the release of aiidalab docker stacks fail the qeapp. But in the end the problem usually happened from QeApp side rather than here. However, the fixes and changes usually should made from downstream, nothing too much can be done from this repo.
We did encounter twice the problem:

  1. when we want to move to aiida-core==2.5 with pydantic v2. The changes in the end were made from qeapp to use ipyoptimade to support pydantic v2.
  2. The problem happens now. I think it is a dependency issue in qeapp (seems the compile of pymatgen) that make the arm64 installation of fail the full-stack test here.

Logically, the full-stack is the upstream of qeapp. It makes less sense to have failing tests block the change from docker stack.

@danielhollas
Copy link
Contributor Author

It is more of a qeapp test in arm64 rather than the pure architecture issue.

This is not true though, qeapp integration tests are not the only ones that are failing now, see e.g. https://github.com/aiidalab/aiidalab-docker-stack/actions/runs/9005715615/job/24742373269

I don't think the tests are at fault here, it's an issue with the ARM64 runner. @mbercx could you try restarting the machine?
(or ideally, investigate what is happening there. Is there enough free RAM?)

Logically, the full-stack is the upstream of qeapp. It makes less sense to have failing tests block the change from docker stack.

Yes, those tests should not block a release, which is why I have separated them into a separate CI job in #439 (which was the original design). If this job fails, it will not block the others.

@unkcpz
Copy link
Member

unkcpz commented May 9, 2024

Yes, those tests should not block a release, which is why I have separated them into a separate CI job in #439 (which was the original design). If this job fails, it will not block the others.

It is nice that the CI job is decoupled. But the publish job still depend on the test-arm64, see here

publish-ghcr:
needs: [build, test-amd64, test-arm64]

If I understand correctly, this means when we make a new release, the image will not push to registries since the publish job will be blocked by the failed test. Am I miss something?

@danielhollas
Copy link
Contributor Author

test-arm64 tests do not include integration tests. Those are tested separately in test-integration, using the -m integration pytest marker

@danielhollas danielhollas changed the title Flaky tests on ARM Tests fail on ARM runner May 9, 2024
@mbercx
Copy link
Member

mbercx commented May 10, 2024

I don't think the tests are at fault here, it's an issue with the ARM64 runner. @mbercx could you try restarting the machine?
(or ideally, investigate what is happening there. Is there enough free RAM?)

I'm currently on holiday until the 21st, so won't be able to look into this anytime soon. I doubt there is a memory issue on my work station though. I'm not running anything there and it has 128 GB. @unkcpz should also have access to the ARM runner.

@danielhollas
Copy link
Contributor Author

Closing since we're moving away from the self-hosted runner for now.

@danielhollas danielhollas closed this as not planned Won't fix, can't repro, duplicate, stale May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants