Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to run without CONFIG_NUMA=y in Linux kernel #37

Open
tofurky opened this issue Jun 16, 2023 · 9 comments
Open

Unable to run without CONFIG_NUMA=y in Linux kernel #37

tofurky opened this issue Jun 16, 2023 · 9 comments
Labels

Comments

@tofurky
Copy link

tofurky commented Jun 16, 2023

I was successfully using the stress-test mode of y-cruncher (v0.7.10.9513-dynamic, Ubuntu 23.04, Ryzen 7950X) until I compiled a stripped down kernel. I then saw the following error:

Exception Encountered: InvalidParametersException

In Function: core_to_node()

Invalid core id: 14

Recompiling with CONFIG_NUMA=y fixed the issue. Unsure if there's maybe a way to enumerate CPUs (and allocate memory to threads?) that's independent of NUMA being enabled on a kernel? Possibly this is a non-starter, but figured it's worth documenting (even if only in the form of this issue).

Without attaching the full strace, I did spot this:

...
[pid  2633] get_mempolicy(0x7ffce39a7110, 0x59b16fc6ed10, 8, NULL, 0) = -1 ENOSYS (Function not implemented)
[pid  2633] openat(AT_FDCWD, "/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 3
...

A peek at the man page for get_mempolicy shows it to be NUMA-related: "retrieve NUMA memory policy for a thread".

Example config that caused the failure without NUMA:

{
    Action : "StressTest"
    StressTest : {
        AllocateLocally : "true"
        LogicalCores : [14]
        TotalMemory : 8589934592
        SecondsPerTest : 60 
        SecondsTotal : 300
        StopOnError : "true"
        Tests : ["FFT"]
    }
}
@Mysticial
Copy link
Owner

Hmm... Nowhere in the program does it call get_mempolicy() directly. So it must be through some other system call.

Looking at the code, yeah it's not easy to strip out all the NUMA stuff or have it cleanly "fall through" when the topology data is missing or incomplete. This kind of thing is also really hard to test anyway. Do doable, but too niche and too difficult to test.

Part of the code here is trying to query the NUMA node of each core to know where to place the data and how to optimize the way the memory is allocated and committed.

What does your /sys/devices/system/cpu/online look like?

@tofurky
Copy link
Author

tofurky commented Jun 16, 2023

What does your /sys/devices/system/cpu/online look like?

For both NUMA and non-NUMA kernels, it shows 0-31.

@Mysticial
Copy link
Owner

Is the NUMA information missing from that file?

@tofurky
Copy link
Author

tofurky commented Jun 16, 2023

There's nothing other than 0-31 in that file.

matt@aquos:~$ cat /sys/devices/system/cpu/online
0-31
matt@aquos:~$ dmesg |grep NUMA
[    0.000913] No NUMA configuration found
matt@aquos:~$ grep NUMA /boot/config-6.3.8aquos+ 
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
# CONFIG_NUMA_BALANCING is not set
CONFIG_NUMA=y
# CONFIG_AMD_NUMA is not set
CONFIG_X86_64_ACPI_NUMA=y
# CONFIG_NUMA_EMU is not set
CONFIG_ACPI_NUMA=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y

@Mysticial
Copy link
Owner

Then it's been a while since I've looked at the Linux stuff. Looking at it again, the files that matter are:

  • /proc/cpuinfo
  • /proc/zoneinfo
  • /sys/devices/system/node/

There might be more.

@tofurky
Copy link
Author

tofurky commented Jun 16, 2023

cpuinfo and zoneinfo are more or less identical, but /sys/devices/system/node/ is missing without NUMA.

@Mysticial Mysticial added the bug label Dec 3, 2023
@Mysticial
Copy link
Owner

Might be worth checking to see if this still repros in latest. I made some changes in this area to suppress some of the errors hopefully let it fall-through better.

I doubt it will work yet, but it may make it further.

@tofurky
Copy link
Author

tofurky commented Dec 13, 2023

sorry for the delay in response. i rebuilt my kernel with CONFIG_NUMA=n again and i was able to do e.g. calculate pi after hitting enter after this:

Parsing Core -> Handle Mappings...
    Cores:  0-31 

Parsing NUMA -> Core Mappings...

Unable to read or parse "/sys/devices/system/node/".
Thread and node affinities may not function correctly.

Press ENTER to continue . . .

using the stress.cfg as shown in the initial issue here leads to this (core 2, for example):

matt@aquos:~/aquos/y-cruncher v0.8.3.9532-dynamic$ ./y-cruncher config stress_running.cfg
y-cruncher v0.8.3 Build 9532

Detecting Environment...

Hardware Features:
(*) Indicates it is used explicitly by y-cruncher.

CPU Vendor:
    AMD         = Yes
    Intel       = No

...

Auto-Selecting: 22-ZN4 ~ Kizuna

/home/matt/aquos/y-cruncher v0.8.3.9532-dynamic/Binaries/22-ZN4 ~ Kizuna


Launching y-cruncher...
================================================================



Insufficient permissions to set thread priority. Please retry as root.

Further messages for this warning will be suppressed.

Checking processor/OS features...

Required Features:
    x64, ABM, BMI1, BMI2, ADX,
    SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2,
    AVX, FMA3, AVX2,
    AVX512-(F/CD/VL/BW/DQ/IFMA/VBMI/GFNI)



Parsing Core -> Handle Mappings...
    Cores:  0-31 

Parsing NUMA -> Core Mappings...

Unable to read or parse "/sys/devices/system/node/".
Thread and node affinities may not function correctly.

Press ENTER to continue . . .



Component Stress Tester

  1   Logical Cores:      1
  2   Memory:             8.00 GiB  ( 8.00 GiB per thread )
  3   NUMA Mode:          Local - Memory allocated from local thread.
  4/5 Time Limit:         60 seconds per test / 60 seconds total
  6   Stop on Error:      Enabled

 7/8  Enable All Tests / Disable All Tests
 9/10 Load/Save Configuration File

  #   Tag - Test Name               Mem/Thread   Component        CPU------Mem
 11   BKT - Basecase + Karatsuba      Disabled   Scalar Integer    -|--------
 12   BBP - BBP Digit Extraction      Disabled   AVX512 Float      |---------
 13   SFT - Small In-Cache FFT        Disabled   AVX512 Float      -|--------
 14   FFT - Fast Fourier Transform     312 MiB   AVX512 Float      ---------|
 15   N63 - Classic NTT (v2)          Disabled   AVX512 Integer    -----|----
 16   VT3 - Vector Transform (v3)     Disabled   AVX512 Integer    ------|---

  0   Start Stress-Testing!

Allocating Memory...


Exception Encountered: InvalidParametersException

In Function: core_to_node()

Invalid core id: 2



Press ENTER to continue . . .

@Mysticial
Copy link
Owner

Looks like it did indeed get a bit further. lol

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants