-
Notifications
You must be signed in to change notification settings - Fork 28
mOS for HPC v1.0 User Guide
mOS for HPC combines a lightweight kernel (LWK) with Linux. Resources, e.g., CPUs and physical memory blocks, are either managed by the LWK or by the Linux kernel. The process of giving resources to the LWK, thereby taking them away from Linux management, is called designation. Resources that have been designated for the LWK are still visible from Linux but are now under LWK control. For example, the Linux kernel, when directed to do so, can perform I/O using memory designated for the LWK, but the LWK decides how that memory is managed. LWK resource designation can be done at boot time, or later using the lwkctl command.
Giving some or all the designated LWK resources to an mOS process is called reservation and is done at process launch time using a utility called yod (see below). The third stage, allocation, happens when a running process requests a resource; e.g., through calls like mmap() or sched_setaffinity(). A process can only allocate resources that have been reserved for it at launch time, and designated as an LWK resource before that.
Earlier versions of mOS sent most system calls to Linux CPUs for processing. Only about a dozen calls that directly affected LWK resources were handled by the LWK locally. Since version v0.8, Linux calls are no longer forwarded to Linux CPUs. The LWK has access to all the Linux kernel code and we found that running that code locally is more efficient and does not increase OS jitter of LWK cores. Linux CPUs are still needed to boot the system, handle interrupts and timer ticks, and occasionally as utility CPUs for hosting application or runtime utility threads. A runtime system or a user can mark a given thread as a utility thread that may get moved to a Linux CPU if not enough LWK CPUs are available to execute the primary computation. Utility threads are usually not part of the high-performance computation but rather assist by monitoring or ensuring that progress is made in the background. As such, utility threads usually do not need a lot of compute power, but have a tendency to introduce noise into a system. Therefore, it is often best not to let them execute on the compute CPUs in use by the application. In many situations this can be accomplished using yod options. For more advanced management of utility threads, see mOS for HPC Utility Thread API.
The lwkctl command can be used to display the LWK partition information. This includes the list of LWK CPUs, LWK memory and utility CPUs.
To see the output in human readable format use,
lwkctl -s
To see the output in raw format use,
lwkctl -s -r
For further details regarding usage refer to the lwkctl man page on a compute node where mOS for HPC is installed.
Applications are run under mOS for HPC through the use of a launcher command called yod . Any program not launched with yod will simply run on Linux. This document discusses how to use yod in conjunction with mpirun, but does not discuss job schedulers.
The yod utility of mOS is the fundamental mechanism for reserving LWK resources for a process being spawned by the native job launcher. The syntax is:
yod yod-arguments program program-arguments
One of yod's principal jobs is to reserve LWK resources (CPUs, GPUs, memory) for the process being spawned. yod supports a simple syntax whereby a fraction of the resources that have been designated for the LWK are reserved. This is useful when launching multiple MPI ranks per node. In such cases, the general pattern looks like this:
mpirun -ppn N mpirun-args yod -R 1/N yod-args program program-args
This reserves for each MPI rank an equal portion of the designated LWK resources.
Please consult the mpirun man page for further information on mpirun-args.
The following are the supported yod options. This information can also be found in the yod man page.
Option | Description |
---|---|
--resources, -R < fraction, all, MPI, file:map_file> | Reserves a portion of the LWK resources. If specified as a fraction, then the corresponding number of LWK cores and GPU devices are reserved, as well as an equal portion of the designated LWK memory. A fraction may be specified in floating point format or as a rational number M/N, where M and N are integers. If MPI is specified then MPI environment variables are used to determine the fractional amount of resources. If file:map_file is specified, then a mapping file is used to specify LWK CPU and/or memory and/or number of utility threads per MPI rank. See RESOURCE MAP FILES for details. If all is specified, all designated LWK resources are reserved. This option may not be combined with any of the --cpu, --cores, --gpus or --mem options. |
--cpus, -c <list, mask , all> | Reserves the specified CPUs for the command being launched. If all is specified, all available LWK CPUs are reserved. Otherwise, the argument describes a list of LWK CPU to be reserved and is in either list or mask format. |
--cores, -C <number, fraction, all, MPI> | Reserves the specified number of LWK cores for the command being launched. If all is specified, all available LWK cores are reserved. Otherwise, the argument specifies either a number of cores to be reserved, or a fraction of the overall LWK cores designated for mOS use. A fraction may be specified in floating point format or as a rational number M/N, where M and N are integers. Fractions must be in the interval (0, 1]. If MPI is specified then MPI environment variables are used to determine the fractional amount of core resources. |
The following options control reservation of LWK memory. If specified, then LWK CPUs must also be specified via either the --cpus or the --cores option. | |
--mem, -M < size , fraction, all, MPI> | Reserve the specified amount of memory for the command being launched. Size is an integer or decimal number, optionally augmented with K, M or G to indicate units of kilobytes, megabytes or gigabytes, respectively. A fraction reserves memory as a portion of the overall amount of memory designated for LWK use. Fractions must be in the interval (0.0, 1.0) and may be specified either in floating point format or as a rational number M/N, where M and N are integers (M <= N). If all is specified, then all available LWK memory is reserved. If MPI is specified then MPI environment variables are used to determine the fractional amount of memory resources. |
The following options control the reservation of GPU devices. If these options are not specified and if the ZE_AFFINITY_MASK environment variable is set, the ZE_AFFINITY_MASK will control the available GPU devices. If neither of these options nor the --resources/-R option is specified and the ZE_AFFINITY_MASK is not set, yod will reserve all available GPU devices. | |
--gpus, -G <number, fraction, all> | Reserves the specified number of GPU devices for the command being launched. If all is specified, all available GPU devices are reserved. Otherwise, if the argument specifies either a number of GPU devices to be reserved, or a fraction of the total GPU devices available for mOS use. A fraction may be specified in floating point format or as a rational number M/N, where M and N are integers. Fractions must be in the interval (0, 1]. If MPI is specified then MPI environment variables are used to determine the fractional amount of GPU resources. If this option is specified, then LWK CPUs must also be specified via either the --cpus or the --cores option along with memory. This option and the --gpu-tiles option are mutually exclusive. |
--gpu-tiles, -g <number, fraction, all> | Reserves the specified number of GPU tiles for the command being launched. If all is specified, all available GPU tiles are reserved. Otherwise, the argument specifies either a number of GPU tiles to be reserved, or a fraction of the total GPU tiles available for mOS use. A fraction may be specified in floating point format or as a rational number M/N, where M and N are integers. Fractions must be in the interval (0, 1]. If MPI is specified then MPI environment variables are used to determine the fractional amount of GPU resources. If this option is specified, then LWK CPUs must also be specified via either the --cpus or the --cores option along with memory. This option and the --gpus option are mutually exclusive. |
Additional options | |
--util_threads, -u <number> | Specify number of threads to be identified as utility threads within the process being launched. If a value is specified, the kernel will heuristically identify that number of threads as utility threads and provide special placement and behaviors to those threads. If no value is specified, the kernel will make no heuristic identification of utility threads. If the number of utility threads is specified in the -R file:map_file option then that value overrides the value specified here. |
--resource_algorithm <numa, simple, random> | Controls the selection and layout of CPUs relative to the overall set of designated LWK CPUs. See RESOURCE ALGORITHMS for additional information. The default is numa. |
--memory-preference, -p <preference> | States preferences of types of memory to use for various kinds of allocations. See MEMORY PREFERENCES for additional information. |
--layout <description> | Provides CPU (hardware thread) ordering suggestions to the mOS scheduler. See THREAD LAYOUT for additional information. |
--rank-layout <compact, scatter[:stride], disabled> | Provides a hint to lay out ranks in a prescribed order. A compact layout will place adjacent ranks near each other, from a NUMA perspective. A scatter layout will interleave ranks using a stride; if not specified, the stride will be the number of NUMA domains for CPUs. Disabling the layout will not prescribe any specific layout of the ranks with respect to NUMA domains. This option is a hint and requires additional support from the underlying MPI launch mechanism. |
--brk-clear-length size | For non-negative size values, size defines the number of bytes to clear (zero) at the beginning of the expanded region when the brk system call expands the data segment. For negative size values, the entire expanded region will be cleared. The default behavior is to clear 4K. The size argument may be specified in integer or symbolic format (4K, 2M, 1G, etc.). |
--mosview <lwk, all> | Sets the mOS view of process being launched. If lwk is specified, the process will see only LWK global resources but not Linux resources. If all is specified, the process will see both LWK and Linux resources. The default is all. |
--maxpage scope:maxpage | Sets the largest page size that can be used for a virtual memory region. scope can be dbss for .data/.bss area, heap for brk area, anon_private for private anonymous area, tstack for thread stacks, stack for process stack or all for all LWK memory backed areas. maxpage can be either 4k, 2m, or 1g. Setting for multiple virtual memory regions can be specified by using separator '/' between settings of each virtual memory region. One can specify a different maxpage for each region. If setting for a virtual memory region is not specified then by default the largest page size supported by the hardware is used for that region, unless interleaving is active. If interleaving is active, then the default maximum page size is 2m. |
--pagefault scope:level | Sets the page faulting level that can be used for a virtual memory region. scope can be dbss for .data/.bss area, heap for brk area, anon_private for private anonymous area, tstack for thread stacks, stack for process stack or all for all LWK memory backed areas. level can be either nofault or onefault. Setting for multiple virtual memory regions can be specified by using separator '/' between settings of each virtual memory region. One can specify a different level for each region. If setting for a virtual memory region is not specified then a default setting of nofault is applied. |
--mempolicy scope:type | Sets the memory policy type that can be used for a virtual memory region. scope can be dbss for .data/.bss area, heap for brk area, anon_private for private anonymous area, tstack for thread stacks, stack for process stack or all for all LWK memory backed areas. type can be either normal, random, interleave, or interleave_random. Setting for multiple virtual memory regions can be specified by using separator '/' between settings of each virtual memory region. One can specify a different type for each region. If setting for a virtual memory region is not specified then a default setting of interleave is applied if there is more than 1 NUMA domain reserved for a memory type otherwise the default setting is normal |
--dry-run | Do not actually reserve resources and launch. |
--verbose, -v <number> | Controls the verbosity of yod. number is an integer between 0 and 9. |
--help, -h | Prints a terse version of this documentation. |
--option, -o kernel-option[=value] | Passes a kernel-option to the mOS kernel. In addition to the previously described options in this table, yod contains several experimental kernel options. These may be elevated to be fully supported or may disappear in future releases. The kernel options are described in the following table. |
Kernel Option |
Description |
||
---|---|---|---|
lwkmem-report=<level> |
Generates a report of memory usage and writes to the kernel console when the process exits. Values for level:
|
||
lwkmem-vmr-disable=<vmr> |
Disables the use of LWK memory for the indicated VMR. Multiple comma separated VMRs can be specified. Supported vmr specifications: dbss, heap, anon_private, tstack |
||
lwkmem-pma-cache=<page_type:num_pages> | Specify the number of cached physical pages to maintain for each of the page sizes. The default is 512 4k and 2m pages, and 4 1g pages. Multiple comma-separated arguments can be specified. | ||
lwkmem-pma=<physical_mem_alloc> | Specify the physical memory allocator to be used. Currently the only physical_mem_alloc implemented is the buddy allocator. | ||
lwksched-disable-setaffinity=<errno> | Do not perform any action as a result of the sched_setaffinnity system call if the target is an mOS thread. On return from the system call, set the returned ERRNO to the value provided. Examples:
This option is useful for debugging. There are no plans to make it officially supported. |
||
lwksched-enable-rr=<time_quantum> | Modify the time quantum for round robin dispatching or disable round robin dispatching. By default, when more than one thread is executing on a CPU, round robin dispatching is automatically enabled with a 100ms time quantum. Each thread will execute up to the value of <time_quantum> milliseconds before being preempted by another thread of equal priority. The minimum supported time quantum is 10ms. If a value of 0 is specified, automatic round robin dispatching will be disabled. By default, no timer tick will occur if only one LWK thread is runnable on an LWK CPU. | ||
lwksched-stats =<level> | Output counters to the kernel log at the time of process exit. Data detail is controlled by <level>. A value of 1 will generate an entry for every mOS CPU that had more than one mOS thread committed to run on it. A value of 2 will add a summary record for the exiting mOS process. A value of 3 will add records for all CPUs in the process and a process summary record for the exiting process regardless of commitment levels. Information provided:
The content and format of the output are highly dependent on the current implementation of the mOS scheduler and therefore are likely to change in future releases. |
||
util-threshold =<X:Y> | The X value indicates the maximum number of LWK CPUs that can be used for hosting utility threads. The
Y
value represents the maximum number of utility threads allowed to be assigned to any one LWK CPU. Some examples: A value of 0:0 will prevent any utility threads from being placed on an LWK CPU and force all utility threads to be placed on the Linux CPUs that are defined to be the syscall target CPUs. A value of -1:1 will allow any number of LWK CPUs to hold utility threads however only a maximum of one utility thread will be assigned to each LWK CPU. Default behavior is X = -1, Y = 1. The UTI API would be a more controlled approach to placing a utility thread. |
||
idle-control=<mechanism,boundary> |
mechanism is the fast-path idle/dispatch mechanism used by the idle task. The allowed values are:
boundary is the boundary where the fast dispatch mechanism will be deployed. Beyond this boundary, the CPU will request deep sleep. The allowed values are:
|
||
cmci-control=<threshold, poll> |
threshold:
poll:
In all cases, when the mOS process ends, the machine check banks owned by the associated CPUs will be checked and any pending machine check events will be logged, even if we have not reached the threshold. Also any disabled CMCI's will be re-enabled, any modified thresholds will be restored, and polling will be re-enabled. If a threshold is requested and the hardware does not support that threshold value, an info message will be written to the console log. There will be one message written per boot. |
||
enable-balancer=<type, param1, param2, param3> |
type: Can be either the value 'push' or 'pull'. If neither is specified, the default behavior will be pull
Can be useful in environments when CPU over-commitment cannot be avoided and the run-times are not able to intelligently place threads. |
The first logical CPU is CPU 0. The second is CPU 1. And so on. CPU masks in yod are hexadecimal literals specified in little endian order. The least significant bit corresponds to CPU 0 and so on. Masks must begin with either "0x" or "0X". CPU lists are CPU numbers or ranges of numbers separated by commas. For example, the list '0-2,8' is equivalent to mask 0x107.
The --cpus form of LWK CPU reservation is explicit in that it specifically identifies the CPUs to be reserved. Other forms are less explicit and in these cases, yod uses the --resource_algorithm specification to reserve and select CPUs and memory. The numa resource algorithm attempts to reserve LWK cores and memory that are near each other in the NUMA sense. The simple resource algorithm reserves LWK cores from the available poll, in ascending order. Memory is reserved from NUMA domains in ascending order. The random CPU algorithm reserves LWK cores randomly from the available pool.
The --layout <description> option may be used to suggest how software threads are assigned to CPUs (hardware threads) once specific CPUs have been reserved for the process being launched. The <description> argument may be specified as scatter, compact, or a permutation of the dimensions node, tile, core, and cpu.
The scatter option spreads threads out as much as possible within the reserved LWK CPUs. It is equivalent to node,tile,core,cpu and thus will attempt to spread out across nodes before repeating tiles, spread out across tiles before repeating cores, and so on. This is the default.
The compact option is the opposite of scatter and is equivalent to cpu,core,tile,node. It will select CPUs (hardware threads) on a core before moving to another core. Likewise, it will use all cores on a tile before expanding to another tile, and so on.
Other permutations of node, tile, core, and cpu may be passed to specify the sort order of the CPUs.
The node, tile, core, and cpu terms may also be augmented with a :<count> suffix which will prefer the number of the described entities. For example, cpu:1 will construct a layout that uses the first CPU in all reserved cores before using the 2nd and subsequent CPUs in any reserved core. And so cpu:1,core,tile,node is compact from a node, tile and core perspective, but will initially consume one CPU per each reserved core before binding threads to the remaining CPUs of the reserved cores.
Preferences have the form scope[:size]:order. The scope term identifies a virtual memory region and can be dbss for .data/.bss area, heap for brk area, anon_private for anonymous mmap area, tstack for thread stacks or all for all LWK memory backed areas.
The order term lists types of memory in order of preference. This is a comma delimited list of hbm, dram, and nvram. The default ordering is hbm,dram,nvram. If not all types of memory are explicitly stated, the list is implicitly completed with missing types from this default order.
The size term, if present, applies the preference to allocations larger than or equal to the specified size. If not specified, size is implicitly 1.
Multiple preferences are separated with a '/' character.
If no preference is specified, the default behavior all:1:hbm,dram,nvram. Any preferences specified are relative to this default and are applied in order from left to right.
Example:
yod -p all:dram/anon_private:65536:hbm
Gives precedence to DRAM for all memory allocations, except private, anonymous mmaps of 64K or larger. These mmaps will first attempt to be satisfied with high bandwidth memory.
The file: variant of the --resources option may be used to map CPU, memory, and number of utility threads per MPI rank. The file contains lines of the form:
<local-rank> <resource-spec>...
Where <local-rank> is either an integer identifying the Nth rank on the node or the wildcard character '*'. The <resource-spec> can identify CPUs, cores, memory, number of utility threads, and/or resource option. The wildcard line is optional. It matches all ranks and should be the last line in the file. Comments are allowed and start with the '#' character.
This option requires that the MPI_LOCALRANKID environment variable is set to identify the rank's ordinal with the node.
Example:
# The first rank on the node will use 1/4 of the designated resources:
0 -R 1/4
# The second rank on the node will use CPU 9 and 1 gigabyte of memory:
1 -c 9 -M 1G
# All other ranks use 1 core and 1/8 of the designated memory:
* --cores 1 --mem 1/8
YOD ENVIRONMENT VARIABLES
YOD_VERBOSE may be used to control the verbosity. Specifying --verbose= on the command line takes precedence over this environment variable.
The mOS kernel will reserve unique CPU and memory resources for each process/rank within a node and will assign threads to the CPU resources owned (reserved) by the process. For these reasons, it is advisable to prevent MPI implementations and high speed fabric implementations from placing processes and threads on specific CPUs. This behavior will likely cause a conflict with yod's management of resources. For example, Intel MPI will by default attempt to pin ranks to specific CPUs. This action must be disabled by using I_MPI_PIN=off. Another example is in the Intel(R) Omni-Path Fabric. It will attempt to place a worker thread on a specific CPU. To disable this behavior, HFI_NO_CPUAFFINITY=1 should be set. Set the appropriate environment variables for you particular environment/hardware. In addition, if you are running in an OpenMP environment and if you want mOS to place the OpenMP threads, then KMP_AFFINITY=none should be set. You can allow OpenMP to place threads. The mOS OS will respect OpenMP thread placement. If left to its own defaults, OneCCL may attempt to affinitize worker threads onto CPUs outside of the LWK CPUs that have been reserved for the current process. This will result in an error being reported by the runtime and the application will be terminated. The same problem will occur if the user supplied an affinity list containing CPUs outside of the reserved LWK CPUs. If no CCL_WORKER_AFFINITY list is provided, YOD will create one prior to executing the application. It will assign the CCL worker threads to the last 4 CPUs in the sequence list of reserved LWK CPUs. If a valid affinity specification is provided by the caller, no change will be performed. The environment variable that controls this is: CCL_WORKER_AFFINITY.
The XPMEM implementation of mOS is derived from the open source XPMEM implementation https://gitlab.com/hjelmn/xpmem . It is compatible with the open source XPMEM implementation with respect to user APIs. The user API description can be found either at the open source XPMEM implementation specified at the link above or can be found in mOS XPMEM installation header files mentioned in the table below. In addition, a few fixes were made to the user space XPMEM library. Users can pick up these changes by re-building/linking their applications with mOS user space XPMEM library.
XPMEM component | Path to installation on mOS |
---|---|
Shared library | /usr/lib/libxpmem.so |
Header files | /usr/include/xpmem/ |
XPMEM kernel module is loaded during the kernel boot up and will be ready to use upon a successful boot up without any additional steps needed by the user.
The mOS XPMEM implementation supports huge pages for the attached virtual memory. This usage of huge pages for an attachment is subjected to conditions mentioned below. Terminology: an owner process is any user process that shares its virtual address space to be consumed by others; a non-owner is any other user process that attaches the owner's shared address space into its own virtual address space and then accesses the shared memory. The below table lists the situations for huge page usage in the mOS XPMEM implementation and provides recommendations for handling such conditions.
Owner virtual memory condition |
Attached memory behaviors |
Usage of huge pages in the owner page table itself. |
Map and share large LWK memory in the owner process. Ex: For XPMEM share mmap() large areas (>2 MB) using MAP_PRIVATE | MAP_ANONYMOUS flags or brk(). |
Remapping of huge pages is not supported for Linux memory. For an LWK process the memory for data/bss, brk and private anonymous mmaps are allocated out of LWK memory and rest of the process memory is allocated from Linux memory (ex: text area, file backed mmaps etc). The huge pages supported by the mOS XPMEM in the non-owner is only for corresponding LWK memory in the owner process. |
Avoid XPMEM share of Linux memory in the owner address space if large memory needs to be XPMEM shared. Expectation is that an LWK process will use more LWK memory than Linux memory. |
The alignment of start address of shared segment |
Create a shared segment with a virtual start address aligned to a huge page boundary. Typically when large memory is mapped through LWKMEM the mapped address is already huge page aligned. |
The length of the shared segment. |
To use huge pages, the length needs to be at least 2MB. |
Holes in the virtual address space covered by the shared segment |
It is recommended that the non-owner attaches to the owner's shared address space after it is mapped by the owner, Ex: using mmap, mremap or brk etc. Note this still supports creating an XPMEM share over the owners virtual address space that is not mapped yet (a hole); this recommendation simply states that the owner regions being attached need to be fully mapped first. |
Recreations of memory maps with larger sizes that could potentially result in using higher order huge pages |
Alignment of an XPMEM attachment in the non-owner is largely dependent on the corresponding owner address space of the owner at the time of attachment. If the corresponding owner address space changes i.e. a previously existing map is unmapped and a new map is created with large size, then it is recommended to detach the existing xpmem attachment and create a new attachment to ensure that the attachment is aligned to newly allocated huge page size in the owner. |
Non-owner virtual memory actions |
Recommendations |
The length of XPMEM attachment used |
It needs to be at least 2MB. |
Fixed start virtual address used for the attachment |
Its recommended that the application not use a fixed start address (MAP_FIXED) for an attachment start address so that the kernel can choose a best huge page alignment for that attachment. |
Offset of the attachment. |
Offsets specifically that are not multiples of huge page size can result in attaching to an unaligned virtual memory start in the owner address space that in turn forces remap to use smaller pages if the resulting start/end address is in the middle of a huge page. |
Some components of the mOS LWK are instrumented with Linux' ftrace support.
Enabling/disabling trace events and dumping the trace buffer requires root permissions.
To enable tracing, write a '1' to the individual event's control file or to the global control file:
# To see all of the supported events:
$ ls /sys/kernel/debug/tracing/events/mos
# To enable just the "mos_clone_cpu_assign" event:
$ echo 1 > /sys/kernel/debug/tracing/events/mos/mos_clone_cpu_assign/enable
# To enable all mOS events:
$ echo 1 > /sys/kernel/debug/tracing/events/mos/enable
After you run, you can dump the trace buffer:
$ cat /sys/kernel/debug/tracing/trace
Tracing real workloads can easily overflow the trace ring, resulting in loss of data earlier in the run. This can be worked around easily by routing the ftrace pipe into a file prior to initiating the workload:
$ cat /sys/kernel/debug/tracing/trace_pipe | tee my.ftrace