-
Notifications
You must be signed in to change notification settings - Fork 28
mOS for HPC v0.9 User's Guide
mOS for HPC combines a lightweight kernel (LWK) with Linux. Resources, e.g., CPUs and physical memory blocks, are either managed by the LWK or by the Linux kernel. The process of giving resources to the LWK, thereby taking them away from Linux management, is called designation. Resources that have been designated for the LWK are still visible from Linux but are now under LWK control. For example, the Linux kernel, when directed to do so, can perform I/O using memory designated for the LWK, but the LWK decides how that memory is managed. LWK resource designation can be done at boot time, or later using the lwkctl command.
Giving some or all the designated LWK resources to an mOS process is called reservation and is done at process launch time using a utility called yod (see below). The third stage, allocation, happens when a running process requests a resource; e.g., through calls like mmap() or sched_setaffinity(). A process can only allocate resources that have been reserved for it at launch time, and designated as an LWK resource before that.
Earlier versions of mOS sent most system calls to Linux CPUs for processing. Only about a dozen calls that directly affected LWK resources were handled by the LWK locally. Since version v0.8, Linux calls are no longer forwarded to Linux CPUs. The LWK has access to all the Linux kernel code and we found that running that code locally is more efficient and does not increase OS jitter of LWK cores. Linux CPUs are still needed to boot the system, handle interrupts and timer ticks, and occasionally as utility CPUs. A runtime system or a user can mark a given thread as a utility thread that may get moved to a Linux CPU if not enough idle LWK CPUs are available. Utility threads are usually not part of the high-performance computation but rather assist by monitoring or ensuring that progress is made in the background. As such, utility threads usually do not need a lot of compute power, but have a tendency to introduce noise into a system. Therefore, it is often best not to let them execute on the compute CPUs in use by the application. See the Utility Thread Application Programmer's Interface section further down.
The lwkctl command can be used to display the LWK partition information. This includes the list of LWK CPUs, LWK memory and utility CPUs.
To see the output in human readable format use,
lwkctl -s
To see the output in raw format use,
lwkctl -s -r
For further details regarding usage refer to the lwkctl man page on a compute node where mOS for HPC is installed.
Applications are run under mOS for HPC through the use of a launcher command called yod. Any program not launched with yod will simply run on Linux. This document discusses how to use yod in conjunction with mpirun, but does not discuss job schedulers.
The yod utility of mOS is the fundamental mechanism for spawning LWK processes. The syntax is:
yod yod-arguments program program-arguments
One of yod's principal jobs is to reserve LWK resources (CPUs, memory) for the process being spawned. yod supports a simple syntax whereby a fraction of the resources that have been designated for the LWK are reserved. This is useful when launching multiple MPI ranks per node. In such cases, the general pattern looks like this:
mpirun -ppn N mpirun-args yod -R 1/N yod-args program program-args
This reserves for each MPI rank an equal portion of the designated LWK resources.
Please consult the yod man page for a more thorough description of the yod arguments. Please consult the mpirun man page for further information on mpirun-args.
In addition to the arguments documented in the yod man page, there are some experimental options. They can easily change or even disappear in future releases. Some of the experimental options are described in the table below. All of these are passed to yod via the --opt
option.
Option | Arguments | Description | Additional Notes |
---|---|---|---|
lwkmem-report | <level> |
Generates a report of memory usage and writes to the kernel console when the process exits. level: 0: default, no reporting 1: physical memory report 2: physical and virtual memory report |
xpmem statistics are in the virtual memory report. |
lwkmem-vmr-disable | <vmr> |
Disables the use of LWK memory for the indicated VMR. Multiple comma separated VMRs can be specified. Supported vmr specifications: dbss, heap, anon_private, tstack |
|
lwkmem-pma-cache | <page_type:num_pages> | Specify the number of cached physical pages to maintain for each of the page sizes. The default is 512 4k and 2m pages, and 4 1g pages. Multiple comma-separated arguments can be specified. | |
lwkmem-pma | <physical_mem_alloc> | Specify the physical memory allocator to be used. Currently the only physical memory allocater implemented is the 'buddy' allocator. | |
lwksched-disable-setaffinity= | <errno> | Do not perform any action as a result of the sched_setaffinnity system call if the target is an mOS thread. On return from the system call, set the returned ERRNO to the value provided. |
Examples: (1) --opt lwksched-disable-setaffinity=0 will no-op the system call. (2) --opt lwksched-disable-setaffinity=38 will cause the system call to fail with ENOSYS This option is useful for debugging. There are no plans to make it officially supported. |
lwksched-enable-rr= | <time_quantum> | Modify the time quantum for round robin dispatching or disable round robin dispatching. By default, when more than one thread is executing on a CPU, round robin dispatching is automatically enabled with a 100ms time quantum. Each thread will execute up to the value of <time_quantum> milliseconds before being preempted by another thread of equal priority. The minimum supported time quantum is 10ms. If a value of 0 is specfied, automatic round robin dispatching will be disabled. |
Be default, no timer tick will occur if only one lwk thread is runnable on an LWK CPU. |
lwksched-stats |
level
|
Output counters to the kernel log at the time of process exit. Data detail is controlled by <level>. A value of 1 will generate an entry for every mOS CPU that had more than one mOS thread committed to run on it. A value of 2 will add a summary record for the exiting mOS process. A value of 3 will add records for all CPUs in the process and a process summary record for the exiting process regardless of commitment levels. Information provided: PID: This is the TGID of the process. This can be used to visually group the CPUs that belong to a specific process CPUID: CPU corresponding to the data being displayed THREADS: number of threads within the process (main thread plus pthreads) CPUS: number of CPUs reserved for use by this process MAX_COMMIT: high water mark of the number of mOS threads assigned to run on this CPU CPU MAX_RUNNING: high water mark of the number of tasks enqueued to the mOS run queue, including kernel tasks. GUEST_DISPATCH: number of times a non-mOS thread (kernel thread) was dispatched on this CPU. TIMER_POP: the number of timer interrupts. Typically this would be as a result of a POSIX timer expiring or RR dispatching, if enabled through the option lwksched-enable-rr SETAFFINITY: The number of sched_setaffinity system calls executed by this CPU. UTIL-CPU: indicator that this CPU has been designated as a utility CPU meant to run utility threads such as the OMP monitor and the PSM Progress threads. |
This option is useful for debugging. The content and format of the output are highly dependent on the current implementation of the mOS scheduler and therefore are likely to change in future releases. |
util-threshold |
<X:Y>
|
The X value indicates the maximum number of LWK CPUs that can be used for hosting utility threads. The Y value represents the maximum number of utility threads allowed to be assigned to any one LWK CPU. Some examples: A value of "0:0" will prevent any utility threads from being placed on an LWK CPU and force all utility threads to be placed on the Linux CPUs that are defined to be the syscall target CPUs. A value of "-1:1" will allow any number of LWK CPUs to hold utility threads however only a maximum of one utility thread will be assigned to each LWK CPU. | Default behavior is X = -1, Y = 1. The UTI API will be the preferred approach to controlling utility thread placement. |
idle-control | <MECHANISM,BOUNDARY> |
MECHANISM is the fast-path idle/dispatch mechanism used by the idle task. The allowed values are:
BOUNDARY is the boundary where the fast dispatch mechanism will be deployed. Beyond this boundary, the CPU will request deep sleep. The allowed values are:
|
The default <MECHANISM,BOUNDARY> is: <mwait,reserved> |
cmci-control | <threshold, poll> |
threshold:
poll:
|
In all cases, when the mOS process ends, the machine check banks owned by the associated CPUs will be checked and any pending machine check events will be logged, even if we have not reached the threshold. Also any disabled CMCI's will be re-enabled, any modified thresholds will be restored, and polling will be re-enabled. If a threshold is requested and the hardware does not support that threshold value, an info message will be written to the console log. There will be one message written per boot. |
enable-balancer | <type, param1, param2, param3> |
type: Can be either the value 'push' or 'pull'.
|
Can be useful in environments when CPU over-commitment cannot be avoided and the run-times are not able to intelligently place threads. If the enable-balancer option is speciffed with no parameters, the "pull" balancer will be the default and will use the default parameter values shown. If parameter values are not specified, the listed default values will be used. |
The mOS kernel will reserve unique CPU and memory resources for each process/rank within a node and will assign threads to the CPU resources owned (reserved) by the process. For these reasons, it is advisable to apply these runtime specific environment variables in order not to interfere with this mOS behavior.
Name | Value | Description |
---|---|---|
I_MPI_PIN | off |
Disables process pinning in Intel MPI. Without this set, Intel MPI gets confused by isolated CPUs (including mOS LWK CPUS) and may attempt to assign ranks to cores not controlled by mOS. Symptoms include core dumps from the pmi_proxy (HYDRA). When disabled via "I_MPI_PIN=off", processes forked by the pmi_proxy will inherit the affinity mask of the proxy, which is what we want for mOS' yod. |
I_MPI_FABRICS |
shm:tmi shm:ofi |
For use on clusters with Intel(R) Omni-Path Fabric. Selects shared memory for intra-process communication. For 2018 Intel MPI editions, the recommended setting for inter-process communication is the Tag Matching Interface. Starting with 2019 Intel MPI editions, the recommended setting for inter-process communication is OpenFabrics Interfaces. See https://software.intel.com/en-us/mpi-developer-guide-linux-selecting-fabrics for additional information. |
I_MPI_TMI_PROVIDER | psm2 | For use on clusters with Intel(R) Omni-Path Fabric. Selects the PSM2 provider for the TMI fabric. This setting is recommended only for 2018 editions of Intel MPI; it has been deprecated or removed in 2019 editions. |
I_MPI_FALLBACK | 0 | Use only the specified communication fabric(s). |
PSM2_RCVTHREAD | 0 or 1 |
When set to 0, disables the PSM2 progress thread. If not disabled, the PSM2 run-time will create an additional thread within each process. This additional thread could interfere with mOS process / thread placement and reduce performance. Some application environments may require the use of this progress thread in order to allow forward progress. In those environments the existence of the PSM2 progress thread must be made known to the mOS kernel through the yod --util_threads option. Please consult the yod man page for a more detailed description of this option. Some 2019 Intel MPI editions require that this thread be enabled. |
PSM2_MQ_RNDV_HFI_WINDOW | 4194304 | For use on clusters with Intel(R) Omni-Path Fabric |
PSM2_MQ_EAGER_SDMA_SZ | 65536 | For use on clusters with Intel(R) Omni-Path Fabric |
PSM2_MQ_RNDV_HFI_THRESH | 200000 | For use on clusters with Intel(R) Omni-Path Fabric |
KMP_AFFINITY | none |
Does not bind OpenMP threads to CPU resources, allowing the mOS kernel to choose from reserved CPU resources. If the operating system supports affinity, the compiler uses the OpenMP thread affinity interface to determine machine topology. Specify |
HFI_NO_CPUAFFINITY | 1 |
For use on clusters with Intel(R) Omni-Path Fabric. Disables affinitization of the PSM2 progress thread. |
The UTIlity thread API (UTI API) has been developed by RIKEN Advanced Institute for Computational Science - Japan, Sandia National Laboratories, and Intel Corp. with feedback from Fujitsu and Argonne National Laboratory.
The UTI API allows run-times and applications to control the placement of the threads which are not the primary computational threads within the application.
The API:
- Keeps these extra threads from interfering with computational threads.
- Allows grouping and placing of utility threads across the ranks within a node to maximize performance.
- Does not require the caller to have detailed knowledge of the system topology or the scheduler. Allows the kernel to provide intelligent placement and scheduling behavior.
- Does not require the caller to be aware of other potentially conflicting run-time or application thread placement actions. CPU selection is managed globally across the node by mOS.
- Header file /usr/include/uti.h contains the function and macro declarations.
- #include <uti.h>
- Library /usr/lib/libmos.so contains the mOS implementation of the UTI API. Link using the following:
- "-lmos'
The programmer can provide behavior and location hints to the kernel. The kernel will then use its knowledge of the system topology and available scheduling facilities to intelligently place and run the utility thread. The scheduler can optimize scheduling actions for the utility thread for the following behaviors: CPU intensive, e.g., constant polling, high or low scheduling priority, processes that block or yield infrequently, or processes that expect to run on a dedicated CPU. The scheduler can also optimize placement considering: L1/L2/L3/NUMA-domain, specific Linux CPU, lightweight kernel CPU, or CPUs that handle fabric interrupts.
There are various ways of specifying a location:
- Explicit NUMA domain
- Supply a bit mask containing NUMA domains.
- Location relative to the caller of the API.
- Same L1, L2, L3, or NUMA domain
- Different L1, L2, L3, or NUMA domain
- Location relative to other utility threads specifying a common key.
- Allows grouping of utility threads used across ranks within the node.
- Used in conjunction with a specification of "Same L1, L2, L3, or NUMA domain"
- Type of CPU
- Can be used in conjunction with the above location specifications
- FWK - Linux CPU running under the Linux scheduler
- LWK - lightweight kernel controlled CPU
- Fabric Interrupt handling CPU
This example shows the required sequence of operations to place utility threads on Linux CPUs running under the same L2 cache.
- Run-time agrees on a unique key value to use across ranks within a node.
- Each rank creates a utility thread and specifies:
- The same location key value.
- Request Same L2
- Request FWK CPU type
- When the first utility thread is created, mOS will pick an appropriate Linux CPU and L2 cache.
- All subsequent utility threads created with the same key will be placed on Linux CPUs and share the same L2 cache.
- The mOS kernel will assign the utility threads balanced across the available CPUs that satisfy the location requested.
The UTI attribute object is an opaque object that contains the behavior and location information to be used by the kernel when a pthread is created. The definition of the fields within the object are OS specific and purposely hidden from the user interface. This object is treated similarly to the pthread_attr object within the pthread library. This object is passed to the uti_pthread_create() interface, along with the standard arguments passed to pthread_create(). The libmos.so library contains the functions used to prepare the attribute object for use.
The following function is provided for initializing the attribute object before use:
- int uti_attr_init(uti_attr_t *attr);
The following function is provided to destroy the attribute object:
- int uti_attr_destroy(uti_attr_t *attr);
This is the list of library functions used to set behaviors in the attribute object:
- int uti_attr_cpu_intensive(uti_attr_t *attr);
- CPU intensive thread, e.g. constant polling
- int uti_attr_high_priority(uti_attr_t *attr);
- Expects high scheduling priority
- int uti_attr_low_priority(uti_attr_t *attr);
- Expects low scheduling priority
- int uti_attr_non_cooperative(uti_attr_t *attr);
- Does not play nice with others. Infrequenct yields and/or blocks
- int uti_attr_exclusive_cpu(uti_attr_t *attr);
- Expectes to run on a dedicated CPU
This is the list of library functions used to set location in the attribute object:
- int uti_attr_numa_set(uti_attr_t *attr, unsigned long *nodemask, unsigned long maxnodes);
- int uti_attr_same_numa_domain(uti_attr_t *attr);
- int uti_attr_different_numa_domain(uti_attr_t *attr);
- int uti_attr_same_l1(uti_attr_t *attr);
- int uti_attr_different_l1(uti_attr_t *attr);
- int uti_attr_same_l2(uti_attr_t *attr);
- int uti_attr_different_l2(uti_attr_t *attr);
- int uti_attr_same_l3(uti_attr_t *attr);
- int uti_attr_different_l3(uti_attr_t *attr);
- int uti_attr_prefer_lwk(uti_attr_t *attr);
- int uti_attr_prefer_fwk(uti_attr_t *attr);
- int uti_attr_fabric_intr_affinity(uti_attr_t *attr);
- int uti_attr_location_key(uti_attr_t *attr, unsigned long key);
The uti_pthread_create interface will return EINVAL if conflicting or invalid specifications are provided in the UTI attributes. For example, EINVAL will be returned if 'Same L2' and 'Different L2' are both requested. In these cases, no thread will be created. In other situations when there is no obvious conflict, the thread will be created, even if the requested location or behavior could not be satisfied. Location and behavior results can be determined using the interfaces listed below. The return values are 1=true, and 0=false. The setting of pthread attributes should be used with caution since, they will override the actions/results provided by the UTI attributes.
- int uti_result_different_numa_domain(uti_attr_t *attr);
- int uti_result_same_l1(uti_attr_t *attr);
- int uti_result_different_l1(uti_attr_t *attr);
- int uti_result_same_l2(uti_attr_t *attr);
- int uti_result_different_l2(uti_attr_t *attr);
- int uti_result_same_l3(uti_attr_t *attr);
- int uti_result_different_l3(uti_attr_t *attr);
- int uti_result_prefer_lwk(uti_attr_t *attr);
- int uti_result_prefer_fwk(uti_attr_t *attr);
- int uti_result_fabric_intr_affinity(uti_attr_t *attr);
-
int uti_result_exclusive_cpu(uti_attr_t *attr);
-
int uti_result_cpu_intensive(uti_attr_t *attr);
-
int uti_result_high_priority(uti_attr_t *attr);
-
int uti_result_low_priority(uti_attr_t *attr);
-
int uti_result_non_cooperative(uti_attr_t *attr);
-
int uti_result_location(uti_attr_t *attr);
-
int uti_result_behavior(uti_attr_t *attr);
-
int uti_result(uti_attr_t *attr);
Note: if your application could be running concurrently with another application using the UTI API, you may need to generate a location key that does not mistakenly match the key in the other application. This example simply uses a statically defined key value.
#include <uti.h>
pthread_attr_t p_attr;
uti_attr_t uti_attr;
int ret;
..
/* Initialize the attribute objects */
if ((ret = pthread_attr_init(&p_attr)) ||
(ret = uti_attr_init(&uti_attr)))
goto uti_exit;
/* Request to put the thread on the same L2 as other utility threads.
* Also indicate that the thread repeatedly monitors a device.
*/
if ((ret = uti_attr_same_l2(&uti_attr)) ||
(ret = uti_attr_location_key(&uti_attr, 123456)) ||
(ret = uti_attr_cpu_intensive(&uti_attr)))
goto uti_exit;
/* Create the utility thread */
if ((ret = uti_pthread_create(idp, &p_attr, thread_start, thread_info, &uti_attr)))
goto uti_exit;
/* Did the system accept our location and behavior request? */
if (!uti_result(&uti_attr))
printf(“Warning: utility thread attributes not honored.\n”);
if ((ret = uti_attr_destroy(&uti_attr)))
goto uti_exit;
..
uti_exit:
Interactions between pthread_attr and uti_attr
Avoid the use of pthread_attr_setaffinity_np when specifying a location with the uti_attr object. The pthread_attr_setaffinity_np directive is prioritized over the uti_attr location requests. If valid CPUs are specified, this action may alter the placement directives requested by the UTI attributes object. If invalid CPUs are provided, this will result in the uti_pthread_create interface returning EINVAL with no utility thread created.
Avoid the use of pthread_attr_set_schedparam and pthread_attr_setschedpolicy when specifying a behavior within the uti_attr_object. These attributes are prioritized over the uti_attr behavior requests. Usage may alter the actions that would have been taken based on the uti_attributes behavior hints. A policy or param that is invalid for an mOS process will result in the uti_pthread_create interface returning EINVAL with no utility thread created.
The XPMEM implementation of mOS is derived from the open source XPMEM implementation https://gitlab.com/hjelmn/xpmem. It is compatible with the open source XPMEM implementation with respect to user APIs. The user API description can be found either at the open source XPMEM implementation specified at the link above or can be found in mOS XPMEM installation header files mentioned in the table below. In addition, a few fixes were made to the user space XPMEM library. Users can pick up these changes by re-building/linking their applications with mOS user space XPMEM library.
XPMEM component | Path to installation on mOS |
---|---|
Shared library | /usr/lib/libxpmem.so |
Header files | /usr/include/xpmem/ |
XPMEM kernel module is loaded during the kernel boot up and will be ready to use upon a successful boot up without any additional steps needed by the user.
The mOS XPMEM implementation supports huge pages for the attached virtual memory. This usage of huge pages for an attachment is subjected to constraints mentioned below. Terminology: an owner process is any user process that shares its virtual address space to be consumed by others; a non-owner is any other user process that attaches the owner's shared address space into its own virtual address space and then accesses the shared memory. The below table lists the constraints for huge page usage in the mOS XPMEM implementation and provides recommendations for handling such constraints.
Constraints due to owner virtual memory |
Recommendations |
Usage of huge pages in the owner page table itself. |
Map and share large LWK memory in the owner process. Ex: For XPMEM share mmap() large areas (>2 MB) using MAP_PRIVATE | MAP_ANONYMOUS flags or brk(). |
Remapping of huge pages is not supported for Linux memory. For an LWK process the memory for data/bss, brk and private anonymous mmaps are allocated out of LWK memory and rest of the process memory is allocated from Linux memory (ex: text area, file backed mmaps etc). The huge pages supported by the mOS XPMEM in the non-owner is only for corresponding LWK memory in the owner process. |
Avoid XPMEM share of Linux memory in the owner address space if large memory needs to be XPMEM shared. Expectation is that an LWK process will use more LWK memory than Linux memory. |
The alignment of start address of shared segment |
Create a shared segment with a virtual start address aligned to a huge page boundary. Typically when large memory is mapped through LWKMEM the mapped address is already huge page aligned. |
The length of shared segment. |
Needs to be at least 2MB or 4MB based on the size of smallest huge TLB size supported by the hardware. |
Holes in the virtual address space covered by the shared segment |
It is recommended that the non-owner attaches to the owner's shared address space after it is mapped by the owner, Ex: using mmap, mremap or brk etc. Note this still supports creating an XPMEM share over the owners virtual address space that is not mapped yet (a hole); this recommendation simply states that the owner regions being attach to need to be fully mapped first. |
Recreations of memory maps with larger sizes that could potentially result in using higher order huge pages |
Alignment of an XPMEM attachment in the non-owner is largely dependent on the corresponding owner address space of the owner at the time of attachment. If the corresponding owner address space changes i.e. a previously existing map is unmapped and a new map is created with large size, then it is recommended to detach the existing xpmem attachment and create a new attachment to ensure that the attachment is aligned to newly allocated huge page size in the owner. |
Constraints due to non-owner virtual memory |
Recommendations |
The length of XPMEM attachment used |
It needs to be atleast 2MB or 4MB based on the size of smallest huge TLB size supported by the hardware. |
Fixed start virtual address used for the attachment |
Its recommended that the application not use a fixed start address (MAP_FIXED) for an attachment start address so that the kernel can choose a best huge page alignment for that attachment. |
Offset of the attachment. |
Offsets specifically that are not multiples of huge page size can result in attaching to an unaligned virtual memory start in the owner address space that in turn forces remap to use smaller pages if the resulting start/end address is in the middle of a huge page. |
The yod option ‘lwkmem-xpmem-stats’ has been implemented to capture the mOS XPMEM statistics while remapping huge pages. When this option is used, the statistics are visible in the dmesg log by each LWK process at the end of the run. It can be used while running an LWK application to see if the application experienced one of the above constraints.
Ex: yod -o lwkmem-xpmem-stats <application> <application args>
Some components of the mOS LWK are instrumented with Linux' ftrace support.
Enabling/disabling trace events and dumping the trace buffer requires root permissions.
To enable tracing, write a '1' to the individual event's control file or to the global control file:
# To see all of the supported events:
$ ls /sys/kernel/debug/tracing/events/mos
# To enable just the "mos_clone_cpu_assign" event:
$ echo 1 > /sys/kernel/debug/tracing/events/mos/mos_clone_cpu_assign/enable
# To enable all mOS events:
$ echo 1 > /sys/kernel/debug/tracing/events/mos/enable
After you run, you can dump the trace buffer:
$ cat /sys/kernel/debug/tracing/trace
Tracing real workloads can easily overflow the trace ring, resulting in loss of data earlier in the run. This can be worked around easily by routing the ftrace pipe into a file prior to initiating the workload:
$ cat /sys/kernel/debug/tracing/trace_pipe | tee my.ftrace