Skip to content

mOS for HPC v1.0 User Guide

John Attinella edited this page Apr 17, 2023 · 10 revisions

Node resource management

mOS for HPC combines a lightweight kernel (LWK) with Linux. Resources, e.g., CPUs and physical memory blocks, are either managed by the LWK or by the Linux kernel. The process of giving resources to the LWK, thereby taking them away from Linux management, is called designation. Resources that have been designated for the LWK are still visible from Linux but are now under LWK control. For example, the Linux kernel, when directed to do so, can perform I/O using memory designated for the LWK, but the LWK decides how that memory is managed. LWK resource designation can be done at boot time, or later using the lwkctl command.

Giving some or all the designated LWK resources to an mOS process is called reservation and is done at process launch time using a utility called yod (see below). The third stage, allocation, happens when a running process requests a resource; e.g., through calls like mmap() or sched_setaffinity(). A process can only allocate resources that have been reserved for it at launch time, and designated as an LWK resource before that.

Earlier versions of mOS sent most system calls to Linux CPUs for processing. Only about a dozen calls that directly affected LWK resources were handled by the LWK locally. Since version v0.8, Linux calls are no longer forwarded to Linux CPUs. The LWK has access to all the Linux kernel code and we found that running that code locally is more efficient and does not increase OS jitter of LWK cores. Linux CPUs are still needed to boot the system, handle interrupts and timer ticks, and occasionally as utility CPUs for hosting application or runtime utility threads. A runtime system or a user can mark a given thread as a utility thread that may get moved to a Linux CPU if not enough LWK CPUs are available to execute the primary computation. Utility threads are usually not part of the high-performance computation but rather assist by monitoring or ensuring that progress is made in the background. As such, utility threads usually do not need a lot of compute power, but have a tendency to introduce noise into a system. Therefore, it is often best not to let them execute on the compute CPUs in use by the application. In many situations this can be accomplished using yod options. For more advanced management of utility threads, see mOS for HPC Utility Thread API.

Checking CPUs and memory designated to the lightweight kernel

The lwkctl command can be used to display the LWK partition information. This includes the list of LWK CPUs, LWK memory and utility CPUs. 

To see the output in human readable format use,

lwkctl -s

To see the output in raw format use,

lwkctl -s -r

For further details regarding usage refer to the lwkctl man page on a compute node where mOS for HPC is installed.

Launching applications on mOS for HPC

Applications are run under mOS for HPC through the use of a launcher command called  yod .  Any program not launched with yod will simply run on Linux.  This document discusses how to use yod in conjunction with mpirun, but does not discuss job schedulers.

Launching processes with yod

The yod utility of mOS is the fundamental mechanism for reserving LWK resources for a process being spawned by the native job launcher.  The syntax is:

yod yod-arguments program program-arguments

One of yod's principal jobs is to reserve LWK resources (CPUs, GPUs, memory) for the process being spawned.  yod supports a simple syntax whereby a fraction of the resources that have been designated for the LWK are reserved.  This is useful when launching multiple MPI ranks per node.  In such cases, the general pattern looks like this:

mpirun -ppn N mpirun-args yod -R 1/N yod-args program program-args

This reserves for each MPI rank an equal portion of the designated LWK resources.

Please consult the mpirun man page for further information on mpirun-args.

The following are the supported yod options. This information can also be found in the yod man page.

Option Description
--resources, -R < fraction, all, MPI, file:map_file> Reserves a portion of the LWK resources. If specified as a fraction, then the corresponding number of LWK cores and GPU devices are reserved, as well as an equal portion of the designated LWK memory. A fraction may be specified in floating point format or as a rational number M/N, where M and N are integers. If MPI is specified then MPI environment variables are used to determine the fractional amount  of  resources.   If file:map_file is specified, then a mapping file is used to specify LWK CPU and/or memory and/or number of utility threads per MPI rank. See RESOURCE MAP FILES for details. If all is specified, all designated LWK resources are reserved.  This option may not be combined with any of the --cpu, --cores, --gpus or --mem options.
--cpus, -c <list, mask , all> Reserves  the  specified  CPUs for the command being launched.  If all is specified, all available LWK CPUs are reserved.  Otherwise, the argument describes a list of LWK CPU to be reserved and is in either list or mask format.
--cores, -C <number, fraction, all, MPI> Reserves the specified number of LWK cores for the command being launched.  If all is specified, all  available LWK cores are reserved.  Otherwise, the argument specifies either a number of cores to be reserved, or a fraction of the overall LWK cores designated for mOS use.  A fraction may be specified in floating point format or as a rational number M/N, where M and N are integers.   Fractions must be in the interval (0, 1].  If MPI is specified then MPI environment variables are used to determine the fractional amount of core resources.
 The following options control reservation of LWK memory.  If specified, then LWK CPUs must also be specified via either  the  --cpus  or the --cores option.
--mem, -M < size , fraction, all, MPI> Reserve the specified amount of memory for the command being launched.  Size is an integer or decimal number, optionally augmented with K, M or G to indicate units of kilobytes, megabytes or gigabytes, respectively.  A fraction reserves memory as a portion  of  the overall amount of memory designated for LWK use.  Fractions must be in the interval (0.0, 1.0) and may be specified either in floating point format or as a rational number M/N, where M and N are integers (M <= N). If all is specified,  then  all available LWK memory is reserved.  If MPI is specified then MPI environment variables are used to determine the fractional amount of memory resources.
The following options control the reservation of GPU devices. If these options are not specified and if the ZE_AFFINITY_MASK environment variable  is set, the ZE_AFFINITY_MASK will control the available GPU devices. If neither of these options nor the --resources/-R option is specified and the ZE_AFFINITY_MASK is not set, yod will reserve all available GPU devices.
--gpus, -G <number, fraction, all> Reserves the specified number of GPU devices for the command being launched. If all is specified, all available GPU devices are reserved.  Otherwise, if the argument specifies either a number of GPU devices to be reserved, or a fraction of the total GPU devices available for mOS use.  A fraction may be specified in floating point format or as a rational number M/N, where M and N are  integers.  Fractions must be in the interval (0, 1].  If MPI is specified then MPI environment variables are used to determine the fractional amount of GPU resources.  If this option is specified, then LWK CPUs must also be specified  via  either  the --cpus or the --cores option along with memory. This option and the --gpu-tiles option are mutually exclusive.
--gpu-tiles, -g <number, fraction, all> Reserves  the  specified  number  of  GPU  tiles for the command being launched. If all is specified, all available GPU tiles are reserved.  Otherwise, the argument specifies either a number of GPU tiles to be reserved, or a fraction of the  total  GPU  tiles available for mOS use.  A fraction may be specified in floating point format or as a rational number M/N, where M and N are integers.  Fractions must be in the interval (0, 1].  If MPI is specified then MPI environment variables are used  to  determine  the fractional  amount  of GPU resources.  If this option is specified, then LWK CPUs must also be specified via either the --cpus or the --cores option along with memory. This option and the --gpus option are mutually exclusive.
Additional options
--util_threads, -u <number> Specify number of threads to be identified as utility threads within the process being launched. If a value  is specified,  the kernel will heuristically identify that number of threads as utility threads and provide special placement and behaviors to those threads. If no value is specified, the kernel will make no heuristic identification of utility threads. If the number of  utility threads is specified in the -R file:map_file option then that value overrides the value specified here.
--resource_algorithm <numa, simple, random> Controls  the selection and layout of CPUs relative to the overall set of designated LWK CPUs.  See RESOURCE ALGORITHMS for additional information.  The default is numa.
--memory-preference, -p <preference> States preferences of types of memory to use for various kinds of allocations.  See MEMORY PREFERENCES  for  additional  information.
--layout <description> Provides CPU (hardware thread) ordering suggestions to the mOS scheduler.  See THREAD LAYOUT for additional information.
--rank-layout <compact, scatter[:stride], disabled> Provides  a hint to lay out ranks in a prescribed order.  A compact layout will place adjacent ranks near each other, from a NUMA perspective.  A scatter layout will interleave ranks using a stride; if not specified, the stride will  be  the  number  of  NUMA domains  for  CPUs.  Disabling the layout will not prescribe any specific layout of the ranks with respect to NUMA domains.  This option is a hint and requires additional support from the underlying MPI launch mechanism.
--brk-clear-length size For non-negative size values, size defines the number of bytes to clear (zero) at the beginning of the expanded region when the brk system call expands the data segment.  For negative size values, the entire expanded region will be cleared.  The default behavior is to clear 4K.  The size argument may be specified in integer or symbolic format (4K, 2M, 1G, etc.).
--mosview <lwk, all> Sets the mOS view of process being launched.  If lwk is specified, the process will see only LWK global resources but not Linux resources.  If all is specified, the process will see both LWK and Linux resources.  The default is all.
--maxpage scope:maxpage Sets the largest page size that can be used for a virtual memory region.  scope can be dbss for .data/.bss area, heap for brk area, anon_private for private anonymous area, tstack for thread stacks, stack for process stack or all for all LWK memory backed areas.  maxpage can be either 4k, 2m, or 1g. Setting for multiple virtual memory regions can be specified by using separator '/'  between settings of each virtual memory region. One can specify a different maxpage for each region.  If setting for a virtual memory region is not specified then by default the largest page size supported by the hardware is used for that region, unless interleaving is active. If interleaving is active, then the default maximum page size is 2m.
--pagefault scope:level Sets the page faulting level that can be used for a virtual memory region. scope can be dbss for .data/.bss area, heap for brk area, anon_private for private anonymous area, tstack for thread stacks, stack for process stack or all for all LWK memory backed areas.  level can be either nofault or onefault. Setting for multiple virtual memory regions can be specified by using  separator '/'  between settings of each virtual memory region. One can specify a different level for each region.  If setting for a virtual memory region is not specified then a default setting of nofault is applied.
--mempolicy scope:type Sets the memory policy type that can be used for a virtual memory region.  scope can be dbss for .data/.bss area, heap for brk area, anon_private for private anonymous area, tstack for thread stacks, stack for process stack or all for all LWK memory backed areas.  type can be either normal, random, interleave, or interleave_random. Setting for multiple virtual memory regions  can  be specified  by  using  separator  '/'  between  settings of each virtual memory region.  One can specify a different type for each region. If setting for a virtual memory region is not specified then a default setting of interleave is applied if there is more than 1 NUMA domain reserved for a memory type otherwise the default setting is normal
--dry-run Do not actually reserve resources and launch.
--verbose, -v <number> Controls the verbosity of yodnumber is an integer between 0 and 9
--help, -h Prints a terse version of this documentation.
--option, -o kernel-option[=value] Passes  a kernel-option to the mOS kernel. In addition to the previously described options in this table, yod contains several experimental kernel options. These may be elevated to be fully supported or may disappear in future releases.  The kernel options are described in the following table. 



Kernel Option

Description

lwkmem-report=<level>

Generates a report of memory usage and writes to the kernel console when the process exits. Values for level:

  • 0: default, no reporting
  • 1: physical memory report
  • 2: physical and virtual memory report
xpmem statistics are in the virtual memory report.
lwkmem-vmr-disable=<vmr>

Disables the use of LWK memory for the indicated VMR. Multiple comma separated VMRs can be specified. Supported vmr specifications: dbss, heap, anon_private, tstack

lwkmem-pma-cache=<page_type:num_pages> Specify the number of cached physical pages to maintain for each of the page sizes. The default is 512 4k and 2m pages, and 4 1g pages. Multiple comma-separated arguments can be specified.
lwkmem-pma=<physical_mem_alloc> Specify the physical memory allocator to be used. Currently the only physical_mem_alloc implemented is the buddy allocator.
lwksched-disable-setaffinity=<errno> Do not perform any action as a result of the sched_setaffinnity system call if the target is an mOS thread. On return from the system call, set the returned ERRNO to the value provided.

Examples:

  • --opt lwksched-disable-setaffinity=0 will no-op the system call.
  • --opt lwksched-disable-setaffinity=38 will cause the system call to fail with ENOSYS

This option is useful for debugging.  There are no plans to make it officially supported.

lwksched-enable-rr=<time_quantum> Modify the time quantum for round robin dispatching or disable round robin dispatching. By default, when more than one thread is executing on a CPU, round robin dispatching is automatically enabled with a 100ms time quantum. Each thread will execute up to the value of  <time_quantum> milliseconds before being preempted by another thread of equal priority. The minimum supported time quantum is 10ms. If a value of 0 is specified, automatic round robin dispatching will be disabled.  By default, no timer tick will occur if only one LWK thread is runnable on an LWK CPU.
lwksched-stats =<level> Output counters to the kernel log at the time of process exit. Data detail is controlled by <level>. A value of 1 will generate an entry for every mOS CPU that had more than one mOS thread committed to run on it. A value of 2 will add a summary record for the exiting mOS process. A value of 3 will add records for all CPUs in the process and a process summary record for the exiting process regardless of commitment levels. Information provided:
  • PID: This is the TGID of the process. This can be used to visually group the CPUs that belong to a specific process CPUID: CPU corresponding to the data being displayed
  • THREADS: number of threads within the process (main thread plus pthreads)
  • CPUS: number of CPUs reserved for use by this process
  • MAX_COMMIT: high water mark of the number of mOS threads assigned to run on this CPU
  • CPU MAX_RUNNING: high water mark of the number of tasks enqueued to the mOS run queue, including kernel tasks.
  • GUEST_DISPATCH: number of times a non-mOS thread (kernel thread) was dispatched on this CPU.
  • TIMER_POP: the number of timer interrupts. Typically this would be as a result of a POSIX timer expiring or RR dispatching, if enabled through the option lwksched-enable-rr
  • SETAFFINITY: The number of sched_setaffinity system calls executed by this CPU.
  • UTIL-CPU: indicator that this CPU has been designated as a utility CPU meant to run utility threads such as the OMP monitor and the PSM Progress threads.

The content and format of the output are highly dependent on the current implementation of the mOS scheduler and therefore are likely to change in future releases.

util-threshold =<X:Y> The X value indicates the maximum number of LWK CPUs that can be used for hosting utility threads. The Y value represents the maximum number of utility threads allowed to be assigned to any one LWK CPU. Some examples: A value of 0:0 will prevent any utility threads from being placed on an LWK CPU and force all utility threads to be placed on the Linux CPUs that are defined to be the syscall target CPUs. A value of -1:1 will allow any number of LWK CPUs to hold utility threads however only a maximum of one utility thread will be assigned to each LWK CPU.
Default behavior is X = -1, Y = 1. The UTI API would be a more controlled approach to placing a utility thread.
idle-control=<mechanism,boundary>

mechanism is the fast-path idle/dispatch mechanism used by the idle task. The allowed values are:

  • mwait - mwait instruction will be used for both fast dispatch and deep sleep situations.
  • halt - halt instruction in combination with an IPI sent by the waking thread will be used for fast dispatch. mwait will be used for deep sleep.
  • poll - polling will be used for fast dispatch, mwait will be used to request deep sleep.

boundary is the boundary where the fast dispatch mechanism will be deployed. Beyond this boundary, the CPU will request deep sleep. The allowed values are:

  • none - Entry to the idle task will always result in a request for deep sleep.
  • committed - A CPU that has a committed thread to run on it will use the fast dispatch mechanism. A CPU that is reserved by a process but does not have a thread assigned to run on it by the scheduler will request deep sleep using mwait.
  • reserved - A CPU reserved by the process will use the fast dispatch mechanism. A CPU that is not reserved will request deep sleep using mwait.
  • online - All LWK CPUs will use the fast dispatch mechanism. Deep sleep will never be requested.
The default <mechanism,boundary> is: <mwait,reserved>
cmci-control=<threshold, poll>

threshold:

  • The number of correctable machine check events that must occur in a machine check bank before a correctable machine check interrupt is delivered to an LWK CPU that is hosting an mOS process. The value specified must be less than 32768.  A value of 0 disables interrupt delivery to all LWK CPUs hosting the mOS process.
  • disable-cmci: same as threshold = 0
  • max-threshold: (default) Set threshold to the maximum allowed value. This will typically be 32767 on newer hardware platforms. 

poll:

  • disable-poll: Polling will be disabled on the LWK CPUs that are hosting the mOS process. (note: this was the default behavior before this commit and will remain the default behavior)
  • enable-poll: Polling will be enabled and operate with the same interval, behavior, and control as in the Linux OS

In all cases, when the mOS process ends, the machine check banks owned by the associated CPUs will be checked and any pending machine check events will be logged, even if we have not reached the threshold. Also any disabled CMCI's will be re-enabled, any modified thresholds will be restored, and polling will be re-enabled.

If a threshold is requested and the hardware does not support that threshold value, an info message will be written to the console log. There will be one message written per boot.

enable-balancer=<type, param1, param2, param3>

type: Can be either the value 'push' or 'pull'. If neither is specified, the default behavior will be pull

  • push: This type of balancing action is initiated during the normal round robin scheduler timer tick.  A thread may be pushed to an idle or lightly loaded CPU.
    • param1: Consider when delta load with target candidate greater than this value. Default: 12
    • param2: Consider when overcommitted CPU has executed for more than this time without blocking or yielding. Default 20ms
    • param3: The number of ms before a target CPU can be a target of another push. Default 10ms.
  • pull: This type of balancing action occurs when a CPU goes into the idle state or if it has been in the idle state for a specified period of time. This idle CPU will look for the busiest CPU and potentially pull a thread from that CPU.
    • param1: idle timer tick frequency. Default: 100ms.
    • param2: overcommit threshold. Default: 20ms.
    • param3: idle time before first pull action. Default: 0ms.

Can be useful in environments when CPU over-commitment cannot be avoided and the run-times are not able to intelligently place threads.

CPU MASKS AND LISTS

The first logical CPU is CPU 0. The second is CPU 1. And so on. CPU masks in yod are hexadecimal literals specified in little endian order. The least significant bit corresponds to CPU 0 and so on. Masks must begin with either "0x" or "0X". CPU lists are CPU numbers or ranges of numbers separated by commas. For example, the list '0-2,8' is equivalent to mask 0x107.

RESOURCE ALGORITHMS

The --cpus form of LWK CPU reservation is explicit in that it specifically identifies the CPUs to be reserved. Other forms are less explicit and in these cases, yod uses the --resource_algorithm specification to reserve and select CPUs and memory. The numa resource algorithm attempts to reserve LWK cores and memory that are near each other in the NUMA sense. The simple resource algorithm reserves LWK cores from the available poll, in ascending order. Memory is reserved from NUMA domains in ascending order. The random CPU algorithm reserves LWK cores randomly from the available pool.

THREAD LAYOUT

The --layout <description> option may be used to suggest how software threads are assigned to CPUs (hardware threads) once specific CPUs have been reserved for the process being launched. The <description> argument may be specified as scatter, compact, or a permutation of the dimensions node, tile, core, and cpu.

The scatter option spreads threads out as much as possible within the reserved LWK CPUs. It is equivalent to node,tile,core,cpu and thus will attempt to spread out across nodes before repeating tiles, spread out across tiles before repeating cores, and so on. This is the default.

The compact option is the opposite of scatter and is equivalent to cpu,core,tile,node. It will select CPUs (hardware threads) on a core before moving to another core. Likewise, it will use all cores on a tile before expanding to another tile, and so on.

Other permutations of node, tile, core, and cpu may be passed to specify the sort order of the CPUs. 

The node, tile, core, and cpu terms may also be augmented with a :<count> suffix which will prefer the number of the described entities. For example, cpu:1 will construct a layout that uses the first CPU in all reserved cores before using the 2nd and subsequent CPUs in any reserved core. And so cpu:1,core,tile,node is compact from a node, tile and core perspective, but will initially consume one CPU per each reserved core before binding threads to the remaining CPUs of the reserved cores.

MEMORY PREFERENCES

Preferences have the form scope[:size]:order. The scope term identifies a virtual memory region and can be dbss for .data/.bss area, heap for brk area, anon_private for anonymous mmap area, tstack for thread stacks or all for all LWK memory backed areas.

The order term lists types of memory in order of preference. This is a comma delimited list of hbm, dram, and nvram. The default ordering is hbm,dram,nvram. If not all types of memory are explicitly stated, the list is implicitly completed with missing types from this default order. 

The size term, if present, applies the preference to allocations larger than or equal to the specified size. If not specified, size is implicitly 1.

Multiple preferences are separated with a '/' character. 

If no preference is specified, the default behavior all:1:hbm,dram,nvram. Any preferences specified are relative to this default and are applied in order from left to right. 

Example: 

yod -p all:dram/anon_private:65536:hbm

Gives precedence to DRAM for all memory allocations, except private, anonymous mmaps of 64K or larger. These mmaps will first attempt to be satisfied with high bandwidth memory.

RESOURCE MAP FILES

The file: variant of the --resources option may be used to map CPU, memory, and number of utility threads per MPI rank. The file contains lines of the form:

<local-rank> <resource-spec>...

Where <local-rank> is either an integer identifying the Nth rank on the node or the wildcard character '*'. The <resource-spec> can identify  CPUs, cores, memory, number of utility threads, and/or resource option. The wildcard line is optional. It matches all ranks and should be the last line in the file. Comments are allowed and start with the '#' character.

This option requires that the MPI_LOCALRANKID environment variable is set to identify the rank's ordinal with the node.

Example:

 # The first rank on the node will use 1/4 of the designated resources:

 0 -R 1/4

 # The second rank on the node will use CPU 9 and 1 gigabyte of memory:

 1 -c 9 -M 1G

 # All other ranks use 1 core and 1/8 of the designated memory:

 * --cores 1 --mem 1/8

YOD ENVIRONMENT VARIABLES

YOD_VERBOSE may be used to control the verbosity. Specifying --verbose= on the command line takes precedence over this environment variable.

Recommended environment variables

The mOS kernel will reserve unique CPU and memory resources for each process/rank within a node and will assign threads to the CPU resources owned (reserved) by the process. For these reasons, it is advisable to prevent MPI implementations and high speed fabric implementations from placing processes and threads on specific CPUs. This behavior will likely cause a conflict with yod's management of resources. For example, Intel MPI will by default attempt to pin ranks to specific CPUs. This action must be disabled by using I_MPI_PIN=off. Another example is in the Intel(R) Omni-Path Fabric. It will attempt to place a worker thread on a specific CPU. To disable this behavior, HFI_NO_CPUAFFINITY=1 should be set. Set the appropriate environment variables for you particular environment/hardware. In addition, if you are running in an OpenMP environment and if you want mOS to place the OpenMP threads, then KMP_AFFINITY=none should be set. You can allow OpenMP to place threads. The mOS OS will respect OpenMP thread placement. If left to its own defaults, OneCCL may attempt to affinitize worker threads onto CPUs outside of the LWK CPUs that have been reserved for the current process. This will result in an error being reported by the runtime and the application will be terminated. The same problem will occur if the user supplied an affinity list containing CPUs outside of the reserved LWK CPUs. If no CCL_WORKER_AFFINITY list is provided, YOD will create one prior to executing the application. It will assign the CCL worker threads to the last 4 CPUs in the sequence list of reserved LWK CPUs. If a valid affinity specification is provided by the caller, no change will be performed. The environment variable that controls this is: CCL_WORKER_AFFINITY.


Shared Memory with XPMEM

The XPMEM implementation of mOS is derived from the open source XPMEM implementation https://gitlab.com/hjelmn/xpmem . It is compatible with the open source XPMEM implementation with respect to user APIs. The user API description can be found either at the open source XPMEM implementation specified at the link above or can be found in mOS XPMEM installation header files mentioned in the table below. In addition, a few fixes were made to the user space XPMEM library. Users can pick up these changes by re-building/linking their applications with mOS user space XPMEM library.

XPMEM component Path to installation on mOS
Shared library /usr/lib/libxpmem.so
Header files /usr/include/xpmem/

XPMEM kernel module is loaded during the kernel boot up and will be ready to use upon a successful boot up without any additional steps needed by the user.

Hugepage usage

The mOS XPMEM implementation supports huge pages for the attached virtual memory. This usage of huge pages for an attachment is subjected to conditions mentioned below. Terminology: an owner process is any user process that shares its virtual address space to be consumed by others; a non-owner is any other user process that attaches the owner's shared address space into its own virtual address space and then accesses the shared memory. The below table lists the situations for huge page usage in the mOS XPMEM implementation and provides recommendations for handling such conditions.


Owner virtual  memory condition

Attached memory behaviors 

Usage of huge pages in the owner page table itself.

Map and share large LWK memory in the owner process.

Ex: For XPMEM share mmap() large areas (>2 MB) using MAP_PRIVATE |  MAP_ANONYMOUS flags or brk().

Remapping of huge pages is not  supported for Linux memory.

For an LWK process the memory for data/bss, brk and private anonymous mmaps are allocated out of LWK memory and rest of the process memory is allocated from Linux memory (ex: text area, file backed mmaps etc). The huge pages supported by the mOS XPMEM in the non-owner is only for corresponding LWK memory in the owner process.

Avoid XPMEM share of Linux memory in the owner address space if large memory needs to be XPMEM shared. Expectation is that an LWK process will use more LWK memory than Linux memory.

The alignment of start address of  shared segment

 

Create a shared segment with a virtual start address aligned to a huge page boundary. Typically when large memory is mapped through LWKMEM the mapped address is already huge page aligned.

The length of the shared segment.

To use huge pages, the length needs to be at least 2MB. 

Holes in the virtual address space covered by the shared segment

It is recommended that the non-owner attaches to the owner's shared address space after it is mapped by the owner, Ex: using mmap, mremap or brk etc. Note this still supports creating an XPMEM share over the owners virtual address space that is not mapped yet (a hole); this recommendation simply states that the owner regions being attached need to be fully mapped first.

Recreations of memory maps with larger sizes that could potentially result in using higher order huge pages

Alignment of an XPMEM attachment in the non-owner is largely dependent on the corresponding owner address space of the owner at the time of attachment. If the corresponding owner address space changes i.e. a previously existing map is unmapped and a new map is created with large size, then it is recommended to detach the existing xpmem attachment and create a new attachment to ensure that the attachment is aligned to newly allocated huge page size in the owner.

Non-owner virtual memory actions

Recommendations 

The length of XPMEM attachment used

It needs to be at least 2MB.

Fixed start virtual address used for the attachment 

Its recommended that the application not use a fixed start address (MAP_FIXED) for an attachment start address so that the kernel can choose a best huge page alignment for that attachment.

Offset of the attachment. 

Offsets specifically that are not multiples of huge page size can result in attaching to an unaligned virtual memory start in the owner address space that in turn forces remap to use smaller pages if the resulting start/end address is in the middle of a huge page. 


Tracing mOS Code

Some components of the mOS LWK are instrumented with Linux' ftrace support.

Enabling/disabling trace events and dumping the trace buffer requires root permissions.

To enable tracing, write a '1' to the individual event's control file or to the global control file:


   # To see all of the supported events:

   $ ls /sys/kernel/debug/tracing/events/mos

   # To enable just the "mos_clone_cpu_assign" event:

   $ echo 1 > /sys/kernel/debug/tracing/events/mos/mos_clone_cpu_assign/enable 

   # To enable all mOS events:

   $ echo 1 > /sys/kernel/debug/tracing/events/mos/enable

After you run, you can dump the trace buffer:

   $ cat /sys/kernel/debug/tracing/trace

Tracing real workloads can easily overflow the trace ring, resulting in loss of data earlier in the run.  This can be worked around easily by routing the ftrace pipe into a file prior to initiating the workload:

   $ cat /sys/kernel/debug/tracing/trace_pipe | tee my.ftrace


Documentation

Readme
    v1.0, v0.9, v0.8, v0.7, v0.6, v0.5, v0.4
User's Guide
    v1.0, v0.9, v0.8, v0.7, v0.6, v0.5, v0.4
Administrator's Guide
    v1.0, v0.9, v0.8, v0.7, v0.6, v0.5, v0.4
Memory Management and Scheduler Design
    LWKMem, Sched
Utility Thread API
    v1.0
Other Info
    An mOS flyer

Clone this wiki locally