This is third part of the chapter which describes timers and time management related stuff in the Linux kernel and we stopped on the clocksource
framework in the previous part. We have started to consider this framework because it is closely related to the special counters which are provided by the Linux kernel. One of these counters which we already saw in the first part of this chapter is - jiffies
. As I already wrote in the first part of this chapter, we will consider time management related stuff step by step during the Linux kernel initialization. Previous step was call of the:
register_refined_jiffies(CLOCK_TICK_RATE);
function which defined in the kernel/time/jiffies.c source code file and executes initialization of the refined_jiffies
clock source for us. Recall that this function is called from the setup_arch
function that defined in the arch/x86/kernel/setup.c source code and executes architecture-specific (x86_64 in our case) initialization. Look on the implementation of the setup_arch
and you will note that the call of the register_refined_jiffies
is the last step before the setup_arch
function will finish its work.
There are many different x86_64
specific things already configured after the end of the setup_arch
execution. For example some early interrupt handlers already able to handle interrupts, memory space reserved for the initrd, DMI scanned, the Linux kernel log buffer is already set and this means that the printk function is able to work, e820 parsed and the Linux kernel already knows about available memory and and many many other architecture specific things (if you are interesting, you can read more about the setup_arch
function and Linux kernel initialization process in the second chapter of this book).
Now, the setup_arch
finished its work and we can back to the generic Linux kernel code. Recall that the setup_arch
function was called from the start_kernel
function which is defined in the init/main.c source code file. So, we shall return to this function. You can see that there are many different function are called right after setup_arch
function inside of the start_kernel
function, but since our chapter is devoted to timers and time management related stuff, we will skip all code which is not related to this topic. The first function which is related to the time management in the Linux kernel is:
tick_init();
in the start_kernel
. The tick_init
function defined in the kernel/time/tick-common.c source code file and does two things:
- Initialization of
tick broadcast
framework related data structures; - Initialization of
full
tickless mode related data structures.
We didn't see anything related to the tick broadcast
framework in this book and didn't know anything about tickless mode in the Linux kernel. So, the main point of this part is to look on these concepts and to know what are they.
First of all, let's look on the implementation of the tick_init
function. As I already wrote, this function defined in the kernel/time/tick-common.c source code file and consists from the two calls of following functions:
void __init tick_init(void)
{
tick_broadcast_init();
tick_nohz_init();
}
As you can understand from the paragraph's title, we are interesting only in the tick_broadcast_init
function for now. This function defined in the kernel/time/tick-broadcast.c source code file and executes initialization of the tick broadcast
framework related data structures. Before we will look on the implementation of the tick_broadcast_init
function and will try to understand what does this function do, we need to know about tick broadcast
framework.
Main point of a central processor is to execute programs. But sometimes a processor may be in a special state when it is not being used by any program. This special state is called - idle. When the processor has no anything to execute, the Linux kernel launches idle
task. We already saw a little about this in the last part of the Linux kernel initialization process. When the Linux kernel will finish all initialization processes in the start_kernel
function from the init/main.c source code file, it will call the rest_init
function from the same source code file. Main point of this function is to launch kernel init
thread and the kthreadd
thread, to call the schedule
function to start task scheduling and to go to sleep by calling the cpu_idle_loop
function that defined in the kernel/sched/idle.c source code file.
The cpu_idle_loop
function represents infinite loop which checks the need for rescheduling on each iteration. After the scheduler finds something to execute, the idle
process will finish its work and the control will be moved to a new runnable task with the call of the schedule_preempt_disabled
function:
static void cpu_idle_loop(void)
{
while (1) {
while (!need_resched()) {
...
...
...
/* the main idle function */
cpuidle_idle_call();
}
...
...
...
schedule_preempt_disabled();
}
Of course, we will not consider full implementation of the cpu_idle_loop
function and details of the idle
state in this part, because it is not related to our topic. But there is one interesting moment for us. We know that the processor can execute only one task in one time. How does the Linux kernel decide to reschedule and stop idle
process if the processor executes infinite loop in the cpu_idle_loop
? The answer is system timer interrupts. When an interrupt occurs, the processor stops the idle
thread and transfers control to an interrupt handler. After the system timer interrupt handler will be handled, the need_resched
will return true and the Linux kernel will stop idle
process and will transfer control to the current runnable task. But handling of the system timer interrupts is not effective for power management, because if a processor is in idle
state, there is little point in sending it a system timer interrupt.
By default, there is the CONFIG_HZ_PERIODIC
kernel configuration option which is enabled in the Linux kernel and tells to handle each interrupt of the system timer. To solve this problem, the Linux kernel provides two additional ways of managing scheduling-clock interrupts:
The first is to omit scheduling-clock ticks on idle processors. To enable this behaviour in the Linux kernel, we need to enable the CONFIG_NO_HZ_IDLE
kernel configuration option. This option allows Linux kernel to avoid sending timer interrupts to idle processors. In this case periodic timer interrupts will be replaced with on-demand interrupts. This mode is called - dyntick-idle
mode. But if the kernel does not handle interrupts of a system timer, how can the kernel decide if the system has nothing to do?
Whenever the idle task is selected to run, the periodic tick is disabled with the call of the tick_nohz_idle_enter
function that defined in the kernel/time/tick-sched.c source code file and enabled with the call of the tick_nohz_idle_exit
function. There is special concept in the Linux kernel which is called - clock event devices
that are used to schedule the next interrupt. This concept provides API for devices which can deliver interrupts at a specific time in the future and represented by the clock_event_device
structure in the Linux kernel. We will not dive into implementation of the clock_event_device
structure now. We will see it in the next part of this chapter. But there is one interesting moment for us right now.
The second way is to omit scheduling-clock ticks on processors that are either in idle
state or that have only one runnable task or in other words busy processor. We can enable this feature with the CONFIG_NO_HZ_FULL
kernel configuration option and it allows to reduce the number of timer interrupts significantly.
Besides the cpu_idle_loop
, idle processor can be in a sleeping state. The Linux kernel provides special cpuidle
framework. Main point of this framework is to put an idle processor to sleeping states. The name of the set of these states is - C-states
. But how does a processor will be woken if local timer is disabled? The linux kernel provides tick broadcast
framework for this. The main point of this framework is assign a timer which is not affected by the C-states
. This timer will wake a sleeping processor.
Now, after some theory we can return to the implementation of our function. Let's recall that the tick_init
function just calls two following functions:
void __init tick_init(void)
{
tick_broadcast_init();
tick_nohz_init();
}
Let's consider the first function. The first tick_broadcast_init
function defined in the kernel/time/tick-broadcast.c source code file and executes initialization of the tick broadcast
framework related data structures. Let's look on the implementation of the tick_broadcast_init
function:
void __init tick_broadcast_init(void)
{
zalloc_cpumask_var(&tick_broadcast_mask, GFP_NOWAIT);
zalloc_cpumask_var(&tick_broadcast_on, GFP_NOWAIT);
zalloc_cpumask_var(&tmpmask, GFP_NOWAIT);
#ifdef CONFIG_TICK_ONESHOT
zalloc_cpumask_var(&tick_broadcast_oneshot_mask, GFP_NOWAIT);
zalloc_cpumask_var(&tick_broadcast_pending_mask, GFP_NOWAIT);
zalloc_cpumask_var(&tick_broadcast_force_mask, GFP_NOWAIT);
#endif
}
As we can see, the tick_broadcast_init
function allocates different cpumasks with the help of the zalloc_cpumask_var
function. The zalloc_cpumask_var
function defined in the lib/cpumask.c source code file and expands to the call of the following function:
bool zalloc_cpumask_var(cpumask_var_t *mask, gfp_t flags)
{
return alloc_cpumask_var(mask, flags | __GFP_ZERO);
}
Ultimately, the memory space will be allocated for the given cpumask
with the certain flags with the help of the kmalloc_node
function:
*mask = kmalloc_node(cpumask_size(), flags, node);
Now let's look on the cpumasks
that will be initialized in the tick_broadcast_init
function. As we can see, the tick_broadcast_init
function will initialize six cpumasks
, and moreover, initialization of the last three cpumasks
will be depended on the CONFIG_TICK_ONESHOT
kernel configuration option.
The first three cpumasks
are:
tick_broadcast_mask
- the bitmap which represents list of processors that are in a sleeping mode;tick_broadcast_on
- the bitmap that stores numbers of processors which are in a periodic broadcast state;tmpmask
- this bitmap for temporary usage.
As we already know, the next three cpumasks
depends on the CONFIG_TICK_ONESHOT
kernel configuration option. Actually each clock event devices can be in one of two modes:
periodic
- clock events devices that support periodic events;oneshot
- clock events devices that capable of issuing events that happen only once.
The linux kernel defines two mask for such clock events devices in the include/linux/clockchips.h header file:
#define CLOCK_EVT_FEAT_PERIODIC 0x000001
#define CLOCK_EVT_FEAT_ONESHOT 0x000002
So, the last three cpumasks
are:
tick_broadcast_oneshot_mask
- stores numbers of processors that must be notified;tick_broadcast_pending_mask
- stores numbers of processors that pending broadcast;tick_broadcast_force_mask
- stores numbers of processors with enforced broadcast.
We have initialized six cpumasks
in the tick broadcast
framework, and now we can proceed to implementation of this framework.
Hardware may provide some clock source devices. When a processor sleeps and its local timer stopped, there must be additional clock source device that will handle awakening of a processor. The Linux kernel uses these special
clock source devices which can raise an interrupt at a specified time. We already know that such timers called clock events
devices in the Linux kernel. Besides clock events
devices, each processor in the system has its own local timer which is programmed to issue interrupt at the time of the next deferred task. Also these timers can be programmed to do a periodical job, like updating jiffies
and etc. These timers represented by the tick_device
structure in the Linux kernel. This structure defined in the kernel/time/tick-sched.h header file and looks:
struct tick_device {
struct clock_event_device *evtdev;
enum tick_device_mode mode;
};
Note, that the tick_device
structure contains two fields. The first field - evtdev
represents pointer to the clock_event_device
structure that defined in the include/linux/clockchips.h header file and represents descriptor of a clock event device. A clock event
device allows to register an event that will happen in the future. As I already wrote, we will not consider clock_event_device
structure and related API in this part, but will see it in the next part.
The second field of the tick_device
structure represents mode of the tick_device
. As we already know, the mode can be one of the:
enum tick_device_mode {
TICKDEV_MODE_PERIODIC,
TICKDEV_MODE_ONESHOT,
};
Each clock events
device in the system registers itself by the call of the clockevents_register_device
function or clockevents_config_and_register
function during initialization process of the Linux kernel. During the registration of a new clock events
device, the Linux kernel calls the tick_check_new_device
function that defined in the kernel/time/tick-common.c source code file and checks the given clock events
device should be used by the Linux kernel. After all checks, the tick_check_new_device
function executes a call of the:
tick_install_broadcast_device(newdev);
function that checks that the given clock event
device can be broadcast device and install it, if the given device can be broadcast device. Let's look on the implementation of the tick_install_broadcast_device
function:
void tick_install_broadcast_device(struct clock_event_device *dev)
{
struct clock_event_device *cur = tick_broadcast_device.evtdev;
if (!tick_check_broadcast_device(cur, dev))
return;
if (!try_module_get(dev->owner))
return;
clockevents_exchange_device(cur, dev);
if (cur)
cur->event_handler = clockevents_handle_noop;
tick_broadcast_device.evtdev = dev;
if (!cpumask_empty(tick_broadcast_mask))
tick_broadcast_start_periodic(dev);
if (dev->features & CLOCK_EVT_FEAT_ONESHOT)
tick_clock_notify();
}
First of all we get the current clock event
device from the tick_broadcast_device
. The tick_broadcast_device
defined in the kernel/time/tick-common.c source code file:
static struct tick_device tick_broadcast_device;
and represents external clock device that keeps track of events for a processor. The first step after we got the current clock device is the call of the tick_check_broadcast_device
function which checks that a given clock events device can be utilized as broadcast device. The main point of the tick_check_broadcast_device
function is to check value of the features
field of the given clock events
device. As we can understand from the name of this field, the features
field contains a clock event device features. Available values defined in the include/linux/clockchips.h header file and can be one of the CLOCK_EVT_FEAT_PERIODIC
- which represents a clock events device which supports periodic events and etc. So, the tick_check_broadcast_device
function check features
flags for CLOCK_EVT_FEAT_ONESHOT
, CLOCK_EVT_FEAT_DUMMY
and other flags and returns false
if the given clock events device has one of these features. In other way the tick_check_broadcast_device
function compares ratings
of the given clock event device and current clock event device and returns the best.
After the tick_check_broadcast_device
function, we can see the call of the try_module_get
function that checks module owner of the clock events. We need to do it to be sure that the given clock events
device was correctly initialized. The next step is the call of the clockevents_exchange_device
function that defined in the kernel/time/clockevents.c source code file and will release old clock events device and replace the previous functional handler with a dummy handler.
In the last step of the tick_install_broadcast_device
function we check that the tick_broadcast_mask
is not empty and start the given clock events
device in periodic mode with the call of the tick_broadcast_start_periodic
function:
if (!cpumask_empty(tick_broadcast_mask))
tick_broadcast_start_periodic(dev);
if (dev->features & CLOCK_EVT_FEAT_ONESHOT)
tick_clock_notify();
The tick_broadcast_mask
filled in the tick_device_uses_broadcast
function that checks a clock events
device during registration of this clock events
device:
int cpu = smp_processor_id();
int tick_device_uses_broadcast(struct clock_event_device *dev, int cpu)
{
...
...
...
if (!tick_device_is_functional(dev)) {
...
cpumask_set_cpu(cpu, tick_broadcast_mask);
...
}
...
...
...
}
More about the smp_processor_id
macro you can read in the fourth part of the Linux kernel initialization process chapter.
The tick_broadcast_start_periodic
function check the given clock event
device and call the tick_setup_periodic
function:
static void tick_broadcast_start_periodic(struct clock_event_device *bc)
{
if (bc)
tick_setup_periodic(bc, 1);
}
that defined in the kernel/time/tick-common.c source code file and sets broadcast handler for the given clock event
device by the call of the following function:
tick_set_periodic_handler(dev, broadcast);
This function checks the second parameter which represents broadcast state (on
or off
) and sets the broadcast handler depends on its value:
void tick_set_periodic_handler(struct clock_event_device *dev, int broadcast)
{
if (!broadcast)
dev->event_handler = tick_handle_periodic;
else
dev->event_handler = tick_handle_periodic_broadcast;
}
When an clock event
device will issue an interrupt, the dev->event_handler
will be called. For example, let's look on the interrupt handler of the high precision event timer which is located in the arch/x86/kernel/hpet.c source code file:
static irqreturn_t hpet_interrupt_handler(int irq, void *data)
{
struct hpet_dev *dev = (struct hpet_dev *)data;
struct clock_event_device *hevt = &dev->evt;
if (!hevt->event_handler) {
printk(KERN_INFO "Spurious HPET timer interrupt on HPET timer %d\n",
dev->num);
return IRQ_HANDLED;
}
hevt->event_handler(hevt);
return IRQ_HANDLED;
}
The hpet_interrupt_handler
gets the irq specific data and check the event handler of the clock event
device. Recall that we just set in the tick_set_periodic_handler
function. So the tick_handler_periodic_broadcast
function will be called in the end of the high precision event timer interrupt handler.
The tick_handler_periodic_broadcast
function calls the
bc_local = tick_do_periodic_broadcast();
function which stores numbers of processors which have asked to be woken up in the temporary cpumask
and call the tick_do_broadcast
function:
cpumask_and(tmpmask, cpu_online_mask, tick_broadcast_mask);
return tick_do_broadcast(tmpmask);
The tick_do_broadcast
calls the broadcast
function of the given clock events which sends IPI interrupt to the set of the processors. In the end we can call the event handler of the given tick_device
:
if (bc_local)
td->evtdev->event_handler(td->evtdev);
which actually represents interrupt handler of the local timer of a processor. After this a processor will wake up. That is all about tick broadcast
framework in the Linux kernel. We have missed some aspects of this framework, for example reprogramming of a clock event
device and broadcast with the oneshot timer and etc. But the Linux kernel is very big, it is not real to cover all aspects of it. I think it will be interesting to dive into with yourself.
If you remember, we have started this part with the call of the tick_init
function. We just consider the tick_broadcast_init
function and related theory, but the tick_init
function contains another call of a function and this function is - tick_nohz_init
. Let's look on the implementation of this function.
We already saw some information about dyntick
concept in this part and we know that this concept allows kernel to disable system timer interrupts in the idle
state. The tick_nohz_init
function makes initialization of the different data structures which are related to this concept. This function defined in the kernel/time/tick-sched.c source code file and starts from the check of the value of the tick_nohz_full_running
variable which represents state of the tick-less mode for the idle
state and the state when system timer interrups are disabled during a processor has only one runnable task:
if (!tick_nohz_full_running) {
if (tick_nohz_init_all() < 0)
return;
}
If this mode is not running we call the tick_nohz_init_all
function that defined in the same source code file and check its result. The tick_nohz_init_all
function tries to allocate the tick_nohz_full_mask
with the call of the alloc_cpumask_var
that will allocate space for a tick_nohz_full_mask
. The tick_nohz_full_mask
will store numbers of processors that have enabled full NO_HZ
. After successful allocation of the tick_nohz_full_mask
we set all bits in the tick_nohz_full_mask
, set the tick_nohz_full_running
and return result to the tick_nohz_init
function:
static int tick_nohz_init_all(void)
{
int err = -1;
#ifdef CONFIG_NO_HZ_FULL_ALL
if (!alloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
WARN(1, "NO_HZ: Can't allocate full dynticks cpumask\n");
return err;
}
err = 0;
cpumask_setall(tick_nohz_full_mask);
tick_nohz_full_running = true;
#endif
return err;
}
In the next step we try to allocate a memory space for the housekeeping_mask
:
if (!alloc_cpumask_var(&housekeeping_mask, GFP_KERNEL)) {
WARN(1, "NO_HZ: Can't allocate not-full dynticks cpumask\n");
cpumask_clear(tick_nohz_full_mask);
tick_nohz_full_running = false;
return;
}
This cpumask
will store number of processor for housekeeping
or in other words we need at least in one processor that will not be in NO_HZ
mode, because it will do timekeeping and etc. After this we check the result of the architecture-specific arch_irq_work_has_interrupt
function. This function checks ability to send inter-processor interrupt for the certain architecture. We need to check this, because system timer of a processor will be disabled during NO_HZ
mode, so there must be at least one online processor which can send inter-processor interrupt to awake offline processor. This function defined in the arch/x86/include/asm/irq_work.h header file for the x86_64 and just checks that a processor has APIC from the CPUID:
static inline bool arch_irq_work_has_interrupt(void)
{
return cpu_has_apic;
}
If a processor has not APIC
, the Linux kernel prints warning message, clears the tick_nohz_full_mask
cpumask, copies numbers of all possible processors in the system to the housekeeping_mask
and resets the value of the tick_nohz_full_running
variable:
if (!arch_irq_work_has_interrupt()) {
pr_warning("NO_HZ: Can't run full dynticks because arch doesn't "
"support irq work self-IPIs\n");
cpumask_clear(tick_nohz_full_mask);
cpumask_copy(housekeeping_mask, cpu_possible_mask);
tick_nohz_full_running = false;
return;
}
After this step, we get the number of the current processor by the call of the smp_processor_id
and check this processor in the tick_nohz_full_mask
. If the tick_nohz_full_mask
contains a given processor we clear appropriate bit in the tick_nohz_full_mask
:
cpu = smp_processor_id();
if (cpumask_test_cpu(cpu, tick_nohz_full_mask)) {
pr_warning("NO_HZ: Clearing %d from nohz_full range for timekeeping\n", cpu);
cpumask_clear_cpu(cpu, tick_nohz_full_mask);
}
Because this processor will be used for timekeeping. After this step we put all numbers of processors that are in the cpu_possible_mask
and not in the tick_nohz_full_mask
:
cpumask_andnot(housekeeping_mask,
cpu_possible_mask, tick_nohz_full_mask);
After this operation, the housekeeping_mask
will contain all processors of the system except a processor for timekeeping. In the last step of the tick_nohz_init_all
function, we are going through all processors that are defined in the tick_nohz_full_mask
and call the following function for an each processor:
for_each_cpu(cpu, tick_nohz_full_mask)
context_tracking_cpu_set(cpu);
The context_tracking_cpu_set
function defined in the kernel/context_tracking.c source code file and main point of this function is to set the context_tracking.active
percpu variable to true
. When the active
field will be set to true
for the certain processor, all context switches will be ignored by the Linux kernel context tracking subsystem for this processor.
That's all. This is the end of the tick_nohz_init
function. After this NO_HZ
related data structures will be initialized. We didn't see API of the NO_HZ
mode, but will see it soon.
This is the end of the third part of the chapter that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the clocksource
concept in the Linux kernel which represents framework for managing different clock source in a interrupt and hardware characteristics independent way. We continued to look on the Linux kernel initialization process in a time management context in this part and got acquainted with two new concepts for us: the tick broadcast
framework and tick-less
mode. The first concept helps the Linux kernel to deal with processors which are in deep sleep and the second concept represents the mode in which kernel may work to improve power management of idle
processors.
In the next part we will continue to dive into timer management related things in the Linux kernel and will see new concept for us - timers
.
If you have questions or suggestions, feel free to ping me in twitter 0xAX, drop me email or just create issue.
Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to linux-insides.