Chapter 6: Physical Page Allocation

Physical pages are managed using the 'binary buddy allocator', originally devised by Knowlton and further described by Knuth (TAocP Volume 1, Section 2.5, pg. 442) and has shown to be extremely fast compared to other allocators (see David G. Korn and Kiem-Phong Bo. In search of a better malloc In Proceedings of the Summer 1985 USENIX Conference, pages 489–506, Portland, OR, 1985.)
This is an allocation scheme that combines a power-or-two allocator with free buffer coalescing.
The concept behind a binary buddy allocator is quite simple:

Memory is broken up into large blocks of pages where each block is a power of 2 number of pages.
If a block of the desired size is not available a large block is broken up in half and the two blocks are considered 'buddies' to each other - one half is used for the allocation, the other is free.
The blocks are continuously halved as necessary until a block of the desired size is available.
When a block is freed the buddy is examined, and the two are coalesced if it is free.

6.1 Managing Free Blocks

The allocator maintains blocks of free pages where each block is a power-of-two number of pages.
The exponent of the power of 2 is referred to as its order.
An array of free_area_t structs are maintained for each order each of which point to a linked list of blocks of pages that are free, e.g.:

order    free_area_t             Free page blocks
       zone->free_area
     |---------------|-|
|    |       0       | | -> [] -> [] -> ...       2^0 page-sized blocks
|    |---------------|-|
|    |       1       | |
|    |---------------|-|
|    |       2       | |
|    |---------------|-|
|    |       3       | |
|    |---------------|-|
|    |       4       | | -> [==] -> [==] -> ...   2^4 page-sized blocks
|    |---------------|-|
|    |       5       | |
|    |---------------|-|
|    |       6       | |
|    |---------------|-|
|    |       7       | |
|    |---------------|-|
|    |       8       | |
|    |---------------|-|
|    |       9       | | -> [==========] -> ...   2^(MAX_ORDER - 1) page-sized blocks
|    |---------------|-|
|    |   MAX_ORDER   | | (exclusive bound)
v    |---------------|-|

MAX_ORDER is currently set at 10 unless otherwise specified by CONFIG_FORCE_MAX_ZONEORDER. This eliminates the chance that a larger block will be split to satisfy a request where a smaller block would suffice.
The page blocks are maintained on a linear linked list using struct page->list.
Each zone has a free_area_t array field of free_area[MAX_ORDER].
Looking at free_area_t itself:

typedef struct free_area_struct {
        struct list_head        free_list;
        unsigned long           *map;
} free_area_t;

free_list is a linked list of free page blocks, map is a bitmap representing the state of a pair of buddies.
Linux saves memory by using one bit instead of two to represent each pair of buddies - each time a buddy is allocated or freed, the bit representing the pair of buddies is toggled so that the bit is 0 if the pair of pages are both free or both full and 1 if only one buddy is in use.
To toggle the correct bit, the macro MARK_USED() is used:

#define MARK_USED(index, order, area) \
        __change_bit((index) >> (1+(order)), (area)->map)

index is the index of the page within the global mem_map array.
The correct bit to flip is determined by shifting index right by 1 + order. LS - Need to understand this better.
__change_bit() ultimately uses the btc 'bit test and complement' op-code.

6.2 Allocating Pages

Let's have a look at the Linux API for the allocation of page frames:

alloc_page() - Allocats a single page and returns a struct page.
alloc_pages() - Allocates 2^order pages and returns a struct page.
get_free_page() - Allocates a single page, zeroes it and returns a virtual address (deprecated name, macro wrapper for get_zeroed_page().)
__get_free_page() - Allocates a single page and returns a virtual address.
__get_free_pages() - Allocates 2^order pages and returns a virtual address.
__get_dma_pages() - Allocates 2^order pages from ZONE_DMA and returns a struct page.

Each of these functions take an integer gfp_mask parameter which is a set of flags that determine how the allocator will behave (discussed further in 6.4.)
They all ultimately use the core function __alloc_pages(), the reason these are provided is so the correct node and zone will be used. Different users will require different zones - some device drivers will need ZONE_DMA, disk buffers will need ZONE_NORMAL and callers shouldn't have to be aware of which node is being used.
Allocations are always for a specified order - 0 when only a single page is required.
If a free block can't be found of the requested order, a higher order block is split into two buddies - one is allocated and the other is placed on the free list for the lower order.
For example, a process requesting a 2^2 block when one is not available, nor 2^3, but a 2^4 block is:

order    free_area_t             Free page blocks
       zone->free_area
     |---------------|-|
|    |       0       | |            Requesting Process
|    |---------------|-|                    ^
|    |       1       | |                    |
|    |---------------|-|                    |
|    |       2       | | <---------------\  |
|    |---------------|-|                 |  |
|    |       3       | | <----------\  [==|==]
|    |---------------|-|            |     ^
|    |       4       | | ---\       |     |
|    |---------------|-|    |    [=====|=====]
|    |       5       | |    |          ^
|    |---------------|-|    |          |
|    |       6       | |    \--->[===========]
|    |---------------|-|
|    |       7       | |
|    |---------------|-|
|    |       8       | |
|    |---------------|-|
|    |       9       | |
|    |---------------|-|
|    |   MAX_ORDER   | |
v    |---------------|-|

When a block is later freed, the buddy will be checked. If both are free, they are merged to form a higher order block and placed on the higher free list where its buddy is checked and so on.
If the buddy is not free, the freed block will be placed on the free list at the current order.
During these operations interrupts have to be disabled to prevent an interrupt handler manipulating the lists while a process has them in an inconsistent state - this is achieved by using an interrupt safe spinlock.
The next decision that needs to be made is which memory node or pg_data_t to use. Linux uses a node-local allocation policy - it aims to use the memory bank associated with the CPU running the page-allocating process.
The function _alloc_pages() differs depending on the system type - UMA _alloc_pages() and NUMA _alloc_pages().
Regardless of what API is used, __alloc_pages() is the heart of the allocator (note this function is never called directly.), it:

Examines the selected zone and checks whether it is suitable to allocate from based on the number of available pages.
If the zone is not available, the allocator may fall back to other zones. The order of zones to fall back on is decided at boot time by build_zonelists(), though generally ZONE_HIGHMEM falls back to ZONE_NORMAL which in turn will fall back to ZONE_DMA.
If the number of free pages reaches the zone's pages_low watermark, it will wake kswapd to begin freeing up pages from zones, and if memory is extremely tight, it will do the work of kswapd itself.
After the zone has been decided on, the function rmqueue() is called to allocate the block of pages or split higher level blocks if one of the appropriate size is not available.

6.3 Free Pages

The API for freeing of pages is a lot simpler and exists to help remember the order of the block to free - this is one disadvantage of a buddy allocator - the caller has to remember the size of hte original allocation.
Let's take a look at the API:

__free_pages() - Given a struct page, Frees 2^order pages from the given page.
__free_page() - Given a struct page, frees it.
free_page() - Given a virtual address, frees a page from it.

The principle function is __free_pages_ok(), which should not be called directly, rather __free_pages() performs simple checks first.
When a buddy is freed linux tries to coalesce the buddies together immediately, if possible. This is not optimal, because in the worst-case scenario there will be many coalitions followed by the immediate splitting of the same blocks.
To detect whether the buddies can be merged, linux checks the bit corresponding to the affected pair of buddies in free_area->map.
Recall that we track buddy status using a single bit - 0 if both are set or free, 1 if one is set and the other not. Since we just freed a buddy, we know that a 0 implies both are free, so we can go ahead and merge.
We can calculate the address of the buddy relatively easily - since the allocations are always in blocks of size 2^k, the address of the block or at least its offset within the zone_mem_map will also be a power of 2^k - this means that the binary representation of this address will always have at least k zeros to the right of the address.
The buddy will have the kth bit flipped. Here imask provides a mask for just that bit:

mask = (~0 << k)
imask = 1+~mask = -mask

After the buddy is merged, it is removed from the free list and the newly coalesced pair moves to the next higher order to see if it may also be merged with its buddy there.
Quoting from Knuth (TAocP Volume 1, Section 2.5, pg. 442):

The key fact underlying the practical usefulness of this method is that if we know the address of a block (the memory location of its first word), and if we also know the size of that block, we know the address of its buddy. For example, the buddy of the block of size 16 beginning in binary location 101110010110000 is a block starting in binary location 101110010100000. To see why this must be true, we first observe that as the algorithm proceeds, the address of a block of size 2^k is a multiple of 2^k. In other words, the address in binary notation has at least k zeros at the right. This observation is easily justified by induction: If it is true for all blocks of size 2^(k+1), it is certainly true when such a block is halved. Therefore a block of size, say, 32 has an address of the form xx. . . x00000 (where the x’s represent either 0 or 1); if it is split, the newly formed buddy blocks have the addresses xx. . . x00000 and xx. . . x10000.

Considering an example of order 2 allocations - addresses will look like 0101100, 0111100, 1000000, etc. - we are allocating 2^2 pages at a time. If we need to split an address, the first of the two buddies will be the existing address, the second half way through, i.e. address + 2^1, so 0111100 will split to 0111100 and 0111110 - and the condition holds true.

6.4 Get Free Page (GFP) Flags

A concept that is maintained throughout the VM is the Get Free Page (GFP) flags. These determine how the allocator and kswapd will behave for the allocation and freeing of pages.
For example, an interrupt handler is not allowed to sleep, so it will not set __GFP_WAIT which indicates the caller may sleep.
There are 3 sets of GFP flags all set in include/linux/mm.h. The first set are zone modifiers:

__GFP_DMA - Allocate from ZONE_DMA if possible.
__GFP_HIGHMEM - Allocate from ZONE_HIGHMEM if possible.
GFP_DMA - Alias for __GFP_DMA.

There isn't a zone modifier for ZONE_NORMAL, as this is the default (the zone modifier flag is used as an offset within an array and ZONE_NORMAL is offset 0.)
The second set are action modifiers - they change the behaviour of the VM and what the calling process may do:

__GFP_WAIT - The caller is not high priority and can sleep or reschedule.
__GFP_HIGH - Used by a high priority or kernel process (LS - the book says it's unused, but I see it in the 4.5 kernel so I'm not going to mark it as unused here in case 2.4.22 uses it too.)
__GFP_IO - The caller can perform low-level I/O. The main effect in 2.4.22 is to determine whether try_to_free_buffers() can flush buffers. It is also used by at least one journaled filesystem.
__GFP_HIGHIO - Determines that I/O can be performed on pages mapped into high memory. Only used in try_to_free_buffers().
__GFP_FS - The caller can make calls to the filesystem layer - this is provided when the caller is filesystem-related (e.g. the buffer cache), and wants to avoid recursively calling itself.

These low-level flags on their own are too primitive to be easily used as-is, and it is often difficult to know the appropriate combination of these to use, which brings us to the third set - high-level combinations of the low-level action modifiers. These are more digestible forms of the primitives:

#define GFP_NOHIGHIO    (__GFP_HIGH | __GFP_WAIT | __GFP_IO)
#define GFP_NOIO        (__GFP_HIGH | __GFP_WAIT)
#define GFP_NOFS        (__GFP_HIGH | __GFP_WAIT | __GFP_IO | __GFP_HIGHIO)
#define GFP_ATOMIC      (__GFP_HIGH)
#define GFP_USER        (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)
#define GFP_HIGHUSER    (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS | __GFP_HIGHMEM)
#define GFP_KERNEL      (__GFP_HIGH | __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)
#define GFP_NFS         (__GFP_HIGH | __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)
#define GFP_KSWAPD      (             __GFP_WAIT | __GFP_IO | __GFP_HIGHIO | __GFP_FS)

Considering each of these:

GFP_NOHIGHIO - This is only used in one place - alloc_bounce_page() - during the creation of a bounce buffer for I/O in high memory.
GFP_NOIO - This is used by callers who are already performing I/O, for example when the loopback device is trying to get a page for a buffer head, it uses this flag to make sure it doesn't introduce more I/O. In fact, it seems like the flag was introduced specifically to avoid a deadlock in the loopback device :)
GFP_NOFS - This issued only by the buffer cache and filesystems to make sure sure they do not recursively call themselves by accident.
GFP_ATOMIC - Used whenever the caller cannot sleep and must be serviced if at all possible. Any interrupt handler that requires memory must use this flag to avoid sleeping or performing I/O. Many subsystems use this system during init, e.g. buffer_init() or inode_init().
GFP_USER - Defunct - this is a flag of historic interest - in the 2.2.x series memory was assigned low, medium, or high priority - if memory was tight, a request with GFP_USER (low priority) would fail whereas others would keep trying. Now it has no impact and is treated no differently than GFP_KERNEL.
GFP_HIGHUSER - The allocator should allocate from ZONE_HIGHMEM if possible - used when the page is allocated on behalf of a user process.
GFP_KERNEL - This is the most liberal of the combined flags - the caller is free to do whatever they want.
GFP_NFS - Defunct - in the 2.0.x series this flag would determine what the reserved page size was - usually this was 20, if this flag was set it'd be 5. Now this flag is identical to GFP_KERNEL.
GFP_KSWAPD - Defunct - Effectively the same as GFP_KERNEL, modulo __GFP_HIGH.

6.5 Process Flags

A process may also set flags in the struct task_struct which affect allocator behaviour.
Considering the flags that affect the VM:

PF_MEMALLOC - Set by OOM Killer - flags the process as a memory allocator. This is set by kswapd and is set for any process that is about to be killed by the OOM killer which is discussed in detail in chapter 13. It indicates to the buddy allocator to ignore zone watermarks and assign pages if at all possible.
PF_MEMDIE - Set by OOM Killer - effectively functions the same as PF_MEMALLOC.
PF_FREE_PAGES - This is set when the buddy allocator calls try_to_free_pages() itself to indicate that free pages should be reserved for the calling process in __free_pages_ok() instead of returning them to the free lists.

6.6 Avoiding Fragmentation

Any allocator needs to be careful to avoid both internal and external fragmentation.
External fragmentation is the ability to service a request because the available memory exists only in small blocks.
Internal fragmentation is where space is wasted when a large block had to be assigned to service a small request.
In Linux, external fragmentation is not a serious problem because large requests for contiguous pages are rare and usually vmalloc() suffices to serve the request (see chapter 7 for more details on this) - the list of free blocks ensures that large blocks do not have to be split unnecessarily.
Internal fragmentation is the biggest failing of the binary buddy system. Although it is expected to be in the region of 28% (LS - what does this measure actually mean?), it has been shown it can be in the region of 60%, in comparison to just 1% with the first-fit allocator, and additionally it's been shown that variations of the buddy system does not help the situation significantly.
Linux addresses this by using a 'slab allocator' to carve pages into small blocks of memory for allocation. This is discussed in further detail in chapter 8.
With the combination of the binary buddy allocator and the slab allocator, the kernel can keep the amount of wasted memory due to internal fragmentation to a minimum.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6.md

6.md

Chapter 6: Physical Page Allocation

6.1 Managing Free Blocks

6.2 Allocating Pages

6.3 Free Pages

6.4 Get Free Page (GFP) Flags

6.5 Process Flags

6.6 Avoiding Fragmentation

Files

6.md

Latest commit

History

6.md

File metadata and controls

Chapter 6: Physical Page Allocation

6.1 Managing Free Blocks

6.2 Allocating Pages

6.3 Free Pages

6.4 Get Free Page (GFP) Flags

6.5 Process Flags

6.6 Avoiding Fragmentation