Skip to content
Travis CI User edited this page Mar 16, 2019 · 41 revisions

Memory Allocators

Memory memory everywhere but not a allocation to be made - A really fragmented heap

Introduction

Memory allocation is very important! Allocating and de-allocating heap memory is one of the most common operations in any application. The heap at the system level is contiguous series of addresses that the program can expand or contract and use as its own accord. In POSIX, this is called the system break. We use sbrk to move the system break. Most programs don’t interact directly with this call, they use a memory allocation system around it to handle chunking up and keeping track of which memory is allocated and which is free’d.

We will mainly be looking into how the c standard library does memory allocations and that c-api for it. Just know that there are other ways of dividing up memory like with mmap or other allocation schemes and methods like jemalloc.

C Memory Allocation Functions

  • malloc(size_t bytes) is a C library call and is used to reserve a contiguous block of memory. Unlike stack memory, the memory remains allocated until free is called with the same pointer. If malloc fails to reserve any more memory then it returns NULL. Robust programs should check the return value. If your code assumes malloc succeeds and it does not, then your program will likely crash (segfault) when it tries to write to address 0. Also, malloc may not zero out memory because of performance – check your code to make sure that you are not using uninitialized values.

  • realloc(void *space, size_t bytes) allows you to resize an existing memory allocation that was previously allocated on the heap (via malloc, calloc, or realloc). The most common use of realloc is to resize memory used to hold an array of values. There are two gotchas with realloc. One, a new pointer may be returned. Two, it can fail. A naive but readable version of realloc is suggested below with sample usage.

    void * realloc(void * ptr, size_t newsize) {
      // Simple implementation always reserves more memory
      // and has no error checking
      void *result = malloc(newsize); 
      size_t oldsize =  ... //(depends on allocator's internal data structure)
      if (ptr) memcpy(result, ptr, newsize < oldsize ? newsize : oldsize);
      free(ptr);
      return result;
    }
    
    int main() {
      // 1
      int *array = malloc(sizeof(int) * 2);
      array[0] = 10; array[1] = 20;
      // Ooops need a bigger array - so use realloc..
      array = realloc(array, 3 * sizeof(int));
      array[2] = 30; 
      
    }

    The above code is not robust. If realloc fails then the array is a memory leak. Robust code checks for the return value and only reassigns the original pointer if not null.

  • calloc(size_t nmemb, size_t size) initializes memory contents to zero and also takes two arguments: the number of items and the size in bytes of each item. An advanced discussion of these limitations is in this article. Programmers often use calloc rather than explicitly calling memset after malloc, to set the memory contents to zero. Note calloc(x,y) is identical to calloc(y,x), but you should follow the conventions of the manual. A naive implementation of calloc is below.

    void *calloc(size_t n, size_t size) {
      size_t total = n * size; // Does not check for overflow!
      void *result = malloc(total);
      
      if (!result) return NULL;
        
      // If we're using new memory pages 
      // just allocated from the system by calling sbrk
      // then they will be zero so zero-ing out is unnecessary,
    
      return memset(result, 0, total);
    }
  • free takes a pointer to the start of a piece of memory and makes it available for use in the subsequent calls to the other allocation functions. This is important because we don’t want every process in our address space to take an enormous amount of memory. Once we are done using memory, we stop using it with free. A simple usage is below.

    int *ptr = malloc(sizeof(*ptr));
    do_something(ptr);
    free(ptr);

    If you use a piece of memory after it is freed - that is undefined behavior.

Heaps and sbrk

The heap is part of the process memory and it does not have a fixed size. Heap memory allocation is performed by the C library when you call malloc (calloc, realloc) and free. By calling sbrk the C library can increase the size of the heap as your program demands more heap memory. As the heap and stack (one for each thread) need to grow, we put them at opposite ends of the address space. So for typical architectures the heap will grow upwards and the stack grows downwards.

Truthiness: Modern operating system memory allocators no longer need sbrk - instead they can request independent regions of virtual memory and maintain multiple memory regions. For example, gibibyte requests may be placed in a different memory region than small allocation requests. However this detail is an unwanted complexity. The problems of fragmentation and allocating memory efficiently still apply, so we will ignore this implementation nicety here and will write as if the heap is a single region. If we write a multi-threaded program (more about that later) we will need multiple stacks (one per thread) but there’s only ever one heap.

Programs don’t need to call brk or sbrk typically (though calling sbrk(0) can be interesting because it tells you where your heap currently ends). Instead programs use malloc,calloc,realloc and free which are part of the C library. The internal implementation of these functions may call sbrk when additional heap memory is required.

void *top_of_heap = sbrk(0);
malloc(16384);
void *top_of_heap2 = sbrk(0);
printf("The top of heap went from %p to %p \n", top_of_heap, top_of_heap2);
// Example output: The top of heap went from 0x4000 to 0xa000

An important fact we just glossed over earlier is memory that was newly obtained by the operating system must be zeroed out. If the operating system did not zero out contents of physical RAM it might be possible for one process to learn about the memory of another process that had previously used the memory. This would be a security leak. Unfortunately this means that for malloc requests before any memory has been freed and simple programs (which end up using newly reserved memory from the system) the memory is often zero. Then programmers mistaken write C programs that assume malloc’d memory will always be zero.

char* ptr = malloc(300);
// contents is probably zero because we get brand new memory
// so beginner programs appear to work!
// strcpy(ptr, "Some data"); // work with the data
free(ptr);
// later
char *ptr2 = malloc(300); // Contents might now contain existing data and is probably not zero

Intro to Allocating

Let’s try to write Malloc. Here is a first attempt at it – the naive version.

void* malloc(size_t size)
// Ask the system for more bytes by extending the heap space. 
// sbrk Returns -1 on failure
   void *p = sbrk(size); 
   if(p == (void *) -1) return NULL; // No space left
   return p;
}
void free() {/* Do nothing */}

Above is the simplest implementation of malloc, there are a few drawbacks though.

  • System calls are slow (compared to library calls). We should reserve a large amount of memory and only occasionally ask for more from the system.

  • No reuse of freed memory. Our program never re-uses heap memory - it just keeps asking for a bigger heap.

If this allocator was used in a typical program, the process would quickly exhaust all available memory. Instead we need an allocator that can efficiently use heap space and only ask for more memory when necessary. There are programs that may use this type of allocator. Consider a video game allocating objects to load the next scene. It is considerably faster to do the above and just throw the entire block of memory away than it is to do the following placement strategies.

Placement Strategies

During program execution, memory is allocated and de-allocated (freed), so there will be gaps (holes) in the heap memory that can be re-used for future memory requests. The memory allocator needs to keep track of which parts of the heap are currently allocated and which are parts are available. Suppose our current heap size is 64K, though not all of it is in use because some earlier malloc’d memory has already been freed by the program:

16KiB 10KiB 1KiB 1KiB 30KiB 4KiB 2KiB
free allocated free allocated free allocated free

If a new malloc request for 2KiB is executed (malloc(2048)), where should malloc reserve the memory? It could use the last 2KiB hole, which happens to be the perfect size! Or it could split one of the other two free holes. These choices represent different placement strategies. Whichever hole is chosen, the allocator will need to split the hole into two: The newly allocated space (which will be returned to the program) and a smaller hole (if there is spare space left over). A perfect-fit strategy finds the smallest hole that is of sufficient size (at least 2KiB):

16KiB 10KiB 1KiB 1KiB 30KiB 4KiB 2KiB
free allocated free allocated free allocated HERE!

A worst-fit strategy finds the largest hole that is of sufficient size (so break the 30KiB hole into two):

16KiB 10KiB 1KiB 1KiB 2KiB 28KiB 4KiB 2KiB
free allocated free allocated HERE! free allocated free

A first-fit strategy finds the first available hole that is of sufficient size (break the 16KiB hole into two):

2KiB 14KiB 10KiB 1KiB 1KiB 30KiB 4KiB 2KiB
HERE! free allocated free allocated free allocated free

To introduce another concept, external fragmentation is that even though we have enough memory in the heap, it may be divided up in a way that wear are not able to give the full amount. In the example below, of the 64KiB of heap memory, 17KiB is allocated, and 47KiB is free. However the largest available block is only 30KiB because our available unallocated heap memory is fragmented into smaller pieces.

16KiB 10KiB 1KiB 1KiB 30KiB 4KiB 2KiB
free allocated free allocated free allocated free

Placement Strategy pros and cons

The challenges of writing a heap allocator are

  • Need to minimize fragmentation (i.e. maximize memory utilization)

  • Need high performance

  • Fiddly implementation (lots of pointer manipulation using linked lists and pointer arithmetic)

  • Both fragmentation and performance depend on the application allocation profile, which can be evaluated but not predicted and in practice, under-specific usage conditions, a special-purpose allocator can often out-perform a general purpose implementation.

  • The allocator doesn’t know the program’s memory allocation requests in advance. Even if we did, this is the Knapsack problem which is known to be NP hard!

Different strategies affect the fragmentation of heap memory in non-obvious ways, which only are discovered by mathematical analysis or careful simulations under real-world conditions (for example simulating the memory allocation requests of a database or webserver). For example, best-fit at first glance appears to be an excellent strategy. However, if we can not find a perfectly-sized hole then this placement creates many tiny unusable holes, leading to high fragmentation. It also requires a scan of all possible holes. Since Worst-fit targets the largest unallocated space, it is a poor choice if large allocations are required. Worst fit also has to scan the entire heap

First fit has the advantage that it will not evaluate all possible placements and therefore be faster. In practice first-fit and next-fit are often common placement strategy. Hybrid approaches and many other alternatives exist.

Memory Allocator Tutorial

A memory allocator needs to keep track of which bytes are currently allocated and which are available for use. This page introduces the implementation and conceptual details of building an allocator, or the actual code that implements malloc and free.

Conceptually we are thinking about creating linked lists and lists of blocks! We are writing integers and pointers into memory that we already control, so we can later consistently hop from one address to the next. This internal information represents some overhead. Even if we had requested 1024 KiB of contiguous memory from the system, we will not be able to provide all of it to the running program.

We can think of our heap memory as a list of blocks where each block is either allocated or unallocated. Rather than storing an explicit list of pointers we store information about the block’s size as part of the block. Thus there is conceptually a list of free blocks, but it is implicit in the form of block size information that we store as part of each block.

We could navigate from one block to the next block just by adding the block’s size. For example if you have a pointer p that points to the start of a block, then next_block with be at ((char )p) +  (size_t ) p, if you are storing the size of the blocks in bytes. The cast to char  ensures that pointer arithmetic is calculated in bytes. The cast to size_t  ensures the memory at p is read as a size value and would be necessarily if p was a void  or char  type.

The calling program never sees these values; they are internal to the implementation of the memory allocator. As an example, suppose your allocator is asked to reserve 80 bytes (malloc(80)) and requires 8 bytes of internal header data. The allocator would need to find an unallocated space of at least 88 bytes. After updating the heap data it would return a pointer to the block. However, the returned pointer does not point to the start of the block because that’s where the internal size data is stored! Instead we would return the start of the block + 8 bytes. In the implementation, remember that pointer arithmetic depends on type. For example, p += 8 adds 8  sizeof(p), not necessarily 8 bytes!

Implementing malloc

The simplest implementation uses first fit: Start at the first block, assuming it exists, and iterate until a block that represents unallocated space of sufficient size is found, or we’ve checked all the blocks. If no suitable block is found, it’s time to call sbrk() again to sufficiently extend the size of the heap. A fast implementation might extend it a significant amount so that we will not need to request more heap memory in the near future.

When a free block is found, it may be larger than the space we need. If so, we will create two entries in our implicit list. The first entry is the allocated block, the second entry is the remaining space. There are two simple ways to note if a block is in use or available. The first is to store it as a byte in the header information along with the size and the second to encode it as the lowest bit in the size! Thus block size information would be limited to only even values:

// Assumes p is a reasonable pointer type, e.g. 'size_t *'.
isallocated = (*p) & 1;
realsize = (*p) & ~1;  // mask out the lowest bit

Alignment and rounding up considerations

Many architectures expect multi-byte primitives to be aligned to some multiple of 2 (4, 16, etc). For example, it’s common to require 4-byte types to be aligned to 4-byte boundaries and 8-byte types on 8-byte boundaries. If multi-byte primitives are not stored on a reasonable boundary for example starting at an odd address then the performance can be significantly impacted because it may require two memory read requests instead of one. On some architectures the penalty is even greater - the program will crash with a bus error. Most of you experienced this in your architecture classes if there was no memory protection.

As malloc does not know how the user will use the allocated memory, the pointer returned to the program needs to be aligned for the worst case, which is architecture dependent.

From glibc documentation, the glibc malloc uses the following heuristic

The block that malloc gives you is guaranteed to be aligned so that it can hold any type of data. On GNU systems, the address is always a multiple of eight on most systems, and a multiple of 16 on 64-bit systems." For example, if you need to calculate how many 16 byte units are required, don’t forget to round up.

int s = (requested_bytes + tag_overhead_bytes + 15) / 16

The additional constant ensures incomplete units are rounded up. Note, real code is more likely to symbol sizes e.g. sizeof(x) - 1, rather than coding numerical constant 15.

Here’s a great article on memory alignment, if you are further interested

Another added effect is could be internal fragmentation happens when the block you give them is larger than their allocation size. Let’s say that we have a free block of size 16B (not including metadata). If they allocate 7 bytes, you may want to round up to 16B and just return the entire block. This gets very sinister when you implementing coalescing and splitting (next section). If you don’t implement either, then you may end up returning a block of size 64B for a 7B allocation! There is a lot of overhead for that allocation which is what we are trying to avoid.

Implementing free

When free is called we need to re-apply the offset to get back to the ‘real’ start of the block – to where we stored the size information. A naive implementation would simply mark the block as unused. If we are storing the block allocation status in the lowest size bit, then we just need to clear the bit:

*p = (*p) & ~1; // Clear lowest bit 

However, we have a bit more work to do: If the current block and the next block (if it exists) are both free we need to coalesce these blocks into a single block. Similarly, we also need to check the previous block, too. If that exists and represents an unallocated memory, then we need to coalesce the blocks into a single large block.

To be able to coalesce a free block with a previous free block we will also need to find the previous block, so we store the block’s size at the end of the block, too. These are called “boundary tags” (ref Knuth73). These are Knuth’s solution to the coalescing problem both ways. As the blocks are contiguous, the end of one blocks sits right next to the start of the next block. So the current block (apart from the first one) can look a few bytes further back to lookup the size of the previous block. With this information you can now jump backwards!

Performance

With the above description it’s possible to build a memory allocator. It’s main advantage is simplicity - at least simple compared to other allocators! Allocating memory is a worst-case linear time operation – search linked lists for a sufficiently large free block. De-allocation is constant time (no more than 3 blocks will need to be coalesced into a single block). Using this allocator it is possible to experiment with different placement strategies. For example, you could start searching from where you last free’d a block, or where you last allocated from. If you do store pointers to blocks, you need to be very careful that they always remain valid Particularly when you malloc, free, calloc, realloc, coalesce, split, etc.

Explicit Free Lists Allocators

Better performance can be achieved by implementing an explicit doubly-linked list of free nodes. In that case, we can immediately traverse to the next free block and the previous free block. This can halve the search time, because the linked list only includes unallocated blocks. A second advantage is that we now have some control over the ordering of the linked list. For example, when a block is free’d, we could choose to insert it into the beginning of the linked list rather than always between its neighbors. This is discussed below.

Where do we store the pointers of our linked list? A simple trick is to realize that the block itself is not being used and store the next and previous pointers as part of the block (though now you have to ensure that the free blocks are always sufficiently large to hold two pointers). We still need to implement Boundary Tags, so we can correctly free blocks and coalesce them with their two neighbors. Consequently, explicit free lists require more code and complexity. With explicit linked lists a fast and simple ‘Find-First’ algorithm is used to find the first sufficiently large link. However, since the link order can be modified, this corresponds to different placement strategies. For example if the links are maintained from largest to smallest, then this produces a ‘Worst-Fit’ placement strategy.

Explicit linked list insertion policy

The newly free’d block can be inserted easily into two possible positions: at the beginning or in address order (by using the boundary tags to first find the neighbors). Inserting at the beginning creates a LIFO (last-in, first-out) policy: The most recently free’d spaces will be reused. Studies suggest fragmentation is worse than using address order.

Inserting in address order (“Address ordered policy”) inserts free’d blocks so that the blocks are visited in increasing address order. This policy required more time to free a block because the boundary tags (size data) must be used to find the next and previous unallocated blocks. However, there is less fragmentation.

Case study: Buddy Allocator (an example of a segregated list)

A segregated allocator is one that divides the heap into different areas that are handled by different sub-allocators dependent on the size of the allocation request. Sizes are grouped into classes (e.g. powers of two) and each size is handled by a different sub-allocator and each size maintains its own free list.

A well known allocator of this type is the buddy allocator (Rangan, Raman, and Ramanujam 1999 P. 85). We’ll discuss the binary buddy allocator which splits allocation into blocks of size 2^n (n = 1, 2, 3, …) times some base unit number of bytes, but others also exist (e.g. Fibonacci split). The basic concept is simple: If there are no free blocks of size 2^n, go to the next level and steal that block and split it into two. If two neighboring blocks of the same size become unallocated, they can be coalesced back together into a single large block of twice the size.

Buddy allocators are fast because the neighboring blocks to coalesce with can be calculated from the free’d block’s address, rather than traversing the size tags. Ultimate performance often requires a small amount of assembler code to use a specialized CPU instruction to find the lowest non-zero bit.

The main disadvantage of the Buddy allocator is that they suffer from internal fragmentation, because allocations are rounded up to the nearest block size. For example, a 68-byte allocation will require a 128-byte block.

Further Reading

There are many other allocation schemes. One of three allocators used internally by the Linux Kernel. See the man page!

Topics

  • Best Fit

  • Worst Fit

  • First Fit

  • Buddy Allocator

  • Internal Fragmentation

  • External Fragmentation

  • sbrk

  • Natural Alignment

  • Boundary Tag

  • Coalescing

  • Splitting

  • Slab Allocation/Memory Pool

Questions/Exercises

  • What is Internal Fragmentation? When does it become an issue?

  • What is External Fragmentation? When does it become an issue?

  • What is a Best Fit placement strategy? How is it with External Fragmentation? Time Complexity?

  • What is a Worst Fit placement strategy? Is it any better with External Fragmentation? Time Complexity?

  • What is the First Fit Placement strategy? It’s a little bit better with Fragmentation, right? Expected Time Complexity?

  • Let’s say that we are using a buddy allocator with a new slab of 64kb. How does it go about allocating 1.5kb?

  • When does the 5 line sbrk implementation of malloc have a use?

  • What is natural alignment?

  • What is Coalescing/Splitting? How do they increase/decrease fragmentation? When can you coalesce or split?

  • How do boundary tags work? How can they be used to coalesce or split?

Rangan, C.P., V. Raman, and R. Ramanujam. 1999. Foundations of Software Technology and Theoretical Computer Science: 19th Conference, Chennai, India, December 13-15, 1999 Proceedings. FOUNDATIONS of Computer software Technology and Theoretical Computer Science. Springer. https://books.google.com/books?id=0uHME7EfjQEC.

Clone this wiki locally