-
Notifications
You must be signed in to change notification settings - Fork 85
Malloc
- Memory Allocators
Memory memory everywhere but not a allocation to be made - A really fragmented heap
Memory allocation is very important! Allocating and de-allocating heap
memory is one of the most common operations in any application. The heap
at the system level is contiguous series of addresses that the program
can expand or contract and use as its own accord. In POSIX, this is
called the system break. We use sbrk
to move the system break. Most
programs don’t interact directly with this call, they use a memory
allocation system around it to handle chunking up and keeping track of
which memory is allocated and which is free’d.
We will mainly be looking into how the c standard library does memory
allocations and that c-api for it. Just know that there are other ways
of dividing up memory like with mmap
or other allocation schemes and
methods like jemalloc
.
-
malloc(size_t bytes)
is a C library call and is used to reserve a contiguous block of memory. Unlike stack memory, the memory remains allocated untilfree
is called with the same pointer. Ifmalloc
fails to reserve any more memory then it returnsNULL
. Robust programs should check the return value. If your code assumesmalloc
succeeds and it does not, then your program will likely crash (segfault) when it tries to write to address 0. Also, malloc may not zero out memory because of performance – check your code to make sure that you are not using uninitialized values. -
realloc(void *space, size_t bytes)
allows you to resize an existing memory allocation that was previously allocated on the heap (via malloc, calloc, or realloc). The most common use of realloc is to resize memory used to hold an array of values. There are two gotchas with realloc. One, a new pointer may be returned. Two, it can fail. A naive but readable version of realloc is suggested below with sample usage.void * realloc(void * ptr, size_t newsize) { // Simple implementation always reserves more memory // and has no error checking void *result = malloc(newsize); size_t oldsize = ... //(depends on allocator's internal data structure) if (ptr) memcpy(result, ptr, newsize < oldsize ? newsize : oldsize); free(ptr); return result; } int main() { // 1 int *array = malloc(sizeof(int) * 2); array[0] = 10; array[1] = 20; // Ooops need a bigger array - so use realloc.. array = realloc(array, 3 * sizeof(int)); array[2] = 30; }
The above code is not robust. If
realloc
fails then the array is a memory leak. Robust code checks for the return value and only reassigns the original pointer if not null. -
calloc(size_t nmemb, size_t size)
initializes memory contents to zero and also takes two arguments: the number of items and the size in bytes of each item. An advanced discussion of these limitations is in this article. Programmers often usecalloc
rather than explicitly callingmemset
aftermalloc
, to set the memory contents to zero. Notecalloc(x,y)
is identical tocalloc(y,x)
, but you should follow the conventions of the manual. A naive implementation of calloc is below.void *calloc(size_t n, size_t size) { size_t total = n * size; // Does not check for overflow! void *result = malloc(total); if (!result) return NULL; // If we're using new memory pages // just allocated from the system by calling sbrk // then they will be zero so zero-ing out is unnecessary, return memset(result, 0, total); }
-
free
takes a pointer to the start of a piece of memory and makes it available for use in the subsequent calls to the other allocation functions. This is important because we don’t want every process in our address space to take an enormous amount of memory. Once we are done using memory, we stop using it with free. A simple usage is below.int *ptr = malloc(sizeof(*ptr)); do_something(ptr); free(ptr);
If you use a piece of memory after it is freed - that is undefined behavior.
The heap is part of the process memory and it does not have a fixed
size. Heap memory allocation is performed by the C library when you call
malloc
(calloc
, realloc
) and free
. By calling sbrk
the C
library can increase the size of the heap as your program demands more
heap memory. As the heap and stack (one for each thread) need to grow,
we put them at opposite ends of the address space. So for typical
architectures the heap will grow upwards and the stack grows downwards.
Truthiness: Modern operating system memory allocators no longer need
sbrk
- instead they can request independent regions of virtual memory
and maintain multiple memory regions. For example, gibibyte requests may
be placed in a different memory region than small allocation requests.
However this detail is an unwanted complexity. The problems of
fragmentation and allocating memory efficiently still apply, so we will
ignore this implementation nicety here and will write as if the heap is
a single region. If we write a multi-threaded program (more about that
later) we will need multiple stacks (one per thread) but there’s only
ever one heap.
Programs don’t need to call brk or sbrk typically (though calling
sbrk(0)
can be interesting because it tells you where your heap
currently ends). Instead programs use malloc,calloc,realloc
and free
which are part of the C library. The internal implementation of these
functions may call sbrk
when additional heap memory is required.
void *top_of_heap = sbrk(0);
malloc(16384);
void *top_of_heap2 = sbrk(0);
printf("The top of heap went from %p to %p \n", top_of_heap, top_of_heap2);
// Example output: The top of heap went from 0x4000 to 0xa000
An important fact we just glossed over earlier is memory that was newly
obtained by the operating system must be zeroed out. If the operating
system did not zero out contents of physical RAM it might be possible
for one process to learn about the memory of another process that had
previously used the memory. This would be a security leak. Unfortunately
this means that for malloc
requests before any memory has been freed
and simple programs (which end up using newly reserved memory from the
system) the memory is often zero. Then programmers mistaken write C
programs that assume malloc’d memory will always be zero.
char* ptr = malloc(300);
// contents is probably zero because we get brand new memory
// so beginner programs appear to work!
// strcpy(ptr, "Some data"); // work with the data
free(ptr);
// later
char *ptr2 = malloc(300); // Contents might now contain existing data and is probably not zero
Let’s try to write Malloc. Here is a first attempt at it – the naive version.
void* malloc(size_t size)
// Ask the system for more bytes by extending the heap space.
// sbrk Returns -1 on failure
void *p = sbrk(size);
if(p == (void *) -1) return NULL; // No space left
return p;
}
void free() {/* Do nothing */}
Above is the simplest implementation of malloc, there are a few drawbacks though.
-
System calls are slow (compared to library calls). We should reserve a large amount of memory and only occasionally ask for more from the system.
-
No reuse of freed memory. Our program never re-uses heap memory - it just keeps asking for a bigger heap.
If this allocator was used in a typical program, the process would quickly exhaust all available memory. Instead we need an allocator that can efficiently use heap space and only ask for more memory when necessary. There are programs that may use this type of allocator. Consider a video game allocating objects to load the next scene. It is considerably faster to do the above and just throw the entire block of memory away than it is to do the following placement strategies.
During program execution, memory is allocated and de-allocated (freed), so there will be gaps (holes) in the heap memory that can be re-used for future memory requests. The memory allocator needs to keep track of which parts of the heap are currently allocated and which are parts are available. Suppose our current heap size is 64K, though not all of it is in use because some earlier malloc’d memory has already been freed by the program:
16KiB | 10KiB | 1KiB | 1KiB | 30KiB | 4KiB | 2KiB |
---|---|---|---|---|---|---|
free | allocated | free | allocated | free | allocated | free |
If a new malloc request for 2KiB is executed (malloc(2048)
), where
should malloc
reserve the memory? It could use the last 2KiB hole,
which happens to be the perfect size! Or it could split one of the
other two free holes. These choices represent different placement
strategies. Whichever hole is chosen, the allocator will need to split
the hole into two: The newly allocated space (which will be returned to
the program) and a smaller hole (if there is spare space left over). A
perfect-fit strategy finds the smallest hole that is of sufficient size
(at least 2KiB):
16KiB | 10KiB | 1KiB | 1KiB | 30KiB | 4KiB | 2KiB |
---|---|---|---|---|---|---|
free | allocated | free | allocated | free | allocated | HERE! |
A worst-fit strategy finds the largest hole that is of sufficient size (so break the 30KiB hole into two):
16KiB | 10KiB | 1KiB | 1KiB | 2KiB | 28KiB | 4KiB | 2KiB |
---|---|---|---|---|---|---|---|
free | allocated | free | allocated | HERE! |
free | allocated | free |
A first-fit strategy finds the first available hole that is of sufficient size (break the 16KiB hole into two):
2KiB | 14KiB | 10KiB | 1KiB | 1KiB | 30KiB | 4KiB | 2KiB |
---|---|---|---|---|---|---|---|
HERE! |
free | allocated | free | allocated | free | allocated | free |
To introduce another concept, external fragmentation is that even though we have enough memory in the heap, it may be divided up in a way that wear are not able to give the full amount. In the example below, of the 64KiB of heap memory, 17KiB is allocated, and 47KiB is free. However the largest available block is only 30KiB because our available unallocated heap memory is fragmented into smaller pieces.
16KiB | 10KiB | 1KiB | 1KiB | 30KiB | 4KiB | 2KiB |
---|---|---|---|---|---|---|
free | allocated | free | allocated | free | allocated | free |
The challenges of writing a heap allocator are
-
Need to minimize fragmentation (i.e. maximize memory utilization)
-
Need high performance
-
Fiddly implementation (lots of pointer manipulation using linked lists and pointer arithmetic)
-
Both fragmentation and performance depend on the application allocation profile, which can be evaluated but not predicted and in practice, under-specific usage conditions, a special-purpose allocator can often out-perform a general purpose implementation.
-
The allocator doesn’t know the program’s memory allocation requests in advance. Even if we did, this is the Knapsack problem which is known to be NP hard!
Different strategies affect the fragmentation of heap memory in non-obvious ways, which only are discovered by mathematical analysis or careful simulations under real-world conditions (for example simulating the memory allocation requests of a database or webserver). For example, best-fit at first glance appears to be an excellent strategy. However, if we can not find a perfectly-sized hole then this placement creates many tiny unusable holes, leading to high fragmentation. It also requires a scan of all possible holes. Since Worst-fit targets the largest unallocated space, it is a poor choice if large allocations are required. Worst fit also has to scan the entire heap
First fit has the advantage that it will not evaluate all possible placements and therefore be faster. In practice first-fit and next-fit are often common placement strategy. Hybrid approaches and many other alternatives exist.
A memory allocator needs to keep track of which bytes are currently
allocated and which are available for use. This page introduces the
implementation and conceptual details of building an allocator, or the
actual code that implements malloc
and free
.
Conceptually we are thinking about creating linked lists and lists of blocks! We are writing integers and pointers into memory that we already control, so we can later consistently hop from one address to the next. This internal information represents some overhead. Even if we had requested 1024 KiB of contiguous memory from the system, we will not be able to provide all of it to the running program.
We can think of our heap memory as a list of blocks where each block is either allocated or unallocated. Rather than storing an explicit list of pointers we store information about the block’s size as part of the block. Thus there is conceptually a list of free blocks, but it is implicit in the form of block size information that we store as part of each block.
We could navigate from one block to the next block just by adding the
block’s size. For example if you have a pointer p
that points to the
start of a block, then next_block
with be at
((char )p) + (size_t ) p
, if you are storing the size of the
blocks in bytes. The cast to char
ensures that pointer arithmetic is
calculated in bytes. The cast to size_t
ensures the memory at p
is
read as a size value and would be necessarily if p
was a void
or
char
type.
The calling program never sees these values; they are internal to the
implementation of the memory allocator. As an example, suppose your
allocator is asked to reserve 80 bytes (malloc(80)
) and requires 8
bytes of internal header data. The allocator would need to find an
unallocated space of at least 88 bytes. After updating the heap data it
would return a pointer to the block. However, the returned pointer does
not point to the start of the block because that’s where the internal
size data is stored! Instead we would return the start of the block + 8
bytes. In the implementation, remember that pointer arithmetic depends
on type. For example, p += 8
adds 8 sizeof(p)
, not necessarily 8
bytes!
The simplest implementation uses first fit: Start at the first block,
assuming it exists, and iterate until a block that represents
unallocated space of sufficient size is found, or we’ve checked all the
blocks. If no suitable block is found, it’s time to call sbrk()
again
to sufficiently extend the size of the heap. A fast implementation might
extend it a significant amount so that we will not need to request more
heap memory in the near future.
When a free block is found, it may be larger than the space we need. If so, we will create two entries in our implicit list. The first entry is the allocated block, the second entry is the remaining space. There are two simple ways to note if a block is in use or available. The first is to store it as a byte in the header information along with the size and the second to encode it as the lowest bit in the size! Thus block size information would be limited to only even values:
// Assumes p is a reasonable pointer type, e.g. 'size_t *'.
isallocated = (*p) & 1;
realsize = (*p) & ~1; // mask out the lowest bit
Many architectures expect multi-byte primitives to be aligned to some multiple of 2 (4, 16, etc). For example, it’s common to require 4-byte types to be aligned to 4-byte boundaries and 8-byte types on 8-byte boundaries. If multi-byte primitives are not stored on a reasonable boundary for example starting at an odd address then the performance can be significantly impacted because it may require two memory read requests instead of one. On some architectures the penalty is even greater - the program will crash with a bus error. Most of you experienced this in your architecture classes if there was no memory protection.
As malloc
does not know how the user will use the allocated memory,
the pointer returned to the program needs to be aligned for the worst
case, which is architecture dependent.
From glibc documentation, the glibc malloc
uses the following
heuristic
The block that malloc gives you is guaranteed to be aligned so that it can hold any type of data. On GNU systems, the address is always a multiple of eight on most systems, and a multiple of 16 on 64-bit systems." For example, if you need to calculate how many 16 byte units are required, don’t forget to round up.
int s = (requested_bytes + tag_overhead_bytes + 15) / 16
The additional constant ensures incomplete units are rounded up. Note,
real code is more likely to symbol sizes e.g. sizeof(x) - 1
, rather
than coding numerical constant 15.
Here’s a great article on memory alignment, if you are further interested
Another added effect is could be internal fragmentation happens when the block you give them is larger than their allocation size. Let’s say that we have a free block of size 16B (not including metadata). If they allocate 7 bytes, you may want to round up to 16B and just return the entire block. This gets very sinister when you implementing coalescing and splitting (next section). If you don’t implement either, then you may end up returning a block of size 64B for a 7B allocation! There is a lot of overhead for that allocation which is what we are trying to avoid.
When free
is called we need to re-apply the offset to get back to the
‘real’ start of the block – to where we stored the size information. A
naive implementation would simply mark the block as unused. If we are
storing the block allocation status in the lowest size bit, then we just
need to clear the bit:
*p = (*p) & ~1; // Clear lowest bit
However, we have a bit more work to do: If the current block and the next block (if it exists) are both free we need to coalesce these blocks into a single block. Similarly, we also need to check the previous block, too. If that exists and represents an unallocated memory, then we need to coalesce the blocks into a single large block.
To be able to coalesce a free block with a previous free block we will also need to find the previous block, so we store the block’s size at the end of the block, too. These are called “boundary tags” (ref Knuth73). These are Knuth’s solution to the coalescing problem both ways. As the blocks are contiguous, the end of one blocks sits right next to the start of the next block. So the current block (apart from the first one) can look a few bytes further back to lookup the size of the previous block. With this information you can now jump backwards!
With the above description it’s possible to build a memory allocator. It’s main advantage is simplicity - at least simple compared to other allocators! Allocating memory is a worst-case linear time operation – search linked lists for a sufficiently large free block. De-allocation is constant time (no more than 3 blocks will need to be coalesced into a single block). Using this allocator it is possible to experiment with different placement strategies. For example, you could start searching from where you last free’d a block, or where you last allocated from. If you do store pointers to blocks, you need to be very careful that they always remain valid Particularly when you malloc, free, calloc, realloc, coalesce, split, etc.
Better performance can be achieved by implementing an explicit doubly-linked list of free nodes. In that case, we can immediately traverse to the next free block and the previous free block. This can halve the search time, because the linked list only includes unallocated blocks. A second advantage is that we now have some control over the ordering of the linked list. For example, when a block is free’d, we could choose to insert it into the beginning of the linked list rather than always between its neighbors. This is discussed below.
Where do we store the pointers of our linked list? A simple trick is to realize that the block itself is not being used and store the next and previous pointers as part of the block (though now you have to ensure that the free blocks are always sufficiently large to hold two pointers). We still need to implement Boundary Tags, so we can correctly free blocks and coalesce them with their two neighbors. Consequently, explicit free lists require more code and complexity. With explicit linked lists a fast and simple ‘Find-First’ algorithm is used to find the first sufficiently large link. However, since the link order can be modified, this corresponds to different placement strategies. For example if the links are maintained from largest to smallest, then this produces a ‘Worst-Fit’ placement strategy.
The newly free’d block can be inserted easily into two possible positions: at the beginning or in address order (by using the boundary tags to first find the neighbors). Inserting at the beginning creates a LIFO (last-in, first-out) policy: The most recently free’d spaces will be reused. Studies suggest fragmentation is worse than using address order.
Inserting in address order (“Address ordered policy”) inserts free’d blocks so that the blocks are visited in increasing address order. This policy required more time to free a block because the boundary tags (size data) must be used to find the next and previous unallocated blocks. However, there is less fragmentation.
A segregated allocator is one that divides the heap into different areas that are handled by different sub-allocators dependent on the size of the allocation request. Sizes are grouped into classes (e.g. powers of two) and each size is handled by a different sub-allocator and each size maintains its own free list.
A well known allocator of this type is the buddy allocator (Rangan, Raman, and Ramanujam 1999 P. 85). We’ll discuss the binary buddy allocator which splits allocation into blocks of size 2^n (n = 1, 2, 3, …) times some base unit number of bytes, but others also exist (e.g. Fibonacci split). The basic concept is simple: If there are no free blocks of size 2^n, go to the next level and steal that block and split it into two. If two neighboring blocks of the same size become unallocated, they can be coalesced back together into a single large block of twice the size.
Buddy allocators are fast because the neighboring blocks to coalesce with can be calculated from the free’d block’s address, rather than traversing the size tags. Ultimate performance often requires a small amount of assembler code to use a specialized CPU instruction to find the lowest non-zero bit.
The main disadvantage of the Buddy allocator is that they suffer from internal fragmentation, because allocations are rounded up to the nearest block size. For example, a 68-byte allocation will require a 128-byte block.
There are many other allocation schemes. One of three allocators used internally by the Linux Kernel. See the man page!
-
SLUB (wikipedia)
-
See Foundations of Software Technology and Theoretical Computer Science 1999 proceedings (Google books,page 85)
-
Best Fit
-
Worst Fit
-
First Fit
-
Buddy Allocator
-
Internal Fragmentation
-
External Fragmentation
-
sbrk
-
Natural Alignment
-
Boundary Tag
-
Coalescing
-
Splitting
-
Slab Allocation/Memory Pool
-
What is Internal Fragmentation? When does it become an issue?
-
What is External Fragmentation? When does it become an issue?
-
What is a Best Fit placement strategy? How is it with External Fragmentation? Time Complexity?
-
What is a Worst Fit placement strategy? Is it any better with External Fragmentation? Time Complexity?
-
What is the First Fit Placement strategy? It’s a little bit better with Fragmentation, right? Expected Time Complexity?
-
Let’s say that we are using a buddy allocator with a new slab of 64kb. How does it go about allocating 1.5kb?
-
When does the 5 line
sbrk
implementation of malloc have a use? -
What is natural alignment?
-
What is Coalescing/Splitting? How do they increase/decrease fragmentation? When can you coalesce or split?
-
How do boundary tags work? How can they be used to coalesce or split?
Rangan, C.P., V. Raman, and R. Ramanujam. 1999. Foundations of Software Technology and Theoretical Computer Science: 19th Conference, Chennai, India, December 13-15, 1999 Proceedings. FOUNDATIONS of Computer software Technology and Theoretical Computer Science. Springer. https://books.google.com/books?id=0uHME7EfjQEC.
Please do not edit this wiki
This content is licensed under the coursebook licensing scheme. If you find any typos. Please file an issue or make a PR. Thank you!