dlmalloc: new and faster allocator #54

mikrosk · 2023-03-22T12:03:01Z

Doug Lea's allocator has been discovered by Eero Tamminen during our ScummVM profiling session. It is used in quite a few libc variants and most importantly: it's freaking fast! How much faster? Obviously with applications doing a lot of allocations, typically C++'s vector & list operations.

Few examples (% of CPU time spent in malloc/free before and after):

Gabriel Knight (demo): During startup: 16% -> 0.8%; During game play: 24% -> 1.5%
Ms. Astro Chicken: 30% -> 1.7%; Full Throttle intro: ~6% -> ~1%.
Christmas card 1988: 15.72% -> 1.61%

Basically, allocation/release overhead is now much closer to what you would see on typical PC. Me and Eero have tested it quite a lot, not to mention that it is a well known project.

I have only two points to raise:

should we be using sbrk or __sbrk in malloc.c ? Previous implementation used the latter.
is replacing end = (char*)(CALL_MORECORE(0)); (basically sbrk(0)) with end = br + asize; the right choice?

The allocator supports also non-contiguous memory model via sbrk (however it is also possible to use the contiguous one when setting _stksize properly). The trouble is that our sbrk doesn't support sbrk(0) in the non-contiguous model because it may be not clear what to return in such case. By default, such call shall return the current pointer where the next allocation would occur. I've been thinking that maybe we could fake it by storing the last allocated pointer + size (basically same as I do in malloc.c) and return that value in case of sbrk(0) is called.

But IMHO it is fine as it is now, and it is also faster. :)

The PR is split into two commits so it's easy to see what I have changed.

mikrosk · 2023-03-22T12:06:05Z

Btw another bonus is that memalign and posix_memalign now work as expected, in stark comparison to their previous implementation (with memory leaks...).

th-otto · 2023-03-22T13:54:24Z

There are a few objections that i have.

it might be faster, but at the cost of substantially larger code size (1k vs 12k)
it has to implement some user-settable minimum chunk size to allocate from the OS (that MINCHUNK & MAXCHUNK variables). Otherwise you quickly run into the FOLDR100 problem.
The MINCHUNK & MAXCHUNK variables (as well as some other symbols) must not be removed, since that will give linking errors (yes i know they don't appear in public headers, but they are referenced in other packages)
reference to abort should be avoided. That pulls in fclose() and as as consequence, lots of other stuff.
The file should be split to one function per source file, as before. Very few programs will need posix_memalign

conclusion:
If you really need that, than it should be an alternative implementation, not completely replace the old implementation. So there should be a way to give the user a choice, maybe some #define before any system header files are included.

mikrosk · 2023-03-22T14:12:16Z

it might be faster, but at the cost of substantially larger code size (1k vs 12k)

Indeed. Perhaps this is partially caused by the one file implementation.

it has to implement some user-settable minimum chunk size to allocate from the OS (that MINCHUNK & MAXCHUNK variables). Otherwise you quickly run into the FOLDR100 problem.

What is this? I've grep'ed the original source, I don't see anything with such names in there?

reference to abort should be avoided. That pulls in fclose() and as as consequence, lots of other stuff.

Good catch!

The file should be split to one function per source file, as before. Very few programs will need posix_memalign

Is it even possible? The main functions (malloc, free, ...) use an awful lot of those macros and defines, let alone static functions. It looks quite error prone to dissect it (you know how it is, one forgotten static variable in a header file and you've got suddenly two copies instead of shared variable).

If you really need that, than it should be an alternative implementation, not completely replace the old implementation. So there should be a way to give the user a choice, maybe some #define before any system header files are included.

Wouldn't make more sense to move (if it's not already there) the current allocator to libcmini and let this one for "heavy duty" ? That would be more aligned with philosophy of both projects.

mikrosk · 2023-03-22T14:16:21Z

Ah, I see. You meant MINHUNK, not MINCHUNK. But I still don't get it, how is this used externally? It is a static variable in a C file after all? Or you meant that the function mallocChunkSize should be present? That's easy, dlmalloc provides a similar define (even with the same default size) so it's easy to export the same function from there as well.

Btw I see that mintlib's allocator is copied in libcmini so I still like the idea to make mintlib's one the heavier one.

th-otto · 2023-03-22T14:39:30Z

Yes, sorry for the typo. MINHUNK, MAXHUNK & mallocChunkSizemust remain.

You have to look at the original code to understand what they do. Basically they are used to determine the size that is obtained using GEMDOS Malloc. In ORCS for example i set MINHUNK such that no more than about 1000 Malloc calls are made to the OS, even if all available memory is requested. IIIR some other program do similar things. Remember that plain TOS (including EmuTOS) fails to allocate anything after about 300 Malloc calls, if you don't run FOLDR100.PRG.

Is it even possible?

I only took a brief look so far, but of course that is possible. If there are some variables that are supposed to be static, but will then be referenced from several modules, just rename them with some unique prefix, including "__"

Wouldn't make more sense to move (if it's not already there) the current allocator to libcmini and let this one for "heavy >duty" ? That would be more aligned with philosophy of both projects.

Mintlib is already quite bloated, one of the reasons why users don't like using it. We should not make that worse.

Until now, we got quite well with the old implementation (ok, to be fair, maybe just because noone took the effort to profile it yet). But it can't be that bad, there are lots of other applications that allocate memory in small chunks. Your current need in ScummVM to allocate & free lots of blocks in each display frame seems rather special to me.

mikrosk · 2023-03-22T15:00:36Z

But it can't be that bad, there are lots of other applications that allocate memory in small chunks. Your current need in ScummVM to allocate & free lots of blocks in each display frame seems rather special to me.

It's true it is rather special for typical Atari usage (where the programmer is taught to be very careful about system resources) but I bet it is quite common in any C++ based application. Basically every app has some kind of (main)loop and if it's C++, there's a pretty good chance some vector or list is used. I wouldn't be surprised if zView, OpenTTD (and other SDL-based projects) or maybe even gcc/g++ would show significant speedup.

The trouble with the #ifdef approach (to select old and new malloc) is that I can't imagine how this would work with new and delete without changing the actual C++ source code (i.e. defining a custom allocator within new) ? Or is it possible to achieve this with some weak_alias hackery?

th-otto · 2023-03-22T15:14:45Z

I wouldn't be surprised if zView, OpenTTD (and other SDL-based projects) or maybe even gcc/g++ would show significant speedup.

IIRC neither zView nor SDL use c++ code.

(i.e. defining a custom allocator within new) ? Or is it possible to achieve this with some weak_alias hackery?

Theoretically that should work, by putting a default allocator in the library. Might only be a problem with the link-order, since libc is later in the chain, so you may have to place it in libstdc++ instead, but you don't need to change the code otherwise.

Might also be worth to take a look at other projects. IIRC that allocator is already in use as an alternative rather than replacement.

mikrosk · 2023-03-22T15:59:34Z

zView - take a look at libfreetype :), SDL itself not but take a look at OpenTTD for instance: https://github.com/OpenTTD/OpenTTD/tree/master/src ... heaps of C++ code as well.

Oh man, that sounds pretty terrible. It's easier just to maintain a separate mintlib for my purposes, it's a question of one -L switch.

th-otto · 2023-03-22T16:22:10Z

Well, its just my opinion. Aren't there any others?

mikrosk · 2023-03-22T18:02:37Z

Isn't possible to do something like (with defined __malloc and __dlmalloc functions in mintlib):

#ifdef USE_DLMALLOC
weak_alias(__dlmalloc, malloc)
#else
weak_alias(__malloc, malloc)
#endif

in some header file? So new would automatically pick whatever got aliased.

th-otto · 2023-03-22T18:21:29Z

I don't think that this will work, because

new references operator new resp. operator new[], not malloc
malloc will only be defined in libc, which is later in the link chain, so operator new will be found in libcstd++ first
IIRC you can only define weak aliases for symbols already seen when compiling, but you cannot assume that c++ source include <stdlib.h>
generally you cannot assume that they directly include any mintlib headers at all, they might instead include etc.

That are things that have to be tried.

xdelatour · 2023-03-23T11:33:32Z

Maybe I'm wrong but XaAES seems to be able to use different versions of memcpy/memset (xcclib) depending on a constant (the version number of GCC). Isn't possible to do similar thing with malloc/mfree?

th-otto · 2023-03-23T12:10:27Z

That only seems to be a relict for old (2.95?) gcc versions, and the actual code is identical. Also, for XaAES things are quite different, because most standard library functions are redirected through the function table that the kernel exports.

This is not supported in mintlib. In fact, even sbrk(0) isn't.

mikrosk force-pushed the dlmalloc branch from 9a56957 to 716fca2 Compare April 14, 2023 20:31

mikrosk force-pushed the dlmalloc branch from 716fca2 to 05c903f Compare August 12, 2023 02:04

mikrosk force-pushed the dlmalloc branch 2 times, most recently from 0c9e4f5 to 9ec6a67 Compare August 30, 2023 04:35

mikrosk force-pushed the dlmalloc branch 3 times, most recently from 14b7111 to f98b048 Compare September 28, 2023 18:31

mikrosk force-pushed the dlmalloc branch 5 times, most recently from 55f6a05 to 1fc9c13 Compare October 25, 2023 20:46

mikrosk force-pushed the dlmalloc branch from 1fc9c13 to 74dc604 Compare January 1, 2024 17:34

mikrosk added 4 commits March 10, 2024 22:18

Initial commit

d6f9b14

Adapt to mintlib

4c2aafc

Install into /opt/mintlib-dlmalloc

f626810

Update dlmalloc to latest version

35c1333

mikrosk added 2 commits March 10, 2024 22:18

Use sysconf(_SC_PAGE_SIZE)

1ce6572

Avoid calling sbrk() with negative values

584b4d9

This is not supported in mintlib. In fact, even sbrk(0) isn't.

mikrosk force-pushed the dlmalloc branch from 74dc604 to 584b4d9 Compare March 10, 2024 21:19

th-otto force-pushed the master branch 5 times, most recently from b9a4f13 to 519e232 Compare April 29, 2024 09:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dlmalloc: new and faster allocator #54

dlmalloc: new and faster allocator #54

mikrosk commented Mar 22, 2023 •

edited

Loading

mikrosk commented Mar 22, 2023

th-otto commented Mar 22, 2023

mikrosk commented Mar 22, 2023

mikrosk commented Mar 22, 2023 •

edited

Loading

th-otto commented Mar 22, 2023

mikrosk commented Mar 22, 2023

th-otto commented Mar 22, 2023

mikrosk commented Mar 22, 2023

th-otto commented Mar 22, 2023

mikrosk commented Mar 22, 2023

th-otto commented Mar 22, 2023

xdelatour commented Mar 23, 2023

th-otto commented Mar 23, 2023

dlmalloc: new and faster allocator #54

Are you sure you want to change the base?

dlmalloc: new and faster allocator #54

Conversation

mikrosk commented Mar 22, 2023 • edited Loading

mikrosk commented Mar 22, 2023

th-otto commented Mar 22, 2023

mikrosk commented Mar 22, 2023

mikrosk commented Mar 22, 2023 • edited Loading

th-otto commented Mar 22, 2023

mikrosk commented Mar 22, 2023

th-otto commented Mar 22, 2023

mikrosk commented Mar 22, 2023

th-otto commented Mar 22, 2023

mikrosk commented Mar 22, 2023

th-otto commented Mar 22, 2023

xdelatour commented Mar 23, 2023

th-otto commented Mar 23, 2023

mikrosk commented Mar 22, 2023 •

edited

Loading

mikrosk commented Mar 22, 2023 •

edited

Loading