-
Notifications
You must be signed in to change notification settings - Fork 1.2k
LMDB freelist illustrated guide
Illustrations used in this guide are auto-generated from the actual LMDB databases created by some test code. Therefore, they reflect the behaviour of the actual LMDB code currently embedded into turbo-geth
via lmdb-go
repository here: https://github.com/ledgerwatch/lmdb-go. To understand which illustration is which, this guide will give its name, and will also show the test code that generates the database for each illustration. The illustrations can be reproduced by running:
make hack
./build/bin/hack --action defrag
This creates the .dot
and .png
files in the current directory.
Test code does nothing:
func nothing(kv ethdb.KV, _ ethdb.Tx) (bool, error) {
return true, nil
}
Here is what we see:
Each LMDB database has a meta page (actually it has two meta pages, but any reader or writer always chooses the one that was the last updated). Meta page has lots of interesting information, but the most important pieces for this guide are two "page pointers": FREE_DBI
and MAIN_DBI
. By page pointers here we mean a 64-bit page ID, which also happens to be the offset of that page in the data.md
file, divided by the page size (4096 bytes). This means that the page with ID 0 has offset 0 * 4096 = 0
from the beginning of the file (this happens to be the first meta page), the page with ID 1 has offset 4096 (this is the second meta page), page with ID 2 has offset 2 * 4096 = 9192
and so on.
Q: How do readers and writers choose which of the two meta pages to use? One of the fields within a meta page is the ID of the last transaction that has updated that meta page. Transactions with even IDs update the meta page 0, and transactions with odd IDs update the meta page 1. So readers and writers look at both meta pages and simply choose the one that has the higher transaction ID.
Test code that generates it:
func generate2(tx ethdb.Tx, entries int) error {
c := tx.Cursor("t")
defer c.Close()
for i := 0; i < entries; i++ {
k := fmt.Sprintf("%05d", i)
if err := c.Append([]byte(k), []byte("very_short_value")); err != nil {
return err
}
}
return nil
}
This function takes parameter entries
, which in this example is 2
to create two entries. And here is the result:
The test code snippet above gives an impression that there is only DB transaction happening. You should note, however, that the code does not create table t
, it starts using it (by making a cursor for it) assuming that it is already created. This table was created by some "convenience" wrappers and it was transaction with ID 1000001
. Transaction that is shown in the snippet is the one with ID 1000002
.
Q: Why do transaction IDs in an empty database start with 1000001 and not with 1? This is a recent change that we have made to our version of LMDB. We have noticed that the behaviour of LMDB regarding free space is dependent (surprisingly) on the numerical value of transaction ID, and the lower the ID, the more problematic behaviour becomes for large amount of free space (this will be explained later, because it is actually one of the main reasons this guide is being written in the first place). So a simple fix is just to start with a higher transaction ID, and value 1 million is round, so that people will notice straight away it is not random, and it is high enough that we should not see the problem we were seeing for the amount of free space up to 2Tb.
First of all, lets list all the pages that we can see (or not see) on the picture. We know that pages 0 and 1 are meta pages, and are always present. Page 2 is mentioned in a freelist record, inside the blue box. To make illustrations more clear, we use blue boxes to show pages belonging to FREE_DBI
, which is a sub-database storing the collection of so-called freelists. On the picture above, the FREE_DBI
sub-database resides on one page, page 5 (this is the value of the "page pointer" FREE_DBI). It contains one freelist, currently associated with the transaction ID 1000002
. And this freelist has only one element, which is page 2. The fact that page 2 is on the freelist, means that in subsequent transactions, page 2 can be recycled, i.e. filled up with some other data and/or structures.
Q: If a freelist is associated with some transaction ID, does it mean that that transaction has created this freelist (meaning that it has freed the pages mentioned in this freelist)? In many cases yes, but not always. It will be shown later that sometimes freelist associated with transaction IDs that did not create them, or indeed with transactions that never happened (for example, a freelist can be associated with a transaction ID 999997
, which never happens, because we start with 1000001
). This will be explained in more details later.
We continue listing all the pages. Page 3 is the "root page" of the sub-database for the table t
. This can be seen from the yellow box. To make illustrations more clear, we use yellow boxes to show pages belonging to MAIN_DBI
, which is a sub-database storing collection of more sub-databases we call "tables". Pages for data and structure belonging to the tables are shown in brown (for branches) and green (for leaves). So page 3 is actually that green box with two entries in it. It belongs to the table t
. And finally, page 4 is the yellow box containing the only entry inside MAIN_DBI
. You can see that the yellow box is page 4 by looking at the "page pointer" MAIN_DBI
coming from meta page.
In this small example, neither MAIN_DBI
nor table t
are large enough to require any so-called "branch pages". All information fits into a single page, which here is 4096 (LMDB default). In the leaf page of the table t
, one can see keys and values separated by colons.
Test code snippet:
func generate3(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
for i := 0; i < 61; i++ {
k := fmt.Sprintf("table_%05d", i)
if err := tx.(ethdb.BucketMigrator).CreateBucket(k); err != nil {
return false, err
}
}
return true, nil
}
This code creates 61 empty tables within a single writeable transaction. Why 61? Because this is just enough not to fit on a single page, so that we can see how MAIN_DBI
begins to have branch pages. Here is how it looks:
To keep illustrations small, we use some "folding" of repeated data, here shown as ...53 more records here...
. This will allows us to create examples that are large enough to have certain features (like branch pages), without overburdening the illustrations.
Branch pages of the MAIN_DBI
sub-database are shown as purple. We can see such a branch page which has ID 4 (it is on the MAIN_DBI
"page pointer"). One may notice that there is no FREE_DBI
sub-database here, because there are no freelists. There was just one single transaction (we did not need table t
pre-created like before), and it could not have freed anything up, because it started with an empty database.
It is the same code as we saw previously:
func generate2(tx ethdb.Tx, entries int) error {
c := tx.Cursor("t")
defer c.Close()
for i := 0; i < entries; i++ {
k := fmt.Sprintf("%05d", i)
if err := c.Append([]byte(k), []byte("very_short_value")); err != nil {
return err
}
}
return nil
}
But this time, we pass value 200
as entries
. And here is the result:
As in the example with one table and two records, we see one freelist, exactly the same as before, and for the same reason (table t
used to be empty, so that the page containing pointer to it has to be replaced with a new page).
Since we now have 200 entries in the table t
, and they do not fit into a single page anymore, we observe a branch page (brown box), as well as two leaf pages (green boxes). Nothing much more interesting here.
Here is the test code that generates it:
func generate4(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
c := tx.Cursor("t")
defer c.Close()
if err := c.Append([]byte("k1"), []byte("very_short_value")); err != nil {
return false, err
}
if err := c.Append([]byte("k2"), []byte(strings.Repeat("long_value_", 1000))); err != nil {
return false, err
}
return true, nil
}
The value for the second key ends up being 11 * 1000 = 11000
bytes long, which clearly does not fit on a single page, but would fit on three pages. This is how it looks like indeed:
When confronted with such large values, LMDB places 64-bit number instead, which is interpreted as the first page ID of where the actual value resides. These pages where the actual value resides are called "overflow pages". LMDB insists that overflow pages belonging to the same value have consecutive page IDs, i.e. are stored together in the data.md
file. We suspect that this is done to avoid "jumping around" the database file looking for the pieces of some value. In this example, the first overflow page ID is 4 (looking at OVERFLOW(4)
), which means that pages with IDs 4, 5, and 6 contain that very long value. This matches us with the rest of the picture, because page 7 is the yellow box, page 8 is the blue box, page 2 is free (it is on the freelist), and page 3 is the first green box.
We are now coming closer to the main motivation for writing this guide. Overflow pages present serious challenges when combined with the specific way the freelists are handled in LMDB.
Q: If overflow pages present a serious challenge, would it make sense to try to avoid them? Yes, this is what we will definitely attempt to do. Wish we knew all this much earlier.
Here is the test code:
func generate5(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
c := tx.CursorDupSort("t")
defer c.Close()
if err := c.AppendDup([]byte("key1"), []byte("value11")); err != nil {
return false, err
}
if err := c.AppendDup([]byte("key1"), []byte("value12")); err != nil {
return false, err
}
if err := c.AppendDup([]byte("key1"), []byte("value13")); err != nil {
return false, err
}
if err := c.AppendDup([]byte("key2"), []byte("value21")); err != nil {
return false, err
}
if err := c.AppendDup([]byte("key2"), []byte("value22")); err != nil {
return false, err
}
if err := c.AppendDup([]byte("key3"), []byte("value31")); err != nil {
return false, err
}
return true, nil
}
It appears that the DupSort
feature of LMDB has been inherited from its predecessor, BerkleyDB, in somewhat more restricted form. This feature allows having multiple values per key (therefore Dup
). In BerkleyDB, values for such "duplicated" keys could be stored either unsorted or sorted, but in LMDB they are always stored as sorted. We suspect that the main motivation for bringing this feature to LMDB is simpler implementation of secondary indices. If you use a table for mapping Key => (Attr1, Attr2, Attr3)
, and then you want to have secondary index Attr2 => Key
, it is very likely that for many values of Attr2
there will be more than one matching Key
s. So you can use DupSort
feature to store such index. LMDB will not automatically keep secondary indices consistent with the "primary" table though.
The picture simply demonstrates a key with a single value (key3
), a duplicated key with two values (key2
), and a duplicated key with three values (key1
). All three reside on the same page.
There is, however, one restriction about DupSort
tables that make them relevant to our challenges with freelists and overflow pages. The size of any value associated with a key in such a table may not exceed 512 bytes. This means that for DupSort
tables, overflow pages never occur. This might actually be the reason for introducing such restriction - overflow pages might have been too complicated to combine with large number of values for the same duplicated key, as shown in the next example.
Test code:
func generate6(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
c := tx.CursorDupSort("t")
defer c.Close()
for i := 0; i < 1000; i++ {
v := fmt.Sprintf("dupval_%05d", i)
if err := c.AppendDup([]byte("key1"), []byte(v)); err != nil {
return false, err
}
}
if err := c.AppendDup([]byte("key2"), []byte("value21")); err != nil {
return false, err
}
if err := c.AppendDup([]byte("key2"), []byte("value22")); err != nil {
return false, err
}
if err := c.AppendDup([]byte("key3"), []byte("value31")); err != nil {
return false, err
}
return true, nil
}
The difference between this one and the previous illustration is that key1
has 1000
values instead of three. This number of values does not fit in a single page, so this demonstrates what LMDB does in such cases:
It creates a sub-database for every such key. In this example, the sub-database for key1
has one branch page and a few leaf nodes. It is important to note that sub-database are only creates when values for one key do not fit in a single page. In this sense, sub-database is similar to the overflow pages. However, the crucial difference is that, since every individual value cannot be more than 512 bytes long, LMDB does not require for any pages of such sub-databases to be laid out on the consecutive pages. This makes DupSort sub-databases "exempt" from the challenges with the freelists that very long values have (due to overflow pages).
Test code is the sequence of two operations, first to generate 1000 records, using the same code as before, but called with entries = 1000
:
func generate2(tx ethdb.Tx, entries int) error {
c := tx.Cursor("t")
defer c.Close()
for i := 0; i < entries; i++ {
k := fmt.Sprintf("%05d", i)
if err := c.Append([]byte(k), []byte("very_short_value")); err != nil {
return err
}
}
return nil
}
and the second - to clear the table:
func dropT(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
if err := tx.(ethdb.BucketMigrator).ClearBucket("t"); err != nil {
return false, err
}
return true, nil
}
So in this example we have two pictures, one before the clearing out:
and another - after the clearing out:
The first picture looks similar to what we have seen before (but this time 1000 records instead of 200 records), but the second one displays a larger freelist than we have seen before. The syntax is similar to how you would specify pages when you print a document, there are intervals and individual pages separated by commas. In this example, 15,13-2(12)
means that there are 13 free pages in this list, with IDs 15, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2
. First thing to notice is that pages are listed in the reverse order. This is how LMDB sorts them (we don't know why). Second thing to notice is that after each interval, like 13-2
there is the size of that interval in parenthesis, like (12)
. The number in parenthesis becomes important when the freelist gets used for recycling pages. If we, for example, would require a consecutive run of 13 or 14 pages (for overflowing values), we would not be able to use this freelist, but if we only required 5 consecutive pages, then we could take pages 13, 12, 11, 10, 9
, and reduce the remaining freelist to 15,8-2(7)
.
Where did the page 14 go? It is the one which is represented by the yellow box, now containing info about the empty table t
.
As before, the test code is the sequence of two operations:
func generate7(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
c1 := tx.Cursor("t1")
defer c1.Close()
c2 := tx.Cursor("t2")
defer c2.Close()
for i := 0; i < 1000; i++ {
k := fmt.Sprintf("%05d", i)
if err := c1.Append([]byte(k), []byte("very_short_value")); err != nil {
return false, err
}
if err := c2.Append([]byte(k), []byte("very_short_value")); err != nil {
return false, err
}
}
return true, nil
}
func dropT1(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
if err := tx.(ethdb.BucketMigrator).ClearBucket("t1"); err != nil {
return false, err
}
return true, nil
}
And here are the two pictures, one before the clearing, and another - after the clearing:
If we look closely into the test code that inserts 1000 record into two tables, we notice that instead of inserting 1000 into one, and then inserting 1000 into another one, we insert each key/value pair into both tables, and then go to the next pair. This is deliberate. We know that two tables will be residing in different sub-databases, so they will be allocated their own pages. When there is nothing in the freelist, new pages are allocated by extending the database, and allocated page IDs will monotonically increase: 4, 5, 6, 7, ...
. If we were to insert 1000 records into one table first, LMDB would likely to allocate consecutive run of pages to it, for example 4-13
or something. If we, however, insert records in both, one pair at a time, it is more likely that one table will most get odd page IDs, and another one - even page IDs. This is what we want - because we would like to create a so-called "fragmented" freelist, where there aren't many long consecutive runs of page IDs.
Now if one looks at the second picture, after the table t1
is cleared, it is obvious that we almost succeeded. Most pages previously occupied by the table t1
is now in the freelist, and the freelist looks fragmented, containing mostly odd page IDs (with some exceptions, like 22
and 6
).
This example is based on the previous one, but the test code is the sequence of three operations instead of two. Here is the test code of the third operation:
func dropT2(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
if err := tx.(ethdb.BucketMigrator).ClearBucket("t2"); err != nil {
return false, err
}
return true, nil
}
The two first pictures in this example are identical to the ones in the previous example, therefore here is the third picture showing the database after the table t2
is also cleared out:
Here we can see that because clearing out tables t1
and t2
happened in two separate transactions, we have two freelists. And, as designed, these freelists are fragmented.
Now we are starting to work with larger-scale examples. The test code is the sequence of three operations. First operation creates 100 tables:
func generate8(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
for i := 0; i < 100; i++ {
k := fmt.Sprintf("table_%05d", i)
if err := tx.(ethdb.BucketMigrator).CreateBucket(k); err != nil {
return false, err
}
}
return false, nil
}
Second operation populates them with 1000 records each (note that we use the same approach as before to make sure the tables do not occupy consecutive runs of pages):
func generate9(tx ethdb.Tx, entries int) error {
var cs []ethdb.Cursor
for i := 0; i < 100; i++ {
k := fmt.Sprintf("table_%05d", i)
c := tx.Cursor(k)
defer c.Close()
cs = append(cs, c)
}
for i := 0; i < entries; i++ {
k := fmt.Sprintf("%08d", i)
for _, c := range cs {
if err := c.Append([]byte(k), []byte("very_short_value")); err != nil {
return err
}
}
}
return nil
}
The third operation removes half of the tables, all of them with the even numbers in their names. Removal is not performed in one single transaction, instead each table is removed in a separate transaction:
func dropGradually(kv ethdb.KV, tx ethdb.Tx) (bool, error) {
tx.Rollback()
for i := 0; i < 100; i += 2 {
k := fmt.Sprintf("table_%05d", i)
if err := kv.Update(context.Background(), func(tx1 ethdb.Tx) error {
return tx1.(ethdb.BucketMigrator).DropBucket(k)
}); err != nil {
return false, err
}
}
return true, nil
}
The two following picture shows the state of the database after the creation and population of the 100 tables with 1000 records each:
The small freelist with 3 free pages on it occurred because the second operation (populating tables with records) replaced these 3 pages in the MAIN_DBI
that used to contain the records for the empty tables.
The next picture shows what it looks like after half of the tables were removed, each table in its own transaction:
Because each table was dropped in a separate transaction, we see quite a few freelists instead of just one large one. However, one may notice that even though we removed 50 tables, there are fewer than 50 freelists (there are about 39). Where is the other freelists go? They were "consumed". The free pages in those freelists were allocated to the subsequent transactions. Notice that the freelist associated with the Tx ID 1000015
has only 3 items in it. The others were already recycled for constant re-writing of the pages of the MAIN_DBI
.
But we still ended up with a fair number of freelists, and these freelists are quite fragmented. Because these freelists are quite large and they would obstruct the illustration too much, they are shortened and ellipsis ...
are added at the end. The important characteristics of these freelist is displayed though - maximum gap. One can think of this as how large is the run of consecutive page IDs that can be allowed from this freelist. In our example, the max gap is 2 for most of the freelist - so we again succeeded in making them fragmented.
Here is the similar example to the previous one, with two differences. Firstly, we populate each of 100 tables with 10 thousand instead of 1000 records. Secondly, we drop all the tables in one single transaction, instead of one by one. Here is only the test code for the third operation (dropping all tables at once):
func dropAll(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
for i := 0; i < 100; i++ {
k := fmt.Sprintf("table_%05d", i)
if err := tx.(ethdb.BucketMigrator).DropBucket(k); err != nil {
return false, err
}
}
return true, nil
}
We will not show pictures for the intermediate stages, only for what happens in the end:
Because all of the removals happened within a single transaction, all freed pages were ended up clamped into a single freelist. And this freelist is so large, it does not fit into a single 4096 page.
Q: How many items does a freelist need to have not to fit into a single page? Most pages have headers, which are 16 bytes long. So out of 4096 bytes, only 4080 are useable by data. Each data node requires 2 byte "node pointer" (these pointers are at the beginning of the page, in the order of sorted keys, right after the header, whereas the actual node data are at the end of the page, in arbitrary order). Further 4 bytes is spent on the data size, 2 bytes on the node flags, and another 2 bytes - on the key size. The key of the freelist records is always 8 byte long (transaction ID), and the actual list of 8-byte pageIDs is always pre-pended by the list size (another 8 bytes). In total, there are 2+4+2+2+8+8=26
bytes are overhead of the node. So only 4080-26=4054
bytes are left for the actual list of page IDs. It can contain maximum of 506 page IDs. LMDB code currently estimates it to 509, because it does not take into account node overhead we calculated above.
This is almost a repeat of one of the previous examples, except instead of 10k records per table here we have 300k records per table. As before, we remove half them "gradually", each table in a separate transaction.
We only show the "end result" picture:
There are couple of interesting things to notice here. Firstly, although removing a table with 300k records would have definitely resulted in freeing more than 509 pages, and therefore, in more than 1 overflow page per freelist, this did not happen here. Only 1 overflow page per freelist. Secondly, a lot of freelist records are associated with transaction IDs smaller than 1000001
. As we mentioned earlier, this may happen, and this is an example of how it happens. Both of these interesting things are the consequences of the fix that we have applied to LMDB. The fix was to not allocate transaction IDs lower than 1000001
to the new writeable transactions. How does this help?
When LMDB is confronted with a very large freelist that needs to be persisted into the FREE_DBI
sub-database (blue boxes), it tries to split it up into the chunks that would ideally fit into one page. At the same time, it wants to associate each freelist records with the transaction IDs. This is important for maintaining the "MVCC" (Multi Version Concurrency Control) property. LMDB sees the transaction IDs allocated to the existing readers of the database, and it ensures that no writers can disrupt their view of the database. "Disrupting their view of the database" means potentially modifying the pages that the reader may still find (observe). But if there are no active readers for any transaction IDs lower than X
, then any specific transaction ID < X
loses its meaning, and those those IDs can be used arbitrarily, they won't affect any readers. Knowing that, LMDB tries to "spread" freelist to as many "past" transaction IDs as it can. If X=10
, for example, there there are only 9 possible Tx IDs to spread things to. So, if LMDB has a very large freelist (say, 900k items), spreading them across 9 records will still result in very large records, and extensive use of overflow pages. If however, X = 1000010
instead (after our fix), then those 900k items can be "comfortably" spread such that each records has at most 509 items.
This is why in the picture above one sees those two interesting things: freelists do not use more than 1 overflow page each, and transaction IDs lower than 1000001
are being used.
func change1(tx ethdb.Tx) (bool, error) {
c := tx.Cursor("t")
defer c.Close()
for i := 0; i < 1000; i++ {
k := fmt.Sprintf("%05d", i)
if err := c.Put([]byte(k), []byte("another_short_value_1")); err != nil {
return false, err
}
}
return true, nil
}
func change2(tx ethdb.Tx) (bool, error) {
c := tx.Cursor("t")
defer c.Close()
for i := 0; i < 1000; i++ {
k := fmt.Sprintf("%05d", i)
if err := c.Put([]byte(k), []byte("another_short_value_2")); err != nil {
return false, err
}
}
return true, nil
}
func change3(tx ethdb.Tx) (bool, error) {
c := tx.Cursor("t")
defer c.Close()
for i := 0; i < 1000; i++ {
k := fmt.Sprintf("%05d", i)
if err := c.Put([]byte(k), []byte("another_short_value_3")); err != nil {
return false, err
}
}
return true, nil
}