-
Notifications
You must be signed in to change notification settings - Fork 1.2k
LMDB freelist illustrated guide
Illustrations used in this guide are auto-generated from the actual LMDB databases created by some test code. Therefore, they reflect the behaviour of the actual LMDB code currently embedded into turbo-geth
via lmdb-go
repository here: https://github.com/ledgerwatch/lmdb-go. To understand which illustration is which, this guide will give its name, and will also show the test code that generates the database for each illustration. The illustrations can be reproduced by running:
make hack
./build/bin/hack --action defrag
This creates the .dot
and .png
files in the current directory.
Test code does nothing:
func nothing(kv ethdb.KV, _ ethdb.Tx) (bool, error) {
return true, nil
}
Here is what we see:
Each LMDB database has a meta page (actually it has two meta pages, but any reader or writer always chooses the one that was the last updated). Meta page has lots of interesting information, but the most important pieces for this guide are two "page pointers": FREE_DBI
and MAIN_DBI
. By page pointers here we mean a 64-bit page ID, which also happens to be the offset of that page in the data.md
file, divided by the page size (4096 bytes). This means that the page with ID 0 has offset 0 * 4096 = 0
from the beginning of the file (this happens to be the first meta page), the page with ID 1 has offset 4096 (this is the second meta page), page with ID 2 has offset 2 * 4096 = 9192
and so on.
Q: How do readers and writers choose which of the two meta pages to use? One of the fields within a meta page is the ID of the last transaction that has updated that meta page. Transactions with even IDs update the meta page 0, and transactions with odd IDs update the meta page 1. So readers and writers look at both meta pages and simply choose the one that has the higher transaction ID.
Test code that generates it:
func generate2(tx ethdb.Tx, entries int) error {
c := tx.Cursor("t")
defer c.Close()
for i := 0; i < entries; i++ {
k := fmt.Sprintf("%05d", i)
if err := c.Append([]byte(k), []byte("very_short_value")); err != nil {
return err
}
}
return nil
}
This function takes parameter entries
, which in this example is 2
to create two entries. And here is the result:
The test code snippet above gives an impression that there is only DB transaction happening. You should note, however, that the code does not create table t
, it starts using it (by making a cursor for it) assuming that it is already created. This table was created by some "convenience" wrappers and it was transaction with ID 1000001
. Transaction that is shown in the snippet is the one with ID 1000002
.
Q: Why do transaction IDs in an empty database start with 1000001 and not with 1? This is a recent change that we have made to our version of LMDB. We have noticed that the behaviour of LMDB regarding free space is dependent (surprisingly) on the numerical value of transaction ID, and the lower the ID, the more problematic behaviour becomes for large amount of free space (this will be explained later, because it is actually one of the main reasons this guide is being written in the first place). So a simple fix is just to start with a higher transaction ID, and value 1 million is round, so that people will notice straight away it is not random, and it is high enough that we should not see the problem we were seeing for the amount of free space up to 2Tb.
First of all, lets list all the pages that we can see (or not see) on the picture. We know that pages 0 and 1 are meta pages, and are always present. Page 2 is mentioned in a freelist record, inside the blue box. To make illustrations more clear, we use blue boxes to show pages belonging to FREE_DBI
, which is a sub-database storing the collection of so-called freelists. On the picture above, the FREE_DBI
sub-database resides on one page, page 5 (this is the value of the "page pointer" FREE_DBI). It contains one freelist, currently associated with the transaction ID 1000002
. And this freelist has only one element, which is page 2. The fact that page 2 is on the freelist, means that in subsequent transactions, page 2 can be recycled, i.e. filled up with some other data and/or structures.
Q: If a freelist is associated with some transaction ID, does it mean that that transaction has created this freelist (meaning that it has freed the pages mentioned in this freelist)? In many cases yes, but not always. It will be shown later that sometimes freelist associated with transaction IDs that did not create them, or indeed with transactions that never happened (for example, a freelist can be associated with a transaction ID 999997
, which never happens, because we start with 1000001
). This will be explained in more details later.
We continue listing all the pages. Page 3 is the "root page" of the sub-database for the table t
. This can be seen from the yellow box. To make illustrations more clear, we use yellow boxes to show pages belonging to MAIN_DBI
, which is a sub-database storing collection of more sub-databases we call "tables". Pages for data and structure belonging to the tables are shown in brown (for branches) and green (for leaves). So page 3 is actually that green box with two entries in it. It belongs to the table t
. And finally, page 4 is the yellow box containing the only entry inside MAIN_DBI
. You can see that the yellow box is page 4 by looking at the "page pointer" MAIN_DBI
coming from meta page.
In this small example, neither MAIN_DBI
nor table t
are large enough to require any so-called "branch pages". All information fits into a single page, which here is 4096 (LMDB default). In the leaf page of the table t
, one can see keys and values separated by colons.
Test code snippet:
func generate3(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
for i := 0; i < 61; i++ {
k := fmt.Sprintf("table_%05d", i)
if err := tx.(ethdb.BucketMigrator).CreateBucket(k); err != nil {
return false, err
}
}
return true, nil
}
This code creates 61 empty tables within a single writeable transaction. Why 61? Because this is just enough not to fit on a single page, so that we can see how MAIN_DBI
begins to have branch pages. Here is how it looks:
To keep illustrations small, we use some "folding" of repeated data, here shown as ...53 more records here...
. This will allows us to create examples that are large enough to have certain features (like branch pages), without overburdening the illustrations.
Branch pages of the MAIN_DBI
sub-database are shown as purple. We can see such a branch page which has ID 4 (it is on the MAIN_DBI
"page pointer"). One may notice that there is no FREE_DBI
sub-database here, because there are no freelists. There was just one single transaction (we did not need table t
pre-created like before), and it could not have freed anything up, because it started with an empty database.
It is the same code as we saw previously:
func generate2(tx ethdb.Tx, entries int) error {
c := tx.Cursor("t")
defer c.Close()
for i := 0; i < entries; i++ {
k := fmt.Sprintf("%05d", i)
if err := c.Append([]byte(k), []byte("very_short_value")); err != nil {
return err
}
}
return nil
}
But this time, we pass value 200
as entries
. And here is the result:
As in the example with one table and two records, we see one freelist, exactly the same as before, and for the same reason (table t
used to be empty, so that the page containing pointer to it has to be replaced with a new page).
Since we now have 200 entries in the table t
, and they do not fit into a single page anymore, we observe a branch page (brown box), as well as two leaf pages (green boxes). Nothing much more interesting here.
Here is the test code that generates it:
func generate4(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
c := tx.Cursor("t")
defer c.Close()
if err := c.Append([]byte("k1"), []byte("very_short_value")); err != nil {
return false, err
}
if err := c.Append([]byte("k2"), []byte(strings.Repeat("long_value_", 1000))); err != nil {
return false, err
}
return true, nil
}
The value for the second key ends up being 11 * 1000 = 11000
bytes long, which clearly does not fit on a single page, but would fit on three pages. This is how it looks like indeed:
When confronted with such large values, LMDB places 64-bit number instead, which is interpreted as the first page ID of where the actual value resides. These pages where the actual value resides are called "overflow pages". LMDB insists that overflow pages belonging to the same value have consecutive page IDs, i.e. are stored together in the data.md
file. We suspect that this is done to avoid "jumping around" the database file looking for the pieces of some value. In this example, the first overflow page ID is 4 (looking at OVERFLOW(4)
), which means that pages with IDs 4, 5, and 6 contain that very long value. This matches us with the rest of the picture, because page 7 is the yellow box, page 8 is the blue box, page 2 is free (it is on the freelist), and page 3 is the first green box.
We are now coming closer to the main motivation for writing this guide. Overflow pages present serious challenges when combined with the specific way the freelists are handled in LMDB.
Q: If overflow pages present a serious challenge, would it make sense to try to avoid them? Yes, this is what we will definitely attempt to do. Wish we knew all this much earlier.
Here is the test code:
func generate5(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
c := tx.CursorDupSort("t")
defer c.Close()
if err := c.AppendDup([]byte("key1"), []byte("value11")); err != nil {
return false, err
}
if err := c.AppendDup([]byte("key1"), []byte("value12")); err != nil {
return false, err
}
if err := c.AppendDup([]byte("key1"), []byte("value13")); err != nil {
return false, err
}
if err := c.AppendDup([]byte("key2"), []byte("value21")); err != nil {
return false, err
}
if err := c.AppendDup([]byte("key2"), []byte("value22")); err != nil {
return false, err
}
if err := c.AppendDup([]byte("key3"), []byte("value31")); err != nil {
return false, err
}
return true, nil
}
It appears that the DupSort
feature of LMDB has been inherited from its predecessor, BerkleyDB, in somewhat more restricted form. This feature allows having multiple values per key (therefore Dup
). In BerkleyDB, values for such "duplicated" keys could be stored either unsorted or sorted, but in LMDB they are always stored as sorted. We suspect that the main motivation for bringing this feature to LMDB is simpler implementation of secondary indices. If you use a table for mapping Key => (Attr1, Attr2, Attr3)
, and then you want to have secondary index Attr2 => Key
, it is very likely that for many values of Attr2
there will be more than one matching Key
s. So you can use DupSort
feature to store such index. LMDB will not automatically keep secondary indices consistent with the "primary" table though.
The picture simply demonstrates a key with a single value (key3
), a duplicated key with two values (key2
), and a duplicated key with three values (key1
). All three reside on the same page.
There is, however, one restriction about DupSort
tables that make them relevant to our challenges with freelists and overflow pages. The size of any value associated with a key in such a table may not exceed 512 bytes. This means that for DupSort
tables, overflow pages never occur. This might actually be the reason for introducing such restriction - overflow pages might have been too complicated to combine with large number of values for the same duplicated key, as shown in the next example.
Test code:
func generate6(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
c := tx.CursorDupSort("t")
defer c.Close()
for i := 0; i < 1000; i++ {
v := fmt.Sprintf("dupval_%05d", i)
if err := c.AppendDup([]byte("key1"), []byte(v)); err != nil {
return false, err
}
}
if err := c.AppendDup([]byte("key2"), []byte("value21")); err != nil {
return false, err
}
if err := c.AppendDup([]byte("key2"), []byte("value22")); err != nil {
return false, err
}
if err := c.AppendDup([]byte("key3"), []byte("value31")); err != nil {
return false, err
}
return true, nil
}
The difference between this one and the previous illustration is that key1
has 1000
values instead of three. This number of values does not fit in a single page, so this demonstrates what LMDB does in such cases:
It creates a sub-database for every such key. In this example, the sub-database for key1
has one branch page and a few leaf nodes. It is important to note that sub-database are only creates when values for one key do not fit in a single page. In this sense, sub-database is similar to the overflow pages. However, the crucial difference is that, since every individual value cannot be more than 512 bytes long, LMDB does not require for any pages of such sub-databases to be laid out on the consecutive pages. This makes DupSort sub-databases "exempt" from the challenges with the freelists that very long values have (due to overflow pages).
Test code is the sequence of two operations, first to generate 1000 records, using the same code as before, but called with entries = 1000
:
func generate2(tx ethdb.Tx, entries int) error {
c := tx.Cursor("t")
defer c.Close()
for i := 0; i < entries; i++ {
k := fmt.Sprintf("%05d", i)
if err := c.Append([]byte(k), []byte("very_short_value")); err != nil {
return err
}
}
return nil
}
and the second - to clear the table:
func dropT(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
if err := tx.(ethdb.BucketMigrator).ClearBucket("t"); err != nil {
return false, err
}
return true, nil
}
So in this example we have two pictures, one before the clearing out:
and another - after the clearing out:
The first picture looks similar to what we have seen before (but this time 1000 records instead of 200 records), but the second one displays a larger freelist than we have seen before. The syntax is similar to how you would specify pages when you print a document, there are intervals and individual pages separated by commas. In this example, 15,13-2(12)
means that there are 13 free pages in this list, with IDs 15, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2
. First thing to notice is that pages are listed in the reverse order. This is how LMDB sorts them (we don't know why). Second thing to notice is that after each interval, like 13-2
there is the size of that interval in parenthesis, like (12)
. The number in parenthesis becomes important when the freelist gets used for recycling pages. If we, for example, would require a consecutive run of 13 or 14 pages (for overflowing values), we would not be able to use this freelist, but if we only required 5 consecutive pages, then we could take pages 13, 12, 11, 10, 9
, and reduce the remaining freelist to 15,8-2(7)
.
Where did the page 14 go? It is the one which is represented by the yellow box, now containing info about the empty table t
.
As before, the test code is the sequence of two operations:
func generate7(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
c1 := tx.Cursor("t1")
defer c1.Close()
c2 := tx.Cursor("t2")
defer c2.Close()
for i := 0; i < 1000; i++ {
k := fmt.Sprintf("%05d", i)
if err := c1.Append([]byte(k), []byte("very_short_value")); err != nil {
return false, err
}
if err := c2.Append([]byte(k), []byte("very_short_value")); err != nil {
return false, err
}
}
return true, nil
}
func dropT1(_ ethdb.KV, tx ethdb.Tx) (bool, error) {
if err := tx.(ethdb.BucketMigrator).ClearBucket("t1"); err != nil {
return false, err
}
return true, nil
}
And here are the two pictures, one before the clearing, and another - after the clearing: