Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sqlite3 as another possible backing store for the dbm module #100414

Closed
rhettinger opened this issue Dec 21, 2022 · 10 comments
Closed

Add sqlite3 as another possible backing store for the dbm module #100414

rhettinger opened this issue Dec 21, 2022 · 10 comments
Assignees
Labels
stdlib Python modules in the Lib dir topic-sqlite3 type-feature A feature request or enhancement

Comments

@rhettinger
Copy link
Contributor

rhettinger commented Dec 21, 2022

Right now we support ndbm and gnu.dbm which might or might not be part of a given build. The fallback is the super slow dumbdbm. Not the sqlite3 is part of the standard build, we can do better.

The module docstring says:

Future versions may change the order in which implementations are
tested for existence, and add interfaces to other dbm-like
implementations.

The future is now. Let's provide a fast, stable, robust, always available alternative.

This can be done will pure python calls to the existing sqlite3 module, or there can be a C extension that calls the Sqlite3 C API directly.

This would automatically be available to the shelve module, giving us a high quality, persistent key-value store.

Linked PRs

@rhettinger rhettinger added type-feature A feature request or enhancement 3.12 bugs and security fixes labels Dec 21, 2022
@erlend-aasland
Copy link
Contributor

See also #48033

@erlend-aasland
Copy link
Contributor

IIRC, I've got a fairly up-to-date branch for this, based on your previous efforts, Raymond.

@erlend-aasland
Copy link
Contributor

@rhettinger, would you like to continue the discussion here or on #48033? There is no use in keeping two identical issues open; let's close one of them.

@presidento
Copy link

presidento commented Dec 31, 2022

Here are my 50 cents about this.

TL;DR

  • dbm.dumb is not slower than SqliteDict
  • dbm.dumb is not fail safe, so try to avoid using it until it is fixed
  • SQLite is a good alternative for persistent dict backend, especially with JSON encoding

In details

  • The dbm.dumb is not so slow. According to my measurements the existing SqliteDict can be as fast ast dbm.dumb, but only with some tweaks (using WAL mode). The measurements with code are in this pull request: Reorganize the code of benchmark Dobatymo/lmdb-python-dbm#2
  • If high speed is needed, they can use vedis / unqlite or lmdbm.
  • Using SQLite backend can be very-very helpful if I would like to view or change the database manually. So if the speed of SqliteDict is enough, I'd prefer use that with JSON encoding (not pickle).
  • SQLite may be not the best choice storing binary data. Although it could be a good option if we can change shelve module to allow to use json encoding also, and not only have the option to set pickle protocol version.
  • pysos is also pretty fast and have the ability to review / change the database if the size of the database is small enough to load into an editor. (For example using it as a persistent state backend for small amount of data.)

The biggest problem with dbm.dumb (at least for me) is that it is not fail safe, see the example below. Moreover, it is not mentioned in the documentation. If you kill the Python process, the database will be corrupted every time. (This is because the key file is updated only on adding and deleting keys, and on close.)

import dbm.dumb
import shelve
import itertools

database = dbm.dumb.open("test", "c")
shelf = shelve.Shelf(database)

if "key" in shelf:
    print("The stored value is: " + shelf["key"])
    input("Press ENTER to continue")
    del shelf["key"]

try:
    for index in itertools.count():
        print(f"Persist value {index} to database")
        shelf["key"] = f"value {index} " + ("x" * index)
except:
    print("Exiting gracefully. Try killing the Python process.")

Performance measurements

Sorry. in Windows I do not have dbm.gnu.
Copied from Dobatymo/lmdb-python-dbm#2 (comment)
The numbers are seconds required to do the mentioned amount of work.

write

items lmdbm vedis unqlite rocksdict dbm.gnu semidbm pysos dbm.dumb sqlite-wal sqlite-autocommit sqlite-batch dummypickle dummyjson
10 0.015 0.000 0.000 0.031 - 0 0 0 0.015 0.015 0 0.015 0
100 0.141 0.000 0.000 0.047 - 0 0 0.031 0.031 0.109 0.094 0.156 0.172
1000 1.609 0.000 0.016 0.062 - 0.016 0.031 0.328 0.328 1.015 0.937 1.781 4.250
10000 15.359 0.062 0.062 0.172 - 0.140 0.250 3.219 3.125 - 17.922 19.453 95.188 -
100000 - 0.719 0.672 1.390 - 1.562 2.687 - 31.172 - - - -
1000000 - 9.234 9.140 16.594 - 17.484 27.390 - - - - - -

batch

items lmdbm vedis unqlite rocksdict dbm.gnu semidbm pysos dbm.dumb sqlite-wal sqlite-autocommit sqlite-batch dummypickle dummyjson
10 0.000 0.000 0.000 - - - - 0 0.016 0 0 0 0
100 0.000 0.000 0.000 - - - - 0.032 0.031 0.078 0 0 0
1000 0.000 0.015 0.000 - - - - 0.312 0.109 0.719 0.109 0 0
10000 0.031 0.047 0.046 - - - - 3.140 1.046 - 16.156 1.047 0 0.047
100000 0.422 0.609 0.594 - - - - - 10.875 - 10.985 0.375 2.859
1000000 4.188 8.500 8.406 - - - - - - - - 41.760 -

read

items lmdbm vedis unqlite rocksdict dbm.gnu semidbm pysos dbm.dumb sqlite-wal sqlite-autocommit sqlite-batch dummypickle dummyjson
10 0.000 0.000 0.000 0.031 - 0 0 0 0 0 0 0 0
100 0.000 0.000 0.000 0.031 - 0 0 0 0.031 0.016 0.031 0 0
1000 0.000 0.000 0.000 0.046 - 0.015 0 0.078 0.218 0.297 0.282 0 0
10000 0.047 0.047 0.047 0.093 - 0.109 0.141 0.765 2.079 2.890 3.013 0.031 0.015
100000 0.610 0.641 0.640 0.735 - 1.422 1.797 - 23.797 - 29.375 0.187 0.218
1000000 6.297 8.609 8.765 20.391 - 15.891 20.485 - - - - 2.141 -

combined

items lmdbm vedis unqlite rocksdict dbm.gnu semidbm pysos dbm.dumb sqlite-wal sqlite-autocommit sqlite-batch dummypickle dummyjson
10 0.156 0.000 0.015 0.047 - 0 0 0.078 0.235 0.390 0.406 0.156 0.156
100 0.156 0.015 0.000 0.047 - 0.016 0 0.078 0.235 0.390 0.390 0.156 0.203
1000 0.156 0.000 0.000 0.047 - 0.015 0.015 0.094 0.235 0.406 0.390 0.188 0.672
10000 0.156 0.015 0.015 0.047 - 0.015 0.031 0.203 0.234 0.469 0.406 0.594 5.141
100000 0.171 0.000 0.000 0.047 - 0.140 0.250 - 0.266 - 0.484 5.547 49.750
1000000 0.172 0.015 0.016 0.047 - 1.297 2.407 - - - - 90.718 -

@rhettinger
Copy link
Contributor Author

rhettinger commented Dec 31, 2022

@rhettinger, would you like to continue the discussion here

If you want to run with this, do whatever makes sense for you. The other issue is mostly stale and likely needs a fresh start. A new first draft is likely just an afternoon project. If you don't want to run with it, let me know and I'll work on it.

I don't know that there needs to be much discussion because we have collections.abc.MutableMapping to structure the implementation and because the APIs for shelve and dbm are already established. This is a clearly desirable buildout that addresses the known issues with dumbdbm. With Sqlite3, we get down to a single file in a standard format, we get ACID guarantees, concurrency and compaction, and we get a dbm that will be present in every build.

If you're interested, here's some code that can be used as a starting point:

import sqlite3
from contextlib import suppress
from collections.abc import MutableMapping

BUILD_TABLE = "CREATE TABLE Dict (key TEXT NOT NULL, value BLOB NOT NULL, PRIMARY KEY (key));"
GET_SIZE = "SELECT COUNT (*) FROM Dict;"
LOOKUP_KEY = "SELECT value FROM Dict WHERE key = ?;"
STORE_KV = "REPLACE INTO Dict (key, value) VALUES (?, ?);"
DELETE_KEY = "DELETE FROM Dict WHERE key = ?;"
ITER_KEYS = "SELECT key FROM Dict;"

class SQLiteDict(MutableMapping):

    def __init__(self, dbname=':memory:'):
        self.conn = sqlite3.connect(dbname, isolation_level=None) 
        self.conn.execute('PRAGMA journal_mode = wal') 
        with suppress(sqlite3.OperationalError):
            self.conn.execute(BUILD_TABLE)

    def __len__(self):
        return self.conn.execute(GET_SIZE).fetchone()[0]

    def __getitem__(self, key):
        value_tuple = self.conn.execute(LOOKUP_KEY, (key,)).fetchone()
        if value_tuple is None:
            raise KeyError(key)
        return value_tuple[0]

    def __setitem__(self, key, value):
        self.conn.execute(STORE_KV, (key, value));
        self.conn.commit()

    def __delitem__(self, key):
        if key not in self:
            raise KeyError(key)
        self.conn.execute(DELETE_KEY, (key,));
        self.conn.commit()

    def __iter__(self):
        curs = self.conn.execute(ITER_KEYS)
        while (value_tuple := curs.fetchone()):
            yield value_tuple[0]
        curs.close()

    def close(self):
        self.conn.close()
        self.conn = None

    def __enter__(self):
        return self

    def __exit__(self, *args):
        self.close()

Other thoughts:

  • If we wanted, we could implement dict-like tracking of key insertion order but the API doesn't seem to require it.
  • The idea is to leave the above class as a simple as possible and to let a separate DB_Sqllite class build around it to provide readonly logic and anything else specific to the dbm API. That would nicely decouple the core dictionary logic from the needs of dbm.
  • To make this support concurrency, the mutating methods need to be pause and wait if the db is locked.

@erlend-aasland
Copy link
Contributor

If you want to run with this, do whatever makes sense for you. The other issue is mostly stale and likely needs a fresh start.

I prefer closing the old issue; we definitely need a fresh start here. I'll take care of it.

A new first draft is likely just an afternoon project. If you don't want to run with it, let me know and I'll work on it.

I'll give it a shot; it'll be a nice chance for me to get to know the shelve and dbm modules better.

If you're interested, here's some code that can be used as a starting point:

Thanks!

@erlend-aasland erlend-aasland added 3.13 bugs and security fixes and removed 3.12 bugs and security fixes labels Jun 7, 2023
@iritkatriel iritkatriel added the stdlib Python modules in the Lib dir label Nov 28, 2023
@erlend-aasland erlend-aasland removed the 3.13 bugs and security fixes label Jan 22, 2024
@erlend-aasland erlend-aasland self-assigned this Jan 22, 2024
erlend-aasland added a commit to erlend-aasland/cpython that referenced this issue Jan 23, 2024
@erlend-aasland
Copy link
Contributor

Finally got back to this; I added @rhettinger's patch, amended it so the generic test_dbm test suite passed, and created PR #114481. Go nuts.

@erlend-aasland

This comment was marked as resolved.

@github-project-automation github-project-automation bot moved this from In Progress to Done in sqlite3 issues Feb 14, 2024
fsc-eriker pushed a commit to fsc-eriker/cpython that referenced this issue Feb 14, 2024
Co-authored-by: Raymond Hettinger <[email protected]>
Co-authored-by: Serhiy Storchaka <[email protected]>
Co-authored-by: Mariusz Felisiak <[email protected]>
fsc-eriker pushed a commit to fsc-eriker/cpython that referenced this issue Feb 14, 2024
@AdamWill
Copy link
Contributor

This seems to have broken the beaker library: bbangert/beaker#242 . Its test suite, which passes fine on Python 3.12, fails on Python 3.13 in dbm-based tests, with errors about sqlite3 thread violations. Isn't it a bit bad if code using dbm which worked fine with the previous default backend fails with the new one?

@AdamWill
Copy link
Contributor

of course, I guess it's possible beaker is just not handling threads correctly, but the other backends weren't smart enough to notice, and beaker's tests aren't good enough to catch it either. I'm not sure which is the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir topic-sqlite3 type-feature A feature request or enhancement
Projects
Status: Done
Development

No branches or pull requests

6 participants