Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbm.sqlite proof of concept #48033

Closed
smontanaro opened this issue Sep 4, 2008 · 56 comments
Closed

dbm.sqlite proof of concept #48033

smontanaro opened this issue Sep 4, 2008 · 56 comments
Labels
stdlib Python modules in the Lib dir topic-sqlite3 type-feature A feature request or enhancement

Comments

@smontanaro
Copy link
Contributor

BPO 3783
Nosy @rhettinger, @doko42, @jcea, @pitrou, @merwok, @florentx, @erlend-aasland
Files
  • sq_dict.py: an alternate sqlite dbm-like interface v.5
  • dbsqlite.py: Fixed len
  • dbdict.py: Non-synchronous DB based on a dict subclass.
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2008-09-04.23:57:30.571>
    labels = ['type-feature', 'library']
    title = 'dbm.sqlite proof of concept'
    updated_at = <Date 2021-07-28.20:11:08.117>
    user = 'https://github.com/smontanaro'

    bugs.python.org fields:

    activity = <Date 2021-07-28.20:11:08.117>
    actor = 'erlendaasland'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2008-09-04.23:57:30.571>
    creator = 'skip.montanaro'
    dependencies = []
    files = ['11602', '12933', '12939']
    hgrepos = []
    issue_num = 3783
    keywords = ['patch']
    message_count = 51.0
    messages = ['72556', '72565', '72566', '72567', '72568', '72569', '72570', '72587', '72595', '72661', '72677', '72719', '72725', '72729', '73012', '73013', '73018', '73020', '73025', '73026', '73027', '73028', '73032', '73034', '73046', '73053', '73054', '73067', '73068', '73070', '73072', '73074', '73077', '73080', '73775', '73800', '79050', '80802', '80803', '80808', '80817', '81109', '81110', '81113', '81155', '81483', '85717', '95492', '95804', '97473', '248835']
    nosy_count = 11.0
    nosy_names = ['rhettinger', 'doko', 'jcea', 'ghaering', 'pitrou', 'erno', 'eric.araujo', 'gregburd', 'flox', 'rute', 'erlendaasland']
    pr_nums = []
    priority = 'low'
    resolution = None
    stage = 'needs patch'
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue3783'
    versions = ['Python 3.4']

    @smontanaro
    Copy link
    Contributor Author

    Based on recent discussions about ridding Python of bsddb I decided to
    see how hard it would be to implement a barebones dbm.sqlite module.
    Turns out, not very hard.

    No docs. No test cases. Caveat emptor. But I think it can serve as
    at least a proof of concept, maybe as the basis for a new module in
    3.1.

    @smontanaro smontanaro added type-feature A feature request or enhancement stdlib Python modules in the Lib dir labels Sep 4, 2008
    @smontanaro
    Copy link
    Contributor Author

    Attaching corrected module.

    @smontanaro
    Copy link
    Contributor Author

    Attaching test cases based on dumbdbm tests.

    @smontanaro
    Copy link
    Contributor Author

    Another slight revision to the module.

    @smontanaro
    Copy link
    Contributor Author

    Trivial doc diffs against 3.0b3 doc.

    @smontanaro
    Copy link
    Contributor Author

    Another tweak - add values()

    @smontanaro
    Copy link
    Contributor Author

    Updated test cases

    @pitrou
    Copy link
    Member

    pitrou commented Sep 5, 2008

    It would be more efficient to base keys() on iterkeys() than the
    reverse, IMO.
    Other than that, why not open a branch or at least upload full-fledged
    patch files? :)

    @smontanaro
    Copy link
    Contributor Author

    Antoine> It would be more efficient to base keys() on iterkeys() than the
    Antoine> reverse, IMO.

    True. I was just modifying the dumbdbm implementation.

    Antoine> Other than that, why not open a branch or at least upload
    Antoine> full-fledged patch files? :)
    

    Well, I only intended to create the initial proof of concept, but then last
    night I couldn't sleep so I cobbled the test module and docs from the
    existing stuff ... ;-)

    When I get a couple minutes free I'll try and condense it into a more
    suitable form. Probably not until the weekend though.

    Skip

    @smontanaro
    Copy link
    Contributor Author

    OK, I made a sandbox project out of it:

    svn+ssh://[email protected]/sandbox/trunk/dbm_sqlite

    Hack away!

    @smontanaro smontanaro self-assigned this Sep 6, 2008
    @rhettinger
    Copy link
    Contributor

    I would like to see something like this go into 3.0 so that shelves
    don't become useless for Windows users.

    @josiahcarlson
    Copy link
    Mannequin

    josiahcarlson mannequin commented Sep 6, 2008

    Here's an alternate version with most of bsddb's interface intact.

    @gpshead
    Copy link
    Member

    gpshead commented Sep 7, 2008

    sq_dict review:

    have sqlite quote/escape self._mtn before using it with a python %s
    substitution. or pass it into the sql query function as a positional ?
    parameter like you do for keys and values. (avoid sql injection)

    raise a TypeError rather than a ValueError when you don't like the key
    or value type.

    also, to test the type, isinstance(val, str) is better than using type(val).

    @josiahcarlson
    Copy link
    Mannequin

    josiahcarlson mannequin commented Sep 7, 2008

    I tried passing the db name as a parameter with '?', it doesn't always
    work. Also, there shouldn't be any SQL injection issues here unless
    someone designed their system wrong (if a third party is allowed to pass
    the name of a db table into the open/create function, then they can do
    much worse than mangle or hide data in a sqlite database).

    With regards to isinstance being better than type; it's only better if
    you want to support subclasses. When writing the module, I had no
    interest in supporting subclasses (though supporting both str and buffer
    in 2.x, and bytes and memoryview in 3.x seems reasonable).

    @ghaering
    Copy link
    Mannequin

    ghaering mannequin commented Sep 11, 2008

    I like Skip's version better, because it's closer to the dbm
    "specification" instead of trying to mimic bsddb (first, last, etc.).
    I'd like to keep such things out.

    I've made a few changes to the sandbox project which I will check in
    later today. The most important change is support for a "fast mode",
    which doesn't commit changes until you call the synch() method. synch()
    is also called on close().

    Perhaps we should do automatic commits every n (like 1000) changes, too?

    What's all this ORDER BY in both your implementations about? The dbm
    "spec" says nothing about keys being ordered AFAIC. Can we get rid of these?

    @ghaering
    Copy link
    Mannequin

    ghaering mannequin commented Sep 11, 2008

    One question about Josiah's _check_value(). SQLite is less crippled than
    other simplistic databases and also supports integers, reals and blobs
    in addition to strings.

    Shouldn't we make this accessible to users? Or is compatibility with
    other dbm implementations more important?

    @smontanaro
    Copy link
    Contributor Author

    Gerhard> What's all this ORDER BY in both your implementations about?
    Gerhard> The dbm "spec" says nothing about keys being ordered AFAIC. Can
    Gerhard> we get rid of these?

    I'd like to guarantee that zip(db.keys(), db.values() == db.items().

    Skip

    @pitrou
    Copy link
    Member

    pitrou commented Sep 11, 2008

    Le jeudi 11 septembre 2008 à 12:46 +0000, Skip Montanaro a écrit :

    Gerhard> What's all this ORDER BY in both your implementations about?
    Gerhard> The dbm "spec" says nothing about keys being ordered AFAIC. Can
    Gerhard> we get rid of these?

    I'd like to guarantee that zip(db.keys(), db.values() == db.items().

    It doesn't sound very useful, and it may hurt performance on big tables.

    @ghaering
    Copy link
    Mannequin

    ghaering mannequin commented Sep 11, 2008

    I'd like to guarantee that zip(db.keys(), db.values() == db.items().

    Ok. If that isn't guaranteed elsewhere just drop it here?

    FWIW that will also work without the ORDER BY, because you're getting
    the rows back in the same ORDER. Something cheaper would also be "ORDER
    BY ROWID". I still propose to just do without the ORDER BY.

    @smontanaro
    Copy link
    Contributor Author

    > I'd like to guarantee that zip(db.keys(), db.values() == db.items().

    Antoine> It doesn't sound very useful, and it may hurt performance on
    Antoine> big tables.
    

    Actually, I think Python guarantees (for dicts at least - other mappings
    should probably follow suit) that if you call keys() then call values()
    without making any changes to the dict that their orders match, e.g., that

    zip(d.keys(), d.values()) == d.items()
    

    Skip

    @smontanaro
    Copy link
    Contributor Author

    Gerhard> FWIW that will also work without the ORDER BY, because you're
    Gerhard> getting the rows back in the same ORDER. Something cheaper
    Gerhard> would also be "ORDER BY ROWID". I still propose to just do
    Gerhard> without the ORDER BY.

    As long as SQLite guarantees that the ordering is identical, then sure, dump
    the ORDER BY clause.

    Skip

    @ghaering
    Copy link
    Mannequin

    ghaering mannequin commented Sep 11, 2008

    Skip Montanaro wrote:

    Skip Montanaro <[email protected]> added the comment:

    Gerhard> FWIW that will also work without the ORDER BY, because you're
    Gerhard> getting the rows back in the same ORDER. Something cheaper
    Gerhard> would also be "ORDER BY ROWID". I still propose to just do
    Gerhard> without the ORDER BY.

    As long as SQLite guarantees that the ordering is identical, then sure, dump
    the ORDER BY clause.

    It doesn't guarantee it, but the implementation behaves like this.

    -- Gerhard

    @pitrou
    Copy link
    Member

    pitrou commented Sep 11, 2008

    Le jeudi 11 septembre 2008 à 13:48 +0000, Skip Montanaro a écrit :

    Actually, I think Python guarantees (for dicts at least - other mappings
    should probably follow suit) that if you call keys() then call values()
    without making any changes to the dict that their orders match, e.g., that

    zip(d.keys(), d.values()) == d.items()
    

    Perhaps. I've never written any code that relies this, though, and it
    doesn't sound like an useful guarantee since you can just use the
    items() method anyway. It probably dates back to an era when list
    comprehensions didn't exist, and extracting keys or values from the
    items list required several lines of code and costly method calls.

    Also, the point is that Python dicts can make that guarantee without
    being any slower. It may not be the same for an RDBMS backend. Why?
    Because, depending on the backend, index and data can be stored in
    separate areas with different storage layouts (e.g. keys are in a B tree
    while values are just dumped sequentially). If you only ask for
    unordered keys, they will be read in optimal (sequential) index order,
    and if you only ask for unordered values, they will be read in optimal
    (sequential) data order, which is not the same. This is true for e.g.
    MySQL.

    (also, IMO this discussion proves that the module shouldn't be included
    in Python 3.0. It's too young, its API hasn't even settled down)

    @pitrou
    Copy link
    Member

    pitrou commented Sep 11, 2008

    I might add that calling keys() then values() is suboptimal, because it
    will issue two SQL queries while calling items() will issue just one.

    @josiahcarlson
    Copy link
    Mannequin

    josiahcarlson mannequin commented Sep 11, 2008

    I like Skip's version better, because it's closer to the dbm
    "specification" instead of trying to mimic bsddb (first, last, etc.).
    I'd like to keep such things out.

    dbm.sqlite is meant as a potential replacement of dbm.bsddb. Since
    people do use the extra methods (.first(), .last(), etc.), not having
    them could lead to breakage.

    Separating them out into a subclass (regular open doesn't have it, but
    btopen does), along with all of the other order guarantees (the ORDER BY
    clauses in the SQL statements), could keep it fast for people who don't
    care about ordering, and keep it consistent for those who do care about
    ordering.

    Attached you will find an updated version.

    @smontanaro
    Copy link
    Contributor Author

    > As long as SQLite guarantees that the ordering is identical, then
    >> sure, dump the ORDER BY clause.

    Gerhard> It doesn't guarantee it, but the implementation behaves like
    Gerhard> this.
    

    If the behavior isn't guaranteed, I think you need to retain ORDER BY.

    Skip

    @smontanaro
    Copy link
    Contributor Author

    Antoine> I might add that calling keys() then values() is suboptimal,
    Antoine> because it will issue two SQL queries while calling items()
    Antoine> will issue just one.

    Well, sure, but heaven only knows what an application programmer will do...

    S

    @rhettinger rhettinger self-assigned this Jan 30, 2009
    @rhettinger
    Copy link
    Contributor

    Here's an updated patch (it's also in the sandbox):

    • Added a sync() method to support shelves.
    • Removed commits on granular sets and gets.
    • Optimized __len__ and __contains__.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 3, 2009

    I think issuing 'SELECT MAX(ROWID)' to compute the length of the table
    is not correct if some rows get deleted in the table.
    I've found a thread about it here:
    http://osdir.com/ml/db.sqlite.general/2004-03/msg00329.html
    In that thread someone suggested caching the length in another table and
    updating it through a trigger each time the main table is modified.

    @rhettinger
    Copy link
    Contributor

    That's a bummer. Changing this method to __bool__ and then setting
    __len__ back to "count(*)".

    @rhettinger
    Copy link
    Contributor

    FWIW, I put an alternative in the sandbox /dbm_sqlite/alt/dbdict.py and
    am attaching a copy here. The idea is to emulate gdbm's fast mode and
    delay all writes until closing. That lets us subclass from dict and get
    high-speed lookups, sets, and deletions. Freeing ourselves from an DB
    also gets us a choice of ultra-portable file formats (json, csv, pickle,
    eval).

    @rhettinger
    Copy link
    Contributor

    Unassigning. The code works but no one seems to be pushing for or
    caring about inclusion in Py3.1.

    If commits are delayed, then you might as well adopt the dbdict.py
    approach instead (reading the file in once at the beginning, operating
    directly on a dict subclass, and atomically writing it out at the end).

    @rhettinger rhettinger removed their assignment Feb 9, 2009
    @doko42
    Copy link
    Member

    doko42 commented Apr 7, 2009

    is there any chance for inclusion in 3.1?

    @rute
    Copy link
    Mannequin

    rute mannequin commented Nov 19, 2009

    By utilizing triggers on inserts and deletes it is possible to
    keep track of the size and speed up __len__ by 10 x.

    SQL:

    CREATE TABLE IF NOT EXISTS info
    (key TEXT UNIQUE NOT NULL,
    value INTEGER NOT NULL);

    INSERT OR IGNORE INTO info (key,value) VALUES ('size',0);

    CREATE TABLE IF NOT EXISTS shelf
    (key TEXT UNIQUE NOT NULL,
    value TEXT NOT NULL);

    CREATE TRIGGER IF NOT EXISTS insert_shelf
    AFTER INSERT ON shelf
    BEGIN
    UPDATE info SET value = value + 1 WHERE key = 'size';
    END;

    CREATE TRIGGER IF NOT EXISTS delete_shelf
    AFTER DELETE ON shelf
    BEGIN
    UPDATE info SET value = value - 1 WHERE key = 'size';
    END;

    On my laptop this increase the speed of 'len' about 10x

    I have a slightly modified version of dbsqlite.py for
    running on python 2.5 utilizing the triggers for
    keep track of the size:

    http://dpaste.com/hold/122439/

    @pitrou
    Copy link
    Member

    pitrou commented Nov 29, 2009

    It would be nice to try to advance this at PyCon, or at another time.

    @rute
    Copy link
    Mannequin

    rute mannequin commented Jan 9, 2010

    Multi threading:

    According to http://www.sqlite.org/cvstrac/wiki?p=MultiThreading
    we need to keep a connection for each thread to support multi threaded
    access to the database.

    I came across this when deploying an application in a multi threaded
    environment. Solution is to keep the connection in the thread local-data.

    Also note that a memory database is not shared between threads and
    a hairy workaround with a separate working thread with queues for access
    is needed. A memory database could perhaps be disallowed as the dbm is file only?

    import threading
    
    class SQLhash(collections.MutableMapping):
        def __init__(self, filename=':memory:', flags='r', mode=None):
            self.__filename = filename
            self.__local = threading.local()
            
            MAKE_SHELF = 'CREATE TABLE IF NOT EXISTS shelf (key TEXT PRIMARY KEY, value TEXT NOT NULL)'
            self.conn.execute(MAKE_SHELF)
            self.conn.commit()
        
        @property
        def conn(self):
            try:
                conn = self.__local.conn
            except AttributeError:
                conn = self.__local.conn = sqlite3.connect(self.__filename)
                self.conn.text_factory = bytes
                
            return conn

    @rute rute mannequin added type-bug An unexpected behavior, bug, or error and removed type-feature A feature request or enhancement labels Jan 9, 2010
    @pitrou pitrou added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Mar 17, 2010
    @ghaering
    Copy link
    Mannequin

    ghaering mannequin commented Aug 19, 2015

    This wiki page is out of date. It appears that SQlite is now threadsafe by default: http://www.sqlite.org/threadsafe.html

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @erlend-aasland
    Copy link
    Contributor

    I would like to try and resurrect this issue using @rhettinger's last submitted code (on bpo). @ambv almost mentioned shelve and sqlite3 in the same sentence in one of his developer-in-residence blog posts last year, so it may be worth it to try and get this up and running. (If not, it will at least be a fun exercise.)

    @presidento
    Copy link

    @erlend-aasland do you have any knews about this?

    @erlend-aasland
    Copy link
    Contributor

    @erlend-aasland do you have any knews about this?

    Nope, and I've been busy with $work. Raymond has created a duplicate issue, though: see #100414

    @erlend-aasland
    Copy link
    Contributor

    Raymond and I prefer to close this (stale) issue in favour of #100414; if we're going to do this, I think we can benefit from a fresh start.

    Closing this issue as stale and superseded by #100414; please re-open if you disagree.

    @erlend-aasland erlend-aasland closed this as not planned Won't fix, can't repro, duplicate, stale Jan 2, 2023
    @erlend-aasland erlend-aasland moved this from Done to Discarded in sqlite3 issues Jan 2, 2023
    @erlend-aasland
    Copy link
    Contributor

    Just a heads-up; the gh-100414 now has a PR based on a recent patch Raymond posted. Feel free to review it:

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir topic-sqlite3 type-feature A feature request or enhancement
    Projects
    Status: Discarded
    Development

    No branches or pull requests

    7 participants