-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storing large links in postgres. #650
Comments
googling this error message explains where it is from: you are attempting to store a link with an outgoing set that has more than 338 atoms in it. (in fact, it seems to have 554 atoms in it) I'm assuming this is from a wikipedia parse, and so its some "sentence" that is 554 words long. This suggests the sentence splitter is misbehaving. |
We should probably alter the postgres backend to check the size of the outgoing set, and also check the node name length, and throw an error, if they are too long. |
Pull request #651 adds the check. Now, instead of getting a cryptic SQL error, you will get an error message that has more of an explanation about the problem. It doesn't fix the problem, but gives you something to think about. |
Thanks, yes indeed that's from a wikipedia parse, I'll have a look at the sentence splitter as well |
well, teh pull equest does a throw. If you put a print in there, you cn print the atom that cause the fail. |
@leungmanhin Can you give me an easy way to reproduce this problem? Is the problem only that an outgoing set is too large? i.e. a Link has too many atoms in it? |
@inflector Yes, an outgoing set is too large is the problem, it can be reproduced by doing:
This may be useful for creating a database: |
Oof dah. So:
reports 8210 atoms in the outgoing set. It begins like this:
and its the LgOr that has the 8210 disjuncts in it. I'm not sure what to do. |
OK, here's the deal. When the SQL backend was written 8 years ago, it originally stored each element in the outgoing set as its own row in a table, the "edges" table. It became immediately apparent that this was excessively verbose and inefficient: for example, instead of doing 1 query to get a link, you had to do N queries (or alternately 1 query which returned N rows, and you then had to process each row). To avoid this bloat and overhead, the outgoing set was stored as a list, as part of the link atom. Worked great, until today :-) The question is: what to do about it? Well, the old "edges" code is still there, its if-def'd out. Its surely bit-rotted and probably won't compile any more. (a) We could have a check so that the edges code gets used only if a link has an outgoing set larger than N=300 I know enough SQL to know that (c) is possible, but I don't know enough to make (c) work well. Options (b) and probably (c) all have a huge drawback: One of the important queries is "given this outgoing set, find me the atom that contains it" This is a must-have query, its fundamental. Option (a) allows this query, but it does not allow safety. So, today, there is a UNIQUE INDEX on the table, that prevents inserting two atoms with the same outgoing set. Before this index, such inserts were theoretically impossible, but somehow managed to happen anyway (maybe a threading issue? a race condition?? ) In order to get safety with option (a), we would have to do atomic transactions, whenever we used the overflow path. Also, the UNIQUE INDEX makes link lookup fast -- its not clear to me how to index the edge set so that we can quickly look up links. |
This might be a case, maybe, where having a backend to a proper graph database would be better. By "proper", I mean one which doesn't use REST and java hijinx, paying a huge performance penalty. i.e. we'd need to have a graph database with an API that is on par with postgres, rather than a few orders of magnitude slower. Of course, exploring this option is also very labor-intensive. |
It looks like from my read of the code and your comments that the indices are there mainly to give UNIQUE constraints good performance. For example, this looks like the recent comment for this bug which you wrote hints at the UNIQUE constraint being the real issue:
So, this proposes using a hash of the outgoing set as a proxy value over which one could maintain the UNIQUE constraint. This would have the side effect of being much faster if one used a relatively fast hash here and added a column for that hash with the UNIQUE constraint. With this approach, one could use arrays of arbitrary size and still get the unique constraint. How much do you need a guarantee on the hash uniqueness? Do you want something cryptographic, or nearly so like Md5 or SHA1. We could test for the typical cases but I'd bet they're pretty fast compared with the overhead of a postgres index. @linas You mentioned the index speeding up link lookup, which I assume you mean AtomStorage::getIncomingSet which does a: SELECT * FROM Atoms WHERE outgoing @> h->_uuid which would definitely benefit from the index. The AtomStorage::getLink would as well but the same index on the hashed outgoing set used for the UNIQUE constraint would also work in this case. So the question is: how often is the call to fetch-incoming-set used? I see a couple of references in the scheme code, but it's not clear to me how central to the algorithm these calls are. If they are indeed performance critical. We'd be better off having the atoms loaded into the AtomSpace, I'd think. What is the typical and expected use case for the persistence? It appears to be to support atoms which are useful and want to be retained as part of the analogue to long-term memory for the AtomSpace. If this is the case, then it would be easier and more performance to mark the atoms with a flag and then persist them in batches asynchronously as part of a single transaction to reduce transitional overhead rather than one at a time. If we're going to look at a redesign, we might as well do the one we'll want in two years. How big a factor is speed and performance for our use cases, writing the atoms of wikipedia parses, for example? Or loading them up for another testing run? How often does this happen? There are a few performance issues I've noticed, for example. The persist code is using ODBC drivers. That might have made sense before AtomSpace was tied to Postgres, but once you're supporting one database, you might as well go fast and close to the metal. ODBC queries are much slower for the types of queries that I see here in the AtomSpace persist code. You can use the C drivers for postgres and commands like PQprepare and PQexecPrepared that completely eliminate the query processing overhead since one can exec a prepared query with different parameters as many times as you want. @leungmanhin How much does performance matter in your typical use case? Are you spending a lot of time waiting for writes or loads from the postgres store? |
Woof. Lots of questions. On Tue, Feb 23, 2016 at 7:27 PM, Curtis Faith [email protected]
For use-case (a), its enough to store the hash only (if we can avoid hash
Planned use:
Anyway, the stores are already asynchornous. The backend has 4 threads
Yes, we can ditch ODBC. I kind-of doubt that its a bottleneck here, but I
Performance matters. The three mentioned above are the only "typical" ones --linas
|
@inflector Currently my usage is more about retaining atoms, and I just loaded them back into the atomspace when I need them. But as Linas mentioned above, if the atomspace is big, it's a lot cheaper/faster to keep them in postgres, so I think performance does matter in the near future |
Yeah, I've been through quite a few of those wave-of-the-future cycles. Some of them take, but far more of them are dead ends. The reason I'm asking so many questions is that Ben wanted me to take a look at this bug since he figured you'd be busy doing other things, and this is preventing @leungmanhin from finishing his work, and I think he knew I had a database background. Database internals was my sole focus in the late 80s and early 90s. It's been 15 years but I've even had deep architectural discussions with Tom Lane and a few of the other long-time PostgreSQL team back when I was using PostgreSQL to store price data for realtime data feeds and wanted higher transaction performance. I got Tom to investigate one issue and while he didn't take my suggestion which my testing indicated would have resulted in 20x improvement of write transaction throughput for small writes with AIO-capable filesystems, he did manage to improve those write transactions by 60% in a few hours work looking at the change I suggested. So, I know the core internals of PostgreSQL fairly well, or at least, I once did. And I doubt they've changed too much given Tom's conservative personality. Since preserving the performance of getIncomingSet is important, then the row-per-atom outgoing set representation is the best way to go. It may sound like it will be a big performance hit, but that's only if you go through ODBC and don't use the postgres calls for doing simultaneous inserts (as described here: http://stackoverflow.com/questions/19516645/how-to-insert-a-single-row-in-the-parent-table-and-then-multiple-rows-in-the-chi). Databases, and PostgreSQL especially, are optimized for lots of small rows. That was the originally the bottleneck for the most common use case, the ubiquitous master-detail aka invoice representation. So inserts and selected for lots of small records are optimized pretty well. I'm betting that you'd get equivalent or better performance than our existing array mechanism by dropping ODBC and moving to a single-statement insert (as described above). Even with ODBC, you can still implement multi-row inserts, you just can't do both the atom insert and the outgoing set insert in one call. At least I couldn't find a way to do it. But there's only one way to find out, that's to do it. That's the one truth that always holds with complex database access, especially in a network. Test and see. Since you've got the code already working, the first step is to just use that code and see how it performs. Then move to multi-row inserts with ODBC. Then if that isn't fast enough, move to the single-insert made possible by going to PostgreSQL directly. One PostgreSQL user recently found libpq to be 14X faster than ODBC with inserts. See: http://www.postgresql.org/message-id/9CE034E149417949A58AA9A4FA7E1C5584AF2B48@ORSMSX109.amr.corp.intel.com Even with a row-per-atom in PostgreSQL, there is a limit to the size of the outgoing set that can be saved all in one go, based on the two-byte integer number of parameters in a prepare statement (i.e. a 32K limit), which translates to 16K in the outgoing set. This limit can be worked around by chunking the writes in 16K pieces in a loop. For the most common case, it will be one statement, for 20K in the outgoing set, it would be two inserts. |
You are welcome to do a complete re-write. In order to maintain compatibility for at least a little while, I suggest making a copy of the existing code, and creating a driver-version-two in a parallel directory. I have existing language datasets, so does Rohit, so the old code has to continue working as-is. There is also a different bug that needs to be solved: the LG disjunct set is being represented incorrectly. Instead of having an LgOr of 8210 disjuncts, there instead should just be 8210 disjuncts linked to the given word. That way, new disjuncts can be added, and old ones removed, without re-writing the whole list. I'm creating a new bug report for that issue. |
I opened bug opencog/opencog#2050 to describe the LG fix. It might be easier to fix that, than to redesign the SQL backend. In some ways, it is also a much more serious issue: the LG disjuncts are represented in a fatally incorrect way. Side-comment: if someone is creating data that overflows the current SQL design, they are almost surely mis-using the atomspace, i.e. are representing their data incorrectly. Which is the case here ... Basically, its kind of crazy or incorrect to have an outgoing set that is larger than 5 or 10. An outgoing set of size 20 is insanely, crazily large, and hundreds... thousands .. is just wrong. Motto of the day: the atomspace is a "DNF engine" -- lets use it as its meant to be used! |
Another side-comment: the description in opencog/opencog#2050 illustrates a "typical" query for |
Also a side note, the |
OK. I am also thinking that "good" data designs should not have SetLinks/ListLinks with more than 10 or 20 items in them. There are several reasons for this:
So, I'm thinking: instead of using and storing SetLinks/ListLinks, instead use MemeberLinks instead. This solves both problems:
As a side effect, all outgoing sets will be tiny, thus avoiding the SQL bug! |
Okay, I've finally got to the point where I've got a performance comparison using the new The test data was 70% nodes with the remaining 30% (300,000) links distributed as follows:
The tests were conducted using the exact same schema but in one case the outgoing links were hashed using MurmurHash2 into a 64-bit hash and stored with an associated 32-bit differentiator to be used in the event of collisions. MurmurHash2 is very fast and exhibits very good randomness for the types of data we're using here. NOTE: I have not implemented the collision handling but this test generated zero collisions. The test does have all the Here are the relevant table
Here are the performance measurements for both the current implementation where where they are stored (as currently) in the
These numbers are both using the current ODBC drivers. So they only compare the difference between the two approaches and not between ODBC and PostgreSQL native. So each link write is two SQL statements, an insert into the So it looks like this new approach is significantly faster than the current approach for writes. However, I have not yet tested The one open issue is how to handle collisions. The easiest and most performant way to handle them is to store the hash bucket differentiator (defaulting to 0) in the links themselves. When storing a new link atom, if there is a collision, this could be for two reasons, by far the most common reason will be that the link (Type, Outgoing Set) already exists in the database; the very uncommon case will be a collision of the hash algorithm for different underlying (Type, Outgoing Set) pairs. In the event of a detection of an identical hash value the outgoing sets will have to be compared in code. In the event of a non-duplicate outgoing set collision, the maximum bucket differentiator for the given hash value will be incremented by one and that number stored in the When atoms are loaded, this bucket differentiator can be read from the Assuming everyone concurs and no one sees a problem with adding a new data member to links, I'll proceed with implementing this collision handling algorithm. |
|
There have been huge changes to the SQL backend, see in #1096 #1095 #1093 #1091 #1081 #1079 #1076 #1075 #1074 #1073 #1070 The upshot is that there is now native postgres support, and a variety of other bugs have been fixed. However, the "edges" table has been lost, and needs to be re-created to solve this particular issue. |
The correct long-term fix for this is probably #1502 |
@linas, do you mean that query you mentioned is required to not write link duplicates into database? In other words to incrementally add atoms into database? |
It would probably be best to close this issue, and open a new one with the same title. There's too much stale discussion in here.
I don't understand the question.
|
Ok, thanks @linas, I see. I didn't use this persistence functionality before it is the reason why I asked such obvious thing. |
Ah, your welcome. It's great for when your dataset doesn't fit in RAM. Also for when you want to have multiple machines work on the same dataset (since network distribution works "automatically") |
Closing, obsoleted by issue #3018 -- in short, the RocksStorageNode supports SetLinks of arbitrary size. The PosttgresStorageNode is old and flaky and slow; it would get smaller, simpler & faster simply by cut-n-pasting the RocksStorageNode code and then tweaking it to use Postgres instead of Rocks as the backend. |
I noticed there were errors sometimes when doing sql-store, and they look similar to this one:
I may be able to provide more info on how to reproduce it if I see this error again...
The text was updated successfully, but these errors were encountered: