Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/fold delta's: only insert and delete effective inserts and deletions. #7

Open
wants to merge 19 commits into
base: feature/consistency-with-credo
Choose a base branch
from

Conversation

ajuvercr
Copy link

@ajuvercr ajuvercr commented Aug 19, 2021

Analyse incoming quad changes for effective inserts and deletions.

An effective insert is a quad that would be inserted (is not yet present in the triplestore).
An effective deletion is a quad that would be deleted (is present in the triplestore).

Multiple quad change requests are folded together and flushed with the first select query or with a flush timeout.
With this change delta-notifier can also forward these effective changes in format "v0.0.2".

To determine what changes are effective or not one CONSTRUCT query is created as follows:

CONSTRUCT {
  ?s ?p ?o
} WHERE {
  VALUES (?s ?p ?o) {
    (<some/subject> <some/predicate> <some/object>)
    (<some/subject2> <some/predicate2> <some/object2>)
  }
  ?s ?p ?o.
}

Currently the used VIRTUOSO instance returned an error when adding GRAPH information, although supported in sparql 1.1.

This created a problem. One triple can be present in one graph but not in another, to handle this edge case, the presence of these quads is determined with ASK queries.

@ajuvercr
Copy link
Author

ajuvercr commented Aug 19, 2021

TODO:

  • update README
  • add timer to flush
  • add ENV variable to make timeout duration variable

@ajuvercr ajuvercr changed the title WIP Feature/fold delta's: only insert and delete effective inserts and deletions. Feature/fold delta's: only insert and delete effective inserts and deletions. Aug 19, 2021
@ajuvercr
Copy link
Author

ajuvercr commented Aug 23, 2021

TODO:

  • vigorous testing
  • benchmark performance differences between old and new implementation

@ajuvercr
Copy link
Author

ajuvercr commented Aug 25, 2021

Benchmark results

Benchmarking this change is pretty challenging, because the use cases of mu-auth are very diverse.
The expected ratio between read and write queries is not set in stone.

I tried a benchmark where a simple object is created with around 12 fields, then changed 2 properties in a similar fashion as mu-cl-resources: delete all properties, not only the changed properties and create all new properties.
Next the same object is read for 4 times and then deleted.

The only notable times the cache flushes delta's to the triplestore is either during a read query (all delta's have to be flushed to guarantee a correct read query) or just before a read query due to a timeout (set with QUAD_CHANGE_CACHE_TIMEOUT env variable).

Results

Single entity manipulation

First the flush to the triplestore is executed during a read query (worst case)

New implementation:
total time                                                  insert                                            update                                            delete                                            select                                            
mean        median      std         min         max         mean      median    std       min       max       mean      median    std       min       max       mean      median    std       min       max       mean      median    std       min       max       
17.575      15.000      9.235       10.000      45.000      14.800    16.000    3.187     11.000    18.000    16.267    16.000    2.516     12.000    20.000    12.600    12.000    1.625     11.000    15.000    21.467    13.000    13.754    10.000    45.000

Old implementation:
total time                                                  insert                                            update                                            delete                                            select                                            
mean        median      std         min         max         mean      median    std       min       max       mean      median    std       min       max       mean      median    std       min       max       mean      median    std       min       max       
27.425      26.000      13.472      10.000      54.000      27.400    26.000    3.826     25.000    35.000    42.467    41.000    5.162     35.000    54.000    26.600    27.000    1.625     24.000    29.000    12.667    13.000    2.700     10.000    21.000

Next the flush is executed due to a timeout, so not during any read query (best case)

New implementation:
total time                                                  insert                                            update                                            delete                                            select                                            
mean        median      std         min         max         mean      median    std       min       max       mean      median    std       min       max       mean      median    std       min       max       mean      median    std       min       max       
15.950      16.000      4.283       9.000       27.000      13.200    16.000    2.993     9.000     17.000    18.867    19.000    3.364     11.000    25.000    13.000    15.000    2.608     9.000     16.000    14.933    15.000    4.171     10.000    27.000

Old implementation:
total time                                                  insert                                            update                                            delete                                            select                                            
mean        median      std         min         max         mean      median    std       min       max       mean      median    std       min       max       mean      median    std       min       max       mean      median    std       min       max       
22.725      26.500      8.729       9.000       35.000      29.000    29.000    1.897     26.000    31.000    30.333    30.000    3.300     25.000    35.000    25.400    26.000    2.154     22.000    28.000    12.133    12.000    1.996     9.000     16.000

Multiple entity manipulations

The previous section only covered the manipulation of a single entity, this is probably not representative.
Here the same benchmark is started multiple times with a slight offset, manipulating different entities.

The big flush happens with the first read query. But there are many more read queries to bring down the average.

New implementation:
total time                                                  insert                                            update                                            delete                                            select                                            
mean        median      std         min         max         mean      median    std       min       max       mean      median    std       min       max       mean      median    std       min       max       mean      median    std       min       max       
23.031      18.000      17.912      8.000       114.000     11.350    12.500    2.056     8.000     16.000    17.817    18.000    3.806     11.000    26.000    17.350    18.500    3.745     11.000    24.000    34.033    28.500    25.110    8.000     114.000

Old implementation:
total time                                                  insert                                            update                                            delete                                            select                                            
mean        median      std         min         max         mean      median    std       min       max       mean      median    std       min       max       mean      median    std       min       max       mean      median    std       min       max       
22.288      27.000      9.401       8.000       45.000      24.500    24.500    3.486     18.000    31.000    31.183    31.000    4.209     24.000    45.000    26.050    25.000    3.612     19.000    34.000    11.400    12.000    2.354     8.000     19.000

Conclusion

In all cases the expected median is lower when using the new implementation. To achieve this one read query takes considerably longer than normal. If the actual use case allows this, one can put the timeout duration pretty low, so executing the delta's happens when no request is open.

Trivia

  • You can change the cache behavior with a request header (update is coming to the readme). The best cache behavior is the select behavior, which will soon be is the default. Other options are constructs, construct_and_asks and only_asks. These define how the mechanism determines if a quad is present in the triplestore.
  • The measurements are taken from the entire sparql query, including HTTP protocol etc to mu-auth.
  • Each measurement uses 5 samples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant