High availability? #5147

fkleedorfer · 2024-10-07T09:01:00Z

fkleedorfer
Oct 7, 2024

There is a HA solution for Jena Is there anything comparable for RDF4J?

(Alternatively, is there appetite for creating it?)

kenwenzel · 2024-10-07T09:11:01Z

kenwenzel
Oct 7, 2024
Collaborator

Not yet, but I would propose to create something based on Apache Ratis.

5 replies

fkleedorfer Oct 9, 2024
Author

Very interesting. Want to start working on a prototype?

kenwenzel Oct 9, 2024
Collaborator

Not directly, but it is somehow on my to-do list.
We could start with the FileStore example and use the NativeStore or LmdbStore for persistent storage.
I'm not yet sure if we should support distributed transactions or if it is enough to rely on SnapshotSailStore and only replicate committed changes via RAFT. The latter would only work if we have one master and multiple replicas.
A real distributed store would also be interesting.

fkleedorfer Oct 10, 2024
Author

By distributed transactions you mean that multiple server instances participate in one transaction? Not sure about the cost benefit ratio there.

How about the next easier thing: have a high throughput, data-less rdf4j server as a load balancer that routes to the HA instances. Open transactions are always routed to the same instance. We replicate writes to all HA instances, which will be consistent eventually. The hard part would be to avoid read-read,-write-write inconsistencies, which maybe we can do by causing the raft algo to refuse the second commit.

Then there is the question how we would replicate a transaction. Is there a way to obtain added and removed quads in a commit? If so, (even for "update where" queries) that's the thing we should use.

fkleedorfer Oct 10, 2024
Author

Also, we should allow SHACL validation in the original commit but suppress it on replicated commits

hmottestad Oct 16, 2024
Maintainer

Leader election can be very tricky. Might be easier to build a solution with a single master node and replicated read nodes.

Clients can decide if they want to read from a quorum or just a single node. If reading from a single node they could be reading stale data. Reads from quorum would not actually read the data from multiple nodes, just have a vote to see which snapshot is the latest and then read that from a single node. Writes would always have to be quorum, as in that a write is only complete once it's been replicated to more than half the read replicas.

Raft would be good to keep track of which nodes are part of the cluster. Once a node is kicked out of the cluster it can not rejoin before the cluster is restarted.

This would support:

single writer node
multiple read nodes
redundancy for reads, as long as more than half the nodes are up
no redundancy for writes, once the write node is down the entire cluster would need to be restarted

hmottestad · 2024-10-16T16:05:35Z

hmottestad
Oct 16, 2024
Maintainer

I would like to start another thread here to talk about implementing support for Jena's RDF Delta Patch Log Server instead of actually developing a HA system from scratch.

What makes the Jena setup HA is that the patch log server runs in HA with multiple servers.

We could create a Sail that interfaces with the patch log server to track which patch version we are on, fetch any newer patches and apply them to the underlying store, and also sync back changes to the patch log server before a transaction is allowed to finish committing.

The NotifyingSail would allow us to track changes in a transaction, and we can override the begin, prepare and commit methods to add the interaction with the patch log server.

An important simplification that the patch log server allows for is that a transaction starts at a patch version, and needs to be commit at the same version. This means that a high replicated write load would end up with a lot of cancelled transactions.

6 replies

hmottestad Oct 17, 2024
Maintainer

That sounds interesting. We don't have an implementation of the RDF Patch format, so we would anyways have to implement that.

I've never used Ratis, so I don't know how much code we would have to write ourselves. For Jena's RDF Delta Patch Log Server we would need to implement the interaction layer with the server, which uses http. Persistence, concurrency control and HA is already part of the server.

Performance is a major factor. Jena's RDF Delta Patch Log Server uses http for transport and ZooKeeper for HA. I can see that Ratis is very performance focused, but I couldn't find anything about how it compares to ZooKeeper.

kenwenzel Oct 17, 2024
Collaborator

I also do not yet completely undestand the state machine implementation. Especially, the logic for storing and loading snapshots is not clear to me.

The counter example uses snapshots:
https://github.com/apache/ratis/blob/master/ratis-examples/src/main/java/org/apache/ratis/examples/counter/server/CounterStateMachine.java

The file store example does not:
https://github.com/apache/ratis/blob/master/ratis-examples/src/main/java/org/apache/ratis/examples/counter/server/CounterStateMachine.java

But in the end it is maybe enough to just send the RDF patches as binary data and apply it to the local RDF4J repository.

benherber Oct 17, 2024

From what I understand having toyed a bit with Ratis is that it follows the ideology of a lot of other RAFT libraries in other languages; handling leader elections, heartbeats, all the protocol things (of course you can plug your own implementations in for specific parts). They seem to have a billion and one configurations to tune exactly how this behaves (e.g., do we want full 'linearizable' behavior or can we loosen that requirement a bit).

The real parts to implement afaik are 1. Creating a format for serialized log entires (i.e., what is the state machine); 2. how to 'apply' log entries to deterministically recreate a snapshot of the data corresponding to a given log entry (i.e., run the state machine); and 3. how to create and store a snapshot corresponding to a given log entry such that the replicated log can be compacted.

For a bit more clarity on the protocol part, I found this visualization rather useful: http://thesecretlivesofdata.com/raft/

benherber Oct 17, 2024

I guess my question would be, to better define 'High-Availability', what guarantees are thought to be appropriate for such a deployment of RDF4J? Is it okay to be purely eventual or causally consistent or is something more strict required. Those are two very different problems to solve as far as I have seen.

benherber Oct 17, 2024

Performance is a major factor. Jena's RDF Delta Patch Log Server uses http for transport and ZooKeeper for HA. I can see that Ratis is very performance focused, but I couldn't find anything about how it compares to ZooKeeper.

Ratis and Zookeeper are fairly different in what they are trying to do. RAFT is a trying to come to a consensus on what records are in a WAL for the purposes of applying operations in a deterministic order. It may be more comparable to compare Zookeeper's consensus algorithm ZAB to RAFT as they aim to achieve the same end goal. Zookeeper uses this to implement its features like coordination, synchronization primitives, etc. There is this really interesting article from Confluent on them changing their backend to use their own RAFT variant (K-RAFT) instead of Zookeeper.

Another big thing Ratis likes to tout is that it is not an external service like Zookeeper or Kafka, but rather an embedded library in the server itself which of course comes with a series of benefits and drawbacks.

afs · 2024-10-17T13:39:36Z

afs
Oct 17, 2024

(Sorry if I'm butting in here)

Just to be clear; RDF Delta isn't part of Apache Jena. There's no technical barrier but RDF Delta is significantly more user-support-intensive. Not surprising - it is affected by the deployment environment.

Delta could do with a refresh, even a V2. Apache Zookeeper takes quite a lot of effect to deploy and operate. It can not store the patches (size limits).

Apache Ratis for the system lock and metadata would be a good choice.

I'm not clear whether storing bulk data in Ratis for a long-lived deployment is a good idea or not. This might depend if the design is to keep patches "for a long time" (c.f. incremental backup, rebuild new triplestores from a long-term data snapshot) or just until the triplestores have all taken the updates. RDF Delta does not track the state of the front ends.

If patches are outside Ratis, for storing the patches, there are options:

There is a lot to be said in favour of small deployments using a "safe" filesystem for patch storage. The advantage for users is that the deployment is simple to operate. SPARQL is continuously available for query; yes, there is there is a pause on updates if the patch server needs to be bumped. It starts up very fast (1-2s). Only loosing the patches is catastrophic; a blank patch server can discover log state (once) and that takes a little longer to startup.

For a more complex deployment, Apache Pulsar looks interesting. It is designed to be a distributed log. It supports migrating stored data to different storage hierarchies and also ageing off patches. Apache Kafka nowadays can be used in a log-like manner if old patches are migrated away from expensive broker storage.

This still needs Ratis to coordinate writing the log consistently.

Current RDF Delta can use blob stores - that is another operational cost. Pulsar should be a more integrated solution as well as being cloud-neutral.

1 reply

kenwenzel Oct 17, 2024
Collaborator

Hi @afs,

thank you for joining the discussion. I also thought about using something like Kafka - and Pulsar looks very interesting.

As far as i understand, Ratis logs can be pruned if they are committed to the state machine. If each state machine has an internal RDF store (TDB or the RDF4J stores) then managing the logs should be not a problem.
In this setup we have the possibility to distribute read requests among all RAFT peers. The downside is that we loose the option to store the log on cheap storage systems like S3.

afs · 2024-10-20T19:26:24Z

afs
Oct 20, 2024

Having a machine Ratis in the Ratis membership group could be writing out to the long term log. The members don't have to have the same functionality. Delta minimised implementation on the triplestore process because Zookeeper isn't a library.

@kenwenzel What are your thoughts on recover after a failure?

2 replies

kenwenzel Oct 21, 2024
Collaborator

I'm not sure if I understand the architecture right: We have a RAFT group with multiple peers (>= 3) where each is managing an internal RDF database (Here the deltas that led to the content of the RDF database are lost over time if the RAFT log is pruned.). Some peers also write the log entries to a long term storage (Jena's delta log) to keep the full historical state.
Is this correct?

If a failure occurs on a node then I would reconstruct its state by taking the contents of the RDF database from the leader in combination with the (automatically) synchronized RAFT log entries. For transferring the database contents an additional mechanism has to be developed.
Could you imagine that this is working?

afs Oct 27, 2024

Could you imagine that this is working?

Yes - with sufficient care during the transfer is linked to the exact RAFT state so it will catchup when it goes live because updates may still be happening (e.g. the cluster was 4, and 1 machine lost).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High availability? #5147

{{title}}

Replies: 4 comments 14 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

High availability? #5147

fkleedorfer Oct 7, 2024

Replies: 4 comments · 14 replies

kenwenzel Oct 7, 2024 Collaborator

fkleedorfer Oct 9, 2024 Author

kenwenzel Oct 9, 2024 Collaborator

fkleedorfer Oct 10, 2024 Author

fkleedorfer Oct 10, 2024 Author

hmottestad Oct 16, 2024 Maintainer

hmottestad Oct 16, 2024 Maintainer

hmottestad Oct 17, 2024 Maintainer

kenwenzel Oct 17, 2024 Collaborator

benherber Oct 17, 2024

benherber Oct 17, 2024

benherber Oct 17, 2024

afs Oct 17, 2024

kenwenzel Oct 17, 2024 Collaborator

afs Oct 20, 2024

kenwenzel Oct 21, 2024 Collaborator

afs Oct 27, 2024

fkleedorfer
Oct 7, 2024

Replies: 4 comments 14 replies

kenwenzel
Oct 7, 2024
Collaborator

fkleedorfer Oct 9, 2024
Author

kenwenzel Oct 9, 2024
Collaborator

fkleedorfer Oct 10, 2024
Author

fkleedorfer Oct 10, 2024
Author

hmottestad Oct 16, 2024
Maintainer

hmottestad
Oct 16, 2024
Maintainer

hmottestad Oct 17, 2024
Maintainer

kenwenzel Oct 17, 2024
Collaborator

afs
Oct 17, 2024

kenwenzel Oct 17, 2024
Collaborator

afs
Oct 20, 2024

kenwenzel Oct 21, 2024
Collaborator