How to sync nodes and distribute crawled data? #29

il3ven · 2023-05-16T18:58:01Z

il3ven
May 16, 2023
Maintainer

This is one of the most difficult problem I have been facing with respect to neume. In the version 1 of neume we dumped our crawled data as JSON. A consumer of neume had to read through the complete dump and check for insertions/updates. This was inefficient and not scalable. In version 2 we implemented a functionality to ask neume for what's new. Dumps were no longer required. Since we could ask for what's new, consumption and syncing between two nodes was also easy. The other node just had to ask for what's new. Now that we are integrating the lens protocol and making the crawler more generic. Asking for what's new is no longer trivial.

How do we ask for what's new in version 2?

We maintain a log, mapping block numbers to IDs that have been changed in that block number.

5000-id1
5000-id2
5001-id1

Here, id1 and id2 was inserted at block number 5000 and id1 was updated at block number 5001.

Our database stores the complete history. Therefore, we can ask it for id1 at block number 5000 and it will return the value corresponding to id1 without the update at block 5001.

A consumer of neume can ask for changes since block 5000 assuming it has already consumed changes till 5000. The consumer will get the above list of id1 and id2. The consumer will then ask for the value corresponding to id1 at 5000, apply that value to its database, ask for id2 at 5000, then again id1 at 5001 and apply all these values.

Challenges

Suppose the consumer or the other node started syncing from block 0. It kept asking for what's new till block number x and is now synced upto block x. It will now ask for what new since block x but what if there has been a change before block x. The consumer won't be able to know about that.

The above mentioned scenario doesn't occur with the version 2 of our crawler. However, it can happen now because we are planning to provide wallet addresses to crawl as an input to the crawler. So, if the crawler starts with a list of addresses, crawls till block x, updates the list of addresses, restarts the crawl from 0 and crawls till x. In such a scenario there will now be new changes that happened before x and the consumer which is asking for what's new since x will not be able to get them.

Potential Solutions

To solve for the above mentioned scenario we need some kind of mechanism that will check the data available with the consumer and my node and then calculate a diff. This is essentially set reconciliation.

An inefficient method for set reconciliation is going through all the data available with the node and asking for data that is out of date or unavailable. This will take O(n) where n is the number of rows in our database. The consumer can check the hash of the row to know if the data is out of date.

An efficient method described on this GitHub page takes O(log n) but requires our rows to be converted into numbers. This would be ideal to have but I haven't been able to adapt it to our case.

Any help is appreciated here.

Summarizing the problem

Essentially, we want to sync two nodes but since nodes starting with different inputs can reach to different outputs syncing them efficiently is problematic. Where efficient is anything less than O(n). In our version 2 all the nodes shared the same input hence we didn't have this problem.

Is sync important?

I want to propose this question before we nerd out on solutions. In version 1 we didn't have syncing but in version 2 we did. This option was a side effect of solving the distribution problem. So, in my opinion we can remove the sync feature if we have a good enough solution for distribution. Feel free to weigh in here.

Alternative ways to distribute

We can ask everyone to run their own nodes if then want real time data. Others can run the inefficient (O(n)) method of going through all rows, checking which rows are new or out-of-date and then consuming those rows. Since, this is inefficient this can be done once a day.

Another method is the API route. We can launch our API like api.lens.dev for people to consume data directly from our DB. A con of this approach is that we will have to run a server heavy enough to answer all API requests.

neatonk · 2023-05-18T00:22:50Z

neatonk
May 18, 2023
Maintainer

So, in my opinion we can remove the sync feature if we have a good enough solution for distribution. Feel free to weigh in here.

I agree that sync is not needed right now. That said, I'm fond of this vision of Neume as a network of nodes consuming and contributing to a shared dataset. For this, set reconciliation should work nicely.

0 replies

il3ven · 2023-05-18T13:04:56Z

il3ven
May 18, 2023
Maintainer Author

3rd Proposal

Above, I have mentioned two alterative distribution routes. Here is another one.

Instead of block numbers, we use timestamps. Each track will have a last_updated_at timestamp. The consumer will start from time 0 and keep on asking for what's new since the last call. With block numbers we had the problem that a node could stop itself go back and crawl for new data. With timestamp even if the node goes back and crawls for new data the timestamp will be newer and the consumer will be able to ask for that.

This method is only for distribution and not for syncing as different nodes will give different timestamps to the same track. Think of this method as a way to replicate the database of a node. In practice, neume can run a node and Riff app can be the consumer. Using the above method they will be able to create a replica of neume's DB.

3 replies

il3ven Jun 7, 2023
Maintainer Author

4th Proposal

Thinking out loud. After reading comments from others and taking inspiration I am outlining another solution.

Context

As per my understanding, set reconciliation can help two nodes determine which keys/ids they are missing. However, our problem boils down to the fact that if the value corresponding to an existing key is changed our reconciliation fails.

For the above problem, I was thinking along the lines of deleting old keys and adding a new key when the data changes. This could be possible if our keys had a timestamp/blocknumber. However, I am not able to find a deterministic way of adding blockNumber.

Solution

A solution to enable set reconciliation would be to never update existing keys and only add new ones. The most common scenario that demands updating a value, in our case updating a track is when the owner changes.

How can we only add new keys and still maintain all the latest data?

First of all, we will have to move back to a key-value database. Our DB for a track with id=myTrackID will look like below.

// We have a track with id of myTrackID
myTrackID -----> { title: "My Track", ...other metadata }

// tokenID of 1 has been minted on the myTrackID track
myTrackID/tokens/1 -----> { tokenURI: "ipfs://", ...other metadata }

// tokenID of 1 has the following owners (previous owners included)
// 0xabc is the owner's address
myTrackID/tokens/1/0xabc -----> { transactionHash: "0x" ...other data}
myTrackID/tokens/1/0x12a -----> { transactionHash: "0x" ...other data}

The above structure means key -----> value.

Cons

The SQL solution mentioned here is almost implemented. We will encounter a delay in shipping.
We will not be able to update our tracks. Suppose, Lens allows users to edit their post. In that case, our implementation will fail.
This solution is more complicated than SQL.

Pros

Syncing between nodes is possible
Since no data gets updated we can publish on Arweave.

TimDaub Jun 7, 2023

As per my understanding, set reconciliation can help two nodes determine which keys/ids they are missing. However, our problem boils down to the fact that if the value corresponding to an existing key is changed our reconciliation fails.

I think set reconciliation can also work where u find if a leaf was updated. E.g. you'd traverse down the trie to the leaf and then you'd find that either the leaf was "null" on of of the two nodes (mean if hasn't been copied yet). Or you'd find that a leaf has a different hash and then that probably means it was changed on one or the other node. But yeah, I'm more in favor of making changes an explicit data structure and to compute state as a accumulation of change messages.

il3ven Jun 7, 2023
Maintainer Author

I think set reconciliation can also work where u find if a leaf was updated

I suppose it can work.

But yeah, I'm more in favor of making changes an explicit data structure and to compute state as a accumulation of change messages.

Yes, I agree. Otherwise, agreeing on which node has the latest data will be a problem. The data structure I have proposed above in '4th Proposal' has one drawback that we can't update the token if the tokenURI changes.

TimDaub · 2023-05-24T07:27:40Z

TimDaub
May 24, 2023

From the top of my head, what you could do if all records that you want to replicate are NFTs.

Create an index „{nft.timestamp}{nft.caip-19-identifier}“
Make sure that for each NFT that you crawled, there is always one canonical representation and not more
You probably want each discovered NFT to be signed by the discoverer as to e.g. be able to filter messages by discoverers that have done good work in the past
Insert all NFTs in a Patricia Merkle Tree with the index pointing towards the NFT‘s data
Implement libp2p gossip sub to distribute new crawl events for online nodes
Implement a patricia merkle tree set reconciliation between two nodes that ran out if sync. Compare the trie level by level for hash mismatches until you found all NFTs that had previously not been replicated
Send all the found mismatches to the respective other nodes
I think the Kiwistand node implements this very closely

2 replies

il3ven May 24, 2023
Maintainer Author

Thank you for the response. Will nft.timestamp be the block number? If yes, then there is a problem.

I have been trying to move from NFTs to tracks. Tracks can contain multiple NFTs. Tracks are more generic and genericity is especially useful with the Lens ecosystem. Also, multiple NFTs point to the same song so it is better to think in terms of tracks. It is the same change you tried doing at the end.

We process NFTs and push them to the track they belong. The problem is that what should be the timestamp in case of tracks.

Is nft.timestamp necessary for set reconciliation?

TimDaub May 30, 2023

The only necessity that you end up having a global unique, deterministically totally ordered identifier:

global: there is only one name space for all NFTS
unique: there is only one NFT per identifier
deterministically totally-ordered: When arranging all identifiers on any machine, every machine must come to the same results regarding their order

Additionally, the identifier must be as little user-controllable as possible so that the users CANNOT mess with the order. E.g. choosing the block number over a user defined timestamp may be beneficial here, but the question is also if it's practicably verifiable at the level of the message validation.

Anyways, what is necessary for set reconciliation generally is that basically each node can come to the same result about the messages because they all insert those messages in the same order into a merkle tree. That's the idea. I'm available on TG if you happen to have more questions about this.

djfnd · 2023-05-30T14:49:01Z

djfnd
May 30, 2023
Maintainer

We previously lightly touched on arweave/bundlr as a potentially option for distributing data, and it has stayed in my head as something that could be a good option.

Added to that 0xtranqui just brought it up again when we were chatting in the PA forum.

Furthermore, I have been diving deeper into the docs that Lens has written re: it's "L3" solution, Momoka, and how it uses Arweave & Bundlr for Data Availability (DA).

I may be way off here, but isn't there a potential approach here where neume nodes run the crawler (to their own specified parameters), and then output the results to arweave via bundlr, and then communicate this out to other nodes with the details of what is there and where?

cc: @il3ven @neatonk

1 reply

neatonk May 30, 2023
Maintainer

Yes. I think storing crawl results on Arweave via Bundlr is a good option. Since the data is signed, it can be trusted as though it came from the signer directly. I wouldn't suggest creating a hard dependency on Arweave for storage, but I would suggest supporting it natively so that node operators can choose to offload their data and distribute it via Arweave.

reimertz · 2023-06-09T09:46:54Z

reimertz
Jun 9, 2023
Maintainer

I am sorry but I have missed this conversation. I hope to have looked and responded by Monday next week.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

neume

How to sync nodes and distribute crawled data? #29

{{title}}

Replies: 5 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

neume

How to sync nodes and distribute crawled data? #29

il3ven May 16, 2023 Maintainer

How do we ask for what's new in version 2?

Challenges

Potential Solutions

Summarizing the problem

Is sync important?

Alternative ways to distribute

Replies: 5 comments · 6 replies

neatonk May 18, 2023 Maintainer

il3ven May 18, 2023 Maintainer Author

3rd Proposal

il3ven Jun 7, 2023 Maintainer Author

4th Proposal

Context

Solution

How can we only add new keys and still maintain all the latest data?

Cons

Pros

TimDaub Jun 7, 2023

il3ven Jun 7, 2023 Maintainer Author

TimDaub May 24, 2023

il3ven May 24, 2023 Maintainer Author

TimDaub May 30, 2023

djfnd May 30, 2023 Maintainer

neatonk May 30, 2023 Maintainer

reimertz Jun 9, 2023 Maintainer

il3ven
May 16, 2023
Maintainer

Replies: 5 comments 6 replies

neatonk
May 18, 2023
Maintainer

il3ven
May 18, 2023
Maintainer Author

il3ven Jun 7, 2023
Maintainer Author

il3ven Jun 7, 2023
Maintainer Author

TimDaub
May 24, 2023

il3ven May 24, 2023
Maintainer Author

djfnd
May 30, 2023
Maintainer

neatonk May 30, 2023
Maintainer

reimertz
Jun 9, 2023
Maintainer