Replies: 5 comments 6 replies
-
I agree that sync is not needed right now. That said, I'm fond of this vision of Neume as a network of nodes consuming and contributing to a shared dataset. For this, set reconciliation should work nicely. |
Beta Was this translation helpful? Give feedback.
-
3rd ProposalAbove, I have mentioned two alterative distribution routes. Here is another one. Instead of block numbers, we use timestamps. Each track will have a This method is only for distribution and not for syncing as different nodes will give different timestamps to the same track. Think of this method as a way to replicate the database of a node. In practice, neume can run a node and Riff app can be the consumer. Using the above method they will be able to create a replica of neume's DB. |
Beta Was this translation helpful? Give feedback.
-
From the top of my head, what you could do if all records that you want to replicate are NFTs.
|
Beta Was this translation helpful? Give feedback.
-
We previously lightly touched on arweave/bundlr as a potentially option for distributing data, and it has stayed in my head as something that could be a good option. Added to that 0xtranqui just brought it up again when we were chatting in the PA forum. Furthermore, I have been diving deeper into the docs that Lens has written re: it's "L3" solution, Momoka, and how it uses Arweave & Bundlr for Data Availability (DA). I may be way off here, but isn't there a potential approach here where neume nodes run the crawler (to their own specified parameters), and then output the results to arweave via bundlr, and then communicate this out to other nodes with the details of what is there and where? |
Beta Was this translation helpful? Give feedback.
-
I am sorry but I have missed this conversation. I hope to have looked and responded by Monday next week. |
Beta Was this translation helpful? Give feedback.
-
This is one of the most difficult problem I have been facing with respect to neume. In the version 1 of neume we dumped our crawled data as JSON. A consumer of neume had to read through the complete dump and check for insertions/updates. This was inefficient and not scalable. In version 2 we implemented a functionality to ask neume for what's new. Dumps were no longer required. Since we could ask for what's new, consumption and syncing between two nodes was also easy. The other node just had to ask for what's new. Now that we are integrating the lens protocol and making the crawler more generic. Asking for what's new is no longer trivial.
How do we ask for what's new in version 2?
We maintain a log, mapping block numbers to IDs that have been changed in that block number.
Here,
id1
andid2
was inserted at block number 5000 andid1
was updated at block number 5001.Our database stores the complete history. Therefore, we can ask it for
id1
at block number 5000 and it will return the value corresponding toid1
without the update at block 5001.A consumer of neume can ask for changes since block 5000 assuming it has already consumed changes till 5000. The consumer will get the above list of
id1
andid2
. The consumer will then ask for the value corresponding toid1
at 5000, apply that value to its database, ask forid2
at 5000, then againid1
at 5001 and apply all these values.Challenges
Suppose the consumer or the other node started syncing from block 0. It kept asking for what's new till block number x and is now synced upto block x. It will now ask for what new since block x but what if there has been a change before block x. The consumer won't be able to know about that.
The above mentioned scenario doesn't occur with the version 2 of our crawler. However, it can happen now because we are planning to provide wallet addresses to crawl as an input to the crawler. So, if the crawler starts with a list of addresses, crawls till block x, updates the list of addresses, restarts the crawl from 0 and crawls till x. In such a scenario there will now be new changes that happened before x and the consumer which is asking for what's new since x will not be able to get them.
Potential Solutions
To solve for the above mentioned scenario we need some kind of mechanism that will check the data available with the consumer and my node and then calculate a diff. This is essentially set reconciliation.
An inefficient method for set reconciliation is going through all the data available with the node and asking for data that is out of date or unavailable. This will take O(n) where n is the number of rows in our database. The consumer can check the hash of the row to know if the data is out of date.
An efficient method described on this GitHub page takes O(log n) but requires our rows to be converted into numbers. This would be ideal to have but I haven't been able to adapt it to our case.
Any help is appreciated here.
Summarizing the problem
Essentially, we want to sync two nodes but since nodes starting with different inputs can reach to different outputs syncing them efficiently is problematic. Where efficient is anything less than O(n). In our version 2 all the nodes shared the same input hence we didn't have this problem.
Is sync important?
I want to propose this question before we nerd out on solutions. In version 1 we didn't have syncing but in version 2 we did. This option was a side effect of solving the distribution problem. So, in my opinion we can remove the sync feature if we have a good enough solution for distribution. Feel free to weigh in here.
Alternative ways to distribute
We can ask everyone to run their own nodes if then want real time data. Others can run the inefficient (O(n)) method of going through all rows, checking which rows are new or out-of-date and then consuming those rows. Since, this is inefficient this can be done once a day.
Another method is the API route. We can launch our API like api.lens.dev for people to consume data directly from our DB. A con of this approach is that we will have to run a server heavy enough to answer all API requests.
Beta Was this translation helpful? Give feedback.
All reactions