Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate and compare CRC when writing and reading ledger snapshots #1319

Merged
merged 1 commit into from
Dec 10, 2024

Conversation

geo2a
Copy link
Contributor

@geo2a geo2a commented Nov 21, 2024

Fixes #892

Integration into cardano-node: IntersectMBO/cardano-node#6047. This uses the branch geo2a/issue-892-checksum-snaphot-file-release-ouroboros-consensus-0.21.0.0-backport which is the backport of this PR onto the most resent release of the ouroboros-consensus package.

In this PR, we change the reading and writing disk snapshots of ledger state. When a snapshot is taken and written to disk, an additional file with the .checksum extension is written alongside it. The checksum file contains a string that represent the CRC32 checksum of the snapshot.

The checksum is calculated incrementally, alongside writing the snapshot to disk. When a snapshot is read from dist, the checksum is again calculated and compared to the tracked one. If the checksum is different, readSnaphot returns the ReadSnapshotDataCorruption error, indicating data corruption.

The checksum is calculated incrementally, alongside reading a writing the data. On write, we use the hPutAllCRC function from fs-sim, and on read we modify the readIncremental function to compute the checksum as data is read.

To enable seamless integration into cardano-node, we make the check optional. When initialising the ledger state from a snapshot in initLedgerDB, we issue a warning in case the checksum file is missing for a snapshot, but do not fail as in case of invalid snapshots.

The db-analyser tool ignores the checksum files by default when reading the snapshots. We add --disk-snapshot-checksum flag to enabled the check. When writing a snapshot to disk, e.g. as a result of the --store-ledger analysis, db-analyser will always write
calculate the checksum and write it into the snapshot's .checksum file.

Tests

There state machine test in Test.Ouroboros.Storage.LedgerDB.OnDisk is relevant to this feature, and has caught a number of silly mistakes in the process of its implementation, for example forgetting to delete a checksum file when the snapshot is deleted.

The model in the test does not track checksums, and I do not think it can (or should) be augmented to do that. Howerver, the Snap and Restore events are now parameterised by the checksum flag, and the values for the flag are randomised when generating these events. This leads to testing the following properties:

  • this feature is backwards-compatible, i.e. the Restore events will always lead to restoring from a snapshot, even if Snap events do not write checksum files (i.e. their flag is NoDoDiskSnapshotChecksum ~= False).
  • If the interpretation of the DoDiskSnapshotChecksum flag changes in the code base and becomes strict, i.e. hard fail if the checksum file is missing, this test will discover that.

Effects on Performance:

Running db-analyser to read a ledger snapshot and store the snapshot of the state at the next slot shows a difference of 2 seconds on my machine. See a comment below for the logs.

To precisely evaluate the effects, we need a micro-benchmark of the reading and writing of snapshots with and without the checksum calculation.

@geo2a geo2a force-pushed the 892-checksum-snaphot-file branch from 41891a0 to e449818 Compare November 21, 2024 13:38
@geo2a geo2a force-pushed the 892-checksum-snaphot-file branch 3 times, most recently from b001c90 to a599027 Compare November 28, 2024 08:16
@geo2a geo2a force-pushed the 892-checksum-snaphot-file branch 5 times, most recently from 31b892a to 5d39721 Compare November 29, 2024 11:19
@geo2a
Copy link
Contributor Author

geo2a commented Nov 29, 2024

I've compared the performance of reading and then writing a snapshot using db-analyser. The feature branch seems to be 2s slower, but it is not clear if it is significant.

I would like to spend some more time looking at the microbenchmarks that we currently have and maybe adding one for reading-writing ledger state snapshots.

Logs of `db-analyser` with timings

Feature 101.325841s

time $(cabal list-bin db-analyser) --db ~/Workspace/IOG/cardano-node-data/mainnet/state/mainnet-db cardano --config ~/Workspace/IOG/cardano-node-data/mainnet/mainnet-config.json --analyse-from 112526498 --store-ledger 112526499 --verbose
[0.459957s] TraceImmutableDBEvent (ChunkValidationEvent (StartedValidatingChunk 5285 5285))
[0.869950s] TraceImmutableDBEvent (ChunkValidationEvent (ValidatedChunk 5285 5285))
[0.871115s] TraceImmutableDBEvent (ValidatedLastLocation 5285 (Tip {tipSlotNo = SlotNo 114172812, tipIsEBB = IsNotEBB, tipBlockNo = BlockNo 9827240, tipHash = fcb4668176a65b29cf64c31a0bf7c4d4558f61d17774c4f9c3c1366435be8a7f}))
[59.829891s] Started StoreLedgerStateAt (SlotNo 112526499) LedgerReapply
[60.150611s] TraceImmutableDBEvent (TraceCacheEvent (TraceCurrentChunkHit 5285 0))
[60.150929s] TraceImmutableDBEvent (TraceCacheEvent (TraceCurrentChunkHit 5285 0))
[60.151203s] TraceImmutableDBEvent (TraceCacheEvent (TracePastChunkMiss 5209 0))
[60.179937s] TraceImmutableDBEvent (TraceCacheEvent (TracePastChunkHit 5209 1))
[60.180243s] TraceImmutableDBEvent (TraceCacheEvent (TracePastChunkHit 5209 1))
[101.325411s] Snapshot stored at SlotNo 112526500
[101.325568s] Snapshot was created at SlotNo 112526500 because there was no block forged at requested SlotNo 112526499
[101.325841s] Done
ImmutableDB tip: At (Block {blockPointSlot = SlotNo 114172812, blockPointHash = fcb4668176a65b29cf64c31a0bf7c4d4558f61d17774c4f9c3c1366435be8a7f})
[101.325945s] TraceImmutableDBEvent DBClosed
$(cabal list-bin db-analyser) --db  cardano --config  --analyse-from 11252649  102.92s user 6.79s system 107% cpu 1:41.95 total

Main 99.685827s

 time $(cabal list-bin db-analyser) --db ~/Workspace/IOG/cardano-node-data/mainnet/state/mainnet-db cardano --config ~/Workspace/IOG/cardano-node-data/mainnet/mainnet-config.json --analyse-from 112526498 --store-ledger 112526499 --verbose
[0.466244s] TraceImmutableDBEvent (ChunkValidationEvent (StartedValidatingChunk 5285 5285))
[0.879898s] TraceImmutableDBEvent (ChunkValidationEvent (ValidatedChunk 5285 5285))
[0.881057s] TraceImmutableDBEvent (ValidatedLastLocation 5285 (Tip {tipSlotNo = SlotNo 114172812, tipIsEBB = IsNotEBB, tipBlockNo = BlockNo 9827240, tipHash = fcb4668176a65b29cf64c31a0bf7c4d4558f61d17774c4f9c3c1366435be8a7f}))
[57.514297s] Started StoreLedgerStateAt (SlotNo 112526499) LedgerReapply
[57.837578s] TraceImmutableDBEvent (TraceCacheEvent (TraceCurrentChunkHit 5285 0))
[57.837830s] TraceImmutableDBEvent (TraceCacheEvent (TraceCurrentChunkHit 5285 0))
[57.838031s] TraceImmutableDBEvent (TraceCacheEvent (TracePastChunkMiss 5209 0))
[57.858723s] TraceImmutableDBEvent (TraceCacheEvent (TracePastChunkHit 5209 1))
[57.858963s] TraceImmutableDBEvent (TraceCacheEvent (TracePastChunkHit 5209 1))
[99.685293s] Snapshot stored at SlotNo 112526500
[99.685447s] Snapshot was created at SlotNo 112526500 because there was no block forged at requested SlotNo 112526499
[99.685721s] Done
ImmutableDB tip: At (Block {blockPointSlot = SlotNo 114172812, blockPointHash = fcb4668176a65b29cf64c31a0bf7c4d4558f61d17774c4f9c3c1366435be8a7f})
[99.685827s] TraceImmutableDBEvent DBClosed
$(cabal list-bin db-analyser) --db  cardano --config  --analyse-from 11252649  102.60s user 6.13s system 108% cpu 1:40.31 total

@geo2a geo2a force-pushed the 892-checksum-snaphot-file branch from e9ae82c to 4286fdd Compare December 3, 2024 07:53
@geo2a geo2a force-pushed the 892-checksum-snaphot-file branch 2 times, most recently from b524f64 to c8c082e Compare December 4, 2024 13:51
@geo2a geo2a marked this pull request as ready for review December 4, 2024 13:52
Copy link
Contributor

@jasagredo jasagredo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks good in general, but we should propagate this configuration option to the node and expose it on the JSON/YAML config file.

@geo2a geo2a force-pushed the 892-checksum-snaphot-file branch from c8c082e to 584d4d4 Compare December 5, 2024 08:49
@geo2a geo2a force-pushed the 892-checksum-snaphot-file branch from f27ca6d to 15ec8ee Compare December 5, 2024 14:11
@geo2a geo2a force-pushed the 892-checksum-snaphot-file branch 4 times, most recently from 1ec6cdc to dcf27db Compare December 10, 2024 09:19
@geo2a geo2a force-pushed the 892-checksum-snaphot-file branch from 5df819b to 4200b4a Compare December 10, 2024 13:20
@geo2a geo2a enabled auto-merge December 10, 2024 13:20
- Allow skipping snapshot checksum check
- Generalise `Test/Ouroboros/Storage/LedgerDB/OnDisk.hs`
- Restrict `Ord` instance for `DiskSnapshot` to `dsNumber`
  - Use the `Ord` instance in `listSnapshots`
- Connect snapshot checksum with the node interface:
  - Add `Flag "DoDiskSnapshotChecksum"` to `DiskPolicyArgs`
  - Expose `(No)DoDiskSnapshotChecksum` in Ouroboros.Consensus.Node
  - Re-export `Flag` from `DiskPolicy`
- `db-analyser` changes:
  - use checksums by default as that's what `cardano-node` will do
  - Allow independently disabling checksum on reading and writing
@geo2a geo2a force-pushed the 892-checksum-snaphot-file branch from 4200b4a to 2eef543 Compare December 10, 2024 14:35
@geo2a geo2a added this pull request to the merge queue Dec 10, 2024
Merged via the queue into main with commit 24e48cc Dec 10, 2024
13 checks passed
@geo2a geo2a deleted the 892-checksum-snaphot-file branch December 10, 2024 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEAT] - checksum when deserializing the ledger snapshot file
4 participants