This repository is playing with eqlabs recruitment exercise
of Rolling Hash Algorithm
.
Fork of palucki/yardiff for boilerplate base. Simplified and refactored with these improvements:
- Generalized IO stream to accept also string stream.
- Ability to cleanup stream buffer to reuse IO object.
- Replaced Qt based MD4 strong checksum with independent header only MD5 implementation.
- Rolling checksum collision warning.
- Skipping unchanged segments to have hunk view like print.
- Reporting block coordinates and compared chunks.
- Alignment of matched reference and modified segments.
- Reporting edit operations of difference as addition/removal/substitution details.
- Producing metadata alongside terminal print.
- Builtin test.
You are there.
GNU: GCC 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04)
Project | Where to get | Notes |
---|---|---|
RapidFuzz |
maxbachmann/rapidfuzz-cpp | Included amalgamated header. Needed for levenshtein edit ops to report add/remove/substitution details. |
Hash |
Chocobo1/Hash | Included md5 header. Needed for fast blockwise decision before rolling sequence. |
QMake pro file and generic Makefile are provided for your taste.
Here typical qmake based build-install procedure:
cd builDir
qmake path/to/root/pro/file.pro CONFIG+=release CONFIG+=THIS_IS_A_SWITCH "ARGUMENT=THIS_IS_A_VALUE"
make
make install
Remember that you should clear builDir before recompiling program and additionally YOU MUST PASS "*." filename pattern to "rm" to be able to delete .qmake.super file which contains old qmake flags.
├── main.cpp | Everything is here.
├── Makefile
├── md5.h
├── rapidfuzz_amalgamated.hpp
├── README.MD
├── rolling-hash.pro
└── test_material.hpp | Reference and modified test input strings. Just inputs.
0 directories, 7 files
- Have two file handles of A and B in
main()
. - Block by block walk over B and store rolling and strong checksums in
SignatureCalculator::calculate()
. - Shift block sized window with 1 hop size on A and compare stored strong checksums first and rolling hashes of B in case of mismatch in
DeltaCalculator::calculate()
. - Keep bytes of A between matched segments as the modification range. We are still in
DeltaCalculator::calculate()
. - Locate and align corresponding raw data chunk of B by looking at the segment of A in
DeltaCalculator::align_cross_diff()
. - Have Levenshtein based modification report in terms of add/remove/replace in
DeltaCalculator::get_levenshtein_distance_ops_report()
. - Return hunk metadata and commit information to the print message. Finally leaving the
DeltaCalculator::calculate()
.
Just run as is. Builtin test will always run, even if a/b paths omitted.
Here what to expect:
Reference | Output | Modified |
---|---|---|
Rolling Hash Algorithm _Spec v4 (2021-03-09)_ Make a rolling hash based file diffing algorithm. When comparing original and an updated version of an input, it should return a description ("delta") which can be used to upgrade an original version of the file into the new file. The description contains the chunks which: - Can be reused from the original file - have been added or modified and thus would need to be synchronized The real-world use case for this type of construct could be a distributed file storage system. This reduces the need for bandwidth and storage. If many people have the same file stored on Dropbox, for example, there's no need to upload it again. A library that does a similar thing is [rdiff](https://linux.die.net/man/1/rdiff). You don't need to fulfill the patch part of the API, only signature and delta. ## Requirements - Hashing function gets the data as a parameter. Separate possible filesystem operations. - Chunk size can be fixed or dynamic, but must be split to at least two chunks on any sufficiently sized data. - Should be able to recognize changes between chunks. Only the exact differing locations should be added to the delta. - Well-written unit tests function well in describing the operation, no UI necessary. ## Checklist 1. Input/output operations are separated from the calculations 2. detects chunk changes and/or additions 3. detects chunk removals 4. detects additions between chunks with shifted original chunk |
Signature: 48 blocks |
aRolling Hash Algorithm _Spec v4 (2021-03-09)_ Make a rolling hash based file diffing algorithm. When comparing original and an updated version of an input, it should return a description ("delta") which can be used to upgrade an original version of the file into the new file. The description contains the chunks which: - Can be reused from the original file - have been added or modified and thus would need to be synchronized The real-world use case for this type of construct could be a distributed file storage system. This reduces the need for bandwidth and storage. If many people have the same file stored on Dropbox, for example, there's no need to upload it again. A library that does a similar thing is [rdiff](https://linux.die.net/man/1/rdiff). You don't need to fulfill the patch part of the API, only signature and delta. ## Requirements - Bashing function gets the dat as a parameter. Separate possible filesystem operations. - Chunk size can be fixed or dynamic, but must be split to at least two chunks on any sufficiently sized data. - Should be able to recognize changes between chunks. Only the exact differing locations should be added to the delta. - Well-written unit tests function welll in describing the operation, no UI necessary. ## Checklist 1. Input/output operations are separated from the calculations 2. detects chunk changes and/or additions 3. detects chunk removals 4. detects additions between chunks with shifted original chunk |
As free as dependencies.