GitHub - isgursoy/rolling-hash

Rolling Hash Weekend

This repository is playing with eqlabs recruitment exercise of Rolling Hash Algorithm.

Fork of palucki/yardiff for boilerplate base. Simplified and refactored with these improvements:

Generalized IO stream to accept also string stream.
Ability to cleanup stream buffer to reuse IO object.
Replaced Qt based MD4 strong checksum with independent header only MD5 implementation.
Rolling checksum collision warning.
Skipping unchanged segments to have hunk view like print.
Reporting block coordinates and compared chunks.
Alignment of matched reference and modified segments.
Reporting edit operations of difference as addition/removal/substitution details.
Producing metadata alongside terminal print.
Builtin test.

Documentation

You are there.

Environment

GNU: GCC 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04)

Dependencies

Project	Where to get	Notes
`RapidFuzz`	maxbachmann/rapidfuzz-cpp	Included amalgamated header. Needed for levenshtein edit ops to report add/remove/substitution details.
`Hash`	Chocobo1/Hash	Included md5 header. Needed for fast blockwise decision before rolling sequence.

Building

QMake pro file and generic Makefile are provided for your taste.

Here typical qmake based build-install procedure:

cd builDir
qmake path/to/root/pro/file.pro CONFIG+=release CONFIG+=THIS_IS_A_SWITCH "ARGUMENT=THIS_IS_A_VALUE"
make
make install

Remember that you should clear builDir before recompiling program and additionally YOU MUST PASS "*." filename pattern to "rm" to be able to delete .qmake.super file which contains old qmake flags.

Structure

├── main.cpp | Everything is here.
├── Makefile
├── md5.h
├── rapidfuzz_amalgamated.hpp
├── README.MD
├── rolling-hash.pro
└── test_material.hpp | Reference and modified test input strings. Just inputs.

0 directories, 7 files

App Flow

Have two file handles of A and B in main().
Block by block walk over B and store rolling and strong checksums in SignatureCalculator::calculate().
Shift block sized window with 1 hop size on A and compare stored strong checksums first and rolling hashes of B in case of mismatch in DeltaCalculator::calculate().
Keep bytes of A between matched segments as the modification range. We are still in DeltaCalculator::calculate().
Locate and align corresponding raw data chunk of B by looking at the segment of A in DeltaCalculator::align_cross_diff().
Have Levenshtein based modification report in terms of add/remove/replace in DeltaCalculator::get_levenshtein_distance_ops_report().
Return hunk metadata and commit information to the print message. Finally leaving the DeltaCalculator::calculate().

Test

Just run as is. Builtin test will always run, even if a/b paths omitted.

Here what to expect:

Reference	Output	Modified
Rolling Hash Algorithm _Spec v4 (2021-03-09)_ Make a rolling hash based file diffing algorithm. When comparing original and an updated version of an input, it should return a description ("delta") which can be used to upgrade an original version of the file into the new file. The description contains the chunks which: - Can be reused from the original file - have been added or modified and thus would need to be synchronized The real-world use case for this type of construct could be a distributed file storage system. This reduces the need for bandwidth and storage. If many people have the same file stored on Dropbox, for example, there's no need to upload it again. A library that does a similar thing is [rdiff](https://linux.die.net/man/1/rdiff). You don't need to fulfill the patch part of the API, only signature and delta. ## Requirements - Hashing function gets the data as a parameter. Separate possible filesystem operations. - Chunk size can be fixed or dynamic, but must be split to at least two chunks on any sufficiently sized data. - Should be able to recognize changes between chunks. Only the exact differing locations should be added to the delta. - Well-written unit tests function well in describing the operation, no UI necessary. ## Checklist 1. Input/output operations are separated from the calculations 2. detects chunk changes and/or additions 3. detects chunk removals 4. detects additions between chunks with shifted original chunk	Signature: 48 blocks ### Delta ### ____________________________ \| \| \| Begin Diff \| \|__________________________\| File block:0 (0-31) \|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\| Rolling Hash Algorithm _Spe \|\|\|\|\|\|\|\|\| +1 -0 ~0 \|\|\|\|\|\|\|\|\| aRolling Hash Algorithm _Spe \|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\| ____________________________ \| \| \| Same So Far... \| \|__________________________\| File block:29 (862-928) \|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\| ## Requirements - Hashing function gets the data as a pa \|\|\|\|\|\|\|\|\| +0 -1 ~1 \|\|\|\|\|\|\|\|\| ## Requirements - Bashing function gets the dat as a pa \|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\| ____________________________ \| \| \| Same So Far... \| \|__________________________\| File block:39 (1247-1279) \|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\| nction well in describing the o \|\|\|\|\|\|\|\|\| +1 -0 ~0 \|\|\|\|\|\|\|\|\| nction welll in describing the o \|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\|\| ____________________________ \| \| \| Bye \| \|__________________________\| ----------------------------------------------- Test passed -----------------------------------------------	aRolling Hash Algorithm _Spec v4 (2021-03-09)_ Make a rolling hash based file diffing algorithm. When comparing original and an updated version of an input, it should return a description ("delta") which can be used to upgrade an original version of the file into the new file. The description contains the chunks which: - Can be reused from the original file - have been added or modified and thus would need to be synchronized The real-world use case for this type of construct could be a distributed file storage system. This reduces the need for bandwidth and storage. If many people have the same file stored on Dropbox, for example, there's no need to upload it again. A library that does a similar thing is [rdiff](https://linux.die.net/man/1/rdiff). You don't need to fulfill the patch part of the API, only signature and delta. ## Requirements - Bashing function gets the dat as a parameter. Separate possible filesystem operations. - Chunk size can be fixed or dynamic, but must be split to at least two chunks on any sufficiently sized data. - Should be able to recognize changes between chunks. Only the exact differing locations should be added to the delta. - Well-written unit tests function welll in describing the operation, no UI necessary. ## Checklist 1. Input/output operations are separated from the calculations 2. detects chunk changes and/or additions 3. detects chunk removals 4. detects additions between chunks with shifted original chunk

License

As free as dependencies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rolling Hash Weekend

Documentation

Environment

Dependencies

Building

Structure

App Flow

Test

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
Makefile		Makefile
README.MD		README.MD
main.cpp		main.cpp
md5.h		md5.h
rapidfuzz_amalgamated.hpp		rapidfuzz_amalgamated.hpp
rolling-hash.pro		rolling-hash.pro
test_material.hpp		test_material.hpp

isgursoy/rolling-hash

Folders and files

Latest commit

History

Repository files navigation

Rolling Hash Weekend

Documentation

Environment

Dependencies

Building

Structure

App Flow

Test

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages