-
Notifications
You must be signed in to change notification settings - Fork 24
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
f416fa0
commit 7ff5be3
Showing
8 changed files
with
189 additions
and
22 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,88 @@ | ||
# FastCDC-Go | ||
|
||
A Go implementation of the FastCDC content-defined chunking algorithm. | ||
FastCDC-Go is a Go library implementing the [FastCDC](#references) content-defined chunking algorithm. | ||
|
||
Install: | ||
``` | ||
go get -u github.com/iotafs/fastcdc-go | ||
``` | ||
|
||
## Example | ||
|
||
```go | ||
import ( | ||
"bytes" | ||
"fmt" | ||
"log" | ||
"math/rand" | ||
"io" | ||
|
||
"github.com/iotafs/fastcdc-go" | ||
) | ||
|
||
opts := fastcdc.Options{ | ||
MinSize: 256 * 1024 | ||
AverageSize: 1 * 1024 * 1024 | ||
MaxSize: 4 * 1024 * 1024 | ||
} | ||
|
||
data := make([]byte, 10 * 1024 * 1024) | ||
rand.Read(data) | ||
chunker, _ := fastcdc.NewChunker(bytes.NewReader(data), opts) | ||
|
||
for { | ||
chunk, err := chunker.Next() | ||
if err == io.EOF { | ||
break | ||
} | ||
if err != nil { | ||
log.Fatal(err) | ||
} | ||
|
||
fmt.Printf("%x %d\n", chunk.Data[:10], chunk.Length) | ||
} | ||
``` | ||
|
||
## Command line tool | ||
|
||
This package also includes a useful CLI for testing the chunking output. Install it by running: | ||
|
||
``` | ||
go install ./cmd/fastcdc | ||
``` | ||
|
||
Example: | ||
```bash | ||
# Outputs the position and size of each chunk to stdout | ||
fastcdc -csv -file random.txt | ||
``` | ||
|
||
## Performance | ||
|
||
FastCDC-Go is fast. Chunking speed on an Intel i5 7200U is >1GiB/s. Compared to [`restic/chunker`](https://github.com/restic/chunker), another CDC library for Go, it's about 2.9 times faster. | ||
|
||
Benchmark ([code](https://gist.github.com/eadanfahey/ce2ba2733028e2b3b62a479ba2b9f62a)): | ||
``` | ||
BenchmarkRestic-4 23384482467 ns/op 448.41 MB/s 8943320 B/op 15 allocs/op | ||
BenchmarkFastCDC-4 8080957045 ns/op 1297.59 MB/s 16777336 B/op 3 allocs/op | ||
``` | ||
|
||
## Normalization | ||
|
||
A key feature of FastCDC is chunk size normalization. Normalization helps to improve the distribution of chunk sizes, increasing the number of chunks close to the target average size and reducing the number of chunks clipped by the maximum chunk size, as compared to the [Rabin-based](https://en.wikipedia.org/wiki/Rabin_fingerprint) chunking algorithm used in `restic/chunker`. | ||
|
||
The histograms below show the chunk size distribution for `fastcdc-go` and `restic/chunker` on 1GiB of random data, each with average chunk size 1MiB, minimum chunk size 256 KiB and maximum chunk size 4MiB. The normalization level for `fastcdc-go` is set to 2. | ||
|
||
![](./img/fastcdcgo_norm2_dist.png) ![](./img/restic_dist.png) | ||
|
||
Compared the `restic/chunker`, the distribution of `fastcdc-go` is less skewed (standard deviation 345KiB vs. 964KiB). | ||
|
||
## License | ||
|
||
FastCDC-Go is licensed unser the Apache 2.0 License. See [LICENSE](./LICENSE) for details. | ||
|
||
## References | ||
|
||
- Xia, Wen, et al. "Fastcdc: a fast and efficient content-defined chunking approach for data deduplication." 2016 USENIX Annual Technical Conference | ||
- Xia, Wen, et al. "Fastcdc: a fast and efficient content-defined chunking approach for data deduplication." 2016 USENIX Annual Technical Conference | ||
[pdf](https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
/* | ||
Package fastcdc is a Go implementation of the FastCDC content defined chunking algorithm. | ||
See https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf for details. | ||
*/ | ||
package fastcdc |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
package fastcdc_test | ||
|
||
import ( | ||
"bytes" | ||
"crypto/md5" | ||
"fmt" | ||
"io" | ||
"log" | ||
"math/rand" | ||
|
||
"github.com/iotafs/fastcdc-go" | ||
) | ||
|
||
func Example_basic() { | ||
|
||
data := make([]byte, 10*1024*1024) | ||
rand.Seed(4542) | ||
rand.Read(data) | ||
rd := bytes.NewReader(data) | ||
|
||
chunker, err := fastcdc.NewChunker(rd, fastcdc.Options{ | ||
AverageSize: 1024 * 1024, // target 1 MiB average chunk size | ||
}) | ||
if err != nil { | ||
log.Fatal(err) | ||
} | ||
|
||
fmt.Printf("%-32s %s\n", "CHECKSUM", "CHUNK SIZE") | ||
|
||
for { | ||
chunk, err := chunker.Next() | ||
if err == io.EOF { | ||
break | ||
} | ||
if err != nil { | ||
log.Fatal(err) | ||
} | ||
|
||
fmt.Printf("%x %d\n", md5.Sum(chunk.Data), chunk.Length) | ||
} | ||
|
||
// Output: | ||
// CHECKSUM CHUNK SIZE | ||
// d5bb40f862d68f4c3a2682e6d433f0d7 1788060 | ||
// 113a0aa2023d7dce6a2aac1f807b5bd2 1117240 | ||
// 5b9147b10d4fe6f96282da481ce848ca 1180487 | ||
// dcc4644befb599fa644635b0c5a1ea2c 1655501 | ||
// 224db3de422ad0dd2c840e3e24e0cb03 363172 | ||
// e071658eccda587789f1dabb6f773851 1227750 | ||
// 215868103f0b4ea7f715e179e5b9a6c7 1451026 | ||
// 21e65e40970ec22f5b13ddf60493b746 1150129 | ||
// b8209a1dbef955ef64636af796450252 552395 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.