The Verifiable Remote File Storage service aims at checking the consistency of remotely stored files, uploaded as part of a fileset, then downloaded individually.
Files are verified as untampered on client side, based on the retrieved files' hash, the known MerkleTree root hash of the corresponding fileset and the MerkleTree proofs provided by the service for each file.
The Merkle Tree root and proofs are computed and stored by the VRFS service once the fileset has been uploaded on the FileStorage service.
Refer to the architecture section for more info about the opted protocol.
This is the mono-repository for the Go-based implementation of 2 backend services and 1 client CLI.
3 main components are implemented:
- The VRFS API service - The core component of this protocol, exposing a gRPC API
- A basic File Storage service exposing a gRPC API to batch upload files and download them individually. It also supports the retrieval of its stored files' hash.
- A client CLI to execute the 2 main upload and download operations, along with a MerkleTree root computation and the proof-based file hash verifications
The gRPC protocol is used for optimal client-server and server-to-server communications.
During the upload and download operations, the files' data are streamed and a maximum chunk size can be specified. Large file sizes are supported. Interruptions during those processes are properly handled.
A complete docker compose setup enables an easy local build, deployment & run.
A Docker Compose setup enables building & running the 2 server modules and a Redis DB acting as a distributed KV cache:
# Standard Docker Compose launch
$ docker compose up
# OR using a few custom config options
$ make docker-compose-up
# Quick clean-up (removal) of the local Docker containers & images for this project
$ make docker-cleanup
The Docker images have their build time optimized through caching, and a production layer enables running clean instances.
Refer to the respective Dockerfiles:
- VRFS API server:
./vrfs-api/DockerFile
- File Storage server:
./vrfs-fs/DockerFile
Notice there the monorepo specific dependencies management/requirements in case of replace
in the respective go.mod
files.
Distroless images are used for generating light Docker containers:
# Run locally the File Storage server
$ go run ./vrfs-fs
# Run locally the VRFS server
$ go run ./vrfs-api
List the available client CLI parameters:
go run ./client -h
File Upload & Verify protocol: Upload all files of a local directory to the remote file storage server:
# Upload a fileset with default service endpoints
$ go run ./client -action upload -updir ./fs-playground/forupload/catyclops
# Or by specifying the service endpoints and a max chunk size
$ go run ./client -action upload -updir ./fs-playground/forupload/catyclops \
-api vrfs-api:50051 \
-fs vrfs-fs:9000 \
-chunk 1024
Download locally a file from VFRS API & the File Storage services and have it verified:
# Download & verify file command, by specifying an alternative download directory
$ go run ./client -action download \
-fileset fs-10B..7E21 \
-index 5
-downdir ./fs-playground/downloaded
Demo scripts for running client commands are available in the Makefile. A default playground directory fs-playground with sample files is provided, for testing files' uploads and downloads.
# Start the backend services
$ make docker-compose-up
# Upload local files remotely
$ make demo-run-upload
# Download a file
$ make demo-run-download
# Build exec Client CLI
go build ./client -o ./dist/vrfs-client
# Build exec VRFS Server
go build ./vrfs-api -o ./dist/vrfs-server
# Build exec VRFS FileServer
go build ./vrfs-fs -o ./dist/vrfs-fs
The client CLI is configurable via the command parameters it exposes. Its settings and default values are defined in the client main.go
.
The VRFS & FS server configurations rely on their dedicated yaml
config file available in config, those parameters can be overridden via optional .env
files or via runtime environment variables.
Refer to the cleanenv solution and its integration made in the utility libs/config
.
All config settings come with default values to enable an out-of-the-box experience, and an easier dev one!
Pre-requisites:
- Go dev framework >= v1.18 - install
- Protocol buffer compiler >= v3 - install
- Protobuf gRPC Go plugins to generate the client SDK and server API stubs - install
- Docker for running the backend services in virtual containers - install
make
for benefiting from the availableMakefile
commands
Versions used while developing:
- Go :
go1.21.4 linux/amd64
- protoc :
libprotoc 3.12.4
- Docker :
24.0.7, build afdd53b
A Go workspace is used for handling the different modules part of this monorepo.
Refer to the workspace config file: go.work
.
Adding a new module to the workspace:
$ mkdir moduleX
moduleX/$ go mod init github.com/ja88a/vrfs-go-merkletree/moduleX
$ go work use ./moduleX
The overall implemented protocol for uploading or downloading local files to the remote file storage service, and have the files verified based on the generation of a Merkle Tree root and proofs for checking the leaf values (the file hashes here):
Key principles:
- The VRFS service handles the creds/access to the [external] FS service
- Files are directly uploaded to & downloaded from the FS service
- VRFS retrieves the file hashes from the FS server, for building its Merkle Tree and store corresponding proofs
Design motivations:
-
Separation of concerns: storing the filesets (File Storage server) Vs. handling the files' verification process (VRFS API)
- Independence towards the used files storage service, i.e. actual FS can be replaced by a 3rd party file storage solution
- Corresponding micro-services cloud hosting platform instances can be customized per their core business requirements, i.e. their computation, bandwidth, memory requirements.
-
Minimized bandwidth consumption: limited file transfers
- with this design option the fact that the file storage service exposes an API for retrieving the file bucket hashes is a key requirement since this avoids the need for the VRFS API to upload or download the files
- Integrating a 3rd party solution would probably not support the provision of the file hashes. Integrating with a IPFS CID might be an option.
The considered alternative system architectures are reported in the diagram VRFS design options.
This protocol results in:
- 6 steps to remotely store a fileset - 1 client command: 1 VRFS API & n file uploads requests + 1 VRFS->FS API request
- 3 steps to retrieve & verify a file - 1 client command: 1 VRFS API & 1 file download requests
Overview of the VRFS service main components:
Overview of a considered scalable solution to be deployed:
The depicted files' upload, download & verification protocol is implemented and finalized.
Efficient serialization of the MerkleTree Proofs might be further improved when they are persisted in the DB, on fileset upload verification/confirmation, then retrieved and communicated to clients on every file download info requests for later verification.
A persistence layer for the VRFS service is implemented via a distributed memory cache solution, using Redis.
An additional DB ORM integration could be required, a NoSQL DB such as Mongo could do the job.
The computation models and their settings for the backbone Merkle Tree reference is to be further refined and benchmarked, per the integration use case(s) and corresponding optimization requirements for ad-hoc computation, storage and transport.
For the file hashes computation, constituing the MerkleTree leaf values, the SHA2-256 hashing function is used (NIS, 64 characters long for every string). Alternative file hashing functions might be considered to adapt and/or optimize the computations runtime. Notice the fact that the client and the FS server require using the same hashing function on files since both build a Merkle Tree out of the file hashes.
No user authentication mechanism has been implemented to protect the access to the service APIs.
Available client authentication options:
- API Keys support
- ECDSA signature - a digital signature made from an externally owned account
External services acting as load balancers and an API Gateway should be integrated in order to deal with:
- Load balancing, API requests routing & APIs versioning - Ex.: AWS ALB
- External communications encryption support & private subnets management
- User authentication & permissions - Ex.: AWS Cognito, KeyCloack
Actual services' JSON logs should be reported to a remote log watcher to review them, monitor the service and/or automate runtime alerts triggering.
Example of available 3rd party solutions:
A monitoring infra is to be added and servers should report their usage & performance statistics.
The integration of a Prometheus-like time serie events database should be considered so that each servers report their stats.
Complementary tools like a monitoring dashboard, e.g. Grafana, and runtime alerts management, e.g. Kibana, should be considered for production.
Actual gRPC communications should rely on TLS encryption over HTTP.
X.509 certificates are to be deployed at servers' level and secure connections initiated by the client.
Refer to actual grpc.WithTransportCredentials
.
A basic versioning mechanism is actually implemented for managing the Protobuf-based gRPC stubs integrated by the client and the servers.
The Docker images are tagged using the semver CLI tool.
The services handle a version number through their configuration file.
Actual integration of the semver solution is to be pushed further.
Automated unit and integration tests are not addressed in this repository.
E2E tests & a continuous integration / deployment flows should be implemented.
A Cobra-like integration should be considered if the client CLI is given a priority.
A logs manager on client side might also be integrated for prettier outputs.
The Merkle Tree support in this VRFS solution has been originally developped by Tommy TIAN as of March 2023.
This Go library is distributed under the MIT license and its code repository is available at https://github.com/txaty/go-merkletree
Minor packaging changes have been made. Custom config and fileset specific utilities extend the original implementation.
The GNU Affero General Public License version 3 (AGPL v3, 2007) applies to this mono-repository and all of its software modules.
You can refer to the dedicated licence files.