Skip to content

Commit

Permalink
updating docs for first release
Browse files Browse the repository at this point in the history
  • Loading branch information
will-rowe committed Jun 19, 2020
1 parent dccd1a4 commit 49025a4
Show file tree
Hide file tree
Showing 17 changed files with 241 additions and 78 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<div align="center">
<img src="https://raw.githubusercontent.com/will-rowe/stark/master/docs/stark-logo-with-text.png" alt="stark-logo" width="250">
<img src="https://raw.githubusercontent.com/will-rowe/stark/master/docs/img/stark-logo-with-text.png" alt="stark-logo" width="250">
<h3>Sequence Transmission And Record Keeping</h3>
<hr>
<a href="https://travis-ci.org/will-rowe/stark"><img src="https://travis-ci.org/will-rowe/stark.svg?branch=master" alt="travis"></a>
Expand Down Expand Up @@ -35,7 +35,7 @@ The easiest way to install is using Go (v1.14):
```sh
export GO111MODULE=on
release=0.0.0
go get -v github.com/will-rowe/stark/...@$(release)
go get -v github.com/will-rowe/stark/...@${release}
```

### Usage
Expand Down
92 changes: 77 additions & 15 deletions docs/about.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,90 @@
# About

## Overview
`stark` is an [InterPlanetary File System](https://ipfs.io/)-backed database for recording and distributing sequencing experiments. It is both an application and a Go package for running and interacting with `stark databases`.

**stark** is an IPFS-backed database for recording and distributing sequencing data. It is both an application and a Go package for running and interacting with **stark databases**. Features include:
Features include:

- snapshot, sync and share entire databases over the IPFS
- use PubSub messaging to share and collect data records as they are created
- track record history and rollback revisions (rollback feature WIP)
- attach and sync files to records (WIP)
- encrypt record fields
- submit databases to [pinata](https://pinata.cloud/) pinning service for easy backup and distribution
- submit database snapshots to [pinata](https://pinata.cloud/) pinning service for persistence and distribution

### The database
***

- **stark databases** track, update and share sequence `records`
- a database is aliased by a `project name` which groups the `records`
- `projects` and `records` are DAG nodes in the [IPFS](https://ipfs.io/)
- DAG `links` are created between `records` and the `projects` that use them
- `records` and `projects` are pointed to by `content identifiers (CIDs)`
- the `CIDs` change when the content they point to is altered, so databases track them locally using `keys`
- databases are re-opened and shared using the `project` `CID` (termed a `snapshot`)
## Aims

### Records
- `stark databases` can store, update and share sequencing experiments
- each sequencing experiment is documented via an updateable `record`
- a database and the contained `records` are **distributed**, **immutable** and **versioned**
- `records` can be used by multiple databases
- `records` can be updated but previous versions are still accessible
- likewise, previous versions of the database can be retrieved
- a database is referenced by a `project`
- database instances for the same `project` can communicate and pass `records`
- a `project` is accessible if the node which produced it goes offline

- `records` are a data structure used to represent a Nanopore sequencing run (but can be hijacked and extended to be more generic or to represent Samples and Libraries)
- `records` are defined in [protobuf](https://developers.google.com/protocol-buffers) format (which is compiled with Go bindings using [this makefile](./schema/Makefile))
- currently, `records` are serialised to JSON for IPFS transactions
## How it works

### Documenting sequencing experiments

Sequencing experiments are encoded as `records`.

>This is currently in its infancy and only basic information can be encoded.
A sequencing experiment could be a biological sample, a sequencing library or a sequencer run. The idea is that a `record` can encode any of these. We can add links between `records` to show relationships, such as multiple sequencing runs linked to a single library.

The `record` data is structured and serialised using a [protobuf](https://developers.google.com/protocol-buffers) schema. This means that `records` are **language-neutral**, **platform-neutral** and **extensible**. It also means that we can wrap the database up in a [gRPC server](https://grpc.io/docs/what-is-grpc/introduction/) and use protobuf messages to send and receive `records` to and from the database.

Some features that are currently being worked on:

* linking records to files (e.g. fast5 files)
* attaching process requests to records (e.g. basecall)


### Storing records

The heavy-lifting is done by the [InterPlanetary File System](https://ipfs.io/). Check out the [what is IPFS](https://docs.ipfs.io/concepts/what-is-ipfs) pages for more information on the concepts behind IPFS. Behind the scenes IPFS is doing the following:

- a `record` is broken into blocks, which are then formatted as [IPLD nodes](https://ipld.io/)
- each node is hashed by its contents and used to build a [Merkle DAG](https://docs.ipfs.io/concepts/merkle-dag/)
- the hash of each node is called a content identifier (`CID`)
- the `CID` of the Merkle DAG root node is used to address the whole `record`

Merkle DAGs are constructed from leaves first, i.e. child nodes are constructed before parent nodes. This makes every node in the DAG the root of a subgraph that is contained by a parent node(s). This property makes the graph immutable as a change in a leaf node will be propogated throughout the DAG and alter the base `CID`, effectively producing a new Merkle DAG. This also means that membership queries are 'cheap' and only require a few hashes to compute ([see this example with bananas](https://miro.medium.com/max/1400/0*lR_IMzUjQUJgXq5A.png) from Consensys).

Here is a simplified example of how a `record` is represented in the IPFS and how altering the `record` itself changes the base `CID` which is used to identify it:

![dag-eg](img/dag/mdag-eg.gif)

You can see that by changing a value in the `record`, one of the underlying data blocks changes and this change propogates through the Merkle DAG as `CIDs` are updated to reflect the change in content. We now have a new base `CID` to address this new graph, which still represents our `record`.

However, you can also see that not all nodes in the DAG are altered. This is useful as similar `records` can share subgraphs (which is a feature which IPFS exploits to collect blocks from different sources and quickly authenticate).


### Tracking records

To track `records` in stark databases, we use a specially formatted IPFS node which uses the [unixfs data format](https://github.com/ipfs/go-unixfs/blob/master/pb/unixfs.proto) and basically serves as a directory in the IPFS. This node is the `project` node and acts as a parent node which we link `record` DAGs to.

Each link between `project` and `record` is labelled with the `CID` of the `record` and the `record` name (which is used as a database key).

For example:

![project-layout](img/dag/mdag-3.png)

The figure above shows that record A is linked to Project 1, record B is linked to both Project 1 and 2, whilst record C is linked to Project 2. Records A and B share a subgraph.

You can also see in the figure that each `project` node has a `CID` (which begins with **Qm**<sup>&</sup>). The `project` `CID` is changed when links are added or removed, or when any `records` are changed.

> <sup>&</sup> as an aside, see [here](https://proto.school/#/anatomy-of-a-cid) for more information on interpreting CIDs
This means that `stark` only needs to keep track of the `project CID`. It can use this to gather up all the links and place them in a runtime key-value store, where the key is the link name given by the user and the value is the `record CID`. By using a key-value store to track Record CIDs, the project's DAG links only need to be checked once each time the database is opened.

The `project CID` is the database `snapshot` - each time the database contents change, the `snapshot` changes. We can roll-back and fast-forward the database state using the `snapshot`.

### Sharing records

As well as sharing an entire database by giving the `CID` of the `project` node, a stark database can also communicate in real-time via the IPFS [PubSub](https://blog.ipfs.io/25-pubsub/) messaging service.

This can be used to both let others know what `records` are being added, as well as to tell one database instance to collect `records` from another producer with the same `project` name.
118 changes: 111 additions & 7 deletions docs/app.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,113 @@
# STARK as an app
# Using stark as an app

## Commands
## Overview

- `stark init <project>` - Create a new database for a project.
- `stark open <project> options...` - Open a database for a project.
- `stark add <key>` - Add a key to an open database.
- `stark get <key>` - Get a record from an open database.
- `stark dump` - Dump the metadata from an open database.
The app is used to manage stark databases. Here are a few key points:

* a database must be open in order to run `add`/`get`/`dump`
* one database can be open at a time
* an open database is interfaced by a [gRPC](https://grpc.io/docs/what-is-grpc/introduction/) server
* `records` are passed to and from the database via protobuf messages
* `records` are added and retrieved using a `key`, which is the `record's` `alias` field.
* to keep track of projects locally, `stark` has a config file which stores the most recent database `snapshot` (default location: `~/.stark.json`)

## Subcommands

- `stark open <project>` - Open a database for a `project`.
- `stark add` - Add a `record` to an open database.
- `stark get <key>` - Get a `record` from an open database.
- `stark dump` - Dump the current metadata from an open database.

***

### Open

To open a database, just use the `open` subcommand:

```sh
stark open my-project
```

This will check the stark config file and see if a database has been opened for this `project` before:

- if the `project` is found it will recover the most recent `snapshot CID` for this `project` and then collect all the `record` links in a key-value store
- if the `project` is not found, `stark` will open a new database and add it to the config for next time
- it's easiest to open the database in one terminal window and then run `add` and `get` in another

#### Flags

`--withListen`

- tells the database to listen for `records` being added to other database instances for the same `project`
- for instance, if I had a database open for **metagenomics-project-101** and a collaborator also had a database open with this `project` name, my database instance could pull in all `records` that my collaborator was adding to their database (provided they were using the `--withAnnounce` flag)
- this works best if `--withPeers` is used to connect the two databases directly

`--withAnnounce`

- this is used to announce `records` as they are added to the database
- announced `records` can be picked up by databases that are listening (via the `--withListen` flag)

`--withEncrypt`

- encrypts `record` fields when adding a `record` to the database
- this flag must also be used to get encrypted `records`
- if you try to `get` an encrypted `record` without this flag, the `get` will fail
- to provide the encryption password, use the `STARK_DB_PASSWORD` environment variable

`--withPinata <int>`

`--withPeers <string>`

***

### Add

To add a `record` to an open database:

```sh
cat record.json | stark add
```

- the `record` must follow the schema or the `add` will fail
- the `record` alias is used as the database `key`, which is needed for `record` retrieval
- if no STDIN or file is provided, the `add` subcommand will collect the `record` interactively using a user prompt (this is a WIP)

#### Flags

`--useProto`

- tells the database that the `record` being added is in protobuf format, not JSON

`--inputFile <string>`

- use this flag to provide the `record` via file instead of STDIN or interactively

***

### Get

To get a `record` from an open database:

```sh
stark get <key>
```

#### Flags

`--humanReadable`

- use this flag to print the `record` as human readable text

`--useProto`

- tells the database return the `record` in protobuf format, not JSON

***

### Dump

To dump the current metadata from an open database:

```sh
stark dump
```
44 changes: 37 additions & 7 deletions docs/faq.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,39 @@
# FAQ

- each instance of a database is linked to a project, re-opening a database with the same project name will edit that database
- the `OpenDB` and `NewRecord` consructor functions use functional options to set struct values - this is in an effort to keep the API stable (see [here](https://dave.cheney.net/2014/10/17/functional-options-for-friendly-apis))
- if a record is retrieved from the database and updated, you need to then re-add it to the database. In other words, a **stark database** only records the most recent version of a record commited to the IPFS
- records have a history, which can be used to rollback changes to other version of the record that entered the IPFS
- even though schema is in protobuf, most of the time it's marshaling to JSON to pass stuff around
- Record methods are not threadsafe - the database passes around copies of Records so this isn't much of an issue atm. The idea is that users of the library will end up turning Record data into something more usable and won't operate on them after initial Set/Gets
- Encryption is a WIP, currently only a Record's UUID will be encrypted as a proof of functionality. Encrypted Records are decrypted on retrieval, but this will fail if the database instance requesting them doesn't have the correct password.
## can anyone access my data?

Yes. Anything put into a stark database will be available on the public IPFS network. However, you could configure your own private network if you wanted to.

stark does have an option to encrypt record fields. The record itself will still be accessible on the public network but it will require a passphrase to decrypt the fields.

Note: Encryption is a WIP. Currently only a Record's UUID will be encrypted as a proof of functionality. Encrypted Records are decrypted on retrieval, but this will fail if the database instance requesting them doesn't have the correct password.

## is my data persistent?

Yes. stark pins the records it adds by default, which means that the node you are using to run the database should delete it during garbage collection (see [here](https://docs.ipfs.io/concepts/persistence/)). If no other nodes request the records you add to a database on your node, the records will only exist on your node. This runs the risk that they could be deleted (e.g. if you wipe your nodes storage).

To be safe, and to speed up sharing, you can use the `withPinata` option to use the [pinata API](https://pinata.cloud/) and pin your data to their nodes (**requires a Pinata account**).

You can deactivate pinning (`withNoPinning`), which means records added to a stark database can be collected by the IPFS garbage collector, although this sort of defeats the point.

## can I use the IFPS tool to interact with records and projects?

Yes. Here are some examples:

* to list records in a project

```
ipfs ls <project cid>
```

* to get a record from a project:

```
ipfs dag get <project cid>/<record alias>
```

* to get a field from a record:

```
ipfs dag get <project cid>/<record alias>/<field>
```
Binary file added docs/img/dag/mdag-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/dag/mdag-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/dag/mdag-3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/dag/mdag-eg.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes.
Binary file added docs/img/raw/stark-projects.gvdesign
Binary file not shown.
Binary file added docs/img/raw/stark-records.gvdesign
Binary file not shown.
File renamed without changes
File renamed without changes
File renamed without changes
4 changes: 2 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
<div align="center">
<img src="stark-logo-no-badge.png?raw=true?" alt="stark-logo" width="250">
<img src="img/stark-logo-no-badge.png?raw=true?" alt="stark-logo" width="250">
</div>

---
Expand All @@ -9,7 +9,7 @@ Welcome to the documentation for STARK.
# Contents

- [About](./about.md)
- [Installation](./installing.md)
- [Installation](./installation.md)
- [STARK as an app](./app.md)
- [STARK as a package](./package.md)
- [FAQ](./faq.md)
2 changes: 1 addition & 1 deletion docs/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Both the app and the package can be installed at the same time using the Go tool
```sh
export GO111MODULE=on
release=0.0.1
go get -v github.com/will-rowe/stark/...@$(release)
go get -v github.com/will-rowe/stark/...@${release}
```

- To install the latest master:
Expand Down
55 changes: 11 additions & 44 deletions docs/package.md
Original file line number Diff line number Diff line change
@@ -1,51 +1,18 @@
# Using STARK as a package
# Using stark as a package

View the [Go Documentation](https://pkg.go.dev/github.com/will-rowe/stark) site for the complete **stark** package documentation.

This page will document some examples.
## Examples

## Usage example
TODO

```go
// This basic program will create a new database, add a record to it and then retrieve a copy of that record.
package main
For now, have a look at the app code to get some ideas.

import (
"fmt"
## Notes

"github.com/will-rowe/stark"
)

func main() {

// init a starkDB
db, dbCloser, err := stark.OpenDB(stark.SetProject("my project"))
if err != nil {
panic(err)
}

// defer the database closer
defer dbCloser()

// create a record
record, err := stark.NewRecord(stark.SetAlias("my first sample"))
if err != nil {
panic(err)
}

// add record to starkDB
err = db.Set("lookupKey", record)
if err != nil {
panic(err)
}

// retrieve record from the starkDB
retrievedSample, err := db.Get("lookupKey")
if err != nil {
panic(err)
}
fmt.Println(retrievedSample.GetAlias())


}
```
- each instance of a database is linked to a project, re-opening a database with the same project name will edit that database
- the `OpenDB` and `NewRecord` consructor functions use functional options to set struct values - this is in an effort to keep the API stable (see [here](https://dave.cheney.net/2014/10/17/functional-options-for-friendly-apis))
- if a record is retrieved from the database and updated, you need to then re-add it to the database. In other words, a **stark database** only records the most recent version of a record commited to the IPFS
- records have a history, which can be used to rollback changes to other version of the record that entered the IPFS
- even though schema is in protobuf, most of the time it's marshaling to JSON to pass stuff around
- Record methods are not threadsafe - the database passes around copies of Records so this isn't much of an issue atm. The idea is that users of the library will end up turning Record data into something more usable and won't operate on them after initial Set/Gets

0 comments on commit 49025a4

Please sign in to comment.