Skip to content

Commit

Permalink
Merge pull request #236 from jnywong/gcp-object-storage
Browse files Browse the repository at this point in the history
Add documentation on managing GCP object storage
  • Loading branch information
jnywong authored Jun 25, 2024
2 parents 2131244 + 2fb6ec7 commit 272e724
Show file tree
Hide file tree
Showing 5 changed files with 417 additions and 113 deletions.
2 changes: 1 addition & 1 deletion user/topics/data/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ For more information, see the sections below.
filesystem
git
sharing
cloud
object-storage/index
```

## References and attribution
Expand Down
78 changes: 78 additions & 0 deletions user/topics/data/object-storage/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Cloud Object Storage

This section gives an overview of storing data in the cloud, as well as links to how-to guides for using specific tools to manage your cloud data:

```{toctree}
:maxdepth: 1
working-with-object-storage
manage-object-storage-aws
manage-object-storage-gcp
```

## Overview

Your hub lives in the cloud, therefore the preferred way to store data is using [object storage](https://aws.amazon.com/what-is-cloud-object-storage/), such as Amazon S3 or Google Cloud Storage. Cloud object storage is essentially a key/value storage system.
The keys are strings and the values are bytes of data. Data is read and written using HTTP calls.

The performance of object storage is very different from file storage.
On one hand each individual `read / write` to object storage has a high overhead (10-100 milliseconds) since it has to go over the network, while on the other hand object storage “scales out” nearly infinitely, meaning that we can make hundreds, thousands, or millions of concurrent read / write requests. *This makes object storage well suited for distributed data analytics*. However, data analysis software must be adapted to take advantage of these properties.

## Scratch versus persistent buckets on a 2i2c hub

Bucket
: A *bucket* is a container for objects.

Object
: An *object* is a file and any metadata that describes that file.

(object-storage:env-var-scratch)=
### Scratch buckets

[Scratch buckets](https://infrastructure.2i2c.org/topic/features/#scratch-buckets-on-object-storage) are designed for storage of *temporary* files, e.g. intermediate results.

:::{tip}
Any data in a scratch bucket is deleted after 7 days.

**Do not use scratch buckets to permanently store critical data.**
:::

Check the name of your scratch bucket by opening a Terminal in your hub and running the command

```bash
$ echo $SCRATCH_BUCKET
s3://2i2c-aws-us-scratch-showcase/<username>
```

(object-storage:env-var-persistent)=
### Persistent buckets

[Persistent buckets](https://infrastructure.2i2c.org/topic/features/#persistent-buckets-on-object-storage) are designed for storing data that is consistently used throughout the lifetime of a project and the data is not purged after a set number of days.

Check the name of your persistent bucket by opening a Terminal in your hub and running the command

```bash
$ echo $PERSISTENT_BUCKET
s3://2i2c-aws-us-persistent-showcase/<username>
```

## Storage costs

See [2i2c Infrastructure Guide – What exactly do cloud providers charge us for?](https://infrastructure.2i2c.org/topic/billing/chargeable-resources/#object-storage) for a detailed overview of cloud object storage costs.

```{tip}
It is the responsibility of the hub admin and hub users to delete objects in `$PERSISTENT_BUCKET` when no longer needed to minimize cloud billing costs. Hub champions are responsible for managing storage costs and objects stored in `$PERSISTENT_BUCKET`.
```

```{tip}
Every file you download from the hub to another machine incurs a **heavy data egress cost**. Consider carefully whether you need to download large datasets from the hub, or alternatively post-process and compress files if possible. Hub champions are responsible for costs incurred from data egress.
```

## Access permissions

A common set of credentials is used for accessing storage buckets.

```{tip}
Hub users can access each others' objects stored in scratch or persistent bucket storage and accidentally modify or delete them.
```

It is possible to configure read-only access for objects stored in cloud storage on your hub, though this is not a standard feature of our hubs. Please consult {doc}`2i2c support<../../../../support>` to discuss enabling this feature.
Original file line number Diff line number Diff line change
@@ -1,65 +1,20 @@
(object-storage-aws)=
# How-to manage S3 cloud object storage with AWS CLI

This instructional guide shows you how to upload files to AWS S3 cloud object storage for your hub. In this example, we cover the difference between scratch versus persistent buckets and some basic AWS CLI commands for managing S3 objects within cloud object storage for your hub.
This instructional guide shows you how to upload files from your hub to AWS S3 cloud object storage. In this example, we cover some basic AWS CLI commands for managing S3 objects within cloud object storage for your hub.

```{admonition} Who is this guide for?
:class: note
Some community hubs running on AWS infrastructure have scratch and/or persistent S3 storage buckets already configured. This documentation is intended for hub champions that run a hub with this feature enabled.
Some community hubs running on AWS infrastructure have scratch and/or persistent S3 storage buckets already configured. This documentation is intended for users with a hub that has this feature enabled.
```

```{contents}
:depth: 2
:local:
```

## Scratch versus persistent buckets on a 2i2c hub

Bucket
: A *bucket* is a container for objects.

Object
: An *object* is a file and any metadata that describes that file.

(object-storage-aws:env-var-scratch)=
### Scratch buckets

[Scratch buckets](https://infrastructure.2i2c.org/topic/features/#scratch-buckets-on-object-storage) are designed for storage of *temporary* files, e.g. intermediate results. Objects stored in a scratch bucket are purged after 7 days.

Check the name of your scratch bucket by opening a Terminal in your hub and running the command

```bash
$ echo $SCRATCH_BUCKET
s3://2i2c-aws-us-scratch-showcase/<username>
```

(object-storage-aws:env-var-persistent)=
### Persistent buckets

[Persistent buckets](https://infrastructure.2i2c.org/topic/features/#persistent-buckets-on-object-storage) are designed for storing data that is consistently used throughout the lifetime of a project and the data is not purged after a set number of days.

Check the name of your persistent bucket by opening a Terminal in your hub and running the command

```bash
$ echo $PERSISTENT_BUCKET
s3://2i2c-aws-us-persistent-showcase/<username>
```

## Storage costs

See [2i2c Infrastructure Guide – What exactly do cloud providers charge us for?](https://infrastructure.2i2c.org/topic/billing/chargeable-resources/#object-storage) for a detailed overview of cloud object storage costs.

:::{warning}
It is the responsibility of the hub admin and hub users to delete objects in `$PERSISTENT_BUCKET` when no longer needed to minimize cloud billing costs. 2i2c takes no responsibility for managing storage costs and objects stored in `$PERSISTENT_BUCKET`.
:::

## File permissions

By default there are no permission controls to prevent hub users from accessing each others' objects stored in scratch or persistent bucket storage.

It is possible to configure read-only access for objects stored in cloud storage on your hub. Please consult {doc}`2i2c support<../../../support>` to enable this feature.

## Basic AWS CLI commands in the Terminal

In the Terminal, check that the AWS CLI commands are available in your image with
Expand All @@ -85,9 +40,9 @@ The following examples are for managing objects in a scratch bucket using the `$
### List prefixes within an S3 bucket

Prefix
: There is no concept of "folders" in flat cloud object storage and every object is instead indexed with a key-value pair. Prefixes are a string of characters at the beginning of the object key name used to organize objects in a similar way to folders.
: There is no concept of "folders" in flat cloud object storage and every object is instead indexed with a key-value pair. Prefixes are a string of characters at the beginning of the object key name used to organize objects in a similar way to folders.

Storage buckets on a 2i2c hub are organized into prefixes named after a hub user's username. To list the prefixes of users that have stored files in cloud object storage, use the command
Storage buckets on a 2i2c hub are organized into prefixes named after a hub user's username. To list the prefixes of users that have stored files in cloud object storage, use the command

```bash
$ aws s3 ls $SCRATCH_BUCKET
Expand All @@ -114,16 +69,16 @@ aws s3 ls $SCRATCH_BUCKET/
Note the trailing slash `/` after `$SCRATCH_BUCKET` compared to the command specified in {ref}`List prefixes within an S3 bucket<object-storage:list-prefixes>`.
:::

### Upload and download files to and from a bucket
### Copy files on the hub to and from a bucket

Upload a file to your prefix in the scratch bucket with the command
Copy a file on the hub to your prefix in the scratch bucket with the command

```bash
$ aws s3 cp <filepath> $SCRATCH_BUCKET/
upload: ./<filepath> to s3://2i2c-aws-us-scratch-showcase/<username>/<filepath>
```

and download a file from your prefix in the scratch bucket with the command
and copy a file from your prefix in the scratch bucket with the command

```bash
$ aws s3 cp $SCRATCH_BUCKET/<source_filepath> <target_filepath>
Expand All @@ -145,40 +100,40 @@ Consult the [AWS Docs – Use high-level (s3) commands with the AWS CLI](https:/

## FAQs

- *How does a hub champion determine if our hub is running on AWS or not?*
- *How do I know if our hub is running on AWS or not?*

Check out our [list of running hubs](https://infrastructure.2i2c.org/reference/hubs/) to see which cloud provider your hub is running on.

- *How does a hub champion determine if a scratch and/or persistent bucket is already available?*
- *How do I determine if a scratch and/or persistent bucket is already available?*

Check whether the environment variables for each bucket are set. See {ref}`<object-storage-aws:env-var-scratch>` and {ref}`<object-storage-aws:env-var-persistent>`
Check whether the environment variables for each bucket are set. See {ref}`Scratch buckets<object-storage:env-var-scratch>` and {ref}`Persistent buckets<object-storage:env-var-persistent>`

- *If S3 buckets are supposed to be available but the environment variables for AWS credentials are not defined, what should the hub champion do?*
- *If S3 buckets are supposed to be available but the environment variables for AWS credentials are not defined, what should I do?*

If environment variables for the relevant AWS credentials for your hub are not defined, then you may encounter the following error

```bash
An error occurred (AccessDenied) when calling the AssumeRoleWithWebIdentity operation: Not authorized to perform sts:AssumeRoleWithWebIdentity.
```

Please open a {doc}`2i2c support<../../../support>` ticket with us to resolve this issue.
Please contact your hub champion so that they can open a {doc}`2i2c support<../../../../support>` ticket with us to resolve this issue on your behalf.

- *If S3 bucket are not set up but we want them for our community what should the hub champion do?*
- *If S3 buckets are not set up but I want them for my community what should the I do?*

This feature is not enabled by default since there are extra cloud costs associated with providing S3 object storage. Please open a {doc}`2i2c support<../../../support>` ticket with us to request this feature for your hub.
This feature is not enabled by default since there are extra cloud costs associated with providing S3 object storage. Please speak to your hub champion, who can then open a {doc}`2i2c support<../../../../support>` ticket with us to request this feature for your hub.

- *Is our S3 bucket accessible outside of the hub so I can upload files from elsewhere?*

Yes, this requires configuring AWS credentials from your machine, however we currently do no have documentation for this. Please contact {doc}`2i2c support<../../../support>` for guidance.
Yes, this requires configuring AWS credentials from your machine, however we currently do no have documentation for this. Please contact {doc}`2i2c support<../../../../support>` for guidance.

- *Is our S3 bucket accessible outside of the hub so users can download files to elsewhere?*

The same answer to the question above applies in this instance.

- *Will 2i2c create additional, new S3 buckets for our community?*

Please contact {doc}`2i2c support<../../../support>` to discuss this option.
Please contact contact your hub champion to liaise with {doc}`2i2c support<../../../../support>` to discuss this option.

- *If a community hub is running on GCP or Azure and we have object storage, what are our options?*
- *If a our hub is running on GCP or Azure and we have object storage, what are our options?*

Check out our {doc}`Cloud Object Storage<../../user/topics/data/cloud>` user topic guide in the first instance.
Check out our resources listed in the {doc}`Cloud Object Storage<index>` user topic guide.
Loading

0 comments on commit 272e724

Please sign in to comment.