motivation.tex

\section{Motivation}
\label{sec:motivation}

The need for a wide-area storage kernel is motivated by experience with two application domains:
scientific computing and software-as-a-service (SaaS). In both domains, users
write programs that interact with data in multiple existing services while
keeping them compatible with service changes.  No layer of indirection exists
between these applications and services, which causes developers to constantly
re-implement common functionality and continuously patch it to overcome service
changes.

In scientific computing, sites generate and host petabytes of data and make it
available via multiple legacy storage systems.  These systems are organized
into inter-site federations to facilitate collaboration, but each site defines
its own curation policies in terms their storage systems' behaviors.  Due
to policy complexity (including legal regimes) and due to the sheer size
of the data, it is undesirable to host authoritative copies in 3rd
party cloud storage providers.  Moreover, site-local scientists generate the
site's data and expect to access it frequently, so it makes sense to
keep data on-site to enhance performance for them.

%it's not clear the reasons you give are why data isn't hosted in the cloud
%the economics make it easier to buy hardware than pay a provider (NSF)
%not that you want to go down this rat hole.
%
% took a stab -jcn

Nevertheless, the data must be transferred
between sites, since experiments that rely on bespoke resources like
supercomputers that cannot be moved to where the data is.
Scientific workflows in systems like Agave~\cite{agave} and iPlant~\cite{iplant}
implement cross-component storage features of their own in
order to overcome differences in services and sites.  For example, a
gene-sequence alignment job must know how to avoid stale data, and how to ensure
that confidential results will not be readable to unauthorized off-site
requestors. This is unfortunate, because entangling the science and storage
logic makes the workflow vulnerable to service changes.  However, there is no
clear alternative, since sites and storage services do not have insights into
the workflow's intentions.

Some common storage functionality can be implemented as an intra-site storage
kernel using archival systems like iRODS~\cite{irods}, but no
coherent solution exists for doing so uniformly across sites. Science clouds
attempt to do so, but they only defer the problem to workflows that run across
multiple science clouds.

A similar entanglement problem exists for programs that access data in multiple
SaaS providers.  For example, programs Slack and Onename that read user data
from SaaS providers must re-implement access control logic to keep user
data private.  Programs like Stripple that synchronize data across providers
must re-try failed uploads and deletions to preserve replica consistency. 
SaaS API changes are a looming threat to these applications' stability, and 
changes when they do happen impose a constant
maintenance burden on their developers.

The status quo is sufficiently unfavorable that an ecosystem of decentralized,
blockchain-powered applications (``dapps'') is coming online to compete with SaaS
providers. Their main selling point is that they partially overcome these
problems by treating state on a user's devices as authoritative, instead of
state spread across multiple data silos. However, handling storage concerns
like data persistence, access control, and authentication is still left to
users.

\subsection{Contributions}

Both spaces suffer from the same underlying problem. Because storage features
are tied to services, and because there is no common storage kernel for
wide-area applications, each program must independently implement any storage
features not directly offered by the service.  This leads to duplication of
effort and bugs, and requires each program to be patched to overcome changes in
a single service.

Waskern solves this problem by implementing
storage features as a set of execution contexts that work together to run an 
application-specified storage features independently of providers.  This gives
applications a stable but
custom storage API while isolating them from service changes.  A
developer's task is no longer to patch each application to work with a
service, but is instead to maintain a ``driver'' to get a service to
work with many applications.

The main contribution of this work is to construct this
wide-area storage kernel in a cost-effective manner.  We show how to
combine existing services while minimizing the amount of work
a developer must do to support a new feature or provider.  We also show
how to operate this kernel with minimal administrative overhead.

By doing so, we recoup a set of features similar to those provided by
conventional OS kernels:

\textbf{Programmabe}. The kernel runs user-given storage logic for the user's
I/O flows.  The implementation exists and runs independently from both
services and applications.

\textbf{Dynamic}. The kernel is re-programmable at
runtime, and changes take effect atomically with respect to the application.

\textbf{Extensible}.  Each user may independently control how I/O flows are
processed for their data.

\textbf{Uniform}.  The kernel is scalable and available to all
application endpoints. It spans multiple administrative and trust
domains.

\textbf{Universal}.  Storage functionality may be stateful and
Turing-complete, and may run asynchronously from the application.

Waskern is a proof-of-concept implementation of this kernel. It not only
allows us to combine existing storage services, but also enables storage
functionality to decompose into orthogonal, reusable features that
functionally compose in a pipeline fashion (akin to how UNIX programs
compose).  New functionality can reuse existing functionality, and the
cost adding a new functionality is amortized across all dependent applications.