From 82dc7e46359a53637eab7065a95e7433b728110e Mon Sep 17 00:00:00 2001
From: Aidan Oldershaw <aoldershaw@pivotal.io>
Date: Tue, 4 May 2021 16:32:23 -0400
Subject: [PATCH] alternative input/output approaches for prototypes

Signed-off-by: Aidan Oldershaw <aoldershaw@pivotal.io>
---
 037-prototypes/proposal.md | 317 +++++++++++++++++++++++++++++++++++++
 1 file changed, 317 insertions(+)

diff --git a/037-prototypes/proposal.md b/037-prototypes/proposal.md
index 7cbd0599..31a3eade 100644
--- a/037-prototypes/proposal.md
+++ b/037-prototypes/proposal.md
@@ -599,8 +599,325 @@ prototypes with special pipeline pipeline semantics and step syntax. These
 
 * Is this terminology unrelatable?
 
+### Inputs/Outputs
+
+An open question is how best to provide inputs and outputs to prototypes,
+particularly via the `run` step. The mockup of the `run` step in the [Pipeline
+Usage](#pipeline-usage) section above suggested these could be configured via
+`inputs`/`input_mapping` and `outputs`/`output_mapping`, but this approach has
+some downsides:
+
+* Stutter when `run.params` references the inputs, as you need to specify the
+  artifact name twice, e.g.
+  ```yaml
+  run: build
+  type: go
+  inputs: [my-repo]
+  params:
+    package: my-repo/cmd/my-cmd
+  output_mapping: {binary: my-repo-binary}
+  ```
+  * In this case, this could be avoided by using an `input_mapping` to a name
+    specific to the prototype. However, this only works when the prototype
+    takes in a fixed set of inputs with a fixed set of names, but doesn't
+    work when the prototype takes in a list of inputs, for instance.
+* It's awkward to define a set of outputs that depends on the inputs. For
+  instance, imagine a `go` prototype that can compile multiple packages
+  simultaneously, and emit an `output` for each one. Under this input/output
+  approach, this may look something like:
+  ```yaml
+  run: build
+  type: go
+  inputs: [repo1, repo2]
+  params:
+    packages:
+      cmd1-binary: repo1/cmd/cmd1
+      cmd2-binary: repo1/cmd/cmd2
+      cmd3-binary: repo2/cmd/cmd3
+  outputs:
+  - cmd1-binary
+  - cmd2-binary
+  - cmd3-binary
+  ```
+  or
+  ```yaml
+  run: build
+  type: go
+  inputs: [repo1, repo2]
+  params:
+    packages:
+    - repo1/cmd/cmd1
+    - repo1/cmd/cmd2
+    - repo2/cmd/cmd3
+  output_mapping:
+    binary1: cmd1-binary
+    binary2: cmd2-binary
+    binary3: cmd3-binary
+  ```
+  In the first case, the prototype defines a pseudo-`output_mapping` in its
+  config, which requires repetition when defining the set of `outputs`. In the
+  second case, the `outputs` repetition is gone, but the prototype needed to
+  invent a naming scheme for the outputs (in this case, suffixing a fixed name
+  with the 1-based index of the package). Both approaches are fairly awkward to
+  work with.
+
+Here are some alternative approaches that have been considered:
+
+#### Option 1a - dynamic input/output config
+
+The prototype's `info` response will include the required
+inputs/outputs(/caches?) based on the request object (i.e. `run.params`).
+
+For instance, with the following `run` step:
+
+```yaml
+run: some-message
+type: some-prototype
+params:
+  files: [some-artifact-1/some-file, some-artifact-2/some-other-file]
+  other_config: here
+  output_as: some-output
+```
+
+...the prototype may emit the following config, ascribing special meaning to
+`files` and `output_as`:
+
+```yaml
+inputs:
+- name: some-artifact-1
+- name: some-artifact-2
+
+outputs:
+- name: some-output
+```
+
+...and Concourse will mount these artifacts appropriately.
+
+**Pros**
+
+* No new concepts
+  * If you're familiar with `task` configs, this is effectively just a more
+    flexible version of the same concept
+  * No new pipeline syntax/semantics
+* Prototype details can be encapsulated behind config
+  * e.g. with the [oci-build-task], if you want to persist the build cache, you
+    need to specify: `caches: [{path: cache}]`
+  * With an approach like this, you could just specify: `cache: true` (i.e. you
+    don't need to know where the cache is)
+
+**Cons**
+
+* Requires inventing a naming scheme when the set of outputs is dynamic based
+  on the inputs
+* Has performance implications. Ignoring caching, `run` steps will need to spin
+  up two containers (one for making the `info` request, and one for running the
+  prototype message)
+  * Caching is possible when the configuration is fixed, but with variable
+    interpolation it may not work so well
+* More burden on prototype authors as each message now needs two handlers - one
+  for executing the message, and one for generating the inputs/outputs.
+
+#### Option 1b - [JSON schema] based dynamic input/output config
+
+Similar to 1a in that we rely on the prototype to instruct Concourse on what
+the inputs/outputs are. However, rather than the `info` response providing the
+inputs/outputs config, it instead gives a [JSON schema] for each message, with
+some special semantics for defining inputs/outputs:
+
+e.g. for a `build` message on an `oci-image` prototype:
+
+```json
+{
+  "type": "object",
+  "properties": {
+    "files": {
+      "type": "array",
+      "items": {
+        "type": "string",
+        "concourse:input": {
+          "name": "((name))"
+        }
+      }
+    },
+    "other_config": {
+      "type": "string"
+    },
+    "output_as": {
+      "type": "string",
+      "concourse:output": {
+        "name": "((name))"
+      }
+    }
+  },
+  "required": ["context"]
+}
+```
+
+This isn't fully fleshed out, but the key concept is these `concourse:*`
+keywords, which allow you to say: this element in the object is an
+input/output.
+
+**Pros**
+
+* Performance implications from 1a disappear - much easier to cache
+* All inputs/outputs need to be mentioned in the config, so nothing is implicit
+  * Can also be viewed as a Con - more verbose, and is closer to the original
+  * Unlike the `inputs`/`input_mapping`, `outputs`/`output_mapping` approach,
+    however, this approach requires less repetition, as the inputs/outputs only
+    need to be defined in one place
+* Gives a way for Concourse to easily validate input to a prototype, and for
+  IDE's to provide more useful auto-complete suggestions - related to
+  https://github.com/concourse/concourse/issues/481
+  * The IDE aspect may not be practical as getting the JSON schema requires
+    making the `info` request against a Docker image.
+
+**Cons**
+
+* Makes it much more of a burden to write prototypes, as each message *needs* a
+  JSON schema
+* Probably too restrictive - requires Concourse to support specific custom
+  keywords in the JSON schema. If you wanted to define a map of inputs to
+  paths, for instance, we'd need to provide a special keyword for that (as
+  `concourse:input` alone isn't enough)
+
+#### Option 2a - emit outputs in response objects
+
+This option implies two things:
+
+1. A way for pipelines to explicitly specify what inputs should be provided to
+   a prototype (whereas options 1a/b had the prototypes telling Concourse what
+   inputs should be provided)
+2. A way for prototypes to provide output artifacts as part of its message
+   response (whereas option 1a included in the info response, i.e. not at
+   "runtime" w.r.t. running the message)
+
+The response objects could look something like:
+
+```json
+{
+  "object": {
+    "some_data": 123,
+    "some_output": {"artifact": "./path"}
+  }
+}
+```
+
+Concourse will interpret `{"artifact": "./path"}` as an output, where `./path`
+is relative to some path that's mounted by default to all prototypes. *This
+means that multiple outputs may share an output volume, and differ only by the
+path within that volume.*
+
+Since prototypes can emit multiple response objects, this also means you can
+have *streams* of outputs sharing a name that are identified by some other
+field(s). e.g. the above artifact could be identified as
+`some_output{some_data: 123}` (or something along those lines). That gives you
+the option of aggregating/filtering the stream of outputs by a subset of the
+fields that identify them - a similar idea has been fleshed out for the
+`across` step in
+https://github.com/concourse/rfcs/pull/29#discussion_r619863020.
+
+w.r.t. providing inputs, we can use an approach similar to `put.inputs:
+detect`, but more explicit (from the pipeline's perspecive). e.g.
+
+```yaml
+run: some-message
+type: some-prototype
+params:
+  files: [@some-artifact-1/some-file, @some-artifact-2/some-other-file]
+  other_config: here
+```
+
+Here, `@` is a special syntax that points to an artifact name and both *mounts
+it to the container* and *resolves to an absolute path to the artifact*. In
+this example, the prototype would receive something like:
+
+```json
+{
+  "object": {
+    "files": ["/tmp/build/some-artifact-1/some-file", "/tmp/build/some-artifact-2/some-file"],
+    "other_config": "here"
+  },
+  "response_path": "..."
+}
+```
+
+Interestingly, data and artifacts are all collocated, which raises the question
+- do we need a separate namespace for artifacts like we have now, or can they
+be treated as "just vars" and share the local vars namespace?
+
+In order to use emitted artifacts within the pipeline, you can "set" them to
+the local namespace:
+
+```yaml
+run: some-message
+type: some-prototype
+params:
+  files: [@some-artifact-1/some-file, @some-artifact-2/some-other-file]
+  other_config: here
+set_artifacts: # or, if we do collapse the namespaces, this could be `set_vars`
+  some_output: my-output # some awkwardness around _ vs - here
+```
+
+**Pros**
+
+* Removes burden from prototype authors to define what inputs/outputs are
+  required up-front
+* More flexible - if set of outputs depends on things only determinable at
+  runtime, you can express that here
+* Output streams let you do interesting filtering (example in
+  https://github.com/concourse/rfcs/pull/29#discussion_r619863020), and avoid
+  the issue of needing to invent a naming scheme with sets of common outputs.
+
+**Cons**
+
+* New pipeline syntax/concepts to learn
+* Can't mount inputs at specific paths - if your prototype requires a specific
+  filesystem layout, it needs to shuffle the inputs around
+  * e.g. [oci-build-task] may depend on inputs being at certain paths relative to
+    the `Dockerfile`
+* Since outputs can appear anywhere in the response object, different
+  prototypes may provide a different way to interact with outputs, rather than
+  having a single flat namespace of outputs
+  * Can also be viewed as a Pro, but I feel like having a consistent way of
+    referring to artifacts within a pipeline is beneficial
+
+**Questions**
+
+* Is merging the concepts of vars and artifacts confusing/unintuitive?
+
+#### Option 2b - emit outputs adjacent to the response object
+
+This approach uses the same `@` syntax semantics as 2a - the main difference is
+that it explicitly differentiates between outputs and data by emitting outputs
+in the message response, but not in the response object. e.g.
+
+```json
+{
+  "object": {
+    "some_data": 123
+  },
+  "outputs": [
+    {"name": "some_output", "path": "./path/to/output"}
+  ]
+}
+```
+
+**Pros**
+
+Like 2a, but also:
+* Consistent with the existing notion of outputs (i.e. flat namespace of
+  artifacts than can be referenced by name)
+  * Still allows for filtering the stream down by the corresponding `object`
+
+**Cons**
+
+Like 2a, but also:
+* Can't unify concepts of vars and artifacts
+
 
 [rfc-1]: https://github.com/concourse/rfcs/pull/1
 [rfc-1-comment]: https://github.com/concourse/rfcs/pull/1#issuecomment-477749314
 [rfc-24]: https://github.com/concourse/rfcs/pull/24
 [rfc-38]: https://github.com/concourse/rfcs/pull/38
+[oci-build-task]: https://github.com/vito/oci-build-task
+[JSON schema]: https://json-schema.org/