automatically propagate task values to environment variables #458

mlin · 2021-05-10T07:52:23Z

mlin
May 10, 2021
Maintainer

WDL ~{} interpolations have a couple of disadvantages:

They're unfamiliar to new developers, even those accustomed to shell substitution.
They lack a good mechanism to prevent shell code injection (or generally to deal with values that contain quote marks, etc.) -- see Protection against code injection attacks #443

A potential solution to both problems is to prefer passing inputs to the task command script as environment variables, instead of textually interpolating them. If the WDL engine sets the task inputs in the container environment, then the command script can use standard bash rules for handling them safely (which has pitfalls of its own, but at least they're well-known ones that ShellCheck can point out), and this may also ease the learning curve for new WDLers.

In chanzuckerberg/miniwdl#503 I prototyped a task runtime setting autoEnv: true which causes the task's WDL value declarations to implicitly propagate into the environment of the command script. For example, if the WDL task has an input File bam, then in the command script, $bam will refer to the localized filename. So too for String, Int, etc. values. This prototype is meant as a discussion starting point.

One open question is how to deal with WDL's compound types, like arrays and structs. In this prototype I've chosen to punt on this by making the environment propagation apply only to "atomic" value types. For compound types, one could introduce an auxiliary value like e.g. File filenames = write_lines(file_array) and then consume $filenames in the command. This is open to discussion, of course -- a possible alternative is to load them into the environment as JSON text. (However, it's conceivable that a compound WDL data structure could be too large to fit in an environment variable, or into whatever API call sets up the container environment.)

DavyCats · 2021-05-10T08:40:54Z

DavyCats
May 10, 2021

Interesting proposal, though on thing that immediately popped into my head was the potential of unintended or unknown side effects. Some programs will look at the environment variables to automatically set certain settings. If someone is unaware of a given environment variable being meaningful to the program they're using, then that might mean they end up running the program with different settings than intended. Potentially without ever being made aware of that.

This might just be an edge case though and one could argue you should know how to run the tools you're putting in your workflow, so this might not be a big issue.

0 replies

patmagee · 2021-05-20T02:32:35Z

patmagee
May 20, 2021
Maintainer

@mlin this is a really interesting idea and would definitely simplify alot of very simple tasks by not requiring any sort of interpolation.

Using environment variables seems like a really elegant way of passing values to a task

To @DavyCats point, I think we could simply use a few conventions that alot of people will be familiar with for environment variables. First of all everything probably should be prefixed with a well documented string, WDL_. this would likely get rid of almost every conflict, (but might not be as intuitive).

Secondly, we might want to keep environment variables in all caps. In bash it seems like it's a general convention to make globally available vars full caps and local vars lowercase. So $bam would become $WDL_BAM

I think we can also draw on examples of existing software for defining how complex types should work. I personally write alot of applications in spring, which allows you define very complex configuration (normally written in yaml or json) as environment variables. And we could mimic that.

For example say we have the following variables

String sample
Integer size
Map[String,Array[Pair[File,String]]] silly

Then environment values would be assigned like the following

$WDL_SAMPLE
$WDL_SIZE
$WDL_SILLY_<KEY>_<INDEX>_LEFT
$WDL_SILLY_<KEY>_<INDEX>_RIGHT

0 replies

jgadling · 2021-07-22T18:17:04Z

jgadling
Jul 22, 2021

Our application libraries follow 12-factor best practices, and accept (and in some cases even require) environment variables as core configuration. As @mlin called out, most of the workarounds for the lack of first-class env var support in WDL are error-prone due to quoting issues or shell interpretation.

I agree that the WDL spec should have first-class support for environment variables, but I have concerns about some of the suggestions in this thread:

It's important to be able to control the names of the environment variables passed into tasks. Environment variables are actually the standard way to pass configuration to docker containers in every other compute environment, so it's very common for software to depend on env vars with specific names.
Env vars are strings. I don't think we need to be overly concerned about complex type support.
autoEnv is better than nothing, but I'd prefer to be able to control which inputs are passed in as env vars in more detail, either via an EnvString type, or a separate envInput block

5 replies

patmagee Jul 23, 2021
Maintainer

@jgadling you make some good points, and it leads me to think we are really trying to be to sophisticated here.

What about something as simple as a new task block, which has keys that get set as the variable name, and values which evaluate or are coercive to strings.

Something like the following

task foo {
  input {
   String myinp
  }

  env {
     MY_ENV_VAR: myinp
  } 

  command <<<
     echo ${MY_ENV_VAR}
  >>>
}

mlin Jul 23, 2021
Maintainer Author

@patmagee I don't quite see the need for the indirection. The name of the WDL input is under the developer's control, so why not just declare it String MY_ENV_VAR? I can imagine cases where that wouldn't be suitable, but I think those would be more the exception than the rule, and it's nice to make the hot path the concise one when possible.

I like the @jgadling concept of a separate envInput{} block for a subset of inputs that will propagate to env vars, so that you can pick and choose; this also partially addresses @DavyCats comment above. However, I think most authors wouldn't care about the distinction but would just use envInput{} as a replacement for input{}.

I like exploring these ideas, as I said my prototype was meant only as a starting point.

jgadling Jul 27, 2021

I don't have a strong preference either way for a separate input block for env vars vs having a separate config block that maps input vars to env vars. Both options allow env vars and input vars to be handled separately, and both provide a mechanism to avoid name conflicts between input vars and env vars.

patmagee Jul 28, 2021
Maintainer

How about a subtle modification of what I showed previously?

task foo {
  input {
     File inp_file
  }

  env {
     String MY_ENV_VAR
  } 

  command <<<
     echo ${MY_ENV_VAR}
     dosomething ~{inp_fle}
  >>>
}

workflow bar {
   input {

   }
  
   call foo {
     input {
       inp_file = inp_file
     }
     env {
        MY_ENV_VAR = "some-string"
     }
   }
}

There are a few things going on here. First of all, instead of defining the env as part of config, it is propagated to the call of that task. We would need specific rules around scoping for the env and input blocks. IE can an env reference an input value or vice-versa? I also changed the call input and env declaration format to better suit multiple blocks within the call. We have previously talked about this in the past as a net positive improvement to the call block so I think we could sneak it in here if we wanted.

When it comes to complex types, I know for all of our teams applications we also use the 12 factor approach, and there is a great benefit to being able to pass in ANY value. (ie we sometimes pass in JSON config in a single env var). Therefore, I would
recommend that we allow for complex types to be used in the env section. Any non primitive value should be serialized to a JSON string (or some other agreed upon representation) and set to the env var. If there is a non serializable type or non coercible env value provided, you would probbaly expect an error to appear during static type analysis

patmagee Jul 28, 2021
Maintainer

Overall, I think achieving the least surprising behaviour is the best path here. AutoEnv to me can result in some surprising consequences, because now you need to be very careful what you name your inputs so as not to accidentally override anything important in your shell. As @jgadling said though, it would be better then nothing,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

automatically propagate task values to environment variables #458

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

automatically propagate task values to environment variables #458

mlin May 10, 2021 Maintainer

Replies: 3 comments · 5 replies

DavyCats May 10, 2021

patmagee May 20, 2021 Maintainer

jgadling Jul 22, 2021

patmagee Jul 23, 2021 Maintainer

mlin Jul 23, 2021 Maintainer Author

jgadling Jul 27, 2021

patmagee Jul 28, 2021 Maintainer

patmagee Jul 28, 2021 Maintainer

mlin
May 10, 2021
Maintainer

Replies: 3 comments 5 replies

DavyCats
May 10, 2021

patmagee
May 20, 2021
Maintainer

jgadling
Jul 22, 2021

patmagee Jul 23, 2021
Maintainer

mlin Jul 23, 2021
Maintainer Author

patmagee Jul 28, 2021
Maintainer

patmagee Jul 28, 2021
Maintainer