Document example variables needed to run a Reactors container #5

mwvaughn · 2020-09-02T20:44:04Z

No description provided.

mwvaughn · 2020-09-02T20:56:41Z

The following variables are always passed into a Reactors container when run on the Abaco platform

_abaco_api_server - the API server for current instance of Tapis
_abaco_access_token - an Oauth Bearer token for Tapis access, generated by the Abaco platform
_abaco_Content-Type - the content type of the message passed to Abaco
_abaco_execution_id - the public identifier of the specific instantiation of an Actor
_abaco_username - the Tapis username who is requesting the specific instance of an Actor
_abaco_actor_dbid - the internal database identifier for the Actor. Usually a combination of the Tapis tenant and the actor Id
_abaco_actor_id - the public identifier for the Actor
_abaco_actor_state - a serialized Python dict for persisting state between instantiations of an Actor
_abaco_worker_id - the public identifier for the Abaco worker executing the present instantiation
_abaco_container_repo - the container repo name for the current Actor
_abaco_actor_name - the public name of the current Actor

ethho · 2020-09-02T20:58:25Z

`ec50-stab-audit-rx`

This is an actor that audits file outputs of a Tapis app. A Tapis job UUID in the reactor context is necessary and sufficient for usage.

Message Schema

This reactor expects, but does not enforce, an empty JSON message ({}).

Expected Variables

tapis_jobId: Tapis job UUID of the upstream Tapis job, which called back to this reactor's message inbox.

mwvaughn · 2020-09-03T20:04:11Z

ec50-stab-audit-rx

This is an actor that audits file outputs of a Tapis app. A Tapis job UUID in the reactor context is necessary and sufficient for usage.

Message Schema

This reactor expects, but does not enforce, an empty JSON message ({}).

Expected Variables

tapis_jobId: Tapis job UUID of the upstream Tapis job, which called back to this reactor's message inbox.

Where is the requirement for tapis_jobId defined for this actor?

mwvaughn · 2020-09-03T20:05:32Z

As we're going, we also need to list common variables defined in secrets.json in the reactor project repositories.

ethho · 2020-09-03T20:08:36Z

Where is the requirement for tapis_jobId defined for this actor?

This var is required at the line in this permalink: https://gitlab.sd2e.org/psap/ec50-stab-audit-rx/blob/master/reactor.py#L128, which basically says assert 'tapis_jobID' in r.context with better reporting.

As we're going, we also need to list common variables defined in secrets.json in the reactor project repositories.

Does there exist a canonical way to do this? e.g. secrets.example.json?

mwvaughn · 2020-09-08T20:44:35Z

Here's a stab at doing the baseline context.jsonschema over which we'd merge the one provided with the Reactor. I'm still thinking over the ins and outs of multiple message schemas and how we'd go about with validation. I also think we can add a bit more constraint to some of the string fields, even if we don't have a formal, enforceable regex pattern for them.

{
	"$schema": "http://json-schema.org/draft-07/schema#",
	"$id": "AbacoBaseContext",
	"title": "Baseline Abaco Context",
	"type": "object",
	"properties": {
		"MSG": {
			"type": "string",
			"description": "Message received by the Actor"
		},
		"_abaco_Content-Type": {
			"type": "string",
			"description": "Content type of the message passed to Abaco"
		},
		"_abaco_api_server": {
			"type": "string",
			"description": "API server for an instance of Tapis",
			"format": "uri"
		},
		"_abaco_access_token": {
			"type": "string",
			"description": "Oauth Bearer token for Tapis access, generated by Abaco"
		},
		"_abaco_execution_id": {
			"type": "string",
			"description": "Public identifier of the current Actor Execution"
		},
		"_abaco_username": {
			"type": "string",
			"description": "Tapis username who is requesting the execution of the Actor",
			"pattern": "^[a-z][0-9a-z]{2,7}",
			"examples": [
				"vaughn",
				"tg840555",
				"sd2etest1"
			]
		},
		"_abaco_actor_dbid": {
			"type": "string",
			"description": "Internal identifier for the Actor"
		},
		"_abaco_actor_id": {
			"type": "string",
			"description": "Public identifier for the Actor",
			"examples": [
				"e5QKEW8L0BeZ4",
				"6rgbzrjRKoBDk"
			]
		},
		"_abaco_actor_state": {
			"type": "object",
			"description": "Serialized object for persisting state between Actor Executions"
		},
		"_abaco_worker_id": {
			"type": "string",
			"description": "Public identifier for the Abaco worker handling the current Execution"
		},
		"_abaco_container_repo": {
			"type": "string",
			"description": "Linux container repo for the current Actor",
			"examples": [
				"tacc/tacbobot",
				"tacobot",
				"tacobot:latest",
				"tacobot:600a1af",
				"tacc/tacobot:600a1af",
				"index.docker.io/tacc/tacobot:600a1af"
			]
		},
		"_abaco_actor_name": {
			"type": "string",
			"description": "Public name of the current Actor"
		},
		"x-nonce": {
			"type": "string",
			"description": "An Abaco nonce (API key)"
		}
	},
	"required": ["MSG"],
	"additionalProperties": true
}

mwvaughn · 2020-09-09T15:39:54Z

Here's an interesting case we have not yet considered. If a Reactor relies on environment variables set at the container image level, we need to document those as well. Here's an example Dockerfile for one of the pipeline managers...

FROM sd2e/reactors:python3-edge

ARG DATACATALOG_BRANCH=2_2
# RUN pip uninstall --yes datacatalog || true
RUN pip3 install --upgrade --no-cache-dir \
    git+https://github.com/SD2E/python-datacatalog.git@${DATACATALOG_BRANCH}

COPY schemas /schemas

WORKDIR /mnt/ephemeral-01

ENV CATALOG_ADMIN_TOKEN_KEY=etMn7wzAhn9ErWcpzh8EK75St2CUsHHK54nA
ENV CATALOG_ADMIN_TOKEN_LIFETIME=3600
ENV CATALOG_RECORDS_SOURCE=jobs-indexer
ENV CATALOG_STORAGE_SYSTEM=data-sd2e-community

These CATALOG_ variables are used by the datacatalog Python package. The variable CATALOG_ADMIN_TOKEN_KEY is particularly interesting - this is the seed for the hash function used to generate database record-level administrative tokens. If this value is not set correctly, then messages sent to other actors that also use datacatalog will reject the tokens. The idea there was to allow us to have coordinated, parallel sets of pipeline manager actors for multiple projects or alternative use cases.

Along a similar theme, the CATALOG_RECORDS_SOURCE sets the value for _properties.source in any MongoDB records created by this actor.

ethho · 2020-09-09T16:12:11Z

Here's an interesting case we have not yet considered. If a Reactor relies on environment variables set at the container image level, we need to document those as well.

This touches at the broader question of: what subset of the environment should the context.jsonschema capture? For instance, one might be interested in including CATALOG_ADMIN_TOKEN_KEY, but certainly not other vars that are inherent in the image (LS_COLORS, LD_LIBRARY_PATH, etc.). I propose a tentative definition for context.jsonschema: a built image and a context.jsonschema should be sufficient to reproduce the Abaco runtime environment locally. Based on this definition, I would leave envs that are defined in the Dockerfile out of the context.jsonschema, since they are already documented and built as part of the image.

What do you all think of this definition? It certainly has drawbacks; for instance, it means that ~~config.yml fields would be duplicated in the schema.~~ EDIT: on second thought, we don't strictly need to include these fields in the schema, thanks to "additionalProperties": true

mwvaughn · 2020-09-09T16:14:46Z

Here's a stab at documenting a context.jsonschema for https://github.com/SD2E/pipelinejobs-manager. The context document below is sufficient to define which variables can be set to execute the actor in "parameters" mode. We still need to document the individual JSON message schemas. I am leaning against including them in context.jsonschema right now, in favor of having them in files stored in the source repo ./schemas directory, copied to /schemas in the container.

{
	"$schema": "http://json-schema.org/draft-07/schema#",
	"$id": "PipelineJobsIndexer",
	"title": "Context for PipelineJobs Indexer",
	"type": "object",
	"properties": {
		"uuid": {
			"type": "string",
			"description": "UUID of job to manage",
			"pattern": "^(uri:urn:)?107[0-9a-f]{5}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}",
			"examples": [
				"107acb4d-1f93-553a-9e1c-b7b58125bc89"
			]
		},
		"name": {
			"type": "string",
			"description": "Event name to send to job <uuid>",
			"enum": [
				"index",
				"indexed"
			]
		},
		"level": {
			"type": "string",
			"description": "Data processing level (when action == index)",
			"enum": [
				"0",
				"1",
				"2",
				"3",
				"Reference",
				"User",
				"Unknown"
			]
		},
		"token": {
			"type": "string",
			"description": "Alphanumeric authorization token issued and validated by Python datacatalog"
		}
	},
	"required": [],
	"additionalProperties": true
}

Note: Updated to remove the Data Catalog env variables as per Ethan's comment below.

mwvaughn · 2020-09-09T16:20:16Z

I think this is the right approach - here's why: If we open the pandora's box of Dockerfile-level env variables, we might have to include the ones for any dependency library AND the number of vars even for Python datacatalog is pretty large.

Here is the dir() on the Datacatalog settings module:

['ABACO_HASHIDS_SALT', 'ADJ_ANIMAL_DATE_FORMAT', 'ADJ_ANIMAL_DELIM', 'ADJ_ANIMAL_LENGTH', 'ADJ_ANIMAL_WORDS', 'ADMIN_TOKEN_KEY', 'ADMIN_TOKEN_LIFETIME', 'ADMIN_TOKEN_SECRET', 'DATE_FORMAT', 'DEBUG_MODE', 'DNS_DOMAIN', 'FILE_ID_PREFIX', 'JOBS_TOKEN_SALT', 'LOG_FIXITY_ERRORS', 'LOG_LEVEL', 'LOG_UPDATES', 'LOG_VERBOSE', 'MAX_INDEX_FILTERS', 'MAX_INDEX_PATTERNS', 'MOCKUUIDS_SALT', 'MONGODB_AUTHN', 'MONGODB_AUTH_DATABASE', 'MONGODB_DATABASE', 'MONGODB_HOST', 'MONGODB_PASSWORD', 'MONGODB_PORT', 'MONGODB_REPLICA_SET', 'MONGODB_USERNAME', 'MONGO_DELETE_FIELD', 'PIPELINES_TOKEN_SALT', 'PJS_CACHE_DIR', 'PJS_CACHE_MAX_AGE', 'PRODUCTS_ROOT', 'PRODUCTS_SYSTEM', 'PRODUCTS_VERSION', 'RECORDS_SOURCE', 'RECORD_PROPERTIES_SOURCE', 'REFERENCES_ROOT', 'REFERENCES_SYSTEM', 'REFERENCES_VERSION', 'SALT_LENGTH', 'SCHEMA_BASEURL', 'SCHEMA_NO_COMMENT', 'SCHEMA_REFERENCE', 'STORAGE_MANAGED_VERSION', 'STORAGE_SYSTEM', 'TACC_API_SERVER', 'TACC_JUPYTER_SERVER', 'TACC_MANAGER_ACCOUNT', 'TACC_PRIMARY_STORAGE_SYSTEM', 'TACC_PROJECT_GROUP', 'TACC_PROJECT_ID', 'TACC_PROJECT_NAME', 'TACC_TENANT', 'TACC_TENANT_POC_EMAIL', 'TAPIS_ANY_AUTHENTICATED_USERNAME', 'TAPIS_UNAUTHENTICATED_USERNAME', 'TOKEN_LENGTH', 'TYPEDUUID_HASHIDS_SALT', 'UNICODE_PATHS', 'UPLOADS_ROOT', 'UPLOADS_SYSTEM', 'UPLOADS_VERSION', 'UUID_MOCK_NAMESPACE', 'UUID_NAMESPACE', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'all_settings', 'array_from_string', 'callbacks', 'crypt', 'debug', 'fix_assets_path', 'helpers', 'identifiers', 'int_or_none', 'jsonschema', 'mongo', 'organization', 'os', 'parse_boolean', 'set_from_string']

That's a lot of potential vars to deal with if we included even a few of them in the context.jsonschema. Also, we don't really override these variables at runtime (even though you could).

ethho · 2020-09-09T16:56:55Z

We still need to document the individual JSON message schemas. I am leaning against including them in context.jsonschema right now, in favor of having them in files stored in the source repo ./schemas directory, copied to /schemas in the container.

Agreed. I can't think of a compelling reason to include either the config.yml or the message schemas in the context.jsonschema for the time being.

mwvaughn · 2020-09-09T21:44:44Z

The default context.jsonschema needs to include LOCALONLY which is used to set Reactor.local

See https://github.com/TACC-Cloud/python-reactors/blob/main/src/reactors_sdk/utils.py#L226

ethho · 2020-09-10T18:59:24Z

An example context.jsonschema for the above mentioned actor ec50-audit-rx (https://gitlab.sd2e.org/psap/ec50-stab-audit-rx). The schema performs an "outer merge" with the AbacoBaseContext schema above using the allOf keyword.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "ec50-audit-rx-context",
  "title": "Reactor context for ec50-audit-rx",
  "description": "Reactor context for ec50-audit-rx",
  "allOf": [
    {"#ref": "AbacoBaseContext.json#"},
    {
      "type": "object",
      "additionalProperties": true,
      "required": ["tapis_jobId"],
      "properties": {
      	"tapis_jobId": {
          "type": "string",
          "description": "A Tapis job UUID",
          "pattern": "^[0-9a-z]{8}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{12}-007$"
        }
      }
    }
  ]
}

As we noted before, I think that the typical end user would find it difficult to write these schemas from scratch, especially if we adopt schema merging.

mwvaughn · 2020-09-10T19:31:22Z

Agreed. I want to keep this streamlined towards only what a developer needs to do to define variables. One thought I had was to bundle AbacoBaseContext with the Python package and perform a merge on it using the contents of context.jsonschema, similar to what you've done here. The key difference is that we would do this on demand inside a validate operation or when we we are creating the usage response.

I am working on branch schema_validation from today's develop to implement a couple test approaches to this.

ethho · 2020-09-10T20:37:14Z

One thought I had was to bundle AbacoBaseContext with the Python package and perform a merge on it using the contents of context.jsonschema, similar to what you've done here. The key difference is that we would do this on demand inside a validate operation or when we we are creating the usage response.

Good idea, I agree that the simplicity/usability benefit here outweighs the decrease in descriptiveness.

A corollary of the above: we could provide a static asset (let's call it context_union.jsonschema for sake of argument) that joins AbacoBaseContext with the more developer-friendly context.jsonschema. I imagine this schema would be included in the Python package or created on tapis actors init.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "context_union.jsonschema",
  "allOf": [
    {"#ref": "AbacoBaseContext.json#"},
    {"#ref": "AnotherBaseContext.json#"},
    {"#ref": "context.json#"}
  ]
}

I'm not in love with this idea, it veers pretty close to overengineering, but still worth putting on paper.

shwetagopaul92 · 2020-09-14T16:11:10Z

An example context.jsonschema for FCS-ETL Reactors: https://gitlab.sd2e.org/sd2program/fcs-etl-reactor/.

uri is presented in a JSON message conforming to the AbacoMessageAgavePath schema

{
	"$schema": "http://json-schema.org/draft-04/schema#",
	"$id": "FcsEtlReactor",
	"title": "Context for FCS-ETL Reactors",
	"type": "object",
	"properties": {
		"uri": {
			"type": "string",
			"format": "uri",
			"description": "agave:// samples.json",
			"pattern": "^agave:\\/\\/.*\\.json"
		}
	},
	"required": ["uri"],
	"additionalProperties": true
}

mwvaughn · 2020-09-17T20:25:18Z

Here is a cool implication of defining an Reactor's parameter environment using JSONschema. Via a Javascript library called SchemaForm, it is possible to generate a Bootstrap3 forms interface. This could, in turn, be used to render a web UI for submitting a message to a Reactor. Attached below is the forms interface for the pipeline jobs indexer documented above.

mwvaughn · 2020-09-23T20:29:59Z

To automatically support multiple schema files, including context.jsonschema, I think we need to tweak the base image a little. Here's one proposed solution:

FROM sd2e/python3:ubuntu17

ARG SCRATCH=/mnt/ephemeral-01

RUN mkdir -p ${SCRATCH} && \
    chmod a+rwx ${SCRATCH} && \
    chmod g+rwxs ${SCRATCH}

RUN mkdir /schemas

RUN pip install git+git://github.com/TACC-Cloud/python-reactors.git@develop#egg=reactors

# Onbuild support
ONBUILD ADD requirements.txt /tmp/requirements.txt
ONBUILD RUN pip3 install -r /tmp/requirements.txt
ONBUILD ADD reactor.py /
ONBUILD ADD config.yml /
ONBUILD ADD *.jsonschema /schemas/

# Close out making absolutely sure that work directory is set
WORKDIR ${SCRATCH}

CMD ["python3", "/reactor.py"]

The significant change is that we would now put all schema files in /schemas. The schema verification/classification codes can handle this (and in fact expect it). One annoying aspect of this approach is that context.jsonschema is in the same location as user-defined message schemas. We can make sure that it's ignored by the classification function but it feels messy to do it this way.

Another approach is to require context.jsonschema to be at the top level of the project, and any message schemas to be in a schemas directory. The ONBUILD for this would look like:

...
ONBUILD ADD config.yml /
ONBUILD COPY context.jsonschema /context.jsonschema
ONBUILD COPY schemas /schemas
...

This would clearly and cleanly separate schemas intended for the validation/classification workflow from the context-validation workflow.

Thoughts on which is preferable?

mwvaughn · 2020-09-23T21:01:51Z

I will say that in either case, we will need to provide an explicit path to migrate a "V1" Reactors project to a "V2" project but I expect this will be a small lift.

ethho · 2020-09-23T22:06:08Z

One annoying aspect of this approach is that context.jsonschema is in the same location as user-defined message schemas.

Thoughts on which is preferable?

As a reactor developer, I would actually categorize context.jsonschema as a "user-defined schema," in that it should be exposed to the user to the same extent as message.jsonschema.

By this principle, I'm in favor of putting all *.jsonschemas (except the BaseContext.jsonschema and AbacoUnionContext.jsonschema) in /schemas, like you have in the above Dockerfile. But, as you point out, this leads us to the antipattern where schemas are enforced at different scopes (Reactor().context vs Reactor().context.message_dict), but stored in the same directory.

A corollary of the first approach (putting all schemas in /schemas): tapis actors init could generate the following minimal context.jsonschema:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "context",
  "title": "Template reactor context",
  "description": "Template reactor context",
  "message_dict": {"#ref": "message.jsonschema"}
}

...as well as an empty ({}) message.jsonschema. Then, only enforce the context.jsonschema in the package logic; the message schema is enforced (and thereby required) IFF it is linked in the context.jsonschema. I can't think of an elegant way to avoid the "scope antipattern," but I think linking the message schema in a user-exposed (but Tapis-templated) context schema makes this behavior discoverable, explicit, and mutable.

I believe this would work with the current (750a8f9) behavior of reactors.context.find_context_schema_files, which only globs on files named context.jsonschema.

I will say that in either case, we will need to provide an explicit path to migrate a "V1" Reactors project to a "V2" project but I expect this will be a small lift.

Using the above approach, I can think of two explicit paths for migrating existing reactors to "V2":

Copy the minimal context.jsonschema to every reactor repository. This is pretty clean, and a minimal lift, as you say.
We could include some fault-tolerant logic in the package that provides the minimal context schema if it cannot find a /schemas/context.jsonschema, but this is pretty messy and not worth the time saved IMO.

mwvaughn · 2020-09-24T19:12:35Z

More thoughts (this is worth discussing so we don't have to unwind bad choices later)

We support validation of an incoming JSON message via validate_message(mes, schema) but just as importantly, we support classification of that message by determining which (from a list of schemas) a message validates against. The way we accomplish this is to validate the message against all schemas discovered by find_schema_files(), returning the $id of each matching schema.

To be specific: If I have an actor with context.jsonschema, message1.jsonschema and message2.jsonschema in the /schemas directory, when we call classify_message(mes...), it will be checked against all three schemas, with one of those checks being almost guaranteed to fail.

Without specific logic in find_schema_files() or classify_message to omit context.jsonschema, the act of classification will always perform one futile message validation against context.jsonschema. On the other hand, if we add code to intentionally skip the filename context.jsonschema during classification, it means that an end user cannot use the perfectly valid message schema file name context.jsonschema.

mwvaughn · 2020-09-24T19:36:14Z

Let's now considering specification and validation of the actor's context. We're supporting two main use cases with this:

documenting what variables are needed in a user environment to run the code
enabling validation at run-time of the specified environment variables

We have not yet implemented Reactor.validate_context(), the logical counterpart of Reactor.validate_message(). Consistent with my understanding of the use cases above, we would only validate the environment variables present in a context, not the computed slots like message_dict. We would only validate that the variable MSG was present (by means of the BaseContext schema).

I understand it is possible to validate the entire contents of the context dict by using a schema that incorporates a message_dict property, but this gets complicated when we have multiple possible values for the message JSONschema - at minimum, we'd have to use a oneOf instead of just a simple #ref to message.jsonschema

I think I would prefer validation of the context ignore the contents of the message except to ensure that it can be constructed from MSG.

mwvaughn · 2020-09-24T20:01:55Z

Let's try an exercise by mocking up the usage inside a Reactor for three cases:

Actor that uses only its context for parameterization

from reactors import Reactor
rx = Reactor()
# interprets context.jsonschema
rx.validate_context()
print(rx.context['env-we-know-exists'])

Actor that accepts a single message schema

from reactors import Reactor
rx = Reactor()
# interprets context.jsonschema
rx.validate_context()
# interprets /schemas/message.jsonschema 
rx.validate_message()
# rx is going to have a `.message` property in next release
print(rx.message['key-we-know-exists'])

Actor that accepts multiple message schemas

from reactors import Reactor
rx = Reactor()
# interprets context.jsonschema
rx.validate_context()
# interprets contents of /schemas
# will perform a futile classification attempt on /schemas/context.jsonschema
schema_id = rx.classify_message()
print('This message is schema {0}'.format(schema_id))

An observation: It seems like validate_context() is something we want to do by default. We could add it to the __init__ process for the Reactor class, allowing the developer to opt out of context validation via something like Reactor(validate_context=False).

mwvaughn · 2020-09-24T20:13:08Z

For the sake of demonstrating HOW to implement context validation to developers, bundling a trivial, empty context.jsonschema with the default Tapis actors project makes sense. But, it's not strictly necessary. The AbacoUnionContext schema we can generate inside context.py gives us baseline context validation. Specifically, it guarantees:

MSG is present
If an API server URL is passed, it is a valid URL
If a TACC username is set it is at least lexically valid as a UNIX username
All other Abaco context variables are enforced to be string types

ethho · 2020-09-24T20:35:24Z

For the sake of demonstrating HOW to implement context validation to developers, bundling a trivial, empty context.jsonschema with the default Tapis actors project makes sense. But, it's not strictly necessary.

I see your point here, there is little sense in making the context schema a strict requirement. This is a good case in support of the second option (context.jsonschema at container root, and /schemas for message validation only). Maybe we could just explicitly put context schemas at /context_schemas and message schemas in /message_schemas, just an idea.

Actor that uses only its context for parameterization

Actor that accepts a single message schema

Actor that accepts multiple message schemas

Another observation/question that arises while reading this: shouldn't we also support the following use case?

Actor that accepts multiple context schemas, but ≤ 1 message schemas

from reactors import Reactor
rx = Reactor()
rx.validate_message()
schema_id = rx.classify_context()
print('This context abides by schema {0}'.format(schema_id))

mwvaughn · 2020-09-25T14:05:43Z

Multiple contexts... it stands to reason doesn't it? This makes a strong argument for physically separating context schema files from message schema files.

mwvaughn · 2020-09-30T15:52:09Z

@eho-tacc I'm considering how to accomplish this while maintaining backwards compatibility with V1 project structures. Specifically, I would like to reason through how to build a Reactors Docker image.

I would like to keep things very simple for the basic case, which would involve having a single default context and message schema. If a developer wants to add more schemas, they can manually add COPY commands to their Dockerfile.

At the same time, I think we might want to clearly mark which schemas are the default. This allows code in the Reactors package to know that these schemas have a tiny bit of precedence over any others. Maybe they are evaluated first, or maybe they are all that get checked by default when we call validate() from the reactor - I'm just riffing.

The Dockerfile for the reactors image would look like so:

...
ONBUILD COPY context.jsonschema /context_schemas/default_context.jsonschema
ONBUILD COPY message.jsonschema /message_schemas/default_message.jsonschema
...

This would mandate presence of a minimum viable context and message schema (the second of which we already enforce).

ethho · 2020-09-30T16:41:12Z

At the same time, I think we might want to clearly mark which schemas are the default. This allows code in the Reactors package to know that these schemas have a tiny bit of precedence over any others. Maybe they are evaluated first, or maybe they are all that get checked by default when we call validate() from the reactor - I'm just riffing.

I like the second route: using /context_schemas/default_context.jsonschema as the default schema if one does not pass a path to validate().

Another idea that came to mind: we could call Reactor.validate from the Reactor.__init__ constructor. This removes a single line of boilerplate validation. I don't imagine there will be many use cases that do not validate; in that case, I suppose one could simply pass the kwarg rx = Reactor(context_schema=None, message_schema=None)

mwvaughn · 2020-09-30T17:44:19Z

Am I right you're advocating for one-shot validation, wherein we validate message_dict as part of the context? I had been thinking about keeping them separate, validating only presence of the MSG variable in the context. This is because in my mind, the context we are validating or classifying only refers to variables explicitly set by Abaco when running the container. The message dict, on the other hand, is a computed value that comes from parsing the MSG string using the Python AST, failing over to an attempt to use json.loads(). I'm also thinking about cases where the message may be entirely and intentionally empty, as is the case when Abaco is processing a binary stream.

To your other point: I have been thinking about calling validate() at Reactor.__init__ time as well (in fact, my original design for Reactor had this in it). My continuing concern is that if a Reactor object fails to initialize, we lose access to Reactor.on_success and Reactor.on_failure as well as Reactor.loggers. Losing access to the loggers means losing access to Kibana on SD2 and probably some other logging PaaS for the TACC tenant (e.g. Loggly), which breaks distributed tracing when an invalid message or context is encountered.

To get around this, I was thinking we could add a helper Reactor.validate(context_schema=<default>, message_schema=<default>, validate_context=True, validate_message=True, permissive=True) This would wrap default calls to Reactor.validate_context() and Reactor.validate_message() in a single call. It does keep one line of boilerplate, but does prevent adding an explicit call to two validation functions, so maybe that is a win in the end.

ethho · 2020-09-30T18:06:56Z

Am I right you're advocating for one-shot validation, wherein we validate message_dict as part of the context? I had been thinking about keeping them separate

After last week's discussion, I agree with you that the context and message schemas should be kept separate (and not linked), so no, I am no longer advocating for this.

Concerning your point about the loggers: agreed that __init__ should return without errors. A single line of boilerplate is not worth the price of broken logging.

mwvaughn · 2020-09-30T18:49:21Z

Great. The sum of all this discussion is enough for me to finish a demo branch I had started work on to demonstrate some of these concepts! I'm really optimistic now!

mwvaughn changed the title ~~Document example sets of variables needed to run a Reactors container~~ Document example variables needed to run a Reactors container Sep 2, 2020

ethho added the brainstorm Ideas in the works label Sep 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document example variables needed to run a Reactors container #5

Document example variables needed to run a Reactors container #5

mwvaughn commented Sep 2, 2020

mwvaughn commented Sep 2, 2020

ethho commented Sep 2, 2020 •

edited

Loading

mwvaughn commented Sep 3, 2020

`ec50-stab-audit-rx`

Message Schema

Expected Variables

mwvaughn commented Sep 3, 2020

ethho commented Sep 3, 2020 •

edited

Loading

mwvaughn commented Sep 8, 2020 •

edited

Loading

mwvaughn commented Sep 9, 2020 •

edited

Loading

ethho commented Sep 9, 2020 •

edited

Loading

mwvaughn commented Sep 9, 2020 •

edited

Loading

mwvaughn commented Sep 9, 2020 •

edited

Loading

ethho commented Sep 9, 2020

mwvaughn commented Sep 9, 2020 •

edited

Loading

ethho commented Sep 10, 2020

mwvaughn commented Sep 10, 2020 •

edited

Loading

ethho commented Sep 10, 2020

shwetagopaul92 commented Sep 14, 2020 •

edited

Loading

mwvaughn commented Sep 17, 2020

mwvaughn commented Sep 23, 2020

mwvaughn commented Sep 23, 2020

ethho commented Sep 23, 2020 •

edited

Loading

mwvaughn commented Sep 24, 2020

mwvaughn commented Sep 24, 2020

mwvaughn commented Sep 24, 2020

mwvaughn commented Sep 24, 2020

ethho commented Sep 24, 2020 •

edited

Loading

mwvaughn commented Sep 25, 2020

mwvaughn commented Sep 30, 2020

ethho commented Sep 30, 2020 •

edited

Loading

mwvaughn commented Sep 30, 2020 •

edited

Loading

ethho commented Sep 30, 2020

mwvaughn commented Sep 30, 2020

Document example variables needed to run a Reactors container #5

Document example variables needed to run a Reactors container #5

Comments

mwvaughn commented Sep 2, 2020

mwvaughn commented Sep 2, 2020

ethho commented Sep 2, 2020 • edited Loading

ec50-stab-audit-rx

Message Schema

Expected Variables

mwvaughn commented Sep 3, 2020

ec50-stab-audit-rx

Message Schema

Expected Variables

mwvaughn commented Sep 3, 2020

ethho commented Sep 3, 2020 • edited Loading

mwvaughn commented Sep 8, 2020 • edited Loading

mwvaughn commented Sep 9, 2020 • edited Loading

ethho commented Sep 9, 2020 • edited Loading

mwvaughn commented Sep 9, 2020 • edited Loading

mwvaughn commented Sep 9, 2020 • edited Loading

ethho commented Sep 9, 2020

mwvaughn commented Sep 9, 2020 • edited Loading

ethho commented Sep 10, 2020

mwvaughn commented Sep 10, 2020 • edited Loading

ethho commented Sep 10, 2020

shwetagopaul92 commented Sep 14, 2020 • edited Loading

mwvaughn commented Sep 17, 2020

mwvaughn commented Sep 23, 2020

mwvaughn commented Sep 23, 2020

ethho commented Sep 23, 2020 • edited Loading

mwvaughn commented Sep 24, 2020

mwvaughn commented Sep 24, 2020

mwvaughn commented Sep 24, 2020

mwvaughn commented Sep 24, 2020

ethho commented Sep 24, 2020 • edited Loading

Actor that accepts multiple context schemas, but ≤ 1 message schemas

mwvaughn commented Sep 25, 2020

mwvaughn commented Sep 30, 2020

ethho commented Sep 30, 2020 • edited Loading

mwvaughn commented Sep 30, 2020 • edited Loading

ethho commented Sep 30, 2020

mwvaughn commented Sep 30, 2020

ethho commented Sep 2, 2020 •

edited

Loading

`ec50-stab-audit-rx`

`ec50-stab-audit-rx`

ethho commented Sep 3, 2020 •

edited

Loading

mwvaughn commented Sep 8, 2020 •

edited

Loading

mwvaughn commented Sep 9, 2020 •

edited

Loading

ethho commented Sep 9, 2020 •

edited

Loading

mwvaughn commented Sep 9, 2020 •

edited

Loading

mwvaughn commented Sep 9, 2020 •

edited

Loading

mwvaughn commented Sep 9, 2020 •

edited

Loading

mwvaughn commented Sep 10, 2020 •

edited

Loading

shwetagopaul92 commented Sep 14, 2020 •

edited

Loading

ethho commented Sep 23, 2020 •

edited

Loading

ethho commented Sep 24, 2020 •

edited

Loading

ethho commented Sep 30, 2020 •

edited

Loading

mwvaughn commented Sep 30, 2020 •

edited

Loading