Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document example variables needed to run a Reactors container #5

Open
mwvaughn opened this issue Sep 2, 2020 · 31 comments
Open

Document example variables needed to run a Reactors container #5

mwvaughn opened this issue Sep 2, 2020 · 31 comments
Labels
brainstorm Ideas in the works

Comments

@mwvaughn
Copy link
Contributor

mwvaughn commented Sep 2, 2020

No description provided.

@mwvaughn mwvaughn changed the title Document example sets of variables needed to run a Reactors container Document example variables needed to run a Reactors container Sep 2, 2020
@mwvaughn
Copy link
Contributor Author

mwvaughn commented Sep 2, 2020

The following variables are always passed into a Reactors container when run on the Abaco platform

  • _abaco_api_server - the API server for current instance of Tapis
  • _abaco_access_token - an Oauth Bearer token for Tapis access, generated by the Abaco platform
  • _abaco_Content-Type - the content type of the message passed to Abaco
  • _abaco_execution_id - the public identifier of the specific instantiation of an Actor
  • _abaco_username - the Tapis username who is requesting the specific instance of an Actor
  • _abaco_actor_dbid - the internal database identifier for the Actor. Usually a combination of the Tapis tenant and the actor Id
  • _abaco_actor_id - the public identifier for the Actor
  • _abaco_actor_state - a serialized Python dict for persisting state between instantiations of an Actor
  • _abaco_worker_id - the public identifier for the Abaco worker executing the present instantiation
  • _abaco_container_repo - the container repo name for the current Actor
  • _abaco_actor_name - the public name of the current Actor

@ethho
Copy link
Collaborator

ethho commented Sep 2, 2020

ec50-stab-audit-rx

This is an actor that audits file outputs of a Tapis app. A Tapis job UUID in the reactor context is necessary and sufficient for usage.

Message Schema

  • This reactor expects, but does not enforce, an empty JSON message ({}).

Expected Variables

  • tapis_jobId: Tapis job UUID of the upstream Tapis job, which called back to this reactor's message inbox.

@mwvaughn
Copy link
Contributor Author

mwvaughn commented Sep 3, 2020

ec50-stab-audit-rx

This is an actor that audits file outputs of a Tapis app. A Tapis job UUID in the reactor context is necessary and sufficient for usage.

Message Schema

  • This reactor expects, but does not enforce, an empty JSON message ({}).

Expected Variables

  • tapis_jobId: Tapis job UUID of the upstream Tapis job, which called back to this reactor's message inbox.

Where is the requirement for tapis_jobId defined for this actor?

@mwvaughn
Copy link
Contributor Author

mwvaughn commented Sep 3, 2020

As we're going, we also need to list common variables defined in secrets.json in the reactor project repositories.

@ethho
Copy link
Collaborator

ethho commented Sep 3, 2020

Where is the requirement for tapis_jobId defined for this actor?

This var is required at the line in this permalink: https://gitlab.sd2e.org/psap/ec50-stab-audit-rx/blob/master/reactor.py#L128, which basically says assert 'tapis_jobID' in r.context with better reporting.

As we're going, we also need to list common variables defined in secrets.json in the reactor project repositories.

Does there exist a canonical way to do this? e.g. secrets.example.json?

@mwvaughn
Copy link
Contributor Author

mwvaughn commented Sep 8, 2020

Here's a stab at doing the baseline context.jsonschema over which we'd merge the one provided with the Reactor. I'm still thinking over the ins and outs of multiple message schemas and how we'd go about with validation. I also think we can add a bit more constraint to some of the string fields, even if we don't have a formal, enforceable regex pattern for them.

{
	"$schema": "http://json-schema.org/draft-07/schema#",
	"$id": "AbacoBaseContext",
	"title": "Baseline Abaco Context",
	"type": "object",
	"properties": {
		"MSG": {
			"type": "string",
			"description": "Message received by the Actor"
		},
		"_abaco_Content-Type": {
			"type": "string",
			"description": "Content type of the message passed to Abaco"
		},
		"_abaco_api_server": {
			"type": "string",
			"description": "API server for an instance of Tapis",
			"format": "uri"
		},
		"_abaco_access_token": {
			"type": "string",
			"description": "Oauth Bearer token for Tapis access, generated by Abaco"
		},
		"_abaco_execution_id": {
			"type": "string",
			"description": "Public identifier of the current Actor Execution"
		},
		"_abaco_username": {
			"type": "string",
			"description": "Tapis username who is requesting the execution of the Actor",
			"pattern": "^[a-z][0-9a-z]{2,7}",
			"examples": [
				"vaughn",
				"tg840555",
				"sd2etest1"
			]
		},
		"_abaco_actor_dbid": {
			"type": "string",
			"description": "Internal identifier for the Actor"
		},
		"_abaco_actor_id": {
			"type": "string",
			"description": "Public identifier for the Actor",
			"examples": [
				"e5QKEW8L0BeZ4",
				"6rgbzrjRKoBDk"
			]
		},
		"_abaco_actor_state": {
			"type": "object",
			"description": "Serialized object for persisting state between Actor Executions"
		},
		"_abaco_worker_id": {
			"type": "string",
			"description": "Public identifier for the Abaco worker handling the current Execution"
		},
		"_abaco_container_repo": {
			"type": "string",
			"description": "Linux container repo for the current Actor",
			"examples": [
				"tacc/tacbobot",
				"tacobot",
				"tacobot:latest",
				"tacobot:600a1af",
				"tacc/tacobot:600a1af",
				"index.docker.io/tacc/tacobot:600a1af"
			]
		},
		"_abaco_actor_name": {
			"type": "string",
			"description": "Public name of the current Actor"
		},
		"x-nonce": {
			"type": "string",
			"description": "An Abaco nonce (API key)"
		}
	},
	"required": ["MSG"],
	"additionalProperties": true
}

@mwvaughn
Copy link
Contributor Author

mwvaughn commented Sep 9, 2020

Here's an interesting case we have not yet considered. If a Reactor relies on environment variables set at the container image level, we need to document those as well. Here's an example Dockerfile for one of the pipeline managers...

FROM sd2e/reactors:python3-edge

ARG DATACATALOG_BRANCH=2_2
# RUN pip uninstall --yes datacatalog || true
RUN pip3 install --upgrade --no-cache-dir \
    git+https://github.com/SD2E/python-datacatalog.git@${DATACATALOG_BRANCH}

COPY schemas /schemas

WORKDIR /mnt/ephemeral-01

ENV CATALOG_ADMIN_TOKEN_KEY=etMn7wzAhn9ErWcpzh8EK75St2CUsHHK54nA
ENV CATALOG_ADMIN_TOKEN_LIFETIME=3600
ENV CATALOG_RECORDS_SOURCE=jobs-indexer
ENV CATALOG_STORAGE_SYSTEM=data-sd2e-community

These CATALOG_ variables are used by the datacatalog Python package. The variable CATALOG_ADMIN_TOKEN_KEY is particularly interesting - this is the seed for the hash function used to generate database record-level administrative tokens. If this value is not set correctly, then messages sent to other actors that also use datacatalog will reject the tokens. The idea there was to allow us to have coordinated, parallel sets of pipeline manager actors for multiple projects or alternative use cases.

Along a similar theme, the CATALOG_RECORDS_SOURCE sets the value for _properties.source in any MongoDB records created by this actor.

@ethho
Copy link
Collaborator

ethho commented Sep 9, 2020

Here's an interesting case we have not yet considered. If a Reactor relies on environment variables set at the container image level, we need to document those as well.

This touches at the broader question of: what subset of the environment should the context.jsonschema capture? For instance, one might be interested in including CATALOG_ADMIN_TOKEN_KEY, but certainly not other vars that are inherent in the image (LS_COLORS, LD_LIBRARY_PATH, etc.). I propose a tentative definition for context.jsonschema: a built image and a context.jsonschema should be sufficient to reproduce the Abaco runtime environment locally. Based on this definition, I would leave envs that are defined in the Dockerfile out of the context.jsonschema, since they are already documented and built as part of the image.

What do you all think of this definition? It certainly has drawbacks; for instance, it means that config.yml fields would be duplicated in the schema. EDIT: on second thought, we don't strictly need to include these fields in the schema, thanks to "additionalProperties": true

@mwvaughn
Copy link
Contributor Author

mwvaughn commented Sep 9, 2020

Here's a stab at documenting a context.jsonschema for https://github.com/SD2E/pipelinejobs-manager. The context document below is sufficient to define which variables can be set to execute the actor in "parameters" mode. We still need to document the individual JSON message schemas. I am leaning against including them in context.jsonschema right now, in favor of having them in files stored in the source repo ./schemas directory, copied to /schemas in the container.

{
	"$schema": "http://json-schema.org/draft-07/schema#",
	"$id": "PipelineJobsIndexer",
	"title": "Context for PipelineJobs Indexer",
	"type": "object",
	"properties": {
		"uuid": {
			"type": "string",
			"description": "UUID of job to manage",
			"pattern": "^(uri:urn:)?107[0-9a-f]{5}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}",
			"examples": [
				"107acb4d-1f93-553a-9e1c-b7b58125bc89"
			]
		},
		"name": {
			"type": "string",
			"description": "Event name to send to job <uuid>",
			"enum": [
				"index",
				"indexed"
			]
		},
		"level": {
			"type": "string",
			"description": "Data processing level (when action == index)",
			"enum": [
				"0",
				"1",
				"2",
				"3",
				"Reference",
				"User",
				"Unknown"
			]
		},
		"token": {
			"type": "string",
			"description": "Alphanumeric authorization token issued and validated by Python datacatalog"
		}
	},
	"required": [],
	"additionalProperties": true
}

Note: Updated to remove the Data Catalog env variables as per Ethan's comment below.

@mwvaughn
Copy link
Contributor Author

mwvaughn commented Sep 9, 2020

I think this is the right approach - here's why: If we open the pandora's box of Dockerfile-level env variables, we might have to include the ones for any dependency library AND the number of vars even for Python datacatalog is pretty large.

Here is the dir() on the Datacatalog settings module:

['ABACO_HASHIDS_SALT', 'ADJ_ANIMAL_DATE_FORMAT', 'ADJ_ANIMAL_DELIM', 'ADJ_ANIMAL_LENGTH', 'ADJ_ANIMAL_WORDS', 'ADMIN_TOKEN_KEY', 'ADMIN_TOKEN_LIFETIME', 'ADMIN_TOKEN_SECRET', 'DATE_FORMAT', 'DEBUG_MODE', 'DNS_DOMAIN', 'FILE_ID_PREFIX', 'JOBS_TOKEN_SALT', 'LOG_FIXITY_ERRORS', 'LOG_LEVEL', 'LOG_UPDATES', 'LOG_VERBOSE', 'MAX_INDEX_FILTERS', 'MAX_INDEX_PATTERNS', 'MOCKUUIDS_SALT', 'MONGODB_AUTHN', 'MONGODB_AUTH_DATABASE', 'MONGODB_DATABASE', 'MONGODB_HOST', 'MONGODB_PASSWORD', 'MONGODB_PORT', 'MONGODB_REPLICA_SET', 'MONGODB_USERNAME', 'MONGO_DELETE_FIELD', 'PIPELINES_TOKEN_SALT', 'PJS_CACHE_DIR', 'PJS_CACHE_MAX_AGE', 'PRODUCTS_ROOT', 'PRODUCTS_SYSTEM', 'PRODUCTS_VERSION', 'RECORDS_SOURCE', 'RECORD_PROPERTIES_SOURCE', 'REFERENCES_ROOT', 'REFERENCES_SYSTEM', 'REFERENCES_VERSION', 'SALT_LENGTH', 'SCHEMA_BASEURL', 'SCHEMA_NO_COMMENT', 'SCHEMA_REFERENCE', 'STORAGE_MANAGED_VERSION', 'STORAGE_SYSTEM', 'TACC_API_SERVER', 'TACC_JUPYTER_SERVER', 'TACC_MANAGER_ACCOUNT', 'TACC_PRIMARY_STORAGE_SYSTEM', 'TACC_PROJECT_GROUP', 'TACC_PROJECT_ID', 'TACC_PROJECT_NAME', 'TACC_TENANT', 'TACC_TENANT_POC_EMAIL', 'TAPIS_ANY_AUTHENTICATED_USERNAME', 'TAPIS_UNAUTHENTICATED_USERNAME', 'TOKEN_LENGTH', 'TYPEDUUID_HASHIDS_SALT', 'UNICODE_PATHS', 'UPLOADS_ROOT', 'UPLOADS_SYSTEM', 'UPLOADS_VERSION', 'UUID_MOCK_NAMESPACE', 'UUID_NAMESPACE', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'all_settings', 'array_from_string', 'callbacks', 'crypt', 'debug', 'fix_assets_path', 'helpers', 'identifiers', 'int_or_none', 'jsonschema', 'mongo', 'organization', 'os', 'parse_boolean', 'set_from_string']

That's a lot of potential vars to deal with if we included even a few of them in the context.jsonschema. Also, we don't really override these variables at runtime (even though you could).

@ethho
Copy link
Collaborator

ethho commented Sep 9, 2020

We still need to document the individual JSON message schemas. I am leaning against including them in context.jsonschema right now, in favor of having them in files stored in the source repo ./schemas directory, copied to /schemas in the container.

Agreed. I can't think of a compelling reason to include either the config.yml or the message schemas in the context.jsonschema for the time being.

@mwvaughn
Copy link
Contributor Author

mwvaughn commented Sep 9, 2020

The default context.jsonschema needs to include LOCALONLY which is used to set Reactor.local

See https://github.com/TACC-Cloud/python-reactors/blob/main/src/reactors_sdk/utils.py#L226

@ethho
Copy link
Collaborator

ethho commented Sep 10, 2020

An example context.jsonschema for the above mentioned actor ec50-audit-rx (https://gitlab.sd2e.org/psap/ec50-stab-audit-rx). The schema performs an "outer merge" with the AbacoBaseContext schema above using the allOf keyword.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "ec50-audit-rx-context",
  "title": "Reactor context for ec50-audit-rx",
  "description": "Reactor context for ec50-audit-rx",
  "allOf": [
    {"#ref": "AbacoBaseContext.json#"},
    {
      "type": "object",
      "additionalProperties": true,
      "required": ["tapis_jobId"],
      "properties": {
      	"tapis_jobId": {
          "type": "string",
          "description": "A Tapis job UUID",
          "pattern": "^[0-9a-z]{8}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{12}-007$"
        }
      }
    }
  ]
}

As we noted before, I think that the typical end user would find it difficult to write these schemas from scratch, especially if we adopt schema merging.

@mwvaughn
Copy link
Contributor Author

mwvaughn commented Sep 10, 2020

Agreed. I want to keep this streamlined towards only what a developer needs to do to define variables. One thought I had was to bundle AbacoBaseContext with the Python package and perform a merge on it using the contents of context.jsonschema, similar to what you've done here. The key difference is that we would do this on demand inside a validate operation or when we we are creating the usage response.

I am working on branch schema_validation from today's develop to implement a couple test approaches to this.

@ethho
Copy link
Collaborator

ethho commented Sep 10, 2020

One thought I had was to bundle AbacoBaseContext with the Python package and perform a merge on it using the contents of context.jsonschema, similar to what you've done here. The key difference is that we would do this on demand inside a validate operation or when we we are creating the usage response.

Good idea, I agree that the simplicity/usability benefit here outweighs the decrease in descriptiveness.

A corollary of the above: we could provide a static asset (let's call it context_union.jsonschema for sake of argument) that joins AbacoBaseContext with the more developer-friendly context.jsonschema. I imagine this schema would be included in the Python package or created on tapis actors init.

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "context_union.jsonschema",
  "allOf": [
    {"#ref": "AbacoBaseContext.json#"},
    {"#ref": "AnotherBaseContext.json#"},
    {"#ref": "context.json#"}
  ]
}

I'm not in love with this idea, it veers pretty close to overengineering, but still worth putting on paper.

@shwetagopaul92
Copy link
Collaborator

shwetagopaul92 commented Sep 14, 2020

An example context.jsonschema for FCS-ETL Reactors: https://gitlab.sd2e.org/sd2program/fcs-etl-reactor/.

  • uri is presented in a JSON message conforming to the AbacoMessageAgavePath schema
{
	"$schema": "http://json-schema.org/draft-04/schema#",
	"$id": "FcsEtlReactor",
	"title": "Context for FCS-ETL Reactors",
	"type": "object",
	"properties": {
		"uri": {
			"type": "string",
			"format": "uri",
			"description": "agave:// samples.json",
			"pattern": "^agave:\\/\\/.*\\.json"
		}
	},
	"required": ["uri"],
	"additionalProperties": true
}

@ethho ethho added the brainstorm Ideas in the works label Sep 16, 2020
@mwvaughn
Copy link
Contributor Author

Here is a cool implication of defining an Reactor's parameter environment using JSONschema. Via a Javascript library called SchemaForm, it is possible to generate a Bootstrap3 forms interface. This could, in turn, be used to render a web UI for submitting a message to a Reactor. Attached below is the forms interface for the pipeline jobs indexer documented above.

Screen Shot 2020-09-17 at 3 21 26 PM

@mwvaughn
Copy link
Contributor Author

To automatically support multiple schema files, including context.jsonschema, I think we need to tweak the base image a little. Here's one proposed solution:

FROM sd2e/python3:ubuntu17

ARG SCRATCH=/mnt/ephemeral-01

RUN mkdir -p ${SCRATCH} && \
    chmod a+rwx ${SCRATCH} && \
    chmod g+rwxs ${SCRATCH}

RUN mkdir /schemas

RUN pip install git+git://github.com/TACC-Cloud/python-reactors.git@develop#egg=reactors

# Onbuild support
ONBUILD ADD requirements.txt /tmp/requirements.txt
ONBUILD RUN pip3 install -r /tmp/requirements.txt
ONBUILD ADD reactor.py /
ONBUILD ADD config.yml /
ONBUILD ADD *.jsonschema /schemas/

# Close out making absolutely sure that work directory is set
WORKDIR ${SCRATCH}

CMD ["python3", "/reactor.py"]

The significant change is that we would now put all schema files in /schemas. The schema verification/classification codes can handle this (and in fact expect it). One annoying aspect of this approach is that context.jsonschema is in the same location as user-defined message schemas. We can make sure that it's ignored by the classification function but it feels messy to do it this way.

Another approach is to require context.jsonschema to be at the top level of the project, and any message schemas to be in a schemas directory. The ONBUILD for this would look like:

...
ONBUILD ADD config.yml /
ONBUILD COPY context.jsonschema /context.jsonschema
ONBUILD COPY schemas /schemas
...

This would clearly and cleanly separate schemas intended for the validation/classification workflow from the context-validation workflow.

Thoughts on which is preferable?

@mwvaughn
Copy link
Contributor Author

I will say that in either case, we will need to provide an explicit path to migrate a "V1" Reactors project to a "V2" project but I expect this will be a small lift.

@ethho
Copy link
Collaborator

ethho commented Sep 23, 2020

One annoying aspect of this approach is that context.jsonschema is in the same location as user-defined message schemas.

Thoughts on which is preferable?

As a reactor developer, I would actually categorize context.jsonschema as a "user-defined schema," in that it should be exposed to the user to the same extent as message.jsonschema.

By this principle, I'm in favor of putting all *.jsonschemas (except the BaseContext.jsonschema and AbacoUnionContext.jsonschema) in /schemas, like you have in the above Dockerfile. But, as you point out, this leads us to the antipattern where schemas are enforced at different scopes (Reactor().context vs Reactor().context.message_dict), but stored in the same directory.

A corollary of the first approach (putting all schemas in /schemas): tapis actors init could generate the following minimal context.jsonschema:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "$id": "context",
  "title": "Template reactor context",
  "description": "Template reactor context",
  "message_dict": {"#ref": "message.jsonschema"}
}

...as well as an empty ({}) message.jsonschema. Then, only enforce the context.jsonschema in the package logic; the message schema is enforced (and thereby required) IFF it is linked in the context.jsonschema. I can't think of an elegant way to avoid the "scope antipattern," but I think linking the message schema in a user-exposed (but Tapis-templated) context schema makes this behavior discoverable, explicit, and mutable.

I believe this would work with the current (750a8f9) behavior of reactors.context.find_context_schema_files, which only globs on files named context.jsonschema.

I will say that in either case, we will need to provide an explicit path to migrate a "V1" Reactors project to a "V2" project but I expect this will be a small lift.

Using the above approach, I can think of two explicit paths for migrating existing reactors to "V2":

  1. Copy the minimal context.jsonschema to every reactor repository. This is pretty clean, and a minimal lift, as you say.
  2. We could include some fault-tolerant logic in the package that provides the minimal context schema if it cannot find a /schemas/context.jsonschema, but this is pretty messy and not worth the time saved IMO.

@mwvaughn
Copy link
Contributor Author

More thoughts (this is worth discussing so we don't have to unwind bad choices later)

We support validation of an incoming JSON message via validate_message(mes, schema) but just as importantly, we support classification of that message by determining which (from a list of schemas) a message validates against. The way we accomplish this is to validate the message against all schemas discovered by find_schema_files(), returning the $id of each matching schema.

To be specific: If I have an actor with context.jsonschema, message1.jsonschema and message2.jsonschema in the /schemas directory, when we call classify_message(mes...), it will be checked against all three schemas, with one of those checks being almost guaranteed to fail.

Without specific logic in find_schema_files() or classify_message to omit context.jsonschema, the act of classification will always perform one futile message validation against context.jsonschema. On the other hand, if we add code to intentionally skip the filename context.jsonschema during classification, it means that an end user cannot use the perfectly valid message schema file name context.jsonschema.

@mwvaughn
Copy link
Contributor Author

Let's now considering specification and validation of the actor's context. We're supporting two main use cases with this:

  1. documenting what variables are needed in a user environment to run the code
  2. enabling validation at run-time of the specified environment variables

We have not yet implemented Reactor.validate_context(), the logical counterpart of Reactor.validate_message(). Consistent with my understanding of the use cases above, we would only validate the environment variables present in a context, not the computed slots like message_dict. We would only validate that the variable MSG was present (by means of the BaseContext schema).

I understand it is possible to validate the entire contents of the context dict by using a schema that incorporates a message_dict property, but this gets complicated when we have multiple possible values for the message JSONschema - at minimum, we'd have to use a oneOf instead of just a simple #ref to message.jsonschema

I think I would prefer validation of the context ignore the contents of the message except to ensure that it can be constructed from MSG.

@mwvaughn
Copy link
Contributor Author

Let's try an exercise by mocking up the usage inside a Reactor for three cases:

Actor that uses only its context for parameterization

from reactors import Reactor
rx = Reactor()
# interprets context.jsonschema
rx.validate_context()
print(rx.context['env-we-know-exists'])

Actor that accepts a single message schema

from reactors import Reactor
rx = Reactor()
# interprets context.jsonschema
rx.validate_context()
# interprets /schemas/message.jsonschema 
rx.validate_message()
# rx is going to have a `.message` property in next release
print(rx.message['key-we-know-exists'])

Actor that accepts multiple message schemas

from reactors import Reactor
rx = Reactor()
# interprets context.jsonschema
rx.validate_context()
# interprets contents of /schemas
# will perform a futile classification attempt on /schemas/context.jsonschema
schema_id = rx.classify_message()
print('This message is schema {0}'.format(schema_id))

An observation: It seems like validate_context() is something we want to do by default. We could add it to the __init__ process for the Reactor class, allowing the developer to opt out of context validation via something like Reactor(validate_context=False).

@mwvaughn
Copy link
Contributor Author

For the sake of demonstrating HOW to implement context validation to developers, bundling a trivial, empty context.jsonschema with the default Tapis actors project makes sense. But, it's not strictly necessary. The AbacoUnionContext schema we can generate inside context.py gives us baseline context validation. Specifically, it guarantees:

  1. MSG is present
  2. If an API server URL is passed, it is a valid URL
  3. If a TACC username is set it is at least lexically valid as a UNIX username
  4. All other Abaco context variables are enforced to be string types

@ethho
Copy link
Collaborator

ethho commented Sep 24, 2020

For the sake of demonstrating HOW to implement context validation to developers, bundling a trivial, empty context.jsonschema with the default Tapis actors project makes sense. But, it's not strictly necessary.

I see your point here, there is little sense in making the context schema a strict requirement. This is a good case in support of the second option (context.jsonschema at container root, and /schemas for message validation only). Maybe we could just explicitly put context schemas at /context_schemas and message schemas in /message_schemas, just an idea.

Actor that uses only its context for parameterization

Actor that accepts a single message schema

Actor that accepts multiple message schemas

Another observation/question that arises while reading this: shouldn't we also support the following use case?

Actor that accepts multiple context schemas, but ≤ 1 message schemas

from reactors import Reactor
rx = Reactor()
rx.validate_message()
schema_id = rx.classify_context()
print('This context abides by schema {0}'.format(schema_id))

@mwvaughn
Copy link
Contributor Author

Multiple contexts... it stands to reason doesn't it? This makes a strong argument for physically separating context schema files from message schema files.

@mwvaughn
Copy link
Contributor Author

@eho-tacc I'm considering how to accomplish this while maintaining backwards compatibility with V1 project structures. Specifically, I would like to reason through how to build a Reactors Docker image.

I would like to keep things very simple for the basic case, which would involve having a single default context and message schema. If a developer wants to add more schemas, they can manually add COPY commands to their Dockerfile.

At the same time, I think we might want to clearly mark which schemas are the default. This allows code in the Reactors package to know that these schemas have a tiny bit of precedence over any others. Maybe they are evaluated first, or maybe they are all that get checked by default when we call validate() from the reactor - I'm just riffing.

The Dockerfile for the reactors image would look like so:

...
ONBUILD COPY context.jsonschema /context_schemas/default_context.jsonschema
ONBUILD COPY message.jsonschema /message_schemas/default_message.jsonschema
...

This would mandate presence of a minimum viable context and message schema (the second of which we already enforce).

@ethho
Copy link
Collaborator

ethho commented Sep 30, 2020

At the same time, I think we might want to clearly mark which schemas are the default. This allows code in the Reactors package to know that these schemas have a tiny bit of precedence over any others. Maybe they are evaluated first, or maybe they are all that get checked by default when we call validate() from the reactor - I'm just riffing.

I like the second route: using /context_schemas/default_context.jsonschema as the default schema if one does not pass a path to validate().

Another idea that came to mind: we could call Reactor.validate from the Reactor.__init__ constructor. This removes a single line of boilerplate validation. I don't imagine there will be many use cases that do not validate; in that case, I suppose one could simply pass the kwarg rx = Reactor(context_schema=None, message_schema=None)

@mwvaughn
Copy link
Contributor Author

mwvaughn commented Sep 30, 2020

Am I right you're advocating for one-shot validation, wherein we validate message_dict as part of the context? I had been thinking about keeping them separate, validating only presence of the MSG variable in the context. This is because in my mind, the context we are validating or classifying only refers to variables explicitly set by Abaco when running the container. The message dict, on the other hand, is a computed value that comes from parsing the MSG string using the Python AST, failing over to an attempt to use json.loads(). I'm also thinking about cases where the message may be entirely and intentionally empty, as is the case when Abaco is processing a binary stream.

To your other point: I have been thinking about calling validate() at Reactor.__init__ time as well (in fact, my original design for Reactor had this in it). My continuing concern is that if a Reactor object fails to initialize, we lose access to Reactor.on_success and Reactor.on_failure as well as Reactor.loggers. Losing access to the loggers means losing access to Kibana on SD2 and probably some other logging PaaS for the TACC tenant (e.g. Loggly), which breaks distributed tracing when an invalid message or context is encountered.

To get around this, I was thinking we could add a helper Reactor.validate(context_schema=<default>, message_schema=<default>, validate_context=True, validate_message=True, permissive=True) This would wrap default calls to Reactor.validate_context() and Reactor.validate_message() in a single call. It does keep one line of boilerplate, but does prevent adding an explicit call to two validation functions, so maybe that is a win in the end.

@ethho
Copy link
Collaborator

ethho commented Sep 30, 2020

Am I right you're advocating for one-shot validation, wherein we validate message_dict as part of the context? I had been thinking about keeping them separate

After last week's discussion, I agree with you that the context and message schemas should be kept separate (and not linked), so no, I am no longer advocating for this.

Concerning your point about the loggers: agreed that __init__ should return without errors. A single line of boilerplate is not worth the price of broken logging.

@mwvaughn
Copy link
Contributor Author

Great. The sum of all this discussion is enough for me to finish a demo branch I had started work on to demonstrate some of these concepts! I'm really optimistic now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
brainstorm Ideas in the works
Projects
None yet
Development

No branches or pull requests

3 participants