-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document example variables needed to run a Reactors container #5
Comments
The following variables are always passed into a Reactors container when run on the Abaco platform
|
|
Where is the requirement for |
As we're going, we also need to list common variables defined in |
This var is required at the line in this permalink: https://gitlab.sd2e.org/psap/ec50-stab-audit-rx/blob/master/reactor.py#L128, which basically says
Does there exist a canonical way to do this? e.g. |
Here's a stab at doing the baseline context.jsonschema over which we'd merge the one provided with the Reactor. I'm still thinking over the ins and outs of multiple message schemas and how we'd go about with validation. I also think we can add a bit more constraint to some of the string fields, even if we don't have a formal, enforceable regex pattern for them. {
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "AbacoBaseContext",
"title": "Baseline Abaco Context",
"type": "object",
"properties": {
"MSG": {
"type": "string",
"description": "Message received by the Actor"
},
"_abaco_Content-Type": {
"type": "string",
"description": "Content type of the message passed to Abaco"
},
"_abaco_api_server": {
"type": "string",
"description": "API server for an instance of Tapis",
"format": "uri"
},
"_abaco_access_token": {
"type": "string",
"description": "Oauth Bearer token for Tapis access, generated by Abaco"
},
"_abaco_execution_id": {
"type": "string",
"description": "Public identifier of the current Actor Execution"
},
"_abaco_username": {
"type": "string",
"description": "Tapis username who is requesting the execution of the Actor",
"pattern": "^[a-z][0-9a-z]{2,7}",
"examples": [
"vaughn",
"tg840555",
"sd2etest1"
]
},
"_abaco_actor_dbid": {
"type": "string",
"description": "Internal identifier for the Actor"
},
"_abaco_actor_id": {
"type": "string",
"description": "Public identifier for the Actor",
"examples": [
"e5QKEW8L0BeZ4",
"6rgbzrjRKoBDk"
]
},
"_abaco_actor_state": {
"type": "object",
"description": "Serialized object for persisting state between Actor Executions"
},
"_abaco_worker_id": {
"type": "string",
"description": "Public identifier for the Abaco worker handling the current Execution"
},
"_abaco_container_repo": {
"type": "string",
"description": "Linux container repo for the current Actor",
"examples": [
"tacc/tacbobot",
"tacobot",
"tacobot:latest",
"tacobot:600a1af",
"tacc/tacobot:600a1af",
"index.docker.io/tacc/tacobot:600a1af"
]
},
"_abaco_actor_name": {
"type": "string",
"description": "Public name of the current Actor"
},
"x-nonce": {
"type": "string",
"description": "An Abaco nonce (API key)"
}
},
"required": ["MSG"],
"additionalProperties": true
} |
Here's an interesting case we have not yet considered. If a Reactor relies on environment variables set at the container image level, we need to document those as well. Here's an example Dockerfile for one of the pipeline managers...
These Along a similar theme, the |
This touches at the broader question of: what subset of the environment should the What do you all think of this definition? It certainly has drawbacks; for instance, it means that |
Here's a stab at documenting a context.jsonschema for https://github.com/SD2E/pipelinejobs-manager. The context document below is sufficient to define which variables can be set to execute the actor in "parameters" mode. We still need to document the individual JSON message schemas. I am leaning against including them in context.jsonschema right now, in favor of having them in files stored in the source repo {
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "PipelineJobsIndexer",
"title": "Context for PipelineJobs Indexer",
"type": "object",
"properties": {
"uuid": {
"type": "string",
"description": "UUID of job to manage",
"pattern": "^(uri:urn:)?107[0-9a-f]{5}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}",
"examples": [
"107acb4d-1f93-553a-9e1c-b7b58125bc89"
]
},
"name": {
"type": "string",
"description": "Event name to send to job <uuid>",
"enum": [
"index",
"indexed"
]
},
"level": {
"type": "string",
"description": "Data processing level (when action == index)",
"enum": [
"0",
"1",
"2",
"3",
"Reference",
"User",
"Unknown"
]
},
"token": {
"type": "string",
"description": "Alphanumeric authorization token issued and validated by Python datacatalog"
}
},
"required": [],
"additionalProperties": true
} Note: Updated to remove the Data Catalog env variables as per Ethan's comment below. |
I think this is the right approach - here's why: If we open the pandora's box of Dockerfile-level env variables, we might have to include the ones for any dependency library AND the number of vars even for Python datacatalog is pretty large. Here is the
That's a lot of potential vars to deal with if we included even a few of them in the context.jsonschema. Also, we don't really override these variables at runtime (even though you could). |
Agreed. I can't think of a compelling reason to include either the |
The default See https://github.com/TACC-Cloud/python-reactors/blob/main/src/reactors_sdk/utils.py#L226 |
An example {
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "ec50-audit-rx-context",
"title": "Reactor context for ec50-audit-rx",
"description": "Reactor context for ec50-audit-rx",
"allOf": [
{"#ref": "AbacoBaseContext.json#"},
{
"type": "object",
"additionalProperties": true,
"required": ["tapis_jobId"],
"properties": {
"tapis_jobId": {
"type": "string",
"description": "A Tapis job UUID",
"pattern": "^[0-9a-z]{8}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{4}-[0-9a-z]{12}-007$"
}
}
}
]
} As we noted before, I think that the typical end user would find it difficult to write these schemas from scratch, especially if we adopt schema merging. |
Agreed. I want to keep this streamlined towards only what a developer needs to do to define variables. One thought I had was to bundle I am working on branch |
Good idea, I agree that the simplicity/usability benefit here outweighs the decrease in descriptiveness. A corollary of the above: we could provide a static asset (let's call it {
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "context_union.jsonschema",
"allOf": [
{"#ref": "AbacoBaseContext.json#"},
{"#ref": "AnotherBaseContext.json#"},
{"#ref": "context.json#"}
]
} I'm not in love with this idea, it veers pretty close to overengineering, but still worth putting on paper. |
An example context.jsonschema for FCS-ETL Reactors: https://gitlab.sd2e.org/sd2program/fcs-etl-reactor/.
|
Here is a cool implication of defining an Reactor's parameter environment using JSONschema. Via a Javascript library called SchemaForm, it is possible to generate a Bootstrap3 forms interface. This could, in turn, be used to render a web UI for submitting a message to a Reactor. Attached below is the forms interface for the pipeline jobs indexer documented above. |
To automatically support multiple schema files, including
The significant change is that we would now put all schema files in Another approach is to require
This would clearly and cleanly separate schemas intended for the validation/classification workflow from the context-validation workflow. Thoughts on which is preferable? |
I will say that in either case, we will need to provide an explicit path to migrate a "V1" Reactors project to a "V2" project but I expect this will be a small lift. |
As a reactor developer, I would actually categorize By this principle, I'm in favor of putting all A corollary of the first approach (putting all schemas in {
"$schema": "http://json-schema.org/draft-07/schema#",
"$id": "context",
"title": "Template reactor context",
"description": "Template reactor context",
"message_dict": {"#ref": "message.jsonschema"}
} ...as well as an empty ( I believe this would work with the current (750a8f9) behavior of
Using the above approach, I can think of two explicit paths for migrating existing reactors to "V2":
|
More thoughts (this is worth discussing so we don't have to unwind bad choices later) We support validation of an incoming JSON message via To be specific: If I have an actor with Without specific logic in |
Let's now considering specification and validation of the actor's context. We're supporting two main use cases with this:
We have not yet implemented I understand it is possible to validate the entire contents of the I think I would prefer validation of the context ignore the contents of the message except to ensure that it can be constructed from |
Let's try an exercise by mocking up the usage inside a Reactor for three cases: Actor that uses only its context for parameterization
Actor that accepts a single message schema
Actor that accepts multiple message schemas
An observation: It seems like |
For the sake of demonstrating HOW to implement context validation to developers, bundling a trivial, empty context.jsonschema with the default Tapis actors project makes sense. But, it's not strictly necessary. The
|
I see your point here, there is little sense in making the context schema a strict requirement. This is a good case in support of the second option (
Another observation/question that arises while reading this: shouldn't we also support the following use case? Actor that accepts multiple context schemas, but ≤ 1 message schemasfrom reactors import Reactor
rx = Reactor()
rx.validate_message()
schema_id = rx.classify_context()
print('This context abides by schema {0}'.format(schema_id)) |
Multiple contexts... it stands to reason doesn't it? This makes a strong argument for physically separating context schema files from message schema files. |
@eho-tacc I'm considering how to accomplish this while maintaining backwards compatibility with V1 project structures. Specifically, I would like to reason through how to build a Reactors Docker image. I would like to keep things very simple for the basic case, which would involve having a single default context and message schema. If a developer wants to add more schemas, they can manually add At the same time, I think we might want to clearly mark which schemas are the default. This allows code in the Reactors package to know that these schemas have a tiny bit of precedence over any others. Maybe they are evaluated first, or maybe they are all that get checked by default when we call The Dockerfile for the
This would mandate presence of a minimum viable context and message schema (the second of which we already enforce). |
I like the second route: using Another idea that came to mind: we could call |
Am I right you're advocating for one-shot validation, wherein we validate To your other point: I have been thinking about calling To get around this, I was thinking we could add a helper |
After last week's discussion, I agree with you that the context and message schemas should be kept separate (and not linked), so no, I am no longer advocating for this. Concerning your point about the loggers: agreed that |
Great. The sum of all this discussion is enough for me to finish a demo branch I had started work on to demonstrate some of these concepts! I'm really optimistic now! |
No description provided.
The text was updated successfully, but these errors were encountered: