Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/monitor data status #83

Merged
merged 5 commits into from
May 8, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 13 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,20 @@ Jobs and microservices for routine maintenance and reporting.

## Testing and development

### General contribution flow for a *new* job
Jobs should preferentially be developed to be run in the below platform order.
There are only very minor differences. In fact, just containerize them (i.e. include a Dockerfile), and they can also be taken anywhere else.
The preference below is mostly for best security / availability / cost tracking reasons.
- Sage's Service Catalog
- Github Actions
- Some other platform

### General contribution flow for a **new** job

1. Create a branch off `main` with prefix `feat/`.
2. Create a new directory for the job/service and put the script(s), Dockerfile, and job-specific README there.
3. Add a workflow to build an image (copy and adapt from current `.github/worflows`).
2. Create a new directory for the job/service and put the script(s), Dockerfile, and (recommended) the job-specific README there.
3. Add a workflow file to build an image (copy and adapt from current `.github/worflows`).
- Change `on.paths` so that the Docker build will build specifically for the job
- In the very last step, change `context` to the new job directory
- In the very last step, make sure to update `context` to the new job directory
4. (Optional) In your final commit, add `[pre-build]` in commit message if you want to provide a test image for the reviewers in the PR.
5. Make PR against `main` and add reviewer.

Expand All @@ -36,4 +43,5 @@ SLACK=https://hooks.slack.com/services/xxxxxxxxxxxxxxxxxxxxxxxxxxxx
3. Run the containerized job:
`docker run --env-file envfile ghcr.io/nf-osi/jobs-some-job`

Depending on the job, an additional command may need to be specified.
Depending on the job, there may be additional commands to run. Refer to the job's README.

8 changes: 8 additions & 0 deletions monitor-data-status/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
FROM sagebionetworks/synapsepythonclient:v4.2.0

WORKDIR /app

COPY update_data_status.py /app/update_data_status.py

ENTRYPOINT ["python3", "/app/update_data_status.py"]

28 changes: 28 additions & 0 deletions monitor-data-status/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
## Monitor Data Status

This is a scheduled job that checks "Data Pending" projects for their first file contribution to change project data status to "Under Embargo".
Files may not be immediately annotated with upload; therefore, see also "monitor annotations", which is concerned with annotations on files.
This job is intended to be run on service catalog.

### Secrets and env vars

- `SYNAPSE_AUTH_TOKEN`: (Required) This needs to have edit access to update the projects' annotations.


### Run params

- `--dry` By default, the job updates project in Synapse unless `--dry` is used for the run.
- `--update_df` Use a csv to that directly specifies which projects should be updated, instead of querying the project view + file view. This is used for testing. Should have columns `projectId`, and `N`.


### Testing

- Build the image with e.g. `docker build --no-cache -t ghcr.io/nf-osi/jobs-monitor-data-status .` (or pull current/pre-build image if available)
- Set up an `envfile-monitor-ds` with contents like below:

```
SCHEDULED_JOB_SECRETS={"SYNAPSE_AUTH_TOKEN":"xxxxxxxxxxxxxxxxxxxxxxxxxxxx"}
```

- To run with `--dry`: `docker run --env-file envfile-monitor-ds ghcr.io/nf-osi/jobs-monitor-data-status --dry`
- To run without `--dry` but with test data: `sudo docker run --env-file envfile-monitor-ds --mount type=bind,source=$(pwd)/test.csv,target=/app/test.csv ghcr.io/nf-osi/jobs-monitor-data-status --update_df test.csv`
3 changes: 3 additions & 0 deletions monitor-data-status/test.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
projectId,N
syn22410511,5
syn26462036,10
52 changes: 52 additions & 0 deletions monitor-data-status/update_data_status.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
import synapseclient
import pandas as pd
import os
import json
import argparse

# Constants for the views
FILE_VIEW_ID = "syn16858331"
PROJECT_VIEW_ID = "syn52677631"

def main(dry_run, update_df):
syn = synapseclient.Synapse()
secrets = json.loads(os.getenv("SCHEDULED_JOB_SECRETS"))
auth_token = secrets["SYNAPSE_AUTH_TOKEN"]
syn.login(authToken = auth_token)

if update_df:
print(f"Using manually specified data csv...")
fileview_df = pd.read_csv(update_df)
else:
# Fetch the project view data
pending_projects_df = syn.tableQuery(f"SELECT id FROM {PROJECT_VIEW_ID} WHERE dataStatus='Data Pending'").asDataFrame()
ids = tuple(pending_projects_df['id'])
QUERY_IDS = f"({', '.join(repr(item) for item in ids)})"

# Fetch the file view data
# We'll need to filter out files created by nf-osi service or staff
# who tend to upload data sharing plans & other administrative files
print(f"Checking production table for {len(ids)} projects with status 'Data Pending'...")
fileview_df = syn.tableQuery(f"SELECT projectId,count(*) as N FROM {FILE_VIEW_ID} WHERE type='file' and createdBy not in ('3421893', '3459953', '3434950', '3342573') and projectId in {QUERY_IDS} group by projectId").asDataFrame()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fileview_df = syn.tableQuery(f"SELECT projectId,count(*) as N FROM {FILE_VIEW_ID} WHERE type='file' and createdBy not in ('3421893', '3459953', '3434950', '3342573') and projectId in {QUERY_IDS} group by projectId").asDataFrame()
fileview_df = syn.tableQuery(f"SELECT projectId,count(*) as N FROM {FILE_VIEW_ID} WHERE type='file' and createdBy not in ('3421893', '3459953', '3434950', '3342573') and projectId in {QUERY_IDS} group by projectId").asDataFrame()

Should we maintain a table on the portal backend project to keep the createdBy principal ids in? Then we could also use that table for reporting (instead of the one listed in the readme here https://github.com/nf-osi/internal-reports)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we probably need to start a table as a source of truth for who should be excluded for these sorts of things. I think we can do that first and then go back to revise these two + others.


print(f"Found {len(fileview_df.index)} that qualifying for transition:")
print(fileview_df)

for idx, p in fileview_df.iterrows():
project_to_update = syn.get(p['projectId'])
print(f"Project {project_to_update['name']} has seen its first contribution of {p['N']} file(s)!")
project_to_update['dataStatus'] = ['Under Embargo']
if dry_run:
print("Modified project meta (not stored):")
print(project_to_update)
else:
syn.store(project_to_update)
print(f"Project {project_to_update['name']} had dataStatus changed to 'Under Embargo'")


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--dry", action="store_true", help="Print project with modified metadata but do not store.")
parser.add_argument("--update_df", type = str, help="Path to csv of projects to update.")
args = parser.parse_args()
main(dry_run = args.dry, update_df = args.update_df)