nf-osi · anngvu · May 8, 2024 · May 6, 2024 · May 6, 2024 · May 6, 2024
diff --git a/README.md b/README.md
@@ -4,13 +4,20 @@ Jobs and microservices for routine maintenance and reporting.
 
 ## Testing and development
 
-### General contribution flow for a *new* job
+Jobs should preferentially be developed to be run in the below platform order. 
+There are only very minor differences. In fact, just containerize them (i.e. include a Dockerfile), and they can also be taken anywhere else.   
+The preference below is mostly for best security / availability / cost tracking reasons.
+- Sage's Service Catalog
+- Github Actions
+- Some other platform
+
+### General contribution flow for a **new** job
 
 1. Create a branch off `main` with prefix `feat/`.
-2. Create a new directory for the job/service and put the script(s), Dockerfile, and job-specific README there.
-3. Add a workflow to build an image (copy and adapt from current `.github/worflows`).
+2. Create a new directory for the job/service and put the script(s), Dockerfile, and (recommended) the job-specific README there.
+3. Add a workflow file to build an image (copy and adapt from current `.github/worflows`).
 - Change `on.paths` so that the Docker build will build specifically for the job
-- In the very last step, change `context` to the new job directory
+- In the very last step, make sure to update `context` to the new job directory
 4. (Optional) In your final commit, add `[pre-build]` in commit message if you want to provide a test image for the reviewers in the PR. 
 5. Make PR against `main` and add reviewer.
 
@@ -36,4 +43,5 @@ SLACK=https://hooks.slack.com/services/xxxxxxxxxxxxxxxxxxxxxxxxxxxx
 3. Run the containerized job:
 `docker run --env-file envfile ghcr.io/nf-osi/jobs-some-job`
 
-Depending on the job, an additional command may need to be specified.
+Depending on the job, there may be additional commands to run. Refer to the job's README. 
+
diff --git a/monitor-data-status/Dockerfile b/monitor-data-status/Dockerfile
@@ -0,0 +1,8 @@
+FROM sagebionetworks/synapsepythonclient:v4.2.0
+
+WORKDIR /app
+
+COPY update_data_status.py /app/update_data_status.py
+
+ENTRYPOINT ["python3", "/app/update_data_status.py"]
+
diff --git a/monitor-data-status/README.md b/monitor-data-status/README.md
@@ -0,0 +1,28 @@
+## Monitor Data Status
+
+This is a scheduled job that checks "Data Pending" projects for their first file contribution to change project data status to "Under Embargo". 
+Files may not be immediately annotated with upload; therefore, see also "monitor annotations", which is concerned with annotations on files.
+This job is intended to be run on service catalog.
+
+### Secrets and env vars
+
+- `SYNAPSE_AUTH_TOKEN`: (Required) This needs to have edit access to update the projects' annotations.
+
+
+### Run params
+
+- `--dry` By default, the job updates project in Synapse unless `--dry` is used for the run. 
+- `--update_df` Use a csv to that directly specifies which projects should be updated, instead of querying the project view + file view. This is used for testing. Should have columns `projectId`, and `N`. 
+
+
+### Testing
+
+- Build the image with e.g. `docker build --no-cache -t ghcr.io/nf-osi/jobs-monitor-data-status .` (or pull current/pre-build image if available)
+- Set up an `envfile-monitor-ds` with contents like below:
+
+```
+SCHEDULED_JOB_SECRETS={"SYNAPSE_AUTH_TOKEN":"xxxxxxxxxxxxxxxxxxxxxxxxxxxx"}
+```
+
+- To run with `--dry`: `docker run --env-file envfile-monitor-ds ghcr.io/nf-osi/jobs-monitor-data-status --dry`
+- To run without `--dry` but with test data: `sudo docker run --env-file envfile-monitor-ds --mount type=bind,source=$(pwd)/test.csv,target=/app/test.csv ghcr.io/nf-osi/jobs-monitor-data-status --update_df test.csv`
diff --git a/monitor-data-status/test.csv b/monitor-data-status/test.csv
@@ -0,0 +1,3 @@
+projectId,N
+syn22410511,5
+syn26462036,10
diff --git a/monitor-data-status/update_data_status.py b/monitor-data-status/update_data_status.py
@@ -0,0 +1,52 @@
+import synapseclient
+import pandas as pd
+import os
+import json
+import argparse
+
+# Constants for the views
+FILE_VIEW_ID = "syn16858331"
+PROJECT_VIEW_ID = "syn52677631"
+
+def main(dry_run, update_df):
+    syn = synapseclient.Synapse()
+    secrets = json.loads(os.getenv("SCHEDULED_JOB_SECRETS"))
+    auth_token = secrets["SYNAPSE_AUTH_TOKEN"]
+    syn.login(authToken = auth_token)
+
+    if update_df:
+        print(f"Using manually specified data csv...")
+        fileview_df = pd.read_csv(update_df)
+    else:  
+        # Fetch the project view data
+        pending_projects_df = syn.tableQuery(f"SELECT id FROM {PROJECT_VIEW_ID} WHERE dataStatus='Data Pending'").asDataFrame()
+        ids = tuple(pending_projects_df['id'])
+        QUERY_IDS = f"({', '.join(repr(item) for item in ids)})"
+
+        # Fetch the file view data
+        # We'll need to filter out files created by nf-osi service or staff 
+        # who tend to upload data sharing plans & other administrative files
+        print(f"Checking production table for {len(ids)} projects with status 'Data Pending'...")
+        fileview_df = syn.tableQuery(f"SELECT projectId,count(*) as N FROM {FILE_VIEW_ID} WHERE type='file' and createdBy not in ('3421893', '3459953', '3434950', '3342573') and projectId in {QUERY_IDS} group by projectId").asDataFrame()
-        fileview_df = syn.tableQuery(f"SELECT projectId,count(*) as N FROM {FILE_VIEW_ID} WHERE type='file' and createdBy not in ('3421893', '3459953', '3434950', '3342573') and projectId in {QUERY_IDS} group by projectId").asDataFrame()
+        fileview_df = syn.tableQuery(f"SELECT projectId,count(*) as N FROM {FILE_VIEW_ID} WHERE type='file' and createdBy not in ('3421893', '3459953', '3434950', '3342573') and projectId in {QUERY_IDS} group by projectId").asDataFrame()
-        fileview_df = syn.tableQuery(f"SELECT projectId,count(*) as N FROM {FILE_VIEW_ID} WHERE type='file' and createdBy not in ('3421893', '3459953', '3434950', '3342573') and projectId in {QUERY_IDS} group by projectId").asDataFrame()
+        fileview_df = syn.tableQuery(f"SELECT projectId,count(*) as N FROM {FILE_VIEW_ID} WHERE type='file' and createdBy not in ('3421893', '3459953', '3434950', '3342573') and projectId in {QUERY_IDS} group by projectId").asDataFrame()
+
+    print(f"Found {len(fileview_df.index)} that qualifying for transition:")
+    print(fileview_df)
+
+    for idx, p in fileview_df.iterrows():
+        project_to_update = syn.get(p['projectId'])
+        print(f"Project {project_to_update['name']} has seen its first contribution of {p['N']} file(s)!")
+        project_to_update['dataStatus'] = ['Under Embargo']
+        if dry_run:
+            print("Modified project meta (not stored):")
+            print(project_to_update)
+        else: 
+            syn.store(project_to_update)
+            print(f"Project {project_to_update['name']} had dataStatus changed to 'Under Embargo'")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--dry", action="store_true", help="Print project with modified metadata but do not store.")
+    parser.add_argument("--update_df", type = str, help="Path to csv of projects to update.")
+    args = parser.parse_args()
+    main(dry_run = args.dry, update_df = args.update_df)