Release/v0.4.30 (#169)

spacetelescope · Nov 22, 2021 · f582c9a · f582c9a
1 parent 297b17e
commit f582c9a
Show file tree

Hide file tree

Showing 70 changed files with 1,290 additions and 300 deletions.
diff --git a/.github/actions/release/action.yml b/.github/actions/release/action.yml
@@ -0,0 +1,8 @@
+name: 'Tag or Release'
+description: 'tag or release from a given branch'
+
+runs:
+  using: "composite"
+  steps:
+    - run: ${{ github.action_path }}/tag_or_release.sh
+      shell: bash
diff --git a/.github/actions/release/tag_or_release.sh b/.github/actions/release/tag_or_release.sh
@@ -0,0 +1,16 @@
+git config --local user.email "41898282+github-actions[bot]@users.noreply.github.com"
+git config --local user.name "github-actions[bot]"
+
+if [[ "$tag_or_release" == "tag" ]]; then
+    git checkout ${source_branch}
+    git tag -f "${name}" -m "tagged ${source_branch} to ${name} via manual github action"
+    # will fail if tag already exists; intentional
+    git push origin ${name}
+
+elif [[ "$tag_or_release" == "release" ]]; then
+    echo ${token} | gh auth login --with-token
+    gh release create ${name} -F changelog.md --target ${source_branch} --title ${name}
+
+else
+    echo "bad input"
+fi
diff --git a/.github/workflows/checks.yml b/.github/workflows/checks.yml
@@ -7,10 +7,10 @@ jobs:
     name: run flake8
     runs-on: ubuntu-18.04
     steps:
-      - name: set up python 3.8
+      - name: set up python 3.7
         uses: actions/setup-python@v2
         with:
-          python-version: 3.8
+          python-version: 3.7
 
       - name: checkout code
         uses: actions/checkout@v2
@@ -25,10 +25,10 @@ jobs:
     name: run black
     runs-on: ubuntu-18.04
     steps:
-      - name: set up python 3.8
+      - name: set up python 3.7
         uses: actions/setup-python@v2
         with:
-          python-version: 3.8
+          python-version: 3.7
 
       - name: checkout code
         uses: actions/checkout@v2
@@ -43,10 +43,10 @@ jobs:
     name: run bandit
     runs-on: ubuntu-18.04
     steps:
-      - name: set up python 3.8
+      - name: set up python 3.7
         uses: actions/setup-python@v2
         with:
-          python-version: 3.8
+          python-version: 3.7
 
       - name: checkout code
         uses: actions/checkout@v2

diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -0,0 +1,33 @@
+name: tag-or-release
+on:
+  workflow_dispatch:
+    inputs:
+      tag_or_release:
+        description: 'must be string of either "tag" or "release"'     
+        required: true
+        default: 'tag'
+      name:
+        description: 'the tag or release name, i.e. v1.0.0'     
+        required: true
+      source_branch:
+        description: 'the branch to tag or release'
+        required: true
+        default: "main"
+
+jobs:
+  tag_or_release:
+    runs-on: ubuntu-latest
+    name: tag or release the given branch with the given name
+    steps:
+      - name: checkout
+        uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+      - name: release
+        id: release
+        uses: ./.github/actions/release
+        env:
+          tag_or_release: "${{ github.event.inputs.tag_or_release }}"
+          name: "${{ github.event.inputs.name }}"
+          source_branch: "${{ github.event.inputs.source_branch }}"
+          token: "${{ secrets.GITHUB_TOKEN }}"
diff --git a/README.md b/README.md
@@ -308,7 +308,12 @@ These are distinguished as either "retryable" or "not retryable".
 
 3. Job Cancelled           - An operator cancelled the job so it should not be retried.
 
-4. Other errors            - No automatic rescue
+4. Job timeout             - The job exceeded it's maximum permitted/modeled compute time. This can be manually
+                             rescued by putting the string 'timeout_scale: 1.25' into a rescue message
+			     which will trigger reprocessing the job with 25% extra time. Any postive floating
+			     point scale should work, included fractional scales < 1.0.
+
+5. Other errors            - No automatic rescue
 
 
 Blackboard
@@ -322,23 +327,75 @@ In our current implementation, the blackboard lambda runs on a 7 minute schedule
 To speed up this sync, we could capture Batch state change events in CloudWatch and send them to lambda for ingestion into a dynamodb, which could then be either hooked up directly to the GUIs, or replicated in a more efficient fashion to the on-prem databases. 
 
 
-Job Memory Model
+Job Memory Models
 ================
 
 Overview
 --------
 
-... describe key aspects of the ML here such as features and network layout ...
+Pre-trained artificial neural networks are implemented in the pipeline to predict job resource requirements for HST. All three network architectures are built using the Keras functional API from the Tensorflow library. 
+
+1. Memory Classifier
+1D Convolutional Neural Network performs multi-class classification on 8 features to predict which of 4 possible "memory bins" is the most appropriate for a given dataset. An estimated probability score is assigned to each of the four possible target classes, i.e. Memory Bins, represented by an integer from 0 to 3. The memory size thresholds are categorized as follow:
+
+  - `0: < 2GB`
+  - `1: <= 8GB`
+  - `2: <= 16GB`
+  - `3: < 64GB`
+
+2. Memory Regressor
+1D-CNN performs logistic regression to estimate how much memory (in Gigabytes) a given dataset will require for processing. This prediction is not used directly by the pipeline because AWS compute doesn't require an exact number (hence the bin classification). We retain this model for the purpose of additional analysis of the datasets and their evolving characteristics.
+
+3. Wallclock Regressor
+1D-CNN performs logistic regression to estimate the job's execution time in wallclock seconds. AWS Batch requires a minimum threshold of 60 seconds to be set on each job, although many jobs take less than one minute to complete. The predicted value from this model is used by JobSubmit to set a maximum execution time in which the job has to be completed, after which a job is killed (regardless of whether or not it has finished).
 
-Job Submission
+JobPredict 
 --------------
 
-Describe model execution, lambda, image, etc.
+The JobPredict lambda is invoked by JobSubmit to determine resource allocation needs pertaining to memory and execution time. Upon invocation, a container is created on the fly using a docker image stored in the caldp ECR. The container then loads pre-trained models along with their learned parameters (e.g. weights) from saved keras files.
+
+The model's inputs are scraped from a text file in the calcloud-processing s3 bucket (`control/ipppssoot/MemoryModelFeatures.txt`) and converted into a numpy array. An additional preprocessing step applies a Yeo-Johnson power transform to the first two indices of the array (`n_files`, `total_mb`) using pre-calculated statistical values (mean, standard deviation and lambdas) representative of the entire training data "population". This transformation restricts all values into a 5-value range (-2 to 3) - see Model Training (below) for more details. 
+
+The resulting 2D-array of transformed inputs are then fed into the models which generate predictions for minimum memory size and wallclock (execution) time requirements. Predicted outputs are formatted into JSON and returned back to the JobSubmit lambda to acquire the compute resources necessary for completing calibration processing on that particular ipppssoot's data.
+
+
+Model Ingest
+------------
+
+When a job finishes successfully, its status message (in s3) changes to `processed-$ipppssoot.trigger`, and the  `model-ingest` lambda is automatically triggered. Similar to JobPredict lambda, the job's inputs/features are scraped from the control file in s3, in addition to the actual measured values for memory usage and wallclock time as recorded in the s3 outputs log files `process_metrics.txt | preview_metrics.txt`. The latter serve as ground truth target class labels for training the model. The features and targets are combined into a python dictionary, which is then formatted into a DynamoDB-compatible json object and ingested into the `calcloud-model` DynamoDB table for inclusion in the next model training iteration.
+
 
 Model Training
 --------------
 
-Describe training, scraping, architecture
+Keeping the models performative requires periodic retraining with the latest available data. Unless revisions are otherwise deemed necessary, the overall architecture and tuned hyperparameters of each network are re-built from scratch using the Keras functional API, then trained and validated using all available data. Model training iterations are manually submitted via AWS batch, which fires up a Docker container from the `training` image stored in CALDP elastic container repository (ECR) and runs through the entire training process as a standalone job (separate from the main calcloud processing runs):
 
-Dashboard?
-----------
+  1. Download training data from DynamoDB table
+  2. Preprocess (calculate statisics and re-run the PowerTransform on `n_files` and `total_mb`)
+  3. Build and compile models using Keras Functional API
+  4. Split data into train and test (validation) sets
+  5. Run batch training for each model
+  6. Calculate metrics and scores for evaluation 
+  7. Save and upload models, training results, and training data CSV backup file to s3
+  8. (optional) Run KFOLD cross-validation (10 splits)
+
+
+Calcloud ML Dashboard
+---------------------
+
+Analyze model performance, compare training iterations and explore statistical attributes of the continually evolving dataset with an interactive dashboard built specifically for Calcloud's prediction and classification models. The dashboard is maintained in a separate repository which can be found here: [CALCLOUD-ML-DASHBOARD](https://github.com/alphasentaurii/calcloud-ml-dashboard.git).
+
+
+Migrating Data Across Environments
+----------------------------------
+
+In some cases, there may be a need to migrate existing data from the DynamoDB table of one environment into that of another (e.g. DDB-Test to DDB-Ops, DDB-Ops to DDB-Sandbox, etc). Included in this repo are two helper scripts (located in calcloud/scripts folder)to simplify this process:
+
+  - `scrape_dynamo.py` downloads data from source DDB table and saves to local .csv file
+  - `import_dynamo.py` ingests data from local .csv file into DDB of destination DDB
+
+```bash
+$ cd calcloud/scripts
+$ python dynamo_scrape.py -t $SOURCE_TABLE_NAME -k latest.csv
+$ python dynamo_import.py -t $DESTINATION_TABLE_NAME -k latest.csv
+```
diff --git a/ami_rotation/ami_rotation_userdata.sh b/ami_rotation/ami_rotation_userdata.sh
@@ -0,0 +1,168 @@
+Content-Type: multipart/mixed; boundary="==BOUNDARY==" 
+MIME-Version: 1.0 
+
+--==BOUNDARY==
+MIME-Version: 1.0 
+Content-Type: text/x-shellscript; charset="us-ascii"
+
+#!/bin/bash -ex
+exec &> >(while read line; do echo "$(date +'%Y-%m-%dT%H.%M.%S%z') $line" >> /var/log/user-data.log; done;)
+# ensures instance will shutdown even if we don't reach the end
+shutdown -h +20
+log_stream="`date +'%Y-%m-%dT%H.%M.%S%z'`"
+sleep 5
+
+cat << EOF > /home/ec2-user/log_listener.py
+import boto3
+import time
+import sys
+from datetime import datetime
+  
+client = boto3.client('logs')
+
+log_group = sys.argv[1]
+log_stream = sys.argv[2]
+
+pushed_lines = []
+
+while True:
+    response = client.describe_log_streams(
+        logGroupName=log_group,
+        logStreamNamePrefix=log_stream
+    )
+    try:
+        nextToken = response['logStreams'][0]['uploadSequenceToken']
+    except KeyError:
+        nextToken = None
+    with open("/var/log/user-data.log", 'r') as f:
+        lines = f.readlines()
+        new_lines = []
+        for line in lines:
+            if line in pushed_lines:
+                continue
+            timestamp = line.split(" ")[0].strip()
+            try:
+                dt = datetime.strptime(timestamp, "%Y-%m-%dT%H.%M.%S%z")
+                dt_ts = int(dt.timestamp())*1000 #milliseconds
+                if nextToken is None:
+                    response = client.put_log_events(
+                        logGroupName = log_group,
+                        logStreamName = log_stream,
+                        logEvents = [
+                            {
+                                'timestamp': dt_ts,
+                                'message': line
+                            }
+                        ]
+                    )
+                    nextToken = response['nextSequenceToken']
+                else:
+                    response = client.put_log_events(
+                        logGroupName = log_group,
+                        logStreamName = log_stream,
+                        logEvents = [
+                            {
+                                'timestamp': dt_ts,
+                                'message': line
+                            }
+                        ],
+                        sequenceToken=nextToken
+                    )
+                    nextToken = response['nextSequenceToken']
+            except Exception as e:
+                # print(e)
+                continue
+
+            pushed_lines.append(line)
+            time.sleep(0.21) #AWS throttles at 5 calls/second
+    time.sleep(2)
+
+EOF
+
+echo BEGIN
+pwd
+date '+%Y-%m-%d %H:%M:%S'
+
+yum install -y -q gcc libpng-devel libjpeg-devel unzip yum-utils
+yum update -y -q && yum upgrade -q
+cd /home/ec2-user
+curl -s "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
+unzip -qq awscliv2.zip
+./aws/install --update
+curl -s "https://s3.amazonaws.com/session-manager-downloads/plugin/latest/ubuntu_64bit/session-manager-plugin.deb" -o "session-manager-plugin.deb"
+mkdir /home/ec2-user/.aws
+yum-config-manager --add-repo https://rpm.releases.hashicorp.com/AmazonLinux/hashicorp.repo
+yum install terraform-0.15.4-1 -y -q
+yum install git -y -q
+chown -R ec2-user:ec2-user /home/ec2-user/
+
+echo "export REQUESTS_CA_BUNDLE=/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem" >> /home/ec2-user/.bashrc
+echo "export CURL_CA_BUNDLE=/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem" >> /home/ec2-user/.bashrc
+mkdir -p /usr/lib/ssl
+mkdir -p /etc/ssl/certs
+mkdir -p /etc/pki/ca-trust/extracted/pem
+ln -s /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem /etc/ssl/certs/ca-certificates.crt
+ln -s /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem /usr/lib/ssl/cert.pem 
+
+yum install python3 -y -q
+
+sudo -i -u ec2-user bash << EOF
+mkdir ~/bin ~/tmp
+cd ~/tmp
+curl -s -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.34.0/install.sh | bash
+bash ~/.nvm/nvm.sh
+source ~/.bashrc
+nvm install node
+npm config set registry http://registry.npmjs.org/
+npm config set cafile /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
+npm install -g [email protected]
+python3 -m pip install -q --upgrade pip && python3 -m pip install boto3 -q
+cd ~
+rm -rf ~/tmp
+EOF
+
+chown -R ec2-user:ec2-user /home/ec2-user/
+
+echo "export ADMIN_ARN=${admin_arn}" >> /home/ec2-user/.bashrc
+echo "export AWS_DEFAULT_REGION=us-east-1" >> /home/ec2-user/.bashrc
+echo "export aws_env=${environment}" >> /home/ec2-user/.bashrc
+
+# get cloudwatch logging going
+sudo -i -u ec2-user bash << EOF
+cd /home/ec2-user
+source .bashrc
+aws logs create-log-stream --log-group-name "${log_group}" --log-stream-name $log_stream
+python3 /home/ec2-user/log_listener.py "${log_group}" $log_stream &
+EOF
+
+# calcloud checkout, need right tag
+cd /home/ec2-user
+mkdir ami_rotate && cd ami_rotate
+git clone https://github.com/spacetelescope/calcloud.git
+cd calcloud
+git remote set-url origin DISABLED --push
+git fetch
+git fetch --all --tags && git checkout tags/v${calcloud_ver} && cd ..
+git_exit_status=$?
+if [[ $git_exit_status -ne 0 ]]; then
+    # try without the v
+    cd calcloud && git fetch --all --tags && git checkout tags/${calcloud_ver} && cd ..
+    git_exit_status=$?
+fi
+if [[ $git_exit_status -ne 0 ]]; then
+    echo "could not checkout ${calcloud_ver}; exiting"
+    exit 1
+fi
+
+sudo -i -u ec2-user bash << EOF
+cd /home/ec2-user
+source .bashrc
+cd ami_rotate/calcloud/terraform
+./deploy_ami_rotate.sh
+EOF
+
+sleep 120 #let logs catch up
+
+shutdown -h now
+
+--==BOUNDARY==--
diff --git a/ami_rotation/parse_image_json.py b/ami_rotation/parse_image_json.py
@@ -0,0 +1,23 @@
+import sys
+import json
+from collections import OrderedDict
+from datetime import datetime
+
+response = json.loads(str(sys.argv[1]))
+images = response["Images"]
+image_name_filter = sys.argv[2]
+
+stsciLinux2Ami = {}
+for image in images:
+    creationDate = image["CreationDate"]
+    imageId = image["ImageId"]
+    name = image["Name"]
+    # Only look at particular AMIs
+    if name.startswith(image_name_filter):
+        stsciLinux2Ami.update({creationDate: imageId})
+# Order the list most recent date first
+orderedAmi = OrderedDict(
+    sorted(stsciLinux2Ami.items(), key=lambda x: datetime.strptime(x[0], "%Y-%m-%dT%H:%M:%S.%f%z"), reverse=True)
+)
+# Print first element in the ordered dict
+print(list(orderedAmi.values())[0])