Skip to content

Commit

Permalink
Release/v0.4.30 (#169)
Browse files Browse the repository at this point in the history
  • Loading branch information
bhayden53 authored Nov 22, 2021
1 parent 297b17e commit f582c9a
Show file tree
Hide file tree
Showing 70 changed files with 1,290 additions and 300 deletions.
8 changes: 8 additions & 0 deletions .github/actions/release/action.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
name: 'Tag or Release'
description: 'tag or release from a given branch'

runs:
using: "composite"
steps:
- run: ${{ github.action_path }}/tag_or_release.sh
shell: bash
16 changes: 16 additions & 0 deletions .github/actions/release/tag_or_release.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
git config --local user.email "41898282+github-actions[bot]@users.noreply.github.com"
git config --local user.name "github-actions[bot]"

if [[ "$tag_or_release" == "tag" ]]; then
git checkout ${source_branch}
git tag -f "${name}" -m "tagged ${source_branch} to ${name} via manual github action"
# will fail if tag already exists; intentional
git push origin ${name}

elif [[ "$tag_or_release" == "release" ]]; then
echo ${token} | gh auth login --with-token
gh release create ${name} -F changelog.md --target ${source_branch} --title ${name}

else
echo "bad input"
fi
12 changes: 6 additions & 6 deletions .github/workflows/checks.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@ jobs:
name: run flake8
runs-on: ubuntu-18.04
steps:
- name: set up python 3.8
- name: set up python 3.7
uses: actions/setup-python@v2
with:
python-version: 3.8
python-version: 3.7

- name: checkout code
uses: actions/checkout@v2
Expand All @@ -25,10 +25,10 @@ jobs:
name: run black
runs-on: ubuntu-18.04
steps:
- name: set up python 3.8
- name: set up python 3.7
uses: actions/setup-python@v2
with:
python-version: 3.8
python-version: 3.7

- name: checkout code
uses: actions/checkout@v2
Expand All @@ -43,10 +43,10 @@ jobs:
name: run bandit
runs-on: ubuntu-18.04
steps:
- name: set up python 3.8
- name: set up python 3.7
uses: actions/setup-python@v2
with:
python-version: 3.8
python-version: 3.7

- name: checkout code
uses: actions/checkout@v2
Expand Down
33 changes: 33 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: tag-or-release
on:
workflow_dispatch:
inputs:
tag_or_release:
description: 'must be string of either "tag" or "release"'
required: true
default: 'tag'
name:
description: 'the tag or release name, i.e. v1.0.0'
required: true
source_branch:
description: 'the branch to tag or release'
required: true
default: "main"

jobs:
tag_or_release:
runs-on: ubuntu-latest
name: tag or release the given branch with the given name
steps:
- name: checkout
uses: actions/checkout@v2
with:
fetch-depth: 0
- name: release
id: release
uses: ./.github/actions/release
env:
tag_or_release: "${{ github.event.inputs.tag_or_release }}"
name: "${{ github.event.inputs.name }}"
source_branch: "${{ github.event.inputs.source_branch }}"
token: "${{ secrets.GITHUB_TOKEN }}"
73 changes: 65 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -308,7 +308,12 @@ These are distinguished as either "retryable" or "not retryable".

3. Job Cancelled - An operator cancelled the job so it should not be retried.

4. Other errors - No automatic rescue
4. Job timeout - The job exceeded it's maximum permitted/modeled compute time. This can be manually
rescued by putting the string 'timeout_scale: 1.25' into a rescue message
which will trigger reprocessing the job with 25% extra time. Any postive floating
point scale should work, included fractional scales < 1.0.

5. Other errors - No automatic rescue


Blackboard
Expand All @@ -322,23 +327,75 @@ In our current implementation, the blackboard lambda runs on a 7 minute schedule
To speed up this sync, we could capture Batch state change events in CloudWatch and send them to lambda for ingestion into a dynamodb, which could then be either hooked up directly to the GUIs, or replicated in a more efficient fashion to the on-prem databases.


Job Memory Model
Job Memory Models
================

Overview
--------

... describe key aspects of the ML here such as features and network layout ...
Pre-trained artificial neural networks are implemented in the pipeline to predict job resource requirements for HST. All three network architectures are built using the Keras functional API from the Tensorflow library.

1. Memory Classifier
1D Convolutional Neural Network performs multi-class classification on 8 features to predict which of 4 possible "memory bins" is the most appropriate for a given dataset. An estimated probability score is assigned to each of the four possible target classes, i.e. Memory Bins, represented by an integer from 0 to 3. The memory size thresholds are categorized as follow:

- `0: < 2GB`
- `1: <= 8GB`
- `2: <= 16GB`
- `3: < 64GB`

2. Memory Regressor
1D-CNN performs logistic regression to estimate how much memory (in Gigabytes) a given dataset will require for processing. This prediction is not used directly by the pipeline because AWS compute doesn't require an exact number (hence the bin classification). We retain this model for the purpose of additional analysis of the datasets and their evolving characteristics.

3. Wallclock Regressor
1D-CNN performs logistic regression to estimate the job's execution time in wallclock seconds. AWS Batch requires a minimum threshold of 60 seconds to be set on each job, although many jobs take less than one minute to complete. The predicted value from this model is used by JobSubmit to set a maximum execution time in which the job has to be completed, after which a job is killed (regardless of whether or not it has finished).

Job Submission
JobPredict
--------------

Describe model execution, lambda, image, etc.
The JobPredict lambda is invoked by JobSubmit to determine resource allocation needs pertaining to memory and execution time. Upon invocation, a container is created on the fly using a docker image stored in the caldp ECR. The container then loads pre-trained models along with their learned parameters (e.g. weights) from saved keras files.

The model's inputs are scraped from a text file in the calcloud-processing s3 bucket (`control/ipppssoot/MemoryModelFeatures.txt`) and converted into a numpy array. An additional preprocessing step applies a Yeo-Johnson power transform to the first two indices of the array (`n_files`, `total_mb`) using pre-calculated statistical values (mean, standard deviation and lambdas) representative of the entire training data "population". This transformation restricts all values into a 5-value range (-2 to 3) - see Model Training (below) for more details.

The resulting 2D-array of transformed inputs are then fed into the models which generate predictions for minimum memory size and wallclock (execution) time requirements. Predicted outputs are formatted into JSON and returned back to the JobSubmit lambda to acquire the compute resources necessary for completing calibration processing on that particular ipppssoot's data.


Model Ingest
------------

When a job finishes successfully, its status message (in s3) changes to `processed-$ipppssoot.trigger`, and the `model-ingest` lambda is automatically triggered. Similar to JobPredict lambda, the job's inputs/features are scraped from the control file in s3, in addition to the actual measured values for memory usage and wallclock time as recorded in the s3 outputs log files `process_metrics.txt | preview_metrics.txt`. The latter serve as ground truth target class labels for training the model. The features and targets are combined into a python dictionary, which is then formatted into a DynamoDB-compatible json object and ingested into the `calcloud-model` DynamoDB table for inclusion in the next model training iteration.


Model Training
--------------

Describe training, scraping, architecture
Keeping the models performative requires periodic retraining with the latest available data. Unless revisions are otherwise deemed necessary, the overall architecture and tuned hyperparameters of each network are re-built from scratch using the Keras functional API, then trained and validated using all available data. Model training iterations are manually submitted via AWS batch, which fires up a Docker container from the `training` image stored in CALDP elastic container repository (ECR) and runs through the entire training process as a standalone job (separate from the main calcloud processing runs):

Dashboard?
----------
1. Download training data from DynamoDB table
2. Preprocess (calculate statisics and re-run the PowerTransform on `n_files` and `total_mb`)
3. Build and compile models using Keras Functional API
4. Split data into train and test (validation) sets
5. Run batch training for each model
6. Calculate metrics and scores for evaluation
7. Save and upload models, training results, and training data CSV backup file to s3
8. (optional) Run KFOLD cross-validation (10 splits)


Calcloud ML Dashboard
---------------------

Analyze model performance, compare training iterations and explore statistical attributes of the continually evolving dataset with an interactive dashboard built specifically for Calcloud's prediction and classification models. The dashboard is maintained in a separate repository which can be found here: [CALCLOUD-ML-DASHBOARD](https://github.com/alphasentaurii/calcloud-ml-dashboard.git).


Migrating Data Across Environments
----------------------------------

In some cases, there may be a need to migrate existing data from the DynamoDB table of one environment into that of another (e.g. DDB-Test to DDB-Ops, DDB-Ops to DDB-Sandbox, etc). Included in this repo are two helper scripts (located in calcloud/scripts folder)to simplify this process:

- `scrape_dynamo.py` downloads data from source DDB table and saves to local .csv file
- `import_dynamo.py` ingests data from local .csv file into DDB of destination DDB

```bash
$ cd calcloud/scripts
$ python dynamo_scrape.py -t $SOURCE_TABLE_NAME -k latest.csv
$ python dynamo_import.py -t $DESTINATION_TABLE_NAME -k latest.csv
```
168 changes: 168 additions & 0 deletions ami_rotation/ami_rotation_userdata.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
Content-Type: multipart/mixed; boundary="==BOUNDARY=="
MIME-Version: 1.0

--==BOUNDARY==
MIME-Version: 1.0
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash -ex
exec &> >(while read line; do echo "$(date +'%Y-%m-%dT%H.%M.%S%z') $line" >> /var/log/user-data.log; done;)
# ensures instance will shutdown even if we don't reach the end
shutdown -h +20
log_stream="`date +'%Y-%m-%dT%H.%M.%S%z'`"
sleep 5

cat << EOF > /home/ec2-user/log_listener.py
import boto3
import time
import sys
from datetime import datetime
client = boto3.client('logs')
log_group = sys.argv[1]
log_stream = sys.argv[2]
pushed_lines = []
while True:
response = client.describe_log_streams(
logGroupName=log_group,
logStreamNamePrefix=log_stream
)
try:
nextToken = response['logStreams'][0]['uploadSequenceToken']
except KeyError:
nextToken = None
with open("/var/log/user-data.log", 'r') as f:
lines = f.readlines()
new_lines = []
for line in lines:
if line in pushed_lines:
continue
timestamp = line.split(" ")[0].strip()
try:
dt = datetime.strptime(timestamp, "%Y-%m-%dT%H.%M.%S%z")
dt_ts = int(dt.timestamp())*1000 #milliseconds
if nextToken is None:
response = client.put_log_events(
logGroupName = log_group,
logStreamName = log_stream,
logEvents = [
{
'timestamp': dt_ts,
'message': line
}
]
)
nextToken = response['nextSequenceToken']
else:
response = client.put_log_events(
logGroupName = log_group,
logStreamName = log_stream,
logEvents = [
{
'timestamp': dt_ts,
'message': line
}
],
sequenceToken=nextToken
)
nextToken = response['nextSequenceToken']
except Exception as e:
# print(e)
continue
pushed_lines.append(line)
time.sleep(0.21) #AWS throttles at 5 calls/second
time.sleep(2)
EOF

echo BEGIN
pwd
date '+%Y-%m-%d %H:%M:%S'

yum install -y -q gcc libpng-devel libjpeg-devel unzip yum-utils
yum update -y -q && yum upgrade -q
cd /home/ec2-user
curl -s "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip -qq awscliv2.zip
./aws/install --update
curl -s "https://s3.amazonaws.com/session-manager-downloads/plugin/latest/ubuntu_64bit/session-manager-plugin.deb" -o "session-manager-plugin.deb"
mkdir /home/ec2-user/.aws
yum-config-manager --add-repo https://rpm.releases.hashicorp.com/AmazonLinux/hashicorp.repo
yum install terraform-0.15.4-1 -y -q
yum install git -y -q
chown -R ec2-user:ec2-user /home/ec2-user/

echo "export REQUESTS_CA_BUNDLE=/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem" >> /home/ec2-user/.bashrc
echo "export CURL_CA_BUNDLE=/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem" >> /home/ec2-user/.bashrc
mkdir -p /usr/lib/ssl
mkdir -p /etc/ssl/certs
mkdir -p /etc/pki/ca-trust/extracted/pem
ln -s /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem /etc/ssl/certs/ca-certificates.crt
ln -s /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem /usr/lib/ssl/cert.pem

yum install python3 -y -q

sudo -i -u ec2-user bash << EOF
mkdir ~/bin ~/tmp
cd ~/tmp
curl -s -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.34.0/install.sh | bash
bash ~/.nvm/nvm.sh
source ~/.bashrc
nvm install node
npm config set registry http://registry.npmjs.org/
npm config set cafile /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
npm install -g [email protected]
python3 -m pip install -q --upgrade pip && python3 -m pip install boto3 -q
cd ~
rm -rf ~/tmp
EOF

chown -R ec2-user:ec2-user /home/ec2-user/

echo "export ADMIN_ARN=${admin_arn}" >> /home/ec2-user/.bashrc
echo "export AWS_DEFAULT_REGION=us-east-1" >> /home/ec2-user/.bashrc
echo "export aws_env=${environment}" >> /home/ec2-user/.bashrc

# get cloudwatch logging going
sudo -i -u ec2-user bash << EOF
cd /home/ec2-user
source .bashrc
aws logs create-log-stream --log-group-name "${log_group}" --log-stream-name $log_stream
python3 /home/ec2-user/log_listener.py "${log_group}" $log_stream &
EOF

# calcloud checkout, need right tag
cd /home/ec2-user
mkdir ami_rotate && cd ami_rotate
git clone https://github.com/spacetelescope/calcloud.git
cd calcloud
git remote set-url origin DISABLED --push
git fetch
git fetch --all --tags && git checkout tags/v${calcloud_ver} && cd ..
git_exit_status=$?
if [[ $git_exit_status -ne 0 ]]; then
# try without the v
cd calcloud && git fetch --all --tags && git checkout tags/${calcloud_ver} && cd ..
git_exit_status=$?
fi
if [[ $git_exit_status -ne 0 ]]; then
echo "could not checkout ${calcloud_ver}; exiting"
exit 1
fi

sudo -i -u ec2-user bash << EOF
cd /home/ec2-user
source .bashrc
cd ami_rotate/calcloud/terraform
./deploy_ami_rotate.sh
EOF

sleep 120 #let logs catch up

shutdown -h now

--==BOUNDARY==--
23 changes: 23 additions & 0 deletions ami_rotation/parse_image_json.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
import sys
import json
from collections import OrderedDict
from datetime import datetime

response = json.loads(str(sys.argv[1]))
images = response["Images"]
image_name_filter = sys.argv[2]

stsciLinux2Ami = {}
for image in images:
creationDate = image["CreationDate"]
imageId = image["ImageId"]
name = image["Name"]
# Only look at particular AMIs
if name.startswith(image_name_filter):
stsciLinux2Ami.update({creationDate: imageId})
# Order the list most recent date first
orderedAmi = OrderedDict(
sorted(stsciLinux2Ami.items(), key=lambda x: datetime.strptime(x[0], "%Y-%m-%dT%H:%M:%S.%f%z"), reverse=True)
)
# Print first element in the ordered dict
print(list(orderedAmi.values())[0])
Loading

0 comments on commit f582c9a

Please sign in to comment.