-
Notifications
You must be signed in to change notification settings - Fork 5
zoom 20201127
20201127 16:00
Nigel, Stelios, Dave
Nigel has developed a working distributed ML example using a random forest classifier.
- Good astrometric solutions via ML Random Forrest classifier.
- Stored in GitHub as notebook 2FRPC4BFS
- Works for a 10% data set, fails with out of disc space on the full size dataset.
This is a good example of the kind of thing our users will want to do. It complements the HDBscan example because this is distribute and HDBscan is single node.
Two was to take this further:
- Deploy a larger cluster enabling Nigel to work with the full size data set.
- Experiment with scaling cluster size (cpu, memory, disc) to find the minimum resources needed to work with the full dataset, completing the process in ballpark 10 minutes.
Immediate priority is accessible disc space for Nigel to use to import and process new datasets.
- Blocked by issue #227 "Ceph shares not visible from Openstack "test" project.
- Fixed by PR #228 "20201117 zrq hadoop yarn".
- This is blocking work to prepare for EDR3 which will be available next Thursday.
Current systems rely on manually entered usernames and passwords in the Zeppelin config. Next stage of work is to integrate Zeppelin and Drupal user accounts to provide on-demand account creation with editable properties.
- Integrate Zeppelin and Drupal user accounts
- Integrate Drupal and IRIS IAM OAuth accounts
Targets for public release:
- Suggestion from Nick Walton - run a small invite-only workshop in Q1 2021, working interactively with users to solve issues as they develop their notebooks.
- Suggestion from Nigel - public release at National Astronomy meeting in July 2021.
Nigel reported on meeting with colleague from ESAC who have developed a Java library that can read Gaia GBIN
files into Spark, making the bulk Gaia data available to ML algorithms. Developed for the Gaia validation team on a bare metal Spark deployment.
Tasks for the next week:
- stv - Merge PR #225 and #228 to bring separate copies of Ansible Zeppelin-Hadoop-Yarn deployment together into shared version.
- stv - Delete everything from
gaia_dev
Openstack project and deploy a large enough system for Nigel to work with, including SSH access to shared directory for importing new data. - stv - Delete everything from
gaia_prod
Openstack project and use that to experiment with notebook 2FRPC4BFS to find minimum resources needed to handle full dataset in ~10min. - nch - Waiting for new cluster on
gaia_dev
to be able to import additional data sets and prepare for Gaia EDR3. - zrq - Using
gaia_test
to experiment with integrating Zeppelin and Drupal user accounts and IRIS IAM OAuth.