Skip to content

Commit

Permalink
Merge pull request #121 from jgaglione/master
Browse files Browse the repository at this point in the history
updated jgaglion postdoc page
  • Loading branch information
rct225 authored Oct 1, 2024
2 parents 5221ca8 + f27d8d6 commit 81cfe51
Showing 1 changed file with 18 additions and 2 deletions.
20 changes: 18 additions & 2 deletions pages/postdocs/jgaglione.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ postdoc-name: Jethro Gaglione
title: Post-doctoral researcher
active: True
dates:
start: 2023-10-01
end: 2024-09-30
start: 2024-01-01
end: 2024-12-31
photo: /assets/images/team/Jethro-Gaglione.jpg
institution: Vanderbilt University
e-mail: [email protected]
Expand All @@ -26,7 +26,23 @@ presentations:
meeting: <Production Group Meeting>
meetingurl: <https://indico.cern.ch/event/1400420/>

- title: "Machine Learning Training Facility at Vanderbilt - A Prototype for Efficient and Reproducible ML Training"
date: "July 19, 2024"
url: <https://indico.cern.ch/event/1438068/>
meeting: <Fast ML Co-processor Meeting>
meetingurl: <https://indico.cern.ch/event/1438068/>

current_status: >
<br>
<b>2024 Q3</b>
<br>
This quarter, we made significant progress integrating the btag POG ML training framework b-hive into an MLflow project which can be submitted to the Machine Learning Training Facility (MLTF). This work is very close to being merged, which will make it the first production CMS ML workflow integrated with the MLTF.
Work on hardware capabilities continues to hit delays due to issues with firmwares provided by the manufacturer. Engineers were unable to remotely diagnose the issue, leading Vanderbilt to ship the hardware back for hands-on inspection. This was successful, the engineers were able to find a subtle bug at the PCI-E layer, and updated/flashed the firmware to solve it. As of this writing, the hardware is being shipped back to Vanderbilt with the assertion from the manufacturer that it is fixed. This will, of course, push back hardware-related milestones.

Vanderbilt developers have completed a first draft of the MLflow “gateway” server, which provides a REST-based job submission infrastructure (similar to CMS’ CRAB functionality). This will allow automated submission of training tasks (e.g. for CI/CD) via REST, or CLI-based job submission using a MLflow plugin which users can install into their environments. The functionality is currently basic, stubbing out the API, but has token-based authentication enabled to the point that the service can be securely accessed. The next work is to implement the missing functions so this service can be opened to alpha users in Q4.



<br>
<b>2024 Q2</b>
Expand Down

0 comments on commit 81cfe51

Please sign in to comment.