Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MLflow log_model option #1544

Merged
merged 51 commits into from
Nov 1, 2024
Merged

Add MLflow log_model option #1544

merged 51 commits into from
Nov 1, 2024

Conversation

nancyhung
Copy link
Contributor

@nancyhung nancyhung commented Sep 24, 2024

Context

In order to support customers with sensitive storage network configurations, we have to use the log_model API. This will cause duplicate artifact uploads, which is not efficient, so we will only reserve rolling out to customers who require this.

This PR contains the first of 2 changes:

  1. When saving the final HF checkpoint, use log_model instead of uploading to MLflow artifacts.
  • Functionally, a user can still find their HF checkpoint files in UC if they really wish to download the model weights and serve somewhere else.
  • Instead of calling save_model, register_model, and uploading to UC directly via the remote uploader downloader object, this change simplifies the control logic with the mlflow.log_model function. This function is also critical to support secure training requirements, such as customer firewalls or private endpoints. Logging a model to MLflow will call the necessary steps to save and register a model for deployment.
  • In this change, we only affect the logic while saving the final HF checkpoint. All other logic remains the same.
  1. Next in a follow-up PR, we'll modify the intermediate checkpointing logic to also use log_model but not register the model. That way, a user can still manually register their intermediate checkpoints for evaluation.

Testing

When incorporating this in MAPI, we should enable final_register_only to only upload using the log_model logic instead of uploading a duplicate copy to MLflow artifacts. All tests were done in AWS staging.

Works for older models
[Databricks staging] Llama3 8b
Run: llama3-log-model-xusOti
Llama3 8b was able to be successfully deployed here: https://e2-dogfood.staging.cloud.databricks.com/ml/endpoints/test-log-model?o=6051921418418893.

Works for newest models with extra security
[MCT] Llama3.2 1b
Run: llama3-log-model-4eJUKo
Experiment: https://dbc-559ffd80-2bfc.cloud.databricks.com/ml/experiments/2854093459220376?viewStateShareKey=55a332dc80d7200b6a6301d8f0163155ce9aac54d21436c9d292f0745e0bff05
Endpoint: https://dbc-559ffd80-2bfc.cloud.databricks.com/ml/endpoints/testfinetuning?o=7395834863327820
image

[MCT] Llama3.1 405b
Run: 405b-register-1-xB3dOx

Tested that mlflow.log_model registers model in private link workspace
image
Registered model: https://adb-1622130341351604.4.azuredatabricks.net/explore/data/models/rkg-ft/default/llamatest?o=1622130341351604
image
Model stuck in pending example:
image

Log model also worked
image

…cleans up the code a little and prevents us from having forked logic in Composer to fetch by run_id
…cleans up the code a little and prevents us from having forked logic in Composer to fetch by run_id
@nancyhung nancyhung requested a review from a team as a code owner September 24, 2024 01:03
Copy link
Collaborator

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What testing have you done? We need to make sure everything e2e shows up properly

llmfoundry/callbacks/hf_checkpointer.py Outdated Show resolved Hide resolved
@nancyhung nancyhung closed this Sep 25, 2024
@nancyhung nancyhung reopened this Oct 1, 2024
@nancyhung nancyhung changed the title Add MLflow log_model option [WIP] Add MLflow log_model option Oct 4, 2024
@nancyhung nancyhung reopened this Nov 1, 2024
@nancyhung nancyhung requested a review from dakinggg November 1, 2024 19:13
@dakinggg dakinggg merged commit 2ce6296 into main Nov 1, 2024
9 checks passed
dakinggg added a commit that referenced this pull request Nov 1, 2024
dakinggg added a commit that referenced this pull request Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants