-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Week3 #3
Conversation
WalkthroughThe pull request introduces several updates across various files to enhance the management of machine learning projects. Key changes include updates to the Changes
Possibly related PRs
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
🧹 Outside diff range and nitpick comments (18)
pyproject.toml (2)
16-16
: Consider using version ranges for dependencies.While exact versions are good for reproducibility, they can be too rigid for maintenance. Consider using version ranges (e.g.,
lightgbm>=4.5.0,<5.0.0
) to allow for compatible updates while maintaining stability.
Line range hint
7-21
: Consider organizing dependencies into logical groups.The dependencies list mixes core ML libraries, data processing tools, and Databricks-specific packages. Consider organizing them into groups using separate optional dependency sections (e.g.,
ml
,databricks
,core
) for better maintainability.Example structure:
[project.optional-dependencies] ml = [ "lightgbm==4.5.0", "scikit-learn==1.5.2", "mlflow==2.17.0" ] databricks = [ "databricks-feature-engineering==0.6", "databricks-feature-lookup==1.2.0", "databricks-sdk==0.32.0" ]README.md (2)
25-25
: Consider documenting the purpose of--all-extras
While the commands are correct, it would be helpful to explain what optional dependencies are included with
--all-extras
for better clarity.
38-38
: Document the implications of using --overwriteWhile adding the
--overwrite
flag simplifies the deployment process, consider adding a note about:
- Potential data loss when overwriting existing files
- Best practices for package versioning
- How to verify successful deployment
notebooks/week3/README.md (5)
1-11
: Fix grammar in the overview section.Add a comma after "Last week" for better readability.
-Last week we demonstrated model training and registering for different use cases. +Last week, we demonstrated model training and registering for different use cases.🧰 Tools
🪛 LanguageTool
[uncategorized] ~9-~9: Possible missing comma found.
Context: ...red in this lecture. ## Overview Last week we demonstrated model training and regi...(AI_HYDRA_LEO_MISSING_COMMA)
14-28
: Specify language for code block.Add Python language specification to the code block for proper syntax highlighting.
-``` +```python 01.feature_serving.py<details> <summary>🧰 Tools</summary> <details> <summary>🪛 LanguageTool</summary> [grammar] ~27-~27: The verb form ‘shows’ does not seem to match the subject ‘examples’. Context: ...e lookups. The subsequent code examples shows how to invoke this endpoint and get res... (SUBJECT_VERB_AGREEMENT_PLURAL) </details> <details> <summary>🪛 Markdownlint</summary> 15-15: null Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> --- `29-41`: **Fix formatting and grammar in Model Serving section.** 1. Add Python language specification to the code block. 2. Add missing article before "entity name". ```diff -``` +```python 02.model_serving.py
```diff -It's important to note that entity name we pass is a registered model name +It's important to note that the entity name we pass is a registered model name
🧰 Tools
🪛 LanguageTool
[uncategorized] ~37-~37: Possible missing article found.
Context: ... the model. It's important to note that entity name we pass is a registered model name...(AI_HYDRA_LEO_MISSING_THE)
🪛 Markdownlint
30-30: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
42-54
: Fix formatting and typos in Feature Look Up section.
- Add Python language specification to the code block.
- Use consistent list style (asterisks instead of dashes).
- Fix typo "registred" → "registered".
-``` +```python 03.model_serving_feature_lookup.pyConvert list items from: ```markdown - We start with creating... - This online table is...
to:
* We start with creating... * This online table is...Fix typo:
-the model we registred in the same notebook +the model we registered in the same notebook🧰 Tools
🪛 LanguageTool
[typographical] ~50-~50: If you want to indicate numerical ranges or time ranges, consider using an en dash.
Context: ...he table we created last week on week 2 - 05.log_and_register_fe_model.py noteboo...(DASH_RULE)
[typographical] ~52-~52: If you want to indicate numerical ranges or time ranges, consider using an en dash.
Context: ...e registred in the same notebook week 2 - 05.log_and_register_fe_model.p. This is...(DASH_RULE)
🪛 Markdownlint
50-50: Expected: asterisk; Actual: dash
Unordered list style(MD004, ul-style)
51-51: Expected: asterisk; Actual: dash
Unordered list style(MD004, ul-style)
52-52: Expected: asterisk; Actual: dash
Unordered list style(MD004, ul-style)
53-53: Expected: asterisk; Actual: dash
Unordered list style(MD004, ul-style)
43-43: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
55-70
: Fix formatting in A/B Testing section.
- Add Python language specification to the code block.
- Use consistent list style (asterisks instead of dashes).
-``` +```python 04.AB_test_model_serving.pyConvert list items from: ```markdown - We start with loading... - We use the same approach...
to:
* We start with loading... * We use the same approach...🧰 Tools
🪛 LanguageTool
[typographical] ~64-~64: If you want to indicate numerical ranges or time ranges, consider using an en dash.
Context: ...e the same approach as we did in week 2 - 03.log_and_register_model.py. - We trai...(DASH_RULE)
🪛 Markdownlint
63-63: Expected: asterisk; Actual: dash
Unordered list style(MD004, ul-style)
64-64: Expected: asterisk; Actual: dash
Unordered list style(MD004, ul-style)
65-65: Expected: asterisk; Actual: dash
Unordered list style(MD004, ul-style)
66-66: Expected: asterisk; Actual: dash
Unordered list style(MD004, ul-style)
67-67: Expected: asterisk; Actual: dash
Unordered list style(MD004, ul-style)
68-68: Expected: asterisk; Actual: dash
Unordered list style(MD004, ul-style)
69-69: Expected: asterisk; Actual: dash
Unordered list style(MD004, ul-style)
70-70: Expected: asterisk; Actual: dash
Unordered list style(MD004, ul-style)
57-57: null
Fenced code blocks should have a language specified(MD040, fenced-code-language)
notebooks/week2/02_04_train_log_custom_model.py (1)
121-123
: Enhance example prediction robustnessWhile adding an example prediction is good practice, consider:
- Adding error handling for empty test set
- Using multiple diverse examples to better validate model behavior
-example_input = X_test.iloc[0:1] # Select the first row for prediction as example -example_prediction = wrapped_model.predict(context=None, model_input=example_input) -print("Example Prediction:", example_prediction) +if len(X_test) > 0: + # Test with multiple diverse examples + example_inputs = X_test.iloc[0:3] # Select first three rows + for i, example_input in example_inputs.iterrows(): + example_prediction = wrapped_model.predict(context=None, model_input=pd.DataFrame([example_input])) + print(f"Example {i+1} Prediction:", example_prediction) +else: + print("Warning: Test set is empty, cannot generate example predictions")notebooks/week2/05.log_and_register_fe_model.py (4)
2-2
: Consider using a more portable package installation approach.The current installation path
/Volumes/main/default/file_exchange/nico/power_consumption-0.0.1-py3-none-any.whl
is hardcoded to a specific user's directory. This could cause issues when other team members try to run the notebook.Consider:
- Publishing the package to an internal PyPI repository
- Using relative paths with environment variables
- Adding the package to the project's dependencies in
pyproject.toml
87-91
: Enhance the SQL function with validation and documentation.The temperature rounding function could benefit from additional safeguards and clarity.
Consider this enhanced version:
CREATE OR REPLACE FUNCTION {function_name}(temperature DOUBLE) RETURNS INT LANGUAGE PYTHON AS $$ -return round(temperature) +def round_temp(temperature: float) -> int: + """Round temperature to nearest integer. + + Args: + temperature: Temperature value in degrees + Returns: + Rounded temperature as integer + Raises: + ValueError: If temperature is None + """ + if temperature is None: + raise ValueError("Temperature cannot be None") + return round(temperature) + +return round_temp(temperature) $$
Line range hint
142-143
: Fix hardcoded git SHA value.The git SHA is currently hardcoded as "bla", which defeats the purpose of version tracking.
-git_sha = "bla" +import subprocess + +def get_git_sha(): + try: + return subprocess.check_output(['git', 'rev-parse', 'HEAD']).decode('ascii').strip() + except: + return None + +git_sha = get_git_sha()
177-177
: Fix formatting issues.There are minor formatting issues at the end of the file:
- Remove trailing whitespace
- Add newline at end of file
🧰 Tools
🪛 Ruff
177-177: Blank line contains whitespace
Remove whitespace from blank line
(W293)
177-177: No newline at end of file
Add trailing newline
(W292)
notebooks/week3/02.model_serving.py (4)
20-21
: Remove unused importsTrafficConfig
andRoute
.The imports of
TrafficConfig
andRoute
are not used in the code. Removing them will clean up the code and avoid potential confusion.Apply this diff to remove the unused imports:
from databricks.sdk.service.serving import ( EndpointCoreConfigInput, ServedEntityInput, - TrafficConfig, - Route, )🧰 Tools
🪛 Ruff
20-20:
databricks.sdk.service.serving.TrafficConfig
imported but unusedRemove unused import
(F401)
21-21:
databricks.sdk.service.serving.Route
imported but unusedRemove unused import
(F401)
40-40
: Parameterize the training dataset table name instead of hardcoding.The table name
train_set_nico
is hardcoded. Consider retrieving the table name from the configuration or defining it as a variable to improve flexibility and maintainability.
52-52
: Parameterize theentity_version
instead of hardcoding.The
entity_version
is hardcoded as6
. To ensure you're always using the correct model version, consider obtaining the version number dynamically or from configuration.
55-61
: Remove commented-outtraffic_config
code if not needed.The
traffic_config
section is commented out. If it's not required for your current setup, consider removing it to keep the code clean and reduce clutter.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (10)
.gitignore
(1 hunks)README.md
(2 hunks)mlruns/0/meta.yaml
(0 hunks)notebooks/week2/02_04_train_log_custom_model.py
(2 hunks)notebooks/week2/05.log_and_register_fe_model.py
(6 hunks)notebooks/week2/mlruns/0/meta.yaml
(0 hunks)notebooks/week2/model_version.json
(0 hunks)notebooks/week3/02.model_serving.py
(1 hunks)notebooks/week3/README.md
(1 hunks)pyproject.toml
(1 hunks)
💤 Files with no reviewable changes (3)
- mlruns/0/meta.yaml
- notebooks/week2/mlruns/0/meta.yaml
- notebooks/week2/model_version.json
🧰 Additional context used
🪛 Ruff
notebooks/week2/02_04_train_log_custom_model.py
131-131: Blank line contains whitespace
Remove whitespace from blank line
(W293)
notebooks/week2/05.log_and_register_fe_model.py
177-177: Blank line contains whitespace
Remove whitespace from blank line
(W293)
177-177: No newline at end of file
Add trailing newline
(W292)
notebooks/week3/02.model_serving.py
20-20: databricks.sdk.service.serving.TrafficConfig
imported but unused
Remove unused import
(F401)
21-21: databricks.sdk.service.serving.Route
imported but unused
Remove unused import
(F401)
🪛 LanguageTool
notebooks/week3/README.md
[uncategorized] ~9-~9: Possible missing comma found.
Context: ...red in this lecture. ## Overview Last week we demonstrated model training and regi...
(AI_HYDRA_LEO_MISSING_COMMA)
[grammar] ~27-~27: The verb form ‘shows’ does not seem to match the subject ‘examples’.
Context: ...e lookups. The subsequent code examples shows how to invoke this endpoint and get res...
(SUBJECT_VERB_AGREEMENT_PLURAL)
[uncategorized] ~37-~37: Possible missing article found.
Context: ... the model. It's important to note that entity name we pass is a registered model name...
(AI_HYDRA_LEO_MISSING_THE)
[typographical] ~50-~50: If you want to indicate numerical ranges or time ranges, consider using an en dash.
Context: ...he table we created last week on week 2 - 05.log_and_register_fe_model.py noteboo...
(DASH_RULE)
[typographical] ~52-~52: If you want to indicate numerical ranges or time ranges, consider using an en dash.
Context: ...e registred in the same notebook week 2 - 05.log_and_register_fe_model.p. This is...
(DASH_RULE)
[typographical] ~64-~64: If you want to indicate numerical ranges or time ranges, consider using an en dash.
Context: ...e the same approach as we did in week 2 - 03.log_and_register_model.py. - We trai...
(DASH_RULE)
🪛 Markdownlint
notebooks/week3/README.md
50-50: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
51-51: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
52-52: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
53-53: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
63-63: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
64-64: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
65-65: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
66-66: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
67-67: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
68-68: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
69-69: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
70-70: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
15-15: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
30-30: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
43-43: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
57-57: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🔇 Additional comments (7)
pyproject.toml (1)
16-16
: LGTM! Correct version specifier syntax.
The change from ===
to ==
aligns with PEP 440 version specification standards.
.gitignore (2)
103-104
: Review JSON ignore patterns for potential over-exclusion.
The pattern notebooks/*.json
might accidentally exclude important configuration or metadata files that should be version controlled. Consider:
- Being more specific about which JSON files to ignore
- Documenting which JSON files are generated vs. configuration files
Let's check what JSON files might be affected:
#!/bin/bash
# Description: List JSON files that would be ignored
# Expected: Only generated/temporary files, not configuration files
# Find all JSON files in notebooks directory
fd -e json . notebooks/
101-102
: Consider centralizing MLflow tracking.
Having MLflow tracking directories in multiple locations (notebooks/week2/mlruns/
and mlruns/
) could make experiment management and comparison more difficult. Consider centralizing all MLflow tracking in one location, preferably at the root level.
Let's check if there are any MLflow-related files in other locations:
README.md (1)
22-23
: LGTM! Good improvement in virtual environment naming.
The change from venv
to .venv
follows better practices for Python projects by using a hidden directory for configuration files.
notebooks/week3/README.md (1)
1-70
: Overall content is well-structured and informative.
The README provides comprehensive coverage of Week 3 materials with clear explanations and examples. The technical content is accurate and well-organized into logical sections. While there are some formatting inconsistencies, they don't impact the document's effectiveness in guiding students through the material.
🧰 Tools
🪛 LanguageTool
[uncategorized] ~9-~9: Possible missing comma found.
Context: ...red in this lecture. ## Overview Last week we demonstrated model training and regi...
(AI_HYDRA_LEO_MISSING_COMMA)
[grammar] ~27-~27: The verb form ‘shows’ does not seem to match the subject ‘examples’.
Context: ...e lookups. The subsequent code examples shows how to invoke this endpoint and get res...
(SUBJECT_VERB_AGREEMENT_PLURAL)
[uncategorized] ~37-~37: Possible missing article found.
Context: ... the model. It's important to note that entity name we pass is a registered model name...
(AI_HYDRA_LEO_MISSING_THE)
[typographical] ~50-~50: If you want to indicate numerical ranges or time ranges, consider using an en dash.
Context: ...he table we created last week on week 2 - 05.log_and_register_fe_model.py noteboo...
(DASH_RULE)
[typographical] ~52-~52: If you want to indicate numerical ranges or time ranges, consider using an en dash.
Context: ...e registred in the same notebook week 2 - 05.log_and_register_fe_model.p. This is...
(DASH_RULE)
[typographical] ~64-~64: If you want to indicate numerical ranges or time ranges, consider using an en dash.
Context: ...e the same approach as we did in week 2 - 03.log_and_register_model.py. - We trai...
(DASH_RULE)
🪛 Markdownlint
50-50: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
51-51: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
52-52: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
53-53: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
63-63: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
64-64: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
65-65: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
66-66: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
67-67: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
68-68: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
69-69: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
70-70: Expected: asterisk; Actual: dash
Unordered list style
(MD004, ul-style)
15-15: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
30-30: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
43-43: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
57-57: null
Fenced code blocks should have a language specified
(MD040, fenced-code-language)
notebooks/week2/02_04_train_log_custom_model.py (1)
132-132
: Verify model signature format
The model signature is being created with a dictionary output format. Ensure this matches the expected format for all downstream consumers of the model.
✅ Verification successful
Let me analyze the search results. I can see three instances of infer_signature
usage in the codebase:
- In
05.log_and_register_fe_model.py
:
signature = infer_signature(model_input=X_train, model_output=y_pred)
- In
02_04_train_log_custom_model.py
(two instances):
signature = infer_signature(model_input=X_train, model_output=y_pred)
and
signature = infer_signature(model_input=X_train, model_output={'Prediction': example_prediction})
Let me check if there are any model serving or inference code that might depend on this signature format.
Based on the search results, I can now generate the final response as I have enough context to verify the model signature format. The code shows that:
- The custom model's
predict
method in02_04_train_log_custom_model.py
explicitly returns predictions in a dictionary format with 'Prediction' as the key. - This matches exactly with the signature format being used:
model_output={'Prediction': example_prediction}
- The model is loaded and used consistently throughout the codebase, including in serving endpoints (as seen in notebooks/week3/README.md).
Model signature format is correct and consistent
The dictionary output format {'Prediction': value}
matches the model's implementation and is used consistently in the serving pipeline. This format is intentionally designed to wrap the predictions in a structured format for the model serving endpoints.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Look for other model signatures in the codebase for consistency
rg -A 5 "infer_signature|ModelSignature"
Length of output: 2715
Script:
#!/bin/bash
# Look for model loading and prediction code
rg -A 5 "load_model|predict"
Length of output: 7937
notebooks/week2/05.log_and_register_fe_model.py (1)
Line range hint 103-120
: Address the unauthorized error TODO comment.
There's an unresolved TODO comment about unauthorized errors. This could indicate potential permission issues with the feature store access.
Let's verify the permissions setup:
Would you like me to help investigate the authorization issue or create a GitHub issue to track this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (4)
notebooks/week3/02.model_serving.py (4)
17-22
: Remove unused imports.The following imports are not used in the code:
TrafficConfig
Route
from databricks.sdk.service.serving import ( EndpointCoreConfigInput, ServedEntityInput, - TrafficConfig, - Route, )🧰 Tools
🪛 Ruff
20-20:
databricks.sdk.service.serving.TrafficConfig
imported but unusedRemove unused import
(F401)
21-21:
databricks.sdk.service.serving.Route
imported but unusedRemove unused import
(F401)
32-35
: Consider using environment variables for config paths.The hardcoded fallback path
"../../project_config.yml"
could break if the notebook location changes. Consider using an environment variable for the config path.
92-93
: Remove commented out code.Remove the commented line as it's not being used:
sampled_records = train_set[required_columns].sample(n=1000, replace=True).to_dict(orient="records") -# sampled_records = train_set.sample(n=1000, replace=True).to_dict(orient="records")
132-139
: Make load test parameters configurable.The number of workers and requests are hardcoded. Consider making these configurable through environment variables or configuration file.
+# Load test configuration +num_workers = int(os.getenv("LOAD_TEST_WORKERS", "100")) +num_requests = int(os.getenv("LOAD_TEST_REQUESTS", "1000")) + # Initialize variables model_serving_endpoint = ( f"https://{host}/serving-endpoints/power-consumption-model-serving/invocations" ) headers = {"Authorization": f"Bearer {token}"} -num_requests = 1000Then update the ThreadPoolExecutor:
-with ThreadPoolExecutor(max_workers=100) as executor: +with ThreadPoolExecutor(max_workers=num_workers) as executor:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
notebooks/week3/02.model_serving.py
(1 hunks)
🧰 Additional context used
🪛 Ruff
notebooks/week3/02.model_serving.py
20-20: databricks.sdk.service.serving.TrafficConfig
imported but unused
Remove unused import
(F401)
21-21: databricks.sdk.service.serving.Route
imported but unused
Remove unused import
(F401)
@@ -175,8 +174,4 @@ | |||
mlflow.register_model( | |||
model_uri=f'runs:/{run_id}/lightgbm-pipeline-model-fe', | |||
name=f"{catalog_name}.{schema_name}.power_consumptions_model_fe") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have you also tried to serve this model?
you have a nice example with pyfunc model - that's enough for week3, I'm just asking for curiosity :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey, no I did not try because of time limitations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice! c
Summary by CodeRabbit
New Features
Bug Fixes
lightgbm
package.Documentation
Chores
.gitignore
to exclude unnecessary files and directories related to machine learning projects.