Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

week2 updates #2

Merged
merged 3 commits into from
Nov 4, 2024
Merged

week2 updates #2

merged 3 commits into from
Nov 4, 2024

Conversation

NiGroUni
Copy link
Collaborator

@NiGroUni NiGroUni commented Oct 25, 2024

Unfortunately creating the training set with databricks.feature_engineering lead to an 401-Unauthorized-Error like Maria mentioned in the lecture. I do not know how to solve this as we can not work in Databricks due to company restrictions.

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced README with clearer setup instructions for local and cluster package installations.
    • Introduced new metadata files (meta.yaml and model_version.json) for MLflow experiments and model versioning.
    • New scripts for dataset preparation and machine learning workflows using LightGBM and MLflow.
    • Added a PriceModel class for structured model training and evaluation.
    • New utility function to adjust predictions.
  • Bug Fixes

    • Streamlined configuration parameters in project_config.yml for consistency.
  • Documentation

    • Updated project configuration and README for improved clarity.
  • Chores

    • Updated dependency management in pyproject.toml for precise version control.

@NiGroUni NiGroUni requested a review from a team as a code owner October 25, 2024 14:48
Copy link

coderabbitai bot commented Oct 25, 2024

Walkthrough

This pull request introduces several significant changes across multiple files, primarily enhancing the setup instructions in README.md, adding metadata files for MLflow experiments and model versions, and implementing new functionalities in various Python scripts for data processing and model training. Key additions include a new DataProcessor class, a PriceModel for machine learning, and configuration management through ProjectConfig. The dependency management in pyproject.toml has also been tightened with specific version pinning, while the project_config.yml has been updated for consistency in naming conventions.

Changes

File Change Summary
README.md Updated installation instructions for src package, added cluster installation command, and streamlined the upload section.
mlruns/0/meta.yaml New metadata file created for MLflow experiment with attributes like artifact_location, creation_time, experiment_id, lifecycle_stage, and name.
model_version.json New JSON file created for model version metadata including fields like _name, _version, _creation_time, _user_id, _status, and others.
notebooks/week2/01.prepare_dataset_n.py Introduced a script for dataset preparation using Spark, including configuration loading, data processing, and saving processed datasets.
notebooks/week2/02_04_train_log_custom_model.py Implemented a machine learning workflow with MLflow and LightGBM, including model training, logging, and registration.
notebooks/week2/05.log_and_register_fe_model.py Implemented a feature engineering workflow with Delta table creation, model training, and logging of metrics to MLflow.
notebooks/week2/mlruns/0/meta.yaml New YAML configuration file for ML run metadata including artifact_location, creation_time, experiment_id, lifecycle_stage, and name.
notebooks/week2/model_version.json New JSON file defining a model version for a power consumption model with metadata like model name, version, creation time, and status.
project_config.yml Updated configuration parameters for consistency, including renaming and adding new entries for id_col and mlflow_experiment_name.
pyproject.toml Updated dependencies with specific version pinning and modified optional dependencies for controlled management.
src/power_consumption/config.py New ProjectConfig class created for managing project configurations, including methods for loading from YAML files.
src/power_consumption/data_processor.py Major updates to DataProcessor class, modifying methods for data loading and processing, and introducing a new save_to_catalog method.
src/power_consumption/model.py New PriceModel class introduced for price prediction using a RandomForestRegressor, with methods for training, predicting, evaluating, and retrieving feature importance.
src/power_consumption/utils.py New adjust_predictions function added to modify prediction values by a scale factor.

Poem

In the garden of code, we hop and play,
New scripts and models brighten the day.
With features and functions, we dance with glee,
In the world of data, oh so carefree!
Hooray for the changes, let’s give a cheer,
For every new line brings progress near! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 26

🧹 Outside diff range and nitpick comments (16)
src/power_consumption/utils.py (1)

1-3: Consider adding unit tests.

This utility function is likely critical for your ML pipeline's output. Ensure it's thoroughly tested with various input types and edge cases.

Would you like me to help create comprehensive unit tests for this function? I can generate test cases covering:

  • Different input types (float, list, numpy array)
  • Edge cases (empty arrays, zero values)
  • Error conditions (invalid scale factors)
project_config.yml (2)

18-18: Add newline at end of file.

The YAML file is missing a newline character at the end of the file. While this doesn't affect functionality, it's a common convention that helps with file concatenation and git diffs.

 mlflow_experiment_name: /Shared/mlops_course_nico
+
🧰 Tools
🪛 yamllint

[error] 18-18: no new line character at the end of file

(new-line-at-end-of-file)


Line range hint 4-7: Document ML hyperparameters.

Consider adding comments to explain the choice of hyperparameters:

  • Why learning_rate = 0.01?
  • Why n_estimators = 1000?
  • Why max_depth = 6?

This will help future maintainers understand the reasoning behind these specific values.

 parameters:
+  # Smaller learning rate for better convergence
   learning_rate: 0.01
+  # High number of trees for robust ensemble
   n_estimators: 1000
+  # Moderate depth to prevent overfitting
   max_depth: 6
pyproject.toml (2)

Line range hint 26-30: Resolve version conflict in databricks-sdk.

There's a potential version conflict:

  • Main dependencies: databricks-sdk==0.32.0
  • Optional dependencies: databricks-sdk>=0.32.0, <0.33

While they don't strictly conflict, it's better to align them exactly to prevent any subtle compatibility issues.

 dev = ["databricks-connect==15.3.*,
-       "databricks-sdk>=0.32.0, <0.33",
+       "databricks-sdk==0.32.0",
        "ipykernel>=6.29.5, <7",
        "pip>=24.2",
        "pre-commit"]

7-22: Authentication troubleshooting advice.

Regarding the 401 Unauthorized error:

  1. Ensure you have proper Databricks credentials configured
  2. If you can't access Databricks due to company restrictions, consider:
    • Setting up a local development environment
    • Using mock data for development
    • Implementing feature engineering locally without Databricks dependencies

Would you like assistance in setting up a local development environment that doesn't require Databricks access?

README.md (2)

Line range hint 30-37: Add authentication troubleshooting section

Given the 401 Unauthorized Error you're encountering, it would be helpful to add a troubleshooting section about Databricks authentication. The current instructions assume successful authentication but don't address common issues.

Add this section after the upload example:

### Troubleshooting Authentication

If you encounter a 401 Unauthorized Error:

1. Verify your Databricks access token:
   ```bash
   databricks auth validate
  1. If invalid, generate a new token in Databricks UI: User Settings → Access Tokens
  2. Configure the token:
    databricks auth configure --token

For users with company restrictions, please consult your IT department about:

  • Required permissions for feature store access
  • Any proxy settings needed for authentication

---

Line range hint `34-36`: **Update example to use a generic path**

The upload example also contains a hardcoded username which should be replaced with a placeholder.

```diff
-databricks fs cp dist\power_consumption-0.0.1-py3-none-any.whl dbfs:/Volumes/main/default/file_exchange/nico
+databricks fs cp dist\power_consumption-0.0.1-py3-none-any.whl dbfs:/Volumes/main/default/file_exchange/<your-username>
src/power_consumption/model.py (2)

23-27: Consider logging evaluation metrics for better visibility

While the evaluate method returns the MSE and R-squared scores, adding logging statements can help track model performance during training and debugging.

Example addition:

import logging

def evaluate(self, X_test, y_test):
    y_pred = self.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    logging.info(f"Evaluation results - MSE: {mse}, R2 Score: {r2}")
    return mse, r2

5-32: Add docstrings to the class and its methods for clarity

Including docstrings will enhance code readability and maintainability by providing clear documentation of the class's purpose and how each method functions.

Example:

class PriceModel:
    """Machine learning model for price prediction using a Random Forest regressor."""

    def __init__(self, preprocessor, config):
        """
        Initialize the PriceModel with a preprocessor and configuration.

        Parameters:
        preprocessor : Transformer object
            The preprocessing pipeline to prepare the data.
        config : dict
            Configuration dictionary containing model parameters.
        """
        ...

    def train(self, X_train, y_train):
        """
        Fit the model to the training data.

        Parameters:
        X_train : array-like
            Training feature data.
        y_train : array-like
            Training target data.
        """
        ...

    def predict(self, X):
        """
        Generate predictions for the input data.

        Parameters:
        X : array-like
            Input feature data.

        Returns:
        array-like
            Predicted target values.
        """
        ...

    def evaluate(self, X_test, y_test):
        """
        Evaluate the model's performance on the test data.

        Parameters:
        X_test : array-like
            Test feature data.
        y_test : array-like
            True target values for the test set.

        Returns:
        tuple
            Mean squared error and R-squared score.
        """
        ...

    def get_feature_importance(self):
        """
        Retrieve the feature importances and corresponding feature names.

        Returns:
        tuple
            Feature importances and feature names.
        """
        ...
src/power_consumption/data_processor.py (2)

20-20: Remove print statement from constructor

Printing self.df.head() in the constructor can clutter the output and is not recommended for production code.

Apply this diff to remove the print statement:

- print(self.df.head())

25-38: Optimize data preprocessing using Spark DataFrames

Processing data with pandas DataFrames may not scale well for large datasets. Consider using Spark DataFrame operations for preprocessing to improve performance and scalability.

Refactor the preprocess method to use Spark operations:

- def preprocess(self):
-     # Handle numeric features
-     num_features = self.config.num_features
-     for col in num_features:
-         self.df[col] = pd.to_numeric(self.df[col], errors='coerce')
-     # Fill missing values with mean or default values
-     self.df.fillna(0, inplace=True)
-     # Extract target and relevant features
-     target = self.config.target
-     id_col = self.config.id_col
-     relevant_columns = num_features + [target] + [id_col]
-     self.df = self.df[relevant_columns]
+ def preprocess(self):
+     # Handle numeric features and fill missing values using Spark
+     num_features = self.config.num_features
+     for col in num_features:
+         self.df = self.df.withColumn(col, self.df[col].cast("double"))
+         self.df = self.df.fillna({col: 0})
+     # Extract target and relevant features
+     target = self.config.target
+     id_col = self.config.id_col
+     relevant_columns = num_features + [target, id_col]
+     self.df = self.df.select(relevant_columns)
notebooks/week2/02_04_train_log_custom_model.py (3)

26-30: Catch specific exceptions instead of a broad Exception

In the try-except block, you are catching all exceptions with except Exception. It's better practice to catch specific exceptions to avoid masking unexpected errors. For example, if you're expecting a FileNotFoundError, catch it explicitly.

Apply this diff:

try:
    config = ProjectConfig.from_yaml(config_path="project_config.yml")
- except Exception:
+ except FileNotFoundError:
    config = ProjectConfig.from_yaml(config_path="../../project_config.yml")

63-63: Replace hardcoded git_sha with the actual Git commit SHA

The git_sha variable is currently hardcoded as "bla". It's recommended to dynamically retrieve the actual Git commit SHA to ensure accurate tracking and reproducibility in MLflow.

You can retrieve the Git SHA using the subprocess module:

import subprocess
git_sha = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode("ascii").strip()

Ensure that Git is available in the execution environment when using this approach.


97-101: Remove commented-out code or provide an explanation

The code for logging the sklearn model is commented out. If it's no longer needed, consider removing it. If it's intentionally left for future reference, add a comment explaining why it's commented out.

If not needed, remove the commented code:

- # mlflow.sklearn.log_model(
- #     sk_model=pipeline,
- #     artifact_path="lightgbm-pipeline-model",
- #     signature=signature
- # )
notebooks/week2/05.log_and_register_fe_model.py (2)

143-143: Replace placeholder git_sha with the actual Git SHA.

The git_sha variable is currently set to "bla". Using the actual Git commit SHA improves traceability in MLflow experiments.

Consider updating git_sha dynamically:

-git_sha = "bla"
+import subprocess
+git_sha = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode("utf-8").strip()

Ensure that the script is run in an environment where Git is available.


145-147: Secure handling of MLflow run tags.

When starting an MLflow run, you're manually setting tags for branch and git_sha. Ensure that sensitive information is not exposed in these tags.

If the branch name or other tags might contain sensitive data, consider reviewing them before logging.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 554b874 and 19d3577.

📒 Files selected for processing (14)
  • README.md (1 hunks)
  • mlruns/0/meta.yaml (1 hunks)
  • model_version.json (1 hunks)
  • notebooks/week2/01.prepare_dataset_n.py (1 hunks)
  • notebooks/week2/02_04_train_log_custom_model.py (1 hunks)
  • notebooks/week2/05.log_and_register_fe_model.py (1 hunks)
  • notebooks/week2/mlruns/0/meta.yaml (1 hunks)
  • notebooks/week2/model_version.json (1 hunks)
  • project_config.yml (2 hunks)
  • pyproject.toml (1 hunks)
  • src/power_consumption/config.py (1 hunks)
  • src/power_consumption/data_processor.py (2 hunks)
  • src/power_consumption/model.py (1 hunks)
  • src/power_consumption/utils.py (1 hunks)
✅ Files skipped from review due to trivial changes (4)
  • mlruns/0/meta.yaml
  • model_version.json
  • notebooks/week2/mlruns/0/meta.yaml
  • notebooks/week2/model_version.json
🧰 Additional context used
🪛 yamllint
project_config.yml

[error] 18-18: no new line character at the end of file

(new-line-at-end-of-file)

🔇 Additional comments (8)
src/power_consumption/utils.py (1)

1-3: Verify integration with ML pipeline.

Since this function adjusts model predictions, it's important to understand where and how it fits in your prediction pipeline.

Let's check how this function is being used:

#!/bin/bash
# Search for references to adjust_predictions
rg "adjust_predictions" --type py -A 5

Consider adding logging to track when predictions are adjusted, which could be valuable for monitoring and debugging:

+import logging
+
+logger = logging.getLogger(__name__)
+
 def adjust_predictions(predictions, scale_factor=1.3):
+    logger.info(f"Adjusting predictions with scale_factor={scale_factor}")
     return predictions * scale_factor
project_config.yml (2)

1-2: Verify sandbox database permissions.

The configuration is pointing to a sandbox catalog and schema. Given the 401 Unauthorized error you're encountering, please verify:

  1. You have been granted access to the sandbox catalog
  2. You have appropriate permissions on the sb_adan schema

If you're unable to access Databricks directly, you may need to:

  1. Request access from your Databricks workspace admin
  2. Use a different catalog/schema where you have permissions
  3. Consider setting up a local development environment with sample data for testing

18-18: Review MLflow experiment path permissions.

The MLflow experiment path /Shared/mlops_course_nico might require specific access rights. Since you're experiencing authorization issues:

  1. Ensure you have write permissions to the /Shared directory
  2. Consider using a personal workspace path like /Users/<your-email>/mlops_course_nico instead

If you continue to experience issues:

  1. Check if you can create experiments in your personal workspace
  2. Verify MLflow is properly configured in your environment
  3. Consider using local MLflow tracking for development
🧰 Tools
🪛 yamllint

[error] 18-18: no new line character at the end of file

(new-line-at-end-of-file)

src/power_consumption/config.py (1)

1-3: LGTM! Imports are appropriate and secure.

Good choice using yaml.safe_load for secure YAML parsing and Pydantic for data validation.

notebooks/week2/01.prepare_dataset_n.py (1)

1-7: Add error handling for Spark session creation and verify authentication configuration.

Given the 401 Unauthorized Error mentioned in the PR, ensure proper authentication is configured. Consider adding error handling for Spark session creation and explicitly setting authentication parameters if needed.

Example error handling:

-spark = SparkSession.builder.getOrCreate()
+try:
+    spark = SparkSession.builder.getOrCreate()
+except Exception as e:
+    print(f"Failed to create Spark session: {str(e)}")
+    raise
src/power_consumption/data_processor.py (1)

29-29: Verify conversion of columns to numeric types

Using pd.to_numeric with errors='coerce' will replace non-numeric values with NaN. Ensure that this behavior is intended and that subsequent missing value handling is appropriate.

notebooks/week2/05.log_and_register_fe_model.py (2)

175-177: Verify model registration parameters.

When registering the model with MLflow, ensure that the model URI and name are correctly specified. The current model URI relies on run_id, which should be valid.

Run the following script to confirm that the model artifact exists at the specified URI:

#!/bin/bash
# Description: Verify that the model artifact exists in MLflow.

# Expected result: The artifact path should list the model.

# List artifacts for the run
mlflow artifacts list --run-id "$run_id" --artifact-path "lightgbm-pipeline-model-fe"

Ensure that mlflow CLI is installed and configured with your Databricks tracking URI.


104-105: ⚠️ Potential issue

Investigate the 401 Unauthorized Error during feature engineering.

The TODO comment mentions an unauthorized error when using fe.create_training_set. This error typically occurs due to insufficient permissions or misconfigured authentication.

Please ensure that:

  • You have the necessary permissions to use the Feature Store and Feature Engineering services.
  • Your workspace's feature store is properly configured.
  • Your authentication credentials are correctly set up.

Would you like assistance in troubleshooting this issue?

Run the following script to check your entitlements:

Replace $DATABRICKS_TOKEN with your Databricks personal access token and $DATABRICKS_INSTANCE with your workspace URL (e.g., adb-1234567890123456.7.azuredatabricks.net).

src/power_consumption/utils.py Show resolved Hide resolved
src/power_consumption/utils.py Show resolved Hide resolved
src/power_consumption/config.py Show resolved Hide resolved
src/power_consumption/config.py Show resolved Hide resolved
src/power_consumption/config.py Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (2)
notebooks/week2/02_04_train_log_custom_model.py (2)

45-49: Add data validation checks

Consider adding validation checks for the input data before training:

  • Verify that there are no missing values
  • Check for expected data types
  • Validate value ranges for numerical features

Would you like me to provide an example implementation of these validation checks?


103-116: Add docstrings and type hints

The PowerConsumptionModelWrapper class lacks proper documentation. Consider adding:

  • Class-level docstring explaining the purpose
  • Method docstrings with parameters and return types
  • Type hints for better code maintainability
 class PowerConsumptionModelWrapper(mlflow.pyfunc.PythonModel):
+    """Wrapper for power consumption model that handles prediction adjustments.
+    
+    This wrapper ensures predictions are properly adjusted before being returned.
+    """
     
-    def __init__(self, model):
+    def __init__(self, model: Pipeline) -> None:
+        """Initialize the wrapper with a trained model.
+        
+        Args:
+            model: Trained scikit-learn pipeline
+        """
         self.model = model
         
-    def predict(self, context, model_input):
+    def predict(self, context, model_input: pd.DataFrame) -> dict:
+        """Generate predictions for the input data.
+        
+        Args:
+            context: MLflow context (unused)
+            model_input: Input features as a pandas DataFrame
+            
+        Returns:
+            dict: Adjusted predictions
+            
+        Raises:
+            ValueError: If input is not a pandas DataFrame
+        """
🧰 Tools
🪛 Ruff

104-104: Blank line contains whitespace

Remove whitespace from blank line

(W293)


107-107: Blank line contains whitespace

Remove whitespace from blank line

(W293)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 19d3577 and 45ba01c.

📒 Files selected for processing (2)
  • notebooks/week2/02_04_train_log_custom_model.py (1 hunks)
  • pyproject.toml (1 hunks)
🧰 Additional context used
🪛 Ruff
notebooks/week2/02_04_train_log_custom_model.py

3-3: numpy imported but unused

Remove unused import: numpy

(F401)


10-10: mlflow.utils.environment._mlflow_conda_env imported but unused

Remove unused import: mlflow.utils.environment._mlflow_conda_env

(F401)


15-15: Redefinition of unused mlflow from line 2

Remove definition: mlflow

(F811)


16-16: Redefinition of unused infer_signature from line 6

Remove definition: infer_signature

(F811)


94-94: Blank line contains whitespace

Remove whitespace from blank line

(W293)


104-104: Blank line contains whitespace

Remove whitespace from blank line

(W293)


107-107: Blank line contains whitespace

Remove whitespace from blank line

(W293)


130-130: Blank line contains whitespace

Remove whitespace from blank line

(W293)


156-156: Trailing whitespace

Remove trailing whitespace

(W291)


157-157: Blank line contains whitespace

Remove whitespace from blank line

(W293)


164-164: Found useless expression. Either assign it to a variable or remove it.

(B018)

🔇 Additional comments (3)
pyproject.toml (2)

14-14: Verify compatibility with pyarrow 14.0.1

The pyarrow version has been downgraded to 14.0.1. This version might not be fully compatible with your current pandas 2.2.3 and other dependencies.

Let's check for potential compatibility issues:

✅ Verification successful

pyarrow 14.0.1 is compatible with pandas 2.2.3

The downgrade to pyarrow 14.0.1 is safe as it exceeds pandas' minimum requirement of pyarrow>=10.0.1. No compatibility issues are expected.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check compatibility between pyarrow and pandas versions
curl -s "https://pypi.org/pypi/pandas/2.2.3/json" | jq -r '.info.requires_dist[]' | grep -i "pyarrow"

Length of output: 239


26-26: Consider implications of databricks-connect downgrade

The databricks-connect version has been downgraded from >=15.4.1 to ==15.3.4. While this might be an attempt to resolve the 401 error, older versions could introduce other compatibility issues.

Let's check the changelog for breaking changes:

notebooks/week2/02_04_train_log_custom_model.py (1)

122-122: Clarify TODO comment

The comment "TODO: add code again" is vague. Please specify what code needs to be added and why.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (4)
notebooks/week2/02_04_train_log_custom_model.py (4)

93-98: Remove commented-out code to maintain code cleanliness

The block of code from lines 93 to 97 is commented out. If it's no longer needed, consider removing it to keep the codebase clean and maintainable.

- # mlflow.sklearn.log_model(
- #     sk_model=pipeline,
- #     artifact_path="lightgbm-pipeline-model",
- #     signature=signature
- # )

124-126: Evaluate the necessity of multiple MLflow runs

Starting a new MLflow run immediately after the previous one may not be necessary unless you're logging separate experiments or stages.

Consider consolidating the runs if they represent a single experimental flow:

- with mlflow.start_run(tags={"branch": "week2",
-                             "git_sha": f"{git_sha}"}) as run:
+ # Continue using the existing MLflow run for logging
🧰 Tools
🪛 Ruff

126-126: Blank line contains whitespace

Remove whitespace from blank line

(W293)


139-144: Ensure consistent model naming conventions

The model_name includes dots, which may cause confusion or conflicts in some systems. Verify that this naming convention aligns with your organization's standards.

Consider using underscores or hyphens:

- model_name = f"{catalog_name}.{schema_name}.power_consumption_model_pyfunc"
+ model_name = f"{catalog_name}_{schema_name}_power_consumption_model_pyfunc"

160-160: Remove unnecessary expression to clean up the script

The standalone model expression on line 160 has no effect and can be removed.

- model
🧰 Tools
🪛 Ruff

160-160: Found useless expression. Either assign it to a variable or remove it.

(B018)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 45ba01c and 17ab690.

📒 Files selected for processing (2)
  • notebooks/week2/01.prepare_dataset_n.py (1 hunks)
  • notebooks/week2/02_04_train_log_custom_model.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • notebooks/week2/01.prepare_dataset_n.py
🧰 Additional context used
🪛 Ruff
notebooks/week2/02_04_train_log_custom_model.py

7-7: numpy imported but unused

Remove unused import: numpy

(F401)


92-92: Blank line contains whitespace

Remove whitespace from blank line

(W293)


102-102: Blank line contains whitespace

Remove whitespace from blank line

(W293)


105-105: Blank line contains whitespace

Remove whitespace from blank line

(W293)


126-126: Blank line contains whitespace

Remove whitespace from blank line

(W293)


152-152: Trailing whitespace

Remove trailing whitespace

(W291)


153-153: Blank line contains whitespace

Remove whitespace from blank line

(W293)


160-160: Found useless expression. Either assign it to a variable or remove it.

(B018)

🔇 Additional comments (3)
notebooks/week2/02_04_train_log_custom_model.py (3)

119-119: Confirm that the correct model is wrapped

Ensure that the pipeline passed to PowerConsumptionModelWrapper is the trained model intended for deployment.

You can verify that pipeline has been trained:

#!/bin/bash
# Description: Check that 'pipeline' is the trained model.

# Look for the pipeline fitting step
rg 'pipeline.fit'

# Ensure no reassignments occur after training
rg 'pipeline\s*=.*'

129-132: ⚠️ Potential issue

Provide actual samples to infer_signature for accurate schema inference

Passing empty lists to infer_signature may result in an inaccurate model schema. Use representative samples of your input and output data.

Apply this diff:

mlflow.pyfunc.log_model(
    python_model=wrapped_model,
    artifact_path="pyfunc-power-consumption-model",
-   signature=infer_signature(model_input=[], model_output=[])
+   signature=infer_signature(model_input=X_test, model_output=y_pred)
)

Likely invalid or redundant comment.


151-152: 🛠️ Refactor suggestion

Improve the model version alias for clarity

The alias "the_best_model" is generic. Using a more descriptive alias enhances readability and version management.

Consider including version numbers or environment:

- model_version_alias = "the_best_model"
+ model_version_alias = "production_v1"
client.set_registered_model_alias(model_name, model_version_alias, "1")

Likely invalid or redundant comment.

🧰 Tools
🪛 Ruff

152-152: Trailing whitespace

Remove trailing whitespace

(W291)

Copy link

@mvechtomova mvechtomova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will approve to unblock you. Please resolve the points I mentioned.

Install src package with `uv pip install -e .`
Install src package locally with `uv pip install -e .`

Install src package on cluster in notebook with `pip install dbfs:/Volumes/main/default/file_exchange/nico/power_consumption-0.0.1-py3-none-any.whl`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does not install it in the cluster that way, just in your notebook env.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvechtomova I know. But when I run it with databricks connect I need it locally and not in my cluster right? Because the code gets executed locally except for the spark code.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe good idea to put it in gitignore, it is bug from feature engineering package that causes creation of this folder

try:
config = ProjectConfig.from_yaml(config_path="project_config.yml")
except Exception:
config = ProjectConfig.from_yaml(config_path="../../project_config.yml")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this logics? does not matter that much when we move into DABs, just curious

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had issues with the path, depending on how I run the script. With this it is executable from the root path but also from the notebook path

input_bindings={"Temperature": "RoundedTemp"},
),
],
# exclude_columns=["bla"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to exclude columns (like timestamp), because othersie serving endpoint will expect that column to be there

Copy link
Collaborator Author

@NiGroUni NiGroUni Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have the id column and the features in the feature table so there are no columns to exclude in my opinion

training_df = training_set.load_df().toPandas()

# Split features and target
X_train = training_df[num_features]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed, temperature_rounded (from Feature Function) is not part of num_features, so will not be used for training.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed here b9c253b

@NiGroUni
Copy link
Collaborator Author

NiGroUni commented Nov 4, 2024

I will merge. Will fix in week3 PR

@NiGroUni NiGroUni merged commit c49e1aa into main Nov 4, 2024
2 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Nov 4, 2024
Merged
This was referenced Nov 20, 2024
Merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants