-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
week2 updates #2
Conversation
WalkthroughThis pull request introduces several significant changes across multiple files, primarily enhancing the setup instructions in Changes
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 26
🧹 Outside diff range and nitpick comments (16)
src/power_consumption/utils.py (1)
1-3
: Consider adding unit tests.This utility function is likely critical for your ML pipeline's output. Ensure it's thoroughly tested with various input types and edge cases.
Would you like me to help create comprehensive unit tests for this function? I can generate test cases covering:
- Different input types (float, list, numpy array)
- Edge cases (empty arrays, zero values)
- Error conditions (invalid scale factors)
project_config.yml (2)
18-18
: Add newline at end of file.The YAML file is missing a newline character at the end of the file. While this doesn't affect functionality, it's a common convention that helps with file concatenation and git diffs.
mlflow_experiment_name: /Shared/mlops_course_nico +
🧰 Tools
🪛 yamllint
[error] 18-18: no new line character at the end of file
(new-line-at-end-of-file)
Line range hint
4-7
: Document ML hyperparameters.Consider adding comments to explain the choice of hyperparameters:
- Why learning_rate = 0.01?
- Why n_estimators = 1000?
- Why max_depth = 6?
This will help future maintainers understand the reasoning behind these specific values.
parameters: + # Smaller learning rate for better convergence learning_rate: 0.01 + # High number of trees for robust ensemble n_estimators: 1000 + # Moderate depth to prevent overfitting max_depth: 6pyproject.toml (2)
Line range hint
26-30
: Resolve version conflict in databricks-sdk.There's a potential version conflict:
- Main dependencies: databricks-sdk==0.32.0
- Optional dependencies: databricks-sdk>=0.32.0, <0.33
While they don't strictly conflict, it's better to align them exactly to prevent any subtle compatibility issues.
dev = ["databricks-connect==15.3.*, - "databricks-sdk>=0.32.0, <0.33", + "databricks-sdk==0.32.0", "ipykernel>=6.29.5, <7", "pip>=24.2", "pre-commit"]
7-22
: Authentication troubleshooting advice.Regarding the 401 Unauthorized error:
- Ensure you have proper Databricks credentials configured
- If you can't access Databricks due to company restrictions, consider:
- Setting up a local development environment
- Using mock data for development
- Implementing feature engineering locally without Databricks dependencies
Would you like assistance in setting up a local development environment that doesn't require Databricks access?
README.md (2)
Line range hint
30-37
: Add authentication troubleshooting sectionGiven the 401 Unauthorized Error you're encountering, it would be helpful to add a troubleshooting section about Databricks authentication. The current instructions assume successful authentication but don't address common issues.
Add this section after the upload example:
### Troubleshooting Authentication If you encounter a 401 Unauthorized Error: 1. Verify your Databricks access token: ```bash databricks auth validate
- If invalid, generate a new token in Databricks UI: User Settings → Access Tokens
- Configure the token:
databricks auth configure --tokenFor users with company restrictions, please consult your IT department about:
- Required permissions for feature store access
- Any proxy settings needed for authentication
--- Line range hint `34-36`: **Update example to use a generic path** The upload example also contains a hardcoded username which should be replaced with a placeholder. ```diff -databricks fs cp dist\power_consumption-0.0.1-py3-none-any.whl dbfs:/Volumes/main/default/file_exchange/nico +databricks fs cp dist\power_consumption-0.0.1-py3-none-any.whl dbfs:/Volumes/main/default/file_exchange/<your-username>
src/power_consumption/model.py (2)
23-27
: Consider logging evaluation metrics for better visibilityWhile the
evaluate
method returns the MSE and R-squared scores, adding logging statements can help track model performance during training and debugging.Example addition:
import logging def evaluate(self, X_test, y_test): y_pred = self.predict(X_test) mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) logging.info(f"Evaluation results - MSE: {mse}, R2 Score: {r2}") return mse, r2
5-32
: Add docstrings to the class and its methods for clarityIncluding docstrings will enhance code readability and maintainability by providing clear documentation of the class's purpose and how each method functions.
Example:
class PriceModel: """Machine learning model for price prediction using a Random Forest regressor.""" def __init__(self, preprocessor, config): """ Initialize the PriceModel with a preprocessor and configuration. Parameters: preprocessor : Transformer object The preprocessing pipeline to prepare the data. config : dict Configuration dictionary containing model parameters. """ ... def train(self, X_train, y_train): """ Fit the model to the training data. Parameters: X_train : array-like Training feature data. y_train : array-like Training target data. """ ... def predict(self, X): """ Generate predictions for the input data. Parameters: X : array-like Input feature data. Returns: array-like Predicted target values. """ ... def evaluate(self, X_test, y_test): """ Evaluate the model's performance on the test data. Parameters: X_test : array-like Test feature data. y_test : array-like True target values for the test set. Returns: tuple Mean squared error and R-squared score. """ ... def get_feature_importance(self): """ Retrieve the feature importances and corresponding feature names. Returns: tuple Feature importances and feature names. """ ...src/power_consumption/data_processor.py (2)
20-20
: Remove print statement from constructorPrinting
self.df.head()
in the constructor can clutter the output and is not recommended for production code.Apply this diff to remove the print statement:
- print(self.df.head())
25-38
: Optimize data preprocessing using Spark DataFramesProcessing data with pandas DataFrames may not scale well for large datasets. Consider using Spark DataFrame operations for preprocessing to improve performance and scalability.
Refactor the
preprocess
method to use Spark operations:- def preprocess(self): - # Handle numeric features - num_features = self.config.num_features - for col in num_features: - self.df[col] = pd.to_numeric(self.df[col], errors='coerce') - # Fill missing values with mean or default values - self.df.fillna(0, inplace=True) - # Extract target and relevant features - target = self.config.target - id_col = self.config.id_col - relevant_columns = num_features + [target] + [id_col] - self.df = self.df[relevant_columns] + def preprocess(self): + # Handle numeric features and fill missing values using Spark + num_features = self.config.num_features + for col in num_features: + self.df = self.df.withColumn(col, self.df[col].cast("double")) + self.df = self.df.fillna({col: 0}) + # Extract target and relevant features + target = self.config.target + id_col = self.config.id_col + relevant_columns = num_features + [target, id_col] + self.df = self.df.select(relevant_columns)notebooks/week2/02_04_train_log_custom_model.py (3)
26-30
: Catch specific exceptions instead of a broadException
In the
try-except
block, you are catching all exceptions withexcept Exception
. It's better practice to catch specific exceptions to avoid masking unexpected errors. For example, if you're expecting aFileNotFoundError
, catch it explicitly.Apply this diff:
try: config = ProjectConfig.from_yaml(config_path="project_config.yml") - except Exception: + except FileNotFoundError: config = ProjectConfig.from_yaml(config_path="../../project_config.yml")
63-63
: Replace hardcodedgit_sha
with the actual Git commit SHAThe
git_sha
variable is currently hardcoded as"bla"
. It's recommended to dynamically retrieve the actual Git commit SHA to ensure accurate tracking and reproducibility in MLflow.You can retrieve the Git SHA using the
subprocess
module:import subprocess git_sha = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode("ascii").strip()Ensure that Git is available in the execution environment when using this approach.
97-101
: Remove commented-out code or provide an explanationThe code for logging the sklearn model is commented out. If it's no longer needed, consider removing it. If it's intentionally left for future reference, add a comment explaining why it's commented out.
If not needed, remove the commented code:
- # mlflow.sklearn.log_model( - # sk_model=pipeline, - # artifact_path="lightgbm-pipeline-model", - # signature=signature - # )notebooks/week2/05.log_and_register_fe_model.py (2)
143-143
: Replace placeholdergit_sha
with the actual Git SHA.The
git_sha
variable is currently set to"bla"
. Using the actual Git commit SHA improves traceability in MLflow experiments.Consider updating
git_sha
dynamically:-git_sha = "bla" +import subprocess +git_sha = subprocess.check_output(["git", "rev-parse", "HEAD"]).decode("utf-8").strip()Ensure that the script is run in an environment where Git is available.
145-147
: Secure handling of MLflow run tags.When starting an MLflow run, you're manually setting tags for
branch
andgit_sha
. Ensure that sensitive information is not exposed in these tags.If the
branch
name or other tags might contain sensitive data, consider reviewing them before logging.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (14)
- README.md (1 hunks)
- mlruns/0/meta.yaml (1 hunks)
- model_version.json (1 hunks)
- notebooks/week2/01.prepare_dataset_n.py (1 hunks)
- notebooks/week2/02_04_train_log_custom_model.py (1 hunks)
- notebooks/week2/05.log_and_register_fe_model.py (1 hunks)
- notebooks/week2/mlruns/0/meta.yaml (1 hunks)
- notebooks/week2/model_version.json (1 hunks)
- project_config.yml (2 hunks)
- pyproject.toml (1 hunks)
- src/power_consumption/config.py (1 hunks)
- src/power_consumption/data_processor.py (2 hunks)
- src/power_consumption/model.py (1 hunks)
- src/power_consumption/utils.py (1 hunks)
✅ Files skipped from review due to trivial changes (4)
- mlruns/0/meta.yaml
- model_version.json
- notebooks/week2/mlruns/0/meta.yaml
- notebooks/week2/model_version.json
🧰 Additional context used
🪛 yamllint
project_config.yml
[error] 18-18: no new line character at the end of file
(new-line-at-end-of-file)
🔇 Additional comments (8)
src/power_consumption/utils.py (1)
1-3
: Verify integration with ML pipeline.Since this function adjusts model predictions, it's important to understand where and how it fits in your prediction pipeline.
Let's check how this function is being used:
#!/bin/bash # Search for references to adjust_predictions rg "adjust_predictions" --type py -A 5Consider adding logging to track when predictions are adjusted, which could be valuable for monitoring and debugging:
+import logging + +logger = logging.getLogger(__name__) + def adjust_predictions(predictions, scale_factor=1.3): + logger.info(f"Adjusting predictions with scale_factor={scale_factor}") return predictions * scale_factorproject_config.yml (2)
1-2
: Verify sandbox database permissions.The configuration is pointing to a sandbox catalog and schema. Given the 401 Unauthorized error you're encountering, please verify:
- You have been granted access to the
sandbox
catalog- You have appropriate permissions on the
sb_adan
schemaIf you're unable to access Databricks directly, you may need to:
- Request access from your Databricks workspace admin
- Use a different catalog/schema where you have permissions
- Consider setting up a local development environment with sample data for testing
18-18
: Review MLflow experiment path permissions.The MLflow experiment path
/Shared/mlops_course_nico
might require specific access rights. Since you're experiencing authorization issues:
- Ensure you have write permissions to the
/Shared
directory- Consider using a personal workspace path like
/Users/<your-email>/mlops_course_nico
insteadIf you continue to experience issues:
- Check if you can create experiments in your personal workspace
- Verify MLflow is properly configured in your environment
- Consider using local MLflow tracking for development
🧰 Tools
🪛 yamllint
[error] 18-18: no new line character at the end of file
(new-line-at-end-of-file)
src/power_consumption/config.py (1)
1-3
: LGTM! Imports are appropriate and secure.Good choice using
yaml.safe_load
for secure YAML parsing and Pydantic for data validation.notebooks/week2/01.prepare_dataset_n.py (1)
1-7
: Add error handling for Spark session creation and verify authentication configuration.Given the 401 Unauthorized Error mentioned in the PR, ensure proper authentication is configured. Consider adding error handling for Spark session creation and explicitly setting authentication parameters if needed.
Example error handling:
-spark = SparkSession.builder.getOrCreate() +try: + spark = SparkSession.builder.getOrCreate() +except Exception as e: + print(f"Failed to create Spark session: {str(e)}") + raisesrc/power_consumption/data_processor.py (1)
29-29
: Verify conversion of columns to numeric typesUsing
pd.to_numeric
witherrors='coerce'
will replace non-numeric values withNaN
. Ensure that this behavior is intended and that subsequent missing value handling is appropriate.notebooks/week2/05.log_and_register_fe_model.py (2)
175-177
: Verify model registration parameters.When registering the model with MLflow, ensure that the model URI and name are correctly specified. The current model URI relies on
run_id
, which should be valid.Run the following script to confirm that the model artifact exists at the specified URI:
#!/bin/bash # Description: Verify that the model artifact exists in MLflow. # Expected result: The artifact path should list the model. # List artifacts for the run mlflow artifacts list --run-id "$run_id" --artifact-path "lightgbm-pipeline-model-fe"Ensure that
mlflow
CLI is installed and configured with your Databricks tracking URI.
104-105
:⚠️ Potential issueInvestigate the 401 Unauthorized Error during feature engineering.
The TODO comment mentions an unauthorized error when using
fe.create_training_set
. This error typically occurs due to insufficient permissions or misconfigured authentication.Please ensure that:
- You have the necessary permissions to use the Feature Store and Feature Engineering services.
- Your workspace's feature store is properly configured.
- Your authentication credentials are correctly set up.
Would you like assistance in troubleshooting this issue?
Run the following script to check your entitlements:
Replace
$DATABRICKS_TOKEN
with your Databricks personal access token and$DATABRICKS_INSTANCE
with your workspace URL (e.g.,adb-1234567890123456.7.azuredatabricks.net
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (2)
notebooks/week2/02_04_train_log_custom_model.py (2)
45-49
: Add data validation checksConsider adding validation checks for the input data before training:
- Verify that there are no missing values
- Check for expected data types
- Validate value ranges for numerical features
Would you like me to provide an example implementation of these validation checks?
103-116
: Add docstrings and type hintsThe
PowerConsumptionModelWrapper
class lacks proper documentation. Consider adding:
- Class-level docstring explaining the purpose
- Method docstrings with parameters and return types
- Type hints for better code maintainability
class PowerConsumptionModelWrapper(mlflow.pyfunc.PythonModel): + """Wrapper for power consumption model that handles prediction adjustments. + + This wrapper ensures predictions are properly adjusted before being returned. + """ - def __init__(self, model): + def __init__(self, model: Pipeline) -> None: + """Initialize the wrapper with a trained model. + + Args: + model: Trained scikit-learn pipeline + """ self.model = model - def predict(self, context, model_input): + def predict(self, context, model_input: pd.DataFrame) -> dict: + """Generate predictions for the input data. + + Args: + context: MLflow context (unused) + model_input: Input features as a pandas DataFrame + + Returns: + dict: Adjusted predictions + + Raises: + ValueError: If input is not a pandas DataFrame + """🧰 Tools
🪛 Ruff
104-104: Blank line contains whitespace
Remove whitespace from blank line
(W293)
107-107: Blank line contains whitespace
Remove whitespace from blank line
(W293)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- notebooks/week2/02_04_train_log_custom_model.py (1 hunks)
- pyproject.toml (1 hunks)
🧰 Additional context used
🪛 Ruff
notebooks/week2/02_04_train_log_custom_model.py
3-3:
numpy
imported but unusedRemove unused import:
numpy
(F401)
10-10:
mlflow.utils.environment._mlflow_conda_env
imported but unusedRemove unused import:
mlflow.utils.environment._mlflow_conda_env
(F401)
15-15: Redefinition of unused
mlflow
from line 2Remove definition:
mlflow
(F811)
16-16: Redefinition of unused
infer_signature
from line 6Remove definition:
infer_signature
(F811)
94-94: Blank line contains whitespace
Remove whitespace from blank line
(W293)
104-104: Blank line contains whitespace
Remove whitespace from blank line
(W293)
107-107: Blank line contains whitespace
Remove whitespace from blank line
(W293)
130-130: Blank line contains whitespace
Remove whitespace from blank line
(W293)
156-156: Trailing whitespace
Remove trailing whitespace
(W291)
157-157: Blank line contains whitespace
Remove whitespace from blank line
(W293)
164-164: Found useless expression. Either assign it to a variable or remove it.
(B018)
🔇 Additional comments (3)
pyproject.toml (2)
14-14
: Verify compatibility with pyarrow 14.0.1The pyarrow version has been downgraded to 14.0.1. This version might not be fully compatible with your current pandas 2.2.3 and other dependencies.
Let's check for potential compatibility issues:
✅ Verification successful
pyarrow 14.0.1 is compatible with pandas 2.2.3
The downgrade to pyarrow 14.0.1 is safe as it exceeds pandas' minimum requirement of pyarrow>=10.0.1. No compatibility issues are expected.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check compatibility between pyarrow and pandas versions curl -s "https://pypi.org/pypi/pandas/2.2.3/json" | jq -r '.info.requires_dist[]' | grep -i "pyarrow"Length of output: 239
26-26
: Consider implications of databricks-connect downgradeThe databricks-connect version has been downgraded from
>=15.4.1
to==15.3.4
. While this might be an attempt to resolve the 401 error, older versions could introduce other compatibility issues.Let's check the changelog for breaking changes:
notebooks/week2/02_04_train_log_custom_model.py (1)
122-122
: Clarify TODO commentThe comment "TODO: add code again" is vague. Please specify what code needs to be added and why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (4)
notebooks/week2/02_04_train_log_custom_model.py (4)
93-98
: Remove commented-out code to maintain code cleanlinessThe block of code from lines 93 to 97 is commented out. If it's no longer needed, consider removing it to keep the codebase clean and maintainable.
- # mlflow.sklearn.log_model( - # sk_model=pipeline, - # artifact_path="lightgbm-pipeline-model", - # signature=signature - # )
124-126
: Evaluate the necessity of multiple MLflow runsStarting a new MLflow run immediately after the previous one may not be necessary unless you're logging separate experiments or stages.
Consider consolidating the runs if they represent a single experimental flow:
- with mlflow.start_run(tags={"branch": "week2", - "git_sha": f"{git_sha}"}) as run: + # Continue using the existing MLflow run for logging🧰 Tools
🪛 Ruff
126-126: Blank line contains whitespace
Remove whitespace from blank line
(W293)
139-144
: Ensure consistent model naming conventionsThe
model_name
includes dots, which may cause confusion or conflicts in some systems. Verify that this naming convention aligns with your organization's standards.Consider using underscores or hyphens:
- model_name = f"{catalog_name}.{schema_name}.power_consumption_model_pyfunc" + model_name = f"{catalog_name}_{schema_name}_power_consumption_model_pyfunc"
160-160
: Remove unnecessary expression to clean up the scriptThe standalone
model
expression on line 160 has no effect and can be removed.- model
🧰 Tools
🪛 Ruff
160-160: Found useless expression. Either assign it to a variable or remove it.
(B018)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- notebooks/week2/01.prepare_dataset_n.py (1 hunks)
- notebooks/week2/02_04_train_log_custom_model.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- notebooks/week2/01.prepare_dataset_n.py
🧰 Additional context used
🪛 Ruff
notebooks/week2/02_04_train_log_custom_model.py
7-7:
numpy
imported but unusedRemove unused import:
numpy
(F401)
92-92: Blank line contains whitespace
Remove whitespace from blank line
(W293)
102-102: Blank line contains whitespace
Remove whitespace from blank line
(W293)
105-105: Blank line contains whitespace
Remove whitespace from blank line
(W293)
126-126: Blank line contains whitespace
Remove whitespace from blank line
(W293)
152-152: Trailing whitespace
Remove trailing whitespace
(W291)
153-153: Blank line contains whitespace
Remove whitespace from blank line
(W293)
160-160: Found useless expression. Either assign it to a variable or remove it.
(B018)
🔇 Additional comments (3)
notebooks/week2/02_04_train_log_custom_model.py (3)
119-119
: Confirm that the correct model is wrappedEnsure that the
pipeline
passed toPowerConsumptionModelWrapper
is the trained model intended for deployment.You can verify that
pipeline
has been trained:#!/bin/bash # Description: Check that 'pipeline' is the trained model. # Look for the pipeline fitting step rg 'pipeline.fit' # Ensure no reassignments occur after training rg 'pipeline\s*=.*'
129-132
:⚠️ Potential issueProvide actual samples to
infer_signature
for accurate schema inferencePassing empty lists to
infer_signature
may result in an inaccurate model schema. Use representative samples of your input and output data.Apply this diff:
mlflow.pyfunc.log_model( python_model=wrapped_model, artifact_path="pyfunc-power-consumption-model", - signature=infer_signature(model_input=[], model_output=[]) + signature=infer_signature(model_input=X_test, model_output=y_pred) )Likely invalid or redundant comment.
151-152
: 🛠️ Refactor suggestionImprove the model version alias for clarity
The alias "the_best_model" is generic. Using a more descriptive alias enhances readability and version management.
Consider including version numbers or environment:
- model_version_alias = "the_best_model" + model_version_alias = "production_v1" client.set_registered_model_alias(model_name, model_version_alias, "1")Likely invalid or redundant comment.
🧰 Tools
🪛 Ruff
152-152: Trailing whitespace
Remove trailing whitespace
(W291)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will approve to unblock you. Please resolve the points I mentioned.
Install src package with `uv pip install -e .` | ||
Install src package locally with `uv pip install -e .` | ||
|
||
Install src package on cluster in notebook with `pip install dbfs:/Volumes/main/default/file_exchange/nico/power_consumption-0.0.1-py3-none-any.whl` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does not install it in the cluster that way, just in your notebook env.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvechtomova I know. But when I run it with databricks connect I need it locally and not in my cluster right? Because the code gets executed locally except for the spark code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe good idea to put it in gitignore, it is bug from feature engineering package that causes creation of this folder
try: | ||
config = ProjectConfig.from_yaml(config_path="project_config.yml") | ||
except Exception: | ||
config = ProjectConfig.from_yaml(config_path="../../project_config.yml") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this logics? does not matter that much when we move into DABs, just curious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had issues with the path, depending on how I run the script. With this it is executable from the root path but also from the notebook path
input_bindings={"Temperature": "RoundedTemp"}, | ||
), | ||
], | ||
# exclude_columns=["bla"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to exclude columns (like timestamp), because othersie serving endpoint will expect that column to be there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only have the id column and the features in the feature table so there are no columns to exclude in my opinion
training_df = training_set.load_df().toPandas() | ||
|
||
# Split features and target | ||
X_train = training_df[num_features] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed, temperature_rounded (from Feature Function) is not part of num_features, so will not be used for training.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed here b9c253b
I will merge. Will fix in week3 PR |
Unfortunately creating the training set with databricks.feature_engineering lead to an 401-Unauthorized-Error like Maria mentioned in the lecture. I do not know how to solve this as we can not work in Databricks due to company restrictions.
Summary by CodeRabbit
Release Notes
New Features
meta.yaml
andmodel_version.json
) for MLflow experiments and model versioning.PriceModel
class for structured model training and evaluation.Bug Fixes
project_config.yml
for consistency.Documentation
Chores
pyproject.toml
for precise version control.