Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added DropNullColumn transformer to remove columns that contain only nulls #1115

Merged
merged 64 commits into from
Nov 18, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
bee630f
Adding code for DropNull
rcap107 Oct 17, 2024
ccc9a02
Fixed line
rcap107 Oct 17, 2024
d3f9c90
renamed script
rcap107 Oct 17, 2024
9d42b95
Added new common functions for drop and is_all_null
rcap107 Oct 17, 2024
f249982
Fixed code
rcap107 Oct 17, 2024
a1caf39
Added test for dropcol
rcap107 Oct 17, 2024
b0e3235
Removing dev script
rcap107 Oct 17, 2024
90be825
Update skrub/tests/test_dropnulls.py
rcap107 Oct 21, 2024
55764a8
Renamed file
rcap107 Oct 21, 2024
c8fdaaa
Renamed file
rcap107 Oct 21, 2024
0cdc0bd
Formatting
rcap107 Oct 21, 2024
34c0095
Merge branch 'drop_null_columns' of https://github.com/rcap107/skrub …
rcap107 Oct 21, 2024
430c8e3
Rename file
rcap107 Oct 21, 2024
80bd408
Added docstrings
rcap107 Oct 21, 2024
e2ca33f
Fixing imports and refactoring names
rcap107 Oct 21, 2024
4dbba09
Formatting
rcap107 Oct 21, 2024
7d6f8ce
Updated changelog.
rcap107 Oct 21, 2024
4771d18
Formatting
rcap107 Oct 21, 2024
f0b521a
Removing function because it was not needed
rcap107 Oct 21, 2024
ea9893b
Updated test
rcap107 Oct 21, 2024
c73db7e
Merge branch 'main' into drop_null_columns
rcap107 Oct 21, 2024
09cf9c7
Improving tests
rcap107 Oct 21, 2024
4e4f255
Merge branch 'drop_null_columns' of https://github.com/rcap107/skrub …
rcap107 Oct 21, 2024
754e2ef
Updated test
rcap107 Oct 22, 2024
acafac6
Merge remote-tracking branch 'main_repo/main' into drop_null_columns
rcap107 Oct 22, 2024
4b0aa1c
Fixed is_all_null based on comments
rcap107 Oct 22, 2024
35f8909
Renaming files for consistency
rcap107 Oct 22, 2024
b4e419f
Removing init
rcap107 Oct 22, 2024
75f1110
Moving DropNullColumn after CleanNullStrings
rcap107 Oct 22, 2024
e499dc1
Moved check on drop from transform to fit_transform
rcap107 Oct 22, 2024
c296829
Fixed changelog
rcap107 Oct 22, 2024
ee6b7b5
Moved tests and improved coverage
rcap107 Oct 22, 2024
92210b7
Moved tv test to the proper file
rcap107 Oct 24, 2024
4cad44a
Updated test to make it make sense
rcap107 Oct 24, 2024
836a636
Improving comment
rcap107 Oct 24, 2024
4ec95d6
Improving comment
rcap107 Oct 24, 2024
3c25b84
Removed unneeded code
rcap107 Oct 24, 2024
8638516
Changed default value to True
rcap107 Oct 24, 2024
e70f513
Formatting
rcap107 Oct 24, 2024
6083567
Added back code that should have been there in the first place
rcap107 Oct 24, 2024
a543044
Changed the default parameter
rcap107 Oct 24, 2024
92f5430
Changed to use df interface
rcap107 Oct 24, 2024
24b18ba
Merge remote-tracking branch 'main_repo/main' into drop_null_columns
rcap107 Oct 24, 2024
62ef9d6
Fixed docstring.
rcap107 Oct 24, 2024
53cb8bd
Update skrub/_drop_null_column.py
jeromedockes Oct 24, 2024
7af96ca
Renaming transformer to DropColumnIfNull.
rcap107 Oct 25, 2024
11908b3
Merge branch 'drop_null_columns' of https://github.com/rcap107/skrub …
rcap107 Oct 25, 2024
2499a37
Update skrub/_dataframe/tests/test_common.py
rcap107 Oct 29, 2024
548b792
Removed a coverage file
rcap107 Oct 29, 2024
58feaed
Fix formatting of docstring
rcap107 Oct 29, 2024
5a6539c
Formatting
rcap107 Oct 29, 2024
36c46d4
Whoops
rcap107 Oct 29, 2024
98b6c10
Altering the code to add different options and changing the default
rcap107 Nov 8, 2024
399954a
Improvements to formatting and docstring.
rcap107 Nov 8, 2024
32ca7a0
Adding error checking
rcap107 Nov 8, 2024
b311317
Updated documentation
rcap107 Nov 8, 2024
a04fb50
Fixed tests
rcap107 Nov 8, 2024
7b635ef
Changing exception
rcap107 Nov 8, 2024
43a61d4
Revert "Changing exception"
rcap107 Nov 18, 2024
c48a63d
Revert "Fixed tests"
rcap107 Nov 18, 2024
5704ebf
Revert "Updated documentation"
rcap107 Nov 18, 2024
ab5af46
Revert "Adding error checking"
rcap107 Nov 18, 2024
3f69bde
Revert "Improvements to formatting and docstring."
rcap107 Nov 18, 2024
801d745
Revert "Altering the code to add different options and changing the d…
rcap107 Nov 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 68 additions & 2 deletions skrub/tests/test_drop_column_if_null.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

from skrub import _dataframe as sbd
from skrub._drop_column_if_null import DropColumnIfNull
from skrub._on_each_column import RejectColumn


@pytest.fixture
Expand Down Expand Up @@ -39,9 +40,9 @@ def drop_null_table(df_module):
)


def test_single_column(drop_null_table, df_module):
def test_single_column_drop(drop_null_table, df_module):
"""Check that null columns are dropped and non-null columns are kept."""
dn = DropColumnIfNull()
dn = DropColumnIfNull(null_column_strategy="drop")
assert dn.fit_transform(sbd.col(drop_null_table, "value_nan")) == []
assert dn.fit_transform(sbd.col(drop_null_table, "value_null")) == []
assert dn.fit_transform(sbd.col(drop_null_table, "mixed_null")) == []
Expand All @@ -60,3 +61,68 @@ def test_single_column(drop_null_table, df_module):
dn.fit_transform(sbd.col(drop_null_table, "value_almost_null")),
df_module.make_column("value_almost_null", ["almost", None, None]),
)


def test_single_column_keep(drop_null_table, df_module):
"""Check that all columns are kept."""
dn = DropColumnIfNull(null_column_strategy="keep")

df_module.assert_column_equal(
dn.fit_transform(sbd.col(drop_null_table, "idx")),
df_module.make_column("idx", [1, 2, 3]),
)

df_module.assert_column_equal(
dn.fit_transform(sbd.col(drop_null_table, "value_null")),
df_module.make_column(
"value_null",
[
None,
None,
None,
],
),
)

df_module.assert_column_equal(
dn.fit_transform(sbd.col(drop_null_table, "value_nan")),
df_module.make_column(
"value_nan",
[
np.nan,
np.nan,
np.nan,
],
),
)

df_module.assert_column_equal(
dn.fit_transform(sbd.col(drop_null_table, "mixed_null")),
df_module.make_column("mixed_null", [None, np.nan, None]),
)

df_module.assert_column_equal(
dn.fit_transform(sbd.col(drop_null_table, "value_almost_nan")),
df_module.make_column("value_almost_nan", [2.5, np.nan, np.nan]),
)

df_module.assert_column_equal(
dn.fit_transform(sbd.col(drop_null_table, "value_almost_null")),
df_module.make_column("value_almost_null", ["almost", None, None]),
)


def test_single_column_raise(drop_null_table, df_module):
"""Check that an exception is raised if a null column is detected."""
dn = DropColumnIfNull(null_column_strategy="raise")
with pytest.raises(RejectColumn):
dn.fit_transform(sbd.col(drop_null_table, "value_nan"))
with pytest.raises(RejectColumn):
dn.fit_transform(sbd.col(drop_null_table, "value_null"))
with pytest.raises(RejectColumn):
dn.fit_transform(sbd.col(drop_null_table, "mixed_null"))


def test_incorrect_argument():
with pytest.raises(ValueError):
DropColumnIfNull(null_column_strategy="wrong value")
11 changes: 3 additions & 8 deletions skrub/tests/test_table_vectorizer.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,6 @@
from skrub._datetime_encoder import DatetimeEncoder
from skrub._gap_encoder import GapEncoder
from skrub._minhash_encoder import MinHashEncoder
from skrub._on_each_column import RejectColumn
from skrub._table_vectorizer import TableVectorizer

MSG_PANDAS_DEPRECATED_WARNING = "Skip deprecation warning"
Expand Down Expand Up @@ -531,7 +530,7 @@ def test_changing_types(X_train, X_test, expected_X_out):
# only extract the total seconds
datetime=DatetimeEncoder(resolution=None),
# True by default
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set this to false to keep the original behavior with no DropNullColumns. Given that the default value is True, should I change the test so that the "default behavior" is what is tested here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok the way you did it

null_column_strategy=False,
null_column_strategy="keep",
)

table_vec.fit(X_train)
Expand Down Expand Up @@ -766,7 +765,7 @@ def test_drop_null_column():
"""Check that all null columns are dropped, and no more."""
# Don't drop null columns
X = _get_missing_values_dataframe()
tv = TableVectorizer(null_column_strategy="ignore")
tv = TableVectorizer(null_column_strategy="keep")
transformed = tv.fit_transform(X)

assert sbd.shape(transformed) == sbd.shape(X)
Expand All @@ -778,11 +777,7 @@ def test_drop_null_column():

# Raise exception if a null column is found
with pytest.raises(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is still failing because the TableVectorizer is not raising the correct exception and I don't know how to make it do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here raise a ValueError instead of RejectColumn

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rejectcolumn is a way to signify to the tablevectorizer "I'm not the right transformer for this column, don't apply me here".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, fixed

RejectColumn, match="Column all_null contains only null values."
ValueError,
):
tv = TableVectorizer(null_column_strategy="raise")
transformed = tv.fit_transform(X)

# # Raise an exception if an unknown parameter is found
# tv = TableVectorizer(null_column_strategy="wrong_parameter")
# transformed = tv.fit_transform(X)
Loading