Bug Fix: #60343 Construction of Series / Index fails from dict keys when "str" dtype is specified explicitly #60383

tasfia8 · 2024-11-21T03:17:03Z

This PR fixes BUG (string): contruction of Series / Index fails from dict keys when "str" dtype is specified explicitly #60343 @jorisvandenbossche
The default behaviour (pd.Index(d.keys())) worked correctly, but explicitly setting dtype="str" raised a ValueError. The issue stemmed from dict_keys not being converted to a proper array-like structure before being passed to StringDtype, which couldn't handle such inputs.

To fix the issue:

KeyView was introduced to identify and preprocess dict_keys before passing them to Pandas internals. The keys are now converted to a list for compatibility.
Updated logic in Index and sanitize_array to map dtype="str" to StringDtype(storage="python"). Updated check_array_indexer to allow empty boolean indexers for StringArray
New test added "test_index_from_dict_keys_with_dtype" to ensure:
Default inference (pd.Index(d.keys())) works.
Explicit dtype="str" works, resulting in string[python].
Updated existing tests (test_is_object and test_empty_fancy) to handle new behaviours introduced by the fix.

After the fix both the default (pd.Index(d.keys())) and explicit (pd.Index(d.keys(), dtype="str")) cases work:

…ses fail.

tasfia8 · 2024-11-21T06:54:49Z

Hi, I'm a student contributing to this PR and am on bit of a time crunch due to finals. For my school project, my task is to merge the PR as quickly as possible with the help and guidance of maintainers. I was able to fix the bug but I am a bit stuck on how to fix the checks. Could @jorisvandenbossche or anyone else help? Especially the unit tests ones. I tried to fix the pre-commit (using ruff lint fix) but every time I fixed a formatting issue, after running pre-commit it goes to the initial position before I did the fix.

For the Doc build and upload check (it was giving an error for every declaration of ipython that didn't have import pandas as pd), I manually inserted it but don't know if there is an easy way.

jorisvandenbossche · 2024-11-23T09:38:29Z

For the Doc build and upload check (it was giving an error for every declaration of ipython that didn't have import pandas as pd), I manually inserted it but don't know if there is an easy way.

That should normally not have been needed. Did you get those errors locally? (in that case maybe something with the set up was wrong)
On the CI build I see that there is an error specifically in the doc/source/getting_started/comparison/includes/nth_word.rst file.

jorisvandenbossche

Thanks for working on this!

I added a few comments. It seems you have made more changes than I think would be needed to fix it. I would try to focus the PR a bit more (also, only fixing either Index or Series constructor would also be fine)

jorisvandenbossche · 2024-11-23T09:39:53Z

pandas/core/dtypes/common.py

+    return (
+        isinstance(arr_or_dtype, np.dtype)
+        and arr_or_dtype == "object"
+        or isinstance(arr_or_dtype, StringDtype)


We don't want to change the meaning of is_object_dtype to also include StringDtype. What was the reason you needed this change?

jorisvandenbossche · 2024-11-23T09:42:37Z

pandas/tests/frame/test_query_eval.py

-        df = DataFrame(
-            {
-                "A": range(3),
-                "B": range(3),
-                "C": range(3)
-            }
-        ).rename(columns={"B": "A"})
+        df = DataFrame({"A": range(3), "B": range(3), "C": range(3)}).rename(
+            columns={"B": "A"}
+        )


It seems you included some unintended formatting changes. Maybe some setting in the IDE you are using that is conflicting with the formatting defaults in pandas?
I would recommend you to set up the pre-commit hook, which will ensure the code is formatted correctly when committing (see https://pandas.pydata.org/docs/development/contributing_codebase.html#pre-commit)

jorisvandenbossche · 2024-11-23T09:47:00Z

pandas/core/construction.py

+        if isinstance(data, KeysView):
+            data = list(data)


I think this is the critical part that is indeed fixing the issue (for Series(..) at least), and so this is a good change.

I would just make the if check more generic. Because while the example was using dict.keys(), you also have other iterables (e.g. the dict.values()) that will have the same problem, and we should try to fix it for all of them.

Looking at the logic just below for the non-ExtensionDtype cases, there is a if hasattr(data, "__array__") and then after that in the final else it is also doing a data = list(data). So maybe the check above could be if not hasattr(data, "__array__") instead of if isinstance(data, KeysView).

tasfia8 added 9 commits November 18, 2024 22:47

Add .gitignore to ignore unnecessary files

eb70c39

def test_index_from_dict_keys_with_dtype() passes but other 4 test ca…

8264636

…ses fail.

Testcase test_constructor_casting(self, index) passes

5cb0c64

test_empty_fancy passes

37468b5

Bug fixed 60343 successfully

051e7b2

Bug fixed 60343 - styledcode

8d8fec0

clean up

734c384

Merge remote-tracking branch 'upstream/main' into bug60343v1

4dd2612

Ready for PR

a061863

tasfia8 mentioned this pull request Nov 21, 2024

BUG (string): contruction of Series / Index fails from dict keys when "str" dtype is specified explicitly #60343

Open

tasfia8 added 9 commits November 20, 2024 22:30

Styled comment

6f54206

fixed minor issue

3b977ab

Fixed hooks

e66e307

apply minorfix

e13cf26

fixed linting

18adba3

linting

10311c3

minor lint

85c0efb

minorfix

c71cc03

ipython build imports in rst files

e30dea1

tasfia8 added 2 commits November 21, 2024 02:20

fixed some precommit checks

95e7730

Passed another unit check

28fd9e9

jorisvandenbossche added this to the 2.3 milestone Nov 23, 2024

jorisvandenbossche added Strings String extension data type and string data Constructors Series/DataFrame/Index/pd.array Constructors labels Nov 23, 2024

jorisvandenbossche reviewed Nov 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Fix: #60343 Construction of Series / Index fails from dict keys when "str" dtype is specified explicitly #60383

Bug Fix: #60343 Construction of Series / Index fails from dict keys when "str" dtype is specified explicitly #60383

tasfia8 commented Nov 21, 2024 •

edited

Loading

tasfia8 commented Nov 21, 2024 •

edited

Loading

jorisvandenbossche commented Nov 23, 2024

jorisvandenbossche left a comment

jorisvandenbossche Nov 23, 2024

jorisvandenbossche Nov 23, 2024

jorisvandenbossche Nov 23, 2024

Bug Fix: #60343 Construction of Series / Index fails from dict keys when "str" dtype is specified explicitly #60383

Are you sure you want to change the base?

Bug Fix: #60343 Construction of Series / Index fails from dict keys when "str" dtype is specified explicitly #60383

Conversation

tasfia8 commented Nov 21, 2024 • edited Loading

tasfia8 commented Nov 21, 2024 • edited Loading

jorisvandenbossche commented Nov 23, 2024

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Nov 23, 2024

Choose a reason for hiding this comment

jorisvandenbossche Nov 23, 2024

Choose a reason for hiding this comment

jorisvandenbossche Nov 23, 2024

Choose a reason for hiding this comment

tasfia8 commented Nov 21, 2024 •

edited

Loading

tasfia8 commented Nov 21, 2024 •

edited

Loading