pyarrow.lib.ArrowTypeError: Expected bytes, got a 'dict' object #452

ivanpugachtd · 2021-12-28T13:25:59Z

Hi!
there is a problem when trying to load using pandas-gbq which using pyarrow a column of the list (array) or dictionary (json) type into the table, while the GBQ documentation says that structure types such as array or json are supported,

df = pd.DataFrame(
                {
                    "my_string": ["a", "b", "c"],
                    "my_int64": [1, 2, 3],
                    "my_float64": [4.0, 5.0, 6.0],
                    "my_bool1": [True, False, True],
                    "my_bool2": [False, True, False],
                    "my_struct": [{"test":"str1"},{"test":"str2"},{"test":"str3"}],
                }
            )
pandas_gbq.to_gbq(df, **gbq_params)

as a result, a stacktrace error occurs:

in bq_to_arrow_array
return pyarrow.Array.from_pandas(series, type=arrow_type)
File "pyarrow/array.pxi", line 913, in pyarrow.lib.Array.from_pandas
File "pyarrow/array.pxi", line 311, in pyarrow.lib.array
File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 122, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'dict' object

Can anyone help with it please?

The text was updated successfully, but these errors were encountered:

tswast · 2022-01-04T16:00:41Z

Is this writing to an existing table? Could you share the schema of the destination table?

ivanpugachtd · 2022-01-12T12:30:33Z

Hi, @tswast
I am trying to upload data to the new table, more precisely I tried both
versions:
pyarrow==6.0.1
pandas-gbq==0.16.0

In fact I was able to upload data, only if I using json.dumps() on the column which has list or dict type in there

grzesir · 2022-01-18T03:31:15Z

any updates on this? getting the same error. the strange thing is that the code works well locally and in compute engine, but fails in cloud run (even though the same service account is being used for both)

tswast · 2022-01-19T18:16:54Z

I am trying to upload data to the new table, more precisely I tried both

Ah, that probably explains it. Currently, pandas-gbq attempts to determine a schema locally based on the dtypes it detects. It likely gets this wrong for the struct/array data.

I believe we can avoid this problem with #339 where instead of pandas-gbq creating the table, we create the table as part of the load job.

nabor-slalom-greenparksports · 2022-03-23T20:59:15Z

Has there been any progress on updating this issue? I am seeing the same error message.

Could we elaborate on:

I believe we can avoid this problem with #339 where instead of pandas-gbq creating the table, we create the table as part of the load job.

As I am seeing the same issue even with a created table, and using (if_exists='replace'):

pandas_gbq.to_gbq(dataframe, table_id, project_id=project_id, if_exists='replace')

The work-around that helped me to successfully load my table was casting the dataframe column to string data type.

As an example GCP Cloud Function:

import pandas as pd
import pandas_gbq

def gbq_write(request):

  # TODO: Set project_id to your Google Cloud Platform project ID.
  project_id = "project-id"

  # TODO: Set table_id to the full destination table ID (including the dataset ID).
  table_id = 'dataset.table'

  df = pd.DataFrame(
      {
          "my_string": ["a", "b", "c"],
          "my_int64": [1, 2, 3],
          "my_float64": [4.0, 5.0, 6.0],
          "my_bool1": [True, False, True],
          "my_dates": pd.date_range("now", periods=3),
          "my_struct": [{"test":"str1"},{"test":"str2"},{"test":"str3"}],
      }
  )

  pandas_gbq.to_gbq(df, table_id, project_id=project_id, if_exists='replace')

  return f'Successfully Written'

This produces the error mentioned in this thread:

pyarrow.lib.ArrowTypeError: Expected bytes, got a 'dict' object

With requirements.txt as

pandas==1.4.1
pandas-gbq==0.17.4

When pushing the column casting I added a single line and ended up with:

import pandas as pd
import pandas_gbq

def gbq_write(request):

  # TODO: Set project_id to your Google Cloud Platform project ID.
  project_id = "project-id"

  # TODO: Set table_id to the full destination table ID (including the dataset ID).
  table_id = 'dataset.table'

  df = pd.DataFrame(
      {
          "my_string": ["a", "b", "c"],
          "my_int64": [1, 2, 3],
          "my_float64": [4.0, 5.0, 6.0],
          "my_bool1": [True, False, True],
          "my_dates": pd.date_range("now", periods=3),
          "my_struct": [{"test":"str1"},{"test":"str2"},{"test":"str3"}],
      }
  )

  # Column conversion added to load table
  df['my_struct'] = df['my_struct'].astype("string")

  pandas_gbq.to_gbq(df, table_id, project_id=project_id, if_exists='replace')

  return f'Successfully Written'

This helps to successfully load the table into BigQuery with schema:

Field name	Type
my_string	STRING
my_int64	INTEGER
my_float64	FLOAT
my_bool1	BOOLEAN
my_dates	TIMESTAMP
my_struct	STRING

If you need the my_struct to be an actual struct consider:

SELECT
  *
   # retrieve value from struct
  ,json_value(my_struct, '$.test') AS test
   # recreate struct using value for each row
  ,struct(json_value(my_struct, '$.test') AS test) AS my_created_struct
FROM `project-id.dataset.table` order by my_int64

Row	my_string	my_int64	my_float64	my_bool1	my_dates	my_struct	test	my_created_struct.test
1	a	1	4.0	true	2022-03-24 04:14:28.267319 UTC	{'test': 'str1'}	str1	str1
2	b	2	5.0	false	2022-03-25 04:14:28.267319 UTC	{'test': 'str2'}	str2	str2
3	c	3	6.0	true	2022-03-26 04:14:28.267319 UTC	{'test': 'str3'}	str3	str3

tswast · 2024-09-23T16:03:31Z

The reason the fix in #339 didn't work is that pandas-gbq isn't using the load job to actually create the table, so it doesn't benefit from the same logic that google-cloud-bigquery has.

In #814 I'm taking the opposite approach and moving some logic from google-cloud-bigquery to pandas-gbq as part of an effort to make pandas-gbq the canonical location for all bigquery + pandas logic and reduce redundancy across our suite of libraries.

Thanks everyone for your contributions and clear test cases. These have influenced the test cases I made in #814.

product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label Dec 28, 2021

yoshi-automation added triage me I really want to be triaged. 🚨 This issue needs some love. labels Dec 29, 2021

tswast added type: question Request for information or clarification. Not an issue. and removed 🚨 This issue needs some love. triage me I really want to be triaged. labels Jan 4, 2022

tswast self-assigned this Jan 4, 2022

tswast added the priority: p3 Desirable enhancement or fix. May not be included in next release. label Jan 19, 2022

meredithslota added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed type: question Request for information or clarification. Not an issue. priority: p3 Desirable enhancement or fix. May not be included in next release. labels Feb 7, 2023

tswast removed their assignment Feb 7, 2023

tswast self-assigned this Sep 20, 2024

tswast mentioned this issue Sep 20, 2024

fix!: to_gbq loads unit8 columns to BigQuery INT64 instead of STRING #814

Merged

4 tasks

tswast closed this as completed in #814 Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyarrow.lib.ArrowTypeError: Expected bytes, got a 'dict' object #452

pyarrow.lib.ArrowTypeError: Expected bytes, got a 'dict' object #452

ivanpugachtd commented Dec 28, 2021 •

edited

Loading

tswast commented Jan 4, 2022

ivanpugachtd commented Jan 12, 2022

grzesir commented Jan 18, 2022 •

edited

Loading

tswast commented Jan 19, 2022

nabor-slalom-greenparksports commented Mar 23, 2022 •

edited

Loading

tswast commented Sep 23, 2024

pyarrow.lib.ArrowTypeError: Expected bytes, got a 'dict' object #452

pyarrow.lib.ArrowTypeError: Expected bytes, got a 'dict' object #452

Comments

ivanpugachtd commented Dec 28, 2021 • edited Loading

tswast commented Jan 4, 2022

ivanpugachtd commented Jan 12, 2022

grzesir commented Jan 18, 2022 • edited Loading

tswast commented Jan 19, 2022

nabor-slalom-greenparksports commented Mar 23, 2022 • edited Loading

tswast commented Sep 23, 2024

ivanpugachtd commented Dec 28, 2021 •

edited

Loading

grzesir commented Jan 18, 2022 •

edited

Loading

nabor-slalom-greenparksports commented Mar 23, 2022 •

edited

Loading