-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pyarrow.lib.ArrowTypeError: Expected bytes, got a 'dict' object #452
Comments
Is this writing to an existing table? Could you share the schema of the destination table? |
Hi, @tswast In fact I was able to upload data, only if I using json.dumps() on the column which has list or dict type in there |
any updates on this? getting the same error. the strange thing is that the code works well locally and in compute engine, but fails in cloud run (even though the same service account is being used for both) |
Ah, that probably explains it. Currently, I believe we can avoid this problem with #339 where instead of pandas-gbq creating the table, we create the table as part of the load job. |
Has there been any progress on updating this issue? I am seeing the same error message. Could we elaborate on: I believe we can avoid this problem with #339 where instead of pandas-gbq creating the table, we create the table as part of the load job. As I am seeing the same issue even with a created table, and using (if_exists='replace'):
The work-around that helped me to successfully load my table was casting the dataframe column to As an example GCP Cloud Function: import pandas as pd
import pandas_gbq
def gbq_write(request):
# TODO: Set project_id to your Google Cloud Platform project ID.
project_id = "project-id"
# TODO: Set table_id to the full destination table ID (including the dataset ID).
table_id = 'dataset.table'
df = pd.DataFrame(
{
"my_string": ["a", "b", "c"],
"my_int64": [1, 2, 3],
"my_float64": [4.0, 5.0, 6.0],
"my_bool1": [True, False, True],
"my_dates": pd.date_range("now", periods=3),
"my_struct": [{"test":"str1"},{"test":"str2"},{"test":"str3"}],
}
)
pandas_gbq.to_gbq(df, table_id, project_id=project_id, if_exists='replace')
return f'Successfully Written' This produces the error mentioned in this thread:
With
When pushing the column casting I added a single line and ended up with: import pandas as pd
import pandas_gbq
def gbq_write(request):
# TODO: Set project_id to your Google Cloud Platform project ID.
project_id = "project-id"
# TODO: Set table_id to the full destination table ID (including the dataset ID).
table_id = 'dataset.table'
df = pd.DataFrame(
{
"my_string": ["a", "b", "c"],
"my_int64": [1, 2, 3],
"my_float64": [4.0, 5.0, 6.0],
"my_bool1": [True, False, True],
"my_dates": pd.date_range("now", periods=3),
"my_struct": [{"test":"str1"},{"test":"str2"},{"test":"str3"}],
}
)
# Column conversion added to load table
df['my_struct'] = df['my_struct'].astype("string")
pandas_gbq.to_gbq(df, table_id, project_id=project_id, if_exists='replace')
return f'Successfully Written' This helps to successfully load the table into BigQuery with schema:
If you need the my_struct to be an actual struct consider: SELECT
*
# retrieve value from struct
,json_value(my_struct, '$.test') AS test
# recreate struct using value for each row
,struct(json_value(my_struct, '$.test') AS test) AS my_created_struct
FROM `project-id.dataset.table` order by my_int64
|
The reason the fix in #339 didn't work is that pandas-gbq isn't using the load job to actually create the table, so it doesn't benefit from the same logic that google-cloud-bigquery has. In #814 I'm taking the opposite approach and moving some logic from google-cloud-bigquery to pandas-gbq as part of an effort to make pandas-gbq the canonical location for all bigquery + pandas logic and reduce redundancy across our suite of libraries. Thanks everyone for your contributions and clear test cases. These have influenced the test cases I made in #814. |
Hi!
there is a problem when trying to load using pandas-gbq which using pyarrow a column of the list (array) or dictionary (json) type into the table, while the GBQ documentation says that structure types such as array or json are supported,
as a result, a stacktrace error occurs:
Can anyone help with it please?
The text was updated successfully, but these errors were encountered: