Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couldn't upload Session Result DataFrame into Google Spreadsheet when DataFrame is too big #15

Open
semio opened this issue Jun 15, 2023 · 1 comment

Comments

@semio
Copy link
Collaborator

semio commented Jun 15, 2023

Hi @motin I created a cli tool for running the evaluation and cache results locally and upload to google spreadsheet. It's mostly the same process as in the run_evaluation.py notebook. But somehow I couldn't upload the session results to google spreadsheet.

# sheet.session_results.replace_data(session_df)

When I enable this line (or append_data(session_df)), it result in an error

File ~/src/work/gapminder/experiments/gapminder-ai/automation-api/lib/gsheets/gsheets_worksheet_editor.py:90, in GsheetsWorksheetEditor.replace_data(self, df)
     88 def replace_data(self, df: pd.DataFrame) -> None:
     89     self.data.df = df
---> 90     export_df = self.data.export()
     91     gspread_dataframe.set_with_dataframe(self.worksheet, export_df, resize=True)

File ~/src/work/gapminder/experiments/gapminder-ai/automation-api/lib/gsheets/gsheets_worksheet_data.py:94, in GsheetsWorksheetData.export(self)
     93 def export(self) -> pa.typing.DataFrame[DfSchemaModel]:
---> 94     return self.restore_current_row_numbers_in_formulas(self.df).rename(
     95         columns=self.attributes_to_columns_map
     96     )

File ~/src/work/gapminder/experiments/gapminder-ai/automation-api/lib/gsheets/gsheets_worksheet_data.py:88, in GsheetsWorksheetData.restore_current_row_numbers_in_formulas(self, df)
     82             row[col_name] = value.replace(
     83                 self.row_number_placeholder_in_formulas,
     84                 str(row["__row_number"]),
     85             )
     86     return row
---> 88 replaced_df = df.apply(
     89     restore_current_row_numbers, axis=1, result_type="broadcast"
     90 )
     91 return replaced_df.drop(columns=["__row_number"])

File ~/.pyvenv/gapm39/lib/python3.9/site-packages/pandas/core/frame.py:9568, in DataFrame.apply(self, func, axis, raw, result_type, args, **kwargs)
   9557 from pandas.core.apply import frame_apply
   9559 op = frame_apply(
   9560     self,
   9561     func=func,
   (...)
   9566     kwargs=kwargs,
   9567 )
-> 9568 return op.apply().__finalize__(self, method="apply")

File ~/.pyvenv/gapm39/lib/python3.9/site-packages/pandas/core/apply.py:754, in FrameApply.apply(self)
    752 # broadcasting
    753 if self.result_type == "broadcast":
--> 754     return self.apply_broadcast(self.obj)
    756 # one axis empty
    757 elif not all(self.obj.shape):

File ~/.pyvenv/gapm39/lib/python3.9/site-packages/pandas/core/apply.py:1013, in FrameColumnApply.apply_broadcast(self, target)
   1012 def apply_broadcast(self, target: DataFrame) -> DataFrame:
-> 1013     result = super().apply_broadcast(target.T)
   1014     return result.T

File ~/.pyvenv/gapm39/lib/python3.9/site-packages/pandas/core/apply.py:870, in FrameApply.apply_broadcast(self, target)
    867 result_compare = target.shape[0]
    869 for i, col in enumerate(target.columns):
--> 870     res = self.f(target[col])
    871     ares = np.asarray(res).ndim
    873     # must be a scalar or 1d

File ~/src/work/gapminder/experiments/gapminder-ai/automation-api/lib/gsheets/gsheets_worksheet_data.py:80, in GsheetsWorksheetData.restore_current_row_numbers_in_formulas.<locals>.restore_current_row_numbers(row)
     78 if col_name == "__row_number":
     79     continue
---> 80 value = row[col_name]
     81 if isinstance(value, str) and len(value) > 0 and value[0] == "=":
     82     row[col_name] = value.replace(
     83         self.row_number_placeholder_in_formulas,
     84         str(row["__row_number"]),
     85     )

File ~/.pyvenv/gapm39/lib/python3.9/site-packages/pandas/core/frame.py:3807, in DataFrame.__getitem__(self, key)
   3805 if self.columns.nlevels > 1:
   3806     return self._getitem_multilevel(key)
-> 3807 indexer = self.columns.get_loc(key)
   3808 if is_integer(indexer):
   3809     indexer = [indexer]

File ~/.pyvenv/gapm39/lib/python3.9/site-packages/pandas/core/indexes/base.py:3804, in Index.get_loc(self, key, method, tolerance)
   3802     return self._engine.get_loc(casted_key)
   3803 except KeyError as err:
-> 3804     raise KeyError(key) from err
   3805 except TypeError:
   3806     # If we have a listlike key, _check_indexing_error will raise
   3807     #  InvalidIndexError. Otherwise we fall through and re-raise
   3808     #  the TypeError.
   3809     self._check_indexing_error(key)

KeyError: 'session_id'

But I checked the session_df, it has passed SessionResultsDf validation and the dtypes looks correct.

Then I found that it's possibly caused by the size of session_df, because if I only upload a few rows from session_df, it worked without issue. 1990 rows worked, but 3980 rows not work.

I suggest that, we can check if the DataFrame to be uploaded is too large and give a more meaningful error message instead of KeyError .

Besides, it seems that 3980 rows is not a very big number.. Is it possible to make it work with larger DataFrames? For our usage, I think we are looking at 200 questions * 3 models * 10 rounds * 10 prompts = 60000 rows

@semio
Copy link
Collaborator Author

semio commented Jun 15, 2023

Here are some files and script you can test:

from lib.pilot.helpers import read_ai_eval_spreadsheet
import glob
import pandas as pd

sheet = read_ai_eval_spreadsheet()

# download above csv files into a folder
session_df = pd.concat([pd.read_csv(x) for x in glob.glob('./*csv')])
session_df = SessionResultsDf.validate(session_df)

sheet.session_results.replace_data(session_df)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant