Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Update parameter.add_data() functionality #112

Draft
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

glatterf42
Copy link
Member

As discovered in #108, we need a different implementation of parameter.add_data(). Ideally, we would like to make use of the JSONB type in-DB to avoid parsing the JSON field. However, with my current implementation, even that (which is only possible on postgres in the first place) is not at all faster than the pandas implementation (i.e. parsing the field to python, using pandas, writing it back).
I've tested this with a parameter with 1 indexset with 1000 values and with 60 indexsets with 1000 values (though the parameter was limited to 10000 values and units there): pandas is faster. Some numbers from my most recent test tun:

--------------------------------------------------------------------------------------- benchmark: 3 tests ---------------------------------------------------------------------------------------
Name (time in ms)                   Min                 Max                Mean            StdDev              Median               IQR            Outliers      OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_add_data[sqlite]           79.8390 (1.0)       79.8390 (1.0)       79.8390 (1.0)      0.0000 (1.0)       79.8390 (1.0)      0.0000 (1.0)           0;0  12.5252 (1.0)           1           1
test_add_data[postgres]        128.1381 (1.60)     128.1381 (1.60)     128.1381 (1.60)     0.0000 (1.0)      128.1381 (1.60)     0.0000 (1.0)           0;0   7.8041 (0.62)          1           1
test_add_data_json[sqlite]     276.1958 (3.46)     276.1958 (3.46)     276.1958 (3.46)     0.0000 (1.0)      276.1958 (3.46)     0.0000 (1.0)           0;0   3.6206 (0.29)          1           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

You'll notice that test_add_data_json[postgres] is missing: this test currently fails because it struggles to read in the column.names at one point. Hardcoding them does work, so it's doable, and you can test locally: it's still slower than pandas.

In general, _json indicated the variants working directly in-DB, while those without suffix are the pandas versions.

We might want to consider something like this if we want to use temporary tables in postgres.

Finally, for the 10000-values case, the in-DB workflow was not just slower: it didn't manage at all. My system did still work, but the test case wouldn't finish in more than 30 minutes. Pandas, on the other hand, took 2.8 seconds max (mean of 2.5).
So please, @danielhuppmann, let me know if using pandas is okay or if we should find a way to utilize JSONB (which would be nice for sure).
And if we go for JSONB, @meksor, please help me figure out the proper way to use it, i.e. one that's actually faster than pandas.

@glatterf42 glatterf42 added the enhancement New feature or request label Aug 26, 2024
@glatterf42 glatterf42 self-assigned this Aug 26, 2024
@danielhuppmann
Copy link
Member

Thanks @glatterf42! I didn't have a chance to look at the code in detail, but if pulling all data and doing the comparison in memory is faster, that sounds like a good approach to me.

glatterf42 and others added 19 commits September 30, 2024 10:40
* Covers:
* run__id, data, name, uniqueness of name together with run__id
* Adapts tests since default order of columns changes
* Make Column generic enough for multiple parents
* Introduce optimization.Parameter
* Add tests for add_data
* Enable remaining parameter tests (#86)
* Enable remaining parameter tests
* Include optimization parameter api layer (#89)
* Bump several dependency versions
* Let api/column handle both tables and parameters
* Make api-layer tests pass
* Include optimization parameter core layer (#90)
* Enable parameter core layer and test it
* Fix things after rebase
* Ensure all intended changes survive the rebase
* Adapt data validation function for parameters
* Allow tests to pass again
@glatterf42 glatterf42 force-pushed the include/optimization-parameter branch from acc783f to a9e5ac8 Compare September 30, 2024 09:17
@glatterf42
Copy link
Member Author

glatterf42 commented Sep 30, 2024

A while ago, Volker asked me to also benchmark upserting a parameter with roughly ten indexsets, but a million values. This gives a similar picture to before, though it may be a little worrying (I still have to look into it further): the only test that finished within an hour was the one using pandas and sqlite:

----------------------------------------------- benchmark: 1 tests -----------------------------------------------
Name (time in s)              Min      Max     Mean  StdDev   Median     IQR  Outliers     OPS  Rounds  Iterations
------------------------------------------------------------------------------------------------------------------
test_add_data[sqlite]     96.3460  96.3460  96.3460  0.0000  96.3460  0.0000       0;0  0.0104       1           1
------------------------------------------------------------------------------------------------------------------

I'm not sure if 96 seconds is fine for us for this kind of operation, I'll benchmark the same operation with ixmp tomorrow.

When running on postgres, the benchmark produced this error:

self = <psycopg.Cursor [closed] [IDLE] (host=localhost user=postgres database=test) at 0x7f000bcd8290>
query = 'UPDATE optimization_parameter SET data=%(data)s::JSONB WHERE optimization_parameter.id = %(optimization_parameter_id)s::INTEGER'
params = {'data': Jsonb({'Indexset 0': [0, 0, 0, 0, 0, 0, 0 ... (128780213 chars)), 'optimization_parameter_id': 1}

    def execute(
        self,
        query: Query,
        params: Optional[Params] = None,
        *,
        prepare: Optional[bool] = None,
        binary: Optional[bool] = None,
    ) -> Self:
        """
        Execute a query or command to the database.
        """
        try:
            with self._conn.lock:
                self._conn.wait(
                    self._execute_gen(query, params, prepare=prepare, binary=binary)
                )
        except e._NO_TRACEBACK as ex:
>           raise ex.with_traceback(None)
E           sqlalchemy.exc.OperationalError: (psycopg.errors.ProgramLimitExceeded) total size of jsonb object elements exceeds the maximum of 268435455 bytes
E           CONTEXT:  unnamed portal parameter $1 = '...'
E           [SQL: UPDATE optimization_parameter SET data=%(data)s::JSONB WHERE optimization_parameter.id = %(optimization_parameter_id)s::INTEGER]
E           [parameters: {'data': Jsonb({'Indexset 0': [0, 0, 0, 0, 0, 0, 0 ... (128780213 chars)), 'optimization_parameter_id': 1}]
E           (Background on this error at: https://sqlalche.me/e/20/e3q8)

.venv/lib/python3.12/site-packages/psycopg/cursor.py:732: OperationalError

Meanwhile, when using the in-DB _json functionality, the sqlite test wouldn't finish within an hour (so that I couldn't even start the one on postgres, though I doubt it would fare much better).

@glatterf42
Copy link
Member Author

Running the same benchmark (upserting a million values to one parameter with 12 indexsets) in ixmp takes just 53.48 seconds, so the current pandas-implementation in ixmp4 would actually constitute a downgrade, it seems.
Hence, I should invest more time into making the new implementation more efficient. Since I tried that before and wasn't too successful, I'd appreciate any help or guidance you can give, @meksor.

glatterf42 and others added 3 commits October 1, 2024 09:00
* Covers:
* run__id, data, name, uniqueness of name together with run__id
* Adapts tests since default order of columns changes
* Make Column generic enough for multiple parents
* Introduce optimization.Parameter
* Add tests for add_data
* Enable remaining parameter tests (#86)
* Enable remaining parameter tests
* Include optimization parameter api layer (#89)
* Bump several dependency versions
* Let api/column handle both tables and parameters
* Make api-layer tests pass
* Include optimization parameter core layer (#90)
* Enable parameter core layer and test it
* Fix things after rebase
* Ensure all intended changes survive the rebase
* Adapt data validation function for parameters
* Allow tests to pass again
@glatterf42 glatterf42 force-pushed the enh/add-data-functionality branch from 460f63c to c6b29b9 Compare October 1, 2024 07:31
Base automatically changed from include/optimization-parameter to main October 3, 2024 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants