-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(raw): add arrow support in backend #2083
base: dev
Are you sure you want to change the base?
Conversation
@@ -59,6 +60,13 @@ | |||
".json": ("application/json", "utf-8"), | |||
} | |||
|
|||
|
|||
class MATRIX_FORMAT(EnumIgnoreCase): | |||
JSON = "json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Selon les cas d'usage dans l'application, vous pouvez avoir "json-split" (cas général) ou "json-index" (cas des Table Form du scenario builder, par exemple).
Exemple :
>>> import pandas as pd
>>> df = pd.DataFrame(data=[[1, 2, 3.14], [4, 5, 6.18]], index=["00:00", "01:00"], columns=["TS-1", "TS-2", "TS-3"])
>>> df.to_json(orient="split")
'{"columns":["TS-1","TS-2","TS-3"],"index":["00:00","01:00"],"data":[[1,2,3.14],[4,5,6.18]]}'
>>> df.to_json(orient="index")
'{"00:00":{"TS-1":1,"TS-2":2,"TS-3":3.14},"01:00":{"TS-1":4,"TS-2":5,"TS-3":6.18}}'
antarest/study/storage/rawstudy/model/filesystem/matrix/matrix.py
Outdated
Show resolved
Hide resolved
antarest/study/storage/rawstudy/model/filesystem/matrix/input_series_matrix.py
Show resolved
Hide resolved
else: | ||
real_format = "json" if formatted else "bytes" | ||
|
||
output = study_service.get(uuid, path, depth=depth, format=real_format, params=parameters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a global point of view, I think it's weird to push the formatting down all the call stack to serialize the matrix there.
Usually, application manipulate "objects" in memory, and then we serialize it according to the requested format only when we need it.
So here for example, I think it would make more sense to retrieve a DataFrame
object, and serialize it here.
Do you think it's possible ? Maybe in another PR because it has a bigger impact on the codebase ?
In particular, we still want matrices to be formatted as JSON when getting a whole tree from the file tree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's possible and I agree it's a better way to do it. I tried inside this PR but it's too big of a work so we should tackle it inside another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember that, some binary files in the study directory are not DataFrame at all : *.ico
, XML files, user-defined files stored in the user
directory…
@@ -174,33 +155,63 @@ def test_get_study( | |||
|
|||
# If we ask for a matrix, we should have a JSON content if formatted is True | |||
rel_path = "/input/links/de/fr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably test it with an output matrix too, since they are handled by different code paths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've duplicated the input tests to do the output matrix ones. The code of the test is a bit more complicated but it doesn't bother me.
b0b2ee2
to
bad58a7
Compare
850cff5
to
9ac959f
Compare
There are few things to note when using the
arrow
format:1- For output files the index is now inside the dataframe as the column "Index".
2- The columns names are now in string (forced by pyarrow)