Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tighten up data/metadata mismatch section #8

Merged
merged 2 commits into from
Dec 11, 2024
Merged

Conversation

mbtaylor
Copy link
Contributor

@mbtaylor mbtaylor commented Dec 6, 2024

In particular mention, without recommending, alternative approaches to VOTable/parquet column mapping in complex cases, and make explicit reader behaviour that should defend against encountering such alternatives.

In particular mention, without recommending, alternative approaches
to VOTable/parquet column mapping in complex cases, and make explicit
reader behaviour that should defend against encountering such
alternatives.
@mbtaylor
Copy link
Contributor Author

mbtaylor commented Dec 6, 2024

I don't know whether to add here some explicit text about how parquet types are mapped to VOTable types in straightforward cases, something like:

Columns containing scalars and 1-d arrays of booleans, unsigned bytes, signed 16/32/64-bit integers, 32/64-bit floats, and strings map straightforwardly to VOTable.

That in turn would raise the question of what you do with types that are somewhat but not quite like those ones:

  • unsigned 16/32/64-bit and signed 8-bit integers
  • multi-dimensional arrays

In both of those cases I don't know what's best. For the signed/unsigned issue the right answer may be a DALI-defined xtype, but that's not in place right now. For multi-d arrays the most VOTable-friendly thing to do would be to store them in parquet as 1-d and use FIELD/@arraysize attributes to reshape them, but then they don't look multi-d in parquet.

Given that I expect a good majority of tables won't run into either issue, and that implementation experience may be required to get this right, is it best to:

  1. just not mention it here
  2. mention the issues but admit we don't have a good answer
  3. try to come up with something prescriptive

I'm inclined to go with (1) or (2) but others might disagree.

@fxpineau
Copy link

fxpineau commented Dec 6, 2024

VOParquet may serve at least 2 purposes:
A - add VO semantic metadata to a parquet file;
B - be another possible format for VO data.
B includes A.
B means the possibility to read such data from a general purpose VO tool such as TOPCAT.

From my understanding (I do not remember whether it is explicitly mentioned or not), the current version of the document covers case A.
And in case A, we don't really care about type mismatches and we can go with (1).

If we want to cover case B, I think that we should mention at least (2).
And possibly being prescriptive (3) stating that the logical types of a VOParquet file must be restricted to VOTable datatypes.
In case of incompatible types, it is up to the client to decide what to do: fail, ignore columns having VO incompatible types, ...
By the way, what is your current strategy in TOPCAT in case of datatype incompatible with the StarTable model?

Then, to go further, it raises the question of VOTable as a pivot format.
From my point-of-view, the current version of VOTable is not satisfactory: it should at least support unsigned 16/32/64-bit integers and signed 8-bit integers. The xtype seems to be a good solution to serve as a decorator that can be ignore when reading/writing those types in Java, in FITS, ... (but still be taken into account when performing operations on those types e.g. in Java, and added to FITS metadata).

I think that the VO lacks a description of tabular data and metadata which would be file format agnostic, i.e. decoupled
from the question of how we serialize/deserialize those data/metadata in various file formats.

@mbtaylor
Copy link
Contributor Author

mbtaylor commented Dec 6, 2024

I've written the document from the point of view of A. There is some feature creep into B (e.g. adding RESPONSEFORMAT=parquet in DALI) but I feel like trying to solve all the problems associated with that is too much at this stage.

In TOPCAT if I see parquet columns that I can't make into StarTable columns I just ignore those columns and write a Warning through the logging system.

@mbtaylor
Copy link
Contributor Author

I've added some text that explicitly lists those column types that can be mapped to VOTable in a straightforward way. Since this comes before the paragraph about what to do for types that can't be straightforwardly mapped, I think it covers (1) and a nod towards (2).

@mbtaylor mbtaylor requested a review from fxpineau December 10, 2024 09:49
@mbtaylor mbtaylor merged commit 4bc9263 into ivoa:main Dec 11, 2024
@mbtaylor mbtaylor deleted the mismatch branch December 11, 2024 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants