adding datetime-like dtypes to ndarray #270

CagtayFabry · 2020-07-01T11:21:23Z

In the light of discussions around version 2.0 of the asdf-standard (and the version bump of all schemas) I would be interested to hear some opinions about extending the supported dtypes of ndarray. Specifically I am interested in adding support for datetime and timedelta like dtypes directly to the ndarray schema.

I am aware of the existing time/time-1.1.0 schema which while versatile and complex seems to be rather specific to astropy use cases in some regards. I think working with POSIX/unix datetimes with high (ns) precision is common in many scientific applications.

Currently core/ndarray-1.0.0 supports the basic (u)int, float and complex dtypes defined in the schema:

asdf-standard/schemas/stsci.edu/asdf/core/ndarray-1.0.0.yaml

Lines 190 to 191 in 29d3410

    
           enum: [int8, uint8, int16, uint16, int32, uint32, int64, uint64, 
        
                  float32, float64, complex64, complex128, bool8]

The asdf python library handles the corresponding numpy mappings here:

_datatype_names = {
    'int8'       : 'i1',
    'int16'      : 'i2',
    'int32'      : 'i4',
    'int64'      : 'i8',
    'uint8'      : 'u1',
    'uint16'     : 'u2',
    'uint32'     : 'u4',
    'uint64'     : 'u8',
    'float32'    : 'f4',
    'float64'    : 'f8',
    'complex64'  : 'c8',
    'complex128' : 'c16',
    'bool8'      : 'b1'
}

When looking at numpy datetime arrays those are basically just integers interpreted as POSIX timestamps or timedeltas.
Unfortunately we cannot store these in an ndarray directly without casting back to integer:

import numpy as np
import asdf

tree = {"times":np.arange(0,3,dtype="datetime64[ns]")}
with asdf.AsdfFile(
    tree,
) as ff:
    ff.write_to("datetimes.asdf")

>>> ValueError: cannot include dtype 'M' in a buffer

This makes handling of numpy datetime arrays somewhat irritating (I noticed this when working with pandas and xarray objects in asdf).

I think natively supporting numpy datetime dtypes would simplify a lot of things when using asdf with other libraries that make use of numpys datetime dtypes, thus possibly expanding asdf to be used more widely (at least throughout the python/scipy ecosystem).

In principle supporting more dtypes should be as easy as extending the standard schema und plugin lists for the asdf-standard schema as well as the python mapping (it seems to work but I have not looked into it in detail)

enum: [int8, uint8, int16, uint16, int32, uint32, int64, uint64,
       float32, float64, complex64, complex128, bool8, "timedelta64[ns]", "datetime64[ns]"]

_datatype_names = {
    'int8'       : 'i1',
    'int16'      : 'i2',
    'int32'      : 'i4',
    'int64'      : 'i8',
    'uint8'      : 'u1',
    'uint16'     : 'u2',
    'uint32'     : 'u4',
    'uint64'     : 'u8',
    'float32'    : 'f4',
    'float64'    : 'f8',
    'complex64'  : 'c8',
    'complex128' : 'c16',
    'bool8'      : 'b1',
    'timedelta64[ns]': 'm8[ns]',
    'datetime64[ns]': 'M8[ns]'
}

Of course one issues with adding dtypes to the core ndarray schema is that all libraries implementing the asdf-standard (asdf-cpp?) would have to add support for these specific datetime dtypes. Honestly I am not aware of how many asdf implementations there are for other languages and how difficult this would be to implement (probably not as easy as with python/numpy).

Another option could be to somehow allow an extension to add support for specific dtypes to ndarray. However I don't know if this can be done in the current implementation of the asdf-standard.

The text was updated successfully, but these errors were encountered:

eslavich · 2020-07-01T12:37:19Z

I like this idea. I don't see this as being an undue burden on other languages, since they're free to deserialize the timestamps into a regular integer array. ASDF implementations would need to "remember" that the type was timestamp, but there are already other properties of ndarrays that need to be tracked and handled.

Some alternative implementation ideas:

Introduce an optional subtype field to ndarray and include timestamp_ns and timedelta_ns as options, so that users can efficiently store small timedeltas using containers less than 64 bits.
Create new tags for ndarrays that are to be interpreted as timestamps or timedeltas, something like http://asdf-format.org/schemas/ndarray_timedelta. The schema would just be a simple $ref to the ndarray schema (only possible if we remove the tag property from core schemas, as suggested in Remove tag property from root object of core schemas #269)
Is it possible to serialize these as a quantity-1.1.0, which is just an ndarray + a unit? We could add a config option to the library that causes quantity with unit "ns" to be deserialized into numpy.ndarray with dtype ... well, that wouldn't work, would it, since we couldn't distinguish between timedelta and datetime.

CagtayFabry · 2020-07-01T13:42:12Z

I also thought about something like subtype (basically how we handly numpy datetime dtypes in our own classes currently). It might prevent bloating the supported dtypes and end up more flexible (similar to how big/little endian encoding is handled now).

When introducing http://asdf-format.org/schemas/ndarray_timedelta I could see this leading to cases of

anyOf:
  - tag: http://asdf-format.org/schemas/ndarray
  - tag: http://asdf-format.org/schemas/ndarray_timedelta

in other schemas where both cases should be allowed. Could this be prevented? (same for using quantity)

eslavich · 2020-07-01T14:13:56Z

That's a good point, subtype is more convenient for that sort of thing. We could add another custom validator to allow schema authors to restrict the subtype value.

It is possible to createhttp://asdf-format.org/schemas/ndarray_all for easy access to that anyOf structure, but that seems more complicated than subtype.

eslavich · 2020-07-01T14:14:22Z

@perrygreenfield do you have any thoughts on this one?

CagtayFabry · 2020-07-01T14:25:23Z

That's a good point, subtype is more convenient for that sort of thing. We could add another custom validator to allow schema authors to restrict the subtype value.

It is possible to createhttp://asdf-format.org/schemas/ndarray_all for easy access to that anyOf structure, but that seems more complicated than subtype.

I guess implementing subtype should also make it very easy to define http://asdf-format.org/schemas/ndarray_timedelta using allOf without implementing a validator (could also be defined in cusotm extensions if necessary)

CagtayFabry · 2020-11-10T10:31:38Z

quick reminder that

  tag: http://asdf-format.org/schemas/ndarray*

should also be possible now

I also thought about something like subtype (basically how we handly numpy datetime dtypes in our own classes currently). It might prevent bloating the supported dtypes and end up more flexible (similar to how big/little endian encoding is handled now).

When introducing http://asdf-format.org/schemas/ndarray_timedelta I could see this leading to cases of
anyOf:
  - tag: http://asdf-format.org/schemas/ndarray
  - tag: http://asdf-format.org/schemas/ndarray_timedelta
in other schemas where both cases should be allowed. Could this be prevented? (same for using quantity)

braingram · 2023-11-21T14:03:31Z

Thanking for mentioning this issue! I'll read through this issue and start taking a look.

braingram · 2023-11-21T21:59:55Z

I spent some time looking into this today. One complication (that I don't yet have a solution for) is the associated unit for a np.datetime64 can have a number of possible values and means that the datatype will need to not only encode datetime64 but also the unit to interpret the bytes corresponding to a datetime64 array. Take the following example:

>> dt0 = np.datetime64(0xFFFF, ("s", 42))
>> dt0
numpy.datetime64('1970-02-01T20:34:30','42s')
>> dt0.tobytes()
b'\xff\xff\x00\x00\x00\x00\x00\x00'
>> dt1 = np.datetime64(0xFFFF, "D")
>> dt1
numpy.datetime64('2149-06-06')
>> dt1.tobytes()
b'\xff\xff\x00\x00\x00\x00\x00\x00'
>> dt0 == dt1
False
>> dt0.tobytes() == dt1.tobytes()
True

Conversion to a 'standard' unit will mean that some valid datetime64 values that use non-standard units will be unusable as the different units have different ranges.

CagtayFabry · 2023-11-21T22:43:09Z

I spent some time looking into this today. One complication (that I don't yet have a solution for) is the associated unit for a np.datetime64 can have a number of possible values and means that the datatype will need to not only encode datetime64 but also the unit to interpret the bytes corresponding to a datetime64 array. Take the following example:
>> dt0 = np.datetime64(0xFFFF, ("s", 42))
>> dt0
numpy.datetime64('1970-02-01T20:34:30','42s')
>> dt0.tobytes()
b'\xff\xff\x00\x00\x00\x00\x00\x00'
>> dt1 = np.datetime64(0xFFFF, "D")
>> dt1
numpy.datetime64('2149-06-06')
>> dt1.tobytes()
b'\xff\xff\x00\x00\x00\x00\x00\x00'
>> dt0 == dt1
False
>> dt0.tobytes() == dt1.tobytes()
True
Conversion to a 'standard' unit will mean that some valid datetime64 values that use non-standard units will be unusable as the different units have different ranges.

True, but that would have to be stored in the dtype information of the asdf file anyway, as I think there is no simple datetime64 dtype without any unit (please correctly if I'm wrong).
To be fair, my initial example only listed the timedelta64[ns]': 'm8[ns]', 'datetime64[ns]': 'M8[ns]' pairs, I didn't consider the different timescales back then.

for u in ["as", "fs", "ps", "ns", "us", "ms", "s", "m", "h", "D", "W", "M", "Y"]:
    dtype = np.datetime64(0xFFFF, u).dtype
    print(dtype.__repr__() + " : " + dtype.__str__())

dtype('<M8[as]') : datetime64[as]
dtype('<M8[fs]') : datetime64[fs]
dtype('<M8[ps]') : datetime64[ps]
dtype('<M8[ns]') : datetime64[ns]
dtype('<M8[us]') : datetime64[us]
dtype('<M8[ms]') : datetime64[ms]
dtype('<M8[s]') : datetime64[s]
dtype('<M8[m]') : datetime64[m]
dtype('<M8[h]') : datetime64[h]
dtype('<M8[D]') : datetime64[D]
dtype('<M8[W]') : datetime64[W]
dtype('<M8[M]') : datetime64[M]
dtype('<M8[Y]') : datetime64[Y]

of course, it seems improbable to cover any possible "custom" datetime type dtype like np.datetime64(0xFFFF, ("s", 42)). Frankly I have no insight into where this functionality is used.

braingram · 2023-12-01T22:22:07Z

I wanted to update this with something more substantial at this point but unfortunately all i can say is I'm still looking into this.

I tried implementing this via an extension and things were complicated by the extension needing to follow every asdf standard version (like the NDArrayConverter in asdf). This seems like too much of a burden to put on an extension (as it needs to effectively take over control of all ndarrays).

Do you have an example of code that works around this limitation (perhaps by converting datetime64 to an int32)? I'm curious to see how much difficulty this issue produces.

The datetime64 and timedelta64 datatypes seem a little out of place in numpy. For example, I was unable to find the unit and increment via any dtype attribute and had to rely on datetime_data. I have yet to sort out how these might fit into one of the ndarray time or quantity schemas.

eslavich added the ASDF Standard 2.0.0 label Jul 1, 2020

CagtayFabry mentioned this issue Aug 1, 2021

[asdf] Do not copy array data BAMWelDX/weldx#456

Merged

4 tasks

CagtayFabry mentioned this issue Nov 21, 2023

Support float16 datatype #410

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding datetime-like dtypes to ndarray #270

adding datetime-like dtypes to ndarray #270

CagtayFabry commented Jul 1, 2020 •

edited

Loading

eslavich commented Jul 1, 2020

CagtayFabry commented Jul 1, 2020

eslavich commented Jul 1, 2020

eslavich commented Jul 1, 2020

CagtayFabry commented Jul 1, 2020

CagtayFabry commented Nov 10, 2020

braingram commented Nov 21, 2023

braingram commented Nov 21, 2023

CagtayFabry commented Nov 21, 2023

braingram commented Dec 1, 2023

adding datetime-like dtypes to ndarray #270

adding datetime-like dtypes to ndarray #270

Comments

CagtayFabry commented Jul 1, 2020 • edited Loading

eslavich commented Jul 1, 2020

CagtayFabry commented Jul 1, 2020

eslavich commented Jul 1, 2020

eslavich commented Jul 1, 2020

CagtayFabry commented Jul 1, 2020

CagtayFabry commented Nov 10, 2020

braingram commented Nov 21, 2023

braingram commented Nov 21, 2023

CagtayFabry commented Nov 21, 2023

braingram commented Dec 1, 2023

CagtayFabry commented Jul 1, 2020 •

edited

Loading