Load hdf5 dataset > 2GB #13

JiangXL · 2020-04-16T14:28:06Z

Dear team,

Could I ask for the support to load dataset larger than 2GB? Because I think that hdf5 is chosen for much larger data, which has performances better than stack tiff.

Thanks!!!

imagejan · 2020-04-24T14:51:49Z

I don't know if this component is still actively developed. You should also consider alternatives such as BDV-HDF5 or n5, maybe.

For what it's worth, here's a Groovy script I once created to read large HDF5 files in chunks of 2GB:

#@ File (label = "HDF5 File Input", style = "extensions:h5/hdf5") h5file
#@ Boolean (label = "Automatic Chunk Size") autoChunkSize
#@ String (visibility = MESSAGE, persist = false, value = "If checked, the following value is ignored") msg
#@ Integer (label = "Chunk Size (number of time points)", min = 1, value = 1000) chunk
#@output imgs
#@ LogService log

import ch.systemsx.cisd.hdf5.HDF5Factory
import ch.systemsx.cisd.hdf5.HDF5DataClass
import ij.ImagePlus
import ij.process.ShortProcessor
import net.imglib2.img.array.ArrayImgs

reader = HDF5Factory.openForReading(h5file)

info = reader.getDataSetInformation("/images")

log.info("Dataset found: $info")

// Make sure we have uint16
assert(!info.getTypeInformation().isSigned()) // u
assert(info.getTypeInformation().getDataClass() == HDF5DataClass.INTEGER) // int
assert(info.getTypeInformation().getElementSize() == 2) // 16

// Make sure we have 3 dimensions (tyx)
dims = info.getDimensions()
assert(dims.length == 3)

// automatically determine optimal chunk size
final twoGiga = 2l * 1024 * 1024 * 1024
optimalChunkSize = twoGiga / (16/8) / dims[2] / dims[1]
log.info("Optimal chunk size: $optimalChunkSize")
if (autoChunkSize) {
	chunk = optimalChunkSize as int
}
log.info("Using chunk size $chunk")

numberOfChunks = ((dims[0] / chunk) as int) + 1
log.info("Creating $numberOfChunks chunks in total")

imgs = []
numberOfChunks.times { index ->
	log.info("Reading chunk ${index+1}")
	shortArray = reader.uint16().readMDArrayBlock("/images", [chunk, dims[1], dims[2]] as int[], [index, 0, 0] as long[])
	// Create ArrayImg from MDShortArray
	aDims = []
	shortArray.dimensions().each { d ->
		aDims << d
	}
	imgs << ArrayImgs.unsignedShorts(shortArray.getAsFlatArray(), aDims.reverse() as long[])
}

// Close HDF5 File
reader.close()

JiangXL · 2020-04-24T15:26:21Z

Thank for your code!
Now, I'm using Big-TIFF for data analysis(by Julia) and visualization(by Fiji). But h5 is still better and native for many programming environments.

MarkRivers · 2022-12-04T21:51:18Z

The limit is not actually 2GB, it is 2G array elements. This plugin can load HDF5 files that are 32-bit floats with dimensions 1024x1024x2047, which is nearly 8 GB. But it cannot load files with dimensions 1024x1024x2048 of any data type, i.e. 8-bit, 16-bit, or 32-bit.

MarkRivers · 2022-12-04T22:10:02Z

Note that the HDF5 plugin below will open large HSD5 datasets OK if virtual stack is selected in the dialog box:
https://github.com/paulscherrerinstitute/ch.psi.imagej.hdf5

However, for many applications virtual stacks are not what is needed because they are read-only. For example I have an 8 GB signed integer HDF5 dataset, so I need to read it into a real stack and apply calibration to convert the display to correctly show signed integers. This works fine when I read the data from a netCDF-3 file. But the native Java HDF5 reader plugin fails when the number of array elements is 2^31 or greater. This is a serious and rather silly limitation these days when 128 GB of RAM is less than $1,000.

mkitti · 2022-12-05T03:11:38Z

We can and should fix this by employing imglib2. BigDataViewer and/or n5-viewer can read large HDF5 datasets without difficulty. This is on my todo list after having updated JHDF5 to 19.04.01.

mkitti · 2022-12-05T03:14:54Z

@MarkRivers if you have a few minutes for a chat, could you contact me at [email protected].

ExtraE113 linked a pull request Jul 23, 2022 that will close this issue

Add support for 2GB+ files with virtual stacks #17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load hdf5 dataset > 2GB #13

Load hdf5 dataset > 2GB #13

JiangXL commented Apr 16, 2020

imagejan commented Apr 24, 2020

JiangXL commented Apr 24, 2020

MarkRivers commented Dec 4, 2022

MarkRivers commented Dec 4, 2022

mkitti commented Dec 5, 2022 •

edited

Loading

mkitti commented Dec 5, 2022

Load hdf5 dataset > 2GB #13

Load hdf5 dataset > 2GB #13

Comments

JiangXL commented Apr 16, 2020

imagejan commented Apr 24, 2020

JiangXL commented Apr 24, 2020

MarkRivers commented Dec 4, 2022

MarkRivers commented Dec 4, 2022

mkitti commented Dec 5, 2022 • edited Loading

mkitti commented Dec 5, 2022

mkitti commented Dec 5, 2022 •

edited

Loading