pwiz mzMLb v0.6
Pre-releaseIMPORTANT
This repository is a fork of ProteoWizard and hence the precompiled Windows executables available below includes proprietary vendor libraries - in order to use them you should agree to the licensing terms detailed on http://proteowizard.sourceforge.net/downloads.shtml
mzMLb file format implemented in ProteoWizard
A NetCDF4 compliant HDF5 file format for mass spectrometry data, implemented in ProteoWizard version 3_0_11349 (8 Sep 2017). mzML 1.1.0 is utilised to encode all metadata, which is then encapsulated in HDF5 to enable both compression and fast random access plus native storage of floating point data. MS Numpress compression is also supported.
This package implements an mzMLb (mzML binary) reader/writer together with some simple truncation and prediction filters that rival MS Numpress compression but without the complexity. To achieve this, the ProteoWizard mzML code was extended (https://github.com/biospi/pwiz-mzmlb/tree/master/pwiz/pwiz/data/msdata) and HDF5-specific code added in a new Connection_mzMLb
class situated in https://github.com/biospi/pwiz-mzmlb/tree/master/pwiz/pwiz/data/msdata/mzmlb. Below we describe the mzMLb format and the new msconvert
arguments to use it.
Note on building for Windows
Prerequisites: For Windows, you must use Visual Studio 2013 (the free Community Edition is ok); 2015 and 2017 are not currently supported. You will also need to have installed the Thermo MSFileReader v3.0SP2 (no later) to build with Thermo raw file support.
There is a problem with inconsistent line endings in the pwiz source code. If you are using Windows you need to install 'Git for Windows' with the setting to not alter line endings, otherwise a few tests will fail! (Altering line endings appears to break mzML v1.0 compatibility, although I did not debug fully). Also, FileSystemTest will fail unless you allow C:/ to be writeable to 'Users'.
The mzMLb format
All mzMLb formats must include a HDF5 dataset mzML
with fixed length string attribute version
. The version currently supported is:
mzMLb 0.6
The mzML document has its base64 data and any index removed and is then stored in the HDF5 mzML
dataset as a character array. You can use the new mzmlbcat
utility to output all or part of this data.
The spectrum and chromatogram index are replaced with HDF5 datasets. mzML_spectrumIndex
and mzML_chromatogramIndex
are 1D arrays of 64bit integers replicating the file pointer offsets except that there is an extra offset at the end of each array representing one past the end position of the last spectrum/chromatogram. mzML_spectrumIndex_idRef
and mzML_chromatogramIndex_idRef
are 1D character arrays containing all the id references as null-terminated strings concatenated together. The optional spotID attributes can be contained in datasets mzML_spectrumIndex_spotID
and mzML_chromatogramIndex_spotID
, while the optional scanTime attributes can be contained in a floating point dataset mzML_spectrumIndex_scanTime
similarly.
The mzML base64 encoded binary data is moved into one or more HDF5 datasets. Floating point binary data (i.e. all non-Numpress compressed <BinaryDataArray>
) is stored as native HDF5 floating point datasets, while Numpress data is stored as a non-base64 encoded bytestream with HDF5 data type OPAQUE
. The mzML is modified slightly to specify this linkage to external data in the same way as is done in imzML. The XML is hence valid mzML.
This version 0.6 is the current latest version and is defined as the ’Release Candidate’ for production use.
msconvert arguments
--mzMLb
Convert input to mzMLb format.
--mzMLbCompressionLevel=[0-9]
Define to use either no compression (0) or GZIP compression strength 1 to 9. Compression is applied to the mzML
and all binary HDF5 datasets. Specifying --zlib
or -z
instead will use the default compression strength of 4.
--mzMLbChunkSize=[4096-]
Defines the chunk size to use for the mzML
and all binary HDF5 datasets, in bytes. A smaller amount improves random access speed at the detriment of compression efficiency. The default is 1048576 (1Mb chunks).
--mzTruncation=[0-]
--intenTruncation=[0-]
Perform lossy compression by removing the last n bits of mantissa from floating point data before storage. The default is 0 (no removal). Set to -1 to truncate to integers.
--mzDelta
--intenDelta
--mzLinear
--intenLinear
Store mz/rt or intensity values after delta or linear prediction. Predictive encoding of mz/rt values may lead to moderate improvements in gzip compression, or further improvements after floating point precision loss.
Examples
For lossless compression just use:
msconvert --mzMLb -z <input_file>
For our recommended compression with error not exceeding that of default Numpress use:
msconvert --mzMLb -z --mzLinear --mzTruncation=19 --intenTruncation=7 <input_file>
Todo
See https://github.com/biospi/mzmlb/issues.
Acknowledgements
Developed by Andrew Dowsey and Andris Jankevics of the biospi team with funding from BBSRC BB/M024954/1 and MRC MR/L011093/1.