Skip to content

pwiz mzMLb v0.6

Pre-release
Pre-release
Compare
Choose a tag to compare
@awd97 awd97 released this 28 Feb 12:47
· 3 commits to mzMLb since this release

IMPORTANT

This repository is a fork of ProteoWizard and hence the precompiled Windows executables available below includes proprietary vendor libraries - in order to use them you should agree to the licensing terms detailed on http://proteowizard.sourceforge.net/downloads.shtml

mzMLb file format implemented in ProteoWizard

A NetCDF4 compliant HDF5 file format for mass spectrometry data, implemented in ProteoWizard version 3_0_11349 (8 Sep 2017). mzML 1.1.0 is utilised to encode all metadata, which is then encapsulated in HDF5 to enable both compression and fast random access plus native storage of floating point data. MS Numpress compression is also supported.

This package implements an mzMLb (mzML binary) reader/writer together with some simple truncation and prediction filters that rival MS Numpress compression but without the complexity. To achieve this, the ProteoWizard mzML code was extended (https://github.com/biospi/pwiz-mzmlb/tree/master/pwiz/pwiz/data/msdata) and HDF5-specific code added in a new Connection_mzMLb class situated in https://github.com/biospi/pwiz-mzmlb/tree/master/pwiz/pwiz/data/msdata/mzmlb. Below we describe the mzMLb format and the new msconvert arguments to use it.

Note on building for Windows

Prerequisites: For Windows, you must use Visual Studio 2013 (the free Community Edition is ok); 2015 and 2017 are not currently supported. You will also need to have installed the Thermo MSFileReader v3.0SP2 (no later) to build with Thermo raw file support.

There is a problem with inconsistent line endings in the pwiz source code. If you are using Windows you need to install 'Git for Windows' with the setting to not alter line endings, otherwise a few tests will fail! (Altering line endings appears to break mzML v1.0 compatibility, although I did not debug fully). Also, FileSystemTest will fail unless you allow C:/ to be writeable to 'Users'.

The mzMLb format

All mzMLb formats must include a HDF5 dataset mzML with fixed length string attribute version. The version currently supported is:


mzMLb 0.6

The mzML document has its base64 data and any index removed and is then stored in the HDF5 mzML dataset as a character array. You can use the new mzmlbcat utility to output all or part of this data.

The spectrum and chromatogram index are replaced with HDF5 datasets. mzML_spectrumIndex and mzML_chromatogramIndex are 1D arrays of 64bit integers replicating the file pointer offsets except that there is an extra offset at the end of each array representing one past the end position of the last spectrum/chromatogram. mzML_spectrumIndex_idRef and mzML_chromatogramIndex_idRef are 1D character arrays containing all the id references as null-terminated strings concatenated together. The optional spotID attributes can be contained in datasets mzML_spectrumIndex_spotID and mzML_chromatogramIndex_spotID, while the optional scanTime attributes can be contained in a floating point dataset mzML_spectrumIndex_scanTime similarly.

The mzML base64 encoded binary data is moved into one or more HDF5 datasets. Floating point binary data (i.e. all non-Numpress compressed <BinaryDataArray>) is stored as native HDF5 floating point datasets, while Numpress data is stored as a non-base64 encoded bytestream with HDF5 data type OPAQUE. The mzML is modified slightly to specify this linkage to external data in the same way as is done in imzML. The XML is hence valid mzML.

This version 0.6 is the current latest version and is defined as the ’Release Candidate’ for production use.

msconvert arguments

--mzMLb

Convert input to mzMLb format.


--mzMLbCompressionLevel=[0-9]

Define to use either no compression (0) or GZIP compression strength 1 to 9. Compression is applied to the mzML and all binary HDF5 datasets. Specifying --zlib or -z instead will use the default compression strength of 4.


--mzMLbChunkSize=[4096-]

Defines the chunk size to use for the mzML and all binary HDF5 datasets, in bytes. A smaller amount improves random access speed at the detriment of compression efficiency. The default is 1048576 (1Mb chunks).


--mzTruncation=[0-] --intenTruncation=[0-]

Perform lossy compression by removing the last n bits of mantissa from floating point data before storage. The default is 0 (no removal). Set to -1 to truncate to integers.


--mzDelta --intenDelta --mzLinear --intenLinear

Store mz/rt or intensity values after delta or linear prediction. Predictive encoding of mz/rt values may lead to moderate improvements in gzip compression, or further improvements after floating point precision loss.

Examples

For lossless compression just use:

msconvert --mzMLb -z <input_file>

For our recommended compression with error not exceeding that of default Numpress use:

msconvert --mzMLb -z --mzLinear --mzTruncation=19 --intenTruncation=7 <input_file>

Todo

See https://github.com/biospi/mzmlb/issues.

Acknowledgements

Developed by Andrew Dowsey and Andris Jankevics of the biospi team with funding from BBSRC BB/M024954/1 and MRC MR/L011093/1.