mzMLb¶

mzMLb is a standard rich HDF5-based format for raw mass spectrometry data storage. This module provides MzMLbLoader, a RandomAccessScanSource implementation. It is based upon the mzML XML file format, re-using a subset of the features. The original design for mzMLb is described in [Bhamber].

The parser is based on pyteomics.mzmlb. It requires h5py to be installed for reading, and hdf5plugin to use the faster, non-zlib-based compressors.

References

Bhamber: Bhamber, R. S., Jankevics, A., Deutsch, E. W., Jones, A. R., & Dowsey, A. W. (2021). MzMLb: A Future-Proof Raw Mass Spectrometry Data Format Based on Standards-Compliant mzML and Optimized for Speed and Storage Requirements. Journal of Proteome Research, 20(1), 172–183. https://doi.org/10.1021/acs.jproteome.0c00192

class ms_deisotope.data_source.mzmlb.MzMLbLoader(source_file, use_index=True, decode_binary=True, index_file=None, **kwargs)[source]¶

Bases: ms_deisotope.data_source.scan.loader.ScanIterator[ms_deisotope.data_source.scan.loader.DataPtrType, ms_deisotope.data_source.scan.loader.ScanType]

Reads scans from PSI-HUPO mzMLb HDF5 files. Provides both iterative and random access.

source_file¶

Path to file to read from.

Type: str

source¶

Underlying scan data source

Type: pyteomics.mzmlb.MzMLb

close()¶: Close the underlying reader.

data_processing() → List[ms_deisotope.data_source.metadata.data_transformation.DataProcessingInformation]¶

Describe any preprocessing steps applied to the data described by this instance.

Returns
Return type: list of DataProcessingInformation

property decode_binary: bool¶: Whether or not to eagerly decode binary data arrays

file_description()¶

Read the file metadata and provenance from the <fileDescription> tag if it is present.

Returns: The description of the file’s contents and its sources
Return type: FileInformation

find_next_ms1(start_index: int) → Optional[ms_deisotope.data_source.scan.loader.ScanType]¶

Locate the MS1 scan following start_index, iterating forwards through scans until either the last scan is reached or an MS1 scan is found.

Returns
Return type: ScanBase or None if not found

find_previous_ms1(start_index: int) → Optional[ms_deisotope.data_source.scan.loader.ScanType]¶

Locate the MS1 scan preceding start_index, iterating backwards through scans until either the first scan is reached or an MS1 scan is found.

Returns
Return type: ScanBase or None if not found

get_scan_by_id(scan_id)¶

Retrieve the scan object for the specified scan id.

If the scan object is still bound and in memory somewhere, a reference to that same object will be returned. Otherwise, a new object will be created.

Parameters: scan_id (str) – The unique scan id value to be retrieved
Returns
Return type: Scan

get_scan_by_index(index)¶

Retrieve the scan object for the specified scan index.

This internally calls get_scan_by_id() which will use its cache.

Parameters: index (int) – The index to get the scan for
Returns
Return type: Scan

get_scan_by_time(time)¶

Retrieve the scan object for the specified scan time.

This internally calls get_scan_by_id() which will use its cache.

Parameters: time (float) – The time to get the nearest scan from
Returns
Return type: Scan

property has_fast_random_access¶

Check whether the underlying data stream supports fast random access or not.

Even if the file format supports random access, it may be impractical due to overhead in parsing the underlying data stream, e.g. calling gzip.GzipFile.seek() can force the file to be decompressed from the beginning of the file on each call. This property can be used to signal to the caller whether or not it should use a different strategy.

Returns: One of DefinitelyNotFastRandomAccess, MaybeFastRandomAccess, or DefinitelyFastRandomAccess. The first is a False-y value, the latter two will evaluate to True
Return type: Constant

has_ms1_scans()¶

Checks if this ScanDataSource contains MS1 spectra.

Returns: Returns a boolean value if the presence of MS1 scans is known for certain, or None if it cannot be determined in the case of missing metadata.
Return type: bool or None

has_msn_scans()¶

Checks if this ScanDataSource contains MSn spectra.

Returns: Returns a boolean value if the presence of MSn scans is known for certain, or None if it cannot be determined in the case of missing metadata.
Return type: bool or None

property index¶

The byte offset index used to achieve fast random access.

Maps ScanBase IDs to the byte offsets, implying the order the scans reside in the file.

Returns
Return type: pyteomics.xml.ByteEncodingOrderedDict

initialize_scan_cache()¶

Initialize a cache which keeps track of which Scan objects are still in memory using a weakref.WeakValueDictionary.

When a scan is requested, if the scan object is found in the cache, the existing object is returned rather than re-read from disk.

instrument_configuration() → List[ms_deisotope.data_source.metadata.instrument_components.InstrumentInformation]¶

Read the instrument configurations settings from the <instrumentConfigurationList>

Returns: A list of different instrument states that scans may be acquired under
Return type: list of InstrumentConfiguration

make_iterator(iterator=None, grouped=None, **kwargs) → ms_deisotope.data_source.scan.loader.ScanIterator¶

Configure the ScanIterator’s behavior, selecting it’s iteration strategy over either its default iterator or the provided iterator argument.

Parameters

iterator (Iterator, optional) – The iterator to manipulate. If missing, the default iterator will be used.
grouped (bool, optional) – Whether the iterator should be grouped and produce ScanBunch objects or single Scan. If None is passed, has_ms1_scans() will be be used instead. Defaults to None.

next()¶

Advance the iterator, fetching the next ScanBunch or ScanBase depending upon iteration strategy.

Returns
Return type: ScanBunch or ScanBase

classmethod prebuild_byte_offset_file(path)[source]¶

A stub method. MzMLb does not require an external index.

Parameters: path (str or file-like) – The path to the file to index, or a file-like object with a name attribute.

reset()¶

Reset the object, clearing out any existing state.

This resets the underlying file iterator, then calls make_iterator(), and clears the scan cache.

samples() → List[ms_deisotope.data_source.metadata.sample.Sample]¶

Describe the sample(s) used to generate the mass spectrometry data contained in this file.

Returns
Return type: list of Sample

property scan_cache¶: A weakref.WeakValueDictionary mapping used to retrieve scans from memory if available before re-reading them from disk.

software_list() → List[ms_deisotope.data_source.metadata.software.Software]¶

Describe any software used on the data described by this instance.

Returns
Return type: list of Software

property source¶: The file parser that this reader consumes.

property source_file_name: Optional[str]¶

Return the name of the file that backs this data source, if available.

Returns
Return type: str or None

start_from_scan(scan_id=None, rt=None, index=None, require_ms1=True, grouped=True, **kwargs)¶

Reconstruct an iterator which will start from the scan matching one of scan_id, rt, or index. Only one may be provided.

After invoking this method, the iterator this object wraps will be changed to begin yielding scan bunchs (or single scans if grouped is False).

This method will trigger several random-access operations, making it prohibitively expensive for normally compressed files.

Parameters

scan_id (str, optional) – Start from the scan with the specified id.
rt (float, optional) – Start from the scan nearest to specified time (in minutes) in the run. If no exact match is found, the nearest scan time will be found, rounded up.
index (int, optional) – Start from the scan with the specified index.
require_ms1 (bool, optional) – Whether the iterator must start from an MS1 scan. True by default.
grouped (bool, optional) – whether the iterator should yield scan bunches or single scans. True by default.

property time¶

A indexer facade that lets you index and slice by scan time.

Returns
Return type: TimeIndex