Writing mzML Documents¶

mzML is a standard rich XML-format for raw mass spectrometry data storage. Please refer to psidev.info for the detailed specification of the format and structure of mzML files.

In addition to mzML, there is a wrapping format called indexedmzML which adds an extra layer to the XML document, including pre-computed byte offsets for each <spectrum> and <chromatogram> element.

To write mzML without an index use PlainMzMLWriter, and for indexedmzML use IndexedMzMLWriter. Because so many tools rely on the index, IndexedMzMLWriter is exported under the alias MzMLWriter. The interface for these two classes are the same, with IndexedMzMLWriter having slightly more complex behavior on writing and when finishing the document, though you are able to alter the indexing behavior via IndexedMzMLWriter.index_builder or through inheritance.

class psims.mzml.writer.IndexedMzMLWriter(outfile, close=None, vocabularies=None, missing_reference_is_error=False, vocabulary_resolver=None, id=None, accession=None, **kwargs)[source]¶

A high level API for generating indexed mzML XML files from simple Python objects.

This class depends heavily on lxml’s incremental file writing API which in turn depends heavily on context managers. Almost all logic is handled inside a context manager and in the context of a particular document. Since all operations assume that they have access to a universal identity map for each element in the document, that map is centralized in this class.

MzMLWriter inherits from ComponentDispatcher, giving it a context attribute and access to all Component objects pre-bound to that context with attribute-access notation.

chromatogram_count¶

A count of the number of chromatograms written

Type: int

spectrum_count¶

A count of the number of spectra written

Type: int

index_builder¶

A writing stream that automatically tokenizes and records byte offsets for specific XML tags.

Type: IndexingStream

__enter__()¶: Begins writing, opening the top-level tag

__exit__(exc_type, exc_value, traceback)¶: Closes the top-level tag, the XML formatter, and the file itself.

__getattr__(name)¶

Provide access to an automatically parameterized version of all ComponentBase types which use this instance’s context.

Parameters: name (str) – Component Name
Returns: A partially parameterized instance constructor for the ComponentBase type requested.
Return type: ReprBorrowingPartial

begin()¶: Writes the doctype and starts the low-level writing machinery

controlled_vocabularies()¶

Write out the <cvList> element and all its children, including both this format’s default controlled vocabularies and those passed as arguments to this method.this

This method requires writing to have begun.

data_processing_list(data_processing)¶

Writes the <dataProcessingList> section of the document.

Note

List and descriptions of data processing applied to this data

Parameters: data_processing (list) – A list or other iterable of dict or DataProcessing-like objects

element(element_name, **kwargs)¶

Construct and immediately open a subclass instance of TagBase with the given tag name. All other arguments are forwarded to the TagBase constructor.

Parameters

element_name (str) – The name of the tag type to create
*args – Arbitrary arguments for the tag
**kwargs – Key word arguments for the tag

See also

element()

end(exc_type=None, exc_value=None, traceback=None)¶: Ends the XML document, and flushes and closes the file if appropriate.

file_description(file_contents=None, source_files=None, contacts=None)¶

Writes the <fileDescription> section of the document.

If file_contents contains a nativeID term, and native_id_format has not been set explicitly, that ID format will be used for this document.

Note

Information pertaining to the entire mzML file (i.e. not specific to any part of the data set) is stored here.

Parameters

file_contents (list, optional) – A list or other iterable of str, dict, or *Param-types which will be placed in the <fileContent> element.
source_files (list) – A list or other iterable of dict or SourceFile-like objects to be placed in the <sourceFileList> element

format(*args, **kwargs)[source]¶: This method is deprecated. Previously, the serialization process did not indent the XML in-place and the lxml pretty printer had to be invoked separately. With the addition of XMLFormattingStreamWriter, the XML stream is formatted in-place as it is being streamed to file.

instrument_configuration_list(instrument_configurations)¶

Writes the <instrumentConfigurationList> section of the document.

Note

List and descriptions of instrument configurations. At least one instrument configuration MUST be specified, even if it is only to specify that the instrument is unknown. In that case, the “instrument model” term is used to indicate the unknown instrument in the instrumentConfiguration

Parameters: instrument_configurations (list) – A list or other iterable of dict or InstrumentConfiguration-like objects

property native_id_format¶

The nativeID format of the spectra to assume for this data file.

This is used to determine how to convert an integer into a spectrum’s id. Defaults to MS:1000774: “multiple peak list nativeID format” which has a pattern of index=<number>.

This attribute has no effect on spectrum id values specified as strings already formatted.

Note

If not explicitly specified, but a term naming an ID format is passed as a parameter in file contents, that will be used. The ID format from source files will not be used.

Returns
Return type: NativeIDParser

precursor_builder(mz=None, intensity=None, charge=None, spectrum_reference=None, activation=None, isolation_window_args=None, params=None, intensity_unit='number of detector counts', scan_id=None, external_spectrum_id=None, source_file_reference=None, isolation_window=None)¶

Create a PrecursorBuilder, an object to help populate the precursor information data structure.

The helper object should be used to incrementally populate the precursor information passed to spectrum() or write_spectrum()’s precursor_information argument.

Parameters

mz (float, optional) – The m/z of the first selected ion
intensity (float, optional) – The intensity of the first selected ion
charge (int, optional) – The charge state of the first selected ion
spectrum_reference (str, optional) – The id of the prescursor <spectrum> for this precursor, mapped through the document context.
activation (dict or list, optional) – Parameters forwarded to PrecursorBuilder.activation(). This should be a dictionary with a key “params” and a list of CVParam coerce-able values, with additional optional keys naming other CVParam coerce-able values. If a list is passed, it will be wrapped in one e.g. {"params": activation}
isolation_window_args (tuple, list, or dict, optional) – Parameters forwarded to :meth:PrecursorBuilder.isolation_window`, tuple or list of three values are converted into dict of the correct structure. The expected keys are “lower”, the lower m/z offset, “target”, the center m/z, and “upper”, the upper m/z offset. You may also pass this argumemt as isolation_window.
params (list, optional) – The cv- and user-params of the first selected ion, in addition to mz, intensity, charge.
intensity_unit (str) – The intensity unit of the first selected ion, to be specified with intensity
scan_id (str, optional) – An alias for spectrum_reference
external_spectrum_id (str, optional) – The externalSpectrumID attribute of the precursor
source_file_reference (str, optional) – The sourceFileRef attribute of the precursor

Returns

Return type

PrecursorBuilder

prepare_precursor_information(mz=None, intensity=None, charge=None, spectrum_reference=None, activation=None, isolation_window_args=None, params=None, intensity_unit='number of detector counts', scan_id=None, external_spectrum_id=None, source_file_reference=None, **kwargs)¶

Prepare a Precursor element from disparate data structures.

Parameters

mz (float, optional) – The m/z of the first selected ion
intensity (float, optional) – The intensity of the first selected ion
charge (int, optional) – The charge state of the first seelcted ion
spectrum_reference (str, optional) – The id of the prescursor <spectrum> for this precursor
activation (list, optional) – A list of parameters describing the ion activation method used.
isolation_window_args (tuple, list, or dict, optional) – Parameters forwarded to PrecursorBuilder.isolation_window(), tuple or list values are converted into dict of the correct structure. This argument may also be passed as isolation_window.
params (list, optional) – The cvParams of the first selected ion
intensity_unit (str) – The intensity unit of the first selected ion
scan_id (str, optional) – An alias for spectrum_reference
external_spectrum_id (str, optional) – The externalSpectrumID attribute of the precursor
source_file_reference (str, optional) – The sourceFileRef attribute of the precursor

Returns

Return type

Precursor

reference_param_group_list(groups)¶

Writes the <referenceableParamGroupList> section of the document.

Parameters: groups (list) – A list or other iterable of dict or ReferenceableParamGroup-like objects

register(entity_type, id)¶

Pre-declare an entity in the document context. Ensures that a reference look up will be satisfied.

Parameters

entity_type (str) – An entity type, either a tag name or a component name
id (int) – The unique id number for the thing registered

Returns

The constructed reference id

Return type

str

run(id=None, instrument_configuration=None, source_file=None, start_time=None, sample=None)¶

Begins the <run> section of the document, describing a single sample run.

Parameters

id (str, optional) – The unique identifier for this element
instrument_configuration (str, optional) – The id string for the default InstrumentConfiguration for this sample
source_file (str, optional) – The id string for the source file used to produce this data
start_time (str, optional) – A string encoding the date and time the sample was acquired
sample (str, optional) – The id string for the sample used to produce this data

Returns

Return type

RunSection

sample_list(samples)¶

Writes the <sampleList> section of the document

Parameters: samples (list) – A list or other iterable of dict or Sample-like objects

software_list(software_list: Iterable[Union[psims.mzml.components.Software, Mapping]])¶

Writes the <softwareList> section of the document.

Note

List and descriptions of software used to acquire and/or process the data in this mzML file

Parameters: software_list (list) – A list or other iterable of dict or Software-like objects

spectrum(mz_array: Optional[numpy.ndarray] = None, intensity_array: Optional[numpy.ndarray] = None, charge_array: Optional[numpy.ndarray] = None, id: Optional[str] = None, polarity='positive scan', centroided=True, precursor_information=None, scan_start_time=None, params=None, compression='zlib', encoding=None, other_arrays=None, scan_params=None, scan_window_list=None, instrument_configuration_id=None, intensity_unit='number of detector counts') → psims.mzml.components.Spectrum¶

Create a new Spectrum instance to be written.

This method does not immediately write and close the spectrum element, leaving it open for modification and embedding.

Parameters

mz_array (np.ndarray of floats) – The m/z array of the spectrum
intensity_array (np.ndarray of floats) – The intensity array of the spectrum
charge_array (np.ndarray, optional) – The charge state array of the spectrum, optional.
id (str) – The native ID of the spectrum.
polarity (str or int, optional) – The polarity of the spectrum. If an integer, the sign of the integer is used, otherwise it is interpreted as a cvParam
centroided (bool, optional) – Whether the spectrum is continuous or discretized by peak picking. Defaults to True.
precursor_information (dict or PrecursorBuilder, optional) – The precursor ion description. Will be passed to _prepare_precursor_list(). The structure of this object should either be formatted as arguments to precursor_builder(), or a PrecursorBuilder instance populated with information.
scan_start_time (float, optional) – The scan start time, in minutes
params (list, optional) – The parameters of the spectrum
compression (str, optional) – The compression type name to use. Defaults to COMPRESSION_ZLIB.
encoding (dict, optional) – A mapping from array name to NumPy data types.
other_arrays (list, optional) – An iterable of array names to additional data arrays. Array names may either be strings, Mapping objects that define CVParam or UserParam, or such paramter objects themselves. Use the latter two methods when defining arrays with units.
scan_params (list, optional) – A list of cvParams for the scan of this spectrum
scan_window_list (list, optional) – A list of scan windows specified as pairs of m/z intervals
instrument_configuration_id (str, optional) – The id of the instrumentConfiguration to associate with this spectrum if not the default one.

Returns

Return type

Spectrum