psims

Writing mzML Documents

mzML is a standard rich XML-format for raw mass spectrometry data storage. Please refer to psidev.info for the detailed specification of the format and structure of mzML files.

In addition to mzML, there is a wrapping format called indexedmzML which adds an extra layer to the XML document, including pre-computed byte offsets for each <spectrum> and <chromatogram> element.

To write mzML without an index use PlainMzMLWriter, and for indexedmzML use IndexedMzMLWriter. Because so many tools rely on the index, IndexedMzMLWriter is exported under the alias MzMLWriter. The interface for these two classes are the same, with IndexedMzMLWriter having slightly more complex behavior on writing and when finishing the document, though you are able to alter the indexing behavior via IndexedMzMLWriter.index_builder or through inheritance.

class psims.mzml.writer.IndexedMzMLWriter(outfile, close=None, vocabularies=None, missing_reference_is_error=False, vocabulary_resolver=None, id=None, accession=None, **kwargs)[source]

A high level API for generating indexed mzML XML files from simple Python objects.

This class depends heavily on lxml’s incremental file writing API which in turn depends heavily on context managers. Almost all logic is handled inside a context manager and in the context of a particular document. Since all operations assume that they have access to a universal identity map for each element in the document, that map is centralized in this class.

MzMLWriter inherits from ComponentDispatcher, giving it a context attribute and access to all Component objects pre-bound to that context with attribute-access notation.

chromatogram_count

A count of the number of chromatograms written

Type

int

spectrum_count

A count of the number of spectra written

Type

int

index_builder

A writing stream that automatically tokenizes and records byte offsets for specific XML tags.

Type

IndexingStream

__enter__()

Begins writing, opening the top-level tag

__exit__(exc_type, exc_value, traceback)

Closes the top-level tag, the XML formatter, and the file itself.

__getattr__(name)

Provide access to an automatically parameterized version of all ComponentBase types which use this instance’s context.

Parameters

name (str) – Component Name

Returns

A partially parameterized instance constructor for the ComponentBase type requested.

Return type

ReprBorrowingPartial

begin()

Writes the doctype and starts the low-level writing machinery

controlled_vocabularies()

Write out the <cvList> element and all its children, including both this format’s default controlled vocabularies and those passed as arguments to this method.this

This method requires writing to have begun.

data_processing_list(data_processing)

Writes the <dataProcessingList> section of the document.

Note

List and descriptions of data processing applied to this data

Parameters

data_processing (list) – A list or other iterable of dict or DataProcessing-like objects

element(element_name, **kwargs)

Construct and immediately open a subclass instance of TagBase with the given tag name. All other arguments are forwarded to the TagBase constructor.

Parameters
  • element_name (str) – The name of the tag type to create

  • *args – Arbitrary arguments for the tag

  • **kwargs – Key word arguments for the tag

See also

element()

end(exc_type=None, exc_value=None, traceback=None)

Ends the XML document, and flushes and closes the file if appropriate.

file_description(file_contents=None, source_files=None, contacts=None)

Writes the <fileDescription> section of the document.

If file_contents contains a nativeID term, and native_id_format has not been set explicitly, that ID format will be used for this document.

Note

Information pertaining to the entire mzML file (i.e. not specific to any part of the data set) is stored here.

Parameters
  • file_contents (list, optional) – A list or other iterable of str, dict, or *Param-types which will be placed in the <fileContent> element.

  • source_files (list) – A list or other iterable of dict or SourceFile-like objects to be placed in the <sourceFileList> element

format(*args, **kwargs)[source]

This method is deprecated. Previously, the serialization process did not indent the XML in-place and the lxml pretty printer had to be invoked separately. With the addition of XMLFormattingStreamWriter, the XML stream is formatted in-place as it is being streamed to file.

instrument_configuration_list(instrument_configurations)

Writes the <instrumentConfigurationList> section of the document.

Note

List and descriptions of instrument configurations. At least one instrument configuration MUST be specified, even if it is only to specify that the instrument is unknown. In that case, the “instrument model” term is used to indicate the unknown instrument in the instrumentConfiguration

Parameters

instrument_configurations (list) – A list or other iterable of dict or InstrumentConfiguration-like objects

property native_id_format

The nativeID format of the spectra to assume for this data file.

This is used to determine how to convert an integer into a spectrum’s id. Defaults to MS:1000774: “multiple peak list nativeID format” which has a pattern of index=<number>.

This attribute has no effect on spectrum id values specified as strings already formatted.

Note

If not explicitly specified, but a term naming an ID format is passed as a parameter in file contents, that will be used. The ID format from source files will not be used.

Returns

Return type

NativeIDParser

precursor_builder(mz=None, intensity=None, charge=None, spectrum_reference=None, activation=None, isolation_window_args=None, params=None, intensity_unit='number of detector counts', scan_id=None, external_spectrum_id=None, source_file_reference=None, isolation_window=None)

Create a PrecursorBuilder, an object to help populate the precursor information data structure.

The helper object should be used to incrementally populate the precursor information passed to spectrum() or write_spectrum()’s precursor_information argument.

Parameters
  • mz (float, optional) – The m/z of the first selected ion

  • intensity (float, optional) – The intensity of the first selected ion

  • charge (int, optional) – The charge state of the first selected ion

  • spectrum_reference (str, optional) – The id of the prescursor <spectrum> for this precursor, mapped through the document context.

  • activation (dict or list, optional) – Parameters forwarded to PrecursorBuilder.activation(). This should be a dictionary with a key “params” and a list of CVParam coerce-able values, with additional optional keys naming other CVParam coerce-able values. If a list is passed, it will be wrapped in one e.g. {"params": activation}

  • isolation_window_args (tuple, list, or dict, optional) – Parameters forwarded to :meth:PrecursorBuilder.isolation_window`, tuple or list of three values are converted into dict of the correct structure. The expected keys are “lower”, the lower m/z offset, “target”, the center m/z, and “upper”, the upper m/z offset. You may also pass this argumemt as isolation_window.

  • params (list, optional) – The cv- and user-params of the first selected ion, in addition to mz, intensity, charge.

  • intensity_unit (str) – The intensity unit of the first selected ion, to be specified with intensity

  • scan_id (str, optional) – An alias for spectrum_reference

  • external_spectrum_id (str, optional) – The externalSpectrumID attribute of the precursor

  • source_file_reference (str, optional) – The sourceFileRef attribute of the precursor

Returns

Return type

PrecursorBuilder

prepare_precursor_information(mz=None, intensity=None, charge=None, spectrum_reference=None, activation=None, isolation_window_args=None, params=None, intensity_unit='number of detector counts', scan_id=None, external_spectrum_id=None, source_file_reference=None, **kwargs)

Prepare a Precursor element from disparate data structures.

Parameters
  • mz (float, optional) – The m/z of the first selected ion

  • intensity (float, optional) – The intensity of the first selected ion

  • charge (int, optional) – The charge state of the first seelcted ion

  • spectrum_reference (str, optional) – The id of the prescursor <spectrum> for this precursor

  • activation (list, optional) – A list of parameters describing the ion activation method used.

  • isolation_window_args (tuple, list, or dict, optional) – Parameters forwarded to PrecursorBuilder.isolation_window(), tuple or list values are converted into dict of the correct structure. This argument may also be passed as isolation_window.

  • params (list, optional) – The cvParams of the first selected ion

  • intensity_unit (str) – The intensity unit of the first selected ion

  • scan_id (str, optional) – An alias for spectrum_reference

  • external_spectrum_id (str, optional) – The externalSpectrumID attribute of the precursor

  • source_file_reference (str, optional) – The sourceFileRef attribute of the precursor

Returns

Return type

Precursor

reference_param_group_list(groups)

Writes the <referenceableParamGroupList> section of the document.

Parameters

groups (list) – A list or other iterable of dict or ReferenceableParamGroup-like objects

register(entity_type, id)

Pre-declare an entity in the document context. Ensures that a reference look up will be satisfied.

Parameters
  • entity_type (str) – An entity type, either a tag name or a component name

  • id (int) – The unique id number for the thing registered

Returns

The constructed reference id

Return type

str

run(id=None, instrument_configuration=None, source_file=None, start_time=None, sample=None)

Begins the <run> section of the document, describing a single sample run.

Parameters
  • id (str, optional) – The unique identifier for this element

  • instrument_configuration (str, optional) – The id string for the default InstrumentConfiguration for this sample

  • source_file (str, optional) – The id string for the source file used to produce this data

  • start_time (str, optional) – A string encoding the date and time the sample was acquired

  • sample (str, optional) – The id string for the sample used to produce this data

Returns

Return type

RunSection

sample_list(samples)

Writes the <sampleList> section of the document

Parameters

samples (list) – A list or other iterable of dict or Sample-like objects

software_list(software_list: Iterable[Union[psims.mzml.components.Software, Mapping]])

Writes the <softwareList> section of the document.

Note

List and descriptions of software used to acquire and/or process the data in this mzML file

Parameters

software_list (list) – A list or other iterable of dict or Software-like objects

spectrum(mz_array: Optional[numpy.ndarray] = None, intensity_array: Optional[numpy.ndarray] = None, charge_array: Optional[numpy.ndarray] = None, id: Optional[str] = None, polarity='positive scan', centroided=True, precursor_information=None, scan_start_time=None, params=None, compression='zlib', encoding=None, other_arrays=None, scan_params=None, scan_window_list=None, instrument_configuration_id=None, intensity_unit='number of detector counts') psims.mzml.components.Spectrum

Create a new Spectrum instance to be written.

This method does not immediately write and close the spectrum element, leaving it open for modification and embedding.

Parameters
  • mz_array (np.ndarray of floats) – The m/z array of the spectrum

  • intensity_array (np.ndarray of floats) – The intensity array of the spectrum

  • charge_array (np.ndarray, optional) – The charge state array of the spectrum, optional.

  • id (str) – The native ID of the spectrum.

  • polarity (str or int, optional) – The polarity of the spectrum. If an integer, the sign of the integer is used, otherwise it is interpreted as a cvParam

  • centroided (bool, optional) – Whether the spectrum is continuous or discretized by peak picking. Defaults to True.

  • precursor_information (dict or PrecursorBuilder, optional) – The precursor ion description. Will be passed to _prepare_precursor_list(). The structure of this object should either be formatted as arguments to precursor_builder(), or a PrecursorBuilder instance populated with information.

  • scan_start_time (float, optional) – The scan start time, in minutes

  • params (list, optional) – The parameters of the spectrum

  • compression (str, optional) – The compression type name to use. Defaults to COMPRESSION_ZLIB.

  • encoding (dict, optional) – A mapping from array name to NumPy data types.

  • other_arrays (list, optional) – An iterable of array names to additional data arrays. Array names may either be strings, Mapping objects that define CVParam or UserParam, or such paramter objects themselves. Use the latter two methods when defining arrays with units.

  • scan_params (list, optional) – A list of cvParams for the scan of this spectrum

  • scan_window_list (list, optional) – A list of scan windows specified as pairs of m/z intervals

  • instrument_configuration_id (str, optional) – The id of the instrumentConfiguration to associate with this spectrum if not the default one.

Returns

Return type

Spectrum

See also

write_spectrum(), chromatogram(), write_chromatogram()

validate()

Attempt to perform XSD validation on the XML document this writer wrote

Returns

  • bool – Whether or not the document was valid

  • lxml.etree.XMLSchema – The schema object where errors are logged

Raises

TypeError – When the file cannot be recovered from the writer object, a TypeError is thrown

write(*args, **kwargs)

Either write a complete XML sub-tree or add free text to the file stream

Parameters

arg (str or lxml.etree.Element) – The entity to be written out.

write_spectrum(mz_array=None, intensity_array=None, charge_array=None, id=None, polarity='positive scan', centroided=True, precursor_information=None, scan_start_time=None, params=None, compression='zlib', encoding=None, other_arrays=None, scan_params=None, scan_window_list=None, instrument_configuration_id=None, intensity_unit='number of detector counts')

Write a Spectrum with the provided data.

To create a spectrum element but not immediately close it off, see the spectrum() method.

Parameters
  • mz_array (np.ndarray of floats) – The m/z array of the spectrum

  • intensity_array (np.ndarray of floats) – The intensity array of the spectrum

  • charge_array (np.ndarray, optional) – The charge state array of the spectrum, optional.

  • id (str) – The native ID of the spectrum.

  • polarity (str or int, optional) – The polarity of the spectrum. If an integer, the sign of the integer is used, otherwise it is interpreted as a cvParam

  • centroided (bool, optional) – Whether the spectrum is continuous or discretized by peak picking. Defaults to True.

  • precursor_information (dict or PrecursorBuilder, optional) – The precursor ion description. Will be passed to _prepare_precursor_list(). The structure of this object should either be formatted as arguments to precursor_builder(), or a PrecursorBuilder instance populated with information.

  • scan_start_time (float, optional) – The scan start time, in minutes

  • params (list, optional) – The parameters of the spectrum

  • compression (str, optional) – The compression type name to use. Defaults to COMPRESSION_ZLIB.

  • encoding (dict, optional) – A mapping from array name to NumPy data types.

  • other_arrays (list, optional) – An iterable of array names to additional data arrays. Array names may either be strings, Mapping objects that define CVParam or UserParam, or such paramter objects themselves. Use the latter two methods when defining arrays with units.

  • scan_params (list, optional) – A list of cvParams for the scan of this spectrum

  • scan_window_list (list, optional) – A list of scan windows specified as pairs of m/z intervals

  • instrument_configuration_id (str, optional) – The id of the instrumentConfiguration to associate with this spectrum if not the default one.

See also

spectrum()

psims.mzml.writer.compression_map
The compression methods available:

Error

Unable to execute python code at writer.rst:16:

‘<’ not supported between instances of ‘NoneType’ and ‘str’