psims

Usage Examples

Writing mzML

In this example, we’ll walk through the steps involved in writing an mzML file containing simulated data.

Writing an mzML file with psims is done using the MzMLWriter class.

from psims.mzml import MzMLWriter

To begin, you create the writer with a file path, or a previously created file-like object:

writer = MzMLWriter("path/to/write.mzML")

Before any content can be written out, the writer must start the document, which can be done by either using it as a context manager, or by calling it’s begin() method.

with writer:
    writer.controlled_vocabularies()
    ...

The above example uses the context manager syntax, and immediately writes the controlled vocabulary list to the document. This starts the standard-compliance state-machine, which checks to make sure that a document proceeds through each section in the expected order, without skipping required sections. The remainder of these code samples will take place within this context manager.

The next step is to describe the contents of the file to be written. This includes both provenance information (where the data came from) and classification of the types of spectra contained (MS1 and/or MSn spectra, centroid vs. profile spectra). In this example, we’ll generate some data de novo, and so won’t have any <sourceFile> elements.

writer.file_description([ # the list of file contents terms
    "MS1 spectrum",
    "MSn spectrum",
    "centroid spectrum"
])

If a <sourceFile> is needed, it can be built using

sf = writer.SourceFile("file://path/to/the", "file_name.ext", id="RAW1", params=params)
# If you need to checksum a local file
sf.params.append(sf.checksum('sha-1'))

The next required list is <softwareList>. Replace the names here with your own tools if needed:

writer.software_list([
    {"id": "psims-writer",
        "version": "0.1.2",
        "params": [
        "python-psims",
    ]}
])

The instrument configuration list is a bit more involved to specify. For example, an ESI FT-ICR system would be described by:

source = writer.Source(1, ["electrospray ionization", "electrospray inlet"])
analyzer = writer.Analyzer(2, [
        ""fourier transform ion cyclotron resonance mass spectrometer"
    ])
detector = writer.Detector(3, ["inductive detector"])
config = writer.InstrumentConfiguration(id="IC1", component_list=[source, analyzer, detector],
                                        params=["LTQ-FT"])
writer.instrument_configuration_list([config])

Multiple configurations may be specified.

The <dataProcessing> instructions can be as intricate as needed. We’ll throw in some extra terms for demonstration purposes. Remember, if you used a different software ID above in the <softwareList> section, to use that here too instead of "psims-writer".

methods = []

methods.append(
    writer.ProcessingMethod(
        order=1, sofware_reference="psims-writer", params=[
            "Gaussian smoothing",
            "median baseline reduction",
            "MS:1000035", # peak picking
            "Conversion to mzML"
        ]))
processing = writer.DataProcessing(methods, id='DP1')
writer.data_processing_list([processing])

Now, we’re nearly ready to start writing spectra. To start, we open the run and spectrum list:

import numpy as np

with writer.run(id=1, instrument_configuration='IC1'):
    # we will write 3,000 spectra
    with writer.spectrum_list(count=1e3):
        ...

We write all spectra within the inner context manager of the spectrum list, or instead of using the context manager notation, call the begin() and end() methods.

A spectrum is a collection of many complex details. We’ll use NumPy arrays to represent the data arrays. This file contains centroided spectra as previously specified in the file description.

We’ll assume there’s a function that generates a centroided spectrum, called get_next_arrays to produce those arrays.

i = 0
while i < 1e3:
    i += 1
    ms1_mzs, ms1_intensities = get_next_arrays()
    ms1_spectrum_id = "index=%d" % i
    scan_time = 0.23 * i

    writer.write_spectrum(
        ms1_mzs, ms1_intensities, id=ms1_spectrum_id, centroided=True,
        scan_start_time=scan_time, scan_window_list=[(0, 2000.0)],
        params=[{"ms level": 1}, {"total ion current": ms1_intensities.sum()}])

    for j in range(3):
        i += 1
        msn_mzs, msn_intensities = get_next_arrays()
        msn_spectrum_id = "index=%d" % i
        scan_time = 0.23 * i

        k = np.random.randint(len(ms1_mzs))
        precursor_info = {
            "mz": ms1_mzs[k], 'intensity': ms1_intensities[k],
            "spectrum_reference": ms1_spectrum_id,
            "activation": ["HCD", {"collision energy": 25.0}],
            "isolation_window": (1.0, ms1_mzs[k], 1.0)
        }


        writer.write_spectrum(
            msn_mzs, msn_intensities, id=msn_spectrum_id, centroided=True,
            scan_start_time=scan_time, scan_window_list=[(0, 2000.0)],
            precursor_information=precursor_info,
            params=[{"ms level": 2}, {"total ion current": msn_intensities.sum()}])

Similar code can be used for writing chromatograms, see write_chromatogram().

Once the document is finished, we leave the top-level context manager.

The Complicated Bits

Precursor Information

Precursor ions, dissociation/activation information, and isolation windows are really important when dealing with tandem mass spectra. The <precursor> element is pretty complex though, with three distinct sub-elements, <selectedIon>, <isolationWindow>, and activation, each of which contains a list of cvParams with varying requirements. precursor_builder() provides you with an object you can use to incrementally build up the value to provide.

precursor = writer.precursor_builder()
si = precursor.selected_ion(mz=256.05)
si.charge = 2
act = precursor.activation()
act.add_param("collision-induced dissociation")
act.add_param({"collision energy": 25.0})

writer.write_spectrum(..., precursor_information=precursor)

This translates to the following XML:

<precursorList count="1">
  <precursor>
    <selectedIonList count="1">
    <selectedIon>
      <cvParam cvRef="PSI-MS" accession="MS:1000744" name="selected ion m/z" value="256.05" unitCvRef="PSI-MS" unitAccession="MS:1000040" unitName="m/z"/>
      <cvParam cvRef="PSI-MS" accession="MS:1000042" name="peak intensity" value="0.0" unitCvRef="PSI-MS" unitAccession="MS:1000131" unitName="number of detector counts"/>
      <cvParam cvRef="PSI-MS" accession="MS:1000041" name="charge state" value="2"/>
    </selectedIon>
    </selectedIonList>
    <activation>
    <cvParam cvRef="PSI-MS" accession="MS:1000133" name="collision-induced dissociation" value=""/>
    <cvParam cvRef="PSI-MS" accession="MS:1000045" name="collision energy" value="25.0" unitCvRef="UO" unitAccession="UO:0000266" unitName="electronvolt"/>
    </activation>
  </precursor>
</precursorList>

Alternatively, you can pass these arguments to prepare_precursor_information() to go through the same value translation. This will be done automatically when a dict was passed as the precursor_information argument to spectrum()/write_spectrum().

Binary Data Array Encoding and Compression

It’s arguable that the binary data arrays account for the majority of the space used by mzML files, and are an important piece of content. These numerical arrays are encoded using specific precision types and compression schemes. The standard practice is to encode m/z arrays using 64-bit floating point numbers and everything else using 32-bit floating point numbers or integers where appropriate.

To configure the encoding of a particular array, when calling spectrum(), chromatogram(), you may pass an encoding dict parameter which maps array type to encoding data type. By default, this will map m/z array to 64-bit float and charge array to 32-bit integer and all other arrays will map to 32-bit float.

The mzML standard specifies a binary data array may be compressed too. The compression argument controls which method is used. The default method is zlib compression, see the compression_map attribute of the psims.mzml.writer module. zlib is a lossless compression scheme, but can be slow, while the MS-Numpress compression methods are able to reduce the size of the data by altering the binary composition prior to base64 encoding at the cost of losing some precision. MS-Numpress methods also require the pynumpress library be installed.