mzML Transformation Stream¶
Given a file stream from an mzML file, psims.transform.mzml.MzMLTransformer
will copy it to a new stream, applying a user provided transformation function to modify
each spectrum en-route. It can also optionally sort the spectra by “scan start time”.
Transforming mzML Files¶
Often, we start with an mzML file we want to manipulate or change, but don’t want to write out explicitly unpacking it and re-packing it.
The MzMLTransformer
class is intended to give you a way to wrap an input file-like object
over an mzML file and an output file-like object to write the manipulated mzML file to, along with
a transformation function to modify spectra, and have it do the rest of the work. It uses pyteomics.mzml
to do the parsing internally.
Transformation Function Semantics¶
The transformation function passed receives a dict
object representing
the spectrum as parsed by pyteomics.mzml
and expects the function to return
the dictionary modified or None
(in which case the spectrum is not written out).
You are free to modify existing keys in the spectrum dictionary, but new keys that are
intended to be recognized as either <cvParam />
or <userParam />
elements must
be instances of pyteomics.auxiliary.cvstr
, or otherwise have an “accession
”
attribute to be picked up. Alternatively, the converter will make an effort to coerce keys
whose values which are scalars, or :class:`dict`s which look like parameters (having a “name”
or “accession” key, at least).
Alternatively, you can inherit from MzMLTransformer
and override format_spectrum()
to modify the spectrum before or after conversion (letting you directly append to the “params” key of the
converted spectrum and avoid needing to mark new params with cvstr
). Additionally, you
can override all other format_
methods to customize how other elements are converted.
Usage and Examples¶
In its simplest form, we would use the MzMLTransformer
like so:
from psims.transform.mzml import MzMLTransformer, cvstr
def transform_drop_ms2(spectrum):
if spectrum['ms level'] > 1:
return None
return spectrum
with open("input.mzML", 'rb') as in_stream, open("ms1_only.mzML", 'wb') as out_stream:
MzMLTransformer(in_stream, out_stream, transform_drop_ms2).write()
- class psims.transform.mzml.MzMLTransformer(input_stream, output_stream, transform=None, transform_description=None, sort_by_scan_time=False)[source]¶
Reads an mzML file stream from
input_stream
, copying its metadata tooutput_stream
, and then copies its spectra, applyingtransform
to each spectrum object as it goes.If
sort_by_by_scan_time
isTrue
, then prior to writing spectra, a first pass will be made over the mzML file and the spectra will be written out ordered byMS:1000016|scan start time
.- input_stream¶
A byte stream from an mzML format data buffer
- Type
file-like
- output_stream¶
A writable binary stream to copy the contents of
input_stream
into- Type
file-like
- transform¶
A function to call on each spectrum, passed as a
dict
object as read bypyteomics.mzml.MzML
. A spectrum will be skipped if this function returnsNone
.- Type
Callable
, optional
- transform_description¶
A description of the transformation to include in the written metadata
- Type
- Parameters
input_stream (path or file-like) – A byte stream from an mzML format data buffer
output_stream (path or file-like) – A writable binary stream to copy the contents of
input_stream
intotransform (
Callable
, optional) – A function to call on each spectrum, passed as adict
object as read bypyteomics.mzml.MzML
.transform_description (
str
) – A description of the transformation to include in the written metadatasort_by_scan_time (
bool
) – Whether or not to sort spectra by scan time prior to writing
MzMLb Translation¶
psims
can also translate mzML into mzMLb automatically using a variant of MzMLtransformer
called
MzMLToMzMLb
. It works identically to MzMLTransformer
, though it can accept additional arguments
to control the HDF5 block size and compression.
- class psims.transform.mzml.MzMLToMzMLb(input_stream, output_stream, transform=None, transform_description=None, sort_by_scan_time=False, **hdf5args)[source]¶
Convert an mzML document into an mzMLb file, with an optional transformation along the way.
- Parameters
input_stream (path or file-like) – A byte stream from an mzML format data buffer
output_stream (path or file-like) – A writable binary stream to copy the contents of
input_stream
intotransform (
Callable
, optional) – A function to call on each spectrum, passed as adict
object as read bypyteomics.mzml.MzML
.transform_description (
str
) – A description of the transformation to include in the written metadatasort_by_scan_time (
bool
) – Whether or not to sort spectra by scan time prior to writingh5_compression (
str
, optional) – The name of the HDF5 compression method to use. Defaults topsims.mzmlb.writer.DEFAULT_COMPRESSOR
h5_compression_opts (
tuple
orint
, optional) – The configuration options for the selected compressor. For “gzip”, this a single integer setting the compression level, while Blosc takes a tuple of integers.h5_blocksize (
int
, optional) – The size of the compression blocks used when building the HDF5 file. Smaller blocks improve random access speed at the expense of compression efficiency and space. Defaults to 2 ** 20, 1MB.
1 #!/usr/bin/env python
2 import sys
3 from psims.transform.mzml import MzMLToMzMLb
4
5 inpath = sys.argv[1]
6 outpath = sys.argv[2]
7 try:
8 compression = sys.argv[3]
9 except IndexError:
10 compression = "blosc"
11
12 with open(inpath, 'rb') as instream:
13 MzMLToMzMLb(instream, outpath, h5_compression=compression).write()