LC-MS/MS Data Preprocessing and Deconvolution

Convert raw mass spectral data files into deisotoped neutral mass peak lists written to a new mzML [Martens2011] file. For tandem mass spectra, recalculate precursor ion monoisotopic peaks.

This task is computationally intensive, and uses several collaborative processes to share the work.

glycresoft mzml preprocess

Convert raw mass spectra data into deisotoped neutral mass peak lists written to mzML.

glycresoft mzml preprocess [OPTIONS] MS_FILE OUTFILE_PATH

Options

-a, --averagine <averagine>

Averagine model to use for MS1 scans. Either a name or formula [default: glycan] (May specify more than once)

-an, --msn-averagine <averagine>

Averagine model to use for MS^n scans. Either a name or formula [default: peptide]

-s, --start-time <float>

Scan time to begin processing at in minutes [default: 0.0]

-e, --end-time <float>

Scan time to stop processing at in minutes [default: inf]

-c, --maximum-charge <int>

Highest absolute charge state to consider [default: 8]

-n, --name <string>

Name for the sample run to be stored. Defaults to the base name of the input mzML file

-t, --score-threshold <float>

Minimum score to accept an isotopic pattern fit in an MS1 scan [default: 20.0]

-tn, --msn-score-threshold <float>

Minimum score to accept an isotopic pattern fit in an MS^n scan [default: 10.0]

-m, --missed-peaks <int>

Number of missing peaks to permit before an isotopic fit is discarded [default: 3]

-mn, --msn-missed-peaks <int>

Number of missing peaks to permit before an isotopic fit is discarded in an MSn scan [default: 1]

-p, --processes <int>

Number of worker processes to use. Defaults to 4 or the number of CPUs, whichever is lower [default: 4]

-b, --background-reduction <float>

Background reduction factor. Larger values more aggresively remove low abundance signal in MS1 scans. [default: 5.0]

-bn, --msn-background-reduction <float>

Background reduction factor. Larger values more aggresively remove low abundance signal in MS^n scans. [default: 0.0]

-r, --transform <choice>

Scan transformations to apply to MS1 scans. May specify more than once. (May specify more than once)

Choices: [
extreme_scale_limiter; fticr_baseline; gaussian_smooth;
linear_resampling; magnitude_boost; mean_below_mean;
median; one_percent_of_max; over_10;
over_100; savitsky_golay; tenth_percent_of_max;
zero_fill]
-rn, --msn-transform <choice>

Scan transformations to apply to MS^n scans. May specify more than once. (May specify more than once)

Choices: [
extreme_scale_limiter; fticr_baseline; gaussian_smooth;
linear_resampling; magnitude_boost; mean_below_mean;
median; one_percent_of_max; over_10;
over_100; savitsky_golay; tenth_percent_of_max;
zero_fill]
-v, --extract-only-tandem-envelopes

Only work on regions that will be chosen for MS/MS [default: False]

-g, --ms1-averaging <int>

The number of MS1 scans before and after the current MS1 scan to average when picking peaks. [default: 0]

--ignore-msn

Ignore MS^n scans [default: False]

-snr, --signal-to-noise-threshold <float>

Signal-to-noise ratio threshold to apply when filtering peaks [default: 1.0]

-mo, --mass-offset <float>

Shift peak masses by the given amount [default: 0.0]

Arguments

MS_FILE

Required argument <path> Path to an mass spectral data file in one of the supported formats

OUTFILE_PATH

Required argument <path> Path to write the processed output to

Usage example

example usage
glycresoft-cli mzml preprocess -a permethylated-glycan -t 20 -p 6 \
    -s 5.0 -e 60.0 "path/to/input" "path/to/output.mzML"

Averagine Models

Argument type for <averagine>. The model selected influences how isotopic patterns are estimated for an arbitrary mass. The value of this parameter may be a builtin model name or a formula.

For a more complete discsussion of how “averagine” isotopic models work, see [Senko1995].

Builtin Models

Model Name Formula
heparin H10C6S0.5O5.5N0.5
peptide H7.8C4.9S0.042O1.5N1.4
heparan-sulfate H11C6S1.3O9N0.67
permethylated-glycan H22C12O5.2N0.5
glycan H12C7O5.2N0.5
glycopeptide H16C11S0.021O6.5N1.7

Supported File Formats

MS_FILE may be in mzML or mzXML format.

Signal Filters

Prior to picking peaks, the raw mass spectral signal may be filtered a number of ways. By default, a local noise reduction filter is applied, modulated by -b and -bn options respectively. Other filers may be set using -r and -rn:

  1. mean_below_mean - Remove all points below the mean of all points below the mean of all unfiltered points of this scan
  2. median - Remove all points below the median intensity of this scan
  3. one_percent_of_max - Remove all points with intensity less than 1% of the maximum intensity point of this scan
  4. fticr_baseline - Apply the same background reduction algorithm used by -b and -bn
  5. savitsky_golay - Apply Savtisky-Golay smoothing on the intensities of this scan

Output Information

The resulting mzML file from this tool attempts to preserve as much metadata as possible from the source data file, and records its own metadata in the appropriate sections of the document.

Each scan has a standard set of cvParam entries covering scan polarity, peak mode, and MS level. In addition to the normal m/z array and intensity array entries, each scan also includes the standardized charge array, as well as two non-standard arrays, deconvolution score array and isotopic envelopes array. The deconvolution score array is just the result of the goodness-of-fit function used to evaluate the isotopic envelopes resulting in the reported peaks. The isotopic envelopes array is more complex, as it encodes the set of isotopic peaks used to fit each reported peak, and does not have a one-to-one relationship with other arrays.

To unpack the isotopic envelopes array after decoding, the we use the following logic:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def decode_envelopes(array):
    '''
    Arguments
    ---------
    array: float32 array
    '''
    envelope_list = []
    current_envelope = []
    i = 0
    n = len(array)
    while i < n:
        # fetch the next two values
        mz = array[i]
        intensity = array[i + 1]
        i += 2

        # if both numbers are zero, this denotes the beginning
        # of a new envelope
        if mz == 0 and intensity == 0:
            if current_envelope is not None:
                if current_envelope:
                    envelope_list.append(Envelope(current_envelope))
                current_envelope = []
        # otherwise add the current point to the existing envelope
        else:
            current_envelope.append(EnvelopePair(mz, intensity))
    envelope_list.append(Envelope(current_envelope))
    return envelope_list

Bibliography

[Senko1995]Senko, M. W., Beu, S. C., & McLafferty, F. W. (1995). Determination of monoisotopic masses and ion populations for large biomolecules from resolved isotopic distributions. Journal of the American Society for Mass Spectrometry, 6(4), 229–233. https://doi.org/10.1016/1044-0305(95)00017-8
[Martens2011]Martens, L., Chambers, M., Sturm, M., Kessner, D., Levander, F., Shofstahl, J., … Deutsch, E. W. (2011). mzML–a community standard for mass spectrometry data. Molecular & Cellular Proteomics : MCP, 10(1), R110.000133. https://doi.org/10.1074/mcp.R110.000133