LC-MS/MS Data Preprocessing and Deconvolution¶
Convert raw mass spectral data files into deisotoped neutral mass peak lists written to a new mzML [Martens2011] file. For tandem mass spectra, recalculate precursor ion monoisotopic peaks.
This task is computationally intensive, and uses several collaborative processes to share the work.
glycresoft mzml preprocess¶
Convert raw mass spectra data into deisotoped neutral mass peak lists written to mzML.
glycresoft mzml preprocess [OPTIONS] MS_FILE OUTFILE_PATH
Options
-
-a
,
--averagine
<averagine>
¶ Averagine model to use for MS1 scans. Either a name or formula [default: glycan] (May specify more than once)
-
-an
,
--msn-averagine
<averagine>
¶ Averagine model to use for MS^n scans. Either a name or formula [default: peptide]
-
-s
,
--start-time
<float>
¶ Scan time to begin processing at in minutes [default: 0.0]
-
-e
,
--end-time
<float>
¶ Scan time to stop processing at in minutes [default: inf]
-
-c
,
--maximum-charge
<int>
¶ Highest absolute charge state to consider [default: 8]
-
-n
,
--name
<string>
¶ Name for the sample run to be stored. Defaults to the base name of the input mzML file
-
-t
,
--score-threshold
<float>
¶ Minimum score to accept an isotopic pattern fit in an MS1 scan [default: 20.0]
-
-tn
,
--msn-score-threshold
<float>
¶ Minimum score to accept an isotopic pattern fit in an MS^n scan [default: 10.0]
-
-m
,
--missed-peaks
<int>
¶ Number of missing peaks to permit before an isotopic fit is discarded [default: 3]
-
-mn
,
--msn-missed-peaks
<int>
¶ Number of missing peaks to permit before an isotopic fit is discarded in an MSn scan [default: 1]
-
-p
,
--processes
<int>
¶ Number of worker processes to use. Defaults to 4 or the number of CPUs, whichever is lower [default: 4]
-
-b
,
--background-reduction
<float>
¶ Background reduction factor. Larger values more aggresively remove low abundance signal in MS1 scans. [default: 5.0]
-
-bn
,
--msn-background-reduction
<float>
¶ Background reduction factor. Larger values more aggresively remove low abundance signal in MS^n scans. [default: 0.0]
-
-r
,
--transform
<choice>
¶ Scan transformations to apply to MS1 scans. May specify more than once. (May specify more than once)
Choices: [extreme_scale_limiter; fticr_baseline; gaussian_smooth;linear_resampling; magnitude_boost; mean_below_mean;median; one_percent_of_max; over_10;over_100; savitsky_golay; tenth_percent_of_max;zero_fill]
-
-rn
,
--msn-transform
<choice>
¶ Scan transformations to apply to MS^n scans. May specify more than once. (May specify more than once)
Choices: [extreme_scale_limiter; fticr_baseline; gaussian_smooth;linear_resampling; magnitude_boost; mean_below_mean;median; one_percent_of_max; over_10;over_100; savitsky_golay; tenth_percent_of_max;zero_fill]
-
-v
,
--extract-only-tandem-envelopes
¶
Only work on regions that will be chosen for MS/MS [default: False]
-
-g
,
--ms1-averaging
<int>
¶ The number of MS1 scans before and after the current MS1 scan to average when picking peaks. [default: 0]
-
--ignore-msn
¶
Ignore MS^n scans [default: False]
-
-snr
,
--signal-to-noise-threshold
<float>
¶ Signal-to-noise ratio threshold to apply when filtering peaks [default: 1.0]
-
-mo
,
--mass-offset
<float>
¶ Shift peak masses by the given amount [default: 0.0]
Arguments
-
MS_FILE
¶
Required argument <path> Path to an mass spectral data file in one of the supported formats
-
OUTFILE_PATH
¶
Required argument <path> Path to write the processed output to
Usage example¶
glycresoft-cli mzml preprocess -a permethylated-glycan -t 20 -p 6 \
-s 5.0 -e 60.0 "path/to/input" "path/to/output.mzML"
Averagine Models¶
Argument type for <averagine>
. The model selected influences how isotopic
patterns are estimated for an arbitrary mass. The value of this parameter may
be a builtin model name or a formula.
For a more complete discsussion of how “averagine” isotopic models work, see [Senko1995].
Builtin Models¶
Model Name | Formula |
---|---|
heparin | H10C6S0.5O5.5N0.5 |
peptide | H7.8C4.9S0.042O1.5N1.4 |
heparan-sulfate | H11C6S1.3O9N0.67 |
permethylated-glycan | H22C12O5.2N0.5 |
glycan | H12C7O5.2N0.5 |
glycopeptide | H16C11S0.021O6.5N1.7 |
Supported File Formats¶
MS_FILE
may be in mzML or mzXML format.
Signal Filters¶
Prior to picking peaks, the raw mass spectral signal may be filtered a number
of ways. By default, a local noise reduction filter is applied, modulated by
-b
and -bn
options respectively. Other filers may be set using -r
and -rn
:
mean_below_mean
- Remove all points below the mean of all points below the mean of all unfiltered points of this scanmedian
- Remove all points below the median intensity of this scanone_percent_of_max
- Remove all points with intensity less than 1% of the maximum intensity point of this scanfticr_baseline
- Apply the same background reduction algorithm used by-b
and-bn
savitsky_golay
- Apply Savtisky-Golay smoothing on the intensities of this scan
Output Information¶
The resulting mzML file from this tool attempts to preserve as much metadata as possible from the source data file, and records its own metadata in the appropriate sections of the document.
Each scan has a standard set of cvParam
entries covering scan polarity,
peak mode, and MS level. In addition to the normal m/z array
and intensity array
entries, each scan also includes the standardized charge array
, as well as two non-standard
arrays, deconvolution score array
and isotopic envelopes array
. The deconvolution score array
is just the result of the goodness-of-fit function used to evaluate the isotopic envelopes resulting
in the reported peaks. The isotopic envelopes array
is more complex, as it encodes the set of isotopic
peaks used to fit each reported peak, and does not have a one-to-one relationship with other arrays.
To unpack the isotopic envelopes array
after decoding, the we use the following logic:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | def decode_envelopes(array):
'''
Arguments
---------
array: float32 array
'''
envelope_list = []
current_envelope = []
i = 0
n = len(array)
while i < n:
# fetch the next two values
mz = array[i]
intensity = array[i + 1]
i += 2
# if both numbers are zero, this denotes the beginning
# of a new envelope
if mz == 0 and intensity == 0:
if current_envelope is not None:
if current_envelope:
envelope_list.append(Envelope(current_envelope))
current_envelope = []
# otherwise add the current point to the existing envelope
else:
current_envelope.append(EnvelopePair(mz, intensity))
envelope_list.append(Envelope(current_envelope))
return envelope_list
|
Bibliography¶
[Senko1995] | Senko, M. W., Beu, S. C., & McLafferty, F. W. (1995). Determination of monoisotopic masses and ion populations for large biomolecules from resolved isotopic distributions. Journal of the American Society for Mass Spectrometry, 6(4), 229–233. https://doi.org/10.1016/1044-0305(95)00017-8 |
[Martens2011] | Martens, L., Chambers, M., Sturm, M., Kessner, D., Levander, F., Shofstahl, J., … Deutsch, E. W. (2011). mzML–a community standard for mass spectrometry data. Molecular & Cellular Proteomics : MCP, 10(1), R110.000133. https://doi.org/10.1074/mcp.R110.000133 |