Searching a Processed Sample with a Glycopeptide Database

Searching a Processed Sample with a Glycopeptide Database¶

The end-goal of all of these tools is to be able to identify glycopeptides from experimental data. After you’ve constructed a glycopeptide database and deconvoluted an LC-MS/MS data file, you’re ready to do just that.

Traditional Database Search¶

glycresoft analyze search-glycopeptide¶

Identify glycopeptide sequences from processed LC-MS/MS data. This algorithm requires a fully materialized

cross-product database (the default), and uses a reverse-peptide decoy by default, evaluated on the total score.

For a search algorithm that applies separate FDR control on the peptide and the glycan, see Multi-component Database Search

glycresoft analyze search-glycopeptide [OPTIONS] DATABASE_CONNECTION
                                       SAMPLE_PATH HYPOTHESIS_IDENTIFIER

Options

-m, --mass-error-tolerance <relative mass error>¶: Mass accuracy constraint, in parts-per-million error, for matching MS^1 ions. [default: 1e-05]

-mn, --msn-mass-error-tolerance <relative mass error>¶: Mass accuracy constraint, in parts-per-million error, for matching MS^n ions. [default: 2e-05]

-g, --grouping-error-tolerance <relative mass error>¶: Mass accuracy constraint, in parts-per-million error, for grouping chromatograms. [default: 1.5e-05]

-n, --analysis-name <string>¶: Name for analysis to be performed.

-q, --psm-fdr-threshold <float>¶: Minimum FDR Threshold to use for filtering GPSMs when selecting identified glycopeptides [default: 0.05]

-s, --tandem-scoring-model <choice or>¶: Select a scoring function to use for evaluating glycopeptide-spectrum matches [default: coverage_weighted_binomial]

Choices: [

binomial; simple; coverage_weighted_binomial;

log_intensity; penalized_log_intensty; peptide_only_cw_binomial;

log_intensity_v3; log_intensity_reweighted]

-x, --oxonium-threshold <float>¶: Minimum HexNAc-derived oxonium ion abundance ratio to filter MS/MS scans. Defaults to 0.05. [default: 0.05]

-a, --adduct <string>¶: Adducts to consider. Specify name or formula, and a multiplicity. (May specify more than once)

-f, --use-peptide-mass-filter¶: Filter putative spectrum matches by estimating the peptide backbone mass from the precursor mass and stub glycopeptide signature ions [default: False]

-p, --processes <int>¶: Number of worker processes to use. Defaults to 4 or the number of CPUs, whichever is lower [default: 4]

--export <choice>¶: export command to after search is complete (May specify more than once)

Choices: [

csv; html; psm-csv]

-o, --output-path <path>¶: Path to write resulting analysis to. [required]

-w, --workload-size <int>¶: Number of spectra to process at once [default: 500]

--save-intermediate-results <path>¶: Save intermediate spectrum matches to a file

--maximum-mass <float>¶: [default: inf]

-D, --decoy-database-connection <string>¶: Provide an alternative hypothesis to draw decoy glycopeptides from instead of the simpler reversed-peptide decoy. This is especially necessary when the stub peptide+Y ions account for a large fraction of MS2 signal.

-G, --permute-decoy-glycan-fragments¶: Whether or not to permute decoy glycopeptides’ peptide+Y ions. The intact mass, peptide, and peptide+Y1 ions are unchanged. [default: False]

--isotope-probing-range <int>¶: The maximum number of isotopic peak errors to allow when searching for untrusted precursor masses [default: 3]

-R, --rare-signatures¶: Look for rare signature ions when scoring glycan oxonium signature [default: False]

--retention-time-modeling, --no-retention-time-modeling¶: Whether or not to model relative retention time to correct for common glycan composition errors. [default: True]

Arguments

DATABASE_CONNECTION¶: Required argument <databaseconnectionparam> A connection URI for a database, or a path on the file system

SAMPLE_PATH¶: Required argument <path> The path to the deconvoluted sample file

HYPOTHESIS_IDENTIFIER¶: Required argument <string> The ID number or name of the glycopeptide hypothesis to use

Usage Example¶

$ glycresoft analyze search-glycopeptide -m 5e-6 -mn 1e-5 fasta-glycopeptides.db path/to/processed/sample.mzML 1\
     -o "agp-glycopepitdes-in-sample.db"

Multi-component Database Search¶

glycresoft analyze search-glycopeptide-multipart¶

Search preprocessed data for glycopeptide sequences scored for both peptide and glycan components.

This search strategy requires an explicit decoy database, like one created with the –reverse flag from glycopeptide-fa.

glycresoft analyze search-glycopeptide-multipart [OPTIONS] DATABASE_CONNECTION
                                                 DECOY_DATABASE_CONNECTION
                                                 SAMPLE_PATH

Options

-T, --target-hypothesis-identifier <int>¶: The ID number or name of the glycopeptide hypothesis to use [default: 1]

-D, --decoy-hypothesis-identifier <int>¶: The ID number or name of the glycopeptide hypothesis to use [default: 1]

-M, --memory-database-index¶: Whether to load the entire peptide database into memory during spectrum mapping. Uses more memory but substantially accelerates the process [default: False]

-m, --mass-error-tolerance <relative mass error>¶: Mass accuracy constraint, in parts-per-million error, for matching MS^1 ions. [default: 1e-05]

-mn, --msn-mass-error-tolerance <relative mass error>¶: Mass accuracy constraint, in parts-per-million error, for matching MS^n ions. [default: 2e-05]

-g, --grouping-error-tolerance <relative mass error>¶: Mass accuracy constraint, in parts-per-million error, for grouping chromatograms. [default: 1.5e-05]

-n, --analysis-name <string>¶: Name for analysis to be performed.

-q, --psm-fdr-threshold <float>¶: Minimum FDR Threshold to use for filtering GPSMs when selecting identified glycopeptides [default: 0.05]

-f, --fdr-estimation-strategy <choice>¶: The FDR estimation strategy to use. The joint estimate uses both peptide and glycan scores, peptide uses only peptide scores, glycan uses only glycan scores, and any uses the smallest FDR of the joint, peptide, and glycan estiamtes. [default: joint]

Choices: [

joint; peptide; glycan;

any]

-s, --tandem-scoring-model <choice or>¶: Select a scoring function to use for evaluating glycopeptide-spectrum matches [default: log_intensity]

Choices: [

log_intensity; simple; penalized_log_intensty;

log_intensity_v3]

-y, --glycan-score-threshold <float>¶: The minimum glycan score required to consider a peptide mass [default: 1.0]

-a, --adduct <string>¶: Adducts to consider. Specify name or formula, and a multiplicity. (May specify more than once)

-p, --processes <int>¶: Number of worker processes to use. Defaults to 4 or the number of CPUs, whichever is lower [default: 4]

--export <choice>¶: export command to after search is complete (May specify more than once)

Choices: [

csv; html; psm-csv]

-o, --output-path <path>¶: Path to write resulting analysis to. [required]

-w, --workload-size <int>¶: Number of spectra to process at once [default: 100]

-R, --rare-signatures¶: Look for rare signature ions when scoring glycan oxonium signature [default: False]

--retention-time-modeling, --no-retention-time-modeling¶: Whether or not to model relative retention time to correct for common glycan composition errors. [default: True]

--isotope-probing-range <int>¶: The maximum number of isotopic peak errors to allow when searching for untrusted precursor masses [default: 3]

-S, --glycoproteome-smoothing-model <path>¶: Path to a glycoproteome site-specific glycome model

-x, --oxonium-threshold <float>¶: Minimum HexNAc-derived oxonium ion abundance ratio to filter MS/MS scans. Defaults to 0.05. [default: 0.05]

-P, --peptide-masses-per-scan <int>¶: The maximum number of peptide masses to consider per scan [default: 60]

Arguments

DATABASE_CONNECTION¶: Required argument <databaseconnectionparam> A connection URI for a database, or a path on the file system

DECOY_DATABASE_CONNECTION¶: Required argument <databaseconnectionparam> A connection URI for a database, or a path on the file system

SAMPLE_PATH¶: Required argument <path> The path to the deconvoluted sample file

Usage Example¶

Please see the SCE Tutorial for example usage

Memory Consumption and Workload Size¶

Extensive use of caching and work-sharing has been done to make searching enormous databases still tractable. If you find you are running out of memory during a search consider shrinking the -w parameter.

Build a Glycosite Network Smoothing Model¶

glycresoft analyze fit-glycoproteome-smoothing-model¶

glycresoft analyze fit-glycoproteome-smoothing-model [OPTIONS]

Options

-p, --processes <int>¶: Number of worker processes to use. Defaults to 4 or the number of CPUs, whichever is lower [default: 4]

-i, --analysis-path <string>¶: [required] (May specify more than once)

-o, --output-path <path>¶: [required]

-q, --fdr-threshold <float>¶: The FDR threshold to apply when selecting identified glycopeptides [default: 0.05]

-P, --glycopeptide-hypothesis <tuple>¶

-g, --glycan-hypothesis <tuple>¶

-u, --unobserved-penalty-scale <float>¶: A penalty to scale unobserved-but-suggested glycans by. Defaults to 1.0, no penalty. [default: 1.0]

-a, --smoothing-limit <float>¶: An upper bound on the network smoothness to use when estimating the posterior probability. [default: 0.2]

-r, --require-multiple-observations, --no-require-multiple-observations¶: Require a glycan/glycosite combination be observed in multiple samples to treat it as real. Defaults to False. [default: False]

-w, --network-path <path>¶: The path to a text file defining the glycan network and its neighborhoods, as produced by glycresfoft build-hypothesis glycan-network, otherwise the default human N-glycan network will be used with the glycans defined in -g.

Adducts¶

Unlike the glycan search tool, the glycopeptide search tool does not apply combinatorial expansion of adducts. It will not mix mass shifts of different types together, so if both Ammonium 2 and Na1H-1 1 are specified, the algorithm will only search for 0, 1, or 2 Ammonium shifts and 0 or 1 Na1H-1 shifts. This is in order to keep the search space tractable, but also in tested datasets, most multiply adducted ion species are low in abundance.

glycresoft documentation