psims

Usage Examples

Writing mzIdentML

In this example, we’ll walk through the steps involved in writing an mzIdentML file from some fictional search engine that we have a tabular file for and some information about search parameters.

from psims.mzid import MzIdentMLWriter

To begin, you create the writer with a file path, or a previously created file-like object:

writer = MzIdentMLWriter("path/to/write.mzid")

The following step is identical to the mzML example. Before any content can be written out, the writer must start the document, which can be done by either using it as a context manager, or by calling it’s begin() method.

with writer:
    writer.controlled_vocabularies()
    ...

The above example uses the context manager syntax, and immediately writes the controlled vocabulary list to the document. This starts the standard-compliance state-machine, which checks to make sure that a document proceeds through each section in the expected order, without skipping required sections. The remainder of these code samples will take place within this context manager.

The next step involves registering the provenance of the document, naming the software(s) that generated the content and optionally the person and/or organization that was responsible for executing that software (or the analysis). We can specify the software either using dict objects or directly instantiate AnalysisSoftware instances (pass a list to register multiple software tools).

software = {
    "name": "my search tool", # Dynamically translated into a cvParam if the name maps to a
                              # a term in the PSI-MS controlled vocabulary, otherwise
                              # into `MS:1000799` with the name as the value.
    "id": 1, # If you have only one analysis software, you may omit this
    "version": "v0.1.0",
    "uri": "https://my.tool.site/",
}
writer.provenance(software=software)

Before we begin writing the actual information content of the file, we may need to do a bit of preparation first. The mzIdentML schema involves making references to entites that are by definition written later than they are referenced, but psims does not like to reference things it doesn’t yet know about. register() can be used to pre-declare any type of entity so psims knows about it before it is used. Alternatively, you can simply ignore the warning messages psims generates when you reference an unknown identifier. Alternatively, you may use register() when accessing a component type as an attribute on an instance of MzIdentMLWriter, e.g. writer.SpectraData.register(1).

Note

Registered identifiers should be unique strings and not integers as they do not go through the normal identifier conversion process!

The next step is to write the sequence collection by opening the sequence_collection() context manager and begin writing database sequences. We’ll assume that we searched a FASTA database stored in "search_database.fasta" with UniProt deflines and that peptides are connected to their protein via their accession number (protein.description['id']).

from pyteomics import fasta, proforma

writer.SearchDatabase.register("search_db_1")

with writer.sequence_collection():
    for protein in fasta.UniProt("search_database.fasta"):
        writer.write_db_sequence(
            protein.description['id'],
            protein.sequence,
            id=protein.description['id'],
            name=protein.description['entry'],
            search_database_id="search_db_1",
            params=[{"protein description": "{entry} {name}".format(**protein.description)}])
    ...

We’ll next generate the peptide list. We assume that the peptide spectrum matches are stored in "peptides.csv" and that the sequences are represented with ProForma 2 notation.

import csv
import os

with open("peptides.csv", 'rt') as peptide_fh:
    peptides_seen = dict()
    for psm in csv.DictReader(peptide_fh):
        peptide_seq = psm['peptide']
        protein_acc = psm['protein']

        key = f"{protein_acc}_$_{peptide_seq}"
        if key in peptides_seen:
            continue
        peptides_seen[key] = (psm['peptide_start'], psm['peptide_end'])

        peptide_seq = proforma.Proforma.parse(peptide_seq)
        unmodified_seq = ''.join([pos[0] for pos in peptide_seq])
        modifications = []
        for i, (aa, mods) in enumerate(peptide_seq, 1):
            for mod in mods:
                if mod.type in (proforma.TagTypeEnum.generic, proforma.TagTypeEnum.unimod):
                    modifications.append({
                        "location": i,
                        "name": mod.name,
                        "monoisotopic_mass_delta": mod.mass,
                    })
                else:
                    print(f"... Skipping tag {mod}, don't know how to convert to UNIMOD modification")
        writer.write_peptide(
            unmodified_seq,
            key,
            modifications)
...

After the peptides are written, we must write out the <PeptideEvidence />.

for peptide_key, (peptide_start, peptide_end) in peptides_seen.items():
    protein_id = peptide_key.split("_$_")[0]

    writer.write_peptide_evidence(
        peptide_key,
        protein_id,
        id=f"{peptide_key}_EVIDENCE",
        start_position=peptide_start,
        end_position=peptide_end)
...

Now, we can begin to define the analysis workflow used. Let’s imagine our search engine searched three MGF files, with identifiers mgf_1 - mgf_3 which we’ll pre-register now. We open the <AnalysisCollection> element and declare a single <SpectrumIdentification> workflow using those MGF files and the search database we registered earlier. We’ll also pre-declare the spectrum identification list spectrum_identified_list1 where we’ll write the identified spectra, and that the workflow took its parameters from a to-be-defined list of parameters in spectrum_idification_params_1.

writer.SpectraData.register("mgf_1")
writer.SpectraData.register("mgf_2")
writer.SpectraData.register("mgf_3")

writer.SpectrumIdentificationList.register('spectra_identified_list_1')
writer.SpectrumIdentificationProtocol.register('spectrum_identification_params_1')

with writer.analysis_collection():
    writer.SpectrumIdentification(spectra_data_ids_used=[
        "mgf_1", "mgf_2", "mgf_3"
    ], search_database_ids_used=[
        "search_db_1"
    ], spectrum_identification_list_id='spectra_identified_list_1',
    spectrum_identification_protocol_id='spectrum_identification_params_1').write()

Next, we’ll actually declare those parameters. The spectrum_identification_protocol() method is a wrapper for writing <SpectrumIdentificationProtocol> that does some extra work to interpret parameters for you. Here is where we list things like mass tolerances, search modification rules, and any other options that your search engine has that you want to share with the reader.

We’ll say we searched our database using trypsin cleavage rules and a constant or “fixed” modification of “Carbamidomethyl” on all cysteines as is common in shotgun proteomics, but for some bizarre reason we chose to also consider a variable “Deamidation” on any asparagines. We could also specify these modifications by their UNIMOD accession numbers, or define custom rules here as well.

The parent_tolerance and fragment_tolerance list the mass accuracy error tolerances for MS1 and MSn spectra respectively, using symmetric upper and lower bounds, but different units (parts-per-million vs. Da). We’ll also specify that we considered monoisotopic masses in the additional_search_params list, along with a custom parameter “frobnication level”, to indicate that we tuned the results “very carefully”.

with writer.analysis_protocol_collection():
    writer.spectrum_identification_protocol(
        enzymes=[{'missed_cleavages': 1, 'name': 'trypsin', 'id': 1}],
        modification_params=[
                    {'fixed': True,
                     'mass_delta': 57.021465,
                     'params': ['Carbamidomethyl'],
                     'residues': ['C']},
                    {'fixed': False,
                     'mass_delta': 0.984,
                     'params': ['Deamidation'],
                     'residues': ['N']}
        ],
        parent_tolerance=(10, 10, "parts per million"),
        fragment_tolerance=(0.02, 0.02, "dalton"),
        additional_search_params=[
            "parent mass type mono",
            "fragment mass type mono",
            {"frobnication level": "high"} # a custom paramater that will map to a userparam
        ],
        id="spectrum_identification_params_1"
    )
...

We’re nearly done. The next thing to do is to actually point to the local files that we searched, namely the sequence database file, the source file for these results, and the spectra data files.

with writer.data_collection():
    source_file = {
        'file_format': 'tab delimited text format',
        'id': 1,
        'location': f"file://{os.path.realpath('peptides.csv')}"
    }

    search_database = {
        'file_format': 'fasta format',
        'id': 'search_db_1',
        'location': f'file://{os.path.realpath("search_database.fasta")}',
        'name': 'Uniprot Proteins'}

    spectra_data = []
    spectra_data = [{'file_format': 'Mascot MGF format',
            'id': f'mgf_{i}',
            'location': f"file://{os.path.realpath('data' + str(i) + '.mgf')}",
            'spectrum_id_format': 'multiple peak list nativeID format'} for i in range(1, 4)
    ]

    writer.inputs(
        source_file, search_database, spectra_data
    )

    ...

Finally, we can write out those peptide spectrum matches we wanted to report!. Now, we can exploit some of those mappings and string encoding rules we used earlier to connect PSMs back to their peptides and their proteins.

We’ll read back through our "peptides.csv" file and reconstruct the peptide ID as we defined it earlier, and use a few convenient columns to fill in spectrum match properties. If we were using a search engine with one or more registered score terms, we can specify them here as params of the spectrum identification items. Since we’re using an imaginary one, we can use userParam “imagine-raw-score” for our score statistic, and then use generic, a search engine agnostic term for our FDR q-value statistic.

data_file_to_id = {os.path.realpath('data' + str(i) + '.mgf'): f"mgf_{i}"
                   for i in range(1, 4)}

with writer.analysis_data():
    with writer.spectrum_identification_list(id='spec_id_list_1'):
        with open("peptides.csv", 'rt') as peptide_fh:
            for i, psm in enumerate(csv.DictReader(peptide_fh)):
                peptide_seq = psm['peptide']
                protein_acc = psm['protein']
                scan_id = psm['scan_id']
                source_file = psm['ms_data_file']
                peptide_id = f"{protein_acc}_$_{peptide_seq}"

                with writer.spectrum_identification_result(
                        spectrum_id=scan_id,
                        id=f'SIR_{i}',
                        spectra_data_id=data_file_to_id[source_file]):

                    writer.write_spectrum_identification_item(
                        calculated_mass_to_charge=psm['theoretical_mz'],
                        charge_state=psm['precursor_charge'],
                        experimental_mass_to_charge=psm['precursor_mz'],
                        id=f'SII_{i}',
                        peptide_id=peptide_id,
                        peptide_evidence_id=f'{peptide_id}_EVIDENCE',
                        params=[
                            {
                                "imagine-raw-score": psm['raw-score'],
                            },
                            {
                                "PSM-level q-value": psm['q-value'],
                            }
                        ]
                    )

And with that, we’re done! let all the context managers close and the file will be written and closed correctly.