| mitocheck's PTMs Documentation | |||
| Mitocheck Home | mtcPTM search | mtcPTM help | Download |
|
The mtcPTM database is a repository of post-translational modifications (PTMs) in human and mouse proteins that also aims to preserve and present the experimental evidence that led to the identification of each modification. The database is publicly available online as part of the MitoCheck web site. |
|
The mtcPTM web site contains data retrieved from literature, protein annotations and other databases as well as that produced by the Mitocheck project. It therefore handles quite different data sets for which the available information varies. For example, modifications retrieved from literature and protein annotation are usually recorded as individual residues and the experimental information from which the modification was deduced can only be recovered by reading the original publication. On the other hand, storing not only the positions of the modified residues but also the experimental information that led to their identification provides a more complete picture and allows the comparison of results from different experimental conditions and/or laboratories. Our aim is to store and present proteomic data obtained as a set of (un)modified peptides determined by mass spectrometry (MS) preserving as much as possible the context in which the experiments were undertaken. Thus, to every set of MS data we assign a hierarchical data structure summarising their experimental information. This hierarchy comprises: (i) data source (i.e. a research group or project); (ii) experimental category (label describing a set of experiments that are undertaken with a combined aim); and (iii) individual experiments. For example, according to this data structure, two experiments undertaken by MitoCheck to determine the differential phosphorylation state of a protein along the cell cycle would receive the following common labels "MitoCheck" (i) and "timing" (ii); and a specific label (iii), e.g. either interphase or mitosis. Storing information in a form as close as possible to the raw data provides not only with the possibility of cross-comparisons to other experiments and sources but also the option of updating and keeping track automatically of changes in the data that may originate from, for example, variations in genome assemblies and gene builds with time. |
|
3- How the MS peptides are handled MS usually produces a number of peptide sequences that have to be assigned to proteins. A peptide could match to more than one protein from the same gene if this had several splicing variants. It could also be possible that a peptide matches to proteins from different genes. When the MS experiment provides a high sequence coverage for each protein the assignment can be unambiguous. However, when the sequence coverage is low, as it is sometimes the case for high-throughput data, it can be difficult to distinguish between splicing variants and/or cross-matches. In the mtcPTM database, all the peptides from a single experiment are searched against Ensembl proteins, resulting in the assignment of each peptide to one or more proteins. Then, the peptides are grouped according to the matched proteins in order to find a minimal list of proteins that could explain all the matches (Nesvizhskii & Aebersold, 2005). After this, every matched protein will be described as being part or not of a minimal list. We defined a protein to be part of the minimal list if at least one of its peptides is unique or if it presents a unique combination of peptides that it is not a subgroup of the peptide set of another protein. All matches, regardless of whether they are part of the minimal list or not, are stored in the database. Figure 1 depicts how peptide matches are dealt with. Gene ENSG00000171467, which has three possible transcripts, has been matched by a number of peptides in one experiment. It can be seen that of all the three transcripts ENSP00000354964 is the one with the higher number of peptide matches even though none of the peptides are unique for this protein. ENSP00000354964 is considered then to be part of the minimal list (peptides highlighted in red). However, it could be possible that the peptides could be explained by the existance of the two transcripts that were not included in the minimal list. Even though more information would be needed to confirm either scenario, the raw data is kept for the users to be able to draw their own conclusions. At the bottom of Figure 1 a number of peptides matching to proteins from other genes are also shown. Some of these proteins matched other peptides and they were therefore part of the minimal list (in red) whereas others did not (in green) and therefore their assignment could be considered spourious. Clicking on each protein name will redirect the user to a web page in which the peptides and identified modifcations will be highlighted, in more detail, onto the individual sequences. |
|
The Protein View provides a summary of the modifications stored in the mtcPTM database as well as general information about the number and type of sequence and structural domains present in the protein and the frequency of residues flanking the modified sites. |
|
4.1- Graphical comparison of experiments The graphical display depicts all the matching peptides stored in the database grouped by the individual experiments in which they were observed (Figure 2). First, a complete summary of all the modifications from all experiments is shown at the top. In this panel, modifications are colour-coded according to whether they were fully resolved by any experiment. Thus, confidently determined positions are shown in red, uncertain positions in orange, and positions that have been retrieved from literature or other sources and are still awaiting for confirmation in grey. Then, the peptide maps for each experiment are shown underneath in such a way that experiments that were related to each other are placed together to allow for easy comparison. The colour coding is the same as before with the exception that grey is now used to highlight residues that are known to be modified but that were not found for a given experiment. |
|
Further information about the experimental MS data of the peptides can be retrieved by clicking the lines representing the peptides. These peptide pages include details about the sequence of the peptide, experimental data (such as protease and software used), whether the peptide is unique for a gene/protein, its position in the full-length protein and whether there exists differences from the ensembl sequences (see examples of a perfect match, as well as peptides with insertions and conservative or non-conservative substitutions with respect to the ensembl sequences). The peptide pages also complement the information provided in the Gene View (Figure 1) as the correct assignment of promiscuous peptides to a given protein/gene can be investigated further, for example, by analysing which of the several matches to the peptides have flanking regions consistent with the digestion pattern expected for the protease used in an experiment. This is useful since the current version of the matching procedure described above does not penalise protein assignments in which the peptide flanks do not agree with the expected digestion pattern (see an example of two protein assigments for the same peptide in which there was agreement and disagreement with respect to the expected flanking residues). |
|
4.2- Structural modelling of modified sites The database also contains structural models for fragments of proteins containing modified sites. These models have been built by homology modelling to PDB co-ordinates. The matching procedure is very stringent and only takes matches with high sequence similarity (not less than 30% sequence identity and without a large amount of gaps in the pairwise alignment) that cover the whole length of the residues with atomic co-ordinates in the PDB entries. Only those proteins that were part of the minimal list were considered for model building. The coordinates of the models are provided as rasmol scripts in which the secondary structure elements of the proteins are differentially coloured from the N-terminus to the C-terminus and the modified residues are represented as white spacefilled molecules (Figure 3). These rasmol scripts are provided as compressed files. Once downloaded and uncompressed, they can be launched in rasmol by typing RasMol> script filename at the RasMol command line. For further help with the display the user can refer to the rasmol documentation and/or email us. |
|
Moreover, for each structural model, two sequence alignments are provided. The first one corresponds to the pairwise alignment between the modelled ensembl sequence and the PDB template. It is important to examine this alignment carefully to assess, for example, whether the side-chain conformation of the residues of interest are likely to be correctly modelled, not only when the overall sequence identity to the template is low but also to identify whether the regions containing the modified residues were modelled from scratch when they fell into areas for which experimental density was not available in the crystals (see an example of the latter here). Here are examples of ensembl sequences modelled from templates with relatively low sequence identity but in which the conformation of the modified residues could have been correctly modelled because the templates contain identical or similar residues at such positions. In any case, the structural information should be helpful in providing hints about the potential local effect of the modifications on the protein domains. The second alignment is a multiple alignment of sequences related to the modelled structure and it provides evolutionary information about the level of conservation at each position of the model (see an example here). These alignments comprise sequences or regions in sequences that are at least 30% identical along their whole length to the sequence of the modelled domain. The alignments are non-redundant in the sense that groups of sequences with high similarity between them (over 90% identity) were removed and only one representative member was taken. Only ungapped positions with respect to the modelled ensembl sequence are shown. Further details about the automatic procedure to generate these multiple alignments can be found elsewhere. |
|
4.3- Conservation of modified sites The web site also provides full-length multiple alignments for each given protein entry with respect to its orthologs and paralogs from various organisms. The alignments can be retrieved either in HTML or FASTA format. For the HTML version, the colouring is identical to that from those built from structural domains (see an example here). In the main protein page, this information is summarised in a table where the degree of conservation of every single residue is provided. The degree of conservation is calculated in three different ways. In the first column, the percentage of residues identical to that modified in the query are provided. In the second column, this value reflects the percentage of similar residues to that from the query when the position involves either a serine or threonine. In the last column, the value provides the percentage of any phopshorylatable residue (i.e. any S, T or Y) found at the equivalent position in all sequences from the multiple alignment. The sequences used for each alignment were those matched by the protein query to uniprot entries using BLAST in such a way that the the match covered at least 70% of the query length and the lengths of query and subject did not differ by more than 30%. The full-length sequences of the selected matches were then aligned using CLUSTAL with default parameters and those with less than 30% sequence identity to the initial query were removed. The whole process was automatic and thus the alignments could perhaps be improved by manual editting in some instances. Also, the stringent parameters employed for sequence selection could have resulted in alignments in which some known homologs may be missing. In addition to full-length alignments from every protein entry, the web site also provides combined alignments containing orthologs/paralogs with experimentally determined phopshorylations. These alignments are coloured as before with the difference that the only highlighted residues in the ensembl entries (i.e. the reference sequences for which experimental data is available) are those known to be modified. These alignments allow to not only compare the experimental data available between human and mouse orthologs (see an example here) but also to investigate the patterns of conservation between paralogs as shown in here. The combined alignments were built by merging those initial full-length alignments that share any sequences. Only sequences with at least 40% sequence identity to the reference ensembl entry were considered. |
|
Additionally, the web site also provides a number of tables summarising the boundaries for each protein domain and the sequence fragments centred at the modified position. In the latter, the information source also indicated as well as relevant publications if any. Finally, the last picture at the bottom of the Protein View page provides a graphical summary of the amino-acid frequency of the regions flanking the modified residues (Figure 4). This summary is made of two tables, one with the frequencies for each amino-acid, and a second one with the frequencies for amino-acid classes, such as positively charged residues, etc. In some cases, these simple heat maps can provide hints about the characteristics of the preferred motifs around the modified positions and thus about what kinase could be involved (Schwartz & Gygi, 2005). On the other hand, weak heat maps could indicate a rather flexible recognition by a kinase or the presence of sites phosphorylated by kinases with different specificities. |
|
The protein/gene of interest can be searched in mtcPTM by its FASTA sequence, its the Ensembl ID or by keywords corresponding to its description. Alternatively, a direct link to the lists of all experimental data sets available is also provided. It is recommended that before visiting the Protein View pages, the summary of all peptide assignments by experiment should be inspected in order to evaluate which proteins may be more likely to be represented by the identified peptides. |
|
J.L. Jimenez, B. Hegemann, J.R.A. Hutchins, J.M. Peters and R. Durbin A systematic comparative and structural analysis of protein phosphorylation sites based on the mtcPTM database Genome Biology 2007 [under revision] [supplementary material] |