List of protein prediction tools




















Protein structure prediction is the inference of the three-dimensional structure of a protein from its amino acid sequence—that is, the prediction of its secondary and tertiary structure from primary structure. Structure prediction is different from the inverse problem of protein design. Protein structure prediction is one of the most important goals pursued by computational biology; and it is important in medicine and biotechnology. Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation.

This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques.

Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

Structural bioinformatics is the branch of bioinformatics that is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA. The term structural has the same meaning as in structural biology, and structural bioinformatics can be seen as a part of computational structural biology. The main objective of structural bioinformatics is the creation of new methods of analysing and manipulating biological macromolecular data in order to solve problems in biology and generate new knowledge.

BioJava is an open-source software project dedicated to provide Java tools to process biological data. BioJava supports a huge range of data, starting from DNA and protein sequences to the level of 3D protein structures.

The BioJava libraries are useful for automating many daily and mundane bioinformatics tasks such as to parsing a Protein Data Bank PDB file, interacting with Jmol and many more. This application programming interface API provides various file parsers, data models and algorithms to facilitate working with the standard data formats and enables rapid application development and analysis.

Biomolecular structure is the intricate folded, three-dimensional shape that is formed by a molecule of protein, DNA, or RNA, and that is important to its function. The structure of these molecules may be considered at any of several length scales ranging from the level of individual atoms to the relationships among entire protein subunits. This useful distinction among scales is often expressed as a decomposition of molecular structure into four levels: primary, secondary, tertiary, and quaternary.

The scaffold for this multiscale organization of the molecule arises at the secondary level, where the fundamental structural elements are the molecule's various hydrogen bonds. This leads to several recognizable domains of protein structure and nucleic acid structure, including such secondary-structure features as alpha helixes and beta sheets for proteins, and hairpin loops, bulges, and internal loops for nucleic acids. Nucleic acid structure prediction is a computational method to determine secondary and tertiary nucleic acid structure from its sequence.

Secondary structure can be predicted from one or several nucleic acid sequences. Tertiary structure can be predicted from the sequence, or by comparative modeling.

This is a list of computer programs that are predominantly used for molecular mechanics calculations. Nucleic acid design is the process of generating a set of nucleic acid base sequences that will associate into a desired conformation. It is necessary because there are many possible sequences of nucleic acid strands that will fold into a given secondary structure, but many of these sequences will have undesired additional interactions which must be avoided.

In addition, there are many tertiary structure considerations which affect the choice of a secondary structure for a given design. Nucleic acid secondary structure is the basepairing interactions within a single nucleic acid polymer or between two polymers. Until then, check out our own bioinformatics tool to help you quickly and easily identify published data for a given antibody to help with your antibody selection.

Research Tips Bioinformatics Science. Platform Science Culture Perspectives. Site Search Desktop. Site Search Mobile. The BenchSci Blog. Written By:. As a scientist who spent the last 15 years in academia, Mohamed is passionate about translating his research and academic expertise into applications that help scientists find their next breakthroughs.

Connect with me on LinkedIn. Related Posts. N-glycosylation and O-glycosylation are two major types of glycosylation and have important roles in the maintenance of protein conformation and activity Glycosylation has a great role in many important biological processes such as cell adhesion, cell—cell and cell—matrix interactions, molecular trafficking, receptor activation, protein solubility effects, protein folding and signal transduction, protein degradation, and protein intracellular trafficking and secretion 9— SUMOylation takes place via SUMO 83 that has a three-dimensional structure similar to ubiquitin protein and has been discovered in a wide range of eukaryotic organisms SUMOylation can occur in both cytoplasm and nucleus on lysine residues SUMO family has three isoforms in mammals, four isoforms in humans, two isoforms in yeasts and eight isoforms in plants 1.

In this reaction, SUMO is connected to a lysine residue in substrate protein by covalent linkage via three enzymes, namely activating E1 , conjugating E2 and ligase E3.

SUMOylation plays a major role in many basic cellular processes like transcription control, chromatin organization, accumulation of macromolecules in cells, regulation of gene expression and signal transduction 89 , It is also necessary for the conservation of genome integrity An important class of PTMs, called lipidation, includes covalent attachment of lipids to proteins.

The first report of the covalent modification of proteins with lipids dates back to These PTMs are taken place via a great variety of lipids like octanoic acid, myristic acid, palmitic acid, palmitoleic acid, stearic acid, cholesterol, etc. Myristoylation, palmitoylation and prenylation can be considered as the three main types of these lipid modifications 95 , Palmitoylation is described in this subsection, and the other two important ones are described in the subsequent subsections.

Bartels Palmitoylation is the covalent attachment of fatty acids, like palmitic acid on the Cys, Gly, Ser, Thr and Lys 6. S-palmitoylation contains a reversible covalent addition of a carbon fatty acid chains, palmitate, to a cysteine via a thioester linkage Figure 3H Palmitoyl-CoA as the lipid substrate is attached to the target protein by a PAT and removed via acyl protein thioesterases Mostly, S-palmitoylation occurs in eukaryotic cells and plays critical roles in many different biological processes including protein function regulation, protein—protein interaction, membrane—protein associations, neuronal development, signal transduction, apoptosis and mitosis 98— Myristoylation N-myristoylation was discovered by Alastair Aitken in , in bovine brain Although often refers to myristoylation as a PTM, it usually occurs co-translationally This modification is an irreversible PTM that occurs mainly on cytoplasmic eukaryotic proteins.

Myristoylation has been reported in some integral membrane proteins as well Myristoylation happens approximately in 0. In myristoylation after removal of the initiating Met, a carbon saturated fatty acid, called myristic acid, is attached to the N-terminal glycine residue via a covalent bond Figure 3I Myristoylation occurs more frequently on Gly and less frequently on Lys residues 6.

Proteins that undergo this PTM play critical roles in regulating the cellular structure and many biological processes such as stabilizing the protein structure maturation, signaling, extracellular communication, metabolism and regulation of the catalytic activity of the enzymes , The first study on prenylation was done in by Yuji KamiIya et al.

It is another important lipid-based PTM, which occurs after translation as an irreversible covalent linkage mainly in the cytosol This reaction occurs on cysteine and near the carboxyl-terminal end of the substrate protein Prenylation has two main forms: farnesylation and geranylation These two forms contain the addition of two different types of isoprenoids to cysteine residues: farnesyl pyrophosphate carbon and geranylgeranyl pyrophosphates carbon , respectively.

In prenylated proteins, one can find a consensus motif at the C-terminal; the motif is CAAX where C is cysteine, A is an aliphatic amino acid and X is any amino acid The prenylation is known as a crucial physiological process for facilitating many cellular processes such as protein—protein interactions, endocytosis regulation, cell growth, differentiation, proliferation and protein trafficking — Observations showed that disruption in this modification plays crucial roles in the pathogenesis of cancer , cardiovascular and cerebrovascular disorders, bone diseases, progeria, metabolic diseases and neurodegenerative diseases , Sulfation was first discovered by Bruno Bettelheim in bovine fibrinopeptide bin in Residues Tyr, Cys, and Ser have been identified as target residues for prenylated proteins 6.

Often, the target residue of this PTM is tyrosine, which happens in the trans-Golgi network. N-sulfation or O-sulfation includes the addition of a negatively charged sulfate group by nitrogen or oxygen to an exposed tyrosine residue on the target protein , Currently, PTS is observed mainly in secreted and transmembrane proteins in multicellular eukaryotes and have not yet been observed in nucleic and cytoplasmic proteins TPSTs govern the transfer of an activated sulfate from 3-phospho adenosine 5-phosphosulfate to tyrosine residues within acidic motifs of polypeptides Figure 3K Recently, it has been observed that PTS has vital roles in many biological processes like protein—protein interactions, leukocyte rolling on endothelial cells, visual functions and viral entry into cells PTMs have a vital role in almost all biological processes and fine-tune numerous molecular functions.

Therefore, the footprints of disruption in PTMs can be seen in many diseases. This network contains 97 diseases and biological processes. Involvement of PTMs in diseases and biological processes. D Involvement of PTMs in disease and biological processes.

Besides, one can see that cancer is also one of the most affected diseases. Consistently with this observation, the biological processes related to cancer are among the high-degree nodes signaling, DNA repair, control of replication and apoptosis. Processes related to apoptosis, protein—protein interaction, signaling, cell cycle control, chromatin assembly, organization and stability, DNA repair, protein degradation, protein trafficking and targeting, regulation of gene expression and transcription control are the other high-degree biological processes.

Moreover, we can say that ubiquitylation, prenylation, glycosylation, S-palmitoylation and SUMOylation have the most involvement in diseases.

On the other hand, the PTMs with the highest number of interactions with biological processes are phosphorylation, ubiquitylation, methylation, acetylation and SUMOylation. Putting all together, we can conclude that the disruption in the pathways of these five PTMs has a great impact on the normal functioning of the cell and, as the result, on the organisms. Due to the considerable cost and difficulties of experimental methods for identifying PTMs, recently many computational methods have been developed for predicting PTMs Almost all of these methods need a set of experimentally validated PTMs to build a prediction model.

Therefore, the availability of valid public databases of PTMs is the first step toward this end. There are a variety of such public databases that could be utilized easily by the scientific community for developing computational methods 17 , According to the scope and diversity of the covered PTMs, these databanks can be classified into two main groups: general databases and specific databases.

The general databases contain different types of PTMs, regardless of target residue and organisms. These databases provide a broad scope of information for various PTMs. The current public PTM databases are greatly different in the number of stored modified proteins, the number of modified sites and the number of covered PTM types. Figure 5 shows a bubble chart of main PTM databases according to these three parameters.

As it is evident from the figure, due to the extensive number of studies on phosphorylation, the specific databases are mainly focused on phosphorylation. From this point of view, glycosylation is the second most interested PTM. In the following, the five largest databases are described briefly. Also, Table 1 summarizes the current main public PTM databases. Bubble chart for PTM databases. The chart was drawn based on three parameters for the databases: the number of stored modified proteins, the number of modified sites and the number of covered PTM types.

This database is the largest database in terms of the number of recorded proteins and also in terms of the number of stored PTM types Figure 5. However, the major amount of its data are extracted from human, mouse and rat Generally speaking, any computational method for predicting a specific type of PTM has four main steps: data gathering, feature extraction, learning the predictor and performance assessment.

These steps have been schematically shown in Figure 6. In the following, these steps are described in detail. Also, the related challenges and problems in each step are discussed as well. A schematic flowchart to show how a predictor works for PTM prediction.

A Data collection and dataset creation. B Feature selection. C Creating training and testing models. D Evaluation of the performance of the models. The first step of a PTM prediction method is gathering the data of proteins that undergo the PTM of interest, in order to assemble a valid dataset Figure 6A. The final dataset must include both positive polypeptide sequences having a target residue that has undergone PTM and negative polypeptide sequences having a target residue that has not been affected by PTM samples in order to enable us to train a machine learning algorithm for predicting PTMs.

Positive data selection: almost all studies use the aforementioned databases such as dbPTM or Uniprot to gather the positive samples. Negative data selection: selecting the negative dataset is the most challenging part of the data gathering step.

There are three main strategies for selecting the negative dataset. A random set of proteins with an equal number of the positive set is selected. Then, those occurrences of the target residue that did not undergo the PTM are considered as the negative samples.

The second strategy works like the first, but only those proteins are considered, to construct the negative dataset, that none of their target residues have undergone that specific PTM based on experimental evidences.

The third strategy examines only the proteins that are included in the positive dataset. In this case, those occurrences of the target residue that have not undergone PTM are considered as the negative samples.

This step varies from study to study. CDhit is used as the major tool to detect similar samples sequences. Regardless of the strategy that is used for the negative data selection, in almost all cases, filtered datasets are imbalanced, and size of the negative dataset is greater than that of positive dataset in various extent sometimes the negative dataset is greater by some order of magnitude.

Due to the biases that can be introduced by the imbalanced datasets in the learning phase when a very specialized learning method is not used, which usually is the case , prior to the feature extraction and learning a classification model, a dataset balancing step is required. In this step, the positive or negative samples protein sequences , according to the various biological properties, are coded into numerical feature vectors to be used to learn the final predictor classifier.

For this encoding, firstly, using a sliding window, all proteins are partitioned into polypeptides with length W , in such a way that the target residue according to the PTM of interest is placed at the center of the polypeptides Figure 6B. There is no agreement on the size of W , and various sizes have been used in different studies. Roughly speaking, W varies from 11 to Some studies select an optimized size for W through a try-and-error approach Finally, according to the appropriate biological descriptors such as amino acid composition, di-peptide composition, similarity score to the known motifs and physicochemical properties, each polypeptide of length W is encoded as a numerical feature vector.

After feature extraction, data are ready to train a classifier model for predicting the PTM, given a protein of interest Figure 6C. There are a variety of classifiers that can be trained. At this step, based on the performance of different classifiers and knowledge of the experts that are involved in the study, a suitable classifier is selected. After parameter optimization, the classifier is trained on a subset of the assembled dataset that is called the training dataset , and then, the predictor is ready to be assessed and compared with the current state-of-the-art methods.

In some studies, an additional process, named feature selection, is done prior to building the final predictor. A standard and widely used procedure for assessing the performance of a given classifier is k-fold cross validation Figure 6D.

In this process, the available dataset is randomly partitioned into k equal-sized disjoint subsets. This process is repeated k times in such a way that every subset is used exactly once as the test set. Finally, the average performance over all k test sets is reported.

The most common values for k are 5 and 10 in the PTM prediction studies. Despite the fact that some studies have used a large value for k , the large values lead to less accurate estimates of the generalization power of the classifier and test error rate All of these measures can be calculated based on the four basic elements of the confusion matrix Table 2. For definition of these performance, refer to Refs.

In addition to the aforementioned measures, ROC and area under the ROC curve are also two major performance evaluation measures There are some important flaws in performance comparison based on k -fold cross validation, which can lead to a biased conclusion. As mentioned above, the data are randomly portioned into k distinct folds subsets in a k -fold CV procedure.

Therefore, if only the train and test data of all the k folds are identical for two methods, the results of those methods are comparable. However, many studies compare their k -fold CV results without satisfying this condition. Another common flaw is using the same data for parameter tuning and feature selection and for performance evaluation. In such situation, the performance of the predictor is overestimated, and the classifier will perform poorly on the unseen samples.

In the presence of enough data for the PTMs, which usually are available except for newly discovered PTMs, some studies carry out an independent test experiment. In this experiment, a dataset of positive and negative samples is assembled or a benchmark dataset may be used as an independent test data, which have not been used in any of the previous steps, and the performance of the classifier is evaluated again using this dataset.

Usually, the performance on an independent test set is lower than that of k -fold CV and is a better estimation of the real-world performance of a method. To show the strength of the proposed methods in real-world biological problems, some studies use their trained models on a set of biologically important proteins, which have recently been studied, to indicate that their method can effectively detect the newly reported and experimentally validated PTMs.

Considering the high cost of experimental identification of PTMs, in recent years, many computational methods have been proposed for the prediction of PTMs. Many of these methods have been introduced as publicly accessible tools. Figure 7 provides a comprehensive list of these tools. In addition to the PTM prediction tools, Nickchi et al. In this case, PEIMAN gives two distinct lists of proteins and then integrates the enrichment results and provides a list of highly enriched terms of both protein sets.

Online PTM prediction tools. PTMs are the chemical modification of a protein after translation and have a wide range of effects on the function and structure of the target proteins.

These processes occur on almost all proteins, and many domains within proteins are modified on multiple amino acids by diverse modifications. The function of a modified protein is often strongly affected by these modifications that play important roles in a myriad of cellular processes. There is strong evidence that shows that disruptions in PTMs can lead to various diseases. Hence, increased knowledge about the potential PTMs of a target protein may increase our understanding of the molecular processes in which it takes part.

High-throughput experimental methods for the discovery of PTMs are very labor-intensive and time-consuming. Thus, there is an urgent need for prediction methods and powerful tools to predict PTMs. There is a considerable amount of PTM data available from various publicly accessible databanks, which are valuable resources for mining patterns to train new models for PTM prediction.

In recent years, many computational methods have been developed for this purpose. However, there are some common weaknesses in assessing these methods, and so it seems that such methods should be evaluated more critically.

Considering the diversity of PTMs and new PTMs that are reported every couple of years on one hand, and the advancement of machine learning algorithms on the other hand, we can conclude that this field will attract more attention in the future. The authors would like to thanks Mohammad Hossein Afsharinia for his help with preparing the graphics and Saber Mohammadi for his help with editing the manuscript.

Also, the authors appreciate the anonymous reviewers for their very constructive comments. National Center for Biotechnology Information , U. Journal List Database Oxford v. Database Oxford. Published online Apr 7. Shahin Ramazi and Javad Zahiri. Author information Article notes Copyright and License information Disclaimer. Box: , Tehran, Iran;.

Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals. This article has been cited by other articles in PMC. Abstract Posttranslational modifications PTMs refer to amino acid side chain modification in some proteins after their biosynthesis.



0コメント

  • 1000 / 1000