Documentation

The ProteoCast Server allows you to rapidly and easily explore how and to what extent missense mutations affect protein function. ProteoCast relies on GEMME, a highly efficient, unsupervised, and fully interpretable variant effect predictor (Laine et al., 2019). By focusing on how protein residues segregate along the topology of evolutionary trees, GEMME has proven instrumental for studying protein stability, function, and disease mechanisms in several studies -- see, for instance, Tsuboyama et al., 2023. It was extensively tested against over 2.5M experimental measurements (Abakarova et al., 2023, Laine et al., 2019). It is among the top-performing methods on the widely adopted ProteinGym benchmark.

You want to know more, collaborate, or you encounter a problem with the server? Please feel free to contact us!

If you would like to provide feedback, you are welcome to do so via this google form.

Input

Please provide a multiple sequence alignement in FASTA or A3M format. The protein of interest should be the first one in the alignment and should not contain any gap. We recommend generating the alignment with the highly efficient MMseqs2-based protocol implemented in ColabFold for an optimal balance between speed and accuracy. Using this protocol, the input alignment should contain at least a couple hundred sequences to obtain reliable predictions (Abakarova et al., 2023).

Optionally, you can opt for mapping the results onto a 3D structure. The most straightforward way to use this functionality is detailed below:

The query sequence corresponds exactly to a Uniprot entry. You can indicate the Uniprot identifier and ProteoCast will automatically retrieve the corresponding 3D model from the AlphaFold Database.
The query sequence do not match a Uniprot entry. You may generate a 3D model along with the alignment using ColabFold and upload it directly.

Alternatively, you may provide an experimental structure from the Protein Data Bank, or a custom 3D model. In that case, please make sure that the sequence is exactly the same as the query. Any user-defined 3D structure or model should be in PDB format and contain only one protein chain.

Output

Please check our example result page.

Mutational landscape

The predicted mutational landscape is displayed as two interactive heatmaps, where each square corresponds to a given amino acid substitution at a given position. The heatmap dimension is thus 20 by the length of the query protein sequence. The heatmap RAW SCORES contains the numerical estimates predicted by GEMME. The darker the color, the more negative the score, and thus the stronger the predicted effect. The heatmap VARIANT CLASSES indicates whether each mutation is considered as neutral (blue), uncertain (pink) or impactful (red) by ProteoCast. For pre-computed results on the Drosophila melanogaster, we optionally provide a third heatmap SNPs with the raw scores in grey tones and known SNPs highlighted in colors, blue for population polymorphisms and red for lethal mutations.

By hovering the mouse cursor on the representation, a contextual window will specify the mutation and give some information about it (predicted raw score, class...). You may zoom in a particular region of interest.

The horizontal bar at the bottom of the heatmap reflects the confidence in the predictions. Dark blue indicates reliable predictions, and white unreliable ones. We consider predictions as unreliable when evolutionary information derived from the input alignment is too scarce.

Input alignement quality assessment

This plot gives an overview of the input alignment, where each horizontal line depicts a sequence and its color indicates its similarity with the query. Gaps correspond to white interruptions. The black curve reports the percentage of sequences that have an amino acid (as opposed to a gap) at each position of the query sequence. This representation is identical to the one used in ColabFold and we adapted the code from there.

Predicted score distributions for variant classification

This plot gives an overview of the predicted score distribution, fitted with a mixture of three Gaussians. To minimize biases, low-confidence predictions are excluded. At one end of the spectrum, mutations close to zero extending down to the median of the middle Gaussian are classified as neutral. At the other end, mutations with very negative scores are classified as impactful, until they are more likely to belong to the middle Gaussian than the leftmost one. Mutations falling in between are categorised as uncertain. The two vertical lines indicate the boundaries of the three classes.

Segmented mutational sensitivity profile

This interactive plot shows the per-residue mutational sensitivity, defined as the average GEMME score over the 19 possible substitutions. The values are scaled between 0 and 1. Residues highly sensitive to mutations will have a value close to 1, and residues highly tolerant a value close to 0. The profile is segmented using the FPOP algorithm, which identifies changepoints in the signal through 'functional pruning'. Informally, each detected changepoint signifies a shift in the mean. This representation allows for emphasizing protein segments under stronger or weaker selective pressure than their surrounding background. Purple: the segment mean is higher than the two neighbouring segments. Red: the segment mean is higher than one neighbour and lower than the other one. A pLDDT track is added on top of the mutational sensitivity profile to improve interpretability -- for instance, toward the identification of putative binding and regulatory sites in unstructured regions. The four classes for pLDDT are those defined in the AlphaFold Database, orange: very low, yellow: low, light blue: medium, dark blue: high.

Mapping of predictions on the 3D structure

This interactive Molstar plugin allows you to explore the localisation of any mutation of interest, whether it is on the surface of the protein or within the core, whether it is part of a well-define secondary structure or an unstructured loop...etc. You can choose between different colour regimes that reflect the original B-factor values (pLDDT for AlphaFold models) or the per-residue mutational sensitivity. For improved clarity, we also provide a binary classification of residues into tolerant or sensitive to mutations. Sensitive residues have more than half of the 19 possible substitutions being impactful. On each representation, you may visualise the segments located in unstructured regions (very low or low pLDDT) and whose sensitivity stands out against their surroundings. They are colored in purple or red, following the color scheme used for the segmented profile above. They are displayed using a cartoon-like representation in which the protein backbone is shown like a clay or putty structure. The size of the tube is proportional to the property being examined (pLDDT, mutational sensitivity or residue class). On the figure below, we can clearly see that the segment containing S146 has higher mutational sensitivity than its surroundings.

Browser compatibility

The server's JavaScript code uses ECMAScript 2015 (ES6) features; therefore, it needs Chrome 58+, Edge 15+, Firefox 54+, Safari 10+, or Opera 55+. We have tested this server using the following browsers:

OS	Version	Chrome	Firefox	Microsoft Edge	Safari
Linux	Ubuntu 22.04	131.0.6778.139	133.0.3	n/a	n/a
MacOS	Mojave	87.0.4280.88	83.0	n/a	14.0
Windows	Windows 10 Home	87.0.4280.88	84.0	87.0.664.60	n/a