122x Filetype PDF File size 1.43 MB Source: www.nature.com
Article https://doi.org/10.1038/s41467-022-34032-y Proteinlanguagemodelstrainedonmultiple sequence alignments learn phylogenetic relationships Received: 8 April 2022 UmbertoLupo 1,2 , Damiano Sgarbossa 1,2 & Anne-Florence Bitbol 1,2 Accepted: 7 October 2022 Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and ,; Checkforupdates ,; : mutational effect prediction. Some protein language models, including MSA : ) ) ( ( 0 0 9 9 8 Transformer and AlphaFold’sEvoFormer,takemultiplesequencealignments 78 7 6 6 5 5 4 4 3 3 (MSAs) of evolutionarily related proteins as inputs. Simple combinations of 12 12 MSATransformer’s row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and uni- versal, combinations of MSA Transformer’s column attentions strongly cor- relate with Hamming distances between sequences in MSAs. Therefore, MSA-based language models encode detailed phylogenetic relationships. Wefurther show that these models can separate coevolutionary signals encodingfunctionalandstructuralconstraintsfromphylogeneticcorrelations reecting historical contingency. To assess this, we generate synthetic MSAs, either without or with phylogeny, from Potts models trained on natural MSAs. Wefindthatunsupervised contact prediction is substantially more resilient to phylogenetic noise when using MSA Transformer versus inferred Potts models. The explosion of available biological sequence data has led to they contributed to the recent breakthrough in the supervised multiple computational approaches aiming to infer three- prediction of protein structure. dimensional structure, biological function, fitness, and evolu- Protein sequences can be classified in families of homologous tionary history of proteins from sequence data1,2. Recently, self- proteins, that descend from an ancestral protein and share a similar supervised deep learning models based on natural language pro- structure and function. Analyzing multiple sequence alignments cessing methods, especially attention3 and transformers4,have (MSAs)ofhomologousproteinsthusprovidessubstantialinformation 1 beentrained on large ensembles of protein sequences by means of about functional and structural constraints. The statistics of MSA themaskedlanguagemodelingobjectiveoffillinginmaskedamino columns, representing amino-acid sites, allow to identify functional acids in a sequence, given the surrounding ones5–10. These models, residues that are conserved during evolution, and correlations of whichcapturelong-rangedependencies,learnrichrepresentations amino-acid usage between columns contain key information about 15–18 of protein sequences, and can be employed for multiple tasks. In functional sectors and structural contacts . Indeed, through the particular, they can predict structural contacts from single course of evolution, contacting amino acids need to maintain their sequences in an unsupervised way7, presumably by transferring physico-chemical complementarity, which leads to correlated amino- knowledge from their large training set11. Neural network archi- acid usages at these sites: this is known as coevolution. Potts models, tectures based on attention are also employed in the Evoformer also knownasDirectCouplingAnalysis(DCA),arepairwisemaximum 12 13 14 blocks in AlphaFold ,aswellasinRoseTTAFold and RGN2 ,and entropy models trained to match the empirical one- and two-body 1 2 InstituteofBioengineering,SchoolofLifeSciences,ÉcolePolytechniqueFédéraledeLausanne(EPFL),CH-1015Lausanne,Switzerland. SIBSwissInstituteof Bioinformatics, CH-1015 Lausanne, Switzerland. e-mail: umberto.lupo@epfl.ch; anne-florence.bitbol@epfl.ch Nature Communications| (2022) 13:6298 1 Article https://doi.org/10.1038/s41467-022-34032-y frequencies of amino acids observed in the columns of an MSA of proxies for structural contacts, we demonstrate that unsupervised homologousproteins2,19–26.Theycapturethecoevolutionofcontacting contactpredictionviaMSATransformerissubstantiallymoreresilient aminoacids,andprovidedstate-of-the-artunsupervisedpredictionsof to phylogenetic noise than contact prediction using inferred Potts structural contacts before the advent of protein language models. models. Note that coevolutionary signal also aids supervised contact 27 prediction . Results While most protein language neural networks take individual Columnattention heads capture Hamming distances in sepa- amino-acid sequences as inputs, some others have been trained to rate MSAs performinferencefromMSAsofevolutionarilyrelatedsequences.This We first considered separately each of 15 different Pfam seed MSAs 28 second class of networks includes MSA Transformer and the Evo- (see“Methods–Datasets”andSupplementaryTable1),corresponding 12 former blocks in AlphaFold , both of which interleave row (i.e. per- to distinct protein families, and asked whether MSA Transformer has sequence) attention with column (i.e. per-site) attention. Such an learnedtoencodephylogeneticrelationshipsbetweensequencesinits architecture is conceptually extremely attractive because it can attention layers. To test this, we split each MSA randomly into a incorporate coevolution in the framework of deep learning models trainingandatestset,andtrainalogisticmodel[Eqs.(5)and(6)]based usingattention.InthecaseofMSATransformer,simplecombinations on the column-wise means of MSA Transformer’s column attention of the model’s row attention heads have led to state-of-the-art unsu- headsonallpairwiseHammingdistancesinthetrainingset—seeFig.1 pervised structural contact prediction, outperforming both language for a schematic, and “Methods – Supervised prediction of Hamming 28 models trained on individual sequences and Potts models .Beyond distances” for details. Figure 2 and Table 1 show the results of fitting structure prediction, MSA Transformer is also able to predict muta- these specialized logistic models. 29,30 31 tional effects and to capture fitness landscapes . In addition to For all alignments considered, large regression coefficients con- coevolutionarysignal causedbystructural andfunctionalconstraints, centrate in early layers in the network, and single out some specific MSAs feature correlations that directly stem from the common heads consistently across different MSAs—see Fig. 2, first and second ancestry of homologous proteins, i.e. from phylogeny. Does MSA columns, for results on four example MSAs. These logistic models Transformer learn to identify phylogenetic relationships between reproduce the Hamming distances in the training set very well, and sequences, which are a key aspect of the MSA data structure? successfully predict those in the test set—see Fig. 2, third and fourth Here, we show that simple, and universal, combinations of MSA columns, for results on four example MSAs. Note that the block Transformer’scolumnattentionheads,computedonagivenMSA, structures visible in the Hamming distance matrices, and well repro- strongly correlate with the Hamming distances between sequences in duced by our models, come from the phylogenetic ordering of that MSA. This demonstrates that MSA Transformer encodes detailed sequencesinourseedMSAs,see“Methods–Datasets”.Quantitatively, 2 phylogenetic relationships. Is MSA Transformer able to separate coe- in all the MSAs studied, the coefficients of determination (R ) com- volutionary signals encoding functional and structural constraints puted on the test sets are above 0.84 in all our MSAs—see Table 1. from phylogenetic correlations arising from historical contingency? A striking result from our analysis is that the regression coeffi- Toaddressthisquestion,wegeneratecontrolledsyntheticMSAsfrom cients appear to be similar across MSAs—see Fig. 2, first column. To Potts models trained on natural MSAs, either without or with phylo- quantify this, we computed the Pearson correlations between the geny.Forthis,weperformMetropolisMonteCarlosamplingunderthe regression coefficients learnt on the larger seed MSAs. Figure 3 Potts Hamiltonians, either at equilibrium or along phylogenetic trees demonstratesthatregressioncoefficientsareindeedhighlycorrelated inferredfromthenaturalMSAs.UsingthetopPottsmodelcouplingsas across these MSAs. Fig. 1 | MSA Transformer: column attentions and Hamming distances. a MSA bOurapproachforHammingdistancematrixpredictionfromthecolumnatten- Transformer is trained using the masked language modeling objective of filling in tions computed bythe trained MSATransformermodel, using a natural MSAas randomlymaskedresiduepositionsinMSAs.Foreachresiduepositioninaninput input.Foreachi=1,…,M,j=0,…,Landl=1,…,12,theembeddingvectorxðlÞ isthe ðlÞ ij MSA,itassignsattentionscorestoallresiduepositionsinthesamerow(sequence) i-th row of the matrix Xj defined in “Methods – MSA Transformer and column andcolumn(site) in the MSA. These computations are performed by 12 indepen- attention”, and the column attentions are computed according to Eqs. (2)and(3). dent row/column attention heads in each of 12 successive layers of the network. Nature Communications| (2022) 13:6298 2 Article https://doi.org/10.1038/s41467-022-34032-y Fig.2|FittinglogisticmodelstopredictHammingdistancesseparatelyineach shownforeachlayerandattentionhead(firstcolumn),aswellastheir absolute MSA.Thecolumn-wisemeansofMSATransformer’scolumnattention headsare values averaged over heads for each layer (second column). For four example usedtopredictnormalisedHammingdistancesasprobabilitiesinalogisticmodel. MSAs,groundtruthHammingdistancesareshownintheuppertriangle(blue)and EachMSAisrandomlysplitintoatrainingsetcomprising70%ofitssequencesand predicted Hamming distances in the lower triangle and diagonal (green), for the atestsetcomposedoftheremainingsequences.ForeachMSA,alogisticmodelis training and test sets (third and fourth columns). Darker shades correspond to trained on all pairwise distances in the training set. Regression coefficients are larger Hammingdistances. Table 1 | Quality of t for logistic models trained to predict MSATransformerlearnsauniversalrepresentationofHamming Hammingdistancesseparately in each MSA distances Given the substantial similarities between our models trained sepa- 2 Family R rately on different MSAs, we next asked whether a common model PF00004 0.97 across MSAs could capture Hamming distances within generic MSAs. PF00005 0.99 Toaddressthisquestion,wetrainedasinglelogistic model,based on PF00041 0.98 thecolumn-wisemeansofMSATransformer’scolumnattentionheads, PF00072 0.99 onall pairwise distances within each of the first 12 of our seed MSAs. PF00076 0.98 WeassesseditsabilitytopredictHammingdistancesintheremaining 3seedMSAs,whichthuscorrespondtoentirelydifferentPfamfamilies PF00096 0.94 from those in the training set. Figure 4 shows the coefficients of this PF00153 0.95 regression (first and second panels), as well as comparisons between PF00271 0.94 predictionsandgroundtruthvaluesfortheHammingdistanceswithin PF00397 0.84 the three test MSAs (last three panels). We observe that large regres- PF00512 0.94 sioncoefficientsagainconcentrateintheearlylayersofthemodel,but PF00595 0.98 somewhat less than in individual models. Furthermore, the common PF01535 0.86 model captures well the main features of the Hamming distance PF02518 0.92 matrices in test MSAs. In Supplementary Table 2, we quantify the quality of fitforthis PF07679 0.99 model on all our MSAs. In all cases, we find very high Pearson corre- PF13354 0.99 lationbetweenthepredicteddistancesandthegroundtruthHamming 2 2 R coefcients of determination are shown for the predictions by each tted model on the distances. Furthermore, the median value of the R coefficient of associated test set, see Fig. 2. determination is 0.6, confirming the good quality of fit. In the three Nature Communications| (2022) 13:6298 3 Article https://doi.org/10.1038/s41467-022-34032-y shortestandthetwoshallowestMSAs,themodelperformsbelowthis Ref. 28 showed that some column attention matrices, summed 2 median,whileallMSAsforwhichR isabovemedianhavedepth M≥52 alongoneoftheirdimensions,correlatewithphylogeneticsequence and length L≥67. We also compute, for each MSA, the slope of the weights (see “Methods – Supervised prediction of Hamming dis- linear fit when regressing the ground truth Hamming distances onthe tances”). This indicates that the model is, in part, attending to distancespredictedbythemodel.MSAdepthishighlycorrelatedwith maximally diverse sequences. Our study demonstrates that MSA thevalueofthisslope(Pearsonr≈0.95).Thisbiasmaybeexplainedby Transformer actually learns pairwise phylogenetic relationships theunder-representationinthetrainingsetofHammingdistancesand betweensequences,beyondtheseaggregatephylogeneticsequence attention values from shallower MSAs, as their number is quadratic in weights. It also suggests an additional mechanism by which the MSAdepth. modelmaybeattendingtotheserelationships,focusingonsimilarity instead of diversity. Indeed, while our regression coefficients with positivesigninFig.4areassociatedwith(average)attentionsthatare positively correlated with the Hamming distances, we also find sev- eral coefficients with large negative values. They indicate the exis- tenceofimportantnegativecorrelations:inthoseheads,themodelis actually attending to pairs of similar sequences. Besides, comparing ourFigs.2,4withFig.5inref.28showsthatdifferentattentionheads are important in our study versus in the analysis of ref. 28 (Sec. 5.1). Specifically,herewefindthatthefifthattentionheadinthefirstlayer in the network is associated with the largest positive regression coefficient,whilethesixthonewasmostimportantthere.Moreover, still focusing on the first layer of the network, the other most pro- minent heads here were not significant there. MSA Transformer’s Fig. 3 | Pearson correlations between regression coefcients in larger MSAs. ability to focus on similarity may also explain whyits performance at Sufficiently deep (≥ 100 sequences) and long (≥ 30 residues) MSAs are considered predicting mutational effects can decrease significantly when using (mean/min/maxPearsoncorrelations: 0.80/0.69/0.87). MSAswhich include a duplicate of the query sequence (see ref. 29, Fig. 4 | Fitting a single logistic model to predict Hamming distances. Our col- asinFig.2.ForthethreetestMSAs,groundtruthHammingdistancesareshownin lection of 15 MSAs is split into a training set comprising 12 of them and a test set the upper triangle (blue) and predicted Hamming distances in the lower triangle 2 composedoftheremaining3.Alogisticregression is trained on all pairwise dis- anddiagonal(green),also asinFig. 2 (last three panels). We further report the R tanceswithineachMSAinthetrainingset.Regressioncoefficients(firstpanel)and coefficients of determination for the regressions on these test MSAs—see also theirabsolutevaluesaveragedoverheadsforeachlayer(secondpanel)areshown SupplementaryTable2. Fig. 5 | Correlations from coevolution and from phylogeny in MSAs. a Natural correlations in their pairwise couplings. c Historical contingency can lead to cor- selectiononstructureandfunctionleadstocorrelationsbetweenresiduepositions relations even in the absence of structural or functional constraints. in MSAs (coevolution). b Potts models, also known as DCA, aim to capture these Nature Communications| (2022) 13:6298 4
no reviews yet
Please Login to review.