S41467 022 34032 Y

Partial capture of text on file.
                Article                                                                                                                                          https://doi.org/10.1038/s41467-022-34032-y
                Proteinlanguagemodelstrainedonmultiple
                sequence alignments learn phylogenetic
                relationships
                Received: 8 April 2022                                                  UmbertoLupo 1,2                     , Damiano Sgarbossa                      1,2 & Anne-Florence Bitbol                      1,2
                Accepted: 7 October 2022
                                                                                        Self-supervised neural language models with attention have recently been
                                                                                        applied to biological sequence data, advancing structure, function and
         ,;          Checkforupdates
     ,;  :                                                                              mutational effect prediction. Some protein language models, including MSA
     :   )
     )   (
     (   0
     0   9
     9   8                                                                              Transformer and AlphaFold’sEvoFormer,takemultiplesequencealignments
     78  7
     6   6
     5   5
     4   4
     3   3                                                                              (MSAs) of evolutionarily related proteins as inputs. Simple combinations of
     12  12
                                                                                        MSATransformer’s row attentions have led to state-of-the-art unsupervised
                                                                                        structural contact prediction. We demonstrate that similarly simple, and uni-
                                                                                        versal, combinations of MSA Transformer’s column attentions strongly cor-
                                                                                        relate with Hamming distances between sequences in MSAs. Therefore,
                                                                                        MSA-based language models encode detailed phylogenetic relationships.
                                                                                        Wefurther show that these models can separate coevolutionary signals
                                                                                        encodingfunctionalandstructuralconstraintsfromphylogeneticcorrelations
                                                                                        reecting historical contingency. To assess this, we generate synthetic MSAs,
                                                                                        either without or with phylogeny, from Potts models trained on natural MSAs.
                                                                                        Weﬁndthatunsupervised contact prediction is substantially more resilient
                                                                                        to phylogenetic noise when using MSA Transformer versus inferred
                                                                                        Potts models.
                The explosion of available biological sequence data has led to                                             they contributed to the recent breakthrough in the supervised
                multiple computational approaches aiming to infer three-                                                   prediction of protein structure.
                dimensional structure, biological function, ﬁtness, and evolu-                                                    Protein sequences can be classiﬁed in families of homologous
                tionary history of proteins from sequence data1,2. Recently, self-                                         proteins, that descend from an ancestral protein and share a similar
                supervised deep learning models based on natural language pro-                                             structure and function. Analyzing multiple sequence alignments
                cessing methods, especially attention3 and transformers4,have (MSAs)ofhomologousproteinsthusprovidessubstantialinformation
                                                                                                                                                                                              1
                beentrained on large ensembles of protein sequences by means of                                            about functional and structural constraints. The statistics of MSA
                themaskedlanguagemodelingobjectiveofﬁllinginmaskedamino                                                    columns, representing amino-acid sites, allow to identify functional
                acids in a sequence, given the surrounding ones5–10. These models,                                         residues that are conserved during evolution, and correlations of
                whichcapturelong-rangedependencies,learnrichrepresentations                                                amino-acid usage between columns contain key information about
                                                                                                                                                                                            15–18
                of protein sequences, and can be employed for multiple tasks. In                                           functional sectors and structural contacts                            . Indeed, through the
                particular, they can predict structural contacts from single                                               course of evolution, contacting amino acids need to maintain their
                sequences in an unsupervised way7, presumably by transferring                                              physico-chemical complementarity, which leads to correlated amino-
                knowledge from their large training set11. Neural network archi-                                           acid usages at these sites: this is known as coevolution. Potts models,
                tectures based on attention are also employed in the Evoformer                                             also knownasDirectCouplingAnalysis(DCA),arepairwisemaximum
                                               12                                          13                14
                blocks in AlphaFold ,aswellasinRoseTTAFold and RGN2 ,and entropy models trained to match the empirical one- and two-body
                1                                                                                                                                                                                    2
                 InstituteofBioengineering,SchoolofLifeSciences,ÉcolePolytechniqueFédéraledeLausanne(EPFL),CH-1015Lausanne,Switzerland. SIBSwissInstituteof
                Bioinformatics, CH-1015 Lausanne, Switzerland.                        e-mail: umberto.lupo@epﬂ.ch; anne-ﬂorence.bitbol@epﬂ.ch
                Nature Communications|     (2022)               13:6298                                                                                                                                                          1
              Article                                                                                                               https://doi.org/10.1038/s41467-022-34032-y
              frequencies of amino acids observed in the columns of an MSA of                         proxies for structural contacts, we demonstrate that unsupervised
              homologousproteins2,19–26.Theycapturethecoevolutionofcontacting                         contactpredictionviaMSATransformerissubstantiallymoreresilient
              aminoacids,andprovidedstate-of-the-artunsupervisedpredictionsof                         to phylogenetic noise than contact prediction using inferred Potts
              structural contacts before the advent of protein language models.                       models.
              Note that coevolutionary signal also aids supervised contact
                          27
              prediction .                                                                            Results
                   While most protein language neural networks take individual                        Columnattention heads capture Hamming distances in sepa-
              amino-acid sequences as inputs, some others have been trained to                        rate MSAs
              performinferencefromMSAsofevolutionarilyrelatedsequences.This                           We ﬁrst considered separately each of 15 different Pfam seed MSAs
                                                                                28
              second class of networks includes MSA Transformer                    and the Evo-       (see“Methods–Datasets”andSupplementaryTable1),corresponding
                                                12
              former blocks in AlphaFold , both of which interleave row (i.e. per-                    to distinct protein families, and asked whether MSA Transformer has
              sequence) attention with column (i.e. per-site) attention. Such an                      learnedtoencodephylogeneticrelationshipsbetweensequencesinits
              architecture is conceptually extremely attractive because it can                        attention layers. To test this, we split each MSA randomly into a
              incorporate coevolution in the framework of deep learning models                        trainingandatestset,andtrainalogisticmodel[Eqs.(5)and(6)]based
              usingattention.InthecaseofMSATransformer,simplecombinations                             on the column-wise means of MSA Transformer’s column attention
              of the model’s row attention heads have led to state-of-the-art unsu-                   headsonallpairwiseHammingdistancesinthetrainingset—seeFig.1
              pervised structural contact prediction, outperforming both language                     for a schematic, and “Methods – Supervised prediction of Hamming
                                                                                      28
              models trained on individual sequences and Potts models .Beyond                         distances” for details. Figure 2 and Table 1 show the results of ﬁtting
              structure prediction, MSA Transformer is also able to predict muta-                     these specialized logistic models.
                              29,30                                           31
              tional effects       and to capture ﬁtness landscapes . In addition to                        For all alignments considered, large regression coefﬁcients con-
              coevolutionarysignal causedbystructural andfunctionalconstraints,                       centrate in early layers in the network, and single out some speciﬁc
              MSAs feature correlations that directly stem from the common                            heads consistently across different MSAs—see Fig. 2, ﬁrst and second
              ancestry of homologous proteins, i.e. from phylogeny. Does MSA                          columns, for results on four example MSAs. These logistic models
              Transformer learn to identify phylogenetic relationships between                        reproduce the Hamming distances in the training set very well, and
              sequences, which are a key aspect of the MSA data structure?                            successfully predict those in the test set—see Fig. 2, third and fourth
                   Here, we show that simple, and universal, combinations of MSA                      columns, for results on four example MSAs. Note that the block
              Transformer’scolumnattentionheads,computedonagivenMSA, structures visible in the Hamming distance matrices, and well repro-
              strongly correlate with the Hamming distances between sequences in                      duced by our models, come from the phylogenetic ordering of
              that MSA. This demonstrates that MSA Transformer encodes detailed                       sequencesinourseedMSAs,see“Methods–Datasets”.Quantitatively,
                                                                                                                                                                                  2
              phylogenetic relationships. Is MSA Transformer able to separate coe-                    in all the MSAs studied, the coefﬁcients of determination (R ) com-
              volutionary signals encoding functional and structural constraints                      puted on the test sets are above 0.84 in all our MSAs—see Table 1.
              from phylogenetic correlations arising from historical contingency?                           A striking result from our analysis is that the regression coefﬁ-
              Toaddressthisquestion,wegeneratecontrolledsyntheticMSAsfrom                             cients appear to be similar across MSAs—see Fig. 2, ﬁrst column. To
              Potts models trained on natural MSAs, either without or with phylo-                     quantify this, we computed the Pearson correlations between the
              geny.Forthis,weperformMetropolisMonteCarlosamplingunderthe                              regression coefﬁcients learnt on the larger seed MSAs. Figure 3
              Potts Hamiltonians, either at equilibrium or along phylogenetic trees                   demonstratesthatregressioncoefﬁcientsareindeedhighlycorrelated
              inferredfromthenaturalMSAs.UsingthetopPottsmodelcouplingsas                             across these MSAs.
              Fig. 1 | MSA Transformer: column attentions and Hamming distances. a MSA                bOurapproachforHammingdistancematrixpredictionfromthecolumnatten-
              Transformer is trained using the masked language modeling objective of ﬁlling in        tions computed bythe trained MSATransformermodel, using a natural MSAas
              randomlymaskedresiduepositionsinMSAs.Foreachresiduepositioninaninput                    input.Foreachi=1,…,M,j=0,…,Landl=1,…,12,theembeddingvectorxðlÞ isthe
                                                                                                                               ðlÞ                                                  ij
              MSA,itassignsattentionscorestoallresiduepositionsinthesamerow(sequence)                 i-th row of the matrix Xj  deﬁned in “Methods – MSA Transformer and column
              andcolumn(site) in the MSA. These computations are performed by 12 indepen-             attention”, and the column attentions are computed according to Eqs. (2)and(3).
              dent row/column attention heads in each of 12 successive layers of the network.
              Nature Communications|     (2022)      13:6298                                                                                                                              2
               Article                                                                                                                        https://doi.org/10.1038/s41467-022-34032-y
               Fig.2|FittinglogisticmodelstopredictHammingdistancesseparatelyineach                            shownforeachlayerandattentionhead(ﬁrstcolumn),aswellastheir absolute
               MSA.Thecolumn-wisemeansofMSATransformer’scolumnattention headsare                               values averaged over heads for each layer (second column). For four example
               usedtopredictnormalisedHammingdistancesasprobabilitiesinalogisticmodel.                         MSAs,groundtruthHammingdistancesareshownintheuppertriangle(blue)and
               EachMSAisrandomlysplitintoatrainingsetcomprising70%ofitssequencesand                            predicted Hamming distances in the lower triangle and diagonal (green), for the
               atestsetcomposedoftheremainingsequences.ForeachMSA,alogisticmodelis                             training and test sets (third and fourth columns). Darker shades correspond to
               trained on all pairwise distances in the training set. Regression coefﬁcients are               larger Hammingdistances.
               Table 1 | Quality of t for logistic models trained to predict                                 MSATransformerlearnsauniversalrepresentationofHamming
               Hammingdistancesseparately in each MSA                                                         distances
                                                                                                              Given the substantial similarities between our models trained sepa-
                                                                     2
                Family                                             R                                          rately on different MSAs, we next asked whether a common model
                PF00004                                            0.97                                       across MSAs could capture Hamming distances within generic MSAs.
                PF00005                                            0.99                                       Toaddressthisquestion,wetrainedasinglelogistic model,based on
                PF00041                                            0.98                                       thecolumn-wisemeansofMSATransformer’scolumnattentionheads,
                PF00072                                            0.99                                       onall pairwise distances within each of the ﬁrst 12 of our seed MSAs.
                PF00076                                            0.98                                       WeassesseditsabilitytopredictHammingdistancesintheremaining
                                                                                                              3seedMSAs,whichthuscorrespondtoentirelydifferentPfamfamilies
                PF00096                                            0.94                                       from those in the training set. Figure 4 shows the coefﬁcients of this
                PF00153                                            0.95                                       regression (ﬁrst and second panels), as well as comparisons between
                PF00271                                            0.94                                       predictionsandgroundtruthvaluesfortheHammingdistanceswithin
                PF00397                                            0.84                                       the three test MSAs (last three panels). We observe that large regres-
                PF00512                                            0.94                                       sioncoefﬁcientsagainconcentrateintheearlylayersofthemodel,but
                PF00595                                            0.98                                       somewhat less than in individual models. Furthermore, the common
                PF01535                                            0.86                                       model captures well the main features of the Hamming distance
                PF02518                                            0.92                                       matrices in test MSAs.
                                                                                                                    In Supplementary Table 2, we quantify the quality of ﬁtforthis
                PF07679                                            0.99                                       model on all our MSAs. In all cases, we ﬁnd very high Pearson corre-
                PF13354                                            0.99                                       lationbetweenthepredicteddistancesandthegroundtruthHamming
                2                                                                                                                                                                    2
               R coefcients of determination are shown for the predictions by each tted model on the        distances. Furthermore, the median value of the R coefﬁcient of
               associated test set, see Fig. 2.                                                               determination is 0.6, conﬁrming the good quality of ﬁt. In the three
               Nature Communications|     (2022)         13:6298                                                                                                                                        3
              Article                                                                                                               https://doi.org/10.1038/s41467-022-34032-y
              shortestandthetwoshallowestMSAs,themodelperformsbelowthis                                     Ref. 28 showed that some column attention matrices, summed
                                                       2
              median,whileallMSAsforwhichR isabovemedianhavedepth M≥52                                alongoneoftheirdimensions,correlatewithphylogeneticsequence
              and length L≥67. We also compute, for each MSA, the slope of the                        weights (see “Methods – Supervised prediction of Hamming dis-
              linear ﬁt when regressing the ground truth Hamming distances onthe                      tances”). This indicates that the model is, in part, attending to
              distancespredictedbythemodel.MSAdepthishighlycorrelatedwith                             maximally diverse sequences. Our study demonstrates that MSA
              thevalueofthisslope(Pearsonr≈0.95).Thisbiasmaybeexplainedby                             Transformer actually learns pairwise phylogenetic relationships
              theunder-representationinthetrainingsetofHammingdistancesand                            betweensequences,beyondtheseaggregatephylogeneticsequence
              attention values from shallower MSAs, as their number is quadratic in                   weights. It also suggests an additional mechanism by which the
              MSAdepth.                                                                               modelmaybeattendingtotheserelationships,focusingonsimilarity
                                                                                                      instead of diversity. Indeed, while our regression coefﬁcients with
                                                                                                      positivesigninFig.4areassociatedwith(average)attentionsthatare
                                                                                                      positively correlated with the Hamming distances, we also ﬁnd sev-
                                                                                                      eral coefﬁcients with large negative values. They indicate the exis-
                                                                                                      tenceofimportantnegativecorrelations:inthoseheads,themodelis
                                                                                                      actually attending to pairs of similar sequences. Besides, comparing
                                                                                                      ourFigs.2,4withFig.5inref.28showsthatdifferentattentionheads
                                                                                                      are important in our study versus in the analysis of ref. 28 (Sec. 5.1).
                                                                                                      Speciﬁcally,hereweﬁndthattheﬁfthattentionheadintheﬁrstlayer
                                                                                                      in the network is associated with the largest positive regression
                                                                                                      coefﬁcient,whilethesixthonewasmostimportantthere.Moreover,
                                                                                                      still focusing on the ﬁrst layer of the network, the other most pro-
                                                                                                      minent heads here were not signiﬁcant there. MSA Transformer’s
              Fig. 3 | Pearson correlations between regression coefcients in larger MSAs.            ability to focus on similarity may also explain whyits performance at
              Sufﬁciently deep (≥ 100 sequences) and long (≥ 30 residues) MSAs are considered         predicting mutational effects can decrease signiﬁcantly when using
              (mean/min/maxPearsoncorrelations: 0.80/0.69/0.87).                                      MSAswhich include a duplicate of the query sequence (see ref. 29,
              Fig. 4 | Fitting a single logistic model to predict Hamming distances. Our col-          asinFig.2.ForthethreetestMSAs,groundtruthHammingdistancesareshownin
              lection of 15 MSAs is split into a training set comprising 12 of them and a test set     the upper triangle (blue) and predicted Hamming distances in the lower triangle
                                                                                                                                                                                           2
              composedoftheremaining3.Alogisticregression is trained on all pairwise dis-              anddiagonal(green),also asinFig. 2 (last three panels). We further report the R
              tanceswithineachMSAinthetrainingset.Regressioncoefﬁcients(ﬁrstpanel)and                  coefﬁcients of determination for the regressions on these test MSAs—see also
              theirabsolutevaluesaveragedoverheadsforeachlayer(secondpanel)areshown                    SupplementaryTable2.
              Fig. 5 | Correlations from coevolution and from phylogeny in MSAs. a Natural            correlations in their pairwise couplings. c Historical contingency can lead to cor-
              selectiononstructureandfunctionleadstocorrelationsbetweenresiduepositions               relations even in the absence of structural or functional constraints.
              in MSAs (coevolution). b Potts models, also known as DCA, aim to capture these
              Nature Communications|     (2022)      13:6298                                                                                                                             4
The words contained in this file might help you see if this file matches what you are looking for:

...Article https doi org s y proteinlanguagemodelstrainedonmultiple sequence alignments learn phylogenetic relationships received april umbertolupo damiano sgarbossa anne florence bitbol accepted october self supervised neural language models with attention have recently been applied to biological data advancing structure function and checkforupdates mutational effect prediction some protein including msa transformer alphafold sevoformer takemultiplesequencealignments msas of evolutionarily related proteins as inputs simple combinations msatransformer row attentions led state the art unsupervised structural contact we demonstrate that similarly uni versal column strongly cor relate hamming distances between sequences in therefore based encode detailed wefurther show these can separate coevolutionary signals encodingfunctionalandstructuralconstraintsfromphylogeneticcorrelations re ecting historical contingency assess this generate synthetic either without or phylogeny from potts trained on...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area