Commerce Pdf 51039

Partial capture of text on file.
                                       Whowrotethisbook?Achallengefore-commerce
                                    ´
                                 Beranger Dumont,SimonaMaggio,GhilesSidiSaid&Quoc-TienAu
                                                         Rakuten Institute of Technology Paris
                                       {beranger.dumont,simona.maggio}@rakuten.com,
                                       {ts-ghiles.sidisaid,quoctien.au}@rakuten.com
                                           Abstract                                  ThevariabilityandnoiseisevidentintheRFRdataset.
                                                                                   For example, books written by F. Scott Fitzgerald are
                        Moderne-commercecatalogs contain millions                  also listed with the following author’s names: “Fran-
                        of references, associated with textual and vi-             cis Scott Fitzgerald” (full name), “Fitzgerald, F. Scott”
                        sual information that is of paramount impor-               (inversion of the ﬁrst and last name), “Fitzgerald” (last
                        tance for the products to be found via search              nameonly), “F. Scott Fitgerald” (misspelling of the last
                        or browsing. Of particular signiﬁcance is the              name), “F SCOTT FITZGERALD”(capitalization and
                        bookcategory, where the author name(s) ﬁeld                different typological conventions), as well as several
                        poses a signiﬁcant challenge. Indeed, books                combinations of those variations. The variability of the
                        written by a given author might be listed with             possible spellings for an author’s name is very hard to
                        different authors’ names due to abbreviations,             capture using rules, even more so for names which are
                        spelling variants and mistakes, among others.              not primarily written in latin alphabet (such as arabic or
                        Tosolvethisproblematscale,wedesignacom-                    asian names), for names containing titles (such as “Dr.”
                        posite system involving open data sources for              or “Pr.”), and for pen names which may not follow the
                        books, as well as deep learning components,                usual conventions. This motivated us to explore auto-
                        such as approximate match with Siamese net-                mated techniques for normalizing the authors’ names to
                        works and name correction with sequence-to-                their best known (“canonical”) spellings.
                        sequence networks. We evaluate this approach                 Fortunately, a wealth of open databases exist for
                        onproduct data from the e-commerce website                 books, making it possible to match a signiﬁcant frac-
                        Rakuten France, and ﬁnd that the top proposal              tion of the books listed in e-commerce catalogs. While
                        of the system is the normalized author name                not always clean and unambiguous, this information is
                        with 72% accuracy.                                         extremely valuable and enables us to build datasets of
                    1    Introduction                                              namevariants,usedtotrainmachinelearningsystemsto
                                                                                   normalize authors’ names. To this end, in addition to the
                    Unlike brick-and-mortar stores, e-commerce websites            match with open databases, we will explore two differ-
                    canlist hundreds of millions of products,with thousands        ent approaches: approximate match with known authors’
                    of new products entering their catalogs every day. The         names using Siamese neural networks, and direct cor-
                    availability and the reliability of the information on the     rection of the provided author’s name using sequence-
                    products, or product data, is crucial for the products to      to-sequence learning with neural networks. Then, an
                    be found by the users via textual or visual search, or         additional machine learning component is used to rank
                    using faceted navigation.                                      the results.
                      Books constitute a prominent part of many large e-             The rest of the paper is organized as follows: we
                    commercecatalogs. Relevant book properties include:            present the data from RFR and from the open databases
                    title, author(s), format, edition, and publication date,       in Section 2, before turning to the experimental setup for
                    amongothers. In this work, we focus on the names of            the overall system and for each of its components in Sec-
                    bookauthors, as they are found to be extremely relevant        tion 3. Finally, we give results in Section 4, we present
                    to the user and are commonly used in search queries on         related works in Section 5, and conclude in Section 6.
                    e-commercewebsites, but suffer from considerable vari-         2   Bookdata
                    ability and noise. To the best of our knowledge, there is
                    no large-scale public dataset for books that captures the      2.1   RakutenFrancedata
                    variability arising on e-commerce marketplaces from            TheRFRdatasetcontains 12 million book references2.
                    user-generated input. Thus, in this work we use product        Themostrelevant product data for normalization is:
                                                        1
                    data from Rakuten France (RFR).
                                                                                      2TheRFRdatasetispubliclyavailable at https://rit.
                       1https://fr.shopping.rakuten.com                            rakuten.co.jp/data_release.
                                                                              121
                        Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on Noisy User-generated Text, pages 121–125
                                                                     c
                                           HongKong,Nov4,2019. 
2019AssociationforComputationalLinguistics
                                                          ISBN                        F. S. Fitzgerald                              NAME
                                      Match with open bibliographic sources         Siamese  approx. name matching           Seq2seq name correction
                                                                                                                                           F  r  a
                                                                                                                                 a  l  d 
                                                Fitzgerald, F. Scott                    F. Scott Fitzgerald                     Frank S. Fitzgerald
                                                                                        Ranking
                                                                                       F. Scott Fitzgerald 
                            Figure 1: Overview of the system for normalizing author names. Each component is detailed in Section 3.
                       Table 1: Performances of the external bibliographic re-                  tities containing the different surface forms (or variants)
                       sources used for matching books on RFR via ISBN.                         of authors’ names is required. The entities should reﬂect
                                                                                                as well as possible the variability that can be found in
                         Source             URL                             %ofISBNs            the RFRdataset,as was illustrated in the case of F. Scott
                         OpenLibrary        openlibrary.org                 24.9%               Fitzgerald in Section 1.
                         ISBNdb             isbndb.com                      36.3%                  For each entity, a canonical name should be elected
                         Goodreads          www.goodreads.com               64.7%               and correspond to the name that should be preferred for
                         Google Books       books.google.com                51.2%               the purpose of e-commerce. Instead of setting these gold
                         OCLC               www.oclc.org                    52.2%
                         BnF                www.bnf.fr                      7.4%                spellings by following some predeﬁned rules (i.e. family
                         Sudoc              www.sudoc.abes.fr               29.0%               nameintheﬁrst position, initial of ﬁrst name, etc. ), for
                         Babelio            www.babelio.com                 7.9%                e-commerce applications it is more appropriate that the
                                                                                                displayedauthorsnameshavethemostpopularspellings
                                                                                                amongreaders. In agreement with Rakuten catalog ana-
                          • ISBN3 in 10 digit or 13 digit format;                               lysts we set the most popular spelling of an author name
                                                                                                                                     4
                          • product title, which includes the book title, often                 as the one found on Wikipedia or DBpedia (Lehmann
                             supplemented with extra information in free text;                  et al., 2015).
                                                                                                   While Wikipedia seems more pertinent to select
                          • author(s) of the book as the input catalog name                     canonical names matching the e-commerce user expec-
                             provided by the seller.                                            tations, specialized librarian data services, such as the
                                                                                                                                            5
                          In particular, the ISBN is a worldwide unique iden-                   Library of Congress Name Authority , could be used in
                       tiﬁer for books, which makes it a prime candidate for                    future research to enrich the dataset of name entities.
                       unambiguous matching with external sources. In this                         Nameentities are collected in three distinct ways:
                       dataset, an ISBN is present for about 70% of the books.                    1. ISBNmatching:foreachbookthedifferentauthor
                       AmongthebookswithnoISBN,30%areancientbooks                                     namesfoundviaISBNsearchonexternalsources
                       which are not expected to be associated an ISBN.                               and the RFR author name ﬁeld build up an entity.
                       2.2    External bibliographic resources                                        Thecanonical form is the one that is matched with
                       There is no central authority providing consistent infor-                      Wikipedia or DBpedia; else the one provided by
                       mation on books associated with an ISBN. However,                              the greatest number of sources.
                       there is a wealth of bibliographic resources and open                      2. Matching of Rakuten authors: we build entities
                       databases for books. In order to retrieve the author’s                         using fuzzy search on the author name ﬁeld on
                       name(s) associated with the books in the RFR dataset,                          DBpedia and consider the DBpedia value to be
                       weperformISBNmatchingusingpublicAPIsoneight                                    canonical. We limit the number of false positives in
                       of them, listed in Table 1 along with the fraction of                          fuzzy search by tokenizing both names, and keep-
                       found ISBNs from this dataset. We ﬁnd the sources to                           ing only the names where at least one token from
                       be highly complementary and that 75% of the books                              the name on RFR is approximately found in the
                       with an ISBN are matched with at least one source. The                         external resource (Levenshtein distance < 2).
                       match via ISBN on external bibliographic resources is
                       the ﬁrst component of the system depicted in Fig. 1.                       3. Name variants:             DBpedia, BnF, and JRC-
                       2.3    Dataset of name entities                                                names(Steinberger et al., 2011; Maud et al., 2016)
                       In order to train and evaluate machine learning systems                        directly provide data about people (not limited to
                       to matchorcorrectauthors’names,adatasetofnameen-                               bookauthors) and their name variants.
                           3International Standard Book Number, see https://                       4https://www.wikipedia.org
                       www.isbn-international.org                                                  5id.loc.gov/authorities/names.html
                                                                                          122
                       Asanexample,byusingthewikiPageRedirects ﬁeld                1fornamevariantsofthesameentity,andto0otherwise.
                    in DBpedia we can build a large entity for the canoni-         Wepreprocess the input by representing all characters
                    cal name “Anton Tchekhov”, containing “Anton Tche-             in ASCII and lowercase. We consider a sequence length
                                `    `      ˇ   ´
                    chov”,“AntonPavlovicChechov”,“Checkhov”,“Anton                 of 32 using zero padding.
                    Chekov”, and many more.                                           The Siamese network is trained with contrastive
                       After creating the name entity dataset, we normalize        loss (Hadsell et al., 2006) in order to push the similarity
                    all names to latin-1. We obtain about 750,000 entities,        towards 1 for similar pairs, and below a certain margin
                    for a total of 2.1 million names.                              (that we set to 0) for dissimilar pairs. The optimization
                    2.4   Annotated Rakuten France data                            is done using Adam (Kingma and Ba, 2014), with a
                                                                                   learning rate of 10−3 and a gradient clipping value of 5.
                    In order to evaluate the overall system, we need product       Weusebatchesof512samples,consideranegative to
                    data from RFR for which the canonical author name has          positive pairs ratio of 4 : 1, and randomly generate new
                    been carefully annotated and can be considered as the          negative pairs at every epoch.
                    ground truth. To this end, we have considered a subset            At test time, we search for the canonical name whose
                    of 1000 books from the RFR dataset, discarding books           representation is closest to that of the query, using only
                                                                     6
                    written by more than one author for simplicity. We ﬁnd         the high-quality name entities from DBpedia, BnF, and
                    that 467 books have a canonical author name that differs       JRC-names. To this end, we do approximate nearest
                    from RFR’s original (unnormalized) author name. Also,                                          7
                                                                                   neighbor search using Annoy .
                    310 do not have an ISBN or do not match on any of              3.2   Namecorrectionwithseq2seqnetworks
                    the bibliographic resources listed in Section 2.2. Among
                    them,208bookshaveacanonicalnamethatdiffersfrom                 We use a generative model to correct and normalize
                    the input catalog name provided by the seller.                 authors’ names directly. The dataset of name entities
                    3    Experimental setup                                        is again employed to train a sequence-to-sequence
                                                                                   (seq2seq) model (Sutskever et al., 2014) to produce the
                    TheoverviewofthesystemcanbefoundinFig.1.Its                    canonical form of a name from one of its variants. The
                    ﬁrst component, the matching via ISBN against external         dataset is further augmented by including additional
                    databases, has already been presented in Section 2.2. In       variants where the ﬁrst name is abbreviated to an initial.
                    the rest of this section, we will shed light on the three         The seq2seq model is an encoder-decoder using
                    machine learning components of the system.                     RNNs,withacharacter embedding layer, as in the case
                                                                                   of the Siamese network. The encoder is a bi-directional
                    3.1   Siamese approximate name matching                        LSTMwith2×256units,whilethedecoderisaplain
                    We want to learn a mapping that assigns a similarity           LSTMwith512unitsconnectedtoasoftmaxlayerthat
                    score to a pair of author names such that name variants        computes a probability distribution over the characters.
                    of the same entity will have high similarity, and names           Thetraining is performed by minimizing the categor-
                    that belong to different entities will have low similarity.    ical cross-entropy loss, using teacher forcing (Williams
                    Oncelearned, this mapping will enable us to assign an          and Zipser, 1989). The optimization setting is identical
                    entity to any given name.                                      to that of the Siamese nework,withbatches of1024sam-
                       To this end, we might use a classical string metric         ples. For inference, we collect the 10 output sequences
                    such as the Levenshtein distance or the n-gram dis-            with highest probability using beam search.
                    tance (Kondrak, 2005). However, those are not speciﬁc          3.3   Rankingoftheproposals
                    to people’s names,andmightreturnalargedistance(low             For any given book with an ISBN and an author’s name,
                    similarity) in cases such as the inversion between ﬁrst        all three techniques shown in Fig. 1 provide one or sev-
                    nameandlastnameortheabbreviationoftheﬁrstname                  eral candidate canonical names. As we aim at providing
                    to an initial. Thus, we want to use the dataset of name        an automated tool to enhance the quality of the book
                    entities to learn a specialized notion of similarity—this      products, the ﬁnal system should provide a ranked list
                    is known as distance metric learning (Kulis et al., 2013).     of candidates with a calibrated conﬁdence level. For
                       To this purpose, we use a pair of neural networks           this purpose we train a logistic regression to estimate
                    with shared weights, or Siamese neural network (Brom-          the probability that a proposal is the canonical form for
                    ley et al., 1994). Each network is a recurrent neural          an author’s name. This information is then used as a
                    network (RNN) composed of a character-level embed-             conﬁdence score to rank the different candidate names
                    ding layer with 256 units, a bidirectional long short-         returned by the three normalization approaches.
                    term memory (LSTM) (Hochreiter and Schmidhuber,
                    1997) with 2 × 128 units, and a dense layer with 256              Speciﬁcally, we represent a proposal with a set of 12
                    units. Each network takes a name as input and outputs a        features: 11 indicating whether it is found in the bib-
                    representation—the two representations are then com-           liographic sources, generated from the seq2seq model,
                    paredusingcosinesimilarity with a target value equal to        matched with the Siamese network or equal to the input
                                                                                   name, and one last feature corresponding to the cosine
                       6The annotated RFR dataset is publicly available at
                    https://rit.rakuten.co.jp/data_release.                            7https://github.com/spotify/annoy
                                                                               123
                   distance between the representation of the proposal and    Table 2: Global system top-k accuracy at the book level.
                   that of the input name. The selected features reﬂect that
                   the conﬁdenceoftheglobalsystemshouldincreasewith              Type of books         #samples    acc@1    acc@3
                  (i) the consensus among the different sources, and (ii)        all                   500         72%      85%
                   the similarity of the candidate to the input name.            unnorm. input author  235         49%      67%
                     For this component we use the annotated dataset in-         noISBNmatch           151         50%      64%
                   troduced in Section 2.4, splitting the books between          unnorm. + no ISBN     109         35%      49%
                   training and test sets, with a ratio of 50% : 50%, gener-  Rankingoftheproposals Withadecisionthreshold
                   ating a total of 11185 proposals.                          of p = 0.5, the trained classiﬁer has an accuracy of 93%
                   4   Results                                                for both positive and negative candidates in the test set.
                                                                              Thecoefﬁcients of the estimator reveal the importance
                   The three machine learning components discussed in         of the features and, thus, of the related components. The
                   the previous section have been individually evaluated      three most important contributions are the match with
                   ontheir speciﬁc task. Furthermore the ﬁnal system has      the Siamese network, the match via ISBN in Babelio,
                   been evaluated in terms of correctly normalized book       and the similarity with the input catalog name, conﬁrm-
                   authors in a real case scenario.                           ing the relevance of a multi-approach design choice.
                   Siamese approximate name matching         Weevaluate       Global system     In order to reﬂect the actual use of
                   the Siamese network on a held out test set, and compare    the global system on e-commerce catalog data, the ﬁnal
                   it to an n-gram distance, by checking that the nearest     evaluation is performedatthe booklevel,byconsidering
                   neighbor of a name variant is the canonical name of        all the proposals provided by the different components
                   the entity to which it belongs. We ﬁnd an accuracy of      for a given book. The metric used is the top-k accuracy
                   79.8%fortheSiamesenetwork,against 71.1% for the            on the ranked list of proposals for each book; results
                   n-gram baseline with n = 3. We have also checked           are summarized in Table 2. We ﬁnd that 72% of the
                   metrics when introducing a threshold distance above        bookshavetheauthor’s name normalized by the highest
                   whichweconsiderthat no matching entity is found, and       ranked proposal. Excluding from the evaluation books
                   found systematic improvement over the baseline. In the     where the ground truth for the author’s name equals the
                   ﬁnal system, we set the threshold to inﬁnity.              catalog value, this accuracy drops to 49%. In the case
                     Siamese networks are more effective than simpler         of books without ISBN or that do not match on any of
                   rule-based approaches and more speciﬁcally they per-       the bibliographic resources, thus relying on machine
                   form better than the n-gram baseline on the following      learning-based components only, we ﬁnd that 50% of
                   cases:                                                     the books are normalized by the top proposal. Finally,
                                                                              for the combination of the above two restrictions, we
                     • Vittorio Hugo → Victor Hugo: capturing name            ﬁndatop-1accuracy of 35%.
                        variants in different languages;
                     • Bill Shakespeare → William Shakespeare: captur-        5   Related works
                        ing commonnicknames                                   Thereis a long line of work on author name disambigua-
                   Namecorrection with seq2seq networks         Similarly     tion for the case of bibliographic citation records (Hus-
                   to the previous approach, the seq2seq network is evalu-    sain and Asghar, 2017). While related, this problem dif-
                   atedonaheldouttestsetbycheckingthatoneofthegen-            fers from the one of book authors. Indeed, unlike most
                   erated name variants is the canonical name of the entity   books, research publications usually have several au-
                   to which it belongs. As expected, name normalization       thors, each of them having published papers with other
                   using seq2seq network gives poorer performances than       researchers. The relationships among authors, which
                   approximate matching within a dataset of known au-         can be represented as a graph, may be used to help dis-
                   thors, but constitutes a complementary approach that is    ambiguate the bibliographic citations.
                   useful in case of formatting issues or incomplete names.     Namedentity linking (Shen et al., 2015), where one
                   This approach alone reaches a top-10 accuracy of 42%       aims at determining the identity of entities (such as a
                   onthe entire test set, 26% on a test set containing only   person’s name) mentioned in text, is another related
                   names with initials, and 53% on a test set containing      problem. Thecrucialdifference withthe disambiguation
                   only minor spelling mistakes.                              of book authors is that entity linking systems leverage
                     Someexampleswhereseq2seqperformsbetter than              the context of the named entity mention to link unam-
                   the other methods are as follows:                          biguously to an entity in a pre-populated knowledge
                                                                              base.
                     • V. Hugo →Victor Hugo: ﬁrst name prediction for           The conformity of truth in web resources is also
                        authors we don’t have in the canonical database;      a related problem, addressed in the literature by
                                                                              TruthFinder (Yin et al., 2008) algorithms. Similarly, the
                     • VicorHugo→VictorHugo:misspellingcorrection             proposed global model in which we combine sources
                        for authors we don’t have in the canonical database.  learns to some extent the level of trust of the different
                                                                          124
The words contained in this file might help you see if this file matches what you are looking for:

...Whowrotethisbook achallengefore commerce beranger dumont simonamaggio ghilessidisaid quoc tienau rakuten institute of technology paris simona maggio com ts ghiles sidisaid quoctien au abstract thevariabilityandnoiseisevidentintherfrdataset for example books written by f scott fitzgerald are moderne commercecatalogs contain millions also listed with the following author s names fran references associated textual and vi cis full name sual information that is paramount impor inversion rst last tance products to be found via search nameonly fitgerald misspelling or browsing particular signicance capitalization bookcategory where eld different typological conventions as well several poses a signicant challenge indeed combinations those variations variability given might possible spellings an very hard authors due abbreviations capture using rules even more so which spelling variants mistakes among others not primarily in latin alphabet such arabic tosolvethisproblematscale wedesignacom asia...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area