Whowrotethisbook?Achallengefore-commerce ´ Beranger Dumont,SimonaMaggio,GhilesSidiSaid&Quoc-TienAu Rakuten Institute of Technology Paris {beranger.dumont,simona.maggio}@rakuten.com, {ts-ghiles.sidisaid,quoctien.au}@rakuten.com Abstract ThevariabilityandnoiseisevidentintheRFRdataset. For example, books written by F. Scott Fitzgerald are Moderne-commercecatalogs contain millions also listed with the following author’s names: “Fran- of references, associated with textual and vi- cis Scott Fitzgerald” (full name), “Fitzgerald, F. Scott” sual information that is of paramount impor- (inversion of the first and last name), “Fitzgerald” (last tance for the products to be found via search nameonly), “F. Scott Fitgerald” (misspelling of the last or browsing. Of particular significance is the name), “F SCOTT FITZGERALD”(capitalization and bookcategory, where the author name(s) field different typological conventions), as well as several poses a significant challenge. Indeed, books combinations of those variations. The variability of the written by a given author might be listed with possible spellings for an author’s name is very hard to different authors’ names due to abbreviations, capture using rules, even more so for names which are spelling variants and mistakes, among others. not primarily written in latin alphabet (such as arabic or Tosolvethisproblematscale,wedesignacom- asian names), for names containing titles (such as “Dr.” posite system involving open data sources for or “Pr.”), and for pen names which may not follow the books, as well as deep learning components, usual conventions. This motivated us to explore auto- such as approximate match with Siamese net- mated techniques for normalizing the authors’ names to works and name correction with sequence-to- their best known (“canonical”) spellings. sequence networks. We evaluate this approach Fortunately, a wealth of open databases exist for onproduct data from the e-commerce website books, making it possible to match a significant frac- Rakuten France, and find that the top proposal tion of the books listed in e-commerce catalogs. While of the system is the normalized author name not always clean and unambiguous, this information is with 72% accuracy. extremely valuable and enables us to build datasets of 1 Introduction namevariants,usedtotrainmachinelearningsystemsto normalize authors’ names. To this end, in addition to the Unlike brick-and-mortar stores, e-commerce websites match with open databases, we will explore two differ- canlist hundreds of millions of products,with thousands ent approaches: approximate match with known authors’ of new products entering their catalogs every day. The names using Siamese neural networks, and direct cor- availability and the reliability of the information on the rection of the provided author’s name using sequence- products, or product data, is crucial for the products to to-sequence learning with neural networks. Then, an be found by the users via textual or visual search, or additional machine learning component is used to rank using faceted navigation. the results. Books constitute a prominent part of many large e- The rest of the paper is organized as follows: we commercecatalogs. Relevant book properties include: present the data from RFR and from the open databases title, author(s), format, edition, and publication date, in Section 2, before turning to the experimental setup for amongothers. In this work, we focus on the names of the overall system and for each of its components in Sec- bookauthors, as they are found to be extremely relevant tion 3. Finally, we give results in Section 4, we present to the user and are commonly used in search queries on related works in Section 5, and conclude in Section 6. e-commercewebsites, but suffer from considerable vari- 2 Bookdata ability and noise. To the best of our knowledge, there is no large-scale public dataset for books that captures the 2.1 RakutenFrancedata variability arising on e-commerce marketplaces from TheRFRdatasetcontains 12 million book references2. user-generated input. Thus, in this work we use product Themostrelevant product data for normalization is: 1 data from Rakuten France (RFR). 2TheRFRdatasetispubliclyavailable at https://rit. 1https://fr.shopping.rakuten.com rakuten.co.jp/data_release. 121 Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on Noisy User-generated Text, pages 121–125 c HongKong,Nov4,2019. 2019AssociationforComputationalLinguistics ISBN F. S. Fitzgerald NAME Match with open bibliographic sources Siamese approx. name matching Seq2seq name correction F r a a l d Fitzgerald, F. Scott F. Scott Fitzgerald Frank S. Fitzgerald Ranking F. Scott Fitzgerald Figure 1: Overview of the system for normalizing author names. Each component is detailed in Section 3. Table 1: Performances of the external bibliographic re- tities containing the different surface forms (or variants) sources used for matching books on RFR via ISBN. of authors’ names is required. The entities should reflect as well as possible the variability that can be found in Source URL %ofISBNs the RFRdataset,as was illustrated in the case of F. Scott OpenLibrary openlibrary.org 24.9% Fitzgerald in Section 1. ISBNdb isbndb.com 36.3% For each entity, a canonical name should be elected Goodreads www.goodreads.com 64.7% and correspond to the name that should be preferred for Google Books books.google.com 51.2% the purpose of e-commerce. Instead of setting these gold OCLC www.oclc.org 52.2% BnF www.bnf.fr 7.4% spellings by following some predefined rules (i.e. family Sudoc www.sudoc.abes.fr 29.0% nameinthefirst position, initial of first name, etc. ), for Babelio www.babelio.com 7.9% e-commerce applications it is more appropriate that the displayedauthorsnameshavethemostpopularspellings amongreaders. In agreement with Rakuten catalog ana- • ISBN3 in 10 digit or 13 digit format; lysts we set the most popular spelling of an author name 4 • product title, which includes the book title, often as the one found on Wikipedia or DBpedia (Lehmann supplemented with extra information in free text; et al., 2015). While Wikipedia seems more pertinent to select • author(s) of the book as the input catalog name canonical names matching the e-commerce user expec- provided by the seller. tations, specialized librarian data services, such as the 5 In particular, the ISBN is a worldwide unique iden- Library of Congress Name Authority , could be used in tifier for books, which makes it a prime candidate for future research to enrich the dataset of name entities. unambiguous matching with external sources. In this Nameentities are collected in three distinct ways: dataset, an ISBN is present for about 70% of the books. 1. ISBNmatching:foreachbookthedifferentauthor AmongthebookswithnoISBN,30%areancientbooks namesfoundviaISBNsearchonexternalsources which are not expected to be associated an ISBN. and the RFR author name field build up an entity. 2.2 External bibliographic resources Thecanonical form is the one that is matched with There is no central authority providing consistent infor- Wikipedia or DBpedia; else the one provided by mation on books associated with an ISBN. However, the greatest number of sources. there is a wealth of bibliographic resources and open 2. Matching of Rakuten authors: we build entities databases for books. In order to retrieve the author’s using fuzzy search on the author name field on name(s) associated with the books in the RFR dataset, DBpedia and consider the DBpedia value to be weperformISBNmatchingusingpublicAPIsoneight canonical. We limit the number of false positives in of them, listed in Table 1 along with the fraction of fuzzy search by tokenizing both names, and keep- found ISBNs from this dataset. We find the sources to ing only the names where at least one token from be highly complementary and that 75% of the books the name on RFR is approximately found in the with an ISBN are matched with at least one source. The external resource (Levenshtein distance < 2). match via ISBN on external bibliographic resources is the first component of the system depicted in Fig. 1. 3. Name variants: DBpedia, BnF, and JRC- 2.3 Dataset of name entities names(Steinberger et al., 2011; Maud et al., 2016) In order to train and evaluate machine learning systems directly provide data about people (not limited to to matchorcorrectauthors’names,adatasetofnameen- bookauthors) and their name variants. 3International Standard Book Number, see https:// 4https://www.wikipedia.org www.isbn-international.org 5id.loc.gov/authorities/names.html 122 Asanexample,byusingthewikiPageRedirects field 1fornamevariantsofthesameentity,andto0otherwise. in DBpedia we can build a large entity for the canoni- Wepreprocess the input by representing all characters cal name “Anton Tchekhov”, containing “Anton Tche- in ASCII and lowercase. We consider a sequence length ` ` ˇ ´ chov”,“AntonPavlovicChechov”,“Checkhov”,“Anton of 32 using zero padding. Chekov”, and many more. The Siamese network is trained with contrastive After creating the name entity dataset, we normalize loss (Hadsell et al., 2006) in order to push the similarity all names to latin-1. We obtain about 750,000 entities, towards 1 for similar pairs, and below a certain margin for a total of 2.1 million names. (that we set to 0) for dissimilar pairs. The optimization 2.4 Annotated Rakuten France data is done using Adam (Kingma and Ba, 2014), with a learning rate of 10−3 and a gradient clipping value of 5. In order to evaluate the overall system, we need product Weusebatchesof512samples,consideranegative to data from RFR for which the canonical author name has positive pairs ratio of 4 : 1, and randomly generate new been carefully annotated and can be considered as the negative pairs at every epoch. ground truth. To this end, we have considered a subset At test time, we search for the canonical name whose of 1000 books from the RFR dataset, discarding books representation is closest to that of the query, using only 6 written by more than one author for simplicity. We find the high-quality name entities from DBpedia, BnF, and that 467 books have a canonical author name that differs JRC-names. To this end, we do approximate nearest from RFR’s original (unnormalized) author name. Also, 7 neighbor search using Annoy . 310 do not have an ISBN or do not match on any of 3.2 Namecorrectionwithseq2seqnetworks the bibliographic resources listed in Section 2.2. Among them,208bookshaveacanonicalnamethatdiffersfrom We use a generative model to correct and normalize the input catalog name provided by the seller. authors’ names directly. The dataset of name entities 3 Experimental setup is again employed to train a sequence-to-sequence (seq2seq) model (Sutskever et al., 2014) to produce the TheoverviewofthesystemcanbefoundinFig.1.Its canonical form of a name from one of its variants. The first component, the matching via ISBN against external dataset is further augmented by including additional databases, has already been presented in Section 2.2. In variants where the first name is abbreviated to an initial. the rest of this section, we will shed light on the three The seq2seq model is an encoder-decoder using machine learning components of the system. RNNs,withacharacter embedding layer, as in the case of the Siamese network. The encoder is a bi-directional 3.1 Siamese approximate name matching LSTMwith2×256units,whilethedecoderisaplain We want to learn a mapping that assigns a similarity LSTMwith512unitsconnectedtoasoftmaxlayerthat score to a pair of author names such that name variants computes a probability distribution over the characters. of the same entity will have high similarity, and names Thetraining is performed by minimizing the categor- that belong to different entities will have low similarity. ical cross-entropy loss, using teacher forcing (Williams Oncelearned, this mapping will enable us to assign an and Zipser, 1989). The optimization setting is identical entity to any given name. to that of the Siamese nework,withbatches of1024sam- To this end, we might use a classical string metric ples. For inference, we collect the 10 output sequences such as the Levenshtein distance or the n-gram dis- with highest probability using beam search. tance (Kondrak, 2005). However, those are not specific 3.3 Rankingoftheproposals to people’s names,andmightreturnalargedistance(low For any given book with an ISBN and an author’s name, similarity) in cases such as the inversion between first all three techniques shown in Fig. 1 provide one or sev- nameandlastnameortheabbreviationofthefirstname eral candidate canonical names. As we aim at providing to an initial. Thus, we want to use the dataset of name an automated tool to enhance the quality of the book entities to learn a specialized notion of similarity—this products, the final system should provide a ranked list is known as distance metric learning (Kulis et al., 2013). of candidates with a calibrated confidence level. For To this purpose, we use a pair of neural networks this purpose we train a logistic regression to estimate with shared weights, or Siamese neural network (Brom- the probability that a proposal is the canonical form for ley et al., 1994). Each network is a recurrent neural an author’s name. This information is then used as a network (RNN) composed of a character-level embed- confidence score to rank the different candidate names ding layer with 256 units, a bidirectional long short- returned by the three normalization approaches. term memory (LSTM) (Hochreiter and Schmidhuber, 1997) with 2 × 128 units, and a dense layer with 256 Specifically, we represent a proposal with a set of 12 units. Each network takes a name as input and outputs a features: 11 indicating whether it is found in the bib- representation—the two representations are then com- liographic sources, generated from the seq2seq model, paredusingcosinesimilarity with a target value equal to matched with the Siamese network or equal to the input name, and one last feature corresponding to the cosine 6The annotated RFR dataset is publicly available at https://rit.rakuten.co.jp/data_release. 7https://github.com/spotify/annoy 123 distance between the representation of the proposal and Table 2: Global system top-k accuracy at the book level. that of the input name. The selected features reflect that the confidenceoftheglobalsystemshouldincreasewith Type of books #samples acc@1 acc@3 (i) the consensus among the different sources, and (ii) all 500 72% 85% the similarity of the candidate to the input name. unnorm. input author 235 49% 67% For this component we use the annotated dataset in- noISBNmatch 151 50% 64% troduced in Section 2.4, splitting the books between unnorm. + no ISBN 109 35% 49% training and test sets, with a ratio of 50% : 50%, gener- Rankingoftheproposals Withadecisionthreshold ating a total of 11185 proposals. of p = 0.5, the trained classifier has an accuracy of 93% 4 Results for both positive and negative candidates in the test set. Thecoefficients of the estimator reveal the importance The three machine learning components discussed in of the features and, thus, of the related components. The the previous section have been individually evaluated three most important contributions are the match with ontheir specific task. Furthermore the final system has the Siamese network, the match via ISBN in Babelio, been evaluated in terms of correctly normalized book and the similarity with the input catalog name, confirm- authors in a real case scenario. ing the relevance of a multi-approach design choice. Siamese approximate name matching Weevaluate Global system In order to reflect the actual use of the Siamese network on a held out test set, and compare the global system on e-commerce catalog data, the final it to an n-gram distance, by checking that the nearest evaluation is performedatthe booklevel,byconsidering neighbor of a name variant is the canonical name of all the proposals provided by the different components the entity to which it belongs. We find an accuracy of for a given book. The metric used is the top-k accuracy 79.8%fortheSiamesenetwork,against 71.1% for the on the ranked list of proposals for each book; results n-gram baseline with n = 3. We have also checked are summarized in Table 2. We find that 72% of the metrics when introducing a threshold distance above bookshavetheauthor’s name normalized by the highest whichweconsiderthat no matching entity is found, and ranked proposal. Excluding from the evaluation books found systematic improvement over the baseline. In the where the ground truth for the author’s name equals the final system, we set the threshold to infinity. catalog value, this accuracy drops to 49%. In the case Siamese networks are more effective than simpler of books without ISBN or that do not match on any of rule-based approaches and more specifically they per- the bibliographic resources, thus relying on machine form better than the n-gram baseline on the following learning-based components only, we find that 50% of cases: the books are normalized by the top proposal. Finally, for the combination of the above two restrictions, we • Vittorio Hugo → Victor Hugo: capturing name findatop-1accuracy of 35%. variants in different languages; • Bill Shakespeare → William Shakespeare: captur- 5 Related works ing commonnicknames Thereis a long line of work on author name disambigua- Namecorrection with seq2seq networks Similarly tion for the case of bibliographic citation records (Hus- to the previous approach, the seq2seq network is evalu- sain and Asghar, 2017). While related, this problem dif- atedonaheldouttestsetbycheckingthatoneofthegen- fers from the one of book authors. Indeed, unlike most erated name variants is the canonical name of the entity books, research publications usually have several au- to which it belongs. As expected, name normalization thors, each of them having published papers with other using seq2seq network gives poorer performances than researchers. The relationships among authors, which approximate matching within a dataset of known au- can be represented as a graph, may be used to help dis- thors, but constitutes a complementary approach that is ambiguate the bibliographic citations. useful in case of formatting issues or incomplete names. Namedentity linking (Shen et al., 2015), where one This approach alone reaches a top-10 accuracy of 42% aims at determining the identity of entities (such as a onthe entire test set, 26% on a test set containing only person’s name) mentioned in text, is another related names with initials, and 53% on a test set containing problem. Thecrucialdifference withthe disambiguation only minor spelling mistakes. of book authors is that entity linking systems leverage Someexampleswhereseq2seqperformsbetter than the context of the named entity mention to link unam- the other methods are as follows: biguously to an entity in a pre-populated knowledge base. • V. Hugo →Victor Hugo: first name prediction for The conformity of truth in web resources is also authors we don’t have in the canonical database; a related problem, addressed in the literature by TruthFinder (Yin et al., 2008) algorithms. Similarly, the • VicorHugo→VictorHugo:misspellingcorrection proposed global model in which we combine sources for authors we don’t have in the canonical database. learns to some extent the level of trust of the different 124
no reviews yet
Please Login to review.