Whowrotethisbook?Achallengefore-commerce
´
Beranger Dumont,SimonaMaggio,GhilesSidiSaid&Quoc-TienAu
Rakuten Institute of Technology Paris
{beranger.dumont,simona.maggio}@rakuten.com,
{ts-ghiles.sidisaid,quoctien.au}@rakuten.com
Abstract ThevariabilityandnoiseisevidentintheRFRdataset.
For example, books written by F. Scott Fitzgerald are
Moderne-commercecatalogs contain millions also listed with the following author’s names: “Fran-
of references, associated with textual and vi- cis Scott Fitzgerald” (full name), “Fitzgerald, F. Scott”
sual information that is of paramount impor- (inversion of the first and last name), “Fitzgerald” (last
tance for the products to be found via search nameonly), “F. Scott Fitgerald” (misspelling of the last
or browsing. Of particular significance is the name), “F SCOTT FITZGERALD”(capitalization and
bookcategory, where the author name(s) field different typological conventions), as well as several
poses a significant challenge. Indeed, books combinations of those variations. The variability of the
written by a given author might be listed with possible spellings for an author’s name is very hard to
different authors’ names due to abbreviations, capture using rules, even more so for names which are
spelling variants and mistakes, among others. not primarily written in latin alphabet (such as arabic or
Tosolvethisproblematscale,wedesignacom- asian names), for names containing titles (such as “Dr.”
posite system involving open data sources for or “Pr.”), and for pen names which may not follow the
books, as well as deep learning components, usual conventions. This motivated us to explore auto-
such as approximate match with Siamese net- mated techniques for normalizing the authors’ names to
works and name correction with sequence-to- their best known (“canonical”) spellings.
sequence networks. We evaluate this approach Fortunately, a wealth of open databases exist for
onproduct data from the e-commerce website books, making it possible to match a significant frac-
Rakuten France, and find that the top proposal tion of the books listed in e-commerce catalogs. While
of the system is the normalized author name not always clean and unambiguous, this information is
with 72% accuracy. extremely valuable and enables us to build datasets of
1 Introduction namevariants,usedtotrainmachinelearningsystemsto
normalize authors’ names. To this end, in addition to the
Unlike brick-and-mortar stores, e-commerce websites match with open databases, we will explore two differ-
canlist hundreds of millions of products,with thousands ent approaches: approximate match with known authors’
of new products entering their catalogs every day. The names using Siamese neural networks, and direct cor-
availability and the reliability of the information on the rection of the provided author’s name using sequence-
products, or product data, is crucial for the products to to-sequence learning with neural networks. Then, an
be found by the users via textual or visual search, or additional machine learning component is used to rank
using faceted navigation. the results.
Books constitute a prominent part of many large e- The rest of the paper is organized as follows: we
commercecatalogs. Relevant book properties include: present the data from RFR and from the open databases
title, author(s), format, edition, and publication date, in Section 2, before turning to the experimental setup for
amongothers. In this work, we focus on the names of the overall system and for each of its components in Sec-
bookauthors, as they are found to be extremely relevant tion 3. Finally, we give results in Section 4, we present
to the user and are commonly used in search queries on related works in Section 5, and conclude in Section 6.
e-commercewebsites, but suffer from considerable vari- 2 Bookdata
ability and noise. To the best of our knowledge, there is
no large-scale public dataset for books that captures the 2.1 RakutenFrancedata
variability arising on e-commerce marketplaces from TheRFRdatasetcontains 12 million book references2.
user-generated input. Thus, in this work we use product Themostrelevant product data for normalization is:
1
data from Rakuten France (RFR).
2TheRFRdatasetispubliclyavailable at https://rit.
1https://fr.shopping.rakuten.com rakuten.co.jp/data_release.
121
Proceedings of the 2019 EMNLP Workshop W-NUT: The 5th Workshop on Noisy User-generated Text, pages 121–125
c
HongKong,Nov4,2019.
2019AssociationforComputationalLinguistics
ISBN F. S. Fitzgerald NAME
Match with open bibliographic sources Siamese approx. name matching Seq2seq name correction
F r a
a l d
Fitzgerald, F. Scott F. Scott Fitzgerald Frank S. Fitzgerald
Ranking
F. Scott Fitzgerald
Figure 1: Overview of the system for normalizing author names. Each component is detailed in Section 3.
Table 1: Performances of the external bibliographic re- tities containing the different surface forms (or variants)
sources used for matching books on RFR via ISBN. of authors’ names is required. The entities should reflect
as well as possible the variability that can be found in
Source URL %ofISBNs the RFRdataset,as was illustrated in the case of F. Scott
OpenLibrary openlibrary.org 24.9% Fitzgerald in Section 1.
ISBNdb isbndb.com 36.3% For each entity, a canonical name should be elected
Goodreads www.goodreads.com 64.7% and correspond to the name that should be preferred for
Google Books books.google.com 51.2% the purpose of e-commerce. Instead of setting these gold
OCLC www.oclc.org 52.2%
BnF www.bnf.fr 7.4% spellings by following some predefined rules (i.e. family
Sudoc www.sudoc.abes.fr 29.0% nameinthefirst position, initial of first name, etc. ), for
Babelio www.babelio.com 7.9% e-commerce applications it is more appropriate that the
displayedauthorsnameshavethemostpopularspellings
amongreaders. In agreement with Rakuten catalog ana-
• ISBN3 in 10 digit or 13 digit format; lysts we set the most popular spelling of an author name
4
• product title, which includes the book title, often as the one found on Wikipedia or DBpedia (Lehmann
supplemented with extra information in free text; et al., 2015).
While Wikipedia seems more pertinent to select
• author(s) of the book as the input catalog name canonical names matching the e-commerce user expec-
provided by the seller. tations, specialized librarian data services, such as the
5
In particular, the ISBN is a worldwide unique iden- Library of Congress Name Authority , could be used in
tifier for books, which makes it a prime candidate for future research to enrich the dataset of name entities.
unambiguous matching with external sources. In this Nameentities are collected in three distinct ways:
dataset, an ISBN is present for about 70% of the books. 1. ISBNmatching:foreachbookthedifferentauthor
AmongthebookswithnoISBN,30%areancientbooks namesfoundviaISBNsearchonexternalsources
which are not expected to be associated an ISBN. and the RFR author name field build up an entity.
2.2 External bibliographic resources Thecanonical form is the one that is matched with
There is no central authority providing consistent infor- Wikipedia or DBpedia; else the one provided by
mation on books associated with an ISBN. However, the greatest number of sources.
there is a wealth of bibliographic resources and open 2. Matching of Rakuten authors: we build entities
databases for books. In order to retrieve the author’s using fuzzy search on the author name field on
name(s) associated with the books in the RFR dataset, DBpedia and consider the DBpedia value to be
weperformISBNmatchingusingpublicAPIsoneight canonical. We limit the number of false positives in
of them, listed in Table 1 along with the fraction of fuzzy search by tokenizing both names, and keep-
found ISBNs from this dataset. We find the sources to ing only the names where at least one token from
be highly complementary and that 75% of the books the name on RFR is approximately found in the
with an ISBN are matched with at least one source. The external resource (Levenshtein distance < 2).
match via ISBN on external bibliographic resources is
the first component of the system depicted in Fig. 1. 3. Name variants: DBpedia, BnF, and JRC-
2.3 Dataset of name entities names(Steinberger et al., 2011; Maud et al., 2016)
In order to train and evaluate machine learning systems directly provide data about people (not limited to
to matchorcorrectauthors’names,adatasetofnameen- bookauthors) and their name variants.
3International Standard Book Number, see https:// 4https://www.wikipedia.org
www.isbn-international.org 5id.loc.gov/authorities/names.html
122
Asanexample,byusingthewikiPageRedirects field 1fornamevariantsofthesameentity,andto0otherwise.
in DBpedia we can build a large entity for the canoni- Wepreprocess the input by representing all characters
cal name “Anton Tchekhov”, containing “Anton Tche- in ASCII and lowercase. We consider a sequence length
` ` ˇ ´
chov”,“AntonPavlovicChechov”,“Checkhov”,“Anton of 32 using zero padding.
Chekov”, and many more. The Siamese network is trained with contrastive
After creating the name entity dataset, we normalize loss (Hadsell et al., 2006) in order to push the similarity
all names to latin-1. We obtain about 750,000 entities, towards 1 for similar pairs, and below a certain margin
for a total of 2.1 million names. (that we set to 0) for dissimilar pairs. The optimization
2.4 Annotated Rakuten France data is done using Adam (Kingma and Ba, 2014), with a
learning rate of 10−3 and a gradient clipping value of 5.
In order to evaluate the overall system, we need product Weusebatchesof512samples,consideranegative to
data from RFR for which the canonical author name has positive pairs ratio of 4 : 1, and randomly generate new
been carefully annotated and can be considered as the negative pairs at every epoch.
ground truth. To this end, we have considered a subset At test time, we search for the canonical name whose
of 1000 books from the RFR dataset, discarding books representation is closest to that of the query, using only
6
written by more than one author for simplicity. We find the high-quality name entities from DBpedia, BnF, and
that 467 books have a canonical author name that differs JRC-names. To this end, we do approximate nearest
from RFR’s original (unnormalized) author name. Also, 7
neighbor search using Annoy .
310 do not have an ISBN or do not match on any of 3.2 Namecorrectionwithseq2seqnetworks
the bibliographic resources listed in Section 2.2. Among
them,208bookshaveacanonicalnamethatdiffersfrom We use a generative model to correct and normalize
the input catalog name provided by the seller. authors’ names directly. The dataset of name entities
3 Experimental setup is again employed to train a sequence-to-sequence
(seq2seq) model (Sutskever et al., 2014) to produce the
TheoverviewofthesystemcanbefoundinFig.1.Its canonical form of a name from one of its variants. The
first component, the matching via ISBN against external dataset is further augmented by including additional
databases, has already been presented in Section 2.2. In variants where the first name is abbreviated to an initial.
the rest of this section, we will shed light on the three The seq2seq model is an encoder-decoder using
machine learning components of the system. RNNs,withacharacter embedding layer, as in the case
of the Siamese network. The encoder is a bi-directional
3.1 Siamese approximate name matching LSTMwith2×256units,whilethedecoderisaplain
We want to learn a mapping that assigns a similarity LSTMwith512unitsconnectedtoasoftmaxlayerthat
score to a pair of author names such that name variants computes a probability distribution over the characters.
of the same entity will have high similarity, and names Thetraining is performed by minimizing the categor-
that belong to different entities will have low similarity. ical cross-entropy loss, using teacher forcing (Williams
Oncelearned, this mapping will enable us to assign an and Zipser, 1989). The optimization setting is identical
entity to any given name. to that of the Siamese nework,withbatches of1024sam-
To this end, we might use a classical string metric ples. For inference, we collect the 10 output sequences
such as the Levenshtein distance or the n-gram dis- with highest probability using beam search.
tance (Kondrak, 2005). However, those are not specific 3.3 Rankingoftheproposals
to people’s names,andmightreturnalargedistance(low For any given book with an ISBN and an author’s name,
similarity) in cases such as the inversion between first all three techniques shown in Fig. 1 provide one or sev-
nameandlastnameortheabbreviationofthefirstname eral candidate canonical names. As we aim at providing
to an initial. Thus, we want to use the dataset of name an automated tool to enhance the quality of the book
entities to learn a specialized notion of similarity—this products, the final system should provide a ranked list
is known as distance metric learning (Kulis et al., 2013). of candidates with a calibrated confidence level. For
To this purpose, we use a pair of neural networks this purpose we train a logistic regression to estimate
with shared weights, or Siamese neural network (Brom- the probability that a proposal is the canonical form for
ley et al., 1994). Each network is a recurrent neural an author’s name. This information is then used as a
network (RNN) composed of a character-level embed- confidence score to rank the different candidate names
ding layer with 256 units, a bidirectional long short- returned by the three normalization approaches.
term memory (LSTM) (Hochreiter and Schmidhuber,
1997) with 2 × 128 units, and a dense layer with 256 Specifically, we represent a proposal with a set of 12
units. Each network takes a name as input and outputs a features: 11 indicating whether it is found in the bib-
representation—the two representations are then com- liographic sources, generated from the seq2seq model,
paredusingcosinesimilarity with a target value equal to matched with the Siamese network or equal to the input
name, and one last feature corresponding to the cosine
6The annotated RFR dataset is publicly available at
https://rit.rakuten.co.jp/data_release. 7https://github.com/spotify/annoy
123
distance between the representation of the proposal and Table 2: Global system top-k accuracy at the book level.
that of the input name. The selected features reflect that
the confidenceoftheglobalsystemshouldincreasewith Type of books #samples acc@1 acc@3
(i) the consensus among the different sources, and (ii) all 500 72% 85%
the similarity of the candidate to the input name. unnorm. input author 235 49% 67%
For this component we use the annotated dataset in- noISBNmatch 151 50% 64%
troduced in Section 2.4, splitting the books between unnorm. + no ISBN 109 35% 49%
training and test sets, with a ratio of 50% : 50%, gener- Rankingoftheproposals Withadecisionthreshold
ating a total of 11185 proposals. of p = 0.5, the trained classifier has an accuracy of 93%
4 Results for both positive and negative candidates in the test set.
Thecoefficients of the estimator reveal the importance
The three machine learning components discussed in of the features and, thus, of the related components. The
the previous section have been individually evaluated three most important contributions are the match with
ontheir specific task. Furthermore the final system has the Siamese network, the match via ISBN in Babelio,
been evaluated in terms of correctly normalized book and the similarity with the input catalog name, confirm-
authors in a real case scenario. ing the relevance of a multi-approach design choice.
Siamese approximate name matching Weevaluate Global system In order to reflect the actual use of
the Siamese network on a held out test set, and compare the global system on e-commerce catalog data, the final
it to an n-gram distance, by checking that the nearest evaluation is performedatthe booklevel,byconsidering
neighbor of a name variant is the canonical name of all the proposals provided by the different components
the entity to which it belongs. We find an accuracy of for a given book. The metric used is the top-k accuracy
79.8%fortheSiamesenetwork,against 71.1% for the on the ranked list of proposals for each book; results
n-gram baseline with n = 3. We have also checked are summarized in Table 2. We find that 72% of the
metrics when introducing a threshold distance above bookshavetheauthor’s name normalized by the highest
whichweconsiderthat no matching entity is found, and ranked proposal. Excluding from the evaluation books
found systematic improvement over the baseline. In the where the ground truth for the author’s name equals the
final system, we set the threshold to infinity. catalog value, this accuracy drops to 49%. In the case
Siamese networks are more effective than simpler of books without ISBN or that do not match on any of
rule-based approaches and more specifically they per- the bibliographic resources, thus relying on machine
form better than the n-gram baseline on the following learning-based components only, we find that 50% of
cases: the books are normalized by the top proposal. Finally,
for the combination of the above two restrictions, we
• Vittorio Hugo → Victor Hugo: capturing name findatop-1accuracy of 35%.
variants in different languages;
• Bill Shakespeare → William Shakespeare: captur- 5 Related works
ing commonnicknames Thereis a long line of work on author name disambigua-
Namecorrection with seq2seq networks Similarly tion for the case of bibliographic citation records (Hus-
to the previous approach, the seq2seq network is evalu- sain and Asghar, 2017). While related, this problem dif-
atedonaheldouttestsetbycheckingthatoneofthegen- fers from the one of book authors. Indeed, unlike most
erated name variants is the canonical name of the entity books, research publications usually have several au-
to which it belongs. As expected, name normalization thors, each of them having published papers with other
using seq2seq network gives poorer performances than researchers. The relationships among authors, which
approximate matching within a dataset of known au- can be represented as a graph, may be used to help dis-
thors, but constitutes a complementary approach that is ambiguate the bibliographic citations.
useful in case of formatting issues or incomplete names. Namedentity linking (Shen et al., 2015), where one
This approach alone reaches a top-10 accuracy of 42% aims at determining the identity of entities (such as a
onthe entire test set, 26% on a test set containing only person’s name) mentioned in text, is another related
names with initials, and 53% on a test set containing problem. Thecrucialdifference withthe disambiguation
only minor spelling mistakes. of book authors is that entity linking systems leverage
Someexampleswhereseq2seqperformsbetter than the context of the named entity mention to link unam-
the other methods are as follows: biguously to an entity in a pre-populated knowledge
base.
• V. Hugo →Victor Hugo: first name prediction for The conformity of truth in web resources is also
authors we don’t have in the canonical database; a related problem, addressed in the literature by
TruthFinder (Yin et al., 2008) algorithms. Similarly, the
• VicorHugo→VictorHugo:misspellingcorrection proposed global model in which we combine sources
for authors we don’t have in the canonical database. learns to some extent the level of trust of the different
124
no reviews yet
Please Login to review.