Computer Science Thesis Pdf 194470

Partial capture of text on file.
                              Minimal Loss Hashing for Compact Binary Codes
              Mohammad Norouzi                                                                     norouzi@cs.toronto.edu
              David J. Fleet                                                                           fleet@cs.toronto.edu
              Department of Computer Science, University of Toronto, Canada
                                    Abstract                                high-dimensional inputs, x ∈ Rp, onto binary codes,
                                                                            h ∈ H ≡ {0,1}q, which preserves some notion of
                   Wepropose a method for learning similarity-              similarity. The canonical approach assumes centered
                   preserving hash functions that map high-                 (mean-subtracted) inputs, linear projection, and bi-
                   dimensional data onto binary codes.        The           nary quantization. Such hash functions, parameter-
                   formulation is based on structured predic-               ized by W ∈Rq×p, are given by
                   tion with latent variables and a hinge-like
                   loss function. It is eﬃcient to train for large                            b(x;w) = thr(Wx),                   (1)
                   datasets, scales well to large code lengths,             where w ≡ vec(W), and the ith bit of the vector
                   and outperforms state-of-the-art methods.                thr(Wx) is 1 iﬀ the ith element of (Wx) is positive. In
                                                                                               th                              th
                                                                            other words, the i    row of W determines the i       bit
              1. Introduction                                               of the hash function in terms of a hyperplane in the
                                                                            input space; 0 is assigned to points on one side of the
                                                                            hyperplane, and 1 to points on the other side.1
              Approximate nearest neighbor (ANN) search in large
              datasets is used widely. In computer vision, for exam-        Random projections are used in LSH (Indyk & Mot-
              ple, it has been used for content-based retrieval (J´egou     wani, 1998; Charikar, 2002) and related methods (Ra-
              et al., 2008), object recognition (Lowe, 2004), and hu-       ginsky & Lazebnik, 2009). They are dataset indepen-
              man pose estimation (Shakhnarovich et al., 2003).             dent, make no prior assumption about the data dis-
              Acommonapproach, particularly well suited to high-            tribution, and come with theoretical guarantees that
              dimensional data, is to use similarity-preserving hash        speciﬁcmetrics(e.g., cosine similarity) are increasingly
              functions, where similar inputs are mapped to nearby          well preserved in Hamming space as the code length
              binary codes. One can preserve Euclidean distances,           increases. But they require large code lengths for good
              e.g., with Locality-Sensitive Hashing (LSH) (Indyk &          retrieval accuracy, and they are not applicable to gen-
              Motwani, 1998), or one might want to preserve the             eral similarity measures, like human ratings.
              similarity of discrete class labels, or real-valued pa-       Recent work has focused on learning compact
              rameters associated with training exemplars.                  codes. Two early approaches showed how one might
              CompactbinarycodesareparticularlyusefulforANN.                preserve semantically meaningful data abstractions
              If the nearest neighbors of a point are within a small        (Shakhnarovich et al., 2003; Salakhutdinov & Hinton,
              hypercube in the Hamming space, then ANN search               2009).   Multilayer neural networks worked well for
              can be performed in sublinear time, treating binary           document retrieval (Salakhutdinov & Hinton, 2009)
              codes as hash keys. Even for an exhaustive, linear            and for large image corpora (Torralba et al., 2008).
              scan through the database, binary codes enable very           However, these methods typically require large train-
              fast search. Compact binary codes also allow one to           ing sets, long learning times, relatively slow indexing,
              store large databases in memory.                              and they exhibit poorer performance than more recent
                                                                            methods (Kulis & Darrell, 2009).
              We formulate the learning of compact binary codes
              in terms of structured prediction with latent variables       Continuous relaxations have been used to avoid opti-
              using a new class of loss functions designed for hash-        mizationwithdiscontinuousmappingstobinarycodes.
              ing.  The task is to ﬁnd a hash function that maps            Spectral Hashing (SH) aims to preserve Euclidean dis-
                                                                            tance with an eigenvector formulation (Weiss et al.,
              Appearing in Proceedings of the 28th International Con-
                                                                               1One can add an oﬀset from the origin, but we ﬁnd the
              ference on Machine Learning, Bellevue, WA, USA, 2011.         gain is marginal. Nonlinear projections are also possible,
              Copyright 2011 by the author(s)/owner(s).                     but in this paper we concentrate on linear projections.
                                          Minimal Loss Hashing for Compact Binary Codes
             2008). The resulting projection directions can be in-    Weassume mappings from Rp to H given by (1). The
             terpreted in terms of the principal directions of the    quality of a mapping is determined by a loss func-
             data. Empirically, SH works well for relatively small    tion L : H × H × {0,1} → R that assigns a cost to
             codes, but often underperforms LSH for longer code       a pair of binary codes and a similarity label. For bi-
             lengths.  Wang et al. (2010) extended SH to incor-       nary codes h,g ∈ H, and a label s ∈ {0,1}, the loss
             porate discrete supervision where sets of similar and    function L(h,g,s) measures how compatible h and g
             dissimilar training points are available. Like Wang et   are with s. For example, when s = 1, the loss assigns
             al., Lin et al. (2010) propose an algorithm that learns  a small cost if h and g are nearby codes, and large
             binay hash functions, one bit at a time, based on some   cost otherwise. Ultimately, to learn w, we minimize
             similarity labels. In contrast, our method is not se-    (regularized) empirical loss over training pairs:
             quential; it optimizes all the code bits simultanously.        L(w) = X L(b(x ;w), b(x ;w), s ).            (3)
                                                                                                 i        j      ij
             Binaryreconstructiveembedding(BRE)(Kulis&Dar-                            (i,j)∈S
             rell, 2009) uses a loss function that penalizes the dif-
             ference between Euclidean distance in the input space    The loss function we advocate is speciﬁc to learn-
             and the Hamming distance between binary codes:           ing binary hash functions, and bears some similar-
                                                     2              ity to the hinge loss used in SVMs.     It includes a
                      ℓ  (m ,d ) =       1m −1d          .     (2)    hyper-parameter ρ, which is a threshold in the Ham-
                       bre  ij  ij       q   ij  2 ij                 ming space that diﬀerentiates neighbors from non-
                                                                      neighbors. This is important for learning hash keys,
             Here, dij is the Euclidean distance between two in-      since we will want similar training points to map to
             puts of unit length, and m   is the Hamming distance
                                       ij                             binary codes that diﬀer by no more that ρ bits. Non-
             between their corresponding q-bit binary codes. The      neighbors should map to codes no closer than ρ bits.
             hash function is found by minimizing empirical loss,
                                                                      Let kh−gk     denote the Hamming distance between
             i.e., the sum of the pairwise loss, ℓbre, over training              H
             pairs. Experiments on several datasets demonstrated      codes h,g ∈ H. Our hinge-like loss function, denoted
                                                                      ℓ , depends on kh−gk       and not on the individual
             improved precision of BRE over SH, and LSH. One           ρ                      H
                                                                      codes, i.e., L(h,g,s) = ℓ (kh−gk ,s):
             limitation of BRE is the high storage cost required for                           ρ        H
             training, making large datasets impractical.                            (
                                                                           ℓ (m,s) =   max(m−ρ+1,0)         for s=1      (4)
             In this paper we provide a new formulation for learn-          ρ          λmax(ρ−m+1,0) for s=0
             ing binary hash functions, based on structural SVMs
                                                                      where m ≡ kh−gk , and λ is a loss hyper-parameter
             with latent variables (Yu & Joachims, 2009) and an                          H
             eﬀective online learning algorithm. We also introduce    that controls the ratio of the slopes of the penalties
             a type of loss function speciﬁcally designed for hash-   incurred for similar (or dissimilar) points when they
             ing, taking Hammingdistance and binary quantization      are too far apart (or too close). Linear penalties are
             into account. The resulting approach is shown to sig-    useful as they are robust to outliers (e.g., in contrast to
             niﬁcantly outperform state-of-the-art methods.           thequadraticpenaltyinℓbre). Further, notethatwhen
                                                                      similar points are suﬃciently close, or dissimilar points
             2. Formulation                                           are distant, our loss does not impose any penalty.
             Our goal is to learn a binary hash function that pre-    3. Bound on Empirical Loss
             serves similarity between pairs of training exemplars.
             Let D comprise N centered, p-dimensional training        Theempirical loss in (3) is discontinuous and typically
                        N                                             non-convex, making optimization diﬃcult. As a conse-
             points {xi}   , and let S be the set of pairs for which
                        i=1                                           quence, rather than directly minimizing empirical loss,
             similarity labels exist. The binary similarity labels are
             given by {s }       , where x and x are similar when     weinsteadformulate, andminimize, a piecewise linear,
                        ij (i,j)∈S        i      j                    upper bound on empirical loss. Our bound is inspired
             s  =1, and dissimilar when s     =0. To preserve a
              ij                            ij                        by a bound used, for similar reasons, in structural
             speciﬁc metric (e.g., Euclidean distance) one can use
             binary similarity labels obtained by thresholding pair-  SVMs with latent variables (Yu & Joachims, 2009).
             wise distances. Alternatively, one can use weakly su-    We ﬁrst re-express the hash function in (1) as a form
             pervised data for which each training point is associ-   of structured prediction:
             ated with a set of neighbors (similar exemplars), and                                     T    
             non-neighbors (dissimilar exemplars). This is useful               b(x;w) = argmax h Wx                     (5)
             for preserving similarity based on semantic content for                           h∈H
                                                                                         = argmax wTψ(x,h) ,             (6)
             example.                                                                          h∈H
                                               Minimal Loss Hashing for Compact Binary Codes
                                          T             T
               where ψ(x,h) ≡ vec(hx ). Here, w ψ(x,h) acts as                 Our upper bound on the loss function L, given a pair
               a scoring function that determines the relevance of             of inputs, x and x , a supervisory label s , and the
                                                                                             i       j                         ij
               input-code pairs, based on a weighted sum of features           parameters of the hash function w, has the form
               in the joint feature vector ψ(x,h).       Other forms of
                                                                                L(b(x ;w), b(x ;w), s )
               ψ(.,.) are possible, leading to other hash functions.                   i          j       ij
                                                                                      ≤ max L(g ,g ,s )+gTWx +gTWx 
               Tomotivate our upper bound on empirical loss, we be-                      g ,g ∈H      i   j  ij      i    i     j     j
                                                                                          i  j                                
               gin with a short review of the bound commonly used                                     T                   T
                                                                                            −max h Wxi −max h Wxj .                    (10)
               for structural SVMs (Taskar et al., 2003; Tsochan-                             h ∈H    i          h ∈H     j
                                                                                               i                   j
               taridis et al., 2004).
                                                                               The proof for (10) is similar to that for (8) above. It
               3.1. Structural SVM                                             follows from (5) that the second and third terms on
                                                                               the RHS of (10) are maximized by h =b(x ;w) and
                                                                                                                          i      i
               In structural SVMs (SSVM), given input-output train-            h =b(x ;w). If the ﬁrst term were maximized by
                                                                                 j       j
                                 ∗   N
               ing pairs {(xi,y )}      , one aims to learn a mapping          g =b(x ;w) and g = b(x ;w), then the inequality
                                 i   i=1                                         i       i            j       j
               from inputs to outputs in terms of a parameterized              in (10) becomes an equality. For all other values of
               scoring function f(x,y;w):                                      g and g that maximize the ﬁrst term, the RHS can
                                                                                 i       j
                               b                                               only increase, hence the inequality. The bound holds
                               y = argmax f(x,y;w).                    (7)     for ℓ , ℓ   , and any similar loss function, with binary
                                         y                                          ρ   bre
                                                                               labels s   or real-valued labels d .
                                                                                        ij                         ij
               Givenalossfunctionontheoutputdomain, L(·,·), the                Weformulatetheoptimizationfortheweightswofthe
               SSVMwithmargin-rescaling introduces a margin vio-               hashing function, in terms of minimization of the fol-
               lation (slack) variable for each training pair, and min-        lowing convex-concave upper bound on empirical loss:
               imizes sum of slack variables. For a pair (x, y∗), slack
               is deﬁned as max [L(y,y∗)+f(x,y;w)]−f(x,y∗;w).                      X                                                  
                                 y                                                         max L(g,g ,s )+gTWx +gTWx
               Importantly, the slack variables provide an upper                          g ,g ∈H      i   j  ij     i     i     j    j
                                                                                 (i,j)∈S   i j
                                                  b
               bound on loss for the predictor y; i.e.,                                                                       
                                                                                            −max hTWx −max hTWx                     .  (11)
                 b ∗                                                                                  i     i             j    j
              L(y,y )                                                                         h ∈H               h ∈H
                                                                                               i                   j
                                      ∗                       b
                     ≤ max[L(y,y )+f(x,y;w)]−f(x,y;w)                  (8)
                          y                                                    4. Optimization
                     ≤ max[L(y,y∗)+f(x,y;w)]−f(x,y∗;w) . (9)
                          y                                                    Minimizing (11) to ﬁnd w entails the maximization
               To see the inequality in (8), note that, if the ﬁrst term       of three terms for each pair (i,j) ∈ S. The second
                                                            b                  and third terms are trivially maximized directly by
               on the RHS of (8) is maximized by y = y, then the f
               terms cancel, and (8) becomes an equality. Otherwise,           the hash function (5). Maximizing the ﬁrst term is,
               the optimal value of the max term must be larger than           however, not trivial. It is similar to the loss-adjusted
                           b                                                   inference in the SSVMs. The next section describes
               when y = y, which causes the inequality. The second
               inequality (9) follows straightforwardly from the deﬁ-          an eﬃcient algorithm for ﬁnding the exact solution of
                         b                    b                                loss-adjusted inference for hash function learning.
               nition of y in (7); i.e., f(x,y;w) ≥ f(x,y;w) for all
               y. The bound in (9) is piecewise linear, convex in w,
               and easier to optimize than the empirical loss.                 4.1. Binary hashing loss-adjusted inference
               3.2. Convex-concave bound for hashing                           We solve loss-adjusted inference for general loss func-
                                                                               tions of the form L(h,g,s) = ℓ(kh − gk ,s). This
                                                                                                                              H
               Thediﬀerencebetweenlearninghashfunctionsandthe                  applies to both ℓ       and ℓ . The loss-adjusted infer-
                                                                                                   bre       ρ
               SSVM is that the binary codes for our training data             ence is to ﬁnd the pair of binary codes given by
               are not known a priori. But note that the tighter                                           
                                       ∗                                         e             e
               bound in (8) uses y       only in the loss term, which            b(xi;xj,w),b(xj;xi,w) =
               is useful for hash function learning, because suitable                                            T          T      
                                                                                 argmax ℓ(kg −g k ,s )+g Wx +g Wx . (12)
               loss functions for hashing, such as (4), do not require                         i    j H ij        i     i    j     j
                                                                                 g ,g ∈H
                                                                                   i j
               ground-truth labels. The bound (8) is piecewise linear,
               convex-concave (a sum of convex and concave terms),             Before solving (12) in general, ﬁrst consider the spe-
               andis the basis for SSVMs with latent variables (Yu &           ciﬁc case for which we restrict the Hamming distance
               Joachims, 2009). Below we formulate a similar bound             between g and g to be m, i.e., kg −g k            =m. For
                                                                                           i       j                   i    j H
               for learning binary hash functions.                             q-bit codes, m is an integer between 0 and q. When
                                                                                Minimal Loss Hashing for Compact Binary Codes
                         kg −g k = m, the loss in (12) depends on m but                                                                procedure (Yuille & Rangarajan, 2003).                                              Applying
                             i         j H
                         not the speciﬁc bit sequences gi and gj. Thus, instead                                                        this method to our problem, we should iteratively
                         of (12), we can now solve                                                                                     impute the missing data (the binary codes b(xi;w))
                                                                            T                     T                                  and optimize for the convex term (the loss-adjusted
                                      ℓ(m,s ) + max                           g Wx +g Wx                               (13)
                                                 ij          g ,g ∈H            i         i        j         j                         terms in (11)). However, our preliminary experiments
                                                               i    j
                                                       s.t. kg −g k                 =m.                                                showed that this procedure is slow and not so eﬀective
                                                                   i        j H                                                        for learning hash functions.
                         The key to ﬁnding the two codes that solve (13) is to                                                         Alternatively,               following the structured perceptron
                         decide which of the m bits in the two codes should be                                                         (Collins, 2002) and recent work of McAllester et al.
                         diﬀerent.                                                                                                     (2010) we considered a stochastic gradient-based ap-
                         Let v           denote the kth element of a vector v. We can                                                  proach, based on an iterative, perceptron-like, learn-
                                   [k]
                         computethejointcontribution of the kth bits of gi and                                                         ing rule. At iteration t, let the current weight vector
                                        T                T                                                                             be wt, and let the new training pair be (xt,x′) with
                         g to [g Wx +g Wx ] by                                                                                                                                                                                 t
                           j           i         i       j         j                                                                   supervisory signal s . We update the parameters ac-
                                                                                                                                                                             t
                               ζk(g          , g       ) = g            (Wxi)            +g (Wxj) ,                                    cording to the following learning rule:
                                        i[k]     j[k]             i[k]              [k]         j[k]              [k]
                         and these contributions can be computed for the four                                                          wt+1 = wt +ηhψ(xt,b(xt;wt))+ψ(x′,b(x′;wt))
                         possible states of the kth bits independently. To this                                                                                                                               t         t
                         end,                                                                                                                                    e           ′       t                ′  e ′                 t    i
                                                                                                                                               −ψ(xt,b(xt;xt,w ))−ψ(xt,b(xt;xt,w )) (14)
                         ∆ =max ζ (1,0),ζ (0,1) −max ζ (0,0),ζ (1,1)
                            k                    k               k                              k               k                                                                                                              T
                                                                                                                                       where η is the learning rate, ψ(x,h)=vec(hx ), and
                         represents how much is gained by setting the bits g                                                          e            ′                  e ′
                                                                                                                                       b(x ;x ,w ) and b(x ;x ,w ) are provided by the loss-
                                                                                                                         i[k]                t     t      t                  t     t      t
                         andgj[k] to be diﬀerent rather than the same. Because                                                         adjusted inference above. This learning rule has been
                         g and g diﬀer only in m bits, the solution to (13) is                                                         eﬀective in our experiments.
                           i             j
                         obtained by setting the m bits with m largest ∆ ’s to
                                                                                                                    k                  One interpretation of this update rule is that it fol-
                         be diﬀerent. All other bits in the two codes should be                                                        lowsthenoisygradientdescentdirectionofourconvex-
                         the same. When g                         and g             must be diﬀerent, they
                                                            i[k]             j[k]                                                      concave objective. To see this more clearly, we rewrite
                         are found by comparing ζk(1,0) and ζk(0,1). Other-                                                            the objective (11) as
                         wise, they are determined by the larger of ζk(0,0) and
                         ζk(1,1). Now solve (13) for all m; noting that we only                                                          Xh                    T           e                             T            e
                                                                                                                                                  L +w ψ(x ,b(x ;x ,w))+w ψ(x ,b(x ;x ,w))
                         compute ∆k for each bit, 1≤k≤q, once.                                                                                       ij                 i         i     j                         j         j      i
                         Tosolve (12) it suﬃces to ﬁnd the m that provides the                                                        (ij)∈S                                                                                  i
                                                                                                                                                  −wTψ(x,b(x ;w))−wTψ(x ,b(x ;w)) . (15)
                         largest value for the objective function in (13). We                                                                                       i        i                           j         j
                         ﬁrst sort the ∆ ’s once, and for diﬀerent values of m,
                                                      k
                         we compare the sum of the ﬁrst m largest ∆ ’s plus                                                                                                                                    e
                                                                                                                k                      Theloss-adjusted inference (12) yields b(xi;xj,w) and
                         ℓ(m,s ), and choose the m that achieves the highest                                                          e
                                    ij                                                                                                 b(xj;xi,w). Evaluating the loss function for these two
                         score. Afterwards, we determine the values of the bits                                                        binarycodesgivesLij (whichnolongerdependsonw).
                         according to their contributions as described above.                                                          Taking the negative gradient of the objective (15) with
                         Given the values of Wx and Wx , this loss-adjusted                                                            respect to w, we get the exact learning rule of (14).
                                                                      i                  j                                             However, note that this objective is piecewise linear,
                         inference algorithm takes time O(qlogq). Other than
                         sorting the ∆ ’s, all other steps are linear in q which                                                       due to the max operations, and thus not diﬀerentiable
                                                   k                                                                                   at isolated points. While the theoretical properties of
                         makes the inference eﬃcient and scalable to large code
                         lengths. The computation of Wx can be done once                                                               this update rule should be explored further (e.g., see
                                                                                          i                                            (McAllester et al., 2010)), we empirically veriﬁed that
                         per point, although it is used with many pairs.
                                                                                                                                       the update rule lowers the upper bound, and converges
                         4.2. Perceptron-like learning                                                                                 to a local minima. For example, Fig. 1 plots the em-
                                                                                                                                       pirical loss and the bound, computed over 105 training
                         In Sec. 3.2, we formulated a convex-concave bound                                                             pairs, as a function of the iteration number.
                         (11) on empirical loss. In Sec. 4.1 we described how
                         the value of the bound could be computed at a given                                                           5. Implementation details
                         W. Now consider optimizing the objective i.e., low-
                         ering the bound.                       A standard technique for mini-                                         We initialize W using LSH; i.e., the entries of W are
                         mizing such objectives is called the concave-convex                                                           sampled(IID)fromanormaldensityN(0,1),andeach
The words contained in this file might help you see if this file matches what you are looking for:

...Minimal loss hashing for compact binary codes mohammad norouzi cs toronto edu david j fleet department of computer science university canada abstract high dimensional inputs x rp onto h q which preserves some notion wepropose a method learning similarity the canonical approach assumes centered preserving hash functions that map mean subtracted linear projection and bi data nary quantization such parameter formulation is based on structured predic ized by w rq p are given tion with latent variables hinge like function it ecient to train large b thr wx datasets scales well code lengths where vec ith bit vector outperforms state art methods i element positive in th other words row determines introduction terms hyperplane input space assigned points one side approximate nearest neighbor ann search used widely vision exam random projections lsh indyk mot ple has been content retrieval egou wani charikar related ra et al object recognition lowe hu ginsky lazebnik they dataset indepen man pos...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area