142x Filetype PDF File size 1.06 MB Source: icml.cc
Minimal Loss Hashing for Compact Binary Codes Mohammad Norouzi norouzi@cs.toronto.edu David J. Fleet fleet@cs.toronto.edu Department of Computer Science, University of Toronto, Canada Abstract high-dimensional inputs, x ∈ Rp, onto binary codes, h ∈ H ≡ {0,1}q, which preserves some notion of Wepropose a method for learning similarity- similarity. The canonical approach assumes centered preserving hash functions that map high- (mean-subtracted) inputs, linear projection, and bi- dimensional data onto binary codes. The nary quantization. Such hash functions, parameter- formulation is based on structured predic- ized by W ∈Rq×p, are given by tion with latent variables and a hinge-like loss function. It is efficient to train for large b(x;w) = thr(Wx), (1) datasets, scales well to large code lengths, where w ≡ vec(W), and the ith bit of the vector and outperforms state-of-the-art methods. thr(Wx) is 1 iff the ith element of (Wx) is positive. In th th other words, the i row of W determines the i bit 1. Introduction of the hash function in terms of a hyperplane in the input space; 0 is assigned to points on one side of the hyperplane, and 1 to points on the other side.1 Approximate nearest neighbor (ANN) search in large datasets is used widely. In computer vision, for exam- Random projections are used in LSH (Indyk & Mot- ple, it has been used for content-based retrieval (J´egou wani, 1998; Charikar, 2002) and related methods (Ra- et al., 2008), object recognition (Lowe, 2004), and hu- ginsky & Lazebnik, 2009). They are dataset indepen- man pose estimation (Shakhnarovich et al., 2003). dent, make no prior assumption about the data dis- Acommonapproach, particularly well suited to high- tribution, and come with theoretical guarantees that dimensional data, is to use similarity-preserving hash specificmetrics(e.g., cosine similarity) are increasingly functions, where similar inputs are mapped to nearby well preserved in Hamming space as the code length binary codes. One can preserve Euclidean distances, increases. But they require large code lengths for good e.g., with Locality-Sensitive Hashing (LSH) (Indyk & retrieval accuracy, and they are not applicable to gen- Motwani, 1998), or one might want to preserve the eral similarity measures, like human ratings. similarity of discrete class labels, or real-valued pa- Recent work has focused on learning compact rameters associated with training exemplars. codes. Two early approaches showed how one might CompactbinarycodesareparticularlyusefulforANN. preserve semantically meaningful data abstractions If the nearest neighbors of a point are within a small (Shakhnarovich et al., 2003; Salakhutdinov & Hinton, hypercube in the Hamming space, then ANN search 2009). Multilayer neural networks worked well for can be performed in sublinear time, treating binary document retrieval (Salakhutdinov & Hinton, 2009) codes as hash keys. Even for an exhaustive, linear and for large image corpora (Torralba et al., 2008). scan through the database, binary codes enable very However, these methods typically require large train- fast search. Compact binary codes also allow one to ing sets, long learning times, relatively slow indexing, store large databases in memory. and they exhibit poorer performance than more recent methods (Kulis & Darrell, 2009). We formulate the learning of compact binary codes in terms of structured prediction with latent variables Continuous relaxations have been used to avoid opti- using a new class of loss functions designed for hash- mizationwithdiscontinuousmappingstobinarycodes. ing. The task is to find a hash function that maps Spectral Hashing (SH) aims to preserve Euclidean dis- tance with an eigenvector formulation (Weiss et al., Appearing in Proceedings of the 28th International Con- 1One can add an offset from the origin, but we find the ference on Machine Learning, Bellevue, WA, USA, 2011. gain is marginal. Nonlinear projections are also possible, Copyright 2011 by the author(s)/owner(s). but in this paper we concentrate on linear projections. Minimal Loss Hashing for Compact Binary Codes 2008). The resulting projection directions can be in- Weassume mappings from Rp to H given by (1). The terpreted in terms of the principal directions of the quality of a mapping is determined by a loss func- data. Empirically, SH works well for relatively small tion L : H × H × {0,1} → R that assigns a cost to codes, but often underperforms LSH for longer code a pair of binary codes and a similarity label. For bi- lengths. Wang et al. (2010) extended SH to incor- nary codes h,g ∈ H, and a label s ∈ {0,1}, the loss porate discrete supervision where sets of similar and function L(h,g,s) measures how compatible h and g dissimilar training points are available. Like Wang et are with s. For example, when s = 1, the loss assigns al., Lin et al. (2010) propose an algorithm that learns a small cost if h and g are nearby codes, and large binay hash functions, one bit at a time, based on some cost otherwise. Ultimately, to learn w, we minimize similarity labels. In contrast, our method is not se- (regularized) empirical loss over training pairs: quential; it optimizes all the code bits simultanously. L(w) = X L(b(x ;w), b(x ;w), s ). (3) i j ij Binaryreconstructiveembedding(BRE)(Kulis&Dar- (i,j)∈S rell, 2009) uses a loss function that penalizes the dif- ference between Euclidean distance in the input space The loss function we advocate is specific to learn- and the Hamming distance between binary codes: ing binary hash functions, and bears some similar- 2 ity to the hinge loss used in SVMs. It includes a ℓ (m ,d ) = 1m −1d . (2) hyper-parameter ρ, which is a threshold in the Ham- bre ij ij q ij 2 ij ming space that differentiates neighbors from non- neighbors. This is important for learning hash keys, Here, dij is the Euclidean distance between two in- since we will want similar training points to map to puts of unit length, and m is the Hamming distance ij binary codes that differ by no more that ρ bits. Non- between their corresponding q-bit binary codes. The neighbors should map to codes no closer than ρ bits. hash function is found by minimizing empirical loss, Let kh−gk denote the Hamming distance between i.e., the sum of the pairwise loss, ℓbre, over training H pairs. Experiments on several datasets demonstrated codes h,g ∈ H. Our hinge-like loss function, denoted ℓ , depends on kh−gk and not on the individual improved precision of BRE over SH, and LSH. One ρ H codes, i.e., L(h,g,s) = ℓ (kh−gk ,s): limitation of BRE is the high storage cost required for ρ H training, making large datasets impractical. ( ℓ (m,s) = max(m−ρ+1,0) for s=1 (4) In this paper we provide a new formulation for learn- ρ λmax(ρ−m+1,0) for s=0 ing binary hash functions, based on structural SVMs where m ≡ kh−gk , and λ is a loss hyper-parameter with latent variables (Yu & Joachims, 2009) and an H effective online learning algorithm. We also introduce that controls the ratio of the slopes of the penalties a type of loss function specifically designed for hash- incurred for similar (or dissimilar) points when they ing, taking Hammingdistance and binary quantization are too far apart (or too close). Linear penalties are into account. The resulting approach is shown to sig- useful as they are robust to outliers (e.g., in contrast to nificantly outperform state-of-the-art methods. thequadraticpenaltyinℓbre). Further, notethatwhen similar points are sufficiently close, or dissimilar points 2. Formulation are distant, our loss does not impose any penalty. Our goal is to learn a binary hash function that pre- 3. Bound on Empirical Loss serves similarity between pairs of training exemplars. Let D comprise N centered, p-dimensional training Theempirical loss in (3) is discontinuous and typically N non-convex, making optimization difficult. As a conse- points {xi} , and let S be the set of pairs for which i=1 quence, rather than directly minimizing empirical loss, similarity labels exist. The binary similarity labels are given by {s } , where x and x are similar when weinsteadformulate, andminimize, a piecewise linear, ij (i,j)∈S i j upper bound on empirical loss. Our bound is inspired s =1, and dissimilar when s =0. To preserve a ij ij by a bound used, for similar reasons, in structural specific metric (e.g., Euclidean distance) one can use binary similarity labels obtained by thresholding pair- SVMs with latent variables (Yu & Joachims, 2009). wise distances. Alternatively, one can use weakly su- We first re-express the hash function in (1) as a form pervised data for which each training point is associ- of structured prediction: ated with a set of neighbors (similar exemplars), and T non-neighbors (dissimilar exemplars). This is useful b(x;w) = argmax h Wx (5) for preserving similarity based on semantic content for h∈H = argmax wTψ(x,h) , (6) example. h∈H Minimal Loss Hashing for Compact Binary Codes T T where ψ(x,h) ≡ vec(hx ). Here, w ψ(x,h) acts as Our upper bound on the loss function L, given a pair a scoring function that determines the relevance of of inputs, x and x , a supervisory label s , and the i j ij input-code pairs, based on a weighted sum of features parameters of the hash function w, has the form in the joint feature vector ψ(x,h). Other forms of L(b(x ;w), b(x ;w), s ) ψ(.,.) are possible, leading to other hash functions. i j ij ≤ max L(g ,g ,s )+gTWx +gTWx Tomotivate our upper bound on empirical loss, we be- g ,g ∈H i j ij i i j j i j gin with a short review of the bound commonly used T T −max h Wxi −max h Wxj . (10) for structural SVMs (Taskar et al., 2003; Tsochan- h ∈H i h ∈H j i j taridis et al., 2004). The proof for (10) is similar to that for (8) above. It 3.1. Structural SVM follows from (5) that the second and third terms on the RHS of (10) are maximized by h =b(x ;w) and i i In structural SVMs (SSVM), given input-output train- h =b(x ;w). If the first term were maximized by j j ∗ N ing pairs {(xi,y )} , one aims to learn a mapping g =b(x ;w) and g = b(x ;w), then the inequality i i=1 i i j j from inputs to outputs in terms of a parameterized in (10) becomes an equality. For all other values of scoring function f(x,y;w): g and g that maximize the first term, the RHS can i j b only increase, hence the inequality. The bound holds y = argmax f(x,y;w). (7) for ℓ , ℓ , and any similar loss function, with binary y ρ bre labels s or real-valued labels d . ij ij Givenalossfunctionontheoutputdomain, L(·,·), the Weformulatetheoptimizationfortheweightswofthe SSVMwithmargin-rescaling introduces a margin vio- hashing function, in terms of minimization of the fol- lation (slack) variable for each training pair, and min- lowing convex-concave upper bound on empirical loss: imizes sum of slack variables. For a pair (x, y∗), slack is defined as max [L(y,y∗)+f(x,y;w)]−f(x,y∗;w). X y max L(g,g ,s )+gTWx +gTWx Importantly, the slack variables provide an upper g ,g ∈H i j ij i i j j (i,j)∈S i j b bound on loss for the predictor y; i.e., −max hTWx −max hTWx . (11) b ∗ i i j j L(y,y ) h ∈H h ∈H i j ∗ b ≤ max[L(y,y )+f(x,y;w)]−f(x,y;w) (8) y 4. Optimization ≤ max[L(y,y∗)+f(x,y;w)]−f(x,y∗;w) . (9) y Minimizing (11) to find w entails the maximization To see the inequality in (8), note that, if the first term of three terms for each pair (i,j) ∈ S. The second b and third terms are trivially maximized directly by on the RHS of (8) is maximized by y = y, then the f terms cancel, and (8) becomes an equality. Otherwise, the hash function (5). Maximizing the first term is, the optimal value of the max term must be larger than however, not trivial. It is similar to the loss-adjusted b inference in the SSVMs. The next section describes when y = y, which causes the inequality. The second inequality (9) follows straightforwardly from the defi- an efficient algorithm for finding the exact solution of b b loss-adjusted inference for hash function learning. nition of y in (7); i.e., f(x,y;w) ≥ f(x,y;w) for all y. The bound in (9) is piecewise linear, convex in w, and easier to optimize than the empirical loss. 4.1. Binary hashing loss-adjusted inference 3.2. Convex-concave bound for hashing We solve loss-adjusted inference for general loss func- tions of the form L(h,g,s) = ℓ(kh − gk ,s). This H Thedifferencebetweenlearninghashfunctionsandthe applies to both ℓ and ℓ . The loss-adjusted infer- bre ρ SSVM is that the binary codes for our training data ence is to find the pair of binary codes given by are not known a priori. But note that the tighter ∗ e e bound in (8) uses y only in the loss term, which b(xi;xj,w),b(xj;xi,w) = is useful for hash function learning, because suitable T T argmax ℓ(kg −g k ,s )+g Wx +g Wx . (12) loss functions for hashing, such as (4), do not require i j H ij i i j j g ,g ∈H i j ground-truth labels. The bound (8) is piecewise linear, convex-concave (a sum of convex and concave terms), Before solving (12) in general, first consider the spe- andis the basis for SSVMs with latent variables (Yu & cific case for which we restrict the Hamming distance Joachims, 2009). Below we formulate a similar bound between g and g to be m, i.e., kg −g k =m. For i j i j H for learning binary hash functions. q-bit codes, m is an integer between 0 and q. When Minimal Loss Hashing for Compact Binary Codes kg −g k = m, the loss in (12) depends on m but procedure (Yuille & Rangarajan, 2003). Applying i j H not the specific bit sequences gi and gj. Thus, instead this method to our problem, we should iteratively of (12), we can now solve impute the missing data (the binary codes b(xi;w)) T T and optimize for the convex term (the loss-adjusted ℓ(m,s ) + max g Wx +g Wx (13) ij g ,g ∈H i i j j terms in (11)). However, our preliminary experiments i j s.t. kg −g k =m. showed that this procedure is slow and not so effective i j H for learning hash functions. The key to finding the two codes that solve (13) is to Alternatively, following the structured perceptron decide which of the m bits in the two codes should be (Collins, 2002) and recent work of McAllester et al. different. (2010) we considered a stochastic gradient-based ap- Let v denote the kth element of a vector v. We can proach, based on an iterative, perceptron-like, learn- [k] computethejointcontribution of the kth bits of gi and ing rule. At iteration t, let the current weight vector T T be wt, and let the new training pair be (xt,x′) with g to [g Wx +g Wx ] by t j i i j j supervisory signal s . We update the parameters ac- t ζk(g , g ) = g (Wxi) +g (Wxj) , cording to the following learning rule: i[k] j[k] i[k] [k] j[k] [k] and these contributions can be computed for the four wt+1 = wt +ηhψ(xt,b(xt;wt))+ψ(x′,b(x′;wt)) possible states of the kth bits independently. To this t t end, e ′ t ′ e ′ t i −ψ(xt,b(xt;xt,w ))−ψ(xt,b(xt;xt,w )) (14) ∆ =max ζ (1,0),ζ (0,1) −max ζ (0,0),ζ (1,1) k k k k k T where η is the learning rate, ψ(x,h)=vec(hx ), and represents how much is gained by setting the bits g e ′ e ′ b(x ;x ,w ) and b(x ;x ,w ) are provided by the loss- i[k] t t t t t t andgj[k] to be different rather than the same. Because adjusted inference above. This learning rule has been g and g differ only in m bits, the solution to (13) is effective in our experiments. i j obtained by setting the m bits with m largest ∆ ’s to k One interpretation of this update rule is that it fol- be different. All other bits in the two codes should be lowsthenoisygradientdescentdirectionofourconvex- the same. When g and g must be different, they i[k] j[k] concave objective. To see this more clearly, we rewrite are found by comparing ζk(1,0) and ζk(0,1). Other- the objective (11) as wise, they are determined by the larger of ζk(0,0) and ζk(1,1). Now solve (13) for all m; noting that we only Xh T e T e L +w ψ(x ,b(x ;x ,w))+w ψ(x ,b(x ;x ,w)) compute ∆k for each bit, 1≤k≤q, once. ij i i j j j i Tosolve (12) it suffices to find the m that provides the (ij)∈S i −wTψ(x,b(x ;w))−wTψ(x ,b(x ;w)) . (15) largest value for the objective function in (13). We i i j j first sort the ∆ ’s once, and for different values of m, k we compare the sum of the first m largest ∆ ’s plus e k Theloss-adjusted inference (12) yields b(xi;xj,w) and ℓ(m,s ), and choose the m that achieves the highest e ij b(xj;xi,w). Evaluating the loss function for these two score. Afterwards, we determine the values of the bits binarycodesgivesLij (whichnolongerdependsonw). according to their contributions as described above. Taking the negative gradient of the objective (15) with Given the values of Wx and Wx , this loss-adjusted respect to w, we get the exact learning rule of (14). i j However, note that this objective is piecewise linear, inference algorithm takes time O(qlogq). Other than sorting the ∆ ’s, all other steps are linear in q which due to the max operations, and thus not differentiable k at isolated points. While the theoretical properties of makes the inference efficient and scalable to large code lengths. The computation of Wx can be done once this update rule should be explored further (e.g., see i (McAllester et al., 2010)), we empirically verified that per point, although it is used with many pairs. the update rule lowers the upper bound, and converges 4.2. Perceptron-like learning to a local minima. For example, Fig. 1 plots the em- pirical loss and the bound, computed over 105 training In Sec. 3.2, we formulated a convex-concave bound pairs, as a function of the iteration number. (11) on empirical loss. In Sec. 4.1 we described how the value of the bound could be computed at a given 5. Implementation details W. Now consider optimizing the objective i.e., low- ering the bound. A standard technique for mini- We initialize W using LSH; i.e., the entries of W are mizing such objectives is called the concave-convex sampled(IID)fromanormaldensityN(0,1),andeach
no reviews yet
Please Login to review.