202x Filetype PDF File size 0.20 MB Source: www.ijcai.org
ProceedingsoftheTwenty-SixthInternationalJointConferenceonArtificialIntelligence(IJCAI-17) Learning Feature Engineering for Classification 1 2 2 FatemehNargesian , Horst Samulowitz , Udayan Khurana 3 2 Elias B. Khalil , Deepak Turaga 1University of Toronto, 2IBM Research, 3Georgia Institute of Technology fnargesian@cs.toronto.edu, {samulowitz, ukhurana}@us.ibm.com, lyes@gatech.edu, turaga@us.ibm.com Abstract search in feature space using heuristic feature quality mea- Feature engineering is the task of improving pre- sures (such as information gain) and other surrogate mea- [ dictive modelling performance on a dataset by sures of performance Markovitch and Rosenstein, 2002; Fan ] transforming its feature space. Existing ap- et al., 2010 . Others perform greedy feature construction and [ proaches to automate this process rely on ei- selection based on model evaluation Dor and Reich, 2012; ] ther transformed feature space exploration through Khurana et al., 2016 . Kanter et al. proposed the Data Sci- evaluation-guided search, or explicit expansion of ence Machine (DSM) which considers feature engineering datasets with all transformed features followed by problem as feature selection on the space of novel features. feature selection. Such approaches incur high com- DSMreliesonexhaustivelyenumeratingallpossiblefeatures putational costs in runtime and/or memory. We that can be constructed from a dataset, given sequences gen- present a novel technique, called Learning Feature erated from a set of transformations, then performing feature [ Engineering (LFE), for automating feature engi- selection on the augmented dataset Kanter and Veeramacha- ] neering in classification tasks. LFE is based on neni, 2015 . Evaluation-based and exhaustive feature enu- learning the effectiveness of applying a transfor- meration and selection approaches result in high time and mation (e.g., arithmetic or aggregate operators) on memory cost and may lead to overfitting due to brute-force numerical features, from past feature engineering generation of features. Moreover, although deep neural net- experiences. Given a new dataset, LFE recom- works(DNN)allowforusefulmeta-featurestobelearnedau- [ ] mendsasetofuseful transformations to be applied tomatically Bengio et al., 2013 , the learned features are not on features without relying on model evaluation or always interpretable and DNNs are not effective learners in explicit feature expansion and selection. Using a various application domains. collection of datasets, we train a set of neural net- In this paper, we propose LFE (Learning Feature Engi- works, which aim at predicting the transformation neering), a novel meta-learning approach to automatically that impacts classification performance positively. perform interpretable feature engineering for classification, Our empirical results show that LFE outperforms based on learning from past feature engineering experiences. other feature engineering approaches for an over- Bygeneralizingtheimpactofdifferenttransformationsonthe whelming majority (89%) of the datasets from var- performance of a large number of datasets, LFE learns useful ious sources while incurring a substantially lower patterns between features, transforms and target that improve computational cost. learning accuracy. We show that generalizing such patterns across thousands of features from hundreds of datasets can be used to successfully predict suitable transformations for 1 Introduction features in new datasets without actually applying the trans- Feature engineering is a central task in data preparation for formations, performing model building and validation tasks, machine learning. It is the practice of constructing suitable that are time consuming. LFE takes as input a dataset and features from given features that lead to improved predictive recommends a set of paradigms for constructing new useful performance. Feature engineering involves the application features. Each paradigm consists of a transformation and an of transformation functions such as arithmetic and aggregate ordered list of features on which the transformation is suit- operators on given features to generate new ones. Transfor- able. mations help scale a feature or convert a non-linear relation At the core of LFE, there is a set of Multi-Layer Percep- between a feature and a target class into a linear relation, tron (MLP) classifiers, each corresponding to a transforma- which is easier to learn. tion. Given a set of features and class labels, the classifier Feature engineering is usually conducted by a data scien- predicts whether the transformation can derive a more useful tist relying on her domain expertise and iterative trial and feature than the input features. LFE considers the notion of error and model evaluation. To perform automated fea- feature and class relevance in the context of a transformation ture engineering, some existing approaches adopt guided- as the measure of the usefulness of a pattern of feature value 2529 ProceedingsoftheTwenty-SixthInternationalJointConferenceonArtificialIntelligence(IJCAI-17) [ and class label distributions, and transformation. feature selection over the augmented dataset. Cognito Khu- ] Different datasets contain different feature sizes and differ- rana et al., 2016 recommends a series of transformations ent value ranges. One key challenge in generalizing across based on a greedy, hierarchical heuristic search. Cognito and different datasets is to convert feature values and their class DSMfocus on sequences of feature transformations, which labels to a fixed size feature vector representation that can is outside the scope of this paper. In contrast to these ap- be fed into LFE classifiers. To characterize datasets, hand- proaches, we do not require expensive classifier evaluations crafted meta-features, fixed-size stratified sampling, neural [ to measure the impact of a transformation. ExploreKit Katz ] networks and hashing methods have been used for different et al., 2016 also generates all possible candidate features, [ tasks Michieetal.,1994;Kalousis,2002;Feureretal.,2015; but uses a learned ranking function to sort them. ExploreKit ] Weinberger et al., 2009 . However, these representations requires generating all possible candidate features when en- do not directly capture the correlation between feature val- gineering features for a new (test) dataset and as such, results ues and class labels. To capture such correlations, LFE con- reported for ExploreKit are with a time limit of three days structs a stack of fixed-size representations of feature values per dataset. In contrast, LFE can generate effective features per target class. We use Quantile Data Sketch to represent within seconds, on average. feature values of each class. Quantile has been used as a Several machine learning methods perform feature extrac- fixed-size space representation and achieves reasonably ac- tion or learning indirectly. While they do not explicitly work curate approximation to the distribution function induced by oninput features and transformations, they generate new fea- [ ] [ the data being sketched Greenwald and Khanna, 2001 . tures as means to solving another problem Storcheus et al., ] LFE presents a computationally efficient and effective al- 2015 . Methods of that type include dimensionality reduc- ternative to other automated feature engineering approaches tion, kernel methods and deep learning. Kernel algorithms by recommending suitable transformations for features in a such as SVM [Shawe-Taylor and Cristianini, 2004] can be dataset. To showcase the capabilities of LFE, we trained LFE seen as embedded methods, where the learning and the (im- on 85K features, extracted from 900 datasets, for 10 unary plicit) feature generation are performed jointly. This is in transformations and 122K feature pairs for 4 binary trans- contrast to our setting, where feature engineering is a pre- formations, for two models: Random Forest and Logistic processing step. Regression. The transformations are listed in Table 1. We Deep neural networks learn useful features automati- empirically compare LFE with a suite of feature engineering cally [Bengio et al., 2013] and have shown remarkable suc- approaches proposed in the literature or applied in practice cesses on video, image and speech data. However, in some (such as the Data Science Machine, evaluation-based, ran- domains feature engineering is still required. Moreover, fea- dom selection of transformations and always applying the tures derived by neural networks are often not interpretable most popular transformation in the training data) on a sub- which is an important factor in certain application domains [ ] [ ] set of 50 datasets from UCI repository Lichman, 2013 , such as healthcare Che et al., 2015 . [ ] OpenML Vanschoren et al., 2014 and other sources. Our experiments show that, of the datasets that demonstrated any 3 AutomatedFeatureEngineeringProblem improvement through feature engineering, LFE was the most Consider a dataset, D, with features, F = {f ,...,f }, and effective in 89% of the cases. As shown in Figure 2, simi- 1 n atarget class, , a set of transformations, T = {T ,...,T }, lar results were observed for the LFE trained with Logistic 1 m Regression. Moreover, LFE runs in significantly lesser time and a classification task, L. The feature engineering problem compared to the other approaches. This also enables interac- is to find q best paradigms for constructing new features such tions with a practitioner since it recommends transformations that appending the new features to D maximizes the accuracy onfeatures in a short amount of time. of L. Each paradigm consists of a candidate transformation T 2Tofarityr,anorderedlist of features [f ,...,f ] c i i+r 1 2 Related Work and a usefulness score. For a dataset with n features and with u unary transfor- The FICUS algorithm [Markovitch and Rosenstein, 2002] mations, O(u ⇥ n) new features can be constructed. With takes as input a dataset and a set of transformations, and per- b binary transformations, there are O(b ⇥ Pn) new possi- n 2 forms a beam search over the space of possible features. FI- ble features, where P is the 2-permutation of n features. 2 CUS’s search for better features is guided by heuristic mea- Given a fixed set of transformations, the number of new fea- sures based on information gain in a decision tree, and other tures and their combinations to explore, for an exact solution, surrogate measures of performance. The more recent FC- grows exponentially. Hence, we make the case that a mere [ ] Tree Fanetal.,2010 alsousesadecisiontreetopartitionthe enumeration and trial by model training and testing is not a data using original or constructed features as splitting points. computationally practical option, and a scalable solution to [ ] The FEADIS algorithm of Dor and Reich, 2012 relies on the problem must avoid this computational bottleneck. LFE a combination of random feature generation and feature se- reduces the complexity of feature space exploration by pro- lection. FEADIS adds constructed features greedily, and as viding an efficient approach for finding a particularly “good” such requires many expensive performance evaluations. The transformation for a given set of features. Therefore, given n DeepFeatureSynthesis component of Data Science Machine features, for unary andbinarytransformations, LFEperforms, [ ] n (DSM) KanterandVeeramachaneni,2015 reliesonexhaus- respectively, O(n) and O(P ) transformation predictions. 2 tively enumerating all possible new features, then performing Inordertoassesstherelativeimpactofaddingnewfeatures 2530 ProceedingsoftheTwenty-SixthInternationalJointConferenceonArtificialIntelligence(IJCAI-17) across different techniques, we add as many new features as Distribution Function (PDF) in an n-dimensional space [Gar- those originally in the data. For unary transformations, LFE rido and Juste, 1998]. However, it is not clear how existing predicts the most suitable transformation for each feature. representation learning and PDF learning approaches may be For binary and higher arity transformations, LFE considers applied in the context of raw numerical data. The main chal- a random sample of all combinations of features, finds the lenges is the high variability in the size and the range of fea- paradigmforeachcombinationandselectstop-k usefulones. ture values (e.g., from 10 to millions). In our setting, features In the following section, we describe how LFE learns and are data points represented with various numbers of dimen- predicts useful transformations for features. sions (number of distinct feature values). Hence, Random [ ] Projection Bingham and Mannila, 2001 for dimensionality 4 Transformation Recommendation reduction is not applicable. Although Recurrent Neural Net- LFEmodelstheproblemofpredictingausefulr-arytransfor- works can deal with varying input size, we aim at determin- mation T 2T,(T isthesetofr-arytransformations in T ), ing a fixed-size representation that captures the correlation c r r between features and target classes. for a given list of features [f ,...,f ] as a multi-class clas- 1 r Previousapproacheshaveusedhand-craftedmeta-features, sification problem, where the input is a representation of fea- including information-theoretic and statistical meta-features, tures, R[f ,...,f ], and output classes are transformations in Tr. 1 r [ [ ] to represent datasets Michie et al., 1994; Kalousis, 2002; LFEtakesaone-vs-restapproach RifkinandKlautau,2004 . Feurer et al., 2015]. Such meta-features aim at modelling Eachtransformation is modelled as a Multi-Layer Perceptron the distribution of values in datasets. Performing fixed-size (MLP)binaryclassifier with real-valued confidence scores as sampling of feature values is another approach for represent- output. Recommending an r-ary transformation for r fea- ingdistributions. Samplesextractedfromfeaturesandclasses tures involves applying all |T | MLPs on R . If the r [f ,...,f ] 1 r are required to reflect the distribution of values in both fea- highest confidence score obtained from classifiers is above a ture and target classes. While stratified sampling solves the given threshold, LFE recommends the corresponding trans- issue for one feature, it becomes more complex for multiple formation to be applied on feature f. Let gk(R[f ,...,f ]) be 1 r [ ] the confidence score of the MLP corresponding to transfor- features with high correlation Skinner et al., 1994 . Feature mationT ,and isthethresholdforconfidencescoreswhich hashing has been used to represent features of type “string” k as feature vectors [Weinberger et al., 2009]. Although fea- we determined empirically. LFE recommends the transfor- ture hashing can be generalized for numerical values, it is not mation T , for features [f ,...,f ], as follows: c 1 r straightforward to choose distance-preserving hash functions c = arg max gk(R[f ,...,f ]) that map values within a small range to the same hash value. 1 r In Section 5, we empirically show how LFE transformation k ⇢T , if g (R ) > classifiers perform using each of representations described c c [f ,...,f ] recommend : 1 r (1) above. none, otherwise Quantile Sketch Array (QSA) uses quantile data [ ] Inthefollowingsections,wedescribehownumericalfeatures sketch Wang et al., 2013 to represent feature values are represented to be the input of transformation MLPs. Next, associated with a class label. QSA is a non-parametric weexplain the process of collecting past feature engineering representation that enables characterizing the approximate knowledge as training samples to train MLPs. Probability Distribution Function of values. QSA is closely related to the familiar concept of histogram, where data is 4.1 Feature-Class Representation summarized into a small number of buckets. Naumann et Transformations are used to reveal and improve significant al. used quantile representation for numerical features to [ ] correlation or discriminative information between features perform feature classification Naumann et al., 2002 .We and class labels. The more pronounced this correlation, the apply the exact brute-force approach to compute quantile [ ] higher the chance that a model can achieve a better predic- sketch for numerical values Wang et al., 2013 , described as tive performance. Each LFE classifier learns the patterns of follows. feature-class distributions for which the corresponding trans- Let Vk be the bag of values in feature f that are for the formation has been effective in improving feature-class cor- training data points with the label c and Q(i) is the quan- k f relation. LFE represents feature f in a dataset with k classes tile sketch of Vk. First, we scale these values to a predefined as follows: h i range [lb,ub]. Generating Q(i) involves bucketing all values R = Q(1);Q(2);...;Q(k) (2) f f f f f in Vk into a set of bins. Given a fixed number of bins, r, the range [lb,ub] is partitioned into r disjoint bins of uni- where Q(i) is a fixed-sized representation of values in f that form width w = ub lb. Assume, the range [lb,ub] is par- f r are associated with class i. We call this representation Quan- titioned into bins {b ,...,b }, where the bin b is a range tile Sketch Array. Next, we describe how feature values asso- 0 r 1 j [lb + j ⇤ w,lb +(j + 1) ⇤ w). Function B(v )=b asso- l j (i) ciates the value v in V to the bin b . Function P(b ) re- ciated to class i are translated into representation Q , which l k j j f turns the number of feature values, that are bucketed in b . is meant to capture the distribution of feature values. j Neural networks have been successful in learning repre- Finally, I(b )=P P(bj) is the normalized value of j P(bm) [ ] 0m