Technology Pdf 86755 | 0352 Item Download 2022-09-14 14-18-12

Partial capture of text on file.
                                ProceedingsoftheTwenty-SixthInternationalJointConferenceonArtiﬁcialIntelligence(IJCAI-17)
                                       Learning Feature Engineering for Classiﬁcation
                                                             1                        2                       2
                                      FatemehNargesian , Horst Samulowitz , Udayan Khurana
                                                                        3                     2
                                                      Elias B. Khalil , Deepak Turaga
                                1University of Toronto, 2IBM Research, 3Georgia Institute of Technology
                                       fnargesian@cs.toronto.edu, {samulowitz, ukhurana}@us.ibm.com,
                                                      lyes@gatech.edu, turaga@us.ibm.com
                                      Abstract                               search in feature space using heuristic feature quality mea-
                  Feature engineering is the task of improving pre-          sures (such as information gain) and other surrogate mea-
                                                                                                 [
                  dictive modelling performance on a dataset by              sures of performance Markovitch and Rosenstein, 2002; Fan
                                                                                        ]
                  transforming its feature space.      Existing ap-          et al., 2010 . Others perform greedy feature construction and
                                                                                                                 [
                  proaches to automate this process rely on ei-              selection based on model evaluation Dor and Reich, 2012;
                                                                                                 ]
                  ther transformed feature space exploration through         Khurana et al., 2016 . Kanter et al. proposed the Data Sci-
                  evaluation-guided search, or explicit expansion of         ence Machine (DSM) which considers feature engineering
                  datasets with all transformed features followed by         problem as feature selection on the space of novel features.
                  feature selection. Such approaches incur high com-         DSMreliesonexhaustivelyenumeratingallpossiblefeatures
                  putational costs in runtime and/or memory.    We           that can be constructed from a dataset, given sequences gen-
                  present a novel technique, called Learning Feature         erated from a set of transformations, then performing feature
                                                                                                               [
                  Engineering (LFE), for automating feature engi-            selection on the augmented dataset Kanter and Veeramacha-
                                                                                       ]
                  neering in classiﬁcation tasks.  LFE is based on           neni, 2015 . Evaluation-based and exhaustive feature enu-
                  learning the effectiveness of applying a transfor-         meration and selection approaches result in high time and
                  mation (e.g., arithmetic or aggregate operators) on        memory cost and may lead to overﬁtting due to brute-force
                  numerical features, from past feature engineering          generation of features. Moreover, although deep neural net-
                  experiences.   Given a new dataset, LFE recom-             works(DNN)allowforusefulmeta-featurestobelearnedau-
                                                                                        [                  ]
                  mendsasetofuseful transformations to be applied            tomatically Bengio et al., 2013 , the learned features are not
                  on features without relying on model evaluation or         always interpretable and DNNs are not effective learners in
                  explicit feature expansion and selection. Using a          various application domains.
                  collection of datasets, we train a set of neural net-        In this paper, we propose LFE (Learning Feature Engi-
                  works, which aim at predicting the transformation          neering), a novel meta-learning approach to automatically
                  that impacts classiﬁcation performance positively.         perform interpretable feature engineering for classiﬁcation,
                  Our empirical results show that LFE outperforms            based on learning from past feature engineering experiences.
                  other feature engineering approaches for an over-          Bygeneralizingtheimpactofdifferenttransformationsonthe
                  whelming majority (89%) of the datasets from var-          performance of a large number of datasets, LFE learns useful
                  ious sources while incurring a substantially lower         patterns between features, transforms and target that improve
                  computational cost.                                        learning accuracy. We show that generalizing such patterns
                                                                             across thousands of features from hundreds of datasets can
                                                                             be used to successfully predict suitable transformations for
              1   Introduction                                               features in new datasets without actually applying the trans-
              Feature engineering is a central task in data preparation for  formations, performing model building and validation tasks,
              machine learning. It is the practice of constructing suitable  that are time consuming. LFE takes as input a dataset and
              features from given features that lead to improved predictive  recommends a set of paradigms for constructing new useful
              performance. Feature engineering involves the application      features. Each paradigm consists of a transformation and an
              of transformation functions such as arithmetic and aggregate   ordered list of features on which the transformation is suit-
              operators on given features to generate new ones. Transfor-    able.
              mations help scale a feature or convert a non-linear relation    At the core of LFE, there is a set of Multi-Layer Percep-
              between a feature and a target class into a linear relation,   tron (MLP) classiﬁers, each corresponding to a transforma-
              which is easier to learn.                                      tion. Given a set of features and class labels, the classiﬁer
                Feature engineering is usually conducted by a data scien-    predicts whether the transformation can derive a more useful
              tist relying on her domain expertise and iterative trial and   feature than the input features. LFE considers the notion of
              error and model evaluation.    To perform automated fea-       feature and class relevance in the context of a transformation
              ture engineering, some existing approaches adopt guided-       as the measure of the usefulness of a pattern of feature value
                                                                        2529
                                  ProceedingsoftheTwenty-SixthInternationalJointConferenceonArtiﬁcialIntelligence(IJCAI-17)
                                                                                                                                           [
               and class label distributions, and transformation.                 feature selection over the augmented dataset. Cognito Khu-
                                                                                                     ]
                 Different datasets contain different feature sizes and differ-   rana et al., 2016 recommends a series of transformations
               ent value ranges. One key challenge in generalizing across         based on a greedy, hierarchical heuristic search. Cognito and
               different datasets is to convert feature values and their class    DSMfocus on sequences of feature transformations, which
               labels to a ﬁxed size feature vector representation that can       is outside the scope of this paper. In contrast to these ap-
               be fed into LFE classiﬁers. To characterize datasets, hand-        proaches, we do not require expensive classiﬁer evaluations
               crafted meta-features, ﬁxed-size stratiﬁed sampling, neural                                                                  [
                                                                                  to measure the impact of a transformation. ExploreKit Katz
                                                                                               ]
               networks and hashing methods have been used for different          et al., 2016 also generates all possible candidate features,
                    [
               tasks Michieetal.,1994;Kalousis,2002;Feureretal.,2015;             but uses a learned ranking function to sort them. ExploreKit
                                        ]
               Weinberger et al., 2009 .    However, these representations        requires generating all possible candidate features when en-
               do not directly capture the correlation between feature val-       gineering features for a new (test) dataset and as such, results
               ues and class labels. To capture such correlations, LFE con-       reported for ExploreKit are with a time limit of three days
               structs a stack of ﬁxed-size representations of feature values     per dataset. In contrast, LFE can generate effective features
               per target class. We use Quantile Data Sketch to represent         within seconds, on average.
               feature values of each class. Quantile has been used as a             Several machine learning methods perform feature extrac-
               ﬁxed-size space representation and achieves reasonably ac-         tion or learning indirectly. While they do not explicitly work
               curate approximation to the distribution function induced by       oninput features and transformations, they generate new fea-
                                       [                               ]                                                        [
               the data being sketched Greenwald and Khanna, 2001 .               tures as means to solving another problem Storcheus et al.,
                                                                                        ]
                 LFE presents a computationally efﬁcient and effective al-        2015 . Methods of that type include dimensionality reduc-
               ternative to other automated feature engineering approaches        tion, kernel methods and deep learning. Kernel algorithms
               by recommending suitable transformations for features in a         such as SVM [Shawe-Taylor and Cristianini, 2004] can be
               dataset. To showcase the capabilities of LFE, we trained LFE       seen as embedded methods, where the learning and the (im-
               on 85K features, extracted from 900 datasets, for 10 unary         plicit) feature generation are performed jointly. This is in
               transformations and 122K feature pairs for 4 binary trans-         contrast to our setting, where feature engineering is a pre-
               formations, for two models: Random Forest and Logistic             processing step.
               Regression. The transformations are listed in Table 1. We             Deep neural networks learn useful features automati-
               empirically compare LFE with a suite of feature engineering        cally [Bengio et al., 2013] and have shown remarkable suc-
               approaches proposed in the literature or applied in practice       cesses on video, image and speech data. However, in some
               (such as the Data Science Machine, evaluation-based, ran-          domains feature engineering is still required. Moreover, fea-
               dom selection of transformations and always applying the           tures derived by neural networks are often not interpretable
               most popular transformation in the training data) on a sub-        which is an important factor in certain application domains
                                                           [                ]                         [                ]
               set of 50 datasets from UCI repository Lichman, 2013 ,             such as healthcare Che et al., 2015 .
                         [                         ]
               OpenML Vanschoren et al., 2014 and other sources. Our
               experiments show that, of the datasets that demonstrated any       3    AutomatedFeatureEngineeringProblem
               improvement through feature engineering, LFE was the most          Consider a dataset, D, with features, F = {f ,...,f }, and
               effective in 89% of the cases. As shown in Figure 2, simi-                                                        1        n
                                                                                  atarget class, , a set of transformations, T = {T ,...,T },
               lar results were observed for the LFE trained with Logistic                                                           1        m
               Regression. Moreover, LFE runs in signiﬁcantly lesser time         and a classiﬁcation task, L. The feature engineering problem
               compared to the other approaches. This also enables interac-       is to ﬁnd q best paradigms for constructing new features such
               tions with a practitioner since it recommends transformations      that appending the new features to D maximizes the accuracy
               onfeatures in a short amount of time.                              of L. Each paradigm consists of a candidate transformation
                                                                                  T 2Tofarityr,anorderedlist of features [f ,...,f               ]
                                                                                    c                                              i       i+r1
               2   Related Work                                                   and a usefulness score.
                                                                                     For a dataset with n features and with u unary transfor-
               The FICUS algorithm [Markovitch and Rosenstein, 2002]              mations, O(u ⇥ n) new features can be constructed. With
               takes as input a dataset and a set of transformations, and per-    b binary transformations, there are O(b ⇥ Pn) new possi-
                                                                                                          n                       2
               forms a beam search over the space of possible features. FI-       ble features, where P     is the 2-permutation of n features.
                                                                                                         2
               CUS’s search for better features is guided by heuristic mea-       Given a ﬁxed set of transformations, the number of new fea-
               sures based on information gain in a decision tree, and other      tures and their combinations to explore, for an exact solution,
               surrogate measures of performance. The more recent FC-             grows exponentially. Hence, we make the case that a mere
                    [               ]
               Tree Fanetal.,2010 alsousesadecisiontreetopartitionthe             enumeration and trial by model training and testing is not a
               data using original or constructed features as splitting points.   computationally practical option, and a scalable solution to
                                           [                       ]
               The FEADIS algorithm of Dor and Reich, 2012 relies on              the problem must avoid this computational bottleneck. LFE
               a combination of random feature generation and feature se-         reduces the complexity of feature space exploration by pro-
               lection. FEADIS adds constructed features greedily, and as         viding an efﬁcient approach for ﬁnding a particularly “good”
               such requires many expensive performance evaluations. The          transformation for a given set of features. Therefore, given n
               DeepFeatureSynthesis component of Data Science Machine             features, for unary andbinarytransformations, LFEperforms,
                      [                                    ]                                                    n
               (DSM) KanterandVeeramachaneni,2015 reliesonexhaus-                 respectively, O(n) and O(P ) transformation predictions.
                                                                                                               2
               tively enumerating all possible new features, then performing         Inordertoassesstherelativeimpactofaddingnewfeatures
                                                                             2530
                                   ProceedingsoftheTwenty-SixthInternationalJointConferenceonArtiﬁcialIntelligence(IJCAI-17)
               across different techniques, we add as many new features as           Distribution Function (PDF) in an n-dimensional space [Gar-
               those originally in the data. For unary transformations, LFE          rido and Juste, 1998]. However, it is not clear how existing
               predicts the most suitable transformation for each feature.           representation learning and PDF learning approaches may be
               For binary and higher arity transformations, LFE considers            applied in the context of raw numerical data. The main chal-
               a random sample of all combinations of features, ﬁnds the             lenges is the high variability in the size and the range of fea-
               paradigmforeachcombinationandselectstop-k usefulones.                 ture values (e.g., from 10 to millions). In our setting, features
               In the following section, we describe how LFE learns and              are data points represented with various numbers of dimen-
               predicts useful transformations for features.                         sions (number of distinct feature values). Hence, Random
                                                                                                 [                               ]
                                                                                     Projection Bingham and Mannila, 2001 for dimensionality
               4    Transformation Recommendation                                    reduction is not applicable. Although Recurrent Neural Net-
               LFEmodelstheproblemofpredictingausefulr-arytransfor-                  works can deal with varying input size, we aim at determin-
               mation T 2T,(T isthesetofr-arytransformations in T ),                 ing a ﬁxed-size representation that captures the correlation
                         c    r    r                                                 between features and target classes.
               for a given list of features [f ,...,f ] as a multi-class clas-
                                              1        r                                Previousapproacheshaveusedhand-craftedmeta-features,
               siﬁcation problem, where the input is a representation of fea-        including information-theoretic and statistical meta-features,
               tures, R[f ,...,f ], and output classes are transformations in Tr.
                         1    r                                                                             [
                                                   [                           ]     to represent datasets Michie et al., 1994; Kalousis, 2002;
               LFEtakesaone-vs-restapproach RifkinandKlautau,2004 .                  Feurer et al., 2015]. Such meta-features aim at modelling
               Eachtransformation is modelled as a Multi-Layer Perceptron            the distribution of values in datasets. Performing ﬁxed-size
               (MLP)binaryclassiﬁer with real-valued conﬁdence scores as             sampling of feature values is another approach for represent-
               output. Recommending an r-ary transformation for r fea-               ingdistributions. Samplesextractedfromfeaturesandclasses
               tures involves applying all |T | MLPs on R               . If the
                                               r               [f ,...,f ]
                                                                 1    r              are required to reﬂect the distribution of values in both fea-
               highest conﬁdence score obtained from classiﬁers is above a           ture and target classes. While stratiﬁed sampling solves the
               given threshold, LFE recommends the corresponding trans-              issue for one feature, it becomes more complex for multiple
               formation to be applied on feature f. Let gk(R[f ,...,f ]) be
                                                                     1    r                                          [                     ]
               the conﬁdence score of the MLP corresponding to transfor-             features with high correlation Skinner et al., 1994 . Feature
               mationT ,and isthethresholdforconﬁdencescoreswhich                   hashing has been used to represent features of type “string”
                         k                                                           as feature vectors [Weinberger et al., 2009]. Although fea-
               we determined empirically. LFE recommends the transfor-               ture hashing can be generalized for numerical values, it is not
               mation T , for features [f ,...,f ], as follows:
                         c                1       r                                  straightforward to choose distance-preserving hash functions
                                c = arg max gk(R[f ,...,f ])                         that map values within a small range to the same hash value.
                                                       1    r                        In Section 5, we empirically show how LFE transformation
                                         k
                                      ⇢T ,       if g (R         ) >                classiﬁers perform using each of representations described
                                          c         c    [f ,...,f ]
                       recommend :                        1    r             (1)     above.
                                        none,    otherwise                              Quantile   Sketch    Array (QSA) uses quantile data
                                                                                             [                      ]
               Inthefollowingsections,wedescribehownumericalfeatures                 sketch   Wang et al., 2013 to represent feature values
               are represented to be the input of transformation MLPs. Next,         associated with a class label.      QSA is a non-parametric
               weexplain the process of collecting past feature engineering          representation that enables characterizing the approximate
               knowledge as training samples to train MLPs.                          Probability Distribution Function of values. QSA is closely
                                                                                     related to the familiar concept of histogram, where data is
               4.1    Feature-Class Representation                                   summarized into a small number of buckets. Naumann et
               Transformations are used to reveal and improve signiﬁcant             al.  used quantile representation for numerical features to
                                                                                                                      [                        ]
               correlation or discriminative information between features            perform feature classiﬁcation Naumann et al., 2002 .We
               and class labels. The more pronounced this correlation, the           apply the exact brute-force approach to compute quantile
                                                                                                                   [                   ]
               higher the chance that a model can achieve a better predic-           sketch for numerical values Wang et al., 2013 , described as
               tive performance. Each LFE classiﬁer learns the patterns of           follows.
               feature-class distributions for which the corresponding trans-           Let Vk be the bag of values in feature f that are for the
               formation has been effective in improving feature-class cor-          training data points with the label c     and Q(i) is the quan-
                                                                                                                            k        f
               relation. LFE represents feature f in a dataset with k classes        tile sketch of Vk. First, we scale these values to a predeﬁned
               as follows:             h                      i                      range [lb,ub]. Generating Q(i) involves bucketing all values
                                R = Q(1);Q(2);...;Q(k)                       (2)                                    f
                                  f       f     f          f                         in Vk into a set of bins. Given a ﬁxed number of bins, r,
                                                                                     the range [lb,ub] is partitioned into r disjoint bins of uni-
                where Q(i) is a ﬁxed-sized representation of values in f that        form width w = ublb. Assume, the range [lb,ub] is par-
                         f                                                                                 r
               are associated with class i. We call this representation Quan-        titioned into bins {b ,...,b      }, where the bin b is a range
               tile Sketch Array. Next, we describe how feature values asso-                              0        r1                    j
                                                                                     [lb + j ⇤ w,lb +(j + 1) ⇤ w). Function B(v )=b asso-
                                                                                                                                       l       j
                                                                      (i)            ciates the value v in V     to the bin b . Function P(b ) re-
               ciated to class i are translated into representation Q   , which                         l      k              j                 j
                                                                      f              turns the number of feature values, that are bucketed in b .
               is meant to capture the distribution of feature values.                                                                              j
                  Neural networks have been successful in learning repre-            Finally, I(b )=P P(bj)              is the normalized value of
                                                                                                  j               P(bm)
                                                        [                      ]                           0m
The words contained in this file might help you see if this file matches what you are looking for:

...Proceedingsofthetwenty sixthinternationaljointconferenceonarticialintelligence ijcai learning feature engineering for classication fatemehnargesian horst samulowitz udayan khurana elias b khalil deepak turaga university of toronto ibm research georgia institute technology fnargesian cs edu ukhurana us com lyes gatech abstract search in space using heuristic quality mea is the task improving pre sures such as information gain and other surrogate transforming its existing ap et al others perform greedy construction ther transformed exploration through kanter proposed data sci evaluation guided or explicit expansion ence machine dsm which considers datasets with all features followed by problem selection on novel approaches incur high dsmreliesonexhaustivelyenumeratingallpossiblefeatures putational costs runtime memory we that can be constructed from a dataset given sequences gen present technique called erated set transformations then performing neering tasks lfe based neni exhaustive en...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area