Ensemble Methods In Machine Learning Pdf 88560

Partial capture of text on file.
                          Impact of Ensemble Machine Learning Methods on
                                                                Handling Missing Data
                                                                                  Ernest Perkowski
                                                                                  University of Twente
                                                                          P.O. Box 217, 7500AE Enschede
                                                                                     TheNetherlands
                                                                     e.perkowski@student.utwente.nl
                  ABSTRACT                                                                         Duetothepopularityoftheproblem, there is an extensive
                  Missing values are a common problem present in data from                         research on the various approaches to handle missing val-
                  various sources.       When building machine learning clas-                      ues. The main focus of this paper is to examine different
                  sifiers, incomplete data creates a risk of drawing invalid                       ensemble learning techniques, their application, and per-
                  conclusions and producing biased models. This can have                           formance impact on handling missing data. In particular,
                  a tremendous impact on many business sectors or even                             the following questions will be explored:
                  humanlives. Ensemble methods are meta-algorithms that                            RQ1 What is the state of the art of ensemble methods
                  can combine weak base estimators into stronger classifiers.                      used for handling missing data?
                  Ensemble learning can make use of both ML and non-ML                             RQ2Whatistheimpactofusingensemblemachinelearn-
                  techniques. Using this approach proved to yield better                           ing methods, in terms of model fit, on various test data
                  predictions in many use cases. This research examines                            sample sizes?
                  various usages of ensemble methods for handling missing
                  data. Moreover, the impact of using ensemble learning is                         To answer the above mentioned questions, a literature re-
                  explored, given various levels of test data artificially gen-                    view is conducted and some of the ensemble methods used
                  erated based on missing at random (MAR) mechanism.                               by other researchers will be described. Then, a number of
                                                                                                   experiments is conducted on two separate datasets. The
                  Keywords                                                                         missing values will be introduced using a generative pro-
                                                                                                   cess described further in this paper. Some of the most
                  Data Cleaning, Data Cleansing, Missing Data, Machine                             commonMLalgorithmsfor solving regression and classifi-
                  learning, ML, Ensemble, Bagging, Boosting, AdaBoost                              cation problems are trained and used to predict previously
                                                                                                   generated missing values. The percentage of data missing-
                  1.    INTRODUCTION                                                               ness ranges from 1-100% relatively to test data size.
                  Data cleaning is a tedious and time-consuming process                            This paper is divided into the following sections. In the
                  that aims for discovery and removal of erroneous, incom-                         Background section, an explanation of key concepts and
                  plete, inconsistent, and many other types of noise in or-                        methods from ensemble learning and missing data mech-
                  der to improve the quality of the data [9]. It is believed                       anisms is given. Related Work describes the discoveries
                  that this step of data processing is takes most of the time                      madebyresearchersworkingonmissingvaluesimputation
                  needed for data analysis [15]. In order to use predictive                        together with ensemble. This is followed by a discussion on
                  models to search for insights, the data should be complete.                      Methodology and Results of conducted experiments aim-
                  This is often not the case, as missing values are a common                       ing to discover the impact of using ML ensemble models
                  problem introducing bias that impacts the models trained                         on various levels of missing data.
                  on them. Biased data leads to biased models. The seri-                           2.    BACKGROUND
                  ousness of this problem depends partly on how much data                          2.1     Ensemblemethods
                  is missing, the pattern of data missingness and its under-
                  lying mechanism. There are three main ways to cope with                          The core idea of ensemble decision making is present in
                  incomplete data. The first and the least effective [19] is by                    our daily lives. We seek others’ ideas about a problem and
                  removing the rows with null values. The second includes                          then evaluate a few different opinions in order to draw the
                  various imputation techniques such as ad-hoc mean or me-                         most optimal conclusions. Ensemble learning aim to im-
                  dian substitution, which are considered traditional. More                        prove ML performance by combining a collection of weak
                  advanced solutions from this category are multiple impu-                         classifiers into a single stronger classifier [4], [22]. There-
                  tations, maximum likelihood or expectation maximization                          after, a new instance is classified by voting the decision or
                  [1]. The third one focuses on predictive machine learning                        averaging in regression. Below, an explanation of certain
                  models, which tend to yield good results [2].                                    ensemble methods used later in the experiments, is given:
                  Permission to make digital or hard copies of all or part of this work for         2.1.1     Bagging
                  personal or classroom use is granted without fee provided that copies            Bagging, also called bootstrap aggregating, was introduced
                  are not made or distributed for profit or commercial advantage and that          in 1996 by Breiman [3]. This method is used for improving
                  copies bear this notice and the full citation on the first page. To copy oth-    unstable estimations or classification problems. Bagging is
                  erwise, or republish, to post on servers or to redistribute to lists, requires   a technique of variance reduction for given base learners,
                  prior specific permission and/or a fee.                                          such as decision trees, or variable selection methods used
                    th                                           rd
                  33   Twente Student Conference on IT July. 3      , 2020, Enschede, The          for linear model fitting. Bagging generates additional data
                  Netherlands.
                  Copyright 2020, University of Twente, Faculty of Electrical Engineer-            for training from the original dataset, using combinations
                  ing, Mathematics and Computer Science.                                           with repetitions to create multisets with the same data
                                                                                              1
                                structure as the original set.
                                                                                                                                                                                                                                    X                                 Z
                                                                              0
                                                                              1
                                                                                                                                                      ...
                                                                                                                                                                                                                                    Y                                 R
                                                                            ...                               ...                                                                   Figure 3: Graphical representation of MAR [17]. X rep-
                                                                                                                                                                                    resents variables completely observed, Y partly missing, Z
                                                                              n                                                                                                     represents component that causes missingness unrelated to
                                                                                                                                                                                    Xand Y, R represents the missingness.
                                                                                                  weak learners fitted on 
                                Initial dataset                    n bootstrap samples                                                         ensemble model
                                                                                                  each bootstrap sample
                                            Figure 1: Graphical representation of Bagging.                                                                                           2.2.2              MCAR
                                                                                                                                                                                    Data missing completely at random (MCAR) represents
                                 2.1.2              Boosting (AdaBoost)                                                                                                             the variables that are completely unrelated either to val-
                                                                                                                                                                                    ues of the specific variable, or other measured variables.
                                Boosting is a similar approach to Bagging. The core idea                                                                                            Compared to MAR, it is more restrictive as there is no
                                is to build a family of models that later on will be aggre-                                                                                         correlation between missing data. Such a mechanism often
                                gated and compose a stronger learner, capable of better                                                                                             occurs in real-world situations [7]. For example, students
                                performance. The main difference between Bagging and                                                                                                can obtain MCAR exam results due to unforeseen circum-
                                Boosting is the sequence of performing the tasks. In Bag-                                                                                           stances that cause the mechanism e.g. family situation,
                                ging, fitting the models is done in parallel and indepen-                                                                                           funeral, illness.
                                dently, while in Boosting it is done sequentially and each
                                next model depends on the models fitted in previous steps.
                                At every step, more focus is directed at the observations                                                                                                                                           X                                 Z
                                that were poorly handled by the previous model, which
                                results in a strong classifier with lower bias. AdaBoost
                                is a modified Boosting algorithm, it keeps track of, and
                                updates the weights attached to each of the training set
                                observations. The weight determines the observations to
                                focus on.                                                                                                                                                                                           Y                                 R
                                                                                                                                                                                                                             Figure 4: MCAR [17].
                                           equal weights                             updated weights                           updated weights
                                                                                                                                                                                     2.2.3              MNAR
                                                                                                                                                                                    When the data missingness is neither MAR nor MCAR
                                                                                                                                                                                    but still systematical, it is referred to as data missing not
                                                                                                                                                                                    at random (MNAR). In this mechanism, there is a rela-
                                                                                                                                                             ...                    tionship between the missing variable and its values [1].
                                                                                                                                                                                    Suppose there are students that experience test anxiety
                                                   ...                                         ...                                       ...                                        and have missing test scores due to the fact that they
                                                                                                                                                                                    could not carry on with the exam.
                                           Figure 2: Graphical representation of AdaBoost.
                                                                                                                                                                                                                                    X                                 Z
                                2.2            Missing Data Mechanisms
                                Whenthinking about data, it is important to make a dis-
                                tinctions between different types of the missing data ran-
                                domness. They are crucial to keep in mind, as they deter-
                                mine which statistical treatments of the missing data can
                                                                                                                                                                                                                                    Y                                 R
                                be effectively applied. We can distinguish between three
                                main mechanisms [16]:                                                                                                                                                                        Figure 5: MNAR [17].
                                 2.2.1              MAR
                                Data missing at random (MAR) refers to a collection,                                                                                                3.          RELATEDWORK
                                where instances with and without missing values have a                                                                                              Missing data handling techniques have been studied ex-
                                systematic relationship [7]. This can be simply explained                                                                                           tensively in the literature. The most well known include
                                with an example from medical data. If there is an emer-                                                                                             various types of imputations (e.g., [6], [8], [5]). This re-
                                gency, there is a tendency that some details are omitted                                                                                            view will focus primarily on ensemble learning approaches
                                when filling in a medical form, compared to a situation                                                                                             to handling missing data.
                                of scheduled appointment with the doctor. In the former
                                situation, the time is critical and the patient might not                                                                                           As one of first studies in this field, Optiz D. et al., per-
                                be able to provide all the required details which yield a                                                                                           formed an extensive research on over 20 datasets, using
                                relationship.                                                                                                                                       both neural networks and decision trees as classifiers for
                                                                                                                                                                           2
               ensemble methods. As a result, it was found that in ma-           individual ML algorithm.      The effects of different pro-
               jority cases Bagging offers more accurate predictions than        portions of missing data when classifying new instances
               an individual classifier, while in some, it yields much less      are further evaluated. This section describes the complete
               accurate than Boosting [12]. This, amongst others, gave           project setup and steps required to successfully generate
               the ground for future research, by showing the capabilities       missing values, train models, test and measure the perfor-
               of such techniques.                                               mance of ML algorithms as well as ensemble learning.
               Twala et al. proposed an ensemble of Bayesian Multiple            AsoutlinedintheBackgroundsection, missingdatamech-
               Imputation (BAMI) and Nearest Neighbour Single Impu-              anism is an important aspect of data imputation. For this
               tation (NNSI). The separate results of both algorithms are        study, MAR has been selected as the relevant type of data
               fed to decision trees and further evaluated. It has been          missingness due to its wide occurrence in real-life datasets.
               discovered that such combination improved the accuracy            The datasets chosen for this research are both small and
               compared to the baseline imputation method (BAMI and              large, and contain a mix of numerical and categorical vari-
               NNSI) [19]. Shortly after, another study on ensemble fol-         ables. The data does not have any missing values, as it
               lowed. This time the objective for was to compare 7 vari-         was crucial to have a total control over the whole datasets.
               ous missing data handling techniques (MDT) as well as an
               ensemblelearningoftwoMDTs. Allofthetechniquesused                 The experiments were conducted in Python programming
               in the study are non-ML and the results show that an em-          language, using PyCharm environment [13].         The data
               semble of expectation maximization multiple imputation            was processed and handled using Pandas library and the
               (EMMI) together with C4.5 [14] yields superior perfor-            graphs were visualized using Matplotlib library. To ap-
               mance compared to individual MDTs [21]. Later, Twala              ply the machine learning models on data, sci-kit learn was
               and Cartwright proposed a novel approach based on boot-           used.
               strap sampling, where incomplete data is split into sub-          4.1    Data
               samples and fed into a decision tree classifier. The result-      To carry out the experiment, the following datasets were
               ing ensemble consists only of decorrelated decision trees         used:
               and uses them as input to make a decision [20]. The au-
               thors concluded by explaining that the proposed strategy             • Avocado Prices (retrieved from
               potentially can improve prediction accuracy, especially if              https://www.kaggle.com/neuromusic/avocado-prices)
               used in combination with multiple imputation.                        • Hearth Disease (retrieved from
               Lu et al. conducted a study, where the use of Bagging                   https://www.kaggle.com/ronitf/heart-disease-uci)
               and Boosting is used for continuous data imputation pur-          These datasets were chosen to provide different perspec-
               poses. The study compares KNN and logistic regression             tives on the results obtained from the experiments. Av-
               to the earlier mentioned ensemble methods and finds that          ocado Prices contains longitudinal data on avocado sales.
               the more sophisticated approach underestimates variance           The dataset is quite large, as it consist of around 19000
               compared to true data, but in a significantly lower degree        rows. On the other hand, Heart Disease dataset has only
               than the individual regressors [10].                              303 rows and the data comes from the healthcare domain.
               A different approach, based on a random subspace for
               multiple imputation method, was proposed by Nanni et               4.1.1   DataPreprocessing
               al. Their idea is to put the missing values into different        To ensure that the results are correct, the data was scaled
               clusters of random data and calculate their value using           before any computations, data splitting or model fitting.
               the mean of the cluster or the center. This technique re-         This is an essential step for ML algorithms that base their
               quires several iterations on the random subspace to create        predictions on distances between data points. To avoid
               an ensemble. The authors compare several ensemble and             any features dominating over others when calculating the
               classifier systems on various medical datasets and show           distances, the data needs to be scaled as some features
               that the proposed approach outperforms other existing             have a higher value range than others. This was done
               techniques of missing data handling on numerous datasets          using sci-kit learn StandardScaler function, which essen-
               and the performance does not drop on data missingness             tially transforms the features, so that their distributions
               up to 30% [11]. Tran et al. used a combination of mul-            have a mean value 0 and standard deviation of 1. The
               tiple imputation and ensemble learning to build a diverse         standarization function could be defined as follows:
               ensemble of classifiers which then was used for predict-                                        x−µ
               ing the incomplete data. The study focused on random                                      zn =
               forest as a regression method and compared the accuracy                                           σ
               to other single imputation methods such as hot deck and            4.1.2   Generating MAR Data
               KNN-based. From the results it is clear that the ensemble         To evaluate the performance of algorithms on predicting
               of multivariate imputation by chained equations utilising         the MAR data, a process of introducing empty values to
               the earlier mentioned regression methods yields the best          a complete dataset has been created.
               accuracy [18].                                                    First, a target attribute has been selected and split from
               As outlined in the literature review, typical solutions to        the rest of the data. To simulate that the missing value
               missing data problem include various imputation methods           is only dependent on the data observed, the weight (W)
               algorithms, which estimate the missing variable based on          matrix has been defined.      Its dimensions are based on
               other observed values of that variable. Due to the sensitiv-      the dimensions of target attribute matrix. The matrix W
               ity of individual imputation techniques to significant errors     has been filled with artificially created float-type variable.
               in estimation, especially for large dimensional datasets, en-     This variable could not be correlated with any other vari-
               semble methods have been employed.                                able, other than target attribute, present in the dataset
               4.   METHODOLOGY                                                  in order to meet the MAR mechanism requirements. The
                                                                                 matrix W has been filled by values randomly drawn from
               The objective of the experiments created in this research         a uniform distribution over [0,1) and assigned a positive
               is to discover the significance of ensemble model fit gain        correlation to the probability of missingness of target at-
               when evaluated on missing data prediction compared to             tribute. These steps assure that MNAR mechanism would
                                                                             3
               not be achieved by mistake, as the target attribute is not                        Table 3: KNN Hyperparameters
               given a correlation to any other variable in the dataset.                Parameter          Values
               Having the W matrix and probability of missingness, we                   n neighbors        2, 3, .. 12
               can successfully introduce missing values to the target at-              weight             uniform, distance
               tribute by comparing the randomly generated weight with                  algorithm          auto, ball tree, kd tree, brute
               the conditional missing probability.     This process is re-             leaf size          12, 20, .. 100
               peated several times in order to allow for generation of                 p                  1, 2 .. 10
               high percentages of missing data. The amount of NaN val-
               ues is determined by a threshold value which is assigned             4.2.3    Ensemble Methods
               before the algorithm run.
                4.1.3    Train/Test Split                                           BaggingandAdaBoostarethetwoensemblemethodsused
                                                                                    to conduct this study. The functions were applied from sci-
               To maintain a stable amount of data for the training set,            kit learn library and used with default parameters. Since
               while changing the amount of missing values in each iter-            one of the ML algorithms evaluated is Decision Tree, Ran-
               ation of model training and testing, two subsets of data             dom Forest has been added to this study as well. RF is
               were created. From the entire collection, half of the rows           a bagging method, which creates an ensemble of decision
               were sampled to be used for the training set. The remain-            trees with large depths. Moreover, the algorithm makes
               der has served as a base set for generating missing data.            use of random feature selection subspace for more robust
               In each iteration, a new percentage of missing values have           models. The implementation of AdaBoost for KNNClas-
               been introduced and the rows containing null values were             sifier was not possible using sci-kit.
               selected and used as testing set, to evaluate score of the
               algorithms.                                                          5.   RESULTS
               4.2    Algorithms                                                    Thegraphsvisualizemodelfitperformancescorefordiffer-
               Several ML models have been chosen to conduct the ex-                ent MLmodelsandensembles. Theplotsexpressscoreob-
                                                                                    tained by a specific algorithm either using R2 (for numeri-
               periments. They have been divided into two categories,               cal variable), or f1 score (for categorical variable). Each of
               wheretheusagedependsonthevariabletype. Thesemod-                     the graphs contain a legend explaining which color sym-
               els were selected because they are the most widely used for          bolizes a specific algorithm. Some of the visualizations
               regression and classification problems.                              showing the most significant impact of ensemble methods
                                 Table 1: Models selected                           can be seen below, while remaining graphs can be found
                                                                                    in the Appendix section.
             Numerical Variables             Categorical Variables                  From the selected algorithms, all performed well (more
             Linear Regression               Logistic Regression                    than 90% R2 score on average) in the classification prob-
             Bayesian Ridge Regression       Perceptron                             lem on Avocado dataset. Furthermore, in the Avocado
             Decision Tree Regressor         Decision Tree Classifier               dataset, we can see high performance (around 80% R2
             K-Nearest Neighbors Re-         K-Nearest Neighbors Clas-              score) of Decision Tree Regressor, with the ensembles yield-
             gressor                         sifier                                 ing improvement of over 10-15% compared to the base
                                                                                    learner (see Figure 6). KNN Regression scores similarly
                4.2.1    ModelEvaluation                                            to Decision Tree and its AdaBoost ensemble, while Bag-
               The setup of the experiments needed to tackle regres-                ging slightly improved the results, giving on average 2-3%
               sion and classification problem. For this reason, in each            increase (see Figure 7). KNN Regression scores. Both
               dataset, one categorical and one numerical attribute was             Linear Regression and Bayesian Ridge score very similarly
               selected.  To perform the predictive modelling, depend-              to each other (around 58% on avg., see Appendix), with
               ing on the type of variable, an appropriate algorithm was            Bagging giving almost the exact same results as the base
               used. The ensemble methods and individual ML models                  learner, and AdaBoost scoring significantly lower than the
               performance were assessed using precision and recall for             base learner.
               categorical variables. On the other hand, numerical val-
               ues were assessed using scikit-learn r2 score function.                      (Avocado, AveragePrice)
                                                                                        100
                4.2.2    Hyperparameter Tuning
               Decision Tree and KNN algorithm performance is highly                     80
               impacted by the parameters used for fitting the models.
               To ensure that the models are trained with optimal pa-
               rameters, GridSearchCV from sci-kit learn has been used                   60
               for the evaluation. GridSearchCV takes an array of pos-
               sible parameters and tests the performance of model with                R squared score40
               each combination of parameters. Based on the scores, it
               returns the most optimal combination. Hyperparameters                             Decision Tree Regression
               used in the GridSearchCV:                                                 20      Bagging
                                                                                                 AdaBoost
                        Table 2: Decision Tree Hyperparameters                                   Random Forest
                                                                                          0   0        20       40       60        80       100
                Parameter                 Values                                                [%] amount of missing data compared to training data size
                criterion                 gini, entropy
                splitter                  best, random                              Figure 6: Decision Tree Regressor and its ensembles on
                max depth                 2, 3 .. (training samples) -1             ’Average Price’ attribute from Avocado dataset
                min samples split         2, 3 .. 12
                                                                                4
The words contained in this file might help you see if this file matches what you are looking for:

...Impact of ensemble machine learning methods on handling missing data ernest perkowski university twente p o box ae enschede thenetherlands e student utwente nl abstract duetothepopularityoftheproblem there is an extensive values are a common problem present in from research the various approaches to handle val sources when building clas ues main focus this paper examine different sifiers incomplete creates risk drawing invalid techniques their application and per conclusions producing biased models can have formance particular tremendous many business sectors or even following questions will be explored humanlives meta algorithms that rq what state art combine weak base estimators into stronger classifiers used for make use both ml non rqwhatistheimpactofusingensemblemachinelearn using approach proved yield better ing terms model fit test predictions cases examines sample sizes usages moreover answer above mentioned literature re given levels artificially gen view conducted some erated...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area