241x Filetype PDF File size 0.73 MB Source: essay.utwente.nl
Impact of Ensemble Machine Learning Methods on Handling Missing Data Ernest Perkowski University of Twente P.O. Box 217, 7500AE Enschede TheNetherlands e.perkowski@student.utwente.nl ABSTRACT Duetothepopularityoftheproblem, there is an extensive Missing values are a common problem present in data from research on the various approaches to handle missing val- various sources. When building machine learning clas- ues. The main focus of this paper is to examine different sifiers, incomplete data creates a risk of drawing invalid ensemble learning techniques, their application, and per- conclusions and producing biased models. This can have formance impact on handling missing data. In particular, a tremendous impact on many business sectors or even the following questions will be explored: humanlives. Ensemble methods are meta-algorithms that RQ1 What is the state of the art of ensemble methods can combine weak base estimators into stronger classifiers. used for handling missing data? Ensemble learning can make use of both ML and non-ML RQ2Whatistheimpactofusingensemblemachinelearn- techniques. Using this approach proved to yield better ing methods, in terms of model fit, on various test data predictions in many use cases. This research examines sample sizes? various usages of ensemble methods for handling missing data. Moreover, the impact of using ensemble learning is To answer the above mentioned questions, a literature re- explored, given various levels of test data artificially gen- view is conducted and some of the ensemble methods used erated based on missing at random (MAR) mechanism. by other researchers will be described. Then, a number of experiments is conducted on two separate datasets. The Keywords missing values will be introduced using a generative pro- cess described further in this paper. Some of the most Data Cleaning, Data Cleansing, Missing Data, Machine commonMLalgorithmsfor solving regression and classifi- learning, ML, Ensemble, Bagging, Boosting, AdaBoost cation problems are trained and used to predict previously generated missing values. The percentage of data missing- 1. INTRODUCTION ness ranges from 1-100% relatively to test data size. Data cleaning is a tedious and time-consuming process This paper is divided into the following sections. In the that aims for discovery and removal of erroneous, incom- Background section, an explanation of key concepts and plete, inconsistent, and many other types of noise in or- methods from ensemble learning and missing data mech- der to improve the quality of the data [9]. It is believed anisms is given. Related Work describes the discoveries that this step of data processing is takes most of the time madebyresearchersworkingonmissingvaluesimputation needed for data analysis [15]. In order to use predictive together with ensemble. This is followed by a discussion on models to search for insights, the data should be complete. Methodology and Results of conducted experiments aim- This is often not the case, as missing values are a common ing to discover the impact of using ML ensemble models problem introducing bias that impacts the models trained on various levels of missing data. on them. Biased data leads to biased models. The seri- 2. BACKGROUND ousness of this problem depends partly on how much data 2.1 Ensemblemethods is missing, the pattern of data missingness and its under- lying mechanism. There are three main ways to cope with The core idea of ensemble decision making is present in incomplete data. The first and the least effective [19] is by our daily lives. We seek others’ ideas about a problem and removing the rows with null values. The second includes then evaluate a few different opinions in order to draw the various imputation techniques such as ad-hoc mean or me- most optimal conclusions. Ensemble learning aim to im- dian substitution, which are considered traditional. More prove ML performance by combining a collection of weak advanced solutions from this category are multiple impu- classifiers into a single stronger classifier [4], [22]. There- tations, maximum likelihood or expectation maximization after, a new instance is classified by voting the decision or [1]. The third one focuses on predictive machine learning averaging in regression. Below, an explanation of certain models, which tend to yield good results [2]. ensemble methods used later in the experiments, is given: Permission to make digital or hard copies of all or part of this work for 2.1.1 Bagging personal or classroom use is granted without fee provided that copies Bagging, also called bootstrap aggregating, was introduced are not made or distributed for profit or commercial advantage and that in 1996 by Breiman [3]. This method is used for improving copies bear this notice and the full citation on the first page. To copy oth- unstable estimations or classification problems. Bagging is erwise, or republish, to post on servers or to redistribute to lists, requires a technique of variance reduction for given base learners, prior specific permission and/or a fee. such as decision trees, or variable selection methods used th rd 33 Twente Student Conference on IT July. 3 , 2020, Enschede, The for linear model fitting. Bagging generates additional data Netherlands. Copyright 2020, University of Twente, Faculty of Electrical Engineer- for training from the original dataset, using combinations ing, Mathematics and Computer Science. with repetitions to create multisets with the same data 1 structure as the original set. X Z 0 1 ... Y R ... ... Figure 3: Graphical representation of MAR [17]. X rep- resents variables completely observed, Y partly missing, Z n represents component that causes missingness unrelated to Xand Y, R represents the missingness. weak learners fitted on Initial dataset n bootstrap samples ensemble model each bootstrap sample Figure 1: Graphical representation of Bagging. 2.2.2 MCAR Data missing completely at random (MCAR) represents 2.1.2 Boosting (AdaBoost) the variables that are completely unrelated either to val- ues of the specific variable, or other measured variables. Boosting is a similar approach to Bagging. The core idea Compared to MAR, it is more restrictive as there is no is to build a family of models that later on will be aggre- correlation between missing data. Such a mechanism often gated and compose a stronger learner, capable of better occurs in real-world situations [7]. For example, students performance. The main difference between Bagging and can obtain MCAR exam results due to unforeseen circum- Boosting is the sequence of performing the tasks. In Bag- stances that cause the mechanism e.g. family situation, ging, fitting the models is done in parallel and indepen- funeral, illness. dently, while in Boosting it is done sequentially and each next model depends on the models fitted in previous steps. At every step, more focus is directed at the observations X Z that were poorly handled by the previous model, which results in a strong classifier with lower bias. AdaBoost is a modified Boosting algorithm, it keeps track of, and updates the weights attached to each of the training set observations. The weight determines the observations to focus on. Y R Figure 4: MCAR [17]. equal weights updated weights updated weights 2.2.3 MNAR When the data missingness is neither MAR nor MCAR but still systematical, it is referred to as data missing not at random (MNAR). In this mechanism, there is a rela- ... tionship between the missing variable and its values [1]. Suppose there are students that experience test anxiety ... ... ... and have missing test scores due to the fact that they could not carry on with the exam. Figure 2: Graphical representation of AdaBoost. X Z 2.2 Missing Data Mechanisms Whenthinking about data, it is important to make a dis- tinctions between different types of the missing data ran- domness. They are crucial to keep in mind, as they deter- mine which statistical treatments of the missing data can Y R be effectively applied. We can distinguish between three main mechanisms [16]: Figure 5: MNAR [17]. 2.2.1 MAR Data missing at random (MAR) refers to a collection, 3. RELATEDWORK where instances with and without missing values have a Missing data handling techniques have been studied ex- systematic relationship [7]. This can be simply explained tensively in the literature. The most well known include with an example from medical data. If there is an emer- various types of imputations (e.g., [6], [8], [5]). This re- gency, there is a tendency that some details are omitted view will focus primarily on ensemble learning approaches when filling in a medical form, compared to a situation to handling missing data. of scheduled appointment with the doctor. In the former situation, the time is critical and the patient might not As one of first studies in this field, Optiz D. et al., per- be able to provide all the required details which yield a formed an extensive research on over 20 datasets, using relationship. both neural networks and decision trees as classifiers for 2 ensemble methods. As a result, it was found that in ma- individual ML algorithm. The effects of different pro- jority cases Bagging offers more accurate predictions than portions of missing data when classifying new instances an individual classifier, while in some, it yields much less are further evaluated. This section describes the complete accurate than Boosting [12]. This, amongst others, gave project setup and steps required to successfully generate the ground for future research, by showing the capabilities missing values, train models, test and measure the perfor- of such techniques. mance of ML algorithms as well as ensemble learning. Twala et al. proposed an ensemble of Bayesian Multiple AsoutlinedintheBackgroundsection, missingdatamech- Imputation (BAMI) and Nearest Neighbour Single Impu- anism is an important aspect of data imputation. For this tation (NNSI). The separate results of both algorithms are study, MAR has been selected as the relevant type of data fed to decision trees and further evaluated. It has been missingness due to its wide occurrence in real-life datasets. discovered that such combination improved the accuracy The datasets chosen for this research are both small and compared to the baseline imputation method (BAMI and large, and contain a mix of numerical and categorical vari- NNSI) [19]. Shortly after, another study on ensemble fol- ables. The data does not have any missing values, as it lowed. This time the objective for was to compare 7 vari- was crucial to have a total control over the whole datasets. ous missing data handling techniques (MDT) as well as an ensemblelearningoftwoMDTs. Allofthetechniquesused The experiments were conducted in Python programming in the study are non-ML and the results show that an em- language, using PyCharm environment [13]. The data semble of expectation maximization multiple imputation was processed and handled using Pandas library and the (EMMI) together with C4.5 [14] yields superior perfor- graphs were visualized using Matplotlib library. To ap- mance compared to individual MDTs [21]. Later, Twala ply the machine learning models on data, sci-kit learn was and Cartwright proposed a novel approach based on boot- used. strap sampling, where incomplete data is split into sub- 4.1 Data samples and fed into a decision tree classifier. The result- To carry out the experiment, the following datasets were ing ensemble consists only of decorrelated decision trees used: and uses them as input to make a decision [20]. The au- thors concluded by explaining that the proposed strategy • Avocado Prices (retrieved from potentially can improve prediction accuracy, especially if https://www.kaggle.com/neuromusic/avocado-prices) used in combination with multiple imputation. • Hearth Disease (retrieved from Lu et al. conducted a study, where the use of Bagging https://www.kaggle.com/ronitf/heart-disease-uci) and Boosting is used for continuous data imputation pur- These datasets were chosen to provide different perspec- poses. The study compares KNN and logistic regression tives on the results obtained from the experiments. Av- to the earlier mentioned ensemble methods and finds that ocado Prices contains longitudinal data on avocado sales. the more sophisticated approach underestimates variance The dataset is quite large, as it consist of around 19000 compared to true data, but in a significantly lower degree rows. On the other hand, Heart Disease dataset has only than the individual regressors [10]. 303 rows and the data comes from the healthcare domain. A different approach, based on a random subspace for multiple imputation method, was proposed by Nanni et 4.1.1 DataPreprocessing al. Their idea is to put the missing values into different To ensure that the results are correct, the data was scaled clusters of random data and calculate their value using before any computations, data splitting or model fitting. the mean of the cluster or the center. This technique re- This is an essential step for ML algorithms that base their quires several iterations on the random subspace to create predictions on distances between data points. To avoid an ensemble. The authors compare several ensemble and any features dominating over others when calculating the classifier systems on various medical datasets and show distances, the data needs to be scaled as some features that the proposed approach outperforms other existing have a higher value range than others. This was done techniques of missing data handling on numerous datasets using sci-kit learn StandardScaler function, which essen- and the performance does not drop on data missingness tially transforms the features, so that their distributions up to 30% [11]. Tran et al. used a combination of mul- have a mean value 0 and standard deviation of 1. The tiple imputation and ensemble learning to build a diverse standarization function could be defined as follows: ensemble of classifiers which then was used for predict- x−µ ing the incomplete data. The study focused on random zn = forest as a regression method and compared the accuracy σ to other single imputation methods such as hot deck and 4.1.2 Generating MAR Data KNN-based. From the results it is clear that the ensemble To evaluate the performance of algorithms on predicting of multivariate imputation by chained equations utilising the MAR data, a process of introducing empty values to the earlier mentioned regression methods yields the best a complete dataset has been created. accuracy [18]. First, a target attribute has been selected and split from As outlined in the literature review, typical solutions to the rest of the data. To simulate that the missing value missing data problem include various imputation methods is only dependent on the data observed, the weight (W) algorithms, which estimate the missing variable based on matrix has been defined. Its dimensions are based on other observed values of that variable. Due to the sensitiv- the dimensions of target attribute matrix. The matrix W ity of individual imputation techniques to significant errors has been filled with artificially created float-type variable. in estimation, especially for large dimensional datasets, en- This variable could not be correlated with any other vari- semble methods have been employed. able, other than target attribute, present in the dataset 4. METHODOLOGY in order to meet the MAR mechanism requirements. The matrix W has been filled by values randomly drawn from The objective of the experiments created in this research a uniform distribution over [0,1) and assigned a positive is to discover the significance of ensemble model fit gain correlation to the probability of missingness of target at- when evaluated on missing data prediction compared to tribute. These steps assure that MNAR mechanism would 3 not be achieved by mistake, as the target attribute is not Table 3: KNN Hyperparameters given a correlation to any other variable in the dataset. Parameter Values Having the W matrix and probability of missingness, we n neighbors 2, 3, .. 12 can successfully introduce missing values to the target at- weight uniform, distance tribute by comparing the randomly generated weight with algorithm auto, ball tree, kd tree, brute the conditional missing probability. This process is re- leaf size 12, 20, .. 100 peated several times in order to allow for generation of p 1, 2 .. 10 high percentages of missing data. The amount of NaN val- ues is determined by a threshold value which is assigned 4.2.3 Ensemble Methods before the algorithm run. 4.1.3 Train/Test Split BaggingandAdaBoostarethetwoensemblemethodsused to conduct this study. The functions were applied from sci- To maintain a stable amount of data for the training set, kit learn library and used with default parameters. Since while changing the amount of missing values in each iter- one of the ML algorithms evaluated is Decision Tree, Ran- ation of model training and testing, two subsets of data dom Forest has been added to this study as well. RF is were created. From the entire collection, half of the rows a bagging method, which creates an ensemble of decision were sampled to be used for the training set. The remain- trees with large depths. Moreover, the algorithm makes der has served as a base set for generating missing data. use of random feature selection subspace for more robust In each iteration, a new percentage of missing values have models. The implementation of AdaBoost for KNNClas- been introduced and the rows containing null values were sifier was not possible using sci-kit. selected and used as testing set, to evaluate score of the algorithms. 5. RESULTS 4.2 Algorithms Thegraphsvisualizemodelfitperformancescorefordiffer- Several ML models have been chosen to conduct the ex- ent MLmodelsandensembles. Theplotsexpressscoreob- tained by a specific algorithm either using R2 (for numeri- periments. They have been divided into two categories, cal variable), or f1 score (for categorical variable). Each of wheretheusagedependsonthevariabletype. Thesemod- the graphs contain a legend explaining which color sym- els were selected because they are the most widely used for bolizes a specific algorithm. Some of the visualizations regression and classification problems. showing the most significant impact of ensemble methods Table 1: Models selected can be seen below, while remaining graphs can be found in the Appendix section. Numerical Variables Categorical Variables From the selected algorithms, all performed well (more Linear Regression Logistic Regression than 90% R2 score on average) in the classification prob- Bayesian Ridge Regression Perceptron lem on Avocado dataset. Furthermore, in the Avocado Decision Tree Regressor Decision Tree Classifier dataset, we can see high performance (around 80% R2 K-Nearest Neighbors Re- K-Nearest Neighbors Clas- score) of Decision Tree Regressor, with the ensembles yield- gressor sifier ing improvement of over 10-15% compared to the base learner (see Figure 6). KNN Regression scores similarly 4.2.1 ModelEvaluation to Decision Tree and its AdaBoost ensemble, while Bag- The setup of the experiments needed to tackle regres- ging slightly improved the results, giving on average 2-3% sion and classification problem. For this reason, in each increase (see Figure 7). KNN Regression scores. Both dataset, one categorical and one numerical attribute was Linear Regression and Bayesian Ridge score very similarly selected. To perform the predictive modelling, depend- to each other (around 58% on avg., see Appendix), with ing on the type of variable, an appropriate algorithm was Bagging giving almost the exact same results as the base used. The ensemble methods and individual ML models learner, and AdaBoost scoring significantly lower than the performance were assessed using precision and recall for base learner. categorical variables. On the other hand, numerical val- ues were assessed using scikit-learn r2 score function. (Avocado, AveragePrice) 100 4.2.2 Hyperparameter Tuning Decision Tree and KNN algorithm performance is highly 80 impacted by the parameters used for fitting the models. To ensure that the models are trained with optimal pa- rameters, GridSearchCV from sci-kit learn has been used 60 for the evaluation. GridSearchCV takes an array of pos- sible parameters and tests the performance of model with R squared score40 each combination of parameters. Based on the scores, it returns the most optimal combination. Hyperparameters Decision Tree Regression used in the GridSearchCV: 20 Bagging AdaBoost Table 2: Decision Tree Hyperparameters Random Forest 0 0 20 40 60 80 100 Parameter Values [%] amount of missing data compared to training data size criterion gini, entropy splitter best, random Figure 6: Decision Tree Regressor and its ensembles on max depth 2, 3 .. (training samples) -1 ’Average Price’ attribute from Avocado dataset min samples split 2, 3 .. 12 4
no reviews yet
Please Login to review.