151x Filetype PDF File size 3.20 MB Source: itlab.uta.edu
Data Mining What is clustering Cluster Analysis: Basic Concepts and Algorithms Clustering: the process of grouping a set of objects into classes of similar objects Lecture Notes for Chapter 7 Documents within a cluster should be similar. Documents from different clusters should be dissimilar. The commonest form of unsupervised learning Introduction to Data Mining Unsupervised learning = learning from raw data, as opposed to by supervised data where a classification of examples/samples is given/known Tan, Steinbach, Kumar A common and important task that finds many applications in IR + and other places Other sources © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 2 Intra- and inter-cluster distance A data set with clear cluster structure Finding groups of objects such that the objects in a group will be similar (or related) to one another and different How would you from (or unrelated to) the objects in other groups design an algorithm Inter-cluster for finding the three Intra-cluster distances are clusters in this case? distances are maximized minimized Humans do it effortlessly 3 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Historic application of clustering Applications Location of new stores Pizza delivery locations Distribution centers (e.g., Amazon, …) ATM machines Location of artilleries in combat Need to be careful about distance metric used If you end up picking a place on the other side of the river with only one bridge, it may not be a wise decision Placement of artillery. Hills and other obstacles are not taken care of in Euclidian distance! 6 Applications of Cluster Analysis What is not Cluster Analysis? Understanding Discovered Clusters Industry Group Supervised classification Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN, 1 Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN, – Group related documents DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN, Technology1-DOWN – Have class label information Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, for browsing, Sun-DOWN Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN, 2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN, – group genes and proteins Computer-Assoc-DOWN,Circuit-City-DOWN, Technology2-DOWN Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN, Simple segmentation that have similar Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN Fannie-Mae-DOWN,Fed-Home-Loan-DOWN, – Dividing students into different registration groups functionality, or 3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP, alphabetically, by last name, zip code, age, … 4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP, Oil-UP – group stocks with similar Schlumberger-UP price fluctuations Results of a query – Groupings are a result of an external specification Summarization – Reduce the size of large Graph partitioning data sets – Some mutual relevance and synergy, but areas are not Clustering precipitation identical in Australia © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Notion of a Cluster can be Ambiguous Types of Clusterings A clustering is a set of clusters Important distinction between hierarchical and partitional sets of clusters How many clusters? Six Clusters Partitional Clustering – A division of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset Hierarchical clustering Two Clusters Four Clusters – A set of nested clusters organized as a hierarchical tree © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Partitional Clustering Hierarchical Clustering p1 p3 p4 p2 p1 p2 p3 p4 Traditional Hierarchical Clustering Traditional Dendrogram p1 p3 p4 p2 p1 p2 p3 p4 Original Points A Partitional Clustering Non-traditional Hierarchical Clustering Non-traditional Dendrogram © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Other Distinctions Between Sets of Clusters Types of Clusters Exclusive versus non-exclusive Well-separated clusters – In non-exclusive clusterings, points may belong to multiple clusters. – Can represent multiple classes or ‘border’ points Center-based clusters Fuzzy versus non-fuzzy – In fuzzy clustering, a point belongs to every cluster with some Contiguous clusters weight between 0 and 1 – Weights must sum to 1 – Probabilistic clustering has similar characteristics Density-based clusters Partial versus complete – In some cases, we only want to cluster some of the data Property or Conceptual Heterogeneous versus homogeneous – Cluster of widely different sizes, shapes, and densities Described by an Objective Function © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Types of Clusters: Well-Separated Types of Clusters: Center-Based Well-Separated Clusters: Center-based – A cluster is a set of points such that any point in a cluster is – A cluster is a set of objects such that an object in a cluster is closer (or more similar) to every other point in the cluster than closer (more similar) to the “center” of a cluster, than to the to any point not in the cluster. center of any other cluster – The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster – Medoid is an actual point where as centroid need not be. 3 well-separated clusters 4 center-based clusters © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
no reviews yet
Please Login to review.