Data Mining Notes 179198 | Chap8 Basic Cluster Analysis(2)

Partial capture of text on file.
                                                                             Data Mining                                                                                   What is clustering
                                                 Cluster Analysis: Basic Concepts 
                                                                         and Algorithms                                                                                     Clustering: the process of grouping a set of objects into 
                                                                                                                                                                                classes of similar objects
                                                          Lecture Notes for Chapter 7                                                                                          Documents within a cluster should be similar.
                                                                                                                                                                               Documents from different clusters should be dissimilar.
                                                                                                                                                                            The commonest form of unsupervised learning
                                                           Introduction to Data Mining                                                                                         Unsupervised learning = learning from raw data, as opposed to 
                                                                                          by                                                                                       supervised data where a classification of examples/samples is 
                                                                                                                                                                                   given/known
                                                                    Tan, Steinbach, Kumar                                                                                      A common and important task that finds many applications in IR 
                                                                                           +                                                                                       and other places
                                                                             Other sources
                                    © Tan,Steinbach, Kumar               Introduction to Data Mining                           4/18/2004               1                2
                                                                                                                                                                             Intra- and inter-cluster distance
                                    A data set with clear cluster structure                                                                                                    Finding groups of objects such that the objects in a group 
                                                                                                                                                                                  will be similar (or related) to one another and different 
                                                                                                            How would you                                                         from (or unrelated to) the objects in other groups
                                                                                                            design an algorithm                                                                                                                                  Inter-cluster 
                                                                                                            for finding the three                                                            Intra-cluster                                                      distances are 
                                                                                                            clusters in this case?                                                          distances are                                                         maximized
                                                                                                                                                                                              minimized
                                                                                                         Humans do it effortlessly
                                3                                                                                                                                           © Tan,Steinbach, Kumar               Introduction to Data Mining                           4/18/2004               ‹#›
                                                                                                                                                                                                                                                                                                                                   Historic application of clustering                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Applications
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Location of new stores
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Pizza delivery locations
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Distribution centers (e.g., Amazon, …)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           ATM machines
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Location of artilleries in combat
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Need to be careful about distance metric used
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           If you end up picking a place on the other side of the river with 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       only one bridge, it may not be a wise decision
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Placement of artillery. Hills and other obstacles are not taken care 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       of in Euclidian distance!
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         6
                                                                                                                                                                                                                                                                        Applications of Cluster Analysis                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     What is not Cluster Analysis?
                                                                                                                                                                                                                                                                            Understanding                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Discovered Clusters                                                                                                                                                                                                Industry Group                                                                                                                                                                                                                                                                         Supervised classification
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          1                                                                Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                       – Group related documents                                                                                                                                                                                                                                                                                                                                                                                DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,                                                                                                                                                                                                                                 Technology1-DOWN                                                                                                                                                                                                                                                                                                                         – Have class label information
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN, 
                                                                                                                                                                                                                                                                                                                                                        for browsing,                                                                                                                                                                                                                                                                                                                                                                                                                                                         Sun-DOWN 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,                                                                                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          2                                                                ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                       – group genes and proteins                                                                                                                                                                                                                                                                                                                                                                                                    Computer-Assoc-DOWN,Circuit-City-DOWN,                                                                                                                                                                                                              Technology2-DOWN 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Simple segmentation
                                                                                                                                                                                                                                                                                                                                                        that have similar                                                                                                                                                                                                                                                                                                                                                   Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         – Dividing students into different registration groups 
                                                                                                                                                                                                                                                                                                                                                        functionality, or                                                                                                                                                                                                                                                                                                 3                                                                          MBNA-Corp-DOWN,Morgan-Stanley-DOWN                                                                                                                                                                                                                            Financial-DOWN 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             alphabetically, by last name, zip code, age, …
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          4                                                                    Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,                                                                                                                                                                                                                                                Oil-UP 
                                                                                                                                                                                                                                                                                                                       – group stocks with similar                                                                                                                                                                                                                                                                                                                                                                                                                                                                 Schlumberger-UP 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                                                        price fluctuations                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Results of a query
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  – Groupings are a result of an external specification
                                                                                                                                                                                                                                                                            Summarization
                                                                                                                                                                                                                                                                                                                       – Reduce the size of large                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Graph partitioning
                                                                                                                                                                                                                                                                                                                                                        data sets                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 – Some mutual relevance and synergy, but areas are not 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Clustering precipitation                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            identical
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          in Australia
                                                                                                                                                                                                                                                               © Tan,Steinbach, Kumar                                                                                                                                                                                                                                                  Introduction to Data Mining                                                                                                                                                                                                                                                                                                                                                               4/18/2004               ‹#›                                                                                                                                                                                                                                                                                                         © Tan,Steinbach, Kumar                                                                                                                                                                                                                                                  Introduction to Data Mining                                                                                                                                                                                                                                                                                                                                                              4/18/2004               ‹#›
                                      Notion of a Cluster can be Ambiguous                                                                                                   Types of Clusterings
                                                                                                                                                                                  A clustering is a set of clusters
                                                                                                                                                                                  Important distinction between hierarchical and 
                                                                                                                                                                                      partitional sets of clusters 
                                              How many clusters?                                          Six Clusters                                                            Partitional Clustering
                                                                                                                                                                                       – A division of data objects into non-overlapping subsets 
                                                                                                                                                                                           (clusters) such that each data object is in exactly one subset
                                                                                                                                                                                  Hierarchical clustering
                                              Two Clusters                                                Four Clusters                                                                – A set of nested clusters organized as a hierarchical tree 
                                    © Tan,Steinbach, Kumar               Introduction to Data Mining                           4/18/2004               ‹#›                  © Tan,Steinbach, Kumar               Introduction to Data Mining                           4/18/2004               ‹#›
                                      Partitional Clustering                                                                                                                 Hierarchical Clustering
                                                                                                                                                                                         p1 
                                                                                                                                                                                                       p3  p4 
                                                                                                                                                                                               p2 
                                                                                                                                                                                                                                                  p1 p2 p3 p4
                                                                                                                                                                                     Traditional Hierarchical Clustering                 Traditional Dendrogram
                                                                                                                                                                                          p1
                                                                                                                                                                                                        p3   p4
                                                                                                                                                                                                p2
                                                                                                                                                                                                                                                  p1 p2         p3 p4
                                              Original Points                                              A Partitional  Clustering                                                 Non-traditional Hierarchical Clustering             Non-traditional Dendrogram
                                    © Tan,Steinbach, Kumar               Introduction to Data Mining                           4/18/2004               ‹#›                  © Tan,Steinbach, Kumar               Introduction to Data Mining                           4/18/2004               ‹#›
                                             Other Distinctions Between Sets of Clusters                                                                                                                       Types of Clusters
                                                  Exclusive versus non-exclusive                                                                                                                                Well-separated clusters
                                                        – In non-exclusive clusterings, points may belong to multiple 
                                                             clusters.
                                                        – Can represent multiple classes or ‘border’ points                                                                                                      Center-based clusters
                                                  Fuzzy versus non-fuzzy
                                                        – In fuzzy clustering, a point belongs to every cluster with some                                                                                        Contiguous clusters
                                                             weight between 0 and 1
                                                        – Weights must sum to 1
                                                        – Probabilistic clustering has similar characteristics                                                                                                   Density-based clusters
                                                  Partial versus complete
                                                        – In some cases, we only want to cluster some of the data                                                                                                Property or Conceptual
                                                  Heterogeneous versus homogeneous
                                                        – Cluster of widely different sizes, shapes, and densities                                                                                               Described by an Objective Function
                                           © Tan,Steinbach, Kumar                       Introduction to Data Mining                                    4/18/2004               ‹#›                            © Tan,Steinbach, Kumar                      Introduction to Data Mining                                     4/18/2004               ‹#›
                                             Types of Clusters: Well-Separated                                                                                                                                 Types of Clusters: Center-Based
                                                  Well-Separated Clusters:                                                                                                                                          Center-based
                                                        – A cluster is a set of points such that any point in a cluster is                                                                                                 – A cluster is a set of objects such that an object in a cluster is 
                                                             closer (or more similar) to every other point in the cluster than                                                                                                  closer (more similar) to the “center” of a cluster, than to the 
                                                             to any point not in the cluster.                                                                                                                                   center of any other cluster  
                                                                                                                                                                                                                           – The center of a cluster is often a centroid, the average of all 
                                                                                                                                                                                                                                the points in the cluster, or a medoid, the most “representative” 
                                                                                                                                                                                                                                point of a cluster 
                                                                                                                                                                                                                           – Medoid is an actual point where as centroid need not be.
                                                                                       3 well-separated clusters                                                                                                                                         4 center-based clusters
                                           © Tan,Steinbach, Kumar                       Introduction to Data Mining                                    4/18/2004               ‹#›                            © Tan,Steinbach, Kumar                      Introduction to Data Mining                                     4/18/2004               ‹#›
The words contained in this file might help you see if this file matches what you are looking for:

...Data mining what is clustering cluster analysis basic concepts and algorithms the process of grouping a set objects into classes similar lecture notes for chapter documents within should be from different clusters dissimilar commonest form unsupervised learning introduction to raw as opposed by supervised where classification examples samples given known tan steinbach kumar common important task that finds many applications in ir other places sources intra inter distance with clear structure finding groups such group will or related one another how would you unrelated design an algorithm three distances are this case maximized minimized humans do it effortlessly historic application location new stores pizza delivery locations...
Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area