146x Filetype PDF File size 0.13 MB Source: pdfs.semanticscholar.org
Slides related to: Why Data Mining? Data Mining: The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Concepts and Techniques Automated data collection tools, database systems, Web, computerized society —Chapter 1 and 2 — Major sources of abundant data —Introduction and Data preprocessing — Business: Web, e-commerce, transactions, stocks, … Jiawei Han and Micheline Kamber Science: Remote sensing, bioinformatics, scientific simulation, … Department of Computer Science Society and everyone: news, digital cameras, YouTube University of Illinois at Urbana-Champaign We are drowning in data, but starving for knowledge! www.cs.uiuc.edu/~hanj “Necessity is the mother of invention”—Data mining—Automated ©2006 Jiawei Han and Micheline Kamber. All rights reserved. analysis of massive data sets Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques 2 Ex. 1: Market Analysis and Management Ex. 2: Corporate Analysis & Risk Management Where does the data come from?—Credit card transactions, loyalty cards, Finance planning and asset evaluation discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing cash flow analysis and prediction Find clusters of “model” customers who share the same characteristics: interest, contingent claim analysis to evaluate assets income level, spending habits, etc. Determine customer purchasing patterns over time cross-sectional and time series analysis (financial-ratio, trend Cross-market analysis—Find associations/co-relations between product sales, analysis, etc.) & predict based on such association Resource planning Customer profiling—What types of customers buy what products (clustering or classification) summarize and compare the resources and spending Customer requirement analysis Competition Identify the best products for different groups of customers monitor competitors and market directions Predict what factors will attract new customers Provision of summary information group customers into classes and a class-based pricing procedure Multidimensional summary reports set pricing strategy in a highly competitive market Statistical summary information (data central tendency and variation) Data Mining: Concepts and Techniques 3 Data Mining: Concepts and Techniques 4 Ex. 3: Fraud Detection & Mining Unusual Patterns Evolution of Database Technology Approaches: Clustering & model construction for frauds, outlier analysis 1960s: Applications: Health care, retail, credit card service, telecomm. Data collection, database creation, IMS and network DBMS Auto insurance: ring of collisions 1970s: Money laundering: suspicious monetary transactions Relational data model, relational DBMS implementation Medical insurance 1980s: Professional patients, ring of doctors, and ring of references Advanced data models (extended-relational, OO, deductive, etc.) Unnecessary or correlated screening tests Application-oriented DBMS (spatial, temporal, multimedia, etc.) Telecommunications: phone-call fraud 1990s: Phone call model: destination of the call, duration, time of day or Data mining, data warehousing, multimedia databases, and Web week. Analyze patterns that deviate from an expected norm databases Retail industry 2000s Analysts estimate that 38% of retail shrink is due to dishonest Stream data management and mining employees Data mining and its applications Anti-terrorism Web technology (XML, data integration) and global information systems Data Mining: Concepts and Techniques 5 Data Mining: Concepts and Techniques 6 1 What Is Data Mining? Knowledge Discovery (KDD) Process Data mining (knowledge discovery from data) Data mining—core of Extraction of interesting (non-trivial, implicit, previously knowledge discovery Pattern evaluation and presentation unknown and potentially useful) patterns or knowledge from process huge amount of data Data Mining Data mining: a misnomer? Task-relevant Data Alternative names Knowledge discovery (mining) in databases (KDD), knowledge Data Warehouse Selection and transformation extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Watch out: Is everything “data mining”? Data Cleaning Simple search and query processing Data Integration (Deductive) expert systems Databases Data Mining: Concepts and Techniques 7 Data Mining: Concepts and Techniques 8 Why Data Preprocessing? Why Is Data Dirty? Data in the real world is dirty Incomplete data may come from incomplete: lacking attribute values, lacking certain “Not applicable” data value when collected Different considerations between the time when the data was collected attributes of interest, or containing only aggregate and when it is analyzed. data Human/hardware/software problems e.g., occupation=“ ” Noisy data (incorrect values) may come from noisy: containing errors or outliers Faulty data collection instruments Human or computer error at data entry e.g., Salary=“-10” Errors in data transmission inconsistent: containing discrepancies in codes or Inconsistent data may come from names Different data sources Functional dependency violation (e.g., modify some linked data) e.g., Age=“42” Birthdate=“03/07/1997” Duplicate records also need data cleaning e.g., Was rating “1,2,3”, now rating “A, B, C” e.g., discrepancy between duplicate records Data Mining: Concepts and Techniques 9 Data Mining: Concepts and Techniques 10 Why Is Data Preprocessing Important? Forms of Data Preprocessing No quality data, no quality mining results! Quality decisions must be based on quality data e.g., duplicate or missing data may cause incorrect or even misleading statistics. Data warehouse needs consistent integration of quality data Data extraction, cleaning, and transformation comprises the majority of the work of building a data warehouse Data Mining: Concepts and Techniques 11 Data Mining: Concepts and Techniques 12 2 Architecture: Typical Data Mining System Why Not Traditional Data Analysis? Tremendous amount of data Graphical User Interface Algorithms must be highly scalable to handle large amounts of data High-dimensionality of data Pattern Evaluation Micro-array may have tens of thousands of dimensions Knowl High complexity of data Data Mining Engine edge- Data streams and sensor data Base Database or Data Time-series data, temporal data, sequence data Warehouse Server Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases data cleaning, integration, and selection Spatial, spatiotemporal, multimedia, text and Web data New and sophisticated applications Database Data World-Wide Other Info Warehouse Web Repositories Data Mining: Concepts and Techniques 13 Data Mining: Concepts and Techniques 14 Data Mining: Classification Schemes Data Mining: on what kinds of data? General functionality Database-oriented data sets and applications Relational database, data warehouse, transactional database Descriptive data mining Advanced data sets and advanced applications Predictive data mining Object-relational databases Different views lead to different classifications Time-series data, temporal data, sequence data (incl. bio-sequences) Spatial data and spatiotemporal data Data view: Kinds of data to be mined Text databases and Multimedia databases Knowledge view: Kinds of knowledge to be discovered Data streams and sensor data Method view: Kinds of techniques utilized The World-Wide Web Application view: Kinds of applications adapted Heterogeneous databases and legacy databases Data Mining: Concepts and Techniques 15 Data Mining: Concepts and Techniques 16 Data Mining – what kinds of patterns? Data Mining – what kinds of patterns? Concept/class description: Frequent patterns, association, correlations Characterization: summarizing the data of the class under study Frequent itemset in general terms Frequent sequential pattern E.g. Characteristics of customers spending more than 10000 Frequent structured pattern sek per year Discrimination: comparing target class with other (contrasting) E.g. buy(X, “Diaper” Æ buy(X, “Beer”) [support=0.5%, confidence=75%] classes confidence: if X buys a diaper, then there is 75% chance that X buys beer E.g. Compare the characteristics of products that had a sales support: of all transactions under consideration 0.5% showed that diaper and increase to products that had a sales decrease last year beer were bought together E.g. Age(X, ”20..29”) and income(X, ”20k..29k”) Æ buys(X, ”cd-player”) [support=2%, confidence=60%] Data Mining: Concepts and Techniques 17 Data Mining: Concepts and Techniques 18 3 Data Mining – what kinds of patterns? Data Mining – what kinds of patterns? Classification and prediction Cluster analysis Construct models (functions) that describe and Class label is unknown: Group data to form new classes, e.g., cluster customers to find target groups for marketing distinguish classes or concepts for future prediction. Maximizing intra-class similarity & minimizing interclass similarity The derived model is based on analyzing training data Outlier analysis – data whose class labels are known. Outlier: Data object that does not comply with the general behavior E.g., classify countries based on (climate), or of the data classify cars based on (gas mileage) Noise or exception? Useful in fraud detection, rare events analysis Predict some unknown or missing numerical values Trend and evolution analysis Trend and deviation Data Mining: Concepts and Techniques 19 Data Mining: Concepts and Techniques 20 Are All the “Discovered” Patterns Interesting? Find All and Only Interesting Patterns? Data mining may generate thousands of patterns: Not all of them Find all the interesting patterns: Completeness are interesting Can a data mining system find all the interesting patterns? Do we Suggested approach: Human-centered, query-based, focused mining need to find all of the interesting patterns? Interestingness measures Heuristic vs. exhaustive search A pattern is interesting if it is easily understood by humans, valid on new Association vs. classification vs. clustering or test data with some degree of certainty, potentially useful, novel, or Search for only interesting patterns: An optimization problem validates some hypothesis that a user seeks to confirm Can a data mining system find only the interesting patterns? Objective vs. subjective interestingness measures Approaches Objective: based on statistics and structures of patterns, e.g., support, First generate all the patterns and then filter out the confidence, etc. uninteresting ones Subjective: based on user’s belief in the data, e.g., unexpectedness, Generate only the interesting patterns—mining query novelty, actionability, etc. optimization Data Mining: Concepts and Techniques 21 Data Mining: Concepts and Techniques 22 Data Mining – what techniques used? Top-10 Most Popular DM Algorithms: 18 Identified Candidates (I) Database Classification Statistics #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Technology Kaufmann., 1993. #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, 1984. #3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R. 1996. Machine Discriminant Adaptive Nearest Neighbor Classification. TPAMI. 18(6) Data Mining Visualization #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid Learning After All? Internat. Statist. Rev. 69, 385-398. Statistical Learning #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Pattern Springer-Verlag. Other #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Recognition Disciplines Wiley, New York. Association Analysis Algorithm #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94. #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In SIGMOD '00. Data Mining: Concepts and Techniques 23 Data Mining: Concepts and Techniques 24 4
no reviews yet
Please Login to review.