jagomart
digital resources
picture1_Data Mining Jiawei Han 91358 | 55b5522e1a2bb1c1a4f4cbd9e179e5fc6b92


 146x       Filetype PDF       File size 0.13 MB       Source: pdfs.semanticscholar.org


File: Data Mining Jiawei Han 91358 | 55b5522e1a2bb1c1a4f4cbd9e179e5fc6b92
slides related to why data mining data mining the explosive growth of data from terabytes to petabytes data collection and data availability concepts and techniques automated data collection tools database ...

icon picture PDF Filetype PDF | Posted on 16 Sep 2022 | 3 years ago
Partial capture of text on file.
                                              Slides related to:                                                                                                           Why Data Mining? 
                                                         Data Mining:                                                                            „   The Explosive Growth of Data: from terabytes to petabytes
                                                                                                                                                       „ Data collection and data availability
                                  Concepts and Techniques                                                                                                   „ Automated data collection tools, database systems, Web, 
                                                                                                                                                              computerized society
                                                             —Chapter 1 and 2 —                                                                        „ Major sources of abundant data
                                          —Introduction  and Data preprocessing —                                                                           „ Business: Web, e-commerce, transactions, stocks, …
                                                   Jiawei Han and Micheline Kamber                                                                          „ Science: Remote sensing, bioinformatics, scientific simulation, …
                                                   Department of Computer Science                                                                           „ Society and everyone: news, digital cameras, YouTube   
                                            University of Illinois at Urbana-Champaign                                                           „   We are drowning in data, but starving for knowledge!
                                                           www.cs.uiuc.edu/~hanj                                                                 „   “Necessity is the mother of invention”—Data mining—Automated 
                                     ©2006 Jiawei Han and Micheline Kamber.  All rights reserved.                                                    analysis of massive data sets
                                                                 Data Mining: Concepts and Techniques                           1                                                  Data Mining: Concepts and Techniques                           2
                                Ex. 1: Market Analysis and Management                                                                             Ex. 2: Corporate Analysis & Risk Management
                               „   Where does the data come from?—Credit card transactions, loyalty cards,                                       „   Finance planning and asset evaluation
                                   discount coupons, customer complaint calls, plus (public) lifestyle studies
                               „   Target marketing                                                                                                    „ cash flow analysis and prediction
                                    „  Find clusters of “model” customers who share the same characteristics: interest,                                „ contingent claim analysis to evaluate assets 
                                       income level, spending habits, etc.
                                    „  Determine customer purchasing patterns over time                                                                „ cross-sectional and time series analysis (financial-ratio, trend 
                               „   Cross-market analysis—Find associations/co-relations between product sales,                                            analysis, etc.)
                                   & predict based on such association                                                                           „   Resource planning
                               „   Customer profiling—What types of customers buy what products (clustering 
                                   or classification)                                                                                                  „ summarize and compare the resources and spending
                               „   Customer requirement analysis                                                                                 „   Competition
                                    „  Identify the best products for different groups of customers                                                    „ monitor competitors and market directions 
                                    „  Predict what factors will attract new customers
                               „   Provision of summary information                                                                                    „ group customers into classes and a class-based pricing procedure
                                    „  Multidimensional summary reports                                                                                „ set pricing strategy in a highly competitive market
                                    „  Statistical summary information (data central tendency and variation)
                                                                 Data Mining: Concepts and Techniques                           3                                                  Data Mining: Concepts and Techniques                           4
                            Ex. 3: Fraud Detection & Mining Unusual Patterns                                                                             Evolution of Database Technology
                                „   Approaches: Clustering & model construction for frauds, outlier analysis                                     „   1960s:
                                „   Applications: Health care, retail, credit card service, telecomm.                                                  „ Data collection, database creation, IMS and network DBMS
                                     „ Auto insurance: ring of collisions                                                                        „   1970s: 
                                     „ Money laundering: suspicious monetary transactions                                                              „ Relational data model, relational DBMS implementation
                                     „ Medical insurance                                                                                         „   1980s: 
                                          „ Professional patients, ring of doctors, and ring of references                                             „ Advanced data models (extended-relational, OO, deductive, etc.) 
                                          „ Unnecessary or correlated screening tests                                                                  „ Application-oriented DBMS (spatial, temporal, multimedia, etc.)
                                     „ Telecommunications: phone-call fraud                                                                      „   1990s: 
                                          „ Phone call model: destination of the call, duration, time of day or                                        „ Data mining, data warehousing, multimedia databases, and Web 
                                             week.  Analyze patterns that deviate from an expected norm                                                   databases
                                     „ Retail industry                                                                                           „   2000s
                                          „ Analysts estimate that 38% of retail shrink is due to dishonest                                            „ Stream data management and mining
                                             employees                                                                                                 „ Data mining and its applications
                                     „ Anti-terrorism                                                                                                  „ Web technology (XML, data integration) and global information systems
                                                                 Data Mining: Concepts and Techniques                           5                                                  Data Mining: Concepts and Techniques                           6
                                                                                                                                                                                                                                                                1
                                            What Is Data Mining?                                                                  Knowledge Discovery (KDD) Process
                            „ Data mining (knowledge discovery from data)                                                         „ Data mining—core of 
                                „ Extraction of interesting (non-trivial, implicit, previously                                      knowledge discovery                  Pattern evaluation and presentation
                                   unknown and potentially useful) patterns or knowledge from                                       process
                                   huge amount of data                                                                                                               Data Mining
                                „ Data mining: a misnomer?                                                                                          Task-relevant Data
                            „ Alternative names
                                „ Knowledge discovery (mining) in databases (KDD), knowledge                                         Data Warehouse            Selection and transformation
                                   extraction, data/pattern analysis, data archeology, data 
                                   dredging, information harvesting, business intelligence, etc.
                            „ Watch out: Is everything “data mining”?                                                         Data Cleaning
                                „ Simple search and query processing                                                                       Data Integration
                                „ (Deductive) expert systems
                                                                                                                                        Databases
                                                         Data Mining: Concepts and Techniques                  7                                           Data Mining: Concepts and Techniques                   8
                                       Why Data Preprocessing?                                                                                    Why Is Data Dirty?
                            „ Data in the real world is dirty                                                                  „ Incomplete data may come from
                                 „ incomplete: lacking attribute values, lacking certain                                           „ “Not applicable” data value when collected
                                                                                                                                   „ Different considerations between the time when the data was collected 
                                    attributes of interest, or containing only aggregate                                              and when it is analyzed.
                                    data                                                                                           „ Human/hardware/software problems
                                     „ e.g., occupation=“ ”                                                                    „ Noisy data (incorrect values) may come from
                                 „ noisy: containing errors or outliers                                                            „ Faulty data collection instruments
                                                                                                                                   „ Human or computer error at data entry
                                     „ e.g., Salary=“-10”                                                                          „ Errors in data transmission
                                 „ inconsistent: containing discrepancies in codes or                                          „ Inconsistent data may come from
                                    names                                                                                          „ Different data sources
                                                                                                                                   „ Functional dependency violation (e.g., modify some linked data)
                                     „ e.g., Age=“42” Birthdate=“03/07/1997”                                                   „ Duplicate records also need data cleaning
                                     „ e.g., Was rating “1,2,3”, now rating “A, B, C”
                                     „ e.g., discrepancy between duplicate records
                                                         Data Mining: Concepts and Techniques                  9                                           Data Mining: Concepts and Techniques                  10
                                  Why Is Data Preprocessing Important?                                                                   Forms of Data Preprocessing
                            „ No quality data, no quality mining results!
                                „ Quality decisions must be based on quality data
                                     „ e.g., duplicate or missing data may cause incorrect or even 
                                       misleading statistics.
                                „ Data warehouse needs consistent integration of quality data
                            „ Data extraction, cleaning, and transformation comprises 
                               the majority of the work of building a data warehouse
                                                         Data Mining: Concepts and Techniques                 11                                           Data Mining: Concepts and Techniques                  12
                                                                                                                                                                                                                              2
                                    Architecture: Typical Data Mining System                                                                          Why Not Traditional Data Analysis?
                                                                                                                                                 „   Tremendous amount of data
                                                        Graphical User Interface                                                                      „ Algorithms must be highly scalable to handle large amounts of data
                                                                                                                                                 „   High-dimensionality of data 
                                                           Pattern Evaluation                                                                         „ Micro-array may have tens of thousands of dimensions
                                                                                                         Knowl                                   „   High complexity of data
                                                          Data Mining Engine                             edge-                                        „ Data streams and sensor data
                                                                                                         Base
                                                            Database or Data                                                                          „ Time-series data, temporal data, sequence data 
                                                           Warehouse Server                                                                           „ Structure data, graphs, social networks and multi-linked data
                                                                                                                                                      „ Heterogeneous databases and legacy databases
                                                     data cleaning, integration, and selection                                                        „ Spatial, spatiotemporal, multimedia, text and Web data
                                                                                                                                                 „   New and sophisticated applications
                                                Database        Data     World-Wide Other Info
                                                             Warehouse       Web       Repositories
                                                                 Data Mining: Concepts and Techniques                         13                                                  Data Mining: Concepts and Techniques                          14
                                    Data Mining: Classification Schemes                                                                              Data Mining: on what kinds of data?
                                „ General functionality                                                                                          „   Database-oriented data sets and applications
                                                                                                                                                      „ Relational database, data warehouse, transactional database
                                     „ Descriptive data mining                                                                                   „   Advanced data sets and advanced applications 
                                     „ Predictive data mining                                                                                         „ Object-relational databases
                                „ Different views lead to different classifications                                                                   „ Time-series data, temporal data, sequence data (incl. bio-sequences) 
                                                                                                                                                      „ Spatial data and spatiotemporal data
                                     „ Data view: Kinds of data to be mined                                                                           „ Text databases and Multimedia databases
                                     „ Knowledge view: Kinds of knowledge to be discovered                                                            „ Data streams and sensor data
                                     „ Method view: Kinds of techniques utilized                                                                      „ The World-Wide Web
                                     „ Application view: Kinds of applications adapted                                                                „ Heterogeneous databases and legacy databases
                                                                 Data Mining: Concepts and Techniques                         15                                                  Data Mining: Concepts and Techniques                          16
                                Data Mining – what kinds of patterns?                                                                             Data Mining – what kinds of patterns?
                                „ Concept/class description:                                                                                     „   Frequent patterns, association, correlations
                                     „ Characterization: summarizing the data of the class under study                                                „ Frequent itemset
                                        in general terms                                                                                              „ Frequent sequential pattern
                                          „ E.g. Characteristics of customers spending more than 10000                                                „ Frequent structured pattern
                                            sek per year
                                     „ Discrimination: comparing target class with other (contrasting)                                                „  E.g. buy(X, “Diaper” Æ buy(X, “Beer”)  [support=0.5%, confidence=75%]
                                        classes                                                                                                           confidence: if X buys a diaper, then there is 75% chance that X buys beer
                                          „ E.g. Compare the characteristics of products that had a sales                                                 support: of all transactions under consideration 0.5% showed that diaper and         
                                            increase to products that had a sales decrease last year                                                                beer were bought together
                                                                                                                                                      „  E.g. Age(X, ”20..29”) and income(X, ”20k..29k”) Æ buys(X, ”cd-player”) 
                                                                                                                                                         [support=2%, confidence=60%]
                                                                 Data Mining: Concepts and Techniques                         17                                                  Data Mining: Concepts and Techniques                          18
                                                                                                                                                                                                                                                               3
                             Data Mining – what kinds of patterns?                                                               Data Mining – what kinds of patterns?
                            „ Classification and prediction                                                                    „   Cluster analysis
                                 „ Construct models (functions) that describe and                                                   „ Class label is unknown: Group data to form new classes, e.g., 
                                                                                                                                       cluster customers to find target groups for marketing
                                   distinguish classes or concepts for future prediction.                                           „ Maximizing intra-class similarity & minimizing interclass similarity
                                   The derived model is based on analyzing training data                                       „   Outlier analysis
                                   – data whose class labels are known.                                                             „ Outlier: Data object that does not comply with the general behavior 
                                     „ E.g., classify countries based on (climate), or                                                 of the data
                                       classify cars based on (gas mileage)                                                         „ Noise or exception? Useful in fraud detection, rare events analysis
                                 „ Predict some unknown or missing numerical values                                            „   Trend and evolution analysis
                                                                                                                                    „ Trend and deviation
                                                         Data Mining: Concepts and Techniques                   19                                           Data Mining: Concepts and Techniques                   20
                             Are All the “Discovered” Patterns Interesting?                                                       Find All and Only Interesting Patterns?
                            „  Data mining may generate thousands of patterns: Not all of them                                 „   Find all the interesting patterns: Completeness
                               are interesting                                                                                      „ Can a data mining system find all the interesting patterns? Do we 
                                 „ Suggested approach: Human-centered, query-based, focused mining                                     need to find all of the interesting patterns?
                            „  Interestingness measures                                                                             „ Heuristic vs. exhaustive search
                                 „ A pattern is interesting if it is easily understood by humans, valid on new                      „ Association vs. classification vs. clustering
                                   or test data with some degree of certainty, potentially useful, novel, or                   „   Search for only interesting patterns: An optimization problem
                                   validates some hypothesis that a user seeks to confirm                                           „ Can a data mining system find only the interesting patterns?
                            „  Objective vs. subjective interestingness measures                                                    „ Approaches
                                 „ Objective: based on statistics and structures of patterns, e.g., support,                             „ First generate all the patterns and then filter out the 
                                   confidence, etc.                                                                                        uninteresting ones
                                 „ Subjective: based on user’s belief in the data, e.g., unexpectedness,                                 „ Generate only the interesting patterns—mining query 
                                   novelty, actionability, etc.                                                                            optimization
                                                         Data Mining: Concepts and Techniques                   21                                           Data Mining: Concepts and Techniques                   22
                                Data Mining – what techniques used?                                                                Top-10 Most Popular DM Algorithms:
                                                                                                                                            18 Identified Candidates (I)
                                            Database                                                                            „   Classification
                                                                               Statistics                                            „ #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan 
                                           Technology                                                                                   Kaufmann., 1993.
                                                                                                                                     „ #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification 
                                                                                                                                        and Regression Trees. Wadsworth, 1984.
                                                                                                                                     „ #3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R. 1996. 
                               Machine                                                                                                  Discriminant Adaptive Nearest Neighbor Classification. TPAMI. 18(6)
                                                          Data Mining                        Visualization                           „ #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid 
                               Learning                                                                                                 After All? Internat. Statist. Rev. 69, 385-398.
                                                                                                                                „   Statistical Learning
                                                                                                                                     „ #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. 
                                  Pattern                                                                                               Springer-Verlag.
                                                                                              Other                                  „  #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. 
                               Recognition                                                Disciplines                                   Wiley, New York. Association Analysis
                                                              Algorithm                                                              „ #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms 
                                                                                                                                        for Mining Association Rules. In VLDB '94.
                                                                                                                                     „ #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns 
                                                                                                                                        without candidate generation. In SIGMOD '00.
                                                         Data Mining: Concepts and Techniques                   23                                           Data Mining: Concepts and Techniques                   24
                                                                                                                                                                                                                                 4
The words contained in this file might help you see if this file matches what you are looking for:

...Slides related to why data mining the explosive growth of from terabytes petabytes collection and availability concepts techniques automated tools database systems web computerized society chapter major sources abundant introduction preprocessing business e commerce transactions stocks jiawei han micheline kamber science remote sensing bioinformatics scientific simulation department computer everyone news digital cameras youtube university illinois at urbana champaign we are drowning in but starving for knowledge www cs uiuc edu hanj necessity is mother invention all rights reserved analysis massive sets ex market management corporate risk where does come credit card loyalty cards finance planning asset evaluation discount coupons customer complaint calls plus public lifestyle studies target marketing cash flow prediction find clusters model customers who share same characteristics interest contingent claim evaluate assets income level spending habits etc determine purchasing patterns ...

no reviews yet
Please Login to review.