Introduction to Data Mining Dr. Hany Saleeb Why Data Mining? — Potential Applications Direct Marketing identify which prospects should be included in a mailing list Market segmentation identify common characteristics of customers who buy same products Market Basket Analysis Identify what products are likely to be bought together Insurance Claims Analysis discover patterns of fraudulent transactions compare current transactions against those patterns What Is Data Mining? Combination of AI and statistical analysis to discover information that is “hidden” in the data associations (e.g. linking purchase of pizza with beer) sequences (e.g. tying events together: marriage and purchase of furniture) classifications (e.g. recognizing patterns such as the attributes of employees that are most likely to quit) forecasting (e.g. predicting buying habits of customers based on past patterns) Expert systems or small ML/statistical programs What can data mining do? Classification – Classify credit applicants as low, medium, high risk – Classify insurance claims as normal, suspicious Estimation – Estimate the probability of a direct mailing response – Estimate the lifetime value of a customer Prediction – Predict which customers will leave within six months – Predict the size of the balance that will be transferred by a credit card prospect What can data mining do? (cont’d) Association – Find out items customers are likely to buy together – Find out what books to recommend to Amazon.com users Clustering – Difference from classification: classes are unknown! Market Analysis and Management Where are the data sources for analysis? Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time Conversion of single to a joint bank account: marriage, etc. Cross-market analysis Associations/co-relations between product sales Prediction based on the association information Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Information Science Statistics Data Mining Visualization Other Disciplines Data Mining: On What Kind of Data? Relational databases Data warehouses Transactional databases Advanced DB and information repositories Object-oriented and object-relational databases Spatial databases Time-series data and temporal data Text databases and multimedia databases Heterogeneous and legacy databases WWW Data Mining Process Learning Collecting relevant data Model building Understanding of business Problem identification Business strategy and evaluation Action Requirements/challenges in Data Mining User interface Mining methodology Performance Data source Social and Security Requirements/challenges in Data Mining(2) User interface - Data Visualization Understandability and interpretation of results Information representation and rendering Screen real-estate - Interactivity Manipulation of mined knowledge focus and refine mining tasks Focus and refine mining results Requirements/challenges in Data Mining(3) Mining Methodology Mining different kinds of knowledge in databases Interactive mining of knowledge at multiple levels of abstraction Incorporation of background knowledge Query languages Expression and visualization of results Handling noise and incomplete data Pattern evaluation Requirements/challenges in Data Mining (4) Performance Efficiency and scalability of data mining algorithms Linear algorithms needed Parallel and distributed methods Incremental methods Divide and conquer? Requirements/challenges in Data Mining(5) Data Source Diversity of data types Handling complex types of data Mining information from heterogenous data bases or information repositories Can we expect a DM algorithm to do well on all types of data ? Data glut Are we collecting the right data for the right answer? Distinguish between important and unimportant data Requirements/challenges in Data Mining(6) Social and Security -Social Impact Private and sensitive data is gathered and mined without individual’s knowledge and/or consent Appropriate use and distribution of discovered knowledge - Regulations Need for privacy and DM policies Data Mining Tools Summary The benefits of knowing one’s business is critical; technologies are coming together to support data mining. Data mining is the process and result of knowledge production, knowledge discovery and knowledge management.