Data Mining Research and Applications Workshop on Cyberinfrastructure For Environmental Research and Education October 31, 2002 Steve Tanner Information Technology and Systems Center University of Alabama in Huntsville firstname.lastname@example.org 256.824.5143 www.itsc.uah.edu Key Questions: What is the most effective approach to developing an integrated framework and plan for an interdisciplinary environmental cyberinfrastructure? What organizational structure is needed to provide long-term support for data storage, access, model development, and services for a global clientele of researchers, educators, policy makers, and citizens? How will effective interagency and public-private partnerships be formed to provide financial support for such an extensive and costly system? How can communication and coordination among computer scientists and environmental researchers and educators be enhanced to develop this innovative, powerful, and accessible infrastructure? Data Mining Data Mining is an interdisciplinary field drawing from areas such as statistics, machine learning, pattern recognition and others Automated discovery of patterns, anomalies, etc. from vast observational and model data sets Derived knowledge for decision making, predictions and disaster response ADaM – Algorithm Development and Mining System datamining.itsc.uah.edu Techniques used for Data Mining Clustering Techniques – – – Pattern Recognition – – – – – – – – Bayes Classifier Minimum Distribution Classifier Image Analysis – K Means Isodata Maximum Boundary Detection Cooccurrence Matrix Dilation and Erosion Histogram Operations Polygon Circumscript Spatial Filtering Texture Operations Genetic Algorithms Neural Networks Etc. Data Mining systems usually involve a toolbox of many different techniques and a means for combining them Typical Everyday Encounters with Data Mining Google – Amazon.Com – Complex algorithm sequence to decide order Additional purchase suggestions Credit Card Fraud – Event notification of odd usage Most current Data Mining applications are text based. Text provides an easily readable source of heterogeneous data. Mining of scientific data sets is more complex. User Perspective and Data Perspective of the Data Mining Process Analysis Decision Value Volume Transformation Knowledge Preprocessing Information Dataset Specific Algorithms Data Data Stores User Perspective Calibration & Navigation Dataset Domain Specific Algorithms Data Perspective Data Mining Scientific Analysis Harnesses human analysis capabilities – Provides automation of the analysis process Can be used for dimensionality reduction when manual examination of data is impossible Can have limitations Based on theory and hypothesis formulation – Highly creative Physical basis is normally used for algorithms Drawing insights about the underlying phenomena Rapidly widening gap between data collection capabilities and the ability to analyze data Potential of vast amounts of data to be unused – May not utilize domain knowledge – May be difficult to prove validity of the results There may not be a physical basis Should be viewed as complimentary tool and not a replacement for scientific analysis Similarity between Data Mining and Scientific Analysis Process Mining Environments Mining Framework (ADaM) – – – – – – Complete System (Client and Engine) Mining Engine (User provides its own client) Application Specific Mining Systems Operations Tool Kit Stand Alone Mining Algorithms Data Fusion Distributed/Federated Mining – – – Distributed services Distributed data Chaining using Interchange Technologies On-board Mining (EVE) – – Real time and distributed mining Processing environment constraints Using the Mining Framework: Focusing on the information in data The ADaM Processing Model Results Translated Data Raw Data Preprocessed Data Patterns/ Models Processing Input Preprocessing Analysis Output PIP-2 SSM/I Pathfinder SSM/I TDR SSM/I NESDIS Lvl 1B SSM/I MSFC Brightness Temp US Rain Landsat ASCII Grass Vectors (ASCII Text) HDF HDF-EOS GIF Intergraph Raster Others... Selection and Sampling Subsetting Subsampling Select by Value Coincidence Search Grid Manipulation Grid Creation Bin Aggregate Bin Select Grid Aggregate Grid Select Find Holes Image Processing Cropping Inversion Thresholding Others... Clustering K Means Isodata Maximum Pattern Recognition Bayes Classifier Min. Dist. Classifier Image Analysis Boundary Detection Cooccurrence Matrix Dilation and Erosion Histogram Operations Polygon Circumscript Spatial Filtering Texture Operations Genetic Algorithms Neural Networks Others… GIF Images HDF Raster Images HDF Scientific Data Sets HDF-ESO Polygons (ASCII, DXF) SSM/I MSFC Brightness Temp TIFF Images GeoTIFF Others... Iterative Nature of the Data Mining Process EVALUATION And PRESENTATION KNOWLEDGE DISCOVERY MINING CLEANING And INTEGRATION PREPROCESSING DATA SELECTION And TRANSFORMATION Distributed/Federated Mining: Meshing data and algorithms to generate knowledge ADaM : Mining Environment for Scientific Data • The system provides knowledge discovery, feature detection and content-based searching for data values, as well as for metadata. •contains over 120 different operations •Operations vary from specialized science data-set specific algorithms to various digital image processing techniques, processing modules for automatic pattern recognition, machine perception, neural networks, genetic algorithms and others Classification Based on Texture Features and Edge Density Science Rationale: Man-made changes to land use cause changes in weather patterns, especially cumulus clouds Comparison based on – Accuracy of detection – Amount of time required to classify Cumulus cloud fields have a very characteristic texture signature in the GOES visible imagery Parallel Version of Cloud Extraction GOES images can be used to recognize GOES Image cumulus cloud fields Sobel Horizontal Sobel Vertical Laplacian Filter Filter Filter Cumulus clouds are small and do not Energy Energy Energy Energy show up well in 4km Computation Computation Computation Computation resolution IR channels Classifier Detection of cumulus cloud fields in GOES Cloud Image can be accomplished GOES Image Cumulus Cloud by using texture Mask features or edge detectors Three edge detection filters are used together to detect cumulus clouds which lends itself to implementation on a parallel cluster Automated Data Analysis for Boundary Detection and Quantification Analysis of polar cap auroras in large volumes of spacecraft UV images Science Rationale: Indicators to predict geomagnetic storm – – Damage satellites Disrupt radio connection Developing different mining algorithms to detect and quantify polar cap boundary Polar Cap Boundary Detecting Signatures Science Rationale: Mesocyclone signatures in Radar data are indicators of Tornadic activity Developing an algorithm based on wind velocity shear signatures – Improve accuracy and reduce false alarm rates Genetic Subtyping Using Hierarchical Clustering Biologists are interested in comparing DNA sequences to see how closely related they are to one another Phylogenetic trees are constructed by performing hierarchical clustering on DNA sequences using genetic distance as a distance measure Such trees show which organisms are most likely share common ancestors, and may provide information about how various subtypes of organisms evolved This information is useful when studying disease causing organisms such as viruses and bacteria, because genetically similar types should behave in similar ways Advanced Microwave Sounding Unit (AMSU-A) Data Calibration/ Limb Correction/ Converted to Tb Mining on Data Ingest: Tropical Cyclone Detection Mining Plan: • Water cover mask to eliminate land • Laplacian filter to compute temperature gradients • Science Algorithm to estimate wind speed • Contiguous regions with wind speeds above a desired threshold identified • Additional test to eliminate false positives • Maximum wind speed and location produced Further Analysis Knowledge Base Data Archive Hurricane Floyd Mining Environment Result Results are placed on the web, made available to National Hurricane Center & Joint Typhoon Warning Center, and stored for further analysis pm-esip.msfc.nasa.gov/ Multiple Mining Environments: Passive Microwave ESIP Information System Web Interfaces & Applications Visualization & Exploration Temperature Trends Data Ordering AMSU-A Images AMSU Product Generation STT Application FTP Cyclone Winds ADaM-based Processing Input PM-ESIP Catalog Process Subset//Grid/Format Out put Order Staging ADaM Servers Custom Processing AMSU-A Ingest TMI TMI Ingest and Product Generation Data Ingest & Processing AMSU-A SSM/I SSM/T2 Distributed Data Stores Interoperability: Accessing Heterogeneous Data The Problem DATA FORMAT 1 DATA FORMAT 3 DATA FORMAT 2 FORMAT CONVERTER READER 1 READER 2 APPLICATION The Solution DATA DATA DATA FORMAT 1 FORMAT 2 FORMAT 3 ESML ESML ESML FILE FILE FILE ESML LIBRARY APPLICATION • Science data comes in: Different formats, types and structures Different states of processing (raw, calibrated, derived, modeled or interpreted) Enormous volumes • Heterogeneity leads to data usability problems • One approach: Standard data formats Difficult to implement and enforce Can’t anticipate all needs Some data can’t be modeled or is lost in translation • The cost of converting legacy data • A better approach: Interchange Technologies • Earth Science Markup Language Chained Image Processing Services WMS (Java/Windows) Format Chained Services (Perl/Linux) Resample (Perl/C – Linux) GeoCrop (Perl/Linux) Draw Image (PERL/C – Linux) Data Data Files Knowledge Base Data Streams Reader (Java/C+ Windows) ESML Lib Data Files Service Chaining is used to integrate modules – or services – developed on distributed platforms and different languages for a single processing solution. ESML Data Integration using Web Mapping Services Countries AMSU-A Channel 01 MCS Events Coastlines Globe Cyclone Events AMSU-A ITSC Knowledge Base AMSU-A data overlaid with MCS and Cyclone events for September 2000, merged with world boundaries from Globe. Fused Displays from Multiple Servers Analysis: Correlate MCSs and cyclones with atmospheric temperatures for September 2000. MULTI-LEVEL MINING CONCEPT MINING DECISION SUPPORT EVENT A EVENT B FEATURE SET I FEATURE I FEATURE II FEATURE III FEATURE X FEATURE Y Model and Observation Data Concept Hierarchy for Data Mining and Fusion On-Board Real-Time Processing Sensor Control/Targeting EVE – Environment for On-board Processing www.itsc.uah.edu/eve • Anomaly detection • Data Mining • Autonomous Decision Making • Immediate response • Direct satellite to Earth delivery of results A Reconfigurable Web of Interacting Sensors Communications Weather Satellite Constellations Military Ground Network 28 Ground Network Ground Network 10/3/2015 Example Plan: Threshold events in AMSU-A Streaming Data EVE Data Integration and Mining: From Global Information to Local Knowledge Emergency Response Precision Agriculture Urban Environments Weather Prediction Key Questions: What is the most effective approach to developing an integrated framework and plan for an interdisciplinary environmental cyberinfrastructure? What organizational structure is needed to provide long-term support for data storage, access, model development, and services for a global clientele of researchers, educators, policy makers, and citizens? How will effective interagency and public-private partnerships be formed to provide financial support for such an extensive and costly system? How can communication and coordination among computer scientists and environmental researchers and educators be enhanced to develop this innovative, powerful, and accessible infrastructure?