Data Mining
Research and Applications
Workshop on Cyberinfrastructure
For Environmental Research and Education
October 31, 2002
Steve Tanner
Information Technology and Systems Center
University of Alabama in Huntsville
[email protected]
256.824.5143
www.itsc.uah.edu
Key Questions:

What is the most effective approach to developing an
integrated framework and plan for an interdisciplinary
environmental cyberinfrastructure?

What organizational structure is needed to provide long-term
support for data storage, access, model development, and
services for a global clientele of researchers, educators, policy
makers, and citizens?

How will effective interagency and public-private partnerships
be formed to provide financial support for such an extensive
and costly system?

How can communication and coordination among computer
scientists and environmental researchers and educators be
enhanced to develop this innovative, powerful, and accessible
infrastructure?
Data Mining




Data Mining is an interdisciplinary field drawing from areas such as statistics,
machine learning, pattern recognition and others
Automated discovery of patterns, anomalies, etc. from vast observational and
model data sets
Derived knowledge for decision making, predictions and disaster response
ADaM – Algorithm Development and Mining System
datamining.itsc.uah.edu
Techniques used for Data Mining

Clustering Techniques
–
–
–

Pattern Recognition
–
–

–
–
–
–
–
–


Bayes Classifier
Minimum Distribution Classifier
Image Analysis
–

K Means
Isodata
Maximum
Boundary Detection
Cooccurrence Matrix
Dilation and Erosion
Histogram Operations
Polygon Circumscript
Spatial Filtering
Texture Operations
Genetic Algorithms
Neural Networks
Etc.
Data Mining systems
usually involve a
toolbox of many
different techniques
and a means for
combining them
Typical Everyday Encounters
with Data Mining

Google
–

Amazon.Com
–

Complex algorithm sequence to decide order
Additional purchase suggestions
Credit Card Fraud
–
Event notification of odd usage
Most current Data Mining applications are text based. Text
provides an easily readable source of heterogeneous data.
Mining of scientific data sets is more complex.
User Perspective and Data
Perspective of the Data Mining
Process
Analysis
Decision
Value
Volume
Transformation
Knowledge
Preprocessing
Information
Dataset
Specific
Algorithms
Data
Data
Stores
User Perspective
Calibration
& Navigation
Dataset
Domain
Specific
Algorithms
Data Perspective
Data
Mining
Scientific
Analysis

Harnesses human analysis
capabilities
–



Provides automation of the
analysis process

Can be used for dimensionality
reduction when manual
examination of data is impossible

Can have limitations
Based on theory and
hypothesis formulation
–

Highly creative

Physical basis is normally
used for algorithms
Drawing insights about the
underlying phenomena
Rapidly widening gap between
data collection capabilities and
the ability to analyze data
Potential of vast amounts of
data to be unused
–
May not utilize domain
knowledge
–
May be difficult to prove
validity of the results

There may not be a physical
basis

Should be viewed as
complimentary tool and not a
replacement for scientific
analysis
Similarity between Data Mining and
Scientific Analysis Process
Mining Environments
Mining Framework (ADaM)
–
–
–
–
–
–
Complete System (Client and Engine)
Mining Engine (User provides its own client)
Application Specific Mining Systems
Operations Tool Kit
Stand Alone Mining Algorithms
Data Fusion
Distributed/Federated Mining
–
–
–
Distributed services
Distributed data
Chaining using Interchange Technologies
On-board Mining (EVE)
–
–
Real time and distributed mining
Processing environment constraints
Using the Mining Framework:
Focusing on the information in data
The ADaM Processing Model
Results
Translated
Data
Raw Data
Preprocessed
Data
Patterns/
Models
Processing
Input
Preprocessing
Analysis
Output
PIP-2
SSM/I Pathfinder
SSM/I TDR
SSM/I NESDIS Lvl 1B
SSM/I MSFC
Brightness Temp
US Rain
Landsat
ASCII Grass
Vectors (ASCII Text)
HDF
HDF-EOS
GIF
Intergraph Raster
Others...
Selection and Sampling
Subsetting
Subsampling
Select by Value
Coincidence Search
Grid Manipulation
Grid Creation
Bin Aggregate
Bin Select
Grid Aggregate
Grid Select
Find Holes
Image Processing
Cropping
Inversion
Thresholding
Others...
Clustering
K Means
Isodata
Maximum
Pattern Recognition
Bayes Classifier
Min. Dist. Classifier
Image Analysis
Boundary Detection
Cooccurrence Matrix
Dilation and Erosion
Histogram
Operations
Polygon
Circumscript
Spatial Filtering
Texture Operations
Genetic Algorithms
Neural Networks
Others…
GIF Images
HDF Raster Images
HDF Scientific Data Sets
HDF-ESO
Polygons (ASCII, DXF)
SSM/I MSFC
Brightness Temp
TIFF Images
GeoTIFF
Others...
Iterative Nature of the
Data Mining Process
EVALUATION
And
PRESENTATION
KNOWLEDGE
DISCOVERY
MINING
CLEANING
And
INTEGRATION
PREPROCESSING
DATA
SELECTION
And
TRANSFORMATION
Distributed/Federated Mining:
Meshing data and algorithms to
generate knowledge
ADaM : Mining Environment for
Scientific Data
• The system provides knowledge discovery, feature detection and
content-based searching for data values, as well as for metadata.
•contains over 120 different operations
•Operations vary from specialized science data-set specific
algorithms to various digital image processing techniques,
processing modules for automatic pattern recognition, machine
perception, neural networks, genetic algorithms and others
Classification Based on Texture
Features and Edge Density

Science Rationale: Man-made changes to land use cause
changes in weather patterns, especially cumulus clouds

Comparison based on
–
Accuracy of detection
–
Amount of time required to classify
Cumulus cloud fields have a very characteristic texture
signature in the GOES visible imagery
Parallel Version of Cloud Extraction




GOES images can be
used to recognize
GOES Image
cumulus cloud fields
Sobel Horizontal
Sobel Vertical
Laplacian Filter
Filter
Filter
Cumulus clouds are
small and do not
Energy
Energy
Energy
Energy
show up well in 4km
Computation
Computation
Computation
Computation
resolution IR
channels
Classifier
Detection of cumulus
cloud fields in GOES
Cloud Image
can be accomplished
GOES Image Cumulus Cloud
by using texture
Mask
features or edge
detectors
Three edge detection filters are used together to detect cumulus
clouds which lends itself to implementation on a parallel cluster
Automated Data Analysis for
Boundary Detection and Quantification


Analysis of polar cap
auroras in large
volumes of spacecraft
UV images
Science Rationale:
Indicators to predict
geomagnetic storm
–
–

Damage satellites
Disrupt radio
connection
Developing different
mining algorithms to
detect and quantify
polar cap boundary
Polar Cap Boundary
Detecting Signatures


Science Rationale:
Mesocyclone signatures in
Radar data are indicators of
Tornadic activity
Developing an algorithm
based on wind velocity
shear signatures
–
Improve accuracy and
reduce false alarm rates
Genetic Subtyping Using
Hierarchical Clustering




Biologists are interested in comparing DNA
sequences to see how closely related they are to
one another
Phylogenetic trees are constructed by performing
hierarchical clustering on DNA sequences using
genetic distance as a distance measure
Such trees show which organisms are most likely
share common ancestors, and may provide
information about how various subtypes of
organisms evolved
This information is useful when studying disease
causing organisms such as viruses and bacteria,
because genetically similar types should behave in
similar ways
Advanced
Microwave
Sounding Unit
(AMSU-A) Data
Calibration/
Limb Correction/
Converted to Tb
Mining on Data Ingest:
Tropical Cyclone Detection
Mining Plan:
• Water cover mask to eliminate land
• Laplacian filter to compute temperature
gradients
• Science Algorithm to estimate wind speed
• Contiguous regions with wind speeds
above a desired threshold identified
• Additional test to eliminate false positives
• Maximum wind speed and location
produced
Further Analysis
Knowledge
Base
Data Archive
Hurricane Floyd
Mining
Environment
Result
Results are placed on the web, made available to
National Hurricane Center & Joint Typhoon Warning Center,
and stored for further analysis
pm-esip.msfc.nasa.gov/
Multiple Mining Environments:
Passive Microwave ESIP Information System
Web Interfaces & Applications
Visualization & Exploration
Temperature Trends
Data Ordering
AMSU-A Images
AMSU Product
Generation
STT
Application
FTP
Cyclone Winds
ADaM-based
Processing
Input
PM-ESIP
Catalog
Process
Subset//Grid/Format
Out
put
Order
Staging
ADaM Servers
Custom
Processing
AMSU-A Ingest
TMI
TMI Ingest and
Product Generation
Data Ingest & Processing
AMSU-A
SSM/I
SSM/T2
Distributed Data Stores
Interoperability: Accessing
Heterogeneous Data
The Problem
DATA
FORMAT 1
DATA
FORMAT 3
DATA
FORMAT 2
FORMAT
CONVERTER
READER 1
READER 2
APPLICATION
The Solution
DATA
DATA
DATA
FORMAT 1
FORMAT 2
FORMAT 3
ESML
ESML
ESML
FILE
FILE
FILE
ESML
LIBRARY
APPLICATION
• Science data comes in:
 Different formats, types and structures
 Different states of processing (raw,
calibrated, derived, modeled or
interpreted)
 Enormous volumes
• Heterogeneity leads to data usability
problems
• One approach: Standard data formats
 Difficult to implement and enforce
 Can’t anticipate all needs
 Some data can’t be modeled or is
lost in translation
• The cost of converting legacy data
• A better approach: Interchange
Technologies
• Earth Science Markup Language
Chained Image
Processing Services
WMS
(Java/Windows)
Format
Chained Services
(Perl/Linux)
Resample
(Perl/C – Linux)
GeoCrop
(Perl/Linux)
Draw Image
(PERL/C – Linux)
Data
Data Files
Knowledge
Base
Data Streams
Reader
(Java/C+
Windows)
ESML Lib
Data
Files
Service Chaining is
used to integrate
modules – or
services – developed
on distributed
platforms and
different languages
for a single
processing solution.
ESML
Data Integration using Web
Mapping Services
Countries
AMSU-A
Channel 01
MCS Events
Coastlines
Globe
Cyclone Events
AMSU-A
ITSC
Knowledge
Base
AMSU-A data overlaid with MCS and Cyclone events for
September 2000, merged with world boundaries from Globe.
Fused Displays from
Multiple Servers
Analysis: Correlate MCSs and cyclones with atmospheric
temperatures for September 2000.
MULTI-LEVEL
MINING
CONCEPT
MINING
DECISION
SUPPORT
EVENT A
EVENT B
FEATURE
SET I
FEATURE
I
FEATURE
II
FEATURE
III
FEATURE
X
FEATURE
Y
Model and Observation Data
Concept Hierarchy for Data Mining and Fusion
On-Board Real-Time Processing
Sensor Control/Targeting
EVE – Environment for On-board Processing
www.itsc.uah.edu/eve
• Anomaly
detection
• Data Mining
• Autonomous
Decision
Making
• Immediate
response
• Direct satellite
to Earth
delivery of
results
A Reconfigurable Web of
Interacting Sensors
Communications
Weather
Satellite
Constellations
Military
Ground Network
28
Ground Network
Ground Network
10/3/2015
Example Plan: Threshold events
in AMSU-A Streaming Data
EVE
Data Integration and Mining:
From Global Information to Local Knowledge
Emergency
Response
Precision Agriculture
Urban
Environments
Weather
Prediction
Key Questions:

What is the most effective approach to developing an
integrated framework and plan for an interdisciplinary
environmental cyberinfrastructure?

What organizational structure is needed to provide long-term
support for data storage, access, model development, and
services for a global clientele of researchers, educators, policy
makers, and citizens?

How will effective interagency and public-private partnerships
be formed to provide financial support for such an extensive
and costly system?

How can communication and coordination among computer
scientists and environmental researchers and educators be
enhanced to develop this innovative, powerful, and accessible
infrastructure?
Descargar

Data Integration for Homeland Security