Computational Discovery of
Communicable Scientific Models
Pat Langley
Center for the Study of Language and Information
Stanford University, Stanford, California
http://cll.stanford.edu/~langley
[email protected]
Thanks to N. Asgharbeygi, K. Arrigo, S. Bay, S. Dzeroski, J. Sanchez, Oren Shiran,
and L. Todorovski for their contributions to this research, which is funded by a grant
from the National Science Foundation.
Data Mining vs. Scientific Discovery
There exist two computational paradigms for discovering explicit
knowledge from data:
 Data mining generates knowledge cast as decision trees,
logical rules, or other notations invented by AI researchers;
 Computational scientific discovery instead uses equations,
structural models, reaction pathways, or other formalisms
invented by scientists and engineers.
Both approaches draw on heuristic search to find regularities in
data, but they differ considerably in their emphases.
Lesson 1
Traditional notations from machine learning are not communicated
easily to domain scientists.
Ecosystem model
NPPc = Smonth max (E · IPAR, 0)
E = 0.56 · T1 · T2 · W
T1 = 0.8 + 0.02 · Topt – 0.0005 · Topt2
T2 = 1.18 / [(1 + e 0.2 · (Topt – Tempc – 10) ) · (1 + e 0.3 · (Tempc – Topt – 10) )]
W = 0.5 + 0.5 · EET / PET
PET = 1.6 · (10 · Tempc / AHI)A · PET-TW-M if Tempc > 0
PET = 0 if Tempc < 0
A = 0.00000068 · AHI3 – 0.000077 · AHI2 + 0.018 · AHI + 0.49
IPAR = 0.5 · FPAR-FAS · Monthly-Solar · Sol-Conver
FPAR-FAS = min [(SR-FAS – 1.08) / SR (UMD-VEG) , 0.95]
SR-FAS = (Mon-FAS-NDVI + 1000) / (Mon-FAS-NDVI – 1000)
Gene regulation model
NBLR
+
+
NBLA
psbA1
-
+
RR
-
Health
+
-
psbA2
Light
PBS
-
DFR
+
-
cpcB
+
Photo
Lesson 2
Scientists often have initial models that should influence the
discovery process.
NBLR
+
+
NBLA
psbA1
+
+
RR
-
Observations
psbA2
+
NBLR
+
psbA1
-
m
+
NBLA
PBS
RR
×
Health
+
-
psbA2
Light
-
+
DFR
Revised model
cpcB
Initial model
+
cpcB
×
Photo
Health
+
Light
PBS
-
DFR
Discovery
-
Photo
Lesson 3
Scientific data are often rare and difficult to obtain rather than
being plentiful.
Ecosystem model
Number of variables
Number of equations
Number of parameters
Number of samples
Gene regulation model
8
11
20
303
9
Number of variables
11
Number of initial links
Number of possible links 70
20
Number of samples
Lesson 4
Scientists want models that move beyond description to provide
explanations of their data.
Ecosystem model
Gene regulation model
NPPc
NBLR
+
+
E
NBLA
-
psbA1
-
W
T2
T1
SOLAR
FPAR
+
+
A
PET
EET
Topt
SR
PETTWM
Tempc
NDVI
RR
-
VEG
Health
+
-
psbA2
Light
AHI
PBS
IPAR
DFR
e_max
-
cpcB
+
Photo
Lesson 5
Scientists want computational assistance rather than automated
discovery systems.
NBLR
+
+
NBLA
psbA1
+
+
RR
-
Observations
psbA2
+
cpcB
Initial model
NBLR
+
+
NBLA
psbA1
-
+
RR
×
Health
+
-
psbA2
Light
PBS
+
DFR
Revised model
-
cpcB
×
Photo
Health
+
Light
PBS
-
DFR
Discovery
-
Photo
The Nature of Systems Science
Disciplines like Earth science and computational biology differ
from traditional fields in that they:
 focus on synthesis rather than analysis in their operation;
 rely on computer modeling as one of their central methods;
 develop system-level models with many variables and relations;
 require that models make contact with known mechanisms.
However, existing methods for computational scientific discovery
were not designed with systems science in mind.
Time Series from the Ross Sea Ecosystem
Inductive Process Modeling
Our approach is to design and implement computational methods
for inductive process modeling, which:
 represent scientific models as sets of quantitative processes;
 use these models to predict and explain observational data;
 search a space of process models to find good candidates;
 utilize background knowledge to constrain this search.
This framework has great potential both for modeling scientific
reasoning and aiding practicing scientists.
Existing Formalisms Are Inadequate
regression trees
B>6
C>0
14.3
C>4
18.7
11.5
16.9
hidden Markov models
0.7
x=16,x=2
y=13,x=1
Horn clause programs
1.0
x=12,x=1
y=18,x=2
x=19,x=1
y=11,x=2
0.3
x=12,x=1
y=10,x=2
1.0
gcd(X,X,X).
gcd(X,Y,D) :- X<Y,Z is Y–X,gcd(X,Z,D).
gcd(X,Y,D) :- Y<X,gcd(Y,X,D).
systems of equations
d[ice_mass,t] =  (18  heat) / 6.02
d[water_mass,t] = (18  heat) / 6.02
A Process Model for an Aquatic Ecosystem
model AquaticEcosystem
variables: phyto, zoo, nitro, residue
observables: phyto, nitro
process phyto_loss
equations: d[phyto,t,1] =  0.307  phyto
d[residue,t,1] = 0.307  phyto
process zoo_loss
equations: d[zoo,t,1] =  0.251  zoo
d[residue,t,1] = 0.251
process zoo_phyto_grazing
equations: d[zoo,t,1] = 0.615  0.495  zoo
d[residue,t,1] = 0.385  0.495  zoo
d[phyto,t,1] =  0.495  zoo
process nitro_uptake
conditions: nitro > 0
equations: d[phyto,t,1] = 0.411  phyto
d[nitro,t,1] =  0.098  0.411  phyto
process nitro_remineralization;
equations: d[nitro,t,1] = 0.005  residue
d[residue,t,1 ] =  0.005  residue
Advantages of Quantitative Process Models
Process models offer scientists a promising framework because:
 they embed quantitative relations within qualitative structure;
 that refer to notations and mechanisms familiar to experts;
 they provide dynamical predictions of changes over time;
 they offer causal and explanatory accounts of phenomena;
 while retaining the modularity needed for induction/abduction.
Quantitative process models provide an important alternative to
formalisms used currently in computational discovery.
Challenges of Inductive Process Modeling
Process model induction differs from typical learning tasks in that:
 process models characterize behavior of dynamical systems;
 variables are continuous but can have discontinuous behavior;
 observations are not independently and identically distributed;
 models may contain unobservable processes and variables;
 multiple processes can interact to produce complex behavior.
Compensating factors include a focus on deterministic systems and
the availability of background knowledge.
Encoding Background Knowledge
To constrain candidate models, we can utilize available backround
knowledge about the domain.
Previous work has encoded background knowledge in terms of:
 Horn clause programs (e.g., Towell & Shavlik, 1990)
 context-free grammars (e.g., Dzeroski & Todorovski, 1997)
 prior probability distributions (e.g., Friedman et al., 2000)
However, none of these notations are familiar to domain scientists,
which suggests the need for another approach.
Generic Processes as Background Knowledge
We cast background knowledge as generic processes that specify:
 the variables involved in a process and their types;
 the parameters appearing in a process and their ranges;
 the forms of conditions on the process; and
 the forms of associated equations and their parameters.
Generic processes are building blocks from which one can compose
a specific process model.
Generic Processes for Aquatic Ecosystems
generic process exponential_loss
variables: S{species}, D{detritus}
parameters:  [0, 1]
equations: d[S,t,1] = 1    S
d[D,t,1] =   S
generic process remineralization
variables: N{nutrient}, D{detritus}
parameters:  [0, 1]
equations: d[N, t,1] =   D
d[D, t,1] = 1    D
generic process grazing
variables: S1{species}, S2{species}, D{detritus}
parameters:  [0, 1],  [0, 1]
equations: d[S1,t,1] =     S1
d[D,t,1] = (1  )    S1
d[S2,t,1] = 1    S1
generic process constant_inflow
variables: N{nutrient}
parameters:  [0, 1]
equations: d[N,t,1] = 
generic process nutrient_uptake
variables: S{species}, N{nutrient}
parameters:  [0, ],  [0, 1],  [0, 1]
conditions: N > 
equations: d[S,t,1] =   S
d[N,t,1] = 1      S
Inducing Process Models
training data
process model
model AquaticEcosystem
variables: nitro, phyto, zoo, nutrient_nitro, nutrient_phyto
observables: nitro, phyto, zoo
process phyto_exponential_growth
equations: d[phyto,t] = 0.1  phyto
process zoo_logistic_growth
equations: d[zoo,t] = 0.1  zoo / (1  zoo / 1.5)
Induction
process exponential_growth
variables: P {population}
equations: d[P,t] = [0, 1,]  P
process logistic_growth
variables: P {population}
equations: d[P,t] = [0, 1, ]  P  (1  P / [0, 1, ])
process constant_inflow
variables: I {inorganic_nutrient}
equations: d[I,t] = [0, 1, ]
process consumption
variables: P1 {population}, P2 {population}, nutrient_P2
equations: d[P1,t] = [0, 1, ]  P1  nutrient_P2,
d[P2,t] =  [0, 1, ]  P1  nutrient_P2
process no_saturation
variables: P {number}, nutrient_P {number}
equations: nutrient_P = P
process saturation
variables: P {number}, nutrient_P {number}
equations: nutrient_P = P / (P + [0, 1, ])
generic processes
process phyto_nitro_consumption
equations: d[nitro,t] = 1  phyto  nutrient_nitro,
d[phyto,t] = 1  phyto  nutrient_nitro
process phyto_nitro_no_saturation
equations: nutrient_nitro = nitro
process zoo_phyto_consumption
equations: d[phyto,t] = 1  zoo  nutrient_phyto,
d[zoo,t] = 1  zoo  nutrient_phyto
process zoo_phyto_saturation
equations: nutrient_phyto = phyto / (phyto + 0.5)
A Method for Process Model Construction
The IPM algorithm constructs explanatory models from generic
elements components in four stages:
1. Find all ways to instantiate known generic processes with
specific variables, subject to type constraints;
2. Combine instantiated processes into candidate generic models
subject to additional constraints (e.g., number of processes);
3. For each generic model, carry out search through parameter
space to find good coefficients;
4. Return the parameterized model with the best overall score.
Our typical evaluation metric is squared error, but we have also
explored other measures of explanatory adequacy.
Estimating Parameters in Process Models
To estimate the parameters for each generic model structure, the
IPM algorithm:
1. Selects random initial values that fall within ranges specified
in the generic processes;
2. Improves these parameters using the Levenberg-Marquardt
method until it reaches a local optimum;
3. Generates new candidate values through random jumps along
dimensions of the parameter vector and continue search;
4. If no improvement occurs after N jumps, it restarts the search
from a new random initial point.
This multi-level method gives reasonable fits to time-series data
from a number of domains, but it is computationally intensive.
Observations from the Ross Sea
Results on Training Data from Ross Sea
Results on Test Data from Ross Sea
Results on a Protist Ecosystem
Results on Rinkobing Fjord
Results on Biochemical Kinetics
observed trajectories
predicted trajectories
Interfacing with Scientists
Because few scientists want to be replaced, we are developing an
interactive environment, PROMETHEUS, that lets users:
 specify a quantitative process model of the target system;
 display and edit the model’s structure and details graphically;
 simulate the model’s behavior over time and situations;
 compare the model’s predicted behavior to observations;
 invoke a revision module in response to detected anomalies.
The environment offers computational assistance in forming and
evaluating models but lets the user retain control.
Viewing a Process Model Graphically
Indicating Processes to Consider Adding
Specifying Data and Search Parameters
Inspecting Revised Process Models
Intellectual Influences
Our approach to computational discovery incorporates ideas from
many traditions:
 computational scientific discovery (e.g., Langley et al., 1983);
 theory revision in machine learning (e.g., Towell, 1991);
 qualitative physics and simulation (e.g., Forbus, 1984);
 languages for scientific simulation (e.g., STELLA, MATLAB);
 interactive tools for data analysis (e.g., Schneiderman, 2001).
Our work combines, in novel ways, insights from machine learning,
AI, programming languages, and human-computer interaction.
Contributions of the Research
In summary, our work on computational scientific discovery has, in
responding to various challenges, produced:
 a new formalism for representing scientific process models;
 a computational method for simulating these models’ behavior;
 an encoding for background knowledge as generic processes;
 an algorithm for inducing process models from time-series data;
 an interactive environment for model construction/utilization.
We have demonstrated this approach to model creation on domains
from Earth science, microbiology, and engineering.
Some Recent Extensions
In recent work, we have extended our approach to incorporate:
 heuristic beam search through the space of process models;
 hierarchical generic processes that further constrain search;
 an ensemble-like method that mitigates overfitting effects;
 metrics for explanatory adequacy based on trajectory shapes.
Inductive process modeling has great potential to speed progress
in systems science and engineering.
End of Presentation
Descargar

Computational Discovery of Communicable Knowledge