```CSC 599: Computational
Scientific Discovery
Lecture 3: Data, Data
Structures, and Reasoning
in Science
Outline
Computers Helping Scientists
Data in Science


Data
Catalogs
Data structures for scientific knowledge


Equations
Symbolic knowledge
Simulations and scientific reasoning



Do simulations
Architectures for sci. reasoning
QSIM
Next time:

Probabilistic reason and graphical models
Computers cooperating with
Scientists
Play to the strengths of computers




Fast
Accurate (we hope!)
Don't get bored
Good for simulations of complex systems!
Avoiding their weaknesses

No “common sense”



Garbage In, Garbage out
Calculation bugs
Time to develop/maintain code

Use someone else's program, if exists
Don't Just Code-It-Up!
Do you really need a computer?



Analytical solutions are cooler
Integrate! (If you can)
Ordinary differential equation

System of eqns & derivatives w.r.t. one independ var




Laplace and z-transformations
Perturbation expansions and discrete time eqns
Partial differential equations

Another variable can change



Eg. time
For fluids: pressure, temperature, etc.
Separation of variables
See Gershenfeld “The Nature of Mathematical
Modeling”
Data in Science
You make a measurement, should record?


Value (numeric or conceptual)
Precision


For measured numbers: Mean and std deviation
For concepts: set identity
“car” vs. “Toyota” vs. “Prius”

Domain

System constraints




Instruments constraints




Range definition limits
System extent limits
Saturation limits
Range definition limits
Detection limits
Reliable limits
Data constraints

Observed min/max w/in system extent and detect limits
Data in Science (2)
You make a measurement, should record?


Units and Dimensions
When and Where



With what equipment


Eg. camera model and film type
By whom?



Speed of light expected to be observer-independent
Count of robins? First bloom of dandelions?
Yesterday's Chinese “guest stars” are today's
supernovas
Why?


What are you trying to look for?
Might hint at what your data do not show
Laboratories: Where the Data
Comes From


Astronomical Observatories
Particle Physics Accelerators



CERN
FermiLab
Biology Labs

Human Genome Project
Issues

Distributed observations


Human Genome Project
Collaborative observations
Catalogs
What is (and is not) Recorded
Distributed Nature of Catalogs
Synonyms in Catalogs
Missing and Error Values in Catalogs
Permanence of Catalogs
What is (and is not) recorded
Not record “outside of bounds”

Space


Time (maybe Money!)


Outside of area of interest
Budget ran out, no more money for sensing
Magnitude (strength)
“Outside of bounds” may be a “soft” concept

USGS records earthquake
All Mw > 7
 Most M
w > 6
 Has M
w ≈ 5 and below close to seismometers
Q: Is it true that the only small earthquakes are
close to seismometers?

Distributed Nature of Catalogs
Distributed databases

Hopefully designed properly


Redundancy
Reliably online up-time
Central repository

Centralizes data available from other dbs


May use




Example: some biochemical dbs
inconsistent terms,
different missing value conventions, etc
May be out-of-date relative to source dbs
Format may be “least common denominator” of
more specialized formats
Synonyms and Different
Meanings for a Given Word
Concepts

ASCII or UNICODE string
Consider the term “robin”

In N. America:



In Europe



“American robin”
“Turdus migratorius
“European robin”
“Erithacus rubecula”



“thrush family bird”
“songbird”
“bird”
Ontologies can help (if used)
Missing and Errors Values
Missing values often coded as sentinel values


There might be more than one!
Esp. for “required” fields


Clerks were required to get customer telephone #
Most common tele number? (111) 111-1111
Errors values


Probably a bigger problem before computers
Computers can solve some problems:



Sanity rules
Parity, checksum (corruption), nonce (security)



NASA launched satellite to look at atmospheric ozone
Satellite reported very low O3 levels by south pole
Data was not believed, BUT WAS TRUE (Ozone hole)
Permanence of Catalog
How long does the media last?

Is the data viewed as temporary or worth
archiving?





Backup policy?
Organization that holds the data




Is more data generated than can be stored?
Proprietary?
Subject to privacy issues?
Tech savvy?
Have money?
Well-organized in general?
Redundant sites?

Banks do this, but they have \$ and are protecting \$
The Importance of Prediction
Scientists do a lot of things:

Prediction underlies much
Scientific assertions



Numbers: Equations
Concepts: Rules, decision trees, production sys
Probabilities: Graphical models

Next time!
Equations
“Strongest” form of knowledge
Defines relationships between quantities
F = ma
 Force and acceleration acting in same direction
 Mass is scalar
For N variables: N-1 knowns -> 1 unknown
Scalar or Vector
F = ma
Equations, cont'd
Are they always generalizations?
The Drake Equation:
N = R* * fp * ne * fl * fi * fc * L
Where:
N = Number of civilizations emitting radio transmissions
R* = rate of Sun-like star formation
fp = fraction of stars with planets
ne = average number of inhabitable planets
fl,fi,fc = probability of life, intelligence and civilization
L = average duration of a civilization
So far N = 0, is this equation useful?
Do we know any of these terms well?
What if N was non-zero, would the eqn be useful?
Computing with symbols
Rules

If dog(X) and wet(X) then smelly(X)
Decision trees
Production systems

Like rules, but special memory to hold what is
true
Simulations
Dimensions





Deterministic or stochastic?
Discrete or continuous?
Linear or non-linear?
Batch or real-time?
Phases of development
1. Real system to logical abstraction
2. Appropriate datastructs/algorithms/math technique
3. Implement algorithm as program
4. Validate implementation
Real System to Logical
Abstraction
1. Define the problem



What are the objectives?
What question(s) would you like to answer?
What do you anticipate answering in the future?
2. When in doubt, leave detail out

You may not need it


“A theory should be as simple as it needs to be, and
no simpler” A. Einstein
You can always add it in future
3. Spend the time turning vague statement of
goals into better defined one(s)


Revisit as necessary
(That's just good software engineering!)
Appropriate Datastructure/
Algorithm/Math Technique
Setting up the model








Go simple -> complex
Iterate
Develop large models modularly
Model only what is necessary
Make (and state) assumptions and hypotheses
State constraints
Alternate between top-down & bottom up
Best data structure/algorithm?

Data Structures and algorithms class
Best math technique?

Numerical methods class
Validation
Define some “test cases” before hand



Define (input,output) pairings
Should range in complexity
Should be checked:



By scientists, or
By well-known (e.g. “textbook”) cases, or
By hand (and then double-checked by someone else)
Just good software engineering!
Implementation
Procedural languages
Object oriented languages
Artificial Intelligence Languages
Production systems
Simulation languages
Procedural Languages
Examples

Fortran, C
Fast


Compilers map language to machine code easily
Large body of libraries for scientific


Among first programming languages
Have to think in machine terms, not domain
terms
Object Oriented Languages
Examples

C++, Java, C#, (Eiffel, Smalltalk)
Object Oriented Nature



Classes can represent scientific objects
Methods can represent scientific transitions
C++ can link with C libraries
Still may be a hassle to program objects
Artificial Intelligence Languages
Examples

Prolog, Lisp, Scheme
Prolog: Predicate Logic (Horn Clause) based
Facts:
sentiveNose(sniffer).
dog(fido). wet(fido).
Rules:
smelly(X) :- dog(X), wet(X).
offendedBy(Y,X) :- smelly(X), sensitiveNose(Y).
Possible queries:
Is sniffer offended by fido? offendedBy(sniffer,fido).
Does fido offend anyone? offendedBy(A,fido).
Is sniffer offended by anyone? offendedBy(sniffer,B).
Is anyone offended by anyone else? offendedBy(A,B).
Artificial Intelligence Langs (2)
Lisp

It's all functions operating on lists:
(defun computeProperties (object newObj)
(if (and (getProperties object dog)
(getProperties object wet)
)
)
)
Artificial Intelligence Langs (3)



Encourage thinking in domain space
Prolog: modular rules
Lisp: functions operating on lists good for
symbolic processing

Prolog: not good for floating pt numbers


Exact matching not compatible with float pt rounding
Lisp:

Handling loops and variables somewhat contrived
Production Systems
Examples

SOAR, OPS5, Mycin?
Models human thought
Production sys = Rules + Working Memory

Rules = if-then statements
if dog(X) and wet(X) then smelly(X)
if smelly(X) and sensitiveNose(Y) then offendedBy(Y,X)

Working memory = models human memory
dog(fido), wet(fido), sensitiveNose(sniffer)

Computation
1st round: compute smelly(fido)
2nd round: compute offendedBy(sniffer,fido)
Production Systems, cont'd


Productions are inherently modular
Can have constraints on working memory elements
(“wme's”)



Better models human cognitive limitations
RETE algorithm matches wmes and rules efficiently
Rules = laws, WMEs = state
When rules conflict, what happens?





Architecture has to do something
OPS5: More specific rule wins
SOAR: Create problem space to resolve issue
MYCIN: Rules w/certainty factors (weighted vote)
Is what the architecture does what you want?
Simulations Languages
Examples

QSIM, SPICE (analog circuits), Scilab
Qualitative simulations



Very precise, not very general:
d2x/dt2(t) = -9.8 m/sec2, x(t0) = 2m, v(t0) = 0 m/sec
Less precise, more general
d2x/dt2(t) = -g, x(t0) = x0, v(t0) = v0
Even less precise, even more general
dx/dt = M- (monotonically dec. fnc),
x(t0) = 0 .. x0 .. infinity
QSIM (1)
Values
Either landmark values
0, MAX_CAPACITY, t0, inf
Or ranges between landmark values
Not full but not empty:
Time before t1 but after t0:
Variables:
(0,MAX_CAPACITY),
(t0,t1)
Have both a value and derivative
<0,inc>
<(0,MAX_CAPACITY),inc>
<MAX_CAPACITY,std>
Functions:
Either monotonically increasing or decreasing
M+ , M -
Derivatives: either increasing, decreasing or steady:
inc, dec, std.
QSIM (2): Classic Example
U-Tube: Two tubes connected by pipe

Both have

Some initial fluid level
amtA: 0..AMAX..inf

amtB: 0..BMAX..inf
Pressure dependent upon level
pressureA = M+(amtA)
pressureB = M+(amtB)
QSIM (3)
Other constraints:

Total fluid only in A & B, and is constant:
amtA + amtB = total

constant(total)
Flow depends on pressure difference:
pAB = pressureA – pressureB
flowAB = M+(pAB)
d(amtB)/dt = flowAB
d(amtA)/dt = -flowAB

Knowledge of quantities:
total:
amtA:
pressureA:
pAB:
flowAB:

0..inf
0..AMAX..inf
0..inf
-inf..0..inf
-inf..0..inf
amtB:
0..BMAX..inf
pressureB: 0..inf
Correspondence between values
pressureA & amtA:
pressureB & amtB:
flowAB & pAB:
(0,0), (inf,inf)
(0,0), (inf,inf)
(-inf, -inf), (0,0), (inf,inf)
QSIM (4)
Note correspondence w/ordinary diffy eqn
d(amtB)/dt = f(g(amtA) – h(amtB))
f,g,h  M+
Qualitative state is dynamic
Imagine filling tank A (ignore tank B)
Need to represent amtA and its derivative:
amtA(t):
t 0:
(t0,t1):
t 1:
<0,inc>
<(0,AMAX),inc>
<AMAX,std>
QSIM (5): Predicting behavior
1. Give initial (at least some) conditions:
“Tank A full”
t = t0: amtA = <AMAX,?>
“Tank B empty”
t = t0: amtB = <0,?>
QSIM (6):
2. Correspondence propagation on init state
amtB = 0
therefore
pressureB = 0
amtA = AMAX
therefore
pressureA =
(0,inf)
amtA=AMAX && amtB = 0
therefore
total = (0,inf)
pressureA = (0,inf) and pressureB = 0
therefore
pAB = (0,inf)
constant(total)
therefore
d(total)/dt = std
pAB = (0,inf)
therefore
flowAB = (0,inf)
flowAB = (0,inf)
therefore
d(amtA)/dt = dec
d(amtB)/dt = inc
d(amtA)/dt = dec therefore d(pressureA)/dt = dec.
d(amtB)/dt = dec therefore d(pressureB)/dt = dec.
QSIM (7)
(Init state correspondence propagate, cont'd)
d(pressureA)/dt = dec && d(pressureB)/dt = inc
therefore
d(pAB)/dt = dec
Last propagation
d(pAB)/dt = dec
therefore
So we have at t = t0:
amtA
amtB
pAB
Total
d(flowAB)/dt = dec
= <AMAX,dec> pressureA = <(0,inf),dec>
= <0,inc>
pressureB = <0,inc>
= <(0,inf), dec> flowAB
= <(0,inf), dec>
= <(0,inf), std>
QSIM (8)
3. Predicting the next state:
What events can make the next state?
1. A variable can reach a limit or landmark
2. A variable may move off a landmark
3. A variable may start or stop moving
6 variables moving to limits/landmarks
(theoretically): 46 - 1 = 4095 possibilities
(actual): constraints & correspondences reduce choices
t = (t0,t1):
amtA
amtB
pAB
Total
= <(0,AMAX),dec>
= <(0,BMAX),inc>
= <(0,inf), dec>
= <(0,inf), std>
pressureA = <(0,inf),dec>
pressureB = <(0,inf),inc>
flowAB
= <(0,inf), dec>
QSIM (9)
What happens after?
t = (t0,t1):
amtA = <(0,AMAX),dec> pressureA = <(0,inf),dec>
amtB = <(0,BMAX),inc>
pressureB = <(0,inf),inc>
pAB
= <(0,inf), dec>
flowAB
= <(0,inf),
dec>
total
= <(0,inf), std>
Well:
amtA is decreasing toward 0 (won't get there)
amtB is increasing toward BMAX.
flowAB is decreasing toward 0.
So either:
1. flowAB gets to 0 AND amtB gets to BMAX, or
2. flowAB gets to 0 BEFORE amtB gets to BMAX, or
3. amtB gets to BMAX BEFORE flowAB gets to 0
(Tank B overflows)
QSIM (10)
1. flowAB gets to 0 AND amtB gets to BMAX:
t = t1(a):
amtA
amtB
pAB
Total
= <(0,AMAX),std>
= <BMAX,std>
= <0,std>
= <(0,inf), std>
pressureA = <(0,inf),std>
pressureB = <(0,inf),std>
flowAB
= <0,std>
3. amtB gets to BMAX BEFORE flowAB gets to 0
(Tank B overflows)
t = t1(c):
amtA
amtB
pAB
Total
= <(0,AMAX),dec>
= <BMAX,inc>
= <(0,inf),dec>
= <(0,inf),std>
pressureA = <(0,inf),dec>
pressureB = <(0,inf),inc>
flowAB
= <(0,inf),dec>
QSIM (11)
2. flowAB gets to 0 BEFORE amtB gets to BMAX
t = t1(b):
amtA
amtB
pAB
Total
= <(0,AMAX),std>
= <(0,BMAX),std>
= <0,std>
= <(0,inf), std>
pressureA = <(0,inf),std>
pressureB = <(0,inf),std>
flowAB
= <0,std>
amtB's level becomes a new landmark so:
amtA:
pressureA:
amtB:
pressureB:
pAB:
flowAB:
total:
0..a0..AMAX..inf
0..p0..inf
0..a1..BMAX..inf
0..p1..inf
-inf..0..inf
-inf..0..inf
0..to0..inf
QSIM (12)
Output: graph of state transitions:
QSIM (13)
QSIM algorithm:
1. Init queue w/QState(t0), complete its description
2. If queue empty or exceed resource limit then stop,
otherwise pop state S from queue
3. For each var vi in S get successors from table
4. Determine all successor states consistent w/restrictions

If S is time-interval, delete copies of itself
5. For each successor Si, assert:
successor(S,Si)
predecessor(Si,S)
6. Apply global filters on successors
Don't add inconsistent, quiescent, cycles, transition or t = inf
states
8. Go to 2
Next Time
Probabilistic reasoning

Graphical Models
Machine Learning and CSD
```