Extending Database
Management Systems by
Developing New Database
Operators
Paul J. Wagner
University of Wisconsin – Eau Claire
Messages
Current relational query languages do not scale
up well to support us in the development of
complex queries on newer data domains
New relational database operators are needed
to help us generate such queries
We can develop a framework for adding new
operators by analyzing the shortcomings of our
current operators
Definitions
Question – an English (or other natural
language) statement of the desired data
Query – the statement of the problem in a
relational query language
Operator – a module representing a
single conceptual task to be carried out on
relational data; can be primitive (e.g.
filtering rows from a relation) or nonprimitive (e.g. joining two tables, SQL
select)
Background
Database world in 1970s and 1980s



Set-Oriented Data (e.g. employees, bank
accounts, airline schedules)
Relatively Well-Understood
Relational Set Operators Were Sufficient
Database world in 1990’s and 2000’s



More Complex Data (e.g. Spatial and
Temporal Data, Protein Sequences)
Not Well Understood
Relational Set Operators Are Insufficient
Relational Query Languages
SQL isn’t the only relational query language

SQL is a transform-oriented language that
implements a variety of atomic relational operations
Other query language options

Relational calculus (descriptive)
“state the defining characteristics of the result” – C.J. Date

Relational algebra (prescriptive)
State the process that gets you to the desired result
Operations: select, project, times (Cartesian product), join,
union, intersect, minus
Relational Algebra Operations
select (U) – filters rows

Note: RA select != SQL select
project (U) – filters columns
times (B) – all combinations of the rows in two relations (even if they
don’t make sense)
join (B) – a macro, involving a sequence of:


times
select
those that make sense
those that meet the criteria of the particular question

(optionally) project
remove duplicate key values
remove any other columns that aren’t part of the question
union (B), intersect (B), minus (B) – basic set operations
B = binary, U = unary
Creatures
Times Example
CID
Name
1
Alice
CID
Name
CID
SCode
2
Bob
1
Alice
1
F
3
Carl
1
Alice
2
S
1
Alice
2
F
1
Alice
3
S
2
Bob
1
F
Achievements
2
Bob
2
S
CID
SCode
2
Bob
2
F
1
F
2
Bob
3
S
2
S
3
Carl
1
F
2
F
3
Carl
2
S
3
S
3
Carl
2
F
3
Carl
3
S
TIMES
Creature-Achievement Pairs
Quiz, Question 1:
Given an
Achievements table
with columns
CreatureID and
SkillCode, what SQL
statement retrieves
each creature that
floats? (assume F
means “floats”)
Achievements
CreatureID SkillCode
1
2
F
S
2
F
3
5
S
C
SELECT CreatureID FROM Achievements
WHERE SkillCode = ‘F’;
Quiz, Question 2
Given an
Achievements table
with columns
CreatureID and
SkillCode, what SQL
statement retrieves
each creature that
floats or swims (S
means “swims”)?
Achievements
CreatureID SkillCode
1
2
F
S
2
F
3
5
S
C
SELECT CreatureID FROM Achievements
WHERE SkillCode = ‘F’ OR SkillCode = ‘S’;
Quiz, Question 3:
Given an
Achievements table
with columns
CreatureID and
SkillCode, what SQL
statement retrieves
each creature that
floats and swims?
Achievements
CreatureID SkillCode
1
2
F
S
2
F
3
5
S
C
SELECT CreatureID FROM Achievements
WHERE SkillCode = ‘F’ AND SkillCode = ‘S’; ???
Problems Emerge With SQL Select
What are the issues here?

SQL operates on one row at a time
We ask questions about the entire data set

SQL is monolithic
One SQL statement can contain many atomic relational
operations
E.g. the select statement for a join in SQL actually contains
projection, times and selection in relational algebra
E.g. the SQL where clause contains meaningful join criteria
as well as “business logic”

SQL starts to break down as the queries become
more complex
Harder to generate the syntax as well as know that the
results are correct
How Do We Answer Question 3?
A Few Possibilities:


SELECT CreatureID FROM Achievements
WHERE SkillCode = ‘F’
INTERSECT
SELECT CreatureID FROM Achievements
WHERE SkillCode = ‘S’;
SELECT A1.CreatureID
FROM Achievements A1, Achievements A2
WHERE A1.CreatureID = A2.CreatureID AND
A1.SkillCode = ‘F’
AND
A2.SkillCode = ‘S’;
Quiz, Question 4:
Given an Achievements table with columns
CreatureID and SkillCode and a table
LifeguardSkills with a list of desired skills, what
SQL statement retrieves each creature that has
achieved all of the Lifeguard skills?
Achievements
CreatureID SkillCode
LifeguardSkills
1
F
SkillCode
2
S
F
2
F
S
3
S
R
5
C
….
….
….
Umm…..
Can I leave
now?
Problems
Prior techniques don’t scale up for an
arbitrarily large number of desired criteria


Don’t want to have to specify N intersect
operations
Don’t want to join N tables
Resulting queries have problems


Time consuming and repetitious to generate
Inefficient to execute
How Do We Answer Question 4?
Relational Algebra also has the (binary) Divide operator

Does what we want (divide Achievements by LifeguardSkills)
In SQL, we need to create a macro:

1) Find all possible creature/lifeguard-skill pairs (Creatures times
LifeguardSkills)
Gives us the “ ’if everyone was a lifeguard’ achievements”
Note: we need a separate Creatures relation


Why can’t we just generate Creatures by projecting CreatureID from
Achievements?
2) Find the difference between step 1 and the Achievements
relation
Gives us the “non-achieved Lifeguard achievements”

3) Project the CreatureID from step 2
Gives us the “creatures who haven’t achieved all LifeguardSkills”

4) Find the difference between Creatures and step 3
Gives us the “creatures who have achieved all LifeguardSkills”
Question 4, Reflection
How many relations are needed for our SQL
macro?
Creatures
Achievements
CreatureID
CreatureID SkillCode
1
F
2
S
2
F
3
S
4
5
C
….
….
….
1
2
3
LifeguardSkills
SkillCode
F
S
R
….
Question 4, Reflection (cont.)
Why are more relations needed for the macro
than for relational divide?


We’re providing for the possibility that there are some
creatures who have no achievements
Not really needed now, but later….
Is question 4 looking for creatures that have
exactly the LifeguardSkills or those with exactly
or more than those skills?

Are there any other possible associations we might
be interested in?
Quiz, Question 5
Find each creature/job pair where the creature has
achieved exactly or more than the skills for that job

Note: we’re generalizing the last question to match multiple jobs
Creatures
CreatureSkills
JobSkills
CreatureID
CreatureID SCode
JobName
SCode
JobName
Lifeguard
F
Lifeguard
Lifeguard
S
Developer
1
F
2
S
2
F
3
S
4
5
….
….
1
2
3
Jobs
Developer D
Manager
C
Developer C
Slacker
….
Manager
O
….
….
….
How To Answer Question 5?
No operator in relational algebra

Possible as a complex macro, but many
operations
Hundreds of lines of SQL code
Let’s think about this question some
more…
Matching Relations
We need four relations to answer this
question




Target (e.g. Creatures) – the ‘candidate’
relation
Target-Detail (e.g. Creature-Skills, or
Achievements) - combination of candidate
plus achieved detail
Pattern-Detail (e.g. Job-Skills) – combination
of what target is matched against plus detail
Pattern (e.g. Jobs) – what the target is
matched against
Possible Set Associations
(DEMONS-ZA)
Exactly: target detail (TD) same as pattern detail
(PD) for a given target/pattern pair
More than: TD >= PD (superset)
Different than: TD and PD share no detail, but
each has detail
Overlapping: TD and PD share some detail, but
each have different detail
None: TD empty, PD has detail
Some: TD <= PD (subset)
Zero: TD empty, PD empty
Any: TD has detail, PD empty
Combinations of Set Associations
There are 28-1= 255 possible non-empty
combinations of the DEMONS-ZA set
associations
All are potentially interesting
Some that commonly arise:




EM = at least that many (universal quantifier)
SOME = at least one (existential quantifier)
NZ = none (no TD values)
ESZ = in (the TD set has no values that are
not in the PD set)
Set HAS Operator
Developed by John Carlis, University of Minnesota; late
1980’s
HAS <Qual.1> <Qual.2> T-Rel TD-Rel PD-Rel P-Rel

E.g. HAS ‘EMZA’ Creatures Creature-Skills Jobs-Skills Jobs
Qualifier1: a DEMONS-ZA string or a counts expression
which includes a relational expression involving one or
more of 3 counts:




TD values in PD (qualifying count)
TD values not in PD (non-qualifying count)
PD values not in TD (missing count)
NOTE: can derive each DEMONS-ZA letter from 3 counts
Qualifier2: ‘Exact’ matching of TD values to PD values
(default), or ‘Range’ matching where PD is specified as a
range of values
Originally implemented in LISP, Scheme
MATCH Operator
Developed by Jim Held, University of Minnesota,
~1990
Set HAS plus

Supports multiple detail properties
E.g. could match on skills, traits, and age

Supports hierarchical structure for patterns
E.g. a job could have multiple sub-jobs

Supports hierarchical structure for criteria
E.g. a skill could have sub-skills
Demonstrated usefulness with medical expert
systems
Implemented in LISP
Bag HAS Operator
Developed by Paul Wagner, ~2000
Set HAS extended to support bags (multisets) of skills

Extended Set-HAS in a different direction than match
Need 5 relations (Target, Target-Detail, Detail, PatternDetail, Pattern)
Target-Detail relation extended to contain count of detail
present
Pattern-Detail relation extended to contain count of detail
required
Developed another qualifier to represent possible
combinations of counts
Demonstrated usefulness in several domains



Academic records
Sports event qualification
Limited protein sequence matching
Implemented in relational algebra (built on top of Oracle)
and PL/SQL
What’s Next?
Sequence HAS

Bag HAS extended with positional matching
Sequence MATCH

MATCH extended with positional matching
Generalized Sequence HAS/MATCH
Usefulness

Support DBMS implementations of many currently
external sequence matching tools
E.g. BLAST, FAST-A for protein sequences
Other types of sequences (temporal, positional)
Issues/Alternatives
For This Approach
Issues

Operators themselves become more complex to use
More relations
More qualifiers

How far to extend languages like SQL?
Current extensions support objects, procedural functionality
Alternatives

Packages based on data types
Contain support for types, operations on those types
Issue – only support that type, not generalized matching
Conclusions (Messages Revisited)
Current relational query languages do not
support the development of complex queries on
newer data domains
New relational database operators are needed
to help us generate such queries
We have developed a framework for adding new
operators by analyzing the shortcomings of our
current operators, and are using it to develop
new database operators that can help meet
today’s data-driven software development needs
Descargar

Document