Subject Mediation for Integrated Access to
Heterogeneous Information Sources
ADBIS’2001
L. A. Kalinichenko
Institute of Informatics Problems
Russian Academy of Science
Laboratory for compositional
information systems development
Various forms of compositions are studied, e.g. :
•
•
•
•
•
Interoperable compositions of pre-existing components for IS design;
Compositions of heterogeneous information collections;
Workflow compositions;
Type compositions in database operations over object collections;
Heterogeneous mediators compositions.
Web site of the group: http://www.ipi.ac.ru/synthesis/
Talk outline
•
•
•
•
Subject Domain Mediation
Mediators’ Projects: Brief Overview
Query Planning Methods
Infrastructure of the mediator aiming at
semantic interoperability of collections
• Summary
Subject Domain Mediation
Outline :
 Objectives of information integration
 The mediator’s concept
 Mediator’s classes
 Consolidation of a mediator
 Advantages of the subject domain mediation
approach
Web Search Engines
1 billion Web pages.
Search engines remain to be the main mechanism to access pages. Key words queries.
Dozens of general purpose search engines, thousands of specialized engines (regional,
thematic, corporal).
The following kinds of general purpose Web search engines can be distinguished:
• basic engines: AltaVista, HotBot, Infoseek, Lycos, WebCrawler, Yahoo, Rambler,
Яndex, etc.
• portals: Skworm, Proteus, Instantseek, etc.
• metasearch engines: SavvySearch, Inference Find, ProFusion, etc.
• metasearch utilities: Copernic, BeeLine, SearchPad, etc.
“Metasearch” engines provide for requesting several search engines and composing
combined response. It is assumed that such response more probably will contain relevant
information.
Precision of search is very low (uncontrollable use of terms for indexing and search). This is
unavoidable payment for simplicity of home pages “registration” for the whole Web.
What is the required level of information
integration/dissemination
• Just
putting information on the Web (creating a homepage, a Web site)
• Inserting a description of a resource into a suitable Digital Library (e.g, into NCSTRL,
the Networked Computer Science Technical Report Library, a collection of institutional
and archival CS research reports and papers)
• Using subject gateways for easier access to networked information resources in a
defined subject area. Subject gateways work as intermediaries
• Applying a community-oriented digital library (a collection of documents built by a
community of users which aims at observing or studying a phenomenon (e.g., in a context
of a certain area)).
• Using heterogeneous multidatabase systems.
• Applying subject mediators to support representation and access to various subject
domains. Mediators should provide modelling facilities and methods for conversion of
unorganized, nonsystematic population of collections registered by different
collection providers into a well-structured set of sources supported by the integrated
uniform specifications. Metainformation. Systematic registration of collections.
What and for what is to be integrated
What kind of information is to be supported
structured, object, semi-structured, textual, multimedia
What kind of metainformation is needed
thesauri, classifiers, vocabularies, ontologies, schema definitions (data, objects, functions,
workflows)
What to disseminate:
1. A document (paper) as a whole using additional document description
2. A document in XML
3. Content of a document
What to retrieve
1. To discover individual resources (Web pages, documents, papers)
2. To retrieve information relevant to a specific query contained in a collection of
resources
3. To retrieve information as workflows, methods and/or data and use various
compositions of those
4. To provide for interoperability of the information sources in process of problem solving:
-technical level of interoperability
-semantic interoperability
Digital repositories of knowledge
Digital repositories of knowledge in certain areas can be implemented, like:
Digital Earth, Digital Sky, Digital Bio, Digital Law, Digital Art, Digital
Music. Examples of Microsoft TerraServer, Multi-Terabyte Astronomy
Archives are widely known.
An example: DigiTerra (an Environmental Digital Library, Rutgers)
objective is to provide continuous land monitoring, fire detection, water and air
quality testing, urban planning, as well as supporting research and instructional
activities in related areas of science. Vast array of environmental data collected
in DigiTerra should include images from a variety of space-borne satellites,
ground data from continuous monitoring weather stations, maps, reports and data
sets from federal, state and local government agencies, and serve diverse user
community.
The Mediator Concept
The mediator architecture (Wiederhold, 1992) deals with the problem of integration
of heterogeneous information. The sources are "heterogeneous" on many levels:
• data model and types of data used;
• the underlying data units (salaries could be stored on a per-hour or per-month
basis);
• behavior of objects involved;
• the underlying concepts. A payroll database may not regard a retiree as an
employee, while the benefits department does. Conversely, the payroll
department may include consultants in the concept "employee" while the
benefits department does not;
• the schema that the information may conform cannot be rigid in advance.
Examples of "semi-structured" information include that found in XML
documents, repositories used in the Human Genome Project, Lotus NOTES.
Mediator is to provide a uniform query interface to the multiple data sources,
thereby freeing the user from having to locate the relevant sources, query each one
in isolation, and combine manually the information from the different sources.
Mediation approaches
• integration information from pre-selected sources according to the predefined
information needs. A procedural approach is known (TSIMMIS, Squirrel,
WHIPS) to integrate information from sources through ad-hoc procedures. When
information needs or sources change, a new mediator should be generated. This
is known as Global as View (GAV) approach.
• integration information from arbitrary sources according to the predefined
information needs. A declarative approach is known (Carnot, SIMS, Information
Manifold, Infomaster). Mediators contain mechanisms to rewrite queries
according to source descriptions. A rewritten query should be contained in the
original query. This is known as Local as View (LAV) approach.
• combined LAV and GAV approaches (GLAV)
Mediator Definition as Subject Metainformation
Consolidation
For the mediator's scalability two separate phases of the mediator's functioning are
distinguished: consolidation phase and operational phase.
On the consolidation phase the efforts of the scientific community are focused on the
mediator subject definition by declaring its metainformation. It is assumed that the top
level researchers are involved in this process. The metainformation defined at the
consolidation phase is assumed to be conservative for a certain period of time when it can
only be extended. The well-known, representative collections of information in the subject
domain are used during the process of metainformation definition. The metainformation
created at the consolidation phase constitutes the federated level of the mediator.
During the operational phase arbitrary information collections can be registered at the
mediator expressed in terms of the federated level. Process of the registration is
autonomous and can be done by collection providers independently of each other. Users of
the mediator know only the metainformation defining the mediator’s subject and formulate
their queries in terms of the mediator’s subject. For a query the mediator decides what
registered collections are relevant to the query.
Subject Mediator. Cultural Heritage Collections.
Federated Level Metainformation
Heritage_Entity
Painting
Sculpture
«type»
Antiquities
Person
Creator
Collector
Owner
Repository
Museum
Gallery
Exhibition
created_by*
date*
narrative*
idintifier*
relation*
…
place_of_origin
history_period
content
origin_history
in_collection
owned_by
digital_form
...
«type»
Text
contains
near
within
follows
…
Thesauri:
Cultural
Heritage
History
Jurisdiction
Subject Mediator. Cultural Heritage Collections.
Collections Registration
Federated Level Metainformation
Local into Federated Level Mapping
CIMI Profile of z39.50
museum_object
created_by*
date_collected*
description*
object_id*
relation*
…
content_general
collection
mrObject
Louvre Museum Web Site
creator_c
nationality
works
department
name
description
sections
author
name
nationality
works
Uffizi Museum Web Site
artist
name
biography
paint_list
canvas
title
painter
date
history
description
to_image
Local Views in Terms of Federated Classes
creator_c(c/Creator_Creator_Info [name,
nationality, date_of_birth, date_of_death,
works/{set_of:Heritage_Entity_Museum_
Object}])  creator(c[name, nationality,
date_of_birth, date_of death, works])
author (a/Creator_Author[name/fname, nationality,
works/{set_of:Heritage_Entity_Work}]) 
creator(name, nationality, works (w)) &  c,s (
repository (c/Collection [contains(s/Section)]) &
repository.name = ‘Louvre’ & in (w, s.contains) )
artist(a/Creator_Artist[name, nationality,
general_info/Text_Textual, works/{set_of:
Painting_Canvas}])  creator(a[name,
nationality, general_info, works]) &
repository (n/name, collection) & n =
‘Uffizi’ &  col/Collection (  isempty
(intersect (collection(col/
Collection).contains, works)))
Subject Mediator. Cultural Heritage Collections.
Query Planning
User
Find digital images of Italian paintings of Renaissance
containing a drawing of Madonna with a child
{i/Image |  p/Painting, d/Digital_Entity, re/Rendition ( creator( nationality,
works(p.digital_form(d).rendition(re).resource(i/Image)) & nationality = ‘Italy’ &
p.content.contains(‘Madonna with a child') & p.history_period = ‘Renaissance’ }
Mediator
Query
Planner
{i/Image | 
o/Heritage_Entity_Museum_Object,
d/Digital_Object, re/Rendition
(creator_c(nationality,works(o)) & nationality
= ‘Italy’ & o.history_period = ‘Renaissance’ &
o.content.contains(‘Madonna with a child OR
…’) & in(i,
o.digital_object(d).rendition(re).resource))}
CIMI
Thesaurus extension may add
‘Virgin Mary’, ‘God Mather’
Thesaurus
{i/Image |  w/Heritage_Entity_Work
(author(nationality, works (w)) &
nationality = ‘Italy’ & w.history_period
= ‘Renaissance’ &
w.description.contains(‘Madonna with
a child OR … ') & in (i, w.to_image)}
Louvre
{i/Image |  r/Collection_Room,
p/Painting_Canvas (artist(nationality,
paint_list(p), room_list (r) ) & in(p,
r.paint_list)) & nationality = ‘Italy’ &
p.history_period = ‘Renaissance’ &
p.description.contains(‘Madonna with a child
OR … ') & in(i, p.to_image)
Uffizi
Advantages of subject domain mediation
1. Subject mediation makes possible to reach semantic integration of heterogeneous
information collections
2. Users should know only subject definitions that contain concepts, structures and
methods as defined by the community
3. Information providers can disseminate their information for integration independently
of each other and at any time. To disseminate they should register their information at
the subject mediator. Users should not know anything about the registration activity.
4. Autonomous information collections contexts, data model and languages used,
implementation platforms are absolutely independent on the mediator and its
consolidated metainformation definitions
5. Querying the subject definitions, users have integrated access to all information
registered at the mediators up to the moment of a query.
6. Mediators form recursive structure: each mediator can be registered at another
mediator. Thus, multiple subjects can be semantically integrated defining mediators of
the higher level.
7. Personalization providing convenient views for specific groups of users can be formed
above the subject definitions. This process is independent of the existing collection and
their registration.
Disadvantages of subject mediation
1. Providing a subject definition requires that a proper level of maturity and
organization of scientific community have to be reached (e.g., are the research and
development groups in the area sufficiently open, collaborative and motivated). Subject
consolidation is a collective, organized effort of the community.
2. Process of registration is not an easy one and requires specific supporting tools.
Mediator’s Recursion
Query
Data from
mediator mediator
Mediator
Query
Data from Register
collection collection collection
Register
mediator
(as collection)
Mediators’ Projects: Brief Overview
Outline :
 TSIMMIS (Stanford)
 Information Manifold (Univ. of Washington)
 GARLIC (IBM)
 InfoSleuth (MCC)
 XML as a middleware model
TSIMMIS (The Stanford-IBM Manager of
Multiple Information Sources)
In TSIMMIS mediators are built above a GIVEN set of sources with wrappers that export
OEM self-describing objects.
OEM (Object Exchange Model) is used as a unifying data model. The mediators
considered provide integrated OEM views of the underlying information (e.g., if a
relational source is considered, it is exported as a set of OEM objects.)
Mediators are specified with MSL (Mediator Specification Language) that can be seen as
a view definition language and is a logic-based object-oriented language targeted to OEM.
Variables in MSL may refer only to existing sets. In absence of negation MSL can be
viewed as a variant of Datalog. A query consists of rules using <object-id label value> as
patterns. To describe a mediator in MSL, one gives logical rules that define the OEM
objects that the mediator makes available in a view.
Wrappers are specified with WSL that is an extension of MSL to allow for the description
of source contents and querying capabilities
Information Manifold
In the Information Manifold a reasoning phase is required for realizing which sources
have the data of interest, unlike TSIMMIS where view expansion is all that is needed for
finding what data each source must contribute.
The user interacts with a uniform interface in the form of a set of global relations (the
mediated schema) used in formulating queries. The actual data is stored in external
source relations. To answer queries, a mapping between the relations in the mediated
schema and the source relations must be specified. A method to specify these mappings is
to describe each source relation as the result of a conjunctive query (i.e., a single Horn
rule) over the relations in the mediated schema.
Given a user query formulated in terms of the relations in the mediated schema, the
system must translate it to a query that mentions only the source relations and is a
maximallycontained plan. The collection of available data sources may not contain all the
information needed to answer a query.
The Information Manifold provides uniform access to structured information sources on
the WWW.
Source Query Capabilities Representations in
Mediation Frameworks
Sources express their capabilities in mediation systems through a variety of
mechanisms - query templates, capability records, and simple capabilitydescription grammars.
Concerning query capabilities, data sources with different and limited
capabilities are accessed either by writing rich functional wrappers for the more
primitive sources, or by dealing with all sources at a ''lowest common
denominator''. Another approach, in which a mediator ensures that sources
receive queries they can handle, while still taking advantage of all the query
power of the source.
Wrappers reflect the actual query capabilities of the underlying data sources,
while the mediator has a general mechanism for interpreting those capabilities
and forming execution strategies for queries. Capabilities-Based Rewriters
(CBR) are basic mechanisms of the mediators to develop a plan for a query
taking into account capabilities of the sources.
The GARLIC Approach (IBM Almaden)
Heterogeneous and multimedia information systems are main objectives.
Only specific data types are supported in multimedia. For example, document retrieval
through use of various text indexing and search, spatial searches in GIS, image
processing (QBIC, Photobook). One of well-known decision is Illustra's datablades for
different data types.
Garlic differs in that there is no intention to store everything in one repository distribution, heterogeneity and integration of heterogeneous sources.
Conformance concept of interfaces (interface in a sense of ODMG-93) leads to an
interface lattice based on a subtyping.
Garlic exploits specific wrapper technology based on source capability specification.
Source capabilities are coded by the programmer within the corresponding wrapper.
They remain unknown to the optimizer.
InfoSleuth: semantic integration of information
in open and dynamic environments
Integration of different technological developments in supporting mediated
interoperation of data and services over information networks:
• Agent Technology. Specialized agents that represent the users, the
information resources, and the system itself cooperate to address the system
requirements of the users. Decentralization of capabilities is reached that is
the key to system scalability and extensibility.
• Domain models (ontologies). Give a concise, uniform and declarative
description of semantic information independent of the underlying models.
• Information Brokerage. Specialized information agents match information
needs (specified in terms of some ontology) with currently available
sources. So requests can be routed to the relevant sources.
• Internet computing. Java and Java Applets enable deployment of agents at
any source of information regardless of its location or platform.
YAT: XML as a middleware model
An XML-oriented algebra having optimization properties in a combination with definition
of query source capabilities, wrapping more structured query languages (e.g., OQL), new
optimization technique for XML-based integration system.
Other semistructured/XML systems – TSIMMIS (query templates are used to describe
source capabilities) and MIX. However, definition of all possible queries according to a
schema is not feasible with such templates.
YAT operational model and algebra. XML data (like objects) can be arbitrarily nested. A
technique similar to OO is adopted. For an arbitrary XML structure an operator Bind is
applied whose function is to extract relevant information and produce a Tab structure
(comparable to non 1NF relation). To these Tab structures classical operators like Join,
Select, Project, etc. can be applied.
Bind operator: input tree, given filter (a tree with distinct variables). Produces a table that
contains the variable bindings resulting from the pattern matching. It is expensive to
evaluate, but it can be rewritten into more simpler operations.
Tab operator: applied to Tab structures and returns a collection of trees conforming to
some input pattern.
Query Planning Methods for Mediators of
Heterogeneous Information Sources
Outline :
 Query Planning for LAV approach
 Query Containment Techniques
 Wrapper generation
Representation of Information Sources
Formally, the contents of an information source are described by a pair (or set of
pairs) of the form (v, rv ) where v is a class name with mv state attributes, and rv
is a formula of the form:
rv = U p1 (U 1) &…& pn ( Un )
The formula rv has mv distinguished variables. The pi 's are any of the classes on
the federated level. The class name v is a new name describing an information
source. This means that the source can be asked a query of the form v(Z) (or any
partial instantiation of it), and returns instances with mv state attributes that
satisfy the following implication:
 Z (v( Z)) => rv(Z))
Simplified source capability model (input bindings, output, selections):
R1(Y1, ... , Yk):- R(X1, ... , Xm), 1 = a1, ... , n = an, = Y1, ... , k = Yk, 1, ... , h
Sound and Relevant Query Plans
A simplified query Q to the mediator can be represented as a conjunction:
Q(Y) :  X p1 (X1) & … & pn (Xn );
X , X 1 , … , Xn are tuples of variables or constants and the pi 's are any of the classes on
the federated level. The answer to the query is the set of bindings that can be obtained for
the variables in Y.
Given a query of the form above, the query processor generates a set of conjunctive plans
for answering Q(Y) as formulae of the form:
Q(Y):  U v1 (U1) & … & vk (Uk ) & Cp
where each of the vi 's is a class name associated with an information source, and Cp is a
conjunction of atoms of order relations. Note that the distinguished variables in the plan
are the same as the ones in the query. Given a conjunctive plan P , the descriptions of the
information sources imply that the following constraints hold on the answers it produces:
(recall that rvi is the formula describing the constraints on the instances found in v i )
ConP : rv1 (U1) & … & rvk (Uk) & Cp
Sound and Relevant Query Plans
Definition: A conjunctive plan P is sound if all the answers it produces are
guaranteed to be answers to the query, i.e., if the following entailment holds:
Y (ConP) => X p1(X1) & … & pn (Xn)
Several conjunctive plans to answer a query are required because the information
sources are not complete.
Definition: A conjunctive plan P is relevant to a query Q(Y) :  X p1(X1) &…&
pn (Xn ) if the sentence  Y,X (Conp & p1(X1) & … & pn(Xn)) is satisfiable.
Plan Generation
First step: separately for each subgoal in the query, compute which information sources
are relevant to it and collect such sources into respective buckets. An information source
is relevant to a subgoal g if, the description of the source contains a subgoal g 1 that can be
unified with g, such that after the unification, the constraints in the query and the
constraints in the source description are mutually satisfiable.
‘Satisfiable’ means that the conjunction of built-in atoms should be satisfiable and there
are no two subgoals C(x) and D(x) where C and D are disjoint classes. ‘Mutually
satisfiable’ means that if C(Q) and C(U) are the conjunction of constraint subgoals in
query and source, then C(Q) & C(U) should be satisfiable.
Second step: conjunctive plans constructed are analyzed by choosing one relevant source
for every subgoal in the query, and check each plan for soundness and relevance.
Specifically, it is considered every conjunctive plan Q1 of the form
Q1(Y) : ( U) v1(U1) & … & vn(Un)
where vi(Ui) has been deemed relevant to subgoal pi in the query. Each such conjunctive
plan should be checked that it is (1) relevant, (2) sound (if it is not a sound plan, it is
checked whether it can be made sound by adding conjuncts of order predicates), and (3)
minimal (i.e., we cannot remove a subgoal from the plan and still obtain a sound plan).
Plan Generation
Usually these properties are checked using algorithms for containment of
conjunctive queries. The algorithm should guarantee to produce only sound and
relevant plans.
Whether the algorithm produces all the necessary conjunctive plans ? The
answer is based on the close relationship between the problem of finding
conjunctive plans and the problem of answering queries using materialized
views.
The cost of checking minimality and soundness of a conjunctive plan is
exponential, it is exponential only in the size of the query, which tends to be
small, and not in the number of information sources or their contents.
Query Containment Algorithms
•
Basic techniques (e.g., QinP (Ullman): Containment of conjunctive queries in
logical recursions, negation in conjunctive queries by Chan)
•
4.
Extensions:
Containment for queries with complex objects. Typing constraints and
integrity constraints for object DB schemas
Relative containment
Conjunctive queries with regular expressions Query containment under
constraints
Bag containment of conjunctive queries
•
1.
2.
3.
Alternative techniques
Counter machines to study query containment
Verification of knowledge bases
Description Logics
1.
2.
3.
Containment of Conjunctive Queries in Logical
Recursions (QinP)
An algorithm testing whether a conjunctive query is contained in the
relation defined by a logic program.
Given are a conjunctive query Q, represented as:
H :- G1 & … & Gk and a logic program P.
To decide whether Q  P:
1) Assign to every variable in Q a unique constant.
2) Form EDB relation from the subgoals of Q.
3) Evaluate P (bottom-up) as DB relation
4) If EDB is contained in DB then Q  P
A Query Converter for Wrappers Toolkit
In Tsimmis query converter is a part of the Wrapper implementation toolkit.
MSL logic-based, OEM-oriented query language is used.
Source capabilities are defined with templates in a Query Description and
Translation Language (QDTL). Each template can be associated with an action
that generates the commands for the underlying source.
The converter will process:
 Directly supported queries. These are queries that syntactically match a
template.
 Logically supported queries. These are queries that produce the same
results as a directly supported query. The notion of logical equivalence is
used to detect queries that fall in this class.
 Indirectly supported queries. These are queries that can be executed in
two steps: first a directly supported query is executed, and then a filter is
applied to the results of the first step.
Detection of maximal supporting query
and of a filter
A query qs is a maximal supporting query of query q with respect to capability
description if qs is directly supported by d, qs indirectly supports q1, and there is no
directly supported query q’s that indirectly supports q1 , is subsumed by qs, and is not
logically equivalent to qs There may be more than one maximal supporting query for a
given query.
Capability description D is expressed as a (possibly recursive) Datalog program.
The problem of determining if a description D supports query Q, is the same as the
problem of determining if program P(D) contains (subsumes) query Q and if a
corresponding filter query exists. A supporting query is found in two steps:
1. find a subsuming query, and
2. find the corresponding filter.
The approach is based on the extended Ullman query containment algorithm (X-QinP)
that gives yes/no answer to the containment question.
The algorithm is extended to find the actual maximal supporting queries and also the
native query constituents for the underlying source.
Known modifications of query rewriting
algorithms using views
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Conjunctive queries
Source templates with binding patterns
Recursive queries
Views in description logic
Rewriting for semistructured data. Regular expressions rewriting,
navigational plans
Boolean queries rewriting
Queries with union and aggregation
Type inferencing
Object fusion
Scalable technique
Infrastructure of the mediator aiming at semantic
interoperability of collections
Outline :
• Heterogeneity of the mediator
 Canonical information model
 Mediator’s metadata
 Information extraction framework
 Collection registration at a mediator as a process of
compositional development
Heterogeneous information models
absorbed by the canonical model
Canonical Model
Core
Extensions
is_refined_by
Semistructured
Data Models
(OEM, ADM, OQL-doc)
Component Models
(IDL, CDL, BOF)
Object & Heterogeneous
DB Models
(ODL, SQL3, Garlic)
Document Object
Model
Knowledge Base
Representations
(OKBC, Ontolingua)
Metadata for DL
(Dublin Core, Warwick,
Starts, Z.39.50)
Metadata Expressible
in Meta Models
(MOF, RDF)
Unstructured Data
(vocabularies, thesauri)
Workflow Models
Canonical Model Entities
instance_of
Metaclass
type
instance type
instance instance type
instance_of
instance_of
supertype
superclass
type
instance type
Class
Type
Collection
type
type
instance_of
instance_of
Object
World
instance_of
instance_of
Abstract Value
becomes an object
instance_of
Frame
Canonical Information Model
A set of the canonical model facilities used for the uniform representation of the
information resources includes the following:
• Frame representation facilities. Frames are treated as a special kind of abstract values
introduced mostly for description of concepts, terminological and weakly-structured
information. All specifications in canonical model have a form of frames that become a
part of the metabase.
• Unifying type system. A universal constructor of arbitrary abstract data types as well as
a comprehensive collection of the built-in types are included into a type system.
• Class representation. Classes provide for representing of sets of homogeneous entities
of an application domain. Class instances (objects) have specific types.
• Multiactivity (workflow) representation. These are used for the specification and
implementation of interconnected and interdependent application activities, for the
specificaton of declarative assertions and concurrent megaprograms over the information
resources.
• Facilities for the logical formulae expressions. A multisorted object calculus (typed
first-order language) is used for querying the integrated set of digital collections as well as
for specification of constraints and behaviour.
Mediator’s Metadata Layering
Personalized
DL Level
Views
Subschemas
Specific Vocabulary
Subject Classification Hierarchy & Context
(metaclass hierarchy & ontological definitions)
Interoperable
(Federated)
Level
Common Thesauri
Federated Schema
Local
Level
Real
Collection
Level
Ontology
Schema
Schema
Structured Collection
Core
Ontology
Schema
Vocabulary
Semistructured
Collection
Extension
Ontology
Vocabulary
Thesauri
Vocabulary/Thesauri
Unstructured Collection
Information Extraction Framework
Personalization
Facilities
Canonical GUI
Personalized DL
Personalized DL
Java / CORBA
Graphical Query
Facilities
Information
Extraction
Facilities
XML
wrapper
Z39.50
wrapper
http
Local
Collections
•canonical mediator’s query language
•best relevant collection identification
•query decomposition
•query planning and monitoring
•ranking
•merging
•aggregation
•summarization
Mediator’s DBMS
(object-relational DBMS)
metadata
repository
Localization
Facilities
Outcome
Presentation
Query Engine
XML data system
Z39.50
Z39.50 server
data
information retrieval
system wrapper
IIOP
information
retrieval system
SRS
wrapper
http
molecular biology
data banks
Metainformation Repository
Value
Frame
1
*
*
slots
Slot
instances
1
1
Module
instType
Type
instInstType
Schema
*
ADT
Class
Attribute
1
*
View
1
Concept
simulatings
*
Metaclass
Function
Simulating
CEntity
Reduct
CompType
Category
Collection Registration Framework
The framework facilities are intended to support functions of collection
contextualizing:
• constructing mapping of a collection data model and metadata into
the canonical ones;
• representation of the new metainformation in terms of the
federated mediator's level;
• inferring from the collection the required information for the
federated level;
• semi-automatic construction of a collection wrapper;
• connecting the wrapper to the interoperation environment (e.g.,
CORBA).
Contextualization of Ontology
• mapping of local ontological context to that of the mediator
–
–
–
–
by names and relationships
by natural language description
applying structural integration to concept specifications
introducing new concepts over existing ones
• contextualization through structural correlation
– establishing weak ontological relevance of specification elements
applying analysis of intercontext concept relationships
– establishing tight ontological relevance of specification elements
introducing a subsumption relationship between concepts
Correlation of Ontological Concepts
• evaluation of descriptor weights
f k log
W Xk 
N
nk

N 
  f i log n 
iV X 
i 
2
• establishing intercontext relationships between concepts
t
 W
sim  X , Y  
Xk
 W Yk

k V X  V Y
t
t
 W    W 
2
Xk
k V X
t
r  X,Y

 min W
Xk
t
r Y,X
t
 W 
Xk
k V X
k V Y
, W Yk 
k V X  V Y
2
2
Yk

 min W
Xk
,W
k V X  V Y
t
 W 
Yk
k V Y
2
Yk

Ontological Metainformation
type
Class
1
*
ADT
*
1 collection
toRelation
1 category
Category
Concept
-code: string
-definition: string
-w ordClass: string
1
weightOf
*
1 fromConcept
*
1
fromRelation
*
concept
toConcept
*
* foreign
ConceptRel
-strength: float=1
*
PositiveRel
descriptorOf
1
Narrow Rel
1
weights
*
*
ConceptWeight
descriptors
PartRel
Descriptor
-w eight: float
-frequency: float
-name: string
-w eight: float
-name: string
RelativeRel
Process of an Information Source Registration
For each source class the following steps (of the compositional development process) are
required [LNCS 2151]:
1.
relevant federated classes identification
•
2.
Find federated classes that ontologically can be used for defining source class extent in
terms of federated classes. To a source class several federated classes may correspond
covering with their instance types different reducts of an instance type of the source class.
On another hand, several source classes may correspond to one federated class.
most common reducts construction
For an instance type of each identified federated class do:
•
Construct most common reducts for instance type of this federated class and source class
instance type to concretize (partially) such federated instance type. Most common reduct
may include also additional attributes corresponding to those federated type attributes that
can be derived from the source type instances to support them.
•
In this process for each attribute type of the common reduct a concretizing type,
concretizing function or their combination should be constructed (this step should be
recursively applied).
Process of an Information Source Registration
For each source class the following steps are required:
3.
partial source view construction
•
4.
For each relevant federated class construct a partial source view expressing a
constraints in terms of the federated class that should be satisfied by values of
respective most common reducts of source class instances. Thus partial views
over all relevant federated classes will be obtained.
partial views composition
•
Construct compositions of the source type most common reducts obtained for
instance types of all federated classes involved.
•
Construct a source view as a composition of partial views obtained above. This
is an expression of a materialized view of an information source in terms of
federated classes. An instance type of this view is determined by the most
common reducts composition constructed above.
Subject Mediator. Cultural Heritage Collections.
Collections Registration
Federated Level Metainformation
Local into Federated Level Mapping
CIMI Profile of z39.50
museum_object
created_by*
date_collected*
description*
object_id*
relation*
…
content_general
collection
mrObject
Louvre Museum Web Site
creator_c
nationality
works
department
name
description
sections
author
name
nationality
works
Uffizi Museum Web Site
artist
name
biography
paint_list
canvas
title
painter
date
history
description
to_image
Local Views in Terms of Federated Classes
creator_c(c/Creator_Creator_Info [name,
nationality, date_of_birth, date_of_death,
works/{set_of:Heritage_Entity_Museum_
Object}])  creator(c[name, nationality,
date_of_birth, date_of death, works])
author (a/Creator_Author[name/fname, nationality,
works/{set_of:Heritage_Entity_Work}]) 
creator(name, nationality, works (w)) &  c,s (
repository (c/Collection [contains(s/Section)]) &
repository.name = ‘Louvre’ & in (w, s.contains) )
artist(a/Creator_Artist[name, nationality,
general_info/Text_Textual, works/{set_of:
Painting_Canvas}])  creator(a[name,
nationality, general_info, works]) &
repository (n/name, collection) & n =
‘Uffizi’ &  col/Collection (  isempty
(intersect (collection(col/
Collection).contains, works)))
Specifications of Types of the Uffizi Site Schema
1
{ordered}
Repository
-name: string
1
Artist
authors
{ordered}
Canvas
paint_list
-name: string
-title: Textual
-biography: Textual
-painter: string
1
-culture: Textual
{ordered}
1
{ordered}
{ordered}
contains
room_list
1
Room
paint_list
-date: tim e
-description: Textual
1
1
to_image
-r oom _no: str ing
-r oom _name: Textual
Image
Specifications of Types of the Federated Schema
1
Person
Repository
collections
Collection
in_repository
*
-name: str ing
-name: str ing
-name: Text
-nationality: str ing
-place: Address
-location: Address
-date_of_birth: time
-description: Text
-description: Text
-date_of_death: time
1
-r esidence: Address
in_collection
Entity
-title: Text
-date: time
-narr ative: Text
1
Creator
-culture_race: string
1
*
Heritage_Entity
created_by
1
contains
*
-general_info: Text
-place_of_origin: Address
works
-date_of_origin: time
-content: Text
Painting
-dimensions: {sequence; type_of_elem ent: integer}
1
1
digital_form
Antiquities
-type_spicem en: Text
-archeology: Text
Digital_Entity
Most Common Reduct (Example)
{CR_Painting_Canvas;
in: c_reduct;
metaslot
of: Canvas;
taking: {title, painter, date, description, to_image};
reduct: R_Painting_Canvas
end;
simulating: {
R_Painting_Canvas.title ~ CR_Painting_Canvas.title;
R_Painting_Canvas.created_by ~
CR_Painting_Canvas.get_created_by;
R_Painting_Canvas.date_of_origin ~ CR_Painting_Canvas.date;
... };
get_created_by: {in: function;
params: {+ext/CR_Painting_Canvas, -returns/Creator};
predicative: {ex c/Canvas ((c/CR_Painting_Canvas = ext) &
ex a/Artist ((c.painter = a.name) &
returns = a/CR_Creator_Artist)))}}
...
}
Partial Source View Construction (Example)
The formula expressing the local class canvas is terms of the federated class painting is
defined as:
canvas(p/CR_Painting_Canvas)  painting(p/R_Painting_Canvas) &
p.in_collection.in_repository = 'Uffizi‘
Specification of a class (actually, this is local as view class) containing this formula is:
{v_canvas_painting;
in: class;
class_section: {
key: invariant, {unique; {title}};
lav: invariant, {subseteq (v_canvas_painting(p),
painting(p/R_Painting_Canvas) &
p.in_collection.in_repository = 'Uffizi')}
};
instance_section: CR_Painting_Canvas
}
Source View Composition (Example)
A final formula for a local class canvas in terms of the federated classes painting and creator is:
canvas(p/CR_Painting_Creator_Canvas) 
painting(p/R_Painting_Canvas) & p.in_collection.in_repository = 'Uffizi' &
creator(c/R_Creator_Canvas) &  w/Painting (in(w, c.works) &
w.in_collection.in_repository = 'Uffizi')
Complete definition of source view looks as follows:
{v_canvas;
in: class;
class_section: {
key: invariant, {unique; {title}};
lav: invariant, {subseteq(v_canvas,
painting(p/R_Painting_Canvas) &
p.in_collection.in_repository = 'Uffizi' &
creator(c/R_Creator_Canvas) & ex w/Painting
in(w,c.works)& w.in_collection.in_repository = 'Uffizi')})
};
instance_section: CR_Painting_Creator_Canvas
}
CR_Painting_Creator_Canvas = CR_Painting_Canvas ⌴ CR_Creator_Canvas
Structure of the Collection Registration Tool
Collection Registration Tool
source collection context / mediator
metadata reconciliation
most common reduct identification
Mediator’s DBMS (Oracle 8i)
metainformation
repository
B-Toolkit
construction source class specifications
as views over federated classes
B-AMN
wrapper generation
wrapper code
Summary
• Subject domain mediation has good perspectives for heterogeneous
information sources integration in process of formation of professional
communities around Internet
• ‘Local as view’ approach looks promising for the worlds of multiple
dynamically changing sources (content, availability) providing also for
mediator’s scalability
• Widely known mediator projects and related researches contributed a
lot to mediator definition, query planning, source capability description
and wrapper generation
• Many serious gaps remain, e.g., mostly relational models were studied,
conjunctive queries were supported, thesauri and ontologies have not
been sufficiently involved, query containment were studied for precise
queries (querying of textual, multimedia, object and semistructured
data may require reconsideration), problem of source view registration
for LAV approach had not been studied, mediator composition
problems have not been investigated
• Therefore, the area looks fruitful for research , experimentation and
development.
Descargar

Information Mediation for Integrated Access to