Content Integration for E-Business
Joe Hellerstein
New Generation of e-Business on the Internet
• Companies moving beyond marketing, storefronts
• Attempting to do operations on the Internet
supply chain
customer relationships
• In a cross-enterprise environment
• Requires cross-enterprise content integration
– catalog integration is the procurement instance of this problem
Content Integration
• Content integration across enterprises
– Not the “in-house” data warehousing problem
– Not the Enterprise App Integration (EAI) problem
• “Operational” data must be integrated
– As opposed to historical (trend) data
– E.g. pricing, availability, supply chain
• Structured and unstructured data
– Not just relational or XML queries
– Not just text search
– A combination of the two: logic meets statistics
The “Butterfly”
• Everybody’s favorite picture c. 1/2000:
• At question (6/2001) is how many butterflies, who owns them
– Not a startup opportunity (Transora vs. Chemdex)
– Perhaps one of the wings is smaller than the other (HomeDepot)
Road Map
Scenarios & Terminology
Characteristics and Challenges of Content Integration
Research Evangelism
Some Scenarios for Content Integration
• Catalog Management: Integration and Syndication
– “MRO” (Maintenance, Repair and Operations) a la Grainger
– Thousands of suppliers, run by a “content manager”
• Availability and Pricing
– Travel industry
– Necessitates live, cross-enterprise querying
• Supply Chain Management
– E.g. auto industry
– Increase in production requires the entire supply chain (“the cows”)
– Contractual information along with catalog and availability
Marketing: The EcoSystem and its Terminology
• Enterprise Application Integration (EAI): App Glue
Imperative, message-oriented programming (scripting languages)
Transactional networking (persistent queues)
Gateways to popular packaged apps
Vendors: WebMethods, BEA, CrossWorlds, Netfish, MQseries, etc.
• Data Integration: Warehousing and associated processes
– Intra-enterprise, for “business intelligence” (historical trends)
– Vendors: Informatica, Ascential, DBMS vendors
• Content Management: Tools for content creation
– Web page and graphic design
– Versioning and configuration management
– Vendors: Vignette, Interwoven, etc.
Road Map
• Setting
• Scenarios & Terminology
• Characteristics and Challenges of Content Integration
– Content Access, Mapping and Transformation
– Query Processing
• Research Evangelism
Content Integration: Characteristic and Challenges
• New integration challenges for e-business
data-centric (not app-centric)
• Two main thrusts
– Content Access, Mapping and Transformation
– Query Processing
Content Access: Relationships with Providers
• Varying relationships with content providers
– Direct DBMS access (typically in-house)
– Direct access to federated apps (SAP, etc.)
– Gateway vendors a la Merant, NEON, Attunity, etc.
– Arm’s-length relationships
– HTML screen scraping
– XML messaging
Relationships evolve over time!
MySimon example
Content Mapping
• Syntactic and semantic integration
– Formatting/normalization is one piece of the puzzle
– XML, HTML, Relational, etc.
– Semantics is much harder
– E.g. “price”. E.g. “delivery”.
• Semantics gate the process
– A “content manager” must own the transformation task
– Ease of use critical
– Home Depot has 60,000 suppliers!
– Standards can help a bit (e.g. UDDI)
– But graphical tools are the name of the game
Cohera Workbench
Schemas and Taxonomies
• Cross-enterprise = multiple schemas
– Even if standards prevail (very optimistic)
– Early e-catalog systems were locked into one schema
– Great for service companies, e.g. Requisite
– Tools are sounding the death knell
• Taxonomies are critical
– Natural for browsing, especially with dirty data
– “Black Ink”, “India ink”, “fountain pen ink, black”
– Taxonomy per vertical markets, plus standards like UNSPSC
– Office Supplies->Ink and lead refills->India ink
– Taxonomy as data: query it, browse it, etc.
• Integration task includes taxonomy integration!
Themes in Content Access and Mapping
• Scalability in human terms
• “Content managers”, not geeks
• The name of the game: semi-automatic tools
– Statistical (“fuzzy”) techniques to provide hints (not silver bullets)
– Integrated into graphical programming-by-example interfaces
– Problem domains:
– Wrapper generation
– Data cleaning
– Schema mapping
– Taxonomy mapping
– Syndication
• One of the key “systems” challenges today
Road Map
Scenarios & Terminology
Characteristics and Challenges of Content Integration
Research Evangelism
Query Processing Issues
• Content to be integrated is increasingly “uncacheable”
– Arm’s-length accessibility
– Business rules, not data
– E.g. custom content throughout the dataflow
– Volatile information
– E.g. Availability
• Yet a great deal of content is cacheable and slowly changing
• Upshot: need a combined technology
– Prefetch/Cache/Replicate when possible
– Query live when impossible
Federated Query Processing
• DBMS community must shed our materialization myopia!
– ETL/Warehousing was inelegant and limited
– What do we do on a “cache miss”??
– Should be no distinction between materialized views and queries!
• Federated Query Processing
– Query across multiple sources
– Choose among multiple replicas, materialized views
– Consider staleness
• This is the natural extension of the modern database vision
– Cohera uses Mariposa’s economic model to do this
– Decouples optimization, cost estimation, storage and processing
Standard Queries Required
• Hand-coded queries are brittle: you want ad-hoc
– Don’t buy a handful of beans
• Need support for standard query languages
– SQL and XPath today
– SQL/XQuery tomorrow
• Everybody knows this!
– Part of industrial religion
– Oracle on one side
– Dotcoms on the other side
– You might get by claiming to be “XML compliant”
– But most people have cottoned on by now
IR capabilities need to be in the engine
• The best-integrated data will still be noisy (product names, etc)
• Text search on taxonomies, names, descriptions
• Still no good integration of DBMS and IR engines
– Storage (compression huge in IR)
– Index concurrency (many updates per doc in IR)
– Query optimization challenges
• Note: this is not semi-structured querying!
– Integration of logic + statistics is the real model/query challenge
– Plus HCI issues
– Unify: “query”, “browse”, “mine”, “rank”
• Cohera integrates AltaVista into the engine & optimizer
Core Systems Issues Remain Important
• Availability, Scalability, Load Balancing
All critically important in the B2B space
Availability: you don’t even control the components! Outage=news.
Scalability: MRO wants to grow up to very big installations
Load Balancing: need to respect SLAs, etc.
• Need adaptive, load balancing, federated QP
100s to 1000’s of “sites”
Replication is key to availability, but optimizer must understand it
Cohera’s economic model adapts for each query
Other models being studied (see DE Bulletin 6/2000)
Compile-time, centralized optimizers (R*, et al) will break
Query Processing: Themes
Logic + Statistics
Adaptivity to changing performance, load, failures
Optimizer Scalability
So What Really Matters Today?
• Cohera sells because…
– Customers need the content integration workbench today
– They are in integration pain!
– Comes in multiple guises (e-catalog, supplier enablement, etc.)
– Smart tools start cutting the pain immediately
– Customers want an open, standard solution
– Plain old SQL and relational schemas (vs. Requisite, e.g.)
– XML “in the bottom”, “out the top” for messaging/integration
– Customers want federated querying…tomorrow
– For today, they’ll settle for a centralized solution
– Want the flexibility to grow in that direction
– Federated query engine works fine centralized
– The converse clearly not true
Road Map
Scenarios & Terminology
Characteristics and Challenges of Content Integration
Research Evangelism
Research Evangelism
• Semi-Automatic Tools
– Statistical + logical techniques, with a user in the loop
– E.g. Potter’s Wheel [Raman/Hellerstein, VLDB ‘01]
– schema integration algebra
– interactive visualization
– programming-by-example
– statistical inferencing for discrepancies and domain detection
– A new class of “systems” work!
– “Tools”/“Apps” must be part of our agenda
– Many systems challenges here, especially on the stat/HCI side
– Architectural elegance, API design, extensibility, scalability,
Research Evangelism, Cont.
• Adaptive Query Processing
– Critical to the federated B2B space
– Unpredictable world, you don’t control the components
– Also critical to the ubiquitous computing space
– Sensors are the next challenge
– Who’s the DBA of your housepaint? The freeway lines?
– Economic optimization (Mariposa) is one model
– Finer-Grained adaptivity possible (Eddies, SIGMOD 2K)
– See for examples, ideas, SW
Research Evangelism, Cont.
• Tired of research on relational? Choose wisely!
– One big direction here is to integrate IR
– Another is to abandon languages in favor of interfaces
– query+browse+mine: semi-automatic GUIs again!
• XML is critical to business, but under control
– We’re doing fine in this space, thank you
– XQuery will push (merge with?) SQL
– The end-result will resemble things you’ve seen before
• But text search is eating our lunch!
– Intellectual impact in the last decade?
– Industrial impact in the last decade?
– Text search is mostly “just” an access method + a sort metric
– Integrate into our composable algebras and architectures!
– Teach it in our undergrad classes
• Content Integration is a new, challenging industrial space
– Cohera provides the first complete solution
– Access with varying relationships, formats
– Support for multiple schemas and taxonomies
– Support for custom syndication
– Support for distributed data, both cacheable and uncacheable
– Ad hoc querying
– Fuzzy & structured search
– Availability, Scalability, Load Balancing
– Smart graphical tools for content managers
– A fertile area for research as well
– Join the fun!

Content Integration for E