Content Integration for E-Business
Joe Hellerstein
1
New Generation of e-Business on the Internet
• Companies moving beyond marketing, storefronts
• Attempting to do operations on the Internet
–
–
–
–
procurement
supply chain
customer relationships
etc.
• In a cross-enterprise environment
• Requires cross-enterprise content integration
– catalog integration is the procurement instance of this problem
Content Integration
• Content integration across enterprises
– Not the “in-house” data warehousing problem
– Not the Enterprise App Integration (EAI) problem
• “Operational” data must be integrated
– As opposed to historical (trend) data
– E.g. pricing, availability, supply chain
• Structured and unstructured data
– Not just relational or XML queries
– Not just text search
– A combination of the two: logic meets statistics
The “Butterfly”
• Everybody’s favorite picture c. 1/2000:
Marketplace
Suppliers
Buyers
• At question (6/2001) is how many butterflies, who owns them
– Not a startup opportunity (Transora vs. Chemdex)
– Perhaps one of the wings is smaller than the other (HomeDepot)
Road Map
•
•
•
•
Setting
Scenarios & Terminology
Characteristics and Challenges of Content Integration
Research Evangelism
Some Scenarios for Content Integration
• Catalog Management: Integration and Syndication
– “MRO” (Maintenance, Repair and Operations) a la Grainger
– Thousands of suppliers, run by a “content manager”
• Availability and Pricing
– Travel industry
– Necessitates live, cross-enterprise querying
• Supply Chain Management
– E.g. auto industry
– Increase in production requires the entire supply chain (“the cows”)
– Contractual information along with catalog and availability
Marketing: The EcoSystem and its Terminology
• Enterprise Application Integration (EAI): App Glue
–
–
–
–
Imperative, message-oriented programming (scripting languages)
Transactional networking (persistent queues)
Gateways to popular packaged apps
Vendors: WebMethods, BEA, CrossWorlds, Netfish, MQseries, etc.
• Data Integration: Warehousing and associated processes
– Intra-enterprise, for “business intelligence” (historical trends)
– Vendors: Informatica, Ascential, DBMS vendors
• Content Management: Tools for content creation
– Web page and graphic design
– Versioning and configuration management
– Vendors: Vignette, Interwoven, etc.
Road Map
• Setting
• Scenarios & Terminology
• Characteristics and Challenges of Content Integration
– Content Access, Mapping and Transformation
– Query Processing
• Research Evangelism
Content Integration: Characteristic and Challenges
• New integration challenges for e-business
–
–
–
–
cross-enterprise
operational
data-centric (not app-centric)
structured/unstructured
• Two main thrusts
– Content Access, Mapping and Transformation
– Query Processing
Content Access: Relationships with Providers
• Varying relationships with content providers
– Direct DBMS access (typically in-house)
– Direct access to federated apps (SAP, etc.)
– Gateway vendors a la Merant, NEON, Attunity, etc.
– Arm’s-length relationships
– HTML screen scraping
– XML messaging
Relationships evolve over time!
MySimon example
Content Mapping
• Syntactic and semantic integration
– Formatting/normalization is one piece of the puzzle
– XML, HTML, Relational, etc.
– Semantics is much harder
– E.g. “price”. E.g. “delivery”.
• Semantics gate the process
– A “content manager” must own the transformation task
– Ease of use critical
– Home Depot has 60,000 suppliers!
– Standards can help a bit (e.g. UDDI)
– But graphical tools are the name of the game
Cohera Workbench
Schemas and Taxonomies
• Cross-enterprise = multiple schemas
– Even if standards prevail (very optimistic)
– Early e-catalog systems were locked into one schema
– Great for service companies, e.g. Requisite
– Tools are sounding the death knell
• Taxonomies are critical
– Natural for browsing, especially with dirty data
– “Black Ink”, “India ink”, “fountain pen ink, black”
– Taxonomy per vertical markets, plus standards like UNSPSC
– Office Supplies->Ink and lead refills->India ink
– Taxonomy as data: query it, browse it, etc.
• Integration task includes taxonomy integration!
Themes in Content Access and Mapping
• Scalability in human terms
• “Content managers”, not geeks
• The name of the game: semi-automatic tools
– Statistical (“fuzzy”) techniques to provide hints (not silver bullets)
– Integrated into graphical programming-by-example interfaces
– Problem domains:
– Wrapper generation
– Data cleaning
– Schema mapping
– Taxonomy mapping
– Syndication
• One of the key “systems” challenges today
Road Map
•
•
•
•
Setting
Scenarios & Terminology
Characteristics and Challenges of Content Integration
Research Evangelism
Query Processing Issues
• Content to be integrated is increasingly “uncacheable”
– Arm’s-length accessibility
– Business rules, not data
– E.g. custom content throughout the dataflow
– Volatile information
– E.g. Availability
• Yet a great deal of content is cacheable and slowly changing
• Upshot: need a combined technology
– Prefetch/Cache/Replicate when possible
– Query live when impossible
Federated Query Processing
• DBMS community must shed our materialization myopia!
– ETL/Warehousing was inelegant and limited
– What do we do on a “cache miss”??
– Should be no distinction between materialized views and queries!
• Federated Query Processing
– Query across multiple sources
– Choose among multiple replicas, materialized views
– Consider staleness
• This is the natural extension of the modern database vision
– Cohera uses Mariposa’s economic model to do this
– Decouples optimization, cost estimation, storage and processing
Standard Queries Required
• Hand-coded queries are brittle: you want ad-hoc
– Don’t buy a handful of beans
• Need support for standard query languages
– SQL and XPath today
– SQL/XQuery tomorrow
• Everybody knows this!
– Part of industrial religion
– Oracle on one side
– Dotcoms on the other side
– You might get by claiming to be “XML compliant”
– But most people have cottoned on by now
IR capabilities need to be in the engine
• The best-integrated data will still be noisy (product names, etc)
• Text search on taxonomies, names, descriptions
• Still no good integration of DBMS and IR engines
– Storage (compression huge in IR)
– Index concurrency (many updates per doc in IR)
– Query optimization challenges
• Note: this is not semi-structured querying!
– Integration of logic + statistics is the real model/query challenge
– Plus HCI issues
– Unify: “query”, “browse”, “mine”, “rank”
• Cohera integrates AltaVista into the engine & optimizer
Core Systems Issues Remain Important
• Availability, Scalability, Load Balancing
–
–
–
–
All critically important in the B2B space
Availability: you don’t even control the components! Outage=news.
Scalability: MRO wants to grow up to very big installations
Load Balancing: need to respect SLAs, etc.
• Need adaptive, load balancing, federated QP
–
–
–
–
–
100s to 1000’s of “sites”
Replication is key to availability, but optimizer must understand it
Cohera’s economic model adapts for each query
Other models being studied (see DE Bulletin 6/2000)
Compile-time, centralized optimizers (R*, et al) will break
Query Processing: Themes
•
•
•
•
Standards
Logic + Statistics
Adaptivity to changing performance, load, failures
Optimizer Scalability
So What Really Matters Today?
• Cohera sells because…
– Customers need the content integration workbench today
– They are in integration pain!
– Comes in multiple guises (e-catalog, supplier enablement, etc.)
– Smart tools start cutting the pain immediately
– Customers want an open, standard solution
– Plain old SQL and relational schemas (vs. Requisite, e.g.)
– XML “in the bottom”, “out the top” for messaging/integration
– Customers want federated querying…tomorrow
– For today, they’ll settle for a centralized solution
– Want the flexibility to grow in that direction
– Federated query engine works fine centralized
– The converse clearly not true
Road Map
•
•
•
•
Setting
Scenarios & Terminology
Characteristics and Challenges of Content Integration
Research Evangelism
Research Evangelism
• Semi-Automatic Tools
– Statistical + logical techniques, with a user in the loop
– E.g. Potter’s Wheel [Raman/Hellerstein, VLDB ‘01]
http://control.cs.berkeley.edu
– schema integration algebra
– interactive visualization
– programming-by-example
– statistical inferencing for discrepancies and domain detection
– A new class of “systems” work!
– “Tools”/“Apps” must be part of our agenda
– Many systems challenges here, especially on the stat/HCI side
– Architectural elegance, API design, extensibility, scalability,
etc.
Research Evangelism, Cont.
• Adaptive Query Processing
– Critical to the federated B2B space
– Unpredictable world, you don’t control the components
– Also critical to the ubiquitous computing space
– Sensors are the next challenge
– Who’s the DBA of your housepaint? The freeway lines?
– Economic optimization (Mariposa) is one model
– Finer-Grained adaptivity possible (Eddies, SIGMOD 2K)
– See http://telegraph.cs.berkeley.edu for examples, ideas, SW
Research Evangelism, Cont.
• Tired of research on relational? Choose wisely!
– One big direction here is to integrate IR
– Another is to abandon languages in favor of interfaces
– query+browse+mine: semi-automatic GUIs again!
• XML is critical to business, but under control
– We’re doing fine in this space, thank you
– XQuery will push (merge with?) SQL
– The end-result will resemble things you’ve seen before
• But text search is eating our lunch!
– Intellectual impact in the last decade?
– Industrial impact in the last decade?
– Text search is mostly “just” an access method + a sort metric
– Integrate into our composable algebras and architectures!
– Teach it in our undergrad classes
Summary
• Content Integration is a new, challenging industrial space
– Cohera provides the first complete solution
– Access with varying relationships, formats
– Support for multiple schemas and taxonomies
– Support for custom syndication
– Support for distributed data, both cacheable and uncacheable
– Ad hoc querying
– Fuzzy & structured search
– Availability, Scalability, Load Balancing
– Smart graphical tools for content managers
– A fertile area for research as well
– Join the fun!
Descargar

Content Integration for E