Content Integration for E-Business Joe Hellerstein 1 New Generation of e-Business on the Internet • Companies moving beyond marketing, storefronts • Attempting to do operations on the Internet – – – – procurement supply chain customer relationships etc. • In a cross-enterprise environment • Requires cross-enterprise content integration – catalog integration is the procurement instance of this problem Content Integration • Content integration across enterprises – Not the “in-house” data warehousing problem – Not the Enterprise App Integration (EAI) problem • “Operational” data must be integrated – As opposed to historical (trend) data – E.g. pricing, availability, supply chain • Structured and unstructured data – Not just relational or XML queries – Not just text search – A combination of the two: logic meets statistics The “Butterfly” • Everybody’s favorite picture c. 1/2000: Marketplace Suppliers Buyers • At question (6/2001) is how many butterflies, who owns them – Not a startup opportunity (Transora vs. Chemdex) – Perhaps one of the wings is smaller than the other (HomeDepot) Road Map • • • • Setting Scenarios & Terminology Characteristics and Challenges of Content Integration Research Evangelism Some Scenarios for Content Integration • Catalog Management: Integration and Syndication – “MRO” (Maintenance, Repair and Operations) a la Grainger – Thousands of suppliers, run by a “content manager” • Availability and Pricing – Travel industry – Necessitates live, cross-enterprise querying • Supply Chain Management – E.g. auto industry – Increase in production requires the entire supply chain (“the cows”) – Contractual information along with catalog and availability Marketing: The EcoSystem and its Terminology • Enterprise Application Integration (EAI): App Glue – – – – Imperative, message-oriented programming (scripting languages) Transactional networking (persistent queues) Gateways to popular packaged apps Vendors: WebMethods, BEA, CrossWorlds, Netfish, MQseries, etc. • Data Integration: Warehousing and associated processes – Intra-enterprise, for “business intelligence” (historical trends) – Vendors: Informatica, Ascential, DBMS vendors • Content Management: Tools for content creation – Web page and graphic design – Versioning and configuration management – Vendors: Vignette, Interwoven, etc. Road Map • Setting • Scenarios & Terminology • Characteristics and Challenges of Content Integration – Content Access, Mapping and Transformation – Query Processing • Research Evangelism Content Integration: Characteristic and Challenges • New integration challenges for e-business – – – – cross-enterprise operational data-centric (not app-centric) structured/unstructured • Two main thrusts – Content Access, Mapping and Transformation – Query Processing Content Access: Relationships with Providers • Varying relationships with content providers – Direct DBMS access (typically in-house) – Direct access to federated apps (SAP, etc.) – Gateway vendors a la Merant, NEON, Attunity, etc. – Arm’s-length relationships – HTML screen scraping – XML messaging Relationships evolve over time! MySimon example Content Mapping • Syntactic and semantic integration – Formatting/normalization is one piece of the puzzle – XML, HTML, Relational, etc. – Semantics is much harder – E.g. “price”. E.g. “delivery”. • Semantics gate the process – A “content manager” must own the transformation task – Ease of use critical – Home Depot has 60,000 suppliers! – Standards can help a bit (e.g. UDDI) – But graphical tools are the name of the game Cohera Workbench Schemas and Taxonomies • Cross-enterprise = multiple schemas – Even if standards prevail (very optimistic) – Early e-catalog systems were locked into one schema – Great for service companies, e.g. Requisite – Tools are sounding the death knell • Taxonomies are critical – Natural for browsing, especially with dirty data – “Black Ink”, “India ink”, “fountain pen ink, black” – Taxonomy per vertical markets, plus standards like UNSPSC – Office Supplies->Ink and lead refills->India ink – Taxonomy as data: query it, browse it, etc. • Integration task includes taxonomy integration! Themes in Content Access and Mapping • Scalability in human terms • “Content managers”, not geeks • The name of the game: semi-automatic tools – Statistical (“fuzzy”) techniques to provide hints (not silver bullets) – Integrated into graphical programming-by-example interfaces – Problem domains: – Wrapper generation – Data cleaning – Schema mapping – Taxonomy mapping – Syndication • One of the key “systems” challenges today Road Map • • • • Setting Scenarios & Terminology Characteristics and Challenges of Content Integration Research Evangelism Query Processing Issues • Content to be integrated is increasingly “uncacheable” – Arm’s-length accessibility – Business rules, not data – E.g. custom content throughout the dataflow – Volatile information – E.g. Availability • Yet a great deal of content is cacheable and slowly changing • Upshot: need a combined technology – Prefetch/Cache/Replicate when possible – Query live when impossible Federated Query Processing • DBMS community must shed our materialization myopia! – ETL/Warehousing was inelegant and limited – What do we do on a “cache miss”?? – Should be no distinction between materialized views and queries! • Federated Query Processing – Query across multiple sources – Choose among multiple replicas, materialized views – Consider staleness • This is the natural extension of the modern database vision – Cohera uses Mariposa’s economic model to do this – Decouples optimization, cost estimation, storage and processing Standard Queries Required • Hand-coded queries are brittle: you want ad-hoc – Don’t buy a handful of beans • Need support for standard query languages – SQL and XPath today – SQL/XQuery tomorrow • Everybody knows this! – Part of industrial religion – Oracle on one side – Dotcoms on the other side – You might get by claiming to be “XML compliant” – But most people have cottoned on by now IR capabilities need to be in the engine • The best-integrated data will still be noisy (product names, etc) • Text search on taxonomies, names, descriptions • Still no good integration of DBMS and IR engines – Storage (compression huge in IR) – Index concurrency (many updates per doc in IR) – Query optimization challenges • Note: this is not semi-structured querying! – Integration of logic + statistics is the real model/query challenge – Plus HCI issues – Unify: “query”, “browse”, “mine”, “rank” • Cohera integrates AltaVista into the engine & optimizer Core Systems Issues Remain Important • Availability, Scalability, Load Balancing – – – – All critically important in the B2B space Availability: you don’t even control the components! Outage=news. Scalability: MRO wants to grow up to very big installations Load Balancing: need to respect SLAs, etc. • Need adaptive, load balancing, federated QP – – – – – 100s to 1000’s of “sites” Replication is key to availability, but optimizer must understand it Cohera’s economic model adapts for each query Other models being studied (see DE Bulletin 6/2000) Compile-time, centralized optimizers (R*, et al) will break Query Processing: Themes • • • • Standards Logic + Statistics Adaptivity to changing performance, load, failures Optimizer Scalability So What Really Matters Today? • Cohera sells because… – Customers need the content integration workbench today – They are in integration pain! – Comes in multiple guises (e-catalog, supplier enablement, etc.) – Smart tools start cutting the pain immediately – Customers want an open, standard solution – Plain old SQL and relational schemas (vs. Requisite, e.g.) – XML “in the bottom”, “out the top” for messaging/integration – Customers want federated querying…tomorrow – For today, they’ll settle for a centralized solution – Want the flexibility to grow in that direction – Federated query engine works fine centralized – The converse clearly not true Road Map • • • • Setting Scenarios & Terminology Characteristics and Challenges of Content Integration Research Evangelism Research Evangelism • Semi-Automatic Tools – Statistical + logical techniques, with a user in the loop – E.g. Potter’s Wheel [Raman/Hellerstein, VLDB ‘01] http://control.cs.berkeley.edu – schema integration algebra – interactive visualization – programming-by-example – statistical inferencing for discrepancies and domain detection – A new class of “systems” work! – “Tools”/“Apps” must be part of our agenda – Many systems challenges here, especially on the stat/HCI side – Architectural elegance, API design, extensibility, scalability, etc. Research Evangelism, Cont. • Adaptive Query Processing – Critical to the federated B2B space – Unpredictable world, you don’t control the components – Also critical to the ubiquitous computing space – Sensors are the next challenge – Who’s the DBA of your housepaint? The freeway lines? – Economic optimization (Mariposa) is one model – Finer-Grained adaptivity possible (Eddies, SIGMOD 2K) – See http://telegraph.cs.berkeley.edu for examples, ideas, SW Research Evangelism, Cont. • Tired of research on relational? Choose wisely! – One big direction here is to integrate IR – Another is to abandon languages in favor of interfaces – query+browse+mine: semi-automatic GUIs again! • XML is critical to business, but under control – We’re doing fine in this space, thank you – XQuery will push (merge with?) SQL – The end-result will resemble things you’ve seen before • But text search is eating our lunch! – Intellectual impact in the last decade? – Industrial impact in the last decade? – Text search is mostly “just” an access method + a sort metric – Integrate into our composable algebras and architectures! – Teach it in our undergrad classes Summary • Content Integration is a new, challenging industrial space – Cohera provides the first complete solution – Access with varying relationships, formats – Support for multiple schemas and taxonomies – Support for custom syndication – Support for distributed data, both cacheable and uncacheable – Ad hoc querying – Fuzzy & structured search – Availability, Scalability, Load Balancing – Smart graphical tools for content managers – A fertile area for research as well – Join the fun!