Database Systems Carlos Ordonez What is “Database systems” research? • Input? large data sets, large files, relational tables • How? Fast external algorithms; RAM-efficient data structures at two storage levels • Efficiency? Desirable O(n) I/O • Hardware? Small computer, single server, parallel DBMS server, parallel cluster; 1 disk, RAID • Infrastructure? DBMS, parallel system • Boring? Theory+programming Database systems research today • • • • Transaction processing? done Efficient querying? done Fast external algorithms? Simple tasks. Parallel computation? Well proven DBMS sharednothing, but still many challenges (big data). • Exploiting new hardware? Difficult, low level • Analyzing? Most difficult: data mining, statistics • Future? Big data DB Systems involves Core CS research: Theory+Programming • • Theory we use: – Time complexity, I/O cost models – Large data structures; especially external – Relational model is here to stay – Multivariate statistics, machine learning, discrete math – Numerical methods: linear algebra, optimization – Compilers: parsing/compiling/optimizing code; recursion Programming (even some hacking): – Systems in a broad sense – Languages: C, C++; efficiency, pointers, legacy systems code; Java, C# mainly for portability – Numerical libraries like LAPACK, OS thread libraries – DBMS • SQL • UDFs • API with C, C++, C# Research topics • • • GOAL: Integrating statistical and machine learning algorithms with a DBMS (external algorithms, queries, UDFs) Difference with machine learning algorithms: Size, external algorithms (small RAM), queries, low level optimization, generally simpler models Main topics by students: – Zhibo Chen: OLAP cubes, parametric statistical tests, cube ops on flash memory – Mario Navas, Naveen Mohanam: Singular Value Decomposition for PCA and ML Factor Analysis, data summarization on multicore CPUs – Carlos Garcia-Alvarado: keyword search across docs and db, ranking, query recommendation – Sasi Pitchaimalai: Bayesian classification, multithreaded summarization – Wellington Cabrera: stochastic search variable selection on high dimensional data, SVD on high-d data – David Matusevich: Hybrid EM and MCMC mixture models on large data sets, database transformations for data mining Representative problems OLAP cubes Bayesian classification Finding predictive association rules Cluster, PCA and regression Why is our database systems research “cool”? • Theory+Programming • Optimization, O(f(n)), systems (external data structures, discrete math, compiler, OS) • Goes from hardware-level stuff (multi-core, cache memory), to high-level query optimization in SQL • Database systems techniques are used in search engines like Google and Yahoo (and vice-versa) • DBMS technology used everywhere Why join DBMS group? • Balance between theory (math) and programming • We target “DB systems” conferences: ACM SIGMOD and “IR/DM” conferences ACM CIKM (IR+DB+DM) • Mature and stable CS research area • Job/internship: many opportunities in DBMS and search engines; Job security on any large company • Visit my web page, DBLP. Google “Ordonez SQL”

Descargar
# DMTUTORIAL - University of Houston