Database Systems
Carlos Ordonez
What is “Database systems” research?
• Input? large data sets, large files, relational tables
• How? Fast external algorithms; RAM-efficient
data structures at two storage levels
• Efficiency? Desirable O(n) I/O
• Hardware? Small computer, single server, parallel
DBMS server, parallel cluster; 1 disk, RAID
• Infrastructure? DBMS, parallel system
• Boring? Theory+programming
Database systems research today
Transaction processing? done
Efficient querying? done
Fast external algorithms? Simple tasks.
Parallel computation? Well proven DBMS sharednothing, but still many challenges (big data).
• Exploiting new hardware? Difficult, low level
• Analyzing? Most difficult: data mining, statistics
• Future? Big data
DB Systems involves Core CS research:
Theory we use:
– Time complexity, I/O cost models
– Large data structures; especially external
– Relational model is here to stay
– Multivariate statistics, machine learning, discrete math
– Numerical methods: linear algebra, optimization
– Compilers: parsing/compiling/optimizing code; recursion
Programming (even some hacking):
– Systems in a broad sense
– Languages: C, C++; efficiency, pointers, legacy systems code; Java, C# mainly for
– Numerical libraries like LAPACK, OS thread libraries
• UDFs
• API with C, C++, C#
Research topics
GOAL: Integrating statistical and machine learning algorithms with a DBMS
(external algorithms, queries, UDFs)
Difference with machine learning algorithms: Size, external algorithms (small
RAM), queries, low level optimization, generally simpler models
Main topics by students:
– Zhibo Chen: OLAP cubes, parametric statistical tests, cube ops on flash
– Mario Navas, Naveen Mohanam: Singular Value Decomposition for PCA
and ML Factor Analysis, data summarization on multicore CPUs
– Carlos Garcia-Alvarado: keyword search across docs and db, ranking,
query recommendation
– Sasi Pitchaimalai: Bayesian classification, multithreaded summarization
– Wellington Cabrera: stochastic search variable selection on high
dimensional data, SVD on high-d data
– David Matusevich: Hybrid EM and MCMC mixture models on large data
sets, database transformations for data mining
Representative problems
OLAP cubes
Bayesian classification
Finding predictive association rules
Cluster, PCA and regression
Why is our database systems research “cool”?
• Theory+Programming
• Optimization, O(f(n)), systems (external data
structures, discrete math, compiler, OS)
• Goes from hardware-level stuff (multi-core, cache
memory), to high-level query optimization in SQL
• Database systems techniques are used in search
engines like Google and Yahoo (and vice-versa)
• DBMS technology used everywhere
Why join DBMS group?
• Balance between theory (math) and programming
• We target “DB systems” conferences: ACM SIGMOD
and “IR/DM” conferences ACM CIKM (IR+DB+DM)
• Mature and stable CS research area
• Job/internship: many opportunities in DBMS and
search engines; Job security on any large company
• Visit my web page, DBLP. Google “Ordonez SQL”

DMTUTORIAL - University of Houston