Restore:
Reusing results of mapreduce jobs
Jun Fan
Outline
Introduction to MapReduce
Overview of ReStore
—
Implementation Details
—
Experiments
—
www.nordridesign.com
LOGO
LOGO
Introduction to MapReduce
MapReduce and its implementations such
as Hadoop are common in Facebook,
Yahoo, and Google as well as smaller
companies
 Use high-level query languages such as Pig
to express their complex analysis tasks.
 Translate queries into workflows of
MapReduce jobs, output is stored in the
distributed file system

www.nordridesign.com
LOGO
Introduction to MapReduce
www.nordridesign.com
LOGO
Introduction to MapReduce
High-level language translate an query
into physical operators(Join, Select)
 Embed all physical operators into mapper
and reducer stages
 Compiler generates code for each
MapReduce job and passes it to the
MapReduce system
 ReStore extends this dataflow to reuse
the output of physical operators

www.nordridesign.com
LOGO
Overview ReStore
ReStore improves the performance of
workflows by storing the intermediate
results and reusing them
 Enable queries submitted at different
times to share results
 Built on top of dataflow language
processors

www.nordridesign.com
LOGO
Overview ReStore
Reuse job outputs previously stored
 Store the outputs of executed jobs for
future reuse
 Create more reuse opportunities by
storing the outputs of sub-jobs
 Selects the outputs of jobs to
 Rewrites a MapReduce job and submits it
to the MapReduce system to be
executed.

www.nordridesign.com
LOGO
Overview ReStore
www.nordridesign.com
LOGO
Overview ReStore
www.nordridesign.com
LOGO
Overview ReStore
www.nordridesign.com
LOGO
Types of Result Reuse
www.nordridesign.com
LOGO
Types of Result Reuse

If all dependant jobs of Join are stored in
the system, Ttotal (Join) = ET(Jobn)

Parts of the query execution plan are
stored in the system
www.nordridesign.com
LOGO
Types of Result Reuse
www.nordridesign.com
LOGO
ReStore System Architecture
Input is a workflow of MapReduce jobs
generated by a dataflow system
 Outputs are:

◦ A modified MapReduce workflow that
exploits prior executed jobs stored by
ReStore
◦ A new set of job outputs to store in the
distributed file system.
www.nordridesign.com
LOGO
ReStore System Architecture

Repository to manage the stored
MapReduce job outputs:
◦ The physical query execution plan of the
MapReduce job (Input, output, operators)
◦ The filename of the output in the distributed
file system,
◦ Statistics about the MapReduce job and the
frequency of use of this output by different
workflows. (size of input and output,
execution time)
www.nordridesign.com
LOGO
ReStore System Architecture
www.nordridesign.com
LOGO
Plan Matcher and Rewritere
Goal is to find physical plans in the
repository that can be used to rewrite
the input workflow
 Before a job is matched against the
repository, all other jobs that it depends
on have to be matched and rewritten to
use the job outputs stored in the
repository

www.nordridesign.com
LOGO
Plan Matcher and Rewritere
www.nordridesign.com
LOGO
Plan Matcher and Rewritere

The flow:
◦ Scan sequentially through the physical plans in
the repository
◦ Rewrite it to use the matched physical plan in
the repository
◦ After rewriting, a new sequential scan through
the repository is started
◦ If a scan does not find any matches, ReStore
proceeds to matching the next MapReduce
job in the workflow
www.nordridesign.com
LOGO
Plan Matcher and Rewritere

Two operators are equivalent if:
◦ Their inputs are pipelined from operators
that are equivalent or from thesame data sets
◦ They perform functions that produce the
same output data
www.nordridesign.com
LOGO
Plan Matcher and Rewritere
ReStore uses the first match that it finds
in the repository
 Rules to order the physical plans in the
repository

◦ Plan A is preferred to plan B if all the
operators in plan B have equivalent operators
in plan A
◦ The ratio between the size of the input data
and output data and the execution time of the
MapReduce job
www.nordridesign.com
LOGO
The Repository

Can we treat all possible sub-jobs as
candidates? NO!!!
◦ Require a substantial amount of storage the
overhead of storing all
◦ The intermediate data would considerably
slow down the execution of the input
MapReduce job
www.nordridesign.com
LOGO
The Repository

Two heuristics for choosing candidate
sub-jobs:
◦ Conservative Heuristic: Use the outputs of
operators that are known to reduce their
input size (Project, Filter)
◦ Aggressive Heuristic: Use the outputs of
operators that are known to be expensive
(Project,Filter, Join, Group)
www.nordridesign.com
LOGO
The Repository
www.nordridesign.com
LOGO
The Repository

Rules to keep a candidate job in the
repository:
◦ The size of its output data is smaller than the
size of its input data
◦ There will be a reduction in execution time
for workflows reusing this job
www.nordridesign.com
LOGO
The Repository

Rules to evict a candidate job in the
repository:
◦ Evict a job from the repository if it has not
been reused within a window of time
◦ Evict a job from the repository if one or
more of its inputs is deleted or modified
www.nordridesign.com
THANK YOU
Questions!
Descargar

PowerPoint Template