Harvard’s PASS Takes on
The Provenance Challenge
September 13, 2006
Margo Seltzer
Harvard University
Division of Engineering and Applied Sciences
Reminder: What is PASS?
• Storage systems (e.g., file systems) in which
provenance is a first class entity.
• Provenance:
– is generated and maintained as transparently as
– can be indexed and queried.
– will be created from objects imported from nonPASS sources.
– is maintained in the presence of deletes, copies,
renames, etc.
Collecting Provenance
% sort a > b
open b (W)
exec “sort a”
open a (R)
read a
write b
close a
close b
argv=“sort a”
Inode cache
To file system
Things to Keep in Mind
• Our focus is provenance collection, not query.
• We collect provenance of everything.
• Provenance collection is done in the
operating system.
• Queries are simply queries against the
database maintained by the kernel.
• Our kernel database is Berkeley DB.
Results Summary
• Workflow: we ran the shell script
– Dropped in all the programs and simply ran them
on Linux.
– Chose not to run the slicer, because the license
worried us.
• Query: command-line query tool: nq
– Successfully ran all queries
– Generated a lot more output than you really want.
– Strategy is to keep everything and provide pruning
to let users see what they want.
Query Tool: NQ
• General form:
• SELECTION: select FIELD … from
nameof(FIELD-NAME), typeof(FIELD-NAME)
SEARCH: ancestors FILE*, descendents FILE*, everything
FILTER: depth NUM, anchor EXPR, hide TYPE, where EXPR
OUTPUT-TYPE: report, report html, table
EXPR: existing, nonexisting, EXPR op EXPR
Q1: Provenance of Graphic X
• nq 'ancestors atlas-x.gif report’
922.0 [passfile; challenge/atlas-x.gif] version 1
type: passfile
name: challenge/atlas-x.gif
input: 922.2 [proc; pid 2937; /usr/local/bin/convert]
version 0
annotation: dim=x
annotation: run=base
annotation: studyModality=mindreading
• And 4806 other objects…
• Results: QUERIES\q1.html
Q2: Q1 excluding prior to
• Query:
nq ‘ancestors atlas-x.gif anchor
(type == “proc” &&
name == “AIR5.2.5/bin/softmean”)
• Result: essentially a subset of Q1
– “only” 148 objects identified
Q3: Q2 w/stages
• We did not create annotations to map to
stages, so this query degenerates to the
same one as Query 2.
Q4: align_warp w/specific
parameter values
• Query:
nq 'everything where
basename == "align_warp" &&
concat(argv) ~ "*-m 12*" &&
freezetime ~ "*Mon*"
• Results:
– We did our run on Monday
– Returns 8 instances:
• Four from the main workload
• Four from the variant workload used in Query 7
Q5: images with max=4095
• Two alternate approaches:
– Three phase solution
• Create list of header files that are ancestors of align_warp
• Pass list of files to scanheader; “grep max=4095”
• Find all the descendents of the headers
– Annotation approach
• Run scanheader on all headers
• Make results of scanheader annotations
• Query on the annotations
– We used the first approach
Q5 Continued
• Create list of files to query
'select ident from everything where
type == "proc" &&
basename == "align_warp"
`$NQ $NQOPTS’ select name from ancestors
{'"$ALIGN_WARPS"' } depth where
basename ~ "*.hdr” table’
• Call scanheader on everything returned
above, selecting those files where max=4095
Q5 Continued
• Query on the list returned above
nq 'descendents { anatomy1.hdr
anatomy2.hdr anatomy3.hdr
anatomy4.hdr }
where basename ~ "atlas*.gif" ||
basename ~ "atlas*.jpg"
• Results
Q6: images produced by
softmean with a particular
align_warp parameter
• Three stage query:
– Find align_warp processes
LIGN_WARPS=`nq ’select ident from everything where type == "proc" &&
basename == "align_warp" && concat(argv) ~ "*-m 12*” table'`
– Find appropriate softmean processes
• SOFTMEANS=`nq 'select ident from descendents { '"$ALIGN_WARPS"' }
where type == "proc" && basename == "softmean" table'
– Find images produced by softmean processes
nq 'select name from descendents { '"$SOFTMEANS"' } depth 1
where type == "passfile" && basename ~ "*.img" report’
• Results:
940.0 [passfile; challenge/q7/atlas.img] version 1
name: challenge/q7/atlas.img
917.0 [passfile; challenge/atlas.img] version 1
name: challenge/atlas.img
Q7: Difference between
original and new workflow
• We use standard diff of textual output
nq 'ancestors atlas-x.gif report' > q7-a.tmp
nq 'ancestors q7/atlas-x.jpg report' > q7-b.tmp
diff -u q7-a.tmp q7-b.tmp
• Result:
-922.0 [passfile; challenge/atlas-x.gif] version 1
+945.0 [passfile; challenge/q7/atlas-x.jpg] version 1
type: passfile
name: challenge/atlas-x.gif
input: 922.2 [proc; pid 2937; /usr/local/bin/convert] version 0
name: challenge/q7/atlas-x.jpg
input: 945.2 [proc; pid 2961; /usr/bin/pnmtojpeg] version 0
annotation: dim=x
annotation: run=base
annotation: studyModality=mindreading
annotation: run=q7
annotation: studyModality=visual
Q8: FindUChicago align_warp
• Three stage query:
– Find everything annotated with UChicago
• INPUTS=`nq 'select ident from everything where
$center == "UChicago" table’
– Find those Uchicago objects that are the result of
• `WARPS=`nq 'select ident from descendents {
'"$INPUTS"' } depth 1 where type == "proc" &&
basename == "align_warp" table’
– Now, find all the outputs of those processes
• `nq 'descendents { '"$WARPS"' } anchor type ==
"passfile" where type == "passfile" report'
Q8 Continued
• Results
930.0 [passfile;
type: passfile
929.0 [passfile;
type: passfile
907.0 [passfile;
type: passfile
906.0 [passfile;
type: passfile
challenge/q7/warp3.warp]version 1
name: challenge/q7/warp3.warp
challenge/q7/warp2.warp]version 1
name: challenge/q7/warp2.warp
challenge/warp3.warp]version 1
name: challenge/warp3.warp
challenge/warp2.warp]version 1
name: challenge/warp2.warp
Q9: Find user annotations for
objects where some annotations
have a given value
• Setup
– We added annotations to all six output images
– We annotated one set of outputs with modality visual and the
other modality mind-reading.
• Query
nq 'select annotations from everything where (basename ~
"atlas*.gif" ||
basename ~ "atlas*.jpg") &&
($studyModality == "speech" ||
$studyModality == "visual" ||
$studyModality == "audio")
Q9 Continued
• Results
947.0 [passfile; challenge/q7/atlas-z.jpg]version 1
annotation: dim=z
annotation: run=q7
annotation: studyModality=visual
946.0 [passfile; challenge/q7/atlas-y.jpg] version 1
annotation: dim=y
annotation: run=q7
annotation: studyModality=visual
945.0 [passfile; challenge/q7/atlas-x.jpg] version 1
annotation: dim=x
annotation: run=q7
annotation: studyModality=visual
• We have the data
• We are not UI people
• Output is remarkably complete
– Sometimes makes it difficult to extract the
information you want.
• Output is BIG if you ask for everything, but …
• … you can ask for everything and get it

PASS: Provenance-Aware Storage Systems