(def functor BeeSpace v3)
Core BSv3 Features
• Personalized Collections
– All functions operate on virtual collections.
• Gene Analysis Functions
– Gene Annotation, Summarization, etc.
• Topic Exploration
– Evolve/extract/expand/map/compare topics.
Challenges
New problems:
• Modifying all functions to
operate on virtual
collections.
• Build Intelligent Gene
Retrieval functionality.
• DB-supported apps.
• Optimize & parallelize
implementations.
• Mod Indexing strategies.
Constraint: 5 month timeline
Old problems:
• Better tokenization needed.
• Teaming and code sharing.
• Upgrade of Lemur & Indri.
• Structured Queries.
• Multiple languages; diverse
skill sets.
Big Hill, Little Time
Accomplishments
• Gene-focus tokenization scheme implemented.
• Intelligent Gene Retrieval function near
completion.
• Optimized & parallelized (EM) Theme
Clustering.
• Developed DB infrastructure for application
support: multiple DBs, tables, DAO access.
• Developed Common Library (4K lines C++) for
sharing across applications.
Accomplishments cont…
• Upgraded Lemur/Indri and normalized
indexing.
• Developed (6) Collection operations and
Boolean Query support.
Development Life Cycle
Project Goals
Documentation &
Req’s Tracking
Requirements Definition
Deployment
& Backup
Conceptual
Models
Deployment
Analysis
Sys Dev Life Cycle
(Waterfall)
Testing
Design
Logical/Physical
Models
Scaffolding &
Testing Software
Implementation
Software
Libraries &
Applications
L o g ic a l V ie w
A p p lic a tio n L a y e r
B eeS pace
N a v ig a to r
Q u e r y & A n a ly s is L a y e r
S e a rc h
E n g in e
G ene
A n a ly s is
T o p ic
E x p lo r a tio n
C lu s te r in g
C o m m o n S o f t w a r e L ib r a r y
D a ta & K n o w le d g e L a y e r
XML
F u ll T e x t
In d e x e s
RDBMS
D a ta P r o c e s s in g L a y e r
T o k e n iz a tio n
N am ed
E n tity
R e c o g n itio n
C la s s ifie r s
In d e x e r s
Mining & Analysis of EAR Graphs
• Need to be able to quickly analyze and mine
knowledge nets and EAR graphs with ad-hoc
operations. User-driven exploration via 4GL
query language.
• Many data models already fit within an
Entity/Attribute/Relationship (EAR) model. Very
flexible design.
• 4GL approach to operations: increases target
audience and allows for query optimization:
select src, sum(wt) from edge group by trg order
by src;
Sample EAR Data Model
Entity
Topic: 11
Collection: fly
Weight: 0.10
Joint: 0.11
Relation
MI: 0.023
Class: gene
Year: 2001
Term: amfor
Prior: 0.07
Doc: pmid:123
Length: 133
DFreq: 4
Attribute
Class: behavior
Year: 1995
Prior: 0.12
Term: forage
Doc: pmid:444
Length: 401
DFreq: 83
Weight: 0.10
Joint: 0.03
Topic: 23
MI: 0.012
Collection: bee
Approach
• Adopt a layered system (stack) approach: Data
layer, core software layer, interpreter, GUI.
Possibility for administration client.
• Data Layer: currently implemented on top of
RDBMS. Achieves flexibility, outreach/reuse, and
is often cluster-compatible.
• Core Software Layer: C++ STL implementation
with functional programming paradigm.
• Best-of-Worlds Effect: combines salient features
of relational modeling with functional power and
adaptive-object modeling methodology.
Motivation #1
(define (deriv exp var)
(cond ((constant? exp) 0)
((variable? exp) (if (same-variable?
exp var) 1 0)
((sum? exp) (make-sum (deriv
(addend exp) var) (deriv augend exp)
var))) ….)
Ref: “Structure and Interpretation of Computer
Programs”, Abelson et al.
Motivation #2
• C++ STL:
• class myFunctor : std::unary_functor<int,
double>
• { double operator()(int x) { return 2.0*x; } };
• std::for_each(list.begin(), list.end(), myFunctor);
• std::set_intersection(a.begin(), a.end(),
b.begin(), b.end, result.begin());
Applications
•
•
•
•
•
•
•
Concept switching.
Theme extraction.
Theme expansion, shrinking, morphing.
Path finding; net flow analysis.
Support for propagation nets, belief nets.
Clustering, clique finding, etc.
*** Not just standard CS/statistical algorithms,
but utilizing semantic information and userdirectives.
Modeling a Concept Space
• Alternative definitions:
F  { f : CS  CS }
Case 1: powerset: implies F is monoid
wrt funct comp (*)
CS : 2
C
F  CS
k  j  Z  f
k
CS
 f
j
cos et ( g ) : { f  g | f  F }
Case 2: finite vector space: F is R^n
(choose generalized velocities)
Case 3: random vectors: F can be
modeled w/ functs over rvs.
Theme =>point;
Region => set-of-subset/compact
set/distribution => need to capture
variances ~ N(mu, sigma)
F  L ( CS , CS )
F ~ f ( x1 ,  x n ;  )
Other ideas…
• EM ~ separation: w
ind. d | c => matrix
factorization. Golub
discusses Jacobi
iterations (parallelizable)
• Cons inference markov
net w/ Gibbs over max.
cliques.
• What are min# of
operators needed for
mining of EAR…
p (w , d )
  p ( w i | c k )   diag ( p ( c k )  p ( d j | c k )
p (w , w )
  p ( w i | c k )   diag ( p ( c k )  p ( w j | c k )
i
i
i  1 .. n
j  1 .. m
j
j
i , j  1 .. n

p ( c i ) : g ( c i ) /  g ( c i )


t

t
Descargar

Document