Evaluating Condor for Enterprise Use:
A UBS Case Study
Gregg Cooke, IT Technical Council
April 26, 2006
Context: Why UBS Uses Grids
Tests: What Did We Look At?
Results: Strengths & Limitations
The Context: Grids in an Investment Bank
Grids at UBS
What do we mean by “grid”?
 Specifically, when we say “grid” we mean a computational cluster
– Condor fits the definition closely
 Other terminology:
Condor term
UBS term
Job cluster
Virtual machine
Engine or Node
Central Manager
Broker or Manager
Grids at UBS
Why do we use grids?
 Complex, long-running calculations include:
– Monte Carlo simulations of risk exposure
– Black-Scholes option valuations on portfolios of stock options
– Valuation of complicated “exotic” financial instruments
 Speed of computation directly correlates to volume of sales
 Accuracy of risk exposure calculation directly correlates to reserve cash
 Calculations constructed by quantitative analysts (“quants”)
– Write code that’s easy to change, not code that’s particularly efficient or parallelized
Current Grid Environment at UBS
How do we build & run our grids?
 10 separate production grids totaling 3000+ engines
– All separate grids…some 60-engine, some 2000-engine
– 1 million tasks per day
 Wide variety of platforms, languages, architectures
– C/C++, C#, Java on Windows or Linux
– Service-oriented vs. batch-oriented, embarrassingly parallel vs. workflow
– Rarely any greenfield development
 Dedicated deployment & operations teams (“GSD”)
– Straddle the development / operations worlds
– Focused on meeting businesses SLAs
– Strong drivers of what grid platform we use
Typical UBS Grid Environment
T r ader D esktop
•write the calculations
• makes app meet SLAs
Job status
• part of the business
• faces off with business
Job specification
M anager
T ask input,
T ask r esults
T ask
assignm ents
Engine- 1
Engine- 2
Engine- 3
Engine- N
• builds & tests the application
• uses quant code, partners with GSD
The Tests: Function, not Performance
How to Test Condor?
Feasibility Study: is Condor suitable for use within our enterprise?
 No performance tests…instead:
 Determine the functional limits of Condor
 Determine how Condor integrates with existing enterprise systems
 Port one or more projects to use Condor and measure:
– Porting effort
– Opportunities for new functionality (and cost of lost functionality)
– Operational impact
The Tests
We tested the following aspects of Condor:
 Scheduling capabilities
– Various combinations of Requirements, Rank, Start, Suspend, etc. rules
 Administrative capabilities
– Features of command line tools, common admin practices,
 Interaction model
– Integrating Condor with an app: APIs, SOAP interface, command line interface
 Robustness and resilience
– Failover options, long-term stability, task retry, realtime reconfiguration, etc.
 Usability
– Impact to the user when a Condor engine is installed on their desktop
 And…scheduling latency…
Scheduling Latency
Definition: the interval between the initial request and when the first engine
starts working on your task
 Applications may be designed with a given scheduling latency in mind
– We can control how long our code takes…we cannot control the scheduling latency
– Redevelopment is often a major undertaking
 We were expecting a very short (100msec) deterministic scheduling latency
– Condor’s is much longer (1min or more) and nondeterministic
– Condor does have an alternative (COD) but it changes the expected behavior of the grid
 Impact on testing: new set of questions!
– “Does Condor’s scheduling latency present a problem for our applications?”
– “Do we have applications that were not developed with assumptions about the scheduling
– “Are there other aspects of Condor’s performance that offset the scheduling latency
– “Can we measure the performance of our applications on Condor without regard to
scheduling latency?”
The Result: Condor as a Functional Benchmark
What We Love About Condor
Too many to list…here are the top four:
 Incredibly powerful expression-based scheduling policy
 No-impact desktop cycle scavenging
 Easy reconfiguration
 Anything that can be run from a command line can be a task
But, Condor has limits too…
What Condor Needs to Better Support UBS
We found issues in four key areas:
 Administrative interface
 Code deployment
 Scheduling latency
 Job submission APIs
Important: remember that these conclusions are only relevant to UBS!
This is only what we found, based on our context…your mileage may vary
Administration Interface
Our conclusions:
 What we expected:
– A nice GUI admin console similar to others our operations personnel are familiar with
 What we found:
– A rich command-line administration interface, but no GUI
 Our conclusion:
– At UBS, Condor will not be used by operations teams that cannot accept a command-line
admin interface
– These are usually Windows teams…Unix teams don’t seem to have as much bias
 What this means for the Condor community:
– A GUI admin console will make Condor more acceptable to enterprise users
– Web-based is best
– Doesn’t have to be fancy…just needs to be point & click (and stable, of course)
– Work being done at Indiana University on a Condor portal is a start
Code Deployment
Our conclusions:
 What we expected:
– Automatic task code deployment done once and refreshed automatically when the grid
system senses a change in a central repository
 What we found:
– Automatic task code deployment every time a job is submitted
 Our conclusion:
– At UBS, Condor causes problems with applications with huge (15Mb+) task codes and
short tasks because the network transmission time impacts job completion time
 What this means for the Condor community:
– To make Condor more acceptable to enterprise users, task code should be cached at the
engines and only refreshing when it changes
– Fortunately, this is being worked on by the Condor Project!
– We’ve watched commercial grid vendors implement this…is not an easy feature!
Scheduling Latency
Our conclusions:
 What we expected:
– Negligibly small latency that’s deterministic enough for us to predict job completion times
 What we found:
– Latencies that depend on configuration settings and complexity of classads
 Our conclusion:
– At UBS, Condor cannot be used for tasks that require less than 3.5 min to complete or
where the total job completion time must be easily predictable
– However,
– Even though our highest-value applications require short deterministic scheduling
latencies, there are many more lower-value applications that aren’t sensitive to
scheduling latency
Application Programmer’s Interface
Our conclusions:
 What we expected:
– Nice, well-designed APIs for all our favorite languages
 What we found:
– A command line interface and a maturing SOAP interface
 Our conclusion:
– Once the SOAP interface matures, UBS programmers will be more amenable to using
 What this means for the Condor community:
– Full-speed ahead on the SOAP interface!
– Make sure all of the functionality available in the command-line interface is available in the
SOAP interface
Condor at UBS
We will continue to use Condor for:
 Teaching new teams how to grid their applications
– Condor is an excellent exploration and learning environment
– Has already accelerated at least one team
 A functional benchmark for all things grid
– Condor is a crucible where new and innovative grid ideas get tried and refined
– Many of these features will prove valuable for commercial vendors to embrace
– Check-pointing & task migration
– Expression-based scheduling policy
– User-centric cycle scavenging
 Non-critical batch-oriented applications with standalone or SOAP-enabled service
code, with operations teams that don’t mind a command line administration
– There are lots and lots of non-critical batch-oriented apps with standalone services
– There are not a lot of operations teams that will tolerate a command line interface…

> - University of Wisconsin–Madison