Introduction to the Microsoft
Biology Foundation (MBF)
Microsoft Biology Initiative Module 02
• Introduction to MBF
▫ What is MBF?
▫ Common usage scenarios
• MBF Architecture
▫ Sequences, alphabets and symbols
▫ Parsers and formatters
▫ Introduction to algorithms
• MBF Starter Project
▫ Creating a new C# project
• MBF Source Code
▫ Building the source
▫ Testing with nUnit
What is MBF?
• Microsoft Biology Foundation (MBF) is a bioinformatics toolkit
▫ built on top of the .NET Framework 4.0
▫ open source under MS-PL license
▫ foundation upon which other tools can be built
• Provides various components useful for biological analysis
▫ parsers to read and write common bioinformatics formats
▫ support for DNA, RNA and protein sequences
▫ algorithm framework for analysis and transformation
▫ web connector framework for web-service interaction
What is MBF intended to do?
• Primarily focused on genomics
▫ reusable data structures to represent sequences + symbols
▫ I/O framework to load/save sequences
▫ algorithm framework to process loaded sequences
• Provides an alternative to other biology frameworks
▫ similar concepts to BioJava or BioPerl
▫ takes advantage of Microsoft developer tools and .NET
▫ will evolve as Microsoft and other contributors add features
• Designed to manipulate large data sets
▫ in-memory compression of sequence data
▫ data virtualization for sequences larger than memory
▫ scalable algorithms that take advantage of multiple cores
MBF Design Goals
• Extensibility was a primary goal
▫ core concepts mapped as interfaces and ABCs
▫ can easily provide alternative implementations or add any
missing features you need
• Language Neutral
▫ built on top of .NET – use any supported language
▫ supports dynamic languages such as IronPython
• Designed and implemented using best practices
▫ commented source code provided so nothing is a black box
▫ algorithms all cite publications
• Interoperability
▫ code can be run on several mainstream platforms
MBF vs. your application
• MBF is not an application in itself
▫ it does not provide any visualization of the data being managed
▫ it provides the basis for visualizations to be built on top of
Your Application
.NET Framework 4.0
Creating Applications with MBF
• MBF allows you to work with your data however you need
NT Service
Forms &
Example: Sequence Assembler
sequence data is loaded from FASTA
file and assembled using MBF
drawn as nucleotide symbols and
graphics using WPF
Deploying your applications
• Possible to target non-Windows platforms[1]
▫ using Silverlight / Mono / Moonlight
Getting MBF
• MBF is available as an open source, free download
Downloads section
lists most recent build
MBF Licensing
• MBF is licensed under Ms-PL
▫ allows you to take the code and use it in academic or commercial
Installing MBF
• Official releases are packaged as Setup Files
▫ include all the pre-built assemblies you can use immediately
▫ installs full .NET 4.0 framework if not already installed
• Several other tools available
▫ Sequence Assembler sample
▫ Excel add-in (
Installing MBF
• Run the Setup program to install the MBF elements
▫ only supports Windows installs, other platforms require manual
installation using source code
▫ creates %Program Files%\Microsoft Biology
MBF Installation
• Installer creates several files and directories
▫ \Doc directory holds documentation files
▫ \Addin directory contains optional algorithms
▫ \Sdk directory contains samples and additional documentation
▫ Bio.dll is core MBF assembly
▫ WebServiceHandlers.dll provides web service capabilities
• Several documents supplied with installation (in /Doc)
▫ even more available from
• Two documents are required reading before you begin
▫ start with the MBF_Overview.docx
▫ then read the Programming_Guide.docx
• BioDotNet.chm help file provides API reference
▫ installed with SDK (full install)
Download the source code
• Source code is also available online at CodePlex
▫ can download as a pre-packaged .ZIP file[1]
• Can also apply to contribute to the framework[2]
▫ provides TFS credentials to get access to repository
Architecture: Namespaces
• Sequences
• Alphabets
• Alignments
• Genomic Intervals
• Phylogeny
• GenBank
• Translation
• Alignment
• Sequence Assembly
• ClustalW
• BioHPC
MBF Core Types
• Bio namespace holds core types and base interfaces
• Valid symbols are defined in terms of an alphabet
▫ determines allowed characters and meaning
• Access standard alphabets through Instance properties
▫ supplies standard alphabets for DNA, RNA and protein sets
▫ can also access them using Alphabets static class
• Or create custom alphabets if necessary
▫ by implementing the IAlphabet interface
var dnaAlphabet = DnaAlphabet.Instance;
var dnaAlphabet2 = Alphabets.DNA;
These two statements retrieve the same alphabet
Sequence Items
• ISequenceItem defines a single symbol
▫ supplies name, attributes and character used to represent symbol
▫ most common form is Nucleotide
var dnaAlphabet = DnaAlphabet.Instance;
ISequenceItem dnaG = dnaAlphabet.LookupBySymbol("G");
Console.WriteLine("{0}: {1} {2} {3}, {4}, {5}",
dnaG.Name, dnaG.IsAmbiguous, dnaG.IsGap,
dnaG.IsTermination, dnaG.Symbol, dnaG.Value);
Guanine: False False False, G, 0
Representing Sequences
• ISequence interface represents ordered list of sequence items
▫ store data relevant to DNA, RNA and Amino Acid structures
▫ can work with sequence as a list of items, or as a string
public interface ISequence : IList<ISequenceItem>
string ID { get; }
string DisplayID { get; }
IAlphabet Alphabet { get; }
object Documentation { get; set; }
MoleculeType MoleculeType { get; }
string ToString();
Sequence Implementations
• Several ISequence implementations in the framework
▫ each optimized for a specific purpose, most common is Sequence
Sequence Type
Standard implementation for managing a sequence.
Maintains original source sequence along with changes.
Stores sequence items and quality score (Sanger, Solexa, Illumina).
Sequence composed of fragments of sequences.
Sequence composed of discontinuous fragments from a longer
sequence. Useful if you only want to work with portions of a long
Sequence of metadata only – no symbols. Useful if all you want to
parse is the additional information associated with the sequence vs. the
sequence data itself.
Creating new sequences
• Sequence type is most basic ISequence implementation
▫ created as read-only by default
▫ insert Nucleotide items to populate
ISequence sequence = new Sequence(Alphabets.DNA, "AGCT");
sequence.IsReadOnly = false;
sequence.Add(new Nucleotide('-',"Gap"));
Working with string-based data
• Common to work with sequences a strings
▫ provides a readable representation of the data
▫ lose some information (gaps, terminators, etc.)
▫ not efficient for larger sequences
void ProcessSequence(ISequence sequence)
string data = sequence.ToString();
string reverse = new string(data.Reverse().ToArray());
foreach (char symbol in reverse)
Working with sequence-based data
• Better to work with real ISequenceItem data
▫ maintains full identity
▫ helper properties on ISequence perform common tasks[1]
void ProcessSequence(ISequence sequence)
ISequence reverse = sequence.Reverse;
foreach (ISequenceItem symbol in reverse)
Reading and writing sequences
• Most common way to obtain a sequence is through a parser
▫ loads sequence data from some persistent storage
▫ tied to a specific format
▫ can load one or more sequences together
▫ can support metadata and statistics for sequence
• Once loaded, sequence can be processed
▫ through methods of ISequence, or by algorithms
• Finally, sequences are saved using formatters
▫ writes collection of ISequence objects to persistent storage
• MBF has several available parsers and formatters[1]
▫ contained in the Bio.IO namespace
▫ designed for extensibility – to support your formats
Loading sequences with parsers
• Several supplied parsers load common bio sequence formats
▫ FastA, FastQ, GenBank, Gff
• All sequence parsers implement ISequenceParser
▫ provides consistent interface to parsing data
▫ supports loading data from files and streams (more on this later)
public interface ISequenceParser : IParser
IList<ISequence> Parse(string filename);
IList<ISequence> Parse(string filename, bool isReadOnly);
ISequence ParseOne(string filename);
ISequence ParseOne(string filename, bool isReadOnly);
Loading data from specific formats
• If file format is known, specific parser can be used to load data
▫ easiest and least error prone method to loading data
private IList<ISequence> LoadSequence(string filename)
FastaParser parser = new FastaParser();
IList<ISequence> data = parser.Parse(filename,true);
return data;
second parameter indicates to open in read-only mode
for performance – indicating change tracking is not
Handling multiple file formats
• SequenceParsers class manages built-in parser types
▫ can use FindParserByFile method to locate proper parser at
private IList<ISequence> LoadSequence(string filename)
ISequenceParser parser =
if (parser == null)
return null;
IList<ISequence> data = parser.Parse(filename,true);
return data;
FindParserByFile returns null if file could not be identified[1]
Interrogating the parser list
• SequenceParsers also provides enumerable list of parsers
private IList<ISequence> TryLoadSequence(string filename)
IList<ISequenceParser> parsers = SequenceParsers.All;
foreach (var parser in parsers)
return parser.Parse(filename, true);
return null;
Saving sequences back to files
• Formatters take sequences and persists them
▫ same formats supported: FastA, FastQ, GenBank and Gff
• Abstracted by ISequenceFormatter interface
▫ supports file-based and stream-based writing (more on this later)
public interface ISequenceFormatter : IFormatter
void Format(ICollection<ISequence> sequences, string filename);
void Format(ISequence sequence, string filename);
string FormatString(ISequence sequence);
Saving a sequence
• SequenceFormatters provides list of available formatters
void SaveSequence(string filename, IList<ISequence> seqList)
ISequenceFormatter formatter =
if (formatter != null)
formatter.Format(seqList, filename);
void SaveFastASequence(string fname, IList<ISequence> seqList)
SequenceFormatters.Fasta.Format(seqList, fname);
Running algorithms on Sequences
• MBF provides a small collection of popular algorithms
▫ alignment, translation, assembly, …
▫ designed specifically to plug in new algorithms
• Bio.Algorithms is where all the algorithmic code is located
▫ each algorithm is given unique namespace
can also be supplied
in separate
assemblies that
MBF locates and
provides access to
at runtime[1]
Using the algorithm classes
• Algorithms generally
▫ take one or more ISequence elements as input and return one or
more ISequence elements as output
• Algorithms come in two forms
▫ static methods – to run simple algorithm on a single sequence
▫ instance classes – to run algorithms on 1+ sequences
using Bio.Algorithms.Translation;
ISequence DNAtoRNA(ISequence dnaSequence)
ISequence rnaSequence = Transcription.Transcribe(dnaSequence);
return rnaSequence;
Using MBF in your applications
• Using MBF is as simple as adding a reference to Bio.dll
▫ can then begin consuming available types
▫ convenient to add assembly to project – to ensure it is available
▫ supplied distribution requires the full .NET 4.0 framework install
MBF Starter Project [Step 1]
• If you are starting fresh, you can use the MBF Starter Template
▫ added to Visual Studio 2010 when you installed MBF
requires .NET Framework 4 and Visual C# project type selected
Select MBF
Application from
project types
MBF Starter Project [Step 2]
• Select options you want to use from MBF in your new app
▫ each checkbox will add standard methods for you to utilize
simple textfile logging
… click Finish
to generate
MBF Starter Project [Step 3]
• Add your code to the Main method
▫ call the supplied methods to get / save and manipulate the
class Program
static void Main(string[] args)
// TODO: Your Code Goes Here
// Exports a given sequence to a file in FastA format
static void ExportFastA(ISequence sequence, string filename);
// Parses a FastA file which has one or more sequences.
static IList<ISequence> ParseFastA(string filename);
// Write a given string to the application log.
static void WriteLog(string matter);
// Method to align two sequences using NeedlemanWunschAligner.
public static IList<IPairwiseSequenceAlignment> AlignSequences(
ISequence referenceSequence, ISequence querySequence)
Contributing back to MBF
• MBF is an open source project
▫ Microsoft wants your ideas, contributions and feedback
• Useful to download the source code
▫ read through to get good coding ideas
▫ to extended or repurpose
• Consider contributing changes / features back to the project
▫ read MBF_Onboarding.doc or download the guide from
• All contributions must be released under Ms-PL license
Examining the Source Code
• Can retrieve source code from TFS repository[1]
▫ or as self-contained .zip file
• Solution MBI.sln contains all projects
▫ Bio – MBF core library
▫ Bio.Workflow
▫ WebServiceHandlers
▫ Unit tests
▫ Sample SDK code
Unit Testing
• MBF includes suite of unit tests for all components
▫ uses nUnit (, included with distribution
▫ test cases are in separate unit-test assemblies
▫ any contributions to codebase must include unit tests
Running unit tests
• Running unit tests involves three steps
1. Execute nUnit.exe from public/ext/nunit/bin/net-2.0
2. Open the Bio.Tests.dll assembly with nUnit
3. Click the Run button to execute the unit test
Writing your own unit tests
• Unit tests are just blocks of code written to test other code
▫ tests assumptions, edge cases, error cases, and functionality
• nUnit makes writing test cases easy
▫ uses .NET attributes to signal intent
▫ Assert class provides helper methods to test assertions
this as a
unit test
public class ReverseTests
identifies this as a unit
test method
public void TestReverse()
Sequence sequence = new Sequence(Alphabets.DNA, "AGCT");
string reverse = sequence.Reverse.ToString();
Assert.AreEqual("TCGA", reverse);
verifies the two strings are equal
• MBF framework is used to build bioinformatics applications
▫ open source
▫ highly extensible
▫ scalable
▫ flexible data architecture
• Based on .NET
▫ language agnostic
▫ allows any application style
▫ can use all the power and flexibility of .NET
• Sequences are the core concept in the framework
▫ contain sequence items [symbols]
▫ based on alphabets
▫ read/written using parsers/formatters
▫ passed as arguments and returned from algorithms

01 Intro to VS2010 and C# - Microsoft Biology Foundation