Introduction to the Microsoft
Biology Foundation (MBF)
Microsoft Biology Initiative Module 02
Agenda
• Introduction to MBF
▫ What is MBF?
▫ Common usage scenarios
• MBF Architecture
▫ Sequences, alphabets and symbols
▫ Parsers and formatters
▫ Introduction to algorithms
• MBF Starter Project
▫ Creating a new C# project
• MBF Source Code
▫ Building the source
▫ Testing with nUnit
What is MBF?
• Microsoft Biology Foundation (MBF) is a bioinformatics toolkit
▫ built on top of the .NET Framework 4.0
▫ open source under MS-PL license
▫ foundation upon which other tools can be built
• Provides various components useful for biological analysis
▫ parsers to read and write common bioinformatics formats
▫ support for DNA, RNA and protein sequences
▫ algorithm framework for analysis and transformation
▫ web connector framework for web-service interaction
What is MBF intended to do?
• Primarily focused on genomics
▫ reusable data structures to represent sequences + symbols
▫ I/O framework to load/save sequences
▫ algorithm framework to process loaded sequences
• Provides an alternative to other biology frameworks
▫ similar concepts to BioJava or BioPerl
▫ takes advantage of Microsoft developer tools and .NET
▫ will evolve as Microsoft and other contributors add features
• Designed to manipulate large data sets
▫ in-memory compression of sequence data
▫ data virtualization for sequences larger than memory
▫ scalable algorithms that take advantage of multiple cores
MBF Design Goals
• Extensibility was a primary goal
▫ core concepts mapped as interfaces and ABCs
▫ can easily provide alternative implementations or add any
missing features you need
• Language Neutral
▫ built on top of .NET – use any supported language
▫ supports dynamic languages such as IronPython
• Designed and implemented using best practices
▫ commented source code provided so nothing is a black box
▫ algorithms all cite publications
• Interoperability
▫ code can be run on several mainstream platforms
MBF vs. your application
• MBF is not an application in itself
▫ it does not provide any visualization of the data being managed
▫ it provides the basis for visualizations to be built on top of
Your Application
MBF
.NET Framework 4.0
Creating Applications with MBF
• MBF allows you to work with your data however you need
Console
(Text)
NT Service
Win
Forms &
WPF
Azure
ASP.NET /
WCF
Silverlight
Example: Sequence Assembler
sequence data is loaded from FASTA
file and assembled using MBF
drawn as nucleotide symbols and
graphics using WPF
Deploying your applications
• Possible to target non-Windows platforms[1]
▫ using Silverlight / Mono / Moonlight
Getting MBF
• MBF is available as an open source, free download
▫ http://mbf.codeplex.com
Downloads section
lists most recent build
MBF Licensing
• MBF is licensed under Ms-PL
▫ http://msdn.microsoft.com/en-us/library/cc707818.aspx
▫ allows you to take the code and use it in academic or commercial
products
Installing MBF
• Official releases are packaged as Setup Files
▫ include all the pre-built assemblies you can use immediately
▫ installs full .NET 4.0 framework if not already installed
• Several other tools available
▫ Sequence Assembler sample
▫ Excel add-in (http://bioexcel.codeplex.com/)
Installing MBF
• Run the Setup program to install the MBF elements
▫ only supports Windows installs, other platforms require manual
installation using source code
▫ creates %Program Files%\Microsoft Biology
Initiative\1.0\MBF
MBF Installation
• Installer creates several files and directories
▫ \Doc directory holds documentation files
▫ \Addin directory contains optional algorithms
▫ \Sdk directory contains samples and additional documentation
▫ Bio.dll is core MBF assembly
▫ WebServiceHandlers.dll provides web service capabilities
Documentation
• Several documents supplied with installation (in /Doc)
▫ even more available from http://mbf.codeplex.com/documentation
• Two documents are required reading before you begin
▫ start with the MBF_Overview.docx
▫ then read the Programming_Guide.docx
• BioDotNet.chm help file provides API reference
▫ installed with SDK (full install)
MBF_Overview.docx
MBF_Programming_Guide.docx
BioDotNet.chm
Download the source code
• Source code is also available online at CodePlex
▫ can download as a pre-packaged .ZIP file[1]
• Can also apply to contribute to the framework[2]
▫ provides TFS credentials to get access to repository
Architecture: Namespaces
Bio
Bio.IO
• Sequences
• Alphabets
• Alignments
• Genomic Intervals
• Phylogeny
• FASTA / FASTQ
• GenBank
• NEXUS
•…
Bio.Algorithms
Bio.Web
• Translation
• Alignment
• Sequence Assembly
•…
• BLAST
• ClustalW
• BioHPC
•…
MBF Core Types
• Bio namespace holds core types and base interfaces
Alphabets
ISequence
ISequenceItem
Nucleotide
DnaAlphabet
Sequence
Alphabets
• Valid symbols are defined in terms of an alphabet
▫ determines allowed characters and meaning
• Access standard alphabets through Instance properties
▫ supplies standard alphabets for DNA, RNA and protein sets
▫ can also access them using Alphabets static class
• Or create custom alphabets if necessary
▫ by implementing the IAlphabet interface
var dnaAlphabet = DnaAlphabet.Instance;
...
var dnaAlphabet2 = Alphabets.DNA;
These two statements retrieve the same alphabet
Sequence Items
• ISequenceItem defines a single symbol
▫ supplies name, attributes and character used to represent symbol
▫ most common form is Nucleotide
var dnaAlphabet = DnaAlphabet.Instance;
ISequenceItem dnaG = dnaAlphabet.LookupBySymbol("G");
Console.WriteLine("{0}: {1} {2} {3}, {4}, {5}",
dnaG.Name, dnaG.IsAmbiguous, dnaG.IsGap,
dnaG.IsTermination, dnaG.Symbol, dnaG.Value);
Guanine: False False False, G, 0
Representing Sequences
• ISequence interface represents ordered list of sequence items
▫ store data relevant to DNA, RNA and Amino Acid structures
▫ can work with sequence as a list of items, or as a string
public interface ISequence : IList<ISequenceItem>
{
string ID { get; }
string DisplayID { get; }
IAlphabet Alphabet { get; }
object Documentation { get; set; }
MoleculeType MoleculeType { get; }
...
string ToString();
}
Sequence Implementations
• Several ISequence implementations in the framework
▫ each optimized for a specific purpose, most common is Sequence
Sequence Type
Description
Sequence
Standard implementation for managing a sequence.
DerivedSequence
Maintains original source sequence along with changes.
QualitativeSequence
Stores sequence items and quality score (Sanger, Solexa, Illumina).
SegmentedSequence
Sequence composed of fragments of sequences.
SparseSequence
Sequence composed of discontinuous fragments from a longer
sequence. Useful if you only want to work with portions of a long
sequence.
VirtualSequence
Sequence of metadata only – no symbols. Useful if all you want to
parse is the additional information associated with the sequence vs. the
sequence data itself.
Creating new sequences
• Sequence type is most basic ISequence implementation
▫ created as read-only by default
▫ insert Nucleotide items to populate
ISequence sequence = new Sequence(Alphabets.DNA, "AGCT");
...
sequence.IsReadOnly = false;
sequence.Add(DnaAlphabet.Instance.AC);
sequence.Add(new Nucleotide('-',"Gap"));
...
sequence.RemoveAt(0);
Console.WriteLine(sequence);
GCTM-
Working with string-based data
• Common to work with sequences a strings
▫ provides a readable representation of the data
▫ lose some information (gaps, terminators, etc.)
▫ not efficient for larger sequences
void ProcessSequence(ISequence sequence)
{
string data = sequence.ToString();
string reverse = new string(data.Reverse().ToArray());
foreach (char symbol in reverse)
{
...
}
}
Working with sequence-based data
• Better to work with real ISequenceItem data
▫ maintains full identity
▫ helper properties on ISequence perform common tasks[1]
void ProcessSequence(ISequence sequence)
{
ISequence reverse = sequence.Reverse;
foreach (ISequenceItem symbol in reverse)
{
...
}
}
Reading and writing sequences
• Most common way to obtain a sequence is through a parser
▫ loads sequence data from some persistent storage
▫ tied to a specific format
▫ can load one or more sequences together
▫ can support metadata and statistics for sequence
• Once loaded, sequence can be processed
▫ through methods of ISequence, or by algorithms
• Finally, sequences are saved using formatters
▫ writes collection of ISequence objects to persistent storage
• MBF has several available parsers and formatters[1]
▫ contained in the Bio.IO namespace
▫ designed for extensibility – to support your formats
Loading sequences with parsers
• Several supplied parsers load common bio sequence formats
▫ FastA, FastQ, GenBank, Gff
• All sequence parsers implement ISequenceParser
▫ provides consistent interface to parsing data
▫ supports loading data from files and streams (more on this later)
public interface ISequenceParser : IParser
{
IList<ISequence> Parse(string filename);
IList<ISequence> Parse(string filename, bool isReadOnly);
...
ISequence ParseOne(string filename);
ISequence ParseOne(string filename, bool isReadOnly);
...
}
Loading data from specific formats
• If file format is known, specific parser can be used to load data
▫ easiest and least error prone method to loading data
private IList<ISequence> LoadSequence(string filename)
{
FastaParser parser = new FastaParser();
IList<ISequence> data = parser.Parse(filename,true);
return data;
}
second parameter indicates to open in read-only mode
for performance – indicating change tracking is not
necessary
Handling multiple file formats
• SequenceParsers class manages built-in parser types
▫ can use FindParserByFile method to locate proper parser at
runtime
private IList<ISequence> LoadSequence(string filename)
{
ISequenceParser parser =
SequenceParsers.FindParserByFile(filename);
if (parser == null)
return null;
IList<ISequence> data = parser.Parse(filename,true);
return data;
}
FindParserByFile returns null if file could not be identified[1]
Interrogating the parser list
• SequenceParsers also provides enumerable list of parsers
private IList<ISequence> TryLoadSequence(string filename)
{
IList<ISequenceParser> parsers = SequenceParsers.All;
foreach (var parser in parsers)
{
try
{
return parser.Parse(filename, true);
}
catch
{
}
}
return null;
}
Saving sequences back to files
• Formatters take sequences and persists them
▫ same formats supported: FastA, FastQ, GenBank and Gff
• Abstracted by ISequenceFormatter interface
▫ supports file-based and stream-based writing (more on this later)
public interface ISequenceFormatter : IFormatter
{
void Format(ICollection<ISequence> sequences, string filename);
void Format(ISequence sequence, string filename);
string FormatString(ISequence sequence);
}
Saving a sequence
• SequenceFormatters provides list of available formatters
void SaveSequence(string filename, IList<ISequence> seqList)
{
ISequenceFormatter formatter =
SequenceFormatters.FindFormatterByFile(filename);
if (formatter != null)
{
formatter.Format(seqList, filename);
}
}
void SaveFastASequence(string fname, IList<ISequence> seqList)
{
SequenceFormatters.Fasta.Format(seqList, fname);
}
Running algorithms on Sequences
• MBF provides a small collection of popular algorithms
▫ alignment, translation, assembly, …
▫ designed specifically to plug in new algorithms
• Bio.Algorithms is where all the algorithmic code is located
▫ each algorithm is given unique namespace
can also be supplied
in separate
assemblies that
MBF locates and
provides access to
at runtime[1]
…
Using the algorithm classes
• Algorithms generally
▫ take one or more ISequence elements as input and return one or
more ISequence elements as output
• Algorithms come in two forms
▫ static methods – to run simple algorithm on a single sequence
▫ instance classes – to run algorithms on 1+ sequences
using Bio.Algorithms.Translation;
...
ISequence DNAtoRNA(ISequence dnaSequence)
{
ISequence rnaSequence = Transcription.Transcribe(dnaSequence);
return rnaSequence;
}
Using MBF in your applications
• Using MBF is as simple as adding a reference to Bio.dll
▫ can then begin consuming available types
▫ convenient to add assembly to project – to ensure it is available
▫ supplied distribution requires the full .NET 4.0 framework install
MBF Starter Project [Step 1]
• If you are starting fresh, you can use the MBF Starter Template
▫ added to Visual Studio 2010 when you installed MBF
requires .NET Framework 4 and Visual C# project type selected
Select MBF
Console
Application from
project types
MBF Starter Project [Step 2]
• Select options you want to use from MBF in your new app
▫ each checkbox will add standard methods for you to utilize
provides
simple textfile logging
capability
… click Finish
to generate
project
MBF Starter Project [Step 3]
• Add your code to the Main method
▫ call the supplied methods to get / save and manipulate the
sequences
class Program
{
static void Main(string[] args)
{
// TODO: Your Code Goes Here
}
// Exports a given sequence to a file in FastA format
static void ExportFastA(ISequence sequence, string filename);
// Parses a FastA file which has one or more sequences.
static IList<ISequence> ParseFastA(string filename);
// Write a given string to the application log.
static void WriteLog(string matter);
// Method to align two sequences using NeedlemanWunschAligner.
public static IList<IPairwiseSequenceAlignment> AlignSequences(
ISequence referenceSequence, ISequence querySequence)
}
Contributing back to MBF
• MBF is an open source project
▫ Microsoft wants your ideas, contributions and feedback
• Useful to download the source code
▫ read through to get good coding ideas
▫ to extended or repurpose
• Consider contributing changes / features back to the project
▫ read MBF_Onboarding.doc or download the guide from
http://mbf.codeplex.com/Project/Download/FileDownload.aspx?D
ownloadId=112159
• All contributions must be released under Ms-PL license
Examining the Source Code
• Can retrieve source code from TFS repository[1]
▫ or as self-contained .zip file
• Solution MBI.sln contains all projects
▫ Bio – MBF core library
▫ Bio.Workflow
▫ WebServiceHandlers
▫ Unit tests
▫ Sample SDK code
Unit Testing
• MBF includes suite of unit tests for all components
▫ uses nUnit (www.nunit.org), included with distribution
▫ test cases are in separate unit-test assemblies
▫ any contributions to codebase must include unit tests
Running unit tests
• Running unit tests involves three steps
1. Execute nUnit.exe from public/ext/nunit/bin/net-2.0
2. Open the Bio.Tests.dll assembly with nUnit
3. Click the Run button to execute the unit test
Writing your own unit tests
• Unit tests are just blocks of code written to test other code
▫ tests assumptions, edge cases, error cases, and functionality
• nUnit makes writing test cases easy
▫ uses .NET attributes to signal intent
▫ Assert class provides helper methods to test assertions
identifies
this as a
unit test
class
[TestFixture]
public class ReverseTests
identifies this as a unit
{
[Test]
test method
public void TestReverse()
{
Sequence sequence = new Sequence(Alphabets.DNA, "AGCT");
string reverse = sequence.Reverse.ToString();
Assert.AreEqual("TCGA", reverse);
}
}
verifies the two strings are equal
Summary
• MBF framework is used to build bioinformatics applications
▫ open source
▫ highly extensible
▫ scalable
▫ flexible data architecture
• Based on .NET
▫ language agnostic
▫ allows any application style
▫ can use all the power and flexibility of .NET
• Sequences are the core concept in the framework
▫ contain sequence items [symbols]
▫ based on alphabets
▫ read/written using parsers/formatters
▫ passed as arguments and returned from algorithms
Descargar

01 Intro to VS2010 and C# - Microsoft Biology Foundation