Computational Biology & Bioinformatics: Molecular Biology Primer What is Bioinformatics biology Bioinformatics is an interdisciplinary field concerned with the study of information content, structure, and processes in biological systems the use of computer science (informatics) to address problems in the life sciences, which includes the creation of data bases on genomes, proteins and metabolic pathways and mining them for knowledge computer science the development of efficient algorithms that solve biological problems bioinformatics biology informatics information mathematics Bioinformatics Biological Data + Computer Calculations Where does Bioinformatics come from? A little Biology: Evolutionary Tree of Life Njsas.org Animal Cell (Eukaryotic) Faculty.southwest.tn.edu Bacterial Cell (Prokaryote) Uccs.edu Cells & DNA Ogm-info.com DNA and Chromosome wikipedia Four types of nucleic acids of DNA Note that A pairs with T; and G pairs with C. Primary Structure of DNA • Unbranched polymer • Sequence of nucleotide bases • Double stranded atgaatcgta ggggtttgaa cgctggcaat acgatgactt ctcaagcgaa cattgacgac ggcagctgga aggcggtctc cgagggcgga …… Building Blocks of Biological Systems: nucleotides and amino acids DNA (nucleotides, 4 types): information carrier/encoder. RNA: bridge from DNA to protein. Protein (amino acids, 20 types): action molecules. Processes • Replication of DNA • Transcription of gene (DNA) to messenger RNA (mRNA) • Translation of mRNA into proteins • Folding of proteins into 3D from • Biochemical or structural functions of proteins DNA RNA Protein transcription translation (access excellence resource center) © 1999 The International Herpes Management Forum, all rights reserved. © 1999 The International Herpes Management Forum, all rights reserved. Translation: Universal Genetic Code • Translation form nucleotide code to amino acid code. atgaatcgta ggggtttgaa cgctggcaat acgatgactt ctcaagcgaa cattgacgac ggcagctgga aggcggtctc cgagggcgga …… MNRRGLNAGNTMTSQANIDDGSWKAVSEGG … Genetic Code uccs.edu Building Blocks of Biological Systems: nucleotides and amino acids DNA (nucleotides, 4 types): information carrier/encoder. RNA: bridge from DNA to protein. Protein (amino acids, 20 types): action molecules. Sequence of Amino Acids: Protein • • • • Unbranched polymer Peptide backbone Twenty side chain types 3D structure the key Amino Acid Polypeptide Chain From genes to proteins and its function Gene > DNA sequence AATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACAC TGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAA TCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTA ACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGG TTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAA TTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTG GTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGA CGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGC TACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGA ACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGG TAAGAAGATCGCGAACATCTAGTAGA Function > Protein sequence MKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVS DVNIDELLNEDILILGCSAMGDEVLEESEFEPFIEEIS TKISGKKVALFGSYGWGDGKWMRDFEERMNGYG CVVVETPLIVQNEPDEAEQDCIEFGKKIANI Languages of Protein and DNA In real life What do bioinformaticians study? Example: Comparison and Similarity What is the function of these structures? What is the function of this sequence? What is the function of this motif? – – the fold provides a scaffold, which can be decorated in different ways by different sequences to confer different functions knowing the fold & function allows us to rationalise how the structure effects its function at the molecular level • Compare proteins with similar sequences and understand what the similarities and differences mean. Genomes sequenced A.thaliana First bacterial genomes sequenced H.influenzae and M.genitalium 1995 •Mouse •Ciona •Rice •Fugu •Anopheles The yeast genome 2002 Human draft 1996 2001 •Human finished •Rat •Chicken E.coli K12 1997 1998 2004 Full sequence of chr. 22 2005 1999 C.elegans 2000 D.melanogaster Genome & Chr. 21 2003 Chimpanzee Xenopus Zebrafish Sequences (millions) # of databases (estimated) . Growth in Data and Databases 700 600 500 400 300 200 100 0 Year 2005 2007 2000 1996 2004 1995 1992 1990 1986 1985 1980 1982 Whole genome comparisons: Gene order in genomes The ~1000 genes on Mouse Chromosome 16 map to Human Chromosomes 3, 8, 12, 16, 21, and 22 Mouse Chromosome 16 Comparative Genomics Helicobacter pylori J99 Helicobacter pylori 26695 Phylogenetic tree • HRV10 HRV100 HRV66 HRV77 HRV25 HRV62 HRV29 HRV44 HRV31 HRV47 HRV39 HRV59 HRV63 HRV40 HRV85 HRV56 HRV54 HRV98 HRV1A HRV1bGenba HRV12 HRV78 HRV20 HRV68 HRV28 HRV53 HRV71 HRV51 HRV65 HRV46 HRV80 HRV45 HRV8 HRV95 HRV58 HRV36 HRV89Genba HRV7 HRV88 HRV23 HRV30 HRV2Genban HRV49 HRV43 HRV75 HRV16Genba HRV81 HRV57 HRV55 HRVHanks HRV21 HRV11 HRV33 HRV76 HRV24 HRV90 HRV18 HRV34 HRV50 HRV73 HRV13 HRV41 HRV61 HRV96 HRV15 HRV74 HRV38 HRV60 HRV67 HRV32 HRV9 HRV19 HRV82 HRV22 HRV64 HRV94 Reconstruction phylogenetics tree Graph based and Optimization Methods Protein structure prediction and modeling • Predict the 3-dimensional structure of a protein from its primary sequence MNIFEMLRID HLLTKSPSLN DEAEKLFNQD LDAVRRCALI LQQKRWDEAA TTFRTGTWDA EGLRLKIYKD AAKSELDKAI VDAAVRGILR NMVFQMGETG VNLAKSRWYN YKNL TEGYYTIGIG GRNCNGVITK NAKLKPVYDS VAGFTNSLRM QTPNRAKRVI ? Computer Aided Drug Design • Understanding how structures bind other molecule (function) • Designing inhibitors • Docking, structure modeling Protein docking Given 2 biological molecules (one of them protein) determine whether they interact. Protein-protein docking Protein- ligand docking • Efficiently represent the docking surface and identify regions of interest. • Match corresponding surfaces to optimize binding sites. Optimization methods for docking problems Genetic Algorithm Linear optimization Nonlinear optimization Direct search methods .. Drug Lead Screening & Docking ? Complementarity - Shape - Chemical - Electrostatic Molecular Graphs and Graph similarity • A molecular structure can be interpreted as a mathematical graph where each bond is an edge. • Such a representation allows for the mathematical processing of molecular structures using graph theory. Microarray: Measuring Gene Expression Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein would be more direct, but is currently harder. Hybridization, RNA, cDNA Microarray The Process Chemistry Basics: Surface Chemistry is used to attach the probe molecules to the glass substrate. Chemical reactions are used to attach the florescent dyes to the target molecules Probe and Target hybridise to form a double helix Labelled targets in solution Heteroduplexes Probes on array Hybridisation The array + Green label RNA sample 1 Scanner + Red label RNA sample 2 Tumors and Microarray Tumor gene profiles and Microarray data Image portrays gene expression profiles showing differences between different tumors Tumors: MD (medulloblastoma) Mglio (malignant glioma) Rhab (rhabdoid) PNET (primitive neuro ectodermal tumor) Ncer: normal cerebella Resolution Image Processing for Microarray standard 10m [currently, max 5m] 100m spot on chip = 10 pixels in diameter Image format TIFF (tagged image file format) 16 bit (65’536 levels of grey) 1cm x 1cm image at 16 bit = 2Mb Data What is a genetic network? Gene networks are usually represented as directed graphs where the nodes are defined as the genes and the edges represent regulation. Networks summarized a limited relationship between a subset of genes in both positive and negative feedback loops. Jenssen et al. 2001 Construction of a Simple Network Clustering Brazhnik et al.