Informatics for Molecular
Biologists
Ansuman Chattopadhyay,PhD
Head, Molecular Biology Information Service
Falk Library,
Health Sciences Library System
University of Pittsburgh
Molecular Biology Information Service
Falk Library of Health Sciences
Health Sciences Library System
University of Pittsburgh
200 Scaife Hall
Desoto and Terrace Streets
Pittsburgh, PA 15261
Topics
• Searching tools
– Internet
– PubMed
• NCBI developed bioinformatics tools
– Entrez Gene
• Structure visualization tools
– Cn3D
• Genome Browsers
– UCSC genome browsers
– NCBI Map viewer
Information search space
• Biomedical literature
databases
• Molecular databases
• Organism whole genome
sequences
Literature database
• NCBI PubMed
– contains over 15 million citations dating back
to the mid-1950's.
Search:
“apoptosis”: 130,476
“breast cancer”: 160,055
“p53”: 42,418
Molecular databases
600
500
400
Articles
300
Databases
200
100
0
1996 1997 1998 1999 2000 2001 2002 2003 2004
Organisms whole genome
sequences
http://www.genomesonline.org/
Internet for Biologists
• Google Vs Clusty
– Google: Chronological list of search results
– Clusty: Search results categorized into topical
clusters
Vivísimo's clustering technology creates topical
categories on-the-fly from the search results, using
terms in the title, snippet, and any other available
textual description in the search results themselves
Google Vs Clusty
• Search Example: Pittsburgh
– Google
– Clusty
Clusty
Clusters help you see your
search results by topic, so
you can zero in on exactly
what you’re looking for
or discover unexpected
relationships between items.
Search examples for Clusty
• SNP
• BLAST
• Lupus
Web 2.0
• Website bookmark and tagging tool
– Del.icio.us
a social bookmarking web service for storing, sharing, and
discovering web bookmarks.
Web 2.0
• Connotea; http://www.connotea.org/
Medline searching tool
• PubMed vs ClusterMed
Search example: macular degeneration, cell cycle, p53
Molecular databases
•
DNA Sequence Databases and Analysis Tools
•
Enzymes and Pathways
•
Gene Mutations, Genetic Variations and Diseases
•
Genomics Databases and Analysis Tools
•
Immunological Databases and Tools
•
Microarray, SAGE, and other Gene Expression
•
Organelle Databases
•
Other Databases and Tools (Literature Mining, Lab Protocols, Medical Topics, and
others)
•
Plant Databases
•
Protein Sequence Databases and Analysis Tools
•
Proteomics Resources
•
RNA Databases and Analysis Tools
•
Structure Databases and Analysis Tools
HSLS OBRC
• http://www.hsls.pitt.edu/guides/genetics/obrc/
Types of databases
– By level of curation:
• Archival
–GenBank, GenPept, ssSNP
• Curated
–Refseq, SwissProt, RefSNP
Types of databases
– Archival data
• repository of information
• redundant; might have many sequence records for
the same gene, each from a different lab
• submitters maintain editorial control over their
records:
what goes in is what comes out
• no controlled vocabulary
• variation in annotation of biological features
Example: GenBank record
GenBank
• archival database of nucleotide sequences from
>130,000 organisms
• records annotated with coding region (CDS)
features also include amino acid translations
• each record represents the work of a single lab
• redundant; can have many sequence records for
a single gene
International Nucleotide Sequence Database
Collaboration
Types of databases
Refseq
• Curated data
– non-redundant; one record for each gene, or
each splice variant
– each record is intended to present an
encapsulation of the current understanding of
a gene or protein, similar to a review article
– records contain value-added information that
have been added by an expert(s)
Refseq
•
Database of reference sequences
•
Curated
•
Non-redundant; one record for each gene, or each splice variant, from each
organism represented
•
A representative GenBank record is used as the source for a RefSeq record
•
Value-added information is added by an expert(s)
•
Each record is intended to present an encapsulation of the current
understanding of a gene or protein, similar to a review article
•
Variety of accession number prefixes (NM_ , NP_ , etc.) and status codes
(provisional, reviewed, etc.). More about those in later slides.
•
RefSeq database includes genomic DNA, mRNA, and protein sequences,
so organizes information according to the model of the central dogma of
biology
RefSeq
Searching GenBank
• Find messenger RNA
sequence for Human
epidermal growth factor (EGF)
gene.
Databases developers
• NCBI
• EBI
Neighbors and Hard Links
Word weight
PubMed
abstracts
Phylogeny
3
-D
3-D
Structure
Structure
Taxonomy
VAST
Genomes
BLAST
Nucleotide
sequences
Protein
sequences
BLAST
Source NCBI
NCBI Tools
Entrez Gene
NCBI’s database for gene centric
information focuses on organisms genome
• completely sequenced
• an active research community to contribute
gene-specific information
• scheduled for intense sequence analysis
– Total Taxa: 4246; Total Genes: 284,3587
• 160,000 organisms in the nucleotide sequence database
(Genbank)
Entrez gene
• each record represents a single gene from a given organism
Gene record includes:
– a unique identifier or GeneID assigned by NCBI
– a preferred symbol
– and any one or more of:
– sequence information
– map information
– official nomenclature from an authority list
– alternate gene symbols
– summary of gene/protein function
– published references that provide additional information on
function
– expression
– homology data
– and more
Gene / Protein
Exon-Intron
Structure
Chromosomal
Localization
Genomic
Sequence
Homologous
Sequences
Amino acid
Sequence
mRNA Sequence
Expression
Profile
3D Structure
Disease
SNP
Interacting
Partners
Searching Entrez Gene
Entrez gene
Find:
• gene symbols and aliases
• sequences: genomic, mRNA, protein
• intron-exon architecture
• genomic context: neighboring and antisense
genes
• Interacting partners
• associated gene ontology terms: function,
cellular component and biological process
Entrez Gene record
Query: BRCA1
Search Tips:
Query text box: BRCA1
Limits:
•To limit your search to a specific field, select: “Gene name” from drop-down menu
•Limit by taxonomy: select “Homo sapiens”
Name and aliases
Chromosoma
l location
Sourse: NCBI
Entrez Gene: sequences and genomic context
mRNA Seq
Genomic Seq
Sequences: mRNA, Genomic, Protein
ProteinSeq
Transcription and alternative splicing
Alternative splicing: http://www.exonhit.com/UserFiles/Image/epissage.swf?PHPSESSID=d9u8tiu2sioqa8u29bkop3l0l2
Entrez Gene: intron-exon architectures
Tips: Change Display to “Gene Table” from “Summary”
mRNA Seq
Genomic Seq
ProteinSeq
Gene Ontology
– Controlled vocabulary tagging
• Function
• Biological Processes
• Cellular Component
Entrez Gene : Gene Ontology
Homologous sequences
Entrez Gene: Homologous sequence
Tips: change Display settings from" summary”
to “Alignment score”
to “Multiple Alignment”
Single nucleotide polymorphisms
Single nucleotide polymorphisms (SNP) are DNA sequence
variations that occur when a single nucleotide (A,T,C,or G)
in the genome sequence is altered.
For example a SNP might change
the DNA sequence AAGGCTAA to ATGGCTAA
SNPs
Coding SNPs
Entrez Gene: SNPs
Protein Info: HPRD
Protein Info: HPRD
Entrez Gene: Links
Entrez Gene: Linkout
Seq to Entrez gene: UCSC BLAT
Query Seq: SGLTPEEFMLVYKFARKHHITLTNLITEE
BLAT to Entreze Gene
CLICK
CLICK
Hands-On Exercise Question
Find chromosomal location of your gene
of interest.
How many exons have been reported for
your gene?
What are its neighboring genes ?
Query sequence:
IHYNYMCNSSCMGGMNRRPILTII
Exercise:
Find the protein sequence for rat leptin.
BLAT this sequence vs. the human
genome to find the human homolog.
Look for SNPs in the coding region of
this gene—are there any?
Sequence alignment
• Pair wise alignment
• Multiple alignment
Pairwise alignment
• Global
– Needleman Wunsc (1970)
• Local
– Smith-Waterman (1981)
– Lipman and Pearson
/FASTA (1985)
– Basic Local Alignment
Search Tool
(BLAST:1991)
BLAST
To find homologous sequence for a sequence of interest
by searching sequence databases:
Nucleotide:
TTGGATTATTTGGGGATAATAATGAAGATAGCAA
TTATCTCAGGGAAAGGAGGAGTAGGAAAATCTTC
TA TTTCAACATCCTTAGCTAAGCTGTTTTCAAAAG
AGTTTAATATTGTAGCATTAGATTGTGATGTTGAT
Protein:
MSVMYKKILYPTDFSETAEIALKHVKAFKTLKAEEVILLHVIDER
EIKKRDIFSLLLGVAGLNKSVEEFE NELKNKLTEEAKNKMENIK
KELEDVGFKVKDIIVVGIPHEEIVKIAEDEGVDIIIMGSHGKTNLKEILLG
BLAST
• To Find statistically significant matches, based on
sequence similarity, to a protein or nucleotide
sequence of interest.
•Obtain information on inferred function of the gene or
protein.
•Find conserved domains in your sequence of interest
that are common to many sequences.
•Compare two known sequences for similarity.
What you can do with BLAST
•Find homologous sequence in all combinations
(DNA/Protein) of query and database.
–DNA Vs DNA
–DNA translation Vs Protein
–Protein Vs Protein
–Protein Vs DNA translation
–DNA translation Vs DNA translation
BLAST exercise
• Find homologous
sequences for
uncharacterized
archaebacterial protein,
NP_247556, from
Methanococcus jannaschii
BLAST search
Sort by E values
Descriptions of hits
2X10-65
Sequence description
Link to Entrez
number of display cut
off (100)over rides E
value cut off (10)
BLAST search
•Orthologs from closely related species will
have the highest scores and lowest E values
–Often E = 10-30 to 10-100
•Closely related homologs with highly
conserved function and structure will
have high scores
–Often E = 10-15 to 10-50
•Distantly related homologs may be
hard to identify
–Less than E = 10-4
Protein domains
• Wikipedia
SH2Src homology 2 domains; Signal transduction, involved in recognition
of phosphorylated tyrosine (pTyr). SH2 domains typically bind pTyr-containing
ligands via two surface pockets, a pTyr and hydrophobic binding pocket,
allowing proteins with SH2 domains to localize to tyrosine phosphorylated sites.
Searching CDD
• CDD SEARCH
Query sequence:
Blink
• BLink displays the graphical output of pre-computed
blastp results against the protein non-redundant (nr)
database. This graphical output includes:
–
–
–
–
–
–
–
–
Alignment of up to 200 BLAST hits on the query sequence
Best Hits to each organism
List of known protein domains in the query sequence
Filter hits by selecting the BLAST cutoff score
Distribution of hits by taxonomic grouping
Display of similar sequences with known 3D structure
Filter hits by database and/or by taxonomic grouping
Display a taxonomic tree of all organisms with similar sequences
Access: Link out from NCBI protein records
Link toTP53 Blink: http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NP_000537.2&dopt=gp
Protein structure
Protein data bank (PDB)
• international database of 3-D biological macromolecular structures
• accepts direct submissions of structure data
• maintained by a nonprofit organization, the Research Collaboratory
for Structural Bioinformatics (RCSB), associated with Rutgers
University, San Diego Supercomputer Center, and the Biotechnology
Division of the National Institute of Standards and Technology
• contains molecular structures of proteins and nucleic acids, primarily
structures experimentally-derived by X-Ray crystallography and
NMR
• also includes some theoretical models, though they are not
encouraged.
3D structure viewing software
• NCBI Cn3D
The Cn3D home page includes a link in the blue sidebar for instructions
on installing Cn3D, which is available for PC, Mac, and Unix.
• First glance in Jmol
A simple tool for macromolecular visualization.
Cn3D
• View the 3-dimensional structure for 1TUP and practice
using some of the Cn3D features that allow you to:
– spin the structure using your mouse
– use the control+left mouse button combination to zoom in and
out of the structure
– use the shift+left mouse button combination to move the
structure across the viewing window
– use the Style menu to render the structure in different ways (e.g.,
worms, space fill, ball and stick, ...)
– use the Style menu to color the structure in different ways (e.g.,
secondary structure, domain, ...)
– use the Style/Edit Global Style to label every 20th amino acids
What is it?
Genome Browser is a computer
program which helps to display gene
maps, browse the chromosomes,
align genes or gene models
with ESTs or contigs etc.
Genome Sequence Project Time Line
1976 : RNA Bacteriophage MS2
1995: Haemophilus influenzae
2003: Human genome reference sequence
2005: 265 genomes;
21 archaeal, 211 bacterial, 33 eukaryotic
http://www.genomesonline.org/
Genome Browsers
• NCBI MAP Viewer
• EBI Ensembl
• UCSC Genome
Browser
Descargar

Slide 1