Bioinformatics & experimental
practice in proteomics.
Perfection (in design) is achieved not
when there is nothing more to add, but
rather when there is nothing more to
take away.1
1. The Cathedral and the Bazaar, Eric Steven Raymond
parameters & ontologies
MIAME/Plant
experiment
sample
hybridization
normalization
data
array
MIAME
Experiment
hypothesis
method_citations
result_citations
MassSpecMachine
manufacturer
model_name
software_version
MALDI
laser_wavelength
laser_power
matrix_type
grid_voltage
acceleration_voltage
ion_mode
OtherIonisation
1
name
ionisation_
parameters
_parameters
1
OthermzAnalysis
name
ToF
reflectron_state
internal_length
1
1
analyte_parameters
1
OtherAnalyte *
name
1
Analyte
* sample_date
experimenter
1
GelItem
id
area
intensity
local_background
annotation
annotation_source
volume
pixel_x_coord
pixel_y_coord
pixel_radius
1
normalisation
normalised_volume
*
MassSpecExperiment
*
description
parameters_file
1
1
IonSource
type
collision_energy
0..1
0..1
1
0..1
mzAnalysis
type
1
1
0..1
Detection
type
Quadrupole
description
has_children
1
*
1
PeakList
1
1..n list_type
description
mass_value_type
0..1
Hexapole
description
IonTrap
gas_type
gas_pressure
rf_frequency
excitation_amplitude
isolation_centre
isolation_width
final_ms_level
CollisionCell
gas_type
gas_pressure
collision_offset
1
*
RelatedGelItem
description
gel_reference
item_reference
1
Peak
*
1 m_to_z
abundance 1
multiplicity
{ordered}
OntologyEntry
category
value
*
mz_analysis description
*
1
MobilePhase
Component
*
description
concentration
*
OntologyEntry
category
value
description
*
{ordered}
Electrospray
spray_tip_voltage
spray_tip_diameter
solution_voltage
cone_voltage
loading_type
solvent
interface_manufacturer
spray_tip_manufacturer
Sample
* sample_id
analyte_processing
_step_parameters
*
Chromatogram
Point
time_point
ion_count
*
PeakSpecific
ChromatogramIntegration
resolution
software version
background_threshold
area_under_curve
peak_description
sister_peak_reference
Column
AssayDataPoint
{ordered}
description
1
* time
manufacturer
part_number
protein_assay
1
batch_number
1
1 PercentX
internal_length
OtherAnalyte 1
internal_diameter 0..1
2..n percentage
ProcessingStep
stationary_phase
1
1
bead_size
name
GradientStep
*{ordered}1 pore_size
step_time
*
temperature
AnalyteProcessingStep
flow_rate
Fraction
injection_volume
* 1
Gel
parameters_file
start_point
description
end_point
raw_image
ChemicalTreatment
protein_assay
annotated_image
digestion
software_version
1
1
derivatisations
TreatedAnalyte
warped_image
warping_map
Gel1D
Band
equipment
1 denaturing_agent
lane_number *
percent_acrylamide
mass_start
apparent_mass
solubilization_buffer
mass_end
stain_details
run_details
protein_assay
Spot
1 in-gel_digestion
apparent_pi
1
Gel2D
background
apparent_mass *
pi_start
pixel_size_x
pi_end
pixel_size_y
BoundaryPoint
mass_start
*
* pixel_x_coord
mass_end
DiGEGel
pixel_y_coord
first_dim_details
dye_type
second_dim_details
DiGEGelItem
excitation_wavelength
*
exposure_time
dye_type
MSMSFraction
tiff_image
target_m_to_z
* plus_or_minus
DBSearch
* username
{ordered}
id_date
Tandem
*
ListProcessing
n-terminal_aa
SequenceData
1
*
c-terminal_aa
smoothing_process
source_type
count_of_specific_aa
background_threshold sequence
name_of_counted_aa
*
regex_pattern
PeptideHit
1
1
{ordered}
*
score
DBSearchParameters
score_type 1..n
ProteinHit
*
program
sequence
database
all_peptides_matched
information
1
database_date
probability
*
parameters_file
1..n
1
taxonomical_filter
db_search_
peptide_hit
parameters
Protein
_parameters
fixed_modifications
1
*
*
accession_number
variable_modifications
OntologyEntry
gene_name
max_missed_cleavages
category
synonyms
mass_value_type
value
organism
fragment_ion_tolerance
description
orf_number
peptide_mass_tolerance
description
accurate_mass_mode
RelatedGelItem
sequence
mass_error_type
modifications
mass_error
description
*
predicted_mass
protonated
gel_reference
predicted_pi
icat_option
item_reference
1
next_dimension
Organism
SampleOrigin
species_name
description
1
*
strain_identifier
condition
relevant_genotype
condition_degree
environment
TaggingProcess
tissue_type
cell_type
* 0..1 lysis_buffer
tag_type
cell_cycle_phase
cell_component 1..n tag_purity
protein_concentration
technique
tag_concentration
metabolic_label
final_volume
C. Taylor, et al. Nature Biotechnology 21, 247 - 254 (2003)
Sample Preparation Technologies
Affinity
Depletion/Enrichment
Chromatography
Chemical
Labeling
Antibody
IEX (SCX, AEC)
cICAT
Chemical
SEC
iTRAQ
Enzymatic
18O
Chemoenzymatic
Lectin
RP
Chemistry
Proteins
Peptides
Chemical
Modification
Global Sample Preparation Workflow
SCX
RP
HPLC
MS
RP
HPLC
MS
bound
Immuno
depletion
Enzymatic
Digestion
SEC
IEX
Strepavidin
Column
cICAT
iTRAQ
unbound
SCX
Global Mass Spectrometry Workflow
ESI/MS/MS
QSTAR, QTRAP,
Q-TOF, LCQ-Deca
QSTAR, QTRAP,
Q-TOF, LCQ-Deca
MS/MS
MS survey
1D, 2D HPLC
split flow
Spotting
Robot
MALDI/MS/MS
Bioinformatics
DE STR,
TOF/TOF.
vMALDI-LTQ
TOF/TOF.
vMALDI-LTQ
MS survey
MS/MS
Targeted MS-based Platforms for Glycoproteins
Immuno
depletion
Lectin
Affinity
Enzymatic
Digestion
IEX
SCX
RP
HPLC
4000 QTRAP
4000 QTRAP
MS survey,
precursor ion
MS/MS
Bioinformatics
Targeted MS-based Platforms for Phosphoproteins
anti-pS,T,Y
IEX
Enzymatic
Digestion
TiO2
IMAC
RP
HPLC
4000 QTRAP
4000 QTRAP
MS survey,
precursor ion/
neutral loss
MS/MS
Bioinformatics
Immuno
depletion
anti-pS,T,Y
pS,T to Aec
conversion
Enzymatic
Digestion
SCX
Spotting
Robot
RP
HPLC
vMALDI-LTQ
vMALDI-LTQ
MS survey
MS/MS
Bioinformatics
Bioinformatics
Storage archive
(LBNL)
Statistical analysis
(LBNL)
Platform-independent
analysis (UBC)
Data
(mzXML)
Quality
Quantity
Identity
Experiment details
(FuGE)
Pattern & trends
(HTML)
Views
(HTML)
Data generation
(UCSF/Buck/LBNL)
1.What proteins are present?
- IDENTITY
2. How much of each protein?
- QUANTITY
3. How reliable are the results?
- QUALTITY
What is the desired output?
1. Study design and sample generation
2. Separations and sample handling
3. Column chromatography
4. Capillary electrophoresis
5. Mass spectrometry
6. Informatics for mass spectrometry
7. Gel electrophoresis
8. Gel image informatics
9. Molecular Interaction Experiments
10.Statistical Analysis of Data
The Minimum Information About a Proteomics Experiment (MIAPE)
“The problem of legacy data sets will be significant in scale
and difficult to address. Clearly, a lack of annotation does
not mean that a data set is without worth …, so the
following principles should be applied when re-annotating
such legacy data:
1. The data set should be re-annotated as fully as possible,
with reference to the appropriate MIAPE modules; the data
set should then be flagged as legacy, and an indication given
of where the reporting requirements have not been met
(e.g. a summary of missing items).
2. Data and metadata should never be created to
supplement the real data in a file. The only allowable
additions are those that serve to indicate the absence of
real data ….”
http://psidev.sourceforge.net/miape/MIAPE_Parent_3.1.pdf
Protein Sequence Collections (2001)
Collection
Annotations
PIR
SWISS-PROT
Good (public)
Good (private)
GenPept
TREMBL
Some (public)
Some (public)
NR
OWL
Good (public)
n/a (public)
dbEST
HGP
YGP
n/a (public)
progressing (both)
Good (both)
Genomes/Unigene collections
"Biologists would rather share their toothbrush
than share a gene name," says Michael Ashburner,
... "Gene nomenclature is beyond redemption."
“Without the umbrella of HUPO, hopes for
standardization in proteomics would have been
bleak, with researchers being more inclined to
use their rivals' toothbrushes than their
protocols.”
Quotes from Nature editorials
GPMDB
design
GPMDB
design
search
ENSEMBL
search
UNIGENE
search
Boutique
Search
servers
• Public information
• query interfaces
• publicly available sequence
assignment
repository
(keyword, sequence,
servers (search engine sites)public
mass, accession)
• multiple locations
• central
repository
• specialized sequence
collections
import
from
• acceptrepository
MS/MS data•indaily
multiple
formats
servers
research
(MGF, mzXML,
mzData,
DTA)
master
• notanalysis
publiclysoftware
accessible
• user interface and data
• responsible
for routine
• all software available
as open source
•code
information for
data processing tasks
analysis
• bioinformatics
public site used
directly research
• multiple
sites
for
on-line analysis
• user interface simplifies
Database
query process
servers
Practical systems lead to complications
XIAPE repository deployment
Minimum user interface?
1. What does homologue mean if you only have a bunch
of peptides?
2. How do you resolve privacy issues?
3. What data formats should be allowed, both for input
and output?
4. Which computer operating systems should be
supported? Which computer languages should be
used?
5. How much detail about each experiment has to be
recorded to make the data useful?
Decisions needed to create a repository
Best protein sequence
{
}
Independent homologue
Peptides
Dependent homologue
What does homologue mean if you only have a bunch of peptides?
RDB
PRIDE
(EBI)
PeptideAtlas
(ISB)
MySQL
XML
(Input)
PRIDE XML
MySQL
mzXML
GPMDB
(UBC+RU)
MySQL
bioML
GAML
XML
(Archive)
PRIDE XML
mzData
pepXML
mzXML
bioML
GAML
Current proteomics respositories
1. mzXML
2. mzData
3. analysisXML
4. PRIDE XML
5. protXML
6. pepXML
7. bioML
8. GAML
9. MI XML
10. Mascot Search Results XML
XML to the rescue?
The Semantic Web to the Rescue?
FUnctional Genomics (FUGE) object model
Experiment
hypothesis
method_citations
result_citations
MassSpecMachine
manufacturer
model_name
software_version
MALDI
laser_wavelength
laser_power
matrix_type
grid_voltage
acceleration_voltage
ion_mode
1
OtherIonisation
name
ionisation_
parameters
*
OntologyEntry
1
1
1
1
OtherAnalyte *
name
Analyte
1
*
MassSpecExperiment
*
description
parameters_file
1
IonSource
1
type
collision_energy
0..1
1
0..1
analyte_parameters
1
* sample_date
experimenter
0..1
MobilePhase
Component
*
description
concentration
*
OntologyEntry
category
value
description
*
mzAnalysis
type
1
0..1
Detection
type
Quadrupole
description
Hexapole
description
IonTrap
gas_type
gas_pressure
GelItem
id
area
intensity
local_background
annotation
annotation_source
volume
pixel_x_coord
pixel_y_coord
pixel_radius
1
normalisation
normalised_volume
1
{ordered}
Electrospray
spray_tip_voltage
spray_tip_diameter
solution_voltage
cone_voltage
loading_type
solvent
interface_manufacturer
spray_tip_manufacturer
Sample
* sample_id
analyte_processing
_step_parameters
1
*
RelatedGelItem
description
gel_reference
item_reference
has_children
1
*
1
PeakList
1
1..n list_type
description
mass_value_type
0..1
1
Peak
m_to_z
1
*
Column
AssayDataPoint
{ordered}
1 description
1
* time
manufacturer
part_number
protein_assay
1
batch_number
1
1 PercentX
internal_length
OtherAnalyte 1
internal_diameter 0..1
2..n percentage
ProcessingStep
stationary_phase
1
1
bead_size
name
GradientStep
*{ordered}1 pore_size
step_time
*
temperature
AnalyteProcessingStep
flow_rate
Fraction
1 injection_volume
*
Gel
parameters_file
start_point
description
end_point
raw_image
ChemicalTreatment
protein_assay
annotated_image
digestion
software_version
1
derivatisations
TreatedAnalyte 1
warped_image
warping_map
Gel1D
Band
equipment
1 denaturing_agent
lane_number *
percent_acrylamide
mass_start
apparent_mass
solubilization_buffer
mass_end
stain_details
run_details
protein_assay
Spot
in-gel_digestion
1
apparent_pi
1
Gel2D
background
apparent_mass *
pi_start
pixel_size_x
pi_end
pixel_size_y
BoundaryPoint
mass_start
*
* pixel_x_coord
mass_end
DiGEGel
pixel_y_coord
first_dim_details
dye_type
second_dim_details
DiGEGelItem
excitation_wavelength
*
exposure_time
dye_type
MSMSFraction
tiff_image
target_m_to_z
* plus_or_minus
DBSearch
* username
{ordered}
id_date
Tandem
*
ListProcessing
n-terminal_aa
SequenceData
1
c-terminal_aa
smoothing_process
source_type *
count_of_specific_aa
background_threshold sequence
name_of_counted_aa
*
regex_pattern
PeptideHit
1
1
next_dimension
Organism
SampleOrigin
species_name
description
1
*
strain_identifier
condition
relevant_genotype
condition_degree
environment
TaggingProcess
tissue_type
cell_type
* 0..1 lysis_buffer
tag_type
cell_cycle_phase
cell_component 1..n tag_purity
protein_concentration
technique
tag_concentration
metabolic_label
final_volume
FUGE protocol model
FUGE sequence model
FUGE ontology model
Google to the rescue?
Descargar

Slide 1