Using Electronic Medical Records
Systems for Clinical Research:
Benefits and Challenges
Prakash M. Nadkarni
 Availability
of clinical, financial and administrative
data in electronic form
 Using
EMR Software for research operations
 Using EMR Data for research? Suitability of careoriented data to clinical research needs.
 EMRs queried directly to answer research
EMR/Clinical Research Information
System (CRIS) Differences:
Research Subjects
Subjects are not necessarily “patients”.
 Personal Health Information may be
 Not all screened subjects are enrolled.
 Simultaneous or sequential enrollment
 Eligibility Criteria
EMR/CRIS Differences: The Study
Events/Visits and Study Calendar: Specific
evaluations or interventions are done at
specific time points ('events") relative to
the start of the study.
 All patients are not enrolled at the same
EMR/CRIS Differences: Electronic
Data Capture (EDC)
CRIS EDC is Far More Structured and Finegrained – textual comments are only a last
 CRISs may need to Support Real-Time
Self-reporting of Subject Data
 CRIS EDC may not always be Real-Time.
 Quality Control considerations dictate
many workflow steps.
EMR/CRIS Differences: TransInstitutional Scope
For trans-institutional scope, Web
technology is virtually mandated.
 Site restriction in Multi-Site studies – endusers and investigators access only their
own site’s patients.
 Trans-National Issues: Software
Localization/ Globalization – same
software, different language/layout.
EMR/CRIS Differences: User Roles
CRISs support differential access to studies
 Most
users of a CRIS are unaware of the other
studies in the same database.
 Some users have read-only access to the data;
some only view reports.
 Only certain users may be allowed to enter data in
particular forms, or even view certain "blinded"
 Data analysts typically do not need to access PHI.
However, in multi-institutional studies, they are
not typically site-restricted (see later)
EMR/CRIS Differences: Summary
EMRs are intended to primarily support
patient care, not research. CRISs are
specifically designed for research protocols.
 May inter-operate with CRISs.
 Sub-systems:
Laboratory, Pharmacy, Scheduling
 EMR *may* be used with structured EDC for
intra-institutional studies if the only alternative is
paper, or if data-entry would otherwise be
Claims by any EMR vendor that their systems
are CRIS-capable should be viewed
EMR Data for Research:
The Nature of Electronic EMR Data
 Significant
dependence on narrative text,
which is often the gold standard for clinical
 Using administrative/billing data as a
surrogate for clinical data
 Miscoding,
variations in coding
Using EMR Data for Research
Primarily hypothesis suggestion/generation
rather than confirmation
 Sample
size may be too small to achieve
statistical significance
 Most data mining tests only show association,
which does not prove causation.
 Selection of patients matching complex criteria:
sample size projections for a planned study (a
strength of I2B2 – no IRB approval needed
because only anonymized data is returned).
Medical Natural Language Processing
NLP is concerned with extraction of meaningful
information from human language input.
 Ultimate goal is to transform unstructured text
into a structured form.
 Most NLP applications are targeted toward
specific goals – e.g., identification of
medications, adverse drug events.
 NLP is not 100% accurate
Medical NLP 101 : Symbolic/ Rulebased approaches
Linguistic / symbolic NLP approaches
employ hand-crafted grammar rules to
parse text into units of speech (symbols),
which are then processed further.
 Still used successfully for limited problems.
 This approach does not always scale
 Labor-intensive,
ambiguous parses, poor
results with telegraphic text.
Medical NLP 101: Statistical NLP
Relies on large bodies of text annotated with
the correct answers by humans.
 Utilizes probabilistic methods for prediction
 The larger and more representative the
training data, the better the results will be.
 Approaches include Support Vector Machines
(SVMs), Hidden Markov Models (HMMs), and
Conditional Random Fields (CRFs).
Medical NLP 101: Subproblems
NLP software typically works as a pipeline of
modules: Modules for Low-level tasks
precede those for high-level tasks
 Low Level Tasks
 Segmentation-
sentence and word boundary
detection, problem-specific boundary detection
 Part of speech tagging
 Morphological decomposition of compound
 Aggregation – identification of phrases
Medical NLP 101 : Sub-problems (2)
High-level tasks
 Spelling
and grammatical error correction
 Named Entity Recognition – including medical
concept recognition
 Word /abbreviation disambiguation
 Negation and uncertainty identification
 Relationship extraction
 Temporal inferencing
Medical NLP: Practical Issues
Change of Workflow and Introduction of
Structure can eliminate a difficult problem.
 Code Reuse to avoid reinventing wheels.
 General vs. Specific Solutions
 Tools Need Commoditization
Querying EMR Data:
Technological Considerations
A database cannot be simultaneously
designed for rapid query as well as
efficient interactive, multi-user updates.
 EMR database designs are transactionoriented.
 EMRs are optimized for "Patient/Entity
Centric", not "Attribute-Centric" queries
Data Warehousing 101
Principle: Operating on a separate read-only
copy of the data on separate hardware yields
better query performance.
 Structural
tweaks include adding extra and precomputation of aggregate values.
 Special types of indexes (bitmap indexes) yield
improved query performance.
 “Star schemas” characterize most warehouse
 Farmers vs. Explorers (Inmon)
“Virtual" integration ("federation")
Data Warehousing: Practical
After warehouse, need for creation of
custom reports may increase rather than
 The critical requirement for effective ad
hoc query is a comprehensive
understanding of the data. This is
generally a full-time effort.
Special Considerations: Querying
of Clinical Data
Both EMRs and large-scale CRISs typically
store clinical data in Entity-Attribute-Value
(EAV) form
 100,000s
of clinical parameters exist across all
medical domains.
 The vast majority of parameters will be
inapplicable for a particular subject/patient.
 EAV is a triple: Entity=Patient+point in time,
Attribute=Parameter, Value=value of that
 EPIC Flowsheet data uses EAV.
The mere presence of structure does not
solve all problems
 Synonyms
in narrative text are unavoidablereduced to the same concept. Controlled
medical vocabularies (UMLS) help.
 UMLS is not a panacea
 Institutions will therefore evolve their internal
controlled vocabularies.
Standardization Considerations
Standardizing your definitions
 2nd
Law of Thermodynamics
 Poor definition quality becomes a problem if
pooled-data (or meta-) analysis is intended.
 Features of certain systems predispose to
disorder. (Learn As You Go, separate
definitions databases.)
 Even the best system is not immune – path of
least resistance.
 Consistent definition is difficult to achieve
after the fact – Deming.
EMR use as the basis for research
Conflicting evidence regarding EMR
benefit still appears.
 A *well designed* EMR may benefit.
 Electronic Alerting Systems themselves
may not improve care, unless EMRs also
reduce workload through automatic
 Review vendor-supplied templates
Conclusions: Future EMR Evolution
EMRs fully supporting CRIS capability are
unlikely to evolve.
 No
software should attempt to do everything
 Differences
in storage-engine capabilities
 Jack-of-all-trades approach (doing everything in a
mediocre manner) is not viable.
 Difficult
(or impossible) to devise a logically
consistent user-interface metaphor that
applies to diverse unrelated features.
 Example of Microsoft Office.
Inter-operation (1)
Co-existing and Inter-operating best-ofbreed packages offer the best usability and
 CRISs,
Genomic / Proteomic Data Management
 There may be minimal data duplication- e.g.,
EMRs may pull in very limited summary
information on critical genetic data for selected
patients, so that it is immediately visible.
Inter-operation (2)
 Bulk
import of laboratory parameters, to avoid
duplicate data entry
 Automatic grading of laboratory-based adverse
events (oncology studies) – Richesson et al.
 Use for scheduling research subject visits
 Pharmacy subsystem for drug dispensation
 EMR for primary EDC in intra-institutional studies
if the only alternative is paper, or if data-entry
would otherwise be duplicated.
• EMR/Specialized EMR
• Picture-archiving systems
Inter-operation (3)
• Application Programming Interfaces (APIs)
 All
large packages – CRISs, EMRs, ‘Omics –
require APIs to make inter-operation efficient
 APIs are vendor-specific. Inter-operation
standards (e.g., the HL7 Virtual medical record)
have not received much traction.
 Currently, many vendors set unreasonable
financial and other barriers to use of their APIs
(e.g., official certification, withholding of
 EMRs lag in the software industry’s trend toward
Further reading
Richesson and Andrews, Clinical Research Informatics, 2012 (Springer)
Jurafsky and Martin: Natural Language Processing
Manning and Schuetze: Foundations of Statistical Natural Language
Nadkarni, Ohno-Machado and Chapman: Natural Language Processing:
An Introduction. Journal of the American Medical Informatics
Association 2011.
Data Warehousing
Larry Greenfield. The Data Warehousing Information Center.
Kimball, Reeves, Ross and Thornthwaite. The Data Warehouse Lifecycle
Toolkit : Expert Methods for Designing, Developing, and Deploying Data
Warehouses. Wiley, 1998.

Using Electronic Medical Records Systems for Clinical