Appraisal of a systematic review using
a checklist: Notes and Example 1
Effect of fibre, antispasmodics, and peppermint oil in the
treatment of irritable bowel syndrome: systematic review
and meta-analysis.
Ford AC, Talley NJ, Speigel BMR, et al. BMJ 2008;337:a2313
Appraisal of a systematic review
using a checklist: Notes and
Example 1
• At Clinical Evidence we principally assess systematic reviews (SRs) of
randomised controlled trials (RCTs) that pool data
• This presentation uses the critical appraisal checklist for SRs to examine
one such review and illustrate the principles involved
• It also expands on the issues involved when assessing SRs of RCTs in
• There are SRs that include data from RCTs and observational studies, or
include data from observational studies alone
• However, the content of this presentation relates solely to SRs of RCTs
with meta-analysis
Does the SR perform and report a
comprehensive and reproducible literature
• The SR should clearly state what search it has performed
• Sometimes extensive detail is presented such as the terms searched / the
search string used. Other times, less detail is reported
• However, you should be able to confirm the search is comprehensive and be
reasonably able to reproduce it should you wish to do so
• Key questions include:
– Are the search dates reported (from start date to finish date)?
– Has it searched the appropriate databases (has just one database been
interrogated or different databases)?
– Have other methods of identifying studies also been employed (e.g., searching
– Were the studies identified by the search systematically assessed?
– Are there specific exclusions (e.g., language restrictions)?
In this study: does the SR perform and report
a comprehensive and reproducible literature
• Look in the Methods section. The review describes the databases interrogated as
well as the actual terms used in the search
We searched the medical literature using Medline (1950 to April 2008), Embase (1980 to April
2008), and the Cochrane controlled trials register (2007).
We identified studies on irritable bowel syndrome using the terms “irritable bowel syndrome”
and “functional diseases, colon” (both as medical subject heading and free text terms), and “IBS,
spastic colon, irritable colon”, and “functional adj5 bowel” (as free text terms). These were
combined using the set operator AND with studies identified with the terms: “dietary fibre”,
“cereals”, “psyllium”, “sterculia”, “karaya gum”, “parasympatholytics”, “scopolamine”,
“trimebutine”, “muscarinic antagonists”, “butylscopolammonium bromide” (both as medical
subject headings and free text terms), and the following free text terms: “bulking agent”,
“psyllium fibre”, “fibre”, “husk”, “bran”, “ispaghula”, “wheat bran”, “spasmolytics”, “spasmolytic
agents”, “antispasmodics”, “mebeverine”, “alverine”, “pinaverium bromide”, “otilonium bromide”,
“cimetropium bromide”, “hyoscine butyl bromide”, “butylscopolamine”, “peppermint oil”, and
In this study: does the SR perform and report a
comprehensive and reproducible literature search?
• The SR also reports on restrictions, additional searches, and
the assessment of the studies identified
No language restrictions were applied. The lead reviewer evaluated the abstracts of
papers identified by the initial search for appropriateness to the study question.
Potentially relevant papers were obtained and evaluated in detail. Foreign language
papers were translated when required. We hand searched abstract books of
conference proceedings between 2001 and 2007 to identify potentially eligible
studies. The reference lists of all identified relevant studies were used to carry out a
recursive search of the literature. Two reviewers independently assessed articles
using predesigned eligibility forms, according to eligibility criteria defined
prospectively. Any disagreement between investigators was resolved by consensus.
In this study: does the SR perform and report a comprehensive
and reproducible literature search?
• It also presents a flow diagram of studies identified by the literature
Does the SR formulate a clearly focused
• The SR should define the question that it is trying to answer before the
systematic search is performed
• This is needed in order to ensure that the search will answer the question
• If the question is not clearly defined before the search, or the search is
broad and not clearly defined, this may result in “data dredging” where a
mass of trial data is tested post hoc in order to find significant results and
• Also, if it is not focused a priori, the search may not be able to answer a
specific question of interest because it does not encompass all the trial
data needed to answer that question
In this study: does the SR formulate a clearly
focused question?
• Yes, look in the Abstract for the concise question
Objective To determine the effect of fibre, antispasmodics, and peppermint oil in the
treatment of irritable bowel syndrome.
Design Systematic review and meta-analysis of randomised controlled trials.
• The Introduction gives the rationale and background
Results of randomised controlled trials are conflicting, and many have been
underpowered to detect a difference between active treatment and control
intervention. Systematic reviews have also come to different conclusions about the
efficacy of the three treatments in irritable bowel syndrome. As a result confusion
exists as to the roles of these agents, with current management guidelines for irritable
bowel syndrome making varying recommendations.
We carried out a systematic review and meta-analysis to determine the effect of fibre,
antispasmodics, and peppermint oil in the treatment of irritable bowel syndrome.
Does the methods section explicitly state the
basis for the inclusion or exclusion of primary
• It is vitally important to know why studies have been included or
• Adhering to explicit criteria avoids bias, where a study might be
included or excluded just because an author does or doesn’t like it
• It is also necessary to allow the search to be reproducible
• What studies are included or excluded affects the final results
• Explicitly reporting such criteria also allows the reader to assess
whether these criteria are reasonable, and what population group
or circumstances the results of the review may be applicable to
In this study: is the basis for the inclusion or
exclusion of primary studies described?
• Yes, look in Methods section
We considered randomised controlled trials of adults (>16 years) with a diagnosis of
irritable bowel syndrome based on a clinician’s opinion or that met specific diagnostic
criteria (Manning, Kruis score, Rome I, II, or III), combined with the results of
investigations to exclude organic disease if trial investigators thought this necessary.
The studies had to compare fibre, antispasmodics, and peppermint oil with placebo or
no treatment. Participants were required to be followed up for at least one week, and
studies had to report either a global assessment of cure or improvement of
symptoms, or cure or improvement of abdominal pain, after treatment. This was
preferably as reported by the patient, but could be documented by a doctor. If studies
included patients with other functional gastrointestinal disorders, then we excluded
these patients from our analyses if trial reporting allowed this, but if this was not
possible we excluded the studies from the meta-analysis. We also considered as
eligible for inclusion the first period of cross over randomised controlled trials. To
allow steady state plasma concentrations of the agents to be achieved we considered
one week as the minimum duration of treatment.
In this study: is the basis for the inclusion or
exclusion of primary studies described?
A particular issue of this subject area is the difficulty in making the diagnosis of
irritable bowel syndrome, and hence, what criteria the RCTs used to include
The diagnostic criteria employed by the review are reported on the previous slide.
In addition, this is further explored in the Discussion
Most trials were done before the Rome committee published their recommendations
for the design of randomised controlled trials of therapies in functional
gastrointestinal disorders. Only five of the included studies used the Rome criteria to
define the presence of irritable bowel syndrome, although only nine were published
after the first Rome classification was proposed in 1990, and only two used a
validated outcome measure to define improvement in symptoms after treatment.
However, many of the included trials met some of the other suggested
methodological criteria, such as presence of double blinding and a minimum duration
of therapy of 8 to 12 weeks. We preferentially extracted patient reported
improvement in symptoms of irritable bowel syndrome or abdominal pain whenever
trial reporting allowed this, which is also in line with these recommendations.
Does the SR report data from primary RCTs
(e.g., size, interventions used, results from
individual RCTs)?
• Are the size of included RCTs reported? Sometimes the results of a
meta-analysis are dominated by one large RCT
• Are the interventions used in the RCTs defined? Even through RCTs
may be comparing, say, the same drug, dosages and regimens may
vary widely, and may sometimes account for differing trial results
• Are results from individual RCTs reported? Is there consistency of
effects between RCTs or do their results differ widely?
• Watch out for:
– Where primary studies are not described at all — issues such as what was
actually done in the studies and the setting / population included are
important information and limit who the results may be applicable to
– Where absolute numbers are not reported — although size is no guarantee
of quality or accuracy of a final result, you may view a result based on 100
people differently from one based on 4000 people
In this study: does the SR report data from
primary RCTs?
Yes, look at tables 1, 2, and 3
For each primary study, it reports the country the RCT took place in, the diagnostic
criteria employed, the outcome reported, the sample size of the study, the exact
treatment regimen used, and the duration of treatment given
The review also reports results for individual RCTs in the final meta-analysis
Did you notice in tables 1, 2, and 3 how few studies were done in primary care?
They were undertaken mainly in secondary and tertiary care. We may speculate
these might be selected people who are slightly different from people with
irritable bowel syndrome seen in primary care. This could be an important
limitation in the available data
Did you also notice the duration of treatment in tables 1, 2, and 3? Many trials
were short term, some only lasting a few weeks. Irritable bowel disease is a
chronic relapsing and remitting condition. This could also be an important
limitation in the available data
Reporting such data from primary studies allows us to identify important caveats
for when we come to interpret the results and decide how generalisable they are
Does the SR assess the methodological
quality of primary studies, and take these
into account?
People sometimes get the review quality and the included study quality mixed up
Just because a review includes poor-quality studies doesn’t mean it is a poor
quality review
You can have a good-quality review of a subject area in which only poor-quality
studies are available
Alternatively, you can have a poor-quality review of a subject area with goodquality studies
The methodological quality of the included studies may affect what weight you
may wish to put on the results. A review should discuss this issue, and may present
a sensitivity analysis (for example, an analysis just including high-quality studies to
see if this differs from the analysis of all available studies)
Does the SR assess the methodological quality of
primary studies, and take these into account?
Things to watch out for are:
The inclusion of unpublished data (i.e., data from abstracts or data not published
at all). Some independent SRs may include unpublished data and write to authors
to obtain more details. Sometimes, industry sponsored or authored reviews may
include completely unpublished data with no methodological assessment. Beware
of terms such as ‘data on file’ which are not subject to external scrutiny
Unpublished data is a complex issue and there are suggestions that trials that find
no effect are less likely to be published (publishing bias). Hence, including some
unpublished data may be very informative. Alternatively, some unpublished data
may be misleading. See how the review has handled these data, and whether
what it has done or included seems reasonable and free from bias
Some poor-quality reviews may report methodological parameters very sparsely. It
can be difficult, say, to be sure that all the trials included in an analysis are RCTs.
Don’t take anything for granted. If it doesn’t state that they are RCTs, don’t assume
that they are. Sometimes imprecise terms such as ‘trials’ or ‘clinical trials’ are used
in the text. If in doubt, look at the abstracts of studies included in the analysis on
PubMed as a first step
In this study: does the SR assess the methodological
quality of primary studies, and take these into
• Yes, look in tables 1, 2, and 3 where the Jadad score is reported for all
the individual RCTs
• The Jadad score is a well recognised and often reported quality score
• See the Study Quality section in the review where details are given of
how the Jadad score was assessed and calculated
• A review may use other overall categorisations of methodological
quality. However, it should clearly explain what they are and how they
have been calculated
• Some reviews may simply report on individual elements of methods
without an overall score (e.g., randomisation, blinding, etc). Either way
of reporting is acceptable
Two reviewers independently assessed study quality according to the Jadad
scale. This records whether a study is described as randomised and double
blind, the methods for generation of the allocation schedule and double
blinding, and whether there is a description of dropouts during the trial.
Meta-analysis: does the SR combine primary studies
Pooling data increases the power of an analysis, and may demonstrate a significant
effect with an intervention that was not apparent when looking at the results of
smaller individual studies
However, a key question is when is it reasonable to combine data from different studies
and when is it not
There may be clinical heterogeneity between studies. This could be in the intervention
used, the population studied, or the outcomes measured, amongst others
For example, it may be reasonable, say, to combine the results of different studies using
different beta-blockers to lower blood pressure, where one would not combine these
data with studies using relaxation therapy to lower blood pressure, even though the
aim was the same (to lower blood pressure)
It can be a difficult decision whether it is acceptable to combine trial data. For example,
in the case of combining data on multidisciplinary trials where the actual components
of multidisciplinary care may vary widely between different RCTs. You need to consider
this issue. It may require a judgement on your part
It should also be clear what data from the RCTs have been used in the analysis. A
systematic review may, say, combine data on an intention-to-treat basis, whereas the
original study may have reported a per-protocol or completer analysis
In this study: does the SR combine
primary studies appropriately?
Yes, the study would seem to combine results appropriately
It combines the RCTs looking at bran and antispasmodics as a group and peppermint oil as an
individual agent. It might be argued that different types or bran or different antispasmodics
may or may not have different effects. However, it reports that it also planned a sensitivity
analysis a priori according to the type of fibre or antispasmodic used. Hence, it reports an
overall analysis for bran and antispasmodics as a group, as well as an individual analyses by
the exact agent used
Look at the Data Extraction section. It gives clear information on what data was used from
each study, how these data were extracted, and what assumptions were made
Two reviewers independently extracted data on to an Excel spreadsheet (XP professional; Microsoft,
Redmond, WA) as dichotomous outcomes (persistent or unimproved global symptoms of irritable
bowel syndrome, or persistent or unimproved abdominal pain). In addition we extracted the
following clinical data for each trial: setting (primary, secondary, or tertiary care), number of centres,
country, dose and duration of treatment, total number of adverse events reported, definition of
irritable bowel syndrome used, primary outcome measure used to define improvement in symptoms
or cure after treatment, method of generation of the randomisation schedule, method for allocation
concealment, level of blinding, proportion of female patients, subtype of irritable bowel syndrome
according to predominant stool pattern, and duration of follow-up. Data were extracted as intention
to treat analyses where all dropouts are assumed to be treatment failures, whenever this was
allowed by trial reporting. If this was not clear from the original article then we carried out an
analysis on all patients with reported evaluable data.
Meta-analysis: does the SR state how results
are combined statistically?
• Different statistical tests and assumptions can give different results.
Hence, it is important to report what methods have been used
• For a non-statistician, it can be difficult to interpret the detail of what has
been undertaken. For example, whether one statistical test is more
appropriate than another, or whether it is reasonable to make specific
statistical assumptions
• However, the review should clearly state the statistical methods used
• In practice, the amount of detail supplied varies widely between reviews
• Watch out for reviews that don’t give any detail at all on this
In this study: does the SR state how the
results are combined statistically?
Yes, look at Data Synthesis and Statistical Analysis
The review clearly states what has been done, and also lists the statistical package
We pooled data using a random effects model to give a more conservative estimate of the effect of individual
treatments, allowing for any heterogeneity between studies. The effects of different interventions were
expressed as a relative risk (95% confidence interval) of global symptoms of irritable bowel syndrome or
abdominal pain persisting with fibre, antispasmodics, or peppermint oil compared with placebo or no
treatment. For rare outcomes, such as adverse events, when no patients in one or both treatment arms had
the outcome of interest in a single study, we added 0.5 to all four cells for the purposes of the analysis. From
the reciprocal of the risk difference from the meta-analysis we calculated the number needed to treat and
95% confidence intervals. We used the I2 statistic, with a cut-off point of 25%, to assess heterogeneity
between studies and the χ2 test with a P value <0.10 to define a significant degree of heterogeneity.
If adverse events were statistically significantly increased with active treatment we calculated the number
needed to harm and a 95% confidence interval using the formula: number needed to harm=1/(1–relative
risk)×control adverse event rate.
We used Review Manager version 4.2.8 (Nordic Cochrane Centre, Copenhagen, Denmark) and StatsDirect
version 2.4.4 (Sale, Cheshire, England) to generate forest plots of pooled relative risks and risk differences for
primary and secondary outcomes with 95% confidence intervals. We used the Egger and Begg tests to assess
funnel plots for evidence of publication bias.
Meta-analysis: does the SR report absolute numbers as well
as appropriate summary statistics?
• Each summary statistic has its own strengths and weaknesses
• There is evidence that people may interpret results differently if the same
results are presented as different summary statistics
• Hence, it is also important to know absolute numbers
• Watch out if the review doesn’t report absolute numbers
• Apart from not knowing how many people are included in any analysis,
you can’t judge what figures have been extracted (for example, did the
review use intention-to-treat figures or not; did it include all data from all
RCTs or subgroup data)
• Absolute numbers also help to put the summary statistic in context and
may help you gauge what this result actually means in clinical practice
In this study: does the SR report absolute numbers as well
as appropriate summary statistics?
Yes, it does report absolute numbers. Look at this excerpt of figure 2 and how clear
the reporting is with regard to what data have been used
You could go back to each of these RCTs and check yourself whether you agree
with these extracted figures
In this study: does the SR report absolute numbers as
well as appropriate summary statistics?
In Data Analysis and Statistical Analysis (slide 20) the review reported that it had calculated
relative risks. This is a commonly reported statistic for categorical data (e.g., yes / no; present
/ absent)
In this excerpt from figure 2, it states that the numerators are people with either global
symptoms or abdominal pain unimproved or persistent after treatment
The relative risk is an appropriate summary statistic to use in this case
Fig 2 Forest plot of randomised controlled trials of fibre versus placebo or low fibre diet in
irritable bowel syndrome. Events are number of patients with either global symptoms of
irritable bowel syndrome or abdominal pain unimproved or persistent after treatment
In this study: does the SR report absolute numbers as well as
appropriate summary statistics?
In looking at the technicalities of the numerical data and how results have
been calculated, it may be easy to forget that what is being measured is also of
prime importance, that is, the outcome of interest
On a practical point, any analysis in a systematic review is limited by the
outcomes actually measured and reported in the included RCTs
In some subject areas, outcomes may be well defined e.g., mortality,
subsequent MI
Sometimes composite outcomes are reported (e.g., mortality and morbidity
combined). This increases the statistical power of an analysis. This may or may
not be appropriate. For example, was this analysis specified a priori or were
results combined post hoc in order to achieve a significant result?
For each subject area, you need to form an opinion on whether the outcome
reported (either single or composite) is appropriate
At Clinical Evidence we have always reported primarily on clinical outcomes,
that is, outcomes which matter to people. We try not to report on laboratory
or proxy outcomes wherever possible
In this study: does the SR report absolute
numbers as well as appropriate summary
In irritable bowel disease, we have already noted that diagnosis is an issue
The authors of the review have chosen to report the effects of interventions on a composite
outcome of global symptoms or abdominal pain
Do you think this is reasonable? How was this measured? Should they have reported on, say, pain
alone, or given the nature of the disease, is this a reasonable outcome which is of clinical use? This
involves a value judgement on your part
In practice, whether an outcome is reasonable is often not a straight “yes / no” answer. It may often
be a “yes, but….”
Given the possible broad nature of this outcome, the authors have reported what criteria the RCTs
actually used to define symptom improvement — see table 3 as an example below
Diagnostic criteria for
Criteria to define symptom
irritable bowel syndrome
improvement after therapy
Denmark Secondary care Clinical diagnosis and
Patient reported improvement in
global symptoms
Dose of
peppermint oil
200 mg three
times daily
Duration of
4 weeks
Secondary care Clinical diagnosis and
Patient reported improvement in
abdominal pain
187 mg three or
four times daily
1 month
Capanni Italy
Secondary care Rome II
2 capsules three
times daily
3 months
Cappello Italy
Secondary care Rome II and
Improvement in global symptoms
assessed by validated
≥50% improvement from baseline
in overall irritable bowel syndrome
symptom score using
questionnaire data
225 mg twice
4 weeks
Does the SR discuss the reasons for any variations /
heterogeneity between individual RCTs?
Whenever you combine data from different RCTs there is going to be some heterogeneity.
The question is what degree of heterogeneity is acceptable
We have already mentioned clinical heterogeneity. When results are numerically combined, a
statistical test of heterogeneity should be reported
Statistical heterogeneity is often reported as a chi-squared test (X2) with a P value and/or the
I2 test statistic
If there is a high degree of heterogeneity among RCTs, this suggests there may be something
different among the RCTs, and that their results should not be combined
If there is statistical heterogeneity among RCTs in an analysis, you should expect the review
to comment on the reasons for this. Often the review will exclude trials that account for the
heterogeneity and recalculate the analysis, and may also report other sensitivity analyses. If a
review does exclude trials, it should have a good underlying reason for doing so, other than
its results are different
If a Forest plot is presented, you can visually examine the spread of the results
In practice, some reviews report significant heterogeneity tests in the results tables, but
don’t mention this in the results text or allude to this further. Watch out for this!
In this study: does the SR discuss the reasons for any
variations / heterogeneity between individual RCTs?
Look in Data Synthesis and Statistical Analysis where the review outlines the tests of
heterogeneity it is going to perform;
The review was going to analyse “antispasmodics” as a group. This is a grouping of agents with a
treatment effect rather than a grouping based on drug structure;
We might, therefore, speculate that effects may vary by the individual agents used. Read this
section again where the review outlines a priori that it is going to do a subgroup analysis by each
individual agent. A similar procedure is specified for the “fibre” analysis
In the Results section under the Antispasmodics subheading the review reports a sensitivity
analysis of the overall result. It also reports I2 results for the individual antispasmodics analysis.
Read where it discusses the relative strength of the evidence on the individual agents. It further
alludes to this issue in the Discussion section
In fact, the review found statistical heterogeneity in a number of the analyses. See how it
discusses the issue of heterogeneity in the Discussion section. It also finds some evidence of
publication bias
These issues may affect what weight you may wish to put on the results. Hence, interpretation
may be complex. However, it is important that such issues are identified and the limitations of
any analyses are discussed
Beware of reviews which don’t report on or discuss the limitations of their analyses
In this study: does the SR discuss the reasons
for any variations / heterogeneity between
individual RCTs?
Look at the overall Forest
plot for all antispasmodics.
You can see visually how
the results vary by
individual agent and RCT
It should be remembered
that some of these
individual analyses are
based on small numbers,
often from only one RCT
You can also easily see
how the amount of
available evidence varies
widely between different
In this study: does the SR report on the clinical
relevance / importance of the results?
You should expect a review to report on the limitations of its analysis. This is often reported
in the Discussion section, as it is in this review
You should also expect the review to discuss its results in the context of previously reported
evidence or guidelines. That is, how they may differ, agree, or contradict previous studies or
practice. Again, this is usually reported in the Discussion section as it is in this review
Some test statistics are more difficult to interpret clinically than others. For example,
standardised mean differences (SMDs) and effect sizes have no units specified. Hence, if
these are reported, it is difficult to know what the results actually mean in clinical terms
Similarly, if there is an improvement in pain of, say, 8 points on a 100 point VAS scale, at what
point does the change become clinically important?
Although this particular review didn’t report on these type of data, if a review does, it should
give some guidance as to what represents an important clinical effect
In interpreting any result, it is important to remember that statistical significance and clinical
importance are not synonymous. For example, a very large study may find that an
intervention significantly reduces systolic blood pressure by 0.7 mmHg. The question is than
one of interpretation. In terms of the individual, is this clinically important?
Some final thoughts
At Clinical Evidence we examine systematic reviews every day. Just because a study is a
systematic review doesn’t necessarily mean it is of good methodological quality. There can be
a wide variation between reviews. However, increasingly, most are of reasonable quality
A key principle is transparency. You shouldn’t need to guess or make assumptions about what
a review has done
Pay close attention to the inclusion and exclusion criteria. Are these reasonable? These
directly affect the final result as well who the results may, or may not be, generalisable to
In some subject areas (e.g., cardiovascular) there may be multiple large studies whereas in
others (e.g., surgery) evidence may be scarce. Any review can only report on what is
available, and has to work within the limitations of the available data. It should, however,
explicitly discuss any such limitations, and the effect of these on the robustness of its
Increasingly, reviews are being published online, which allows for the extensive reporting of
data (for example, included study details, methods assessment, etc). Reviews published in
print journals may have more constraints in terms of space, although additional web tables
are increasingly used. Nonetheless, there should always be a minimum level of reporting that
allows the general reader to reasonably assess what has been done
“BMJ Publishing Group Limited (“BMJ Group”) 2011. All rights reserved.”

Slide 1