Initial Data Analysis
Central Tendency
Outline


What is ‘central tendency’?
Classic measures



What’s an ‘average’?
Properties of statistics





Mean, Median, Mode
Sufficiency
Efficiency
Bias
Resistance
Resistant measures
Measures of Central Tendency


While distributions provide an overall picture of
some data set, it is sometimes desirable to represent
some property of the entire data set using a single
statistic
The first descriptive statistic we will discuss are
those used to indicate where the ‘center’ of the
distribution lies.



The expected value
It is not a value that has to be in the dataset itself
There are different measures of central tendency,
each with their own advantages and disadvantages
The Mode

The mode is simply the value of the relevant variable that
occurs most often (i.e., has the highest frequency) in the
sample

Note that if you have done a frequency histogram, you can
often identify the mode simply by finding the value with the
highest bar.

However, that will not work when grouping was performed
prior to plotting the histogram (although you can still use the
histogram to identify the modal group, just not the modal
value).

Modes in particular are probably best applied to nominal data
Mode

Advantages




Very quick and easy to determine
Is an actual value of the data
Not affected by extreme scores
Disadvantages



Sometimes not very informative (e.g. cigarettes smoked in
a day)
Can change dramatically from sample to sample
Might be more than one (which is more representative?)
The Median



The median is the point corresponding to the score that lies in
the middle of the distribution (i.e., there are as many data
points above the median as there are below the median).
To find the median, the data points must first be sorted into
either ascending or descending numerical order.
The position of the median value can then be calculated using
the following formula:
Median Location = N + 1
2
Median

Advantage:


Resistant to outliers
Disadvantage:



May not be so informative:
(1, 1, 2, 2, 2, 2, 5, 6, 9, 9, 10 )
Does the value of 2 really represent this sample as a
whole very well?
The Mean

The most commonly used measure of central
tendency is called the mean (denoted X for
a sample, and µ for a population).

The mean is the same of what many of us call
the ‘average’, and it is calculated in the
following manner:
X

X
N
Mode vs. Median vs. Mean

When there is only one mode and distribution
is fairly symmetrical the three measures (as
well as others to be discussed) will have
similar values

However, when the underlying distribution is
not symmetrical, the three measures of central
tendency can be quite different.
Some Visual Demos

Here is a demonstration1 that allows you to change a
frequency histogram while simultaneously noting the
effects of those changes on the mean versus the
median.

As you use the demo, you should fairly easily be
able to think about how these changes are also
affecting the mode

Note that the order would go Mode Median then
Mean in the direction the tail is pointing.
What’s an average?



We’ve been referring to the mean without qualification, but in
fact there are many types of averages, and that is only one
The mean we typically use is the arithmetic mean
Along with the geometric mean and harmonic mean, they are
the Pythagorean means.



In their calculation, the Arithmetic mean is greater than or equal to the
Geometric mean, which is greater than or equal to the harmonic mean
The geometric mean for n values is to multiply them all and
take the nth root of that number
The harmonic mean can be seen as the reciprocal1 of the
arithmetic mean of the reciprocals of all the values of the
variable in question2
More means

The geometric mean is particularly appropriate for
exponential type of data


E.g. Human population over a period of time
The harmonic mean is good for things like rates and
ratios where an arithmetic mean would actually be
incorrect1, but whenever you see an ANOVA with
unequal sample sizes, the far and away most
common procedure uses the harmonic mean of
sample sizes

As a result, an unbalanced design will have less statistical
power because the average sample size will tend toward
the least sample
More means


Weighted averages
Sometimes we will want to weight a measure of
some variable by the values of some other variable


E.g. If each person gets a score on several items and we
want an average of the total score for each person across
the items, we might weight them by 1/variance to give the
more consistent scorers more importance in the
calculation
The arithmetic mean is a weighted average in which
all weights = 1.
Properties of a Statistic: Sampling
Distribution


In order to examine the properties
of a statistic we often want to take
repeated samples from some
population of data and calculate
the relevant statistic on each
sample.
We can then look at the
distribution of the statistic across
these samples and ask a variety of
questions about it.
Properties of a Statistic

Sufficiency

A sufficient statistic is one that makes use of all of the information in
the sample to estimate its corresponding parameter


For example, this property makes the mean more attractive as a measure
of central tendency compared to the mode or median.
Unbiasedness

A statistic is said to be an unbiased estimator if its expected value
(i.e., the mean of a number of sample means) is equal to the
population parameter it is estimating.

As one can see using the resampling procedure, the mean can be shown
to be an unbiased estimator
Properties of a Statistic

Efficiency

The efficiency of a statistic is reflected in the variance that is observed
when one examines the statistic over independently chosen samples



Standard error
The smaller the variance, the more efficient the statistic is said to be
Resistance



The resistance of an estimator refers to the degree to which that
estimate is effected by extreme values i.e. outliers
Small changes in the data result in only small changes in estimate
Finite-sample breakdown point


Measure of resistance to contamination
The smallest proportion of observations that, when altered sufficiently,
can render the statistic arbitrarily large or small



Median = n/2
Trimmed mean = whatever the trimming amount is
Mean = 1/n
Resistant measures of central tendency

Trimmed mean




Created by “trimming” some percentage of the
high and low ends of the data
The median is actually a trimmed estimate
Windsorized mean
M-estimators


Extreme values are given less weight than those closer to
the center of the distribution.
May be more robust than mean or median for certain
types of “funky” data
Practical Example



Administer the BDI to 10 randomly selected UNT students
8 of the students score less than 25, two scored greater than 45.
8, 12, 6, 16, 10, 20, 22, 25, 47, 55



Median = 18
Mean =22.1
Which is more accurate regarding generalization to the ‘typical
UNT student’? One that includes:






Two people that perhaps reversed their ratings on the items?
A score that was miskeyed (using the number pad they hit a 4 instead
of 1 leading to a score of 47)?
Two people who do not have English as their native language?
Two people that did not answer honestly?
Two people that are actually clinically depressed?
One that is clinically depressed, one that just ‘wants to be different’?
Practical Example

While many think of outliers as representing the ‘complexity of human
nature’1 the issue more revolves around inadequate data collection to
detect why the score is what it is and problematic population description




E.g. my definition of typical UNT student, if such a thing could be said to
exist at all, is not one that is on suicide watch
However, the previous problem most likely represents an attempt to
generalize to something that doesn’t exist.
Better populations to try and represent: UNT Texans, UNT Psych grad
students, UNT international students, UNT students who have visited C & T
in the last semester (in which case those would probably not be outliers) etc.
Application to current events: Do you really think there is a ‘middle
America’, a ‘female vote’ etc. to which the presidential candidates are
trying to appeal? There are demographics, very specific ones yes, but
those connotations do little to note the specifics.
Summary




Favoritism for the arithmetic mean is the result of familiarity
only1, and until you came to this course you would have been
hard-pressed to explain your preference outside of arguments
from authority
The AM is to be valued for some properties it has relative to
other measures (sufficiency, efficiency, unbiased), and also
rejected for the same reason (least amount of resistance)
In many cases it’s entirely inappropriate to use the AM as it
would be a distorted view of central tendency
Which statistics you use to represent your data should be
considered as much as the measures themselves.
Descargar

Central Tendency - University of North Texas