Measures of Variability

Key Questions

  • Answer:

    In the formula for a population standard deviation, you divide by the population size NN, whereas in the formula for the sample standard deviation, you divide by n-1n1 (the sample size minus one).

    Explanation:

    If muμ is the mean of the population, the formula for the population standard deviation of the population data x_{1},x_{2},x_{3},\ldots, x_{N} is

    sigma=sqrt{\frac{sum_{k=1}^{N}(x_{k}-mu)^{2}}{N}}.

    If bar{x} is the mean of a sample, the formula for the sample standard deviation of the sample data x_{1},x_{2},x_{3},\ldots, x_{n} is

    s=sqrt{\frac{sum_{k=1}^{n}(x_{k}-bar{x})^{2}}{n-1}}.

    The reason this is done is somewhat technical. Doing this makes the sample variance s^{2} a so-called unbiased estimator for the population variance sigma^{2}. In effect, if the population size is really large and you are doing many, many random samples of the same size n from that large population, the mean of the many, many values of s^{2} will have an average very close to the value of sigma^{2} (and, as far as a theoretical perspective goes, the mean of s^{2} as a "random variable" will be exactly sigma^{2}).

    The technicalities for why this is true involve lots of algebra with summations, and is usually not worth the time spent for beginning students.

  • Standard deviation is most widely used.

    Range simply gives the difference between lowest and highest value, and a few extreme values will alter the range excessively.

    The standard deviation sigma tells you where most of the values will be, and in a normal distribution 68% of all values will be within one standard deviation from the mean mu, and 95% will be within two standard deviations of the mean.

    Example:
    You have a filling machine that fills kilogram bags of sugar. It will not fill exactly 1000g every time, the standard deviation is 10g.
    Then you know, that 68% is between 990and1010g, and 95% between 980and1020g, a total span of 20g or 40g respectively.

    Every now and again a bag will be far over-filled (say 1100g) and sometimes a bag will end up empty (0g), so the range will be a total of 1100g.

    You may decide which of the two gives a better idea of the spread in this distribution.

  • SD: it gives you an numerical value about the variation of the data.
    Range: it gives you the maximal and minimal values of all data.

    Mean: a pontual value that represents the average value of data. Doesn't represent the true in assimetrical distributions and it is influenced by outliers

Questions