Mathematical expectation and dispersion of a random variable. Residual variance

In many cases, it becomes necessary to introduce another numerical characteristic to measure the degree scattering, spread of values, taken as a random variable ξ , around its mathematical expectation.

Definition. Variance random variable ξ called a number.

D ξ= M(ξ-Mξ) 2 . (1)

In other words, dispersion is the mathematical expectation of the squared deviation of the values ​​of a random variable from its average value.

called mean square deviation

quantities ξ .

If the dispersion characterizes the average size of the squared deviation ξ from , then the number can be considered as some average characteristic the deviation itself, more precisely, the magnitude | ξ-Mξ |.

The following two properties of dispersion follow from definition (1).

1. The variance of a constant value is zero. This is quite consistent with the visual meaning of dispersion as a “measure of scatter”.

Indeed, if

ξ = C, That Mξ = C and that means Dξ = M(C-C) 2 = M 0 = 0.

2. When multiplying a random variable ξ by a constant number C its variance is multiplied by C 2

D(Cξ) = C 2 . (3)

Really

D(Cξ) = M(C

= M(C .

3. The following formula for calculating the variance takes place:

. (4)

The proof of this formula follows from the properties of the mathematical expectation.

We have:

4. If the values ξ 1 and ξ 2 are independent, then the variance of their sum is equal to the sum of their variances:

Proof . To prove this, we use the properties of mathematical expectation. Let 1 = m 1 , 2 = m 2 then.

Formula (5) has been proven.

Since the variance of a random variable is, by definition, the mathematical expectation of the value ( ξ -m) 2 , where m = Mξ, then to calculate the variance you can use the formulas obtained in §7 of Chapter II.

So, if ξ there is a DSV with a distribution law

x 1 x 2 ...
p 1 p 2 ...

then we will have:

. (7)

If ξ continuous random variable with distribution density p(x), then we get:

= . (8)

If you use formula (4) to calculate the variance, you can obtain other formulas, namely:

, (9)

if the value ξ discrete, and

= , (10)

If ξ distributed with density p(x).

Example 1. Let the value ξ uniformly distributed on the segment [ a,b]. Using formula (10) we obtain:

It can be shown that the variance of a random variable distributed according to the normal law with density

p(x)= , (11)

equal to σ 2.

This clarifies the meaning of the parameter σ included in the density expression (11) for the normal law; σ is the standard deviation of the value ξ.

Example 2. Find the variance of a random variable ξ , distributed according to the binomial law.


Solution . Using the representation of ξ in the form

ξ = ξ 1 + ξ 2 + ξn(see example 2 §7 chapter II) and applying the formula for adding variances for independent quantities, we get

Dξ = Dξ 1 + Dξ 2 +Dξn .

Dispersion of any of the quantities ξi (i= 1,2, n) is calculated directly:

Dξ i = ​​M(ξ i) 2 - (Mξ i) 2 = 0 2 · q+ 1 2 p- p 2 = p(1-p) = pq.

Finally we get

= npq, Where q = 1 - p.

Let's calculate inMSEXCELvariance and standard deviation samples. We will also calculate the variance of a random variable if its distribution is known.

Let's first consider dispersion, then standard deviation.

Sample variance

Sample variance (sample variance,samplevariance) characterizes the spread of values ​​in the array relative to .

All 3 formulas are mathematically equivalent.

From the first formula it is clear that sample variance is the sum of the squared deviations of each value in the array from average, divided by sample size minus 1.

variances samples the DISP() function is used, English. the name VAR, i.e. VARiance. From version MS EXCEL 2010, it is recommended to use its analogue DISP.V(), English. the name VARS, i.e. Sample VARiance. In addition, starting from the version of MS EXCEL 2010, there is a function DISP.Г(), English. name VARP, i.e. Population VARiance, which calculates dispersion For population. The whole difference comes down to the denominator: instead of n-1 like DISP.V(), DISP.G() has just n in the denominator. Before MS EXCEL 2010, the VAR() function was used to calculate the variance of the population.

Sample variance
=QUADROTCL(Sample)/(COUNT(Sample)-1)
=(SUM(Sample)-COUNT(Sample)*AVERAGE(Sample)^2)/ (COUNT(Sample)-1)– usual formula
=SUM((Sample -AVERAGE(Sample))^2)/ (COUNT(Sample)-1) –

Sample variance is equal to 0, only if all values ​​are equal to each other and, accordingly, equal average value. Usually, the larger the value variances, the greater the spread of values ​​in the array.

Sample variance is a point estimate variances distribution of the random variable from which it was made sample. About construction confidence intervals when assessing variances can be read in the article.

Variance of a random variable

To calculate dispersion random variable, you need to know it.

For variances random variable X is often denoted Var(X). Dispersion equal to the square of the deviation from the mean E(X): Var(X)=E[(X-E(X)) 2 ]

dispersion calculated by the formula:

where x i is the value that a random variable can take, and μ is the average value (), p(x) is the probability that the random variable will take the value x.

If a random variable has , then dispersion calculated by the formula:

Dimension variances corresponds to the square of the unit of measurement of the original values. For example, if the values ​​in the sample represent part weight measurements (in kg), then the variance dimension would be kg 2 . This can be difficult to interpret, so to characterize the spread of values, a value equal to square root from variancesstandard deviation.

Some properties variances:

Var(X+a)=Var(X), where X is a random variable and a is a constant.

Var(aХ)=a 2 Var(X)

Var(X)=E[(X-E(X)) 2 ]=E=E(X 2)-E(2*X*E(X))+(E(X)) 2 =E(X 2)- 2*E(X)*E(X)+(E(X)) 2 =E(X 2)-(E(X)) 2

This dispersion property is used in article about linear regression.

Var(X+Y)=Var(X) + Var(Y) + 2*Cov(X;Y), where X and Y are random variables, Cov(X;Y) is the covariance of these random variables.

If random variables are independent, then they covariance is equal to 0, and therefore Var(X+Y)=Var(X)+Var(Y). This property of dispersion is used in derivation.

Let us show that for independent quantities Var(X-Y)=Var(X+Y). Indeed, Var(X-Y)= Var(X-Y)= Var(X+(-Y))= Var(X)+Var(-Y)= Var(X)+Var(-Y)= Var( X)+(-1) 2 Var(Y)= Var(X)+Var(Y)= Var(X+Y). This dispersion property is used to construct .

Sample standard deviation

Sample standard deviation is a measure of how widely scattered the values ​​in a sample are relative to their .

A-priory, standard deviation equal to the square root of variances:

Standard deviation does not take into account the magnitude of the values ​​in sample, but only the degree of dispersion of values ​​around them average. To illustrate this, let's give an example.

Let's calculate the standard deviation for 2 samples: (1; 5; 9) and (1001; 1005; 1009). In both cases, s=4. It is obvious that the ratio of the standard deviation to the array values ​​differs significantly between samples. For such cases it is used The coefficient of variation(Coefficient of Variation, CV) - ratio Standard Deviation to the average arithmetic, expressed as a percentage.

In MS EXCEL 2007 and earlier versions for calculation Sample standard deviation the function =STDEVAL() is used, English. name STDEV, i.e. STandard DEViation. From the version of MS EXCEL 2010, it is recommended to use its analogue =STDEV.B() , English. name STDEV.S, i.e. Sample STandard DEViation.

In addition, starting from the version of MS EXCEL 2010, there is a function STANDARDEV.G(), English. name STDEV.P, i.e. Population STandard DEViation, which calculates standard deviation For population. The whole difference comes down to the denominator: instead of n-1 as in STANDARDEV.V(), STANDARDEVAL.G() has just n in the denominator.

Standard deviation can also be calculated directly using the formulas below (see example file)
=ROOT(QUADROTCL(Sample)/(COUNT(Sample)-1))
=ROOT((SUM(Sample)-COUNT(Sample)*AVERAGE(Sample)^2)/(COUNT(Sample)-1))

Other measures of scatter

The SQUADROTCL() function calculates with a sum of squared deviations of values ​​from their average. This function will return the same result as the formula =DISP.G( Sample)*CHECK( Sample) , Where Sample- a reference to a range containing an array of sample values ​​(). Calculations in the QUADROCL() function are made according to the formula:

The SROTCL() function is also a measure of the spread of a data set. The function SROTCL() calculates the average of the absolute values ​​of deviations of values ​​from average. This function will return the same result as the formula =SUMPRODUCT(ABS(Sample-AVERAGE(Sample)))/COUNT(Sample), Where Sample- a link to a range containing an array of sample values.

Calculations in the function SROTCL () are made according to the formula:

The variance of a random variable is a measure of the spread of the values ​​of this variable. Low variance means that the values ​​are clustered close together. Large dispersion indicates a strong spread of values. The concept of variance of a random variable is used in statistics. For example, if you compare the variance of two values ​​(such as between male and female patients), you can test the significance of a variable. Variance is also used when building statistical models, since low variance can be a sign that you are overfitting the values.

Steps

Calculating sample variance

  1. Record the sample values. In most cases, statisticians only have access to samples of specific populations. For example, as a rule, statisticians do not analyze the cost of maintaining the totality of all cars in Russia - they analyze a random sample of several thousand cars. Such a sample will help determine the average cost of a car, but, most likely, the resulting value will be far from the real one.

    • For example, let's analyze the number of buns sold in a cafe over 6 days, taken in random order. The sample looks like this: 17, 15, 23, 7, 9, 13. This is a sample, not a population, because we do not have data on buns sold for each day the cafe is open.
    • If you are given a population rather than a sample of values, continue to the next section.
  2. Write down a formula to calculate sample variance. Dispersion is a measure of the spread of values ​​of a certain quantity. How closer value dispersion to zero, the closer the values ​​are grouped to each other. When working with a sample of values, use the following formula to calculate variance:

    • s 2 (\displaystyle s^(2)) = ∑[(x i (\displaystyle x_(i))- x̅) 2 (\displaystyle ^(2))] / (n - 1)
    • s 2 (\displaystyle s^(2))– this is dispersion. Dispersion is measured in square units.
    • x i (\displaystyle x_(i))– each value in the sample.
    • x i (\displaystyle x_(i)) you need to subtract x̅, square it, and then add the results.
    • x̅ – sample mean (sample mean).
    • n – number of values ​​in the sample.
  3. Calculate the sample mean. It is denoted as x̅. The sample mean is calculated as a simple arithmetic mean: add up all the values ​​in the sample, and then divide the result by the number of values ​​in the sample.

    • In our example, add the values ​​in the sample: 15 + 17 + 23 + 7 + 9 + 13 = 84
      Now divide the result by the number of values ​​in the sample (in our example there are 6): 84 ÷ 6 = 14.
      Sample mean x̅ = 14.
    • The sample mean is the central value around which the values ​​in the sample are distributed. If the values ​​in the sample cluster around the sample mean, then the variance is small; otherwise the variance is large.
  4. Subtract the sample mean from each value in the sample. Now calculate the difference x i (\displaystyle x_(i))- x̅, where x i (\displaystyle x_(i))– each value in the sample. Each result obtained indicates the degree of deviation of a particular value from the sample mean, that is, how far this value is from the sample mean.

    • In our example:
      x 1 (\displaystyle x_(1))- x = 17 - 14 = 3
      x 2 (\displaystyle x_(2))- x̅ = 15 - 14 = 1
      x 3 (\displaystyle x_(3))- x = 23 - 14 = 9
      x 4 (\displaystyle x_(4))- x̅ = 7 - 14 = -7
      x 5 (\displaystyle x_(5))- x̅ = 9 - 14 = -5
      x 6 (\displaystyle x_(6))- x̅ = 13 - 14 = -1
    • The correctness of the results obtained is easy to check, since their sum should be equal to zero. This is related to the determination of the average value, since negative values(distances from the average value to smaller values) are fully compensated positive values(distances from average to large values).
  5. As noted above, the sum of the differences x i (\displaystyle x_(i))- x̅ must be equal to zero. This means that the average variance is always zero, which does not give any idea about the spread of values ​​of a certain quantity. To solve this problem, square each difference x i (\displaystyle x_(i))- x̅. This will result in you only getting positive numbers, which when added will never give 0.

    • In our example:
      (x 1 (\displaystyle x_(1))- x̅) 2 = 3 2 = 9 (\displaystyle ^(2)=3^(2)=9)
      (x 2 (\displaystyle (x_(2))- x̅) 2 = 1 2 = 1 (\displaystyle ^(2)=1^(2)=1)
      9 2 = 81
      (-7) 2 = 49
      (-5) 2 = 25
      (-1) 2 = 1
    • You found the square of the difference - x̅) 2 (\displaystyle ^(2)) for each value in the sample.
  6. Calculate the sum of the squares of the differences. That is, find that part of the formula that is written like this: ∑[( x i (\displaystyle x_(i))- x̅) 2 (\displaystyle ^(2))]. Here the sign Σ means the sum of squared differences for each value x i (\displaystyle x_(i)) in the sample. You have already found the squared differences (x i (\displaystyle (x_(i))- x̅) 2 (\displaystyle ^(2)) for each value x i (\displaystyle x_(i)) in the sample; now just add these squares.

    • In our example: 9 + 1 + 81 + 49 + 25 + 1 = 166 .
  7. Divide the result by n - 1, where n is the number of values ​​in the sample. Some time ago, to calculate sample variance, statisticians simply divided the result by n; in this case you will get the mean of the squared variance, which is ideal for describing the variance of a given sample. But remember that any sample is only a small part of the population of values. If you take another sample and perform the same calculations, you will get a different result. As it turns out, dividing by n - 1 (rather than just n) gives a more accurate estimate of the population variance, which is what you're interested in. Division by n – 1 has become common, so it is included in the formula for calculating sample variance.

    • In our example, the sample includes 6 values, that is, n = 6.
      Sample variance = s 2 = 166 6 − 1 = (\displaystyle s^(2)=(\frac (166)(6-1))=) 33,2
  8. The difference between variance and standard deviation. Note that the formula contains an exponent, so the dispersion is measured in square units of the value being analyzed. Sometimes such a magnitude is quite difficult to operate; in such cases, use the standard deviation, which is equal to the square root of the variance. That is why the sample variance is denoted as s 2 (\displaystyle s^(2)), and the standard deviation of the sample is as s (\displaystyle s).

    • In our example, the standard deviation of the sample is: s = √33.2 = 5.76.

    Calculating Population Variance

    1. Analyze some set of values. The set includes all values ​​of the quantity under consideration. For example, if you are studying the age of residents Leningrad region, then the population includes the ages of all residents of this area. When working with a population, it is recommended to create a table and enter the population values ​​into it. Consider the following example:

      • In a certain room there are 6 aquariums. Each aquarium contains the following number of fish:
        x 1 = 5 (\displaystyle x_(1)=5)
        x 2 = 5 (\displaystyle x_(2)=5)
        x 3 = 8 (\displaystyle x_(3)=8)
        x 4 = 12 (\displaystyle x_(4)=12)
        x 5 = 15 (\displaystyle x_(5)=15)
        x 6 = 18 (\displaystyle x_(6)=18)
    2. Write down a formula to calculate the population variance. Since the totality includes all values ​​of a certain quantity, the formula below allows us to obtain exact value population variances. To distinguish population variance from sample variance (which is only an estimate), statisticians use various variables:

      • σ 2 (\displaystyle ^(2)) = (∑(x i (\displaystyle x_(i)) - μ) 2 (\displaystyle ^(2)))/n
      • σ 2 (\displaystyle ^(2))– population dispersion (read as “sigma squared”). Dispersion is measured in square units.
      • x i (\displaystyle x_(i))– each value in its entirety.
      • Σ – sum sign. That is, from each value x i (\displaystyle x_(i)) you need to subtract μ, square it, and then add the results.
      • μ – population mean.
      • n – number of values ​​in the population.
    3. Calculate the population mean. When working with a population, its mean is denoted as μ (mu). The population mean is calculated as a simple arithmetic mean: add up all the values ​​in the population, and then divide the result by the number of values ​​in the population.

      • Keep in mind that averages are not always calculated as the arithmetic mean.
      • In our example, the population mean: μ = 5 + 5 + 8 + 12 + 15 + 18 6 (\displaystyle (\frac (5+5+8+12+15+18)(6))) = 10,5
    4. Subtract the population mean from each value in the population. The closer the difference is to zero, the closer specific meaning to the population mean. Find the difference between each value in the population and its mean, and you will get a first idea of ​​the distribution of values.

      • In our example:
        x 1 (\displaystyle x_(1))- μ = 5 - 10.5 = -5.5
        x 2 (\displaystyle x_(2))- μ = 5 - 10.5 = -5.5
        x 3 (\displaystyle x_(3))- μ = 8 - 10.5 = -2.5
        x 4 (\displaystyle x_(4))- μ = 12 - 10.5 = 1.5
        x 5 (\displaystyle x_(5))- μ = 15 - 10.5 = 4.5
        x 6 (\displaystyle x_(6))- μ = 18 - 10.5 = 7.5
    5. Square each result obtained. The difference values ​​will be both positive and negative; If these values ​​are plotted on a number line, they will lie to the right and left of the population mean. This is not suitable for calculating variance, since positive and negative numbers compensate each other. So square each difference to get exclusively positive numbers.

      • In our example:
        (x i (\displaystyle x_(i)) - μ) 2 (\displaystyle ^(2)) for each population value (from i = 1 to i = 6):
        (-5,5)2 (\displaystyle ^(2)) = 30,25
        (-5,5)2 (\displaystyle ^(2)), Where x n (\displaystyle x_(n))– the last value in the population.
      • To calculate the average value of the results obtained, you need to find their sum and divide it by n:(( x 1 (\displaystyle x_(1)) - μ) 2 (\displaystyle ^(2)) + (x 2 (\displaystyle x_(2)) - μ) 2 (\displaystyle ^(2)) + ... + (x n (\displaystyle x_(n)) - μ) 2 (\displaystyle ^(2)))/n
      • Now let's write down the above explanation using variables: (∑( x i (\displaystyle x_(i)) - μ) 2 (\displaystyle ^(2))) / n and get a formula for calculating the population variance.

Dispersion in statistics is found as the individual values ​​of the characteristic squared from . Depending on the initial data, it is determined using the simple and weighted variance formulas:

1. (for ungrouped data) is calculated using the formula:

2. Weighted variance (for variation series):

where n is frequency (repeatability of factor X)

An example of finding variance

This page describes a standard example of finding variance, you can also look at other problems for finding it

Example 1. The following data is available for a group of 20 students correspondence department. It is necessary to construct an interval series of the distribution of the characteristic, calculate the average value of the characteristic and study its dispersion

Let's build an interval grouping. Let's determine the range of the interval using the formula:

where X max is the maximum value of the grouping characteristic;
X min – minimum value of the grouping characteristic;
n – number of intervals:

We accept n=5. The step is: h = (192 - 159)/ 5 = 6.6

Let's create an interval grouping

For further calculations, we will build an auxiliary table:

X'i is the middle of the interval. (for example, the middle of the interval 159 – 165.6 = 162.3)

We determine the average height of students using the weighted arithmetic average formula:

Let's determine the variance using the formula:

The dispersion formula can be transformed as follows:

From this formula it follows that variance is equal to the difference between the average of the squares of the options and the square and the average.

Variance in variation series with equal intervals using the method of moments can be calculated in the following way using the second property of dispersion (dividing all options by the value of the interval). Determining variance, calculated by the method of moments, by the following formula less labor intensive:

where i is the value of the interval;
A is a conventional zero, for which it is convenient to use the middle of the interval with the highest frequency;
m1 is the square of the first order moment;
m2 - moment of second order

(if in a statistical population a characteristic changes in such a way that there are only two mutually exclusive options, then such variability is called alternative) can be calculated using the formula:

Substituting q = 1- p into this dispersion formula, we obtain:

Types of variance

Total variance measures the variation of a characteristic across the entire population as a whole under the influence of all factors that cause this variation. It is equal to the mean square of the deviations of individual values ​​of a characteristic x from the overall mean value of x and can be defined as simple variance or weighted variance.

characterizes random variation, i.e. part of the variation that is due to the influence of unaccounted factors and does not depend on the factor-attribute that forms the basis of the group. Such dispersion is equal to the mean square of the deviations of individual values ​​of the attribute within group X from the arithmetic mean of the group and can be calculated as simple dispersion or as weighted dispersion.

Thus, within-group variance measures variation of a trait within a group and is determined by the formula:

where xi is the group average;
ni is the number of units in the group.

For example, intra-group variances that need to be determined in the problem of studying the influence of workers’ qualifications on the level of labor productivity in a workshop show variations in output in each group caused by all possible factors ( technical condition equipment, availability of tools and materials, age of workers, intensity of labor, etc.), except for differences in qualification category (within a group, all workers have the same qualifications).

The average of the within-group variances reflects random, i.e., that part of the variation that occurred under the influence of all other factors, with the exception of the grouping factor. It is calculated using the formula:

Characterizes the systematic variation of the resulting characteristic, which is due to the influence of the factor-sign that forms the basis of the group. It is equal to the mean square of the deviations of the group means from the overall mean. Intergroup variance is calculated using the formula:

The rule for adding variance in statistics

According to rule of adding variances the total variance is equal to the sum of the average of the within-group and between-group variances:

The meaning of this rule is that the total variance that arises under the influence of all factors is equal to the sum of the variances that arise under the influence of all other factors and the variance that arises due to the grouping factor.

Using the formula for adding variances, you can determine the third unknown variance from two known variances, and also judge the strength of the influence of the grouping characteristic.

Dispersion properties

1. If all values ​​of a characteristic are reduced (increased) by the same constant amount, then the dispersion will not change.
2. If all values ​​of a characteristic are reduced (increased) by the same number of times n, then the variance will correspondingly decrease (increase) by n^2 times.

Among the many indicators that are used in statistics, it is necessary to highlight the calculation of variance. It should be noted that performing this calculation manually is a rather tedious task. Fortunately, Excel has functions that allow you to automate the calculation procedure. Let's find out the algorithm for working with these tools.

Dispersion is an indicator of variation, which is the average square of deviations from the mathematical expectation. Thus, it expresses the spread of numbers around the average value. Calculation of variance can be carried out both for the general population and for the sample.

Method 1: calculation based on the population

To calculate this indicator in Excel for the general population, use the function DISP.G. The syntax of this expression is as follows:

DISP.G(Number1;Number2;…)

In total, from 1 to 255 arguments can be used. The arguments can be either numeric values ​​or references to the cells in which they are contained.

Let's see how to calculate this value for a range with numeric data.


Method 2: calculation by sample

Unlike calculating a value based on a population, in calculating a sample, the denominator does not indicate the total number of numbers, but one less. This is done for the purpose of error correction. Excel takes this nuance into account in a special function that is designed for this type of calculation - DISP.V. Its syntax is represented by the following formula:

DISP.B(Number1;Number2;…)

The number of arguments, as in the previous function, can also range from 1 to 255.


As you can see, the Excel program can greatly facilitate the calculation of variance. This statistic can be calculated by the application, either from the population or from the sample. In this case, all user actions actually come down to specifying the range of numbers to be processed, and Excel does the main work itself. Of course, this will save a significant amount of user time.

Views