How to find the variance in a series of numbers. Mathematical expectation of a discrete random variable

For grouped data residual dispersion- average of intragroup dispersions:

Where σ 2 j is the intra-group variance of the j -th group.

For ungrouped data residual dispersion is a measure of the approximation accuracy, i.e. approximation of the regression line to the original data:
where y(t) is the forecast according to the trend equation; y t – initial series of dynamics; n is the number of points; p is the number of coefficients of the regression equation (the number of explanatory variables).
In this example it is called unbiased estimate of variance.

Example #1. The distribution of workers of three enterprises of one association by tariff categories is characterized by the following data:

Worker's wage categoryNumber of workers at the enterprise
enterprise 1enterprise 2enterprise 3
1 50 20 40
2 100 80 60
3 150 150 200
4 350 300 400
5 200 150 250
6 150 100 150

Define:
1. dispersion for each enterprise (intragroup dispersion);
2. average of intragroup dispersions;
3. intergroup dispersion;
4. total variance.

Solution.
Before proceeding to solve the problem, it is necessary to find out which feature is effective and which is factorial. In the example under consideration, the effective feature is "Tariff category", and the factor feature is "Number (name) of the enterprise".
Then we have three groups (enterprises) for which it is necessary to calculate the group average and intragroup variances:


Companygroup average,within-group variance,
1 4 1,8

The average of the intragroup variances ( residual dispersion) calculated by the formula:


where you can calculate:
or:


then:
The total dispersion will be equal to: s 2 \u003d 1.6 + 0 \u003d 1.6.
The total variance can also be calculated using one of the following two formulas:

When solving practical problems, one often has to deal with a sign that takes only two alternative values. In this case, they are not talking about the weight of a particular value of a feature, but about its share in the aggregate. If the proportion of population units that have the trait under study is denoted by " R", and not possessing - through" q”, then the dispersion can be calculated by the formula:
s 2 = p×q

Example #2. According to the data on the development of six workers of the brigade, determine the intergroup variance and evaluate the impact of the work shift on their labor productivity if the total variance is 12.2.

No. of the working brigadeWorking output, pcs.
in the first shiftin 2nd shift
1 18 13
2 19 14
3 22 15
4 20 17
5 24 16
6 23 15

Solution. Initial data

Xf1f2f 3f4f5f6Total
1 18 19 22 20 24 23 126
2 13 14 15 17 16 15 90
Total 31 33 37 37 40 38

Then we have 6 groups for which it is necessary to calculate the group mean and intragroup variances.
1. Find the average values ​​of each group.







2. Find the mean square of each group.







We summarize the results of the calculation in a table:
Group numberGroup averageIntragroup variance
1 1.42 0.24
2 1.42 0.24
3 1.41 0.24
4 1.46 0.25
5 1.4 0.24
6 1.39 0.24

3. Intragroup variance characterizes the change (variation) of the studied (resulting) trait within the group under the influence of all factors, except for the factor underlying the grouping:
We calculate the average of the intragroup dispersions using the formula:


4. Intergroup variance characterizes the change (variation) of the studied (resulting) trait under the influence of a factor (factorial trait) underlying the grouping.
Intergroup dispersion is defined as:

where


Then

Total variance characterizes the change (variation) of the studied (resulting) trait under the influence of all factors (factorial traits) without exception. By the condition of the problem, it is equal to 12.2.
Empirical correlation relation measures how much of the total fluctuation of the resulting attribute is caused by the studied factor. This is the ratio of the factorial variance to the total variance:

We determine the empirical correlation relation:

Relationships between features can be weak or strong (close). Their criteria are evaluated on the Chaddock scale:
0.1 0.3 0.5 0.7 0.9 In our example, the relationship between feature Y factor X is weak
Determination coefficient.

Let's define the coefficient of determination:

Thus, 0.67% of the variation is due to differences between traits, and 99.37% is due to other factors.
Conclusion: in this case, the output of workers does not depend on work in a particular shift, i.e. the influence of the work shift on their labor productivity is not significant and is due to other factors.

Example #3. Based on the data on the average wage and the squared deviations from its value for two groups of workers, find the total variance by applying the variance addition rule:

Solution:
Average of within-group variances

Intergroup dispersion is defined as:


The total variance will be: 480 + 13824 = 14304

Dispersion in statistics is found as individual values ​​of the feature in the square of . Depending on the initial data, it is determined by the simple and weighted variance formulas:

1. (for ungrouped data) is calculated by the formula:

2. Weighted variance (for a variation series):

where n is the frequency (repeatability factor X)

An example of finding the variance

This page describes a standard example of finding the variance, you can also look at other tasks for finding it

Example 1. We have the following data for a group of 20 correspondence students. It is necessary to build an interval series of the feature distribution, calculate the mean value of the feature and study its variance

Let's build an interval grouping. Let's determine the range of the interval by the formula:

where X max is the maximum value of the grouping feature;
X min is the minimum value of the grouping feature;
n is the number of intervals:

We accept n=5. The step is: h \u003d (192 - 159) / 5 \u003d 6.6

Let's make an interval grouping

For further calculations, we will build an auxiliary table:

X'i is the middle of the interval. (for example, the middle of the interval 159 - 165.6 = 162.3)

The average growth of students is determined by the formula of the arithmetic weighted average:

We determine the dispersion by the formula:

The variance formula can be converted as follows:

From this formula it follows that the variance is the difference between the mean of the squares of the options and the square and the mean.

Variance in variation series with equal intervals according to the method of moments can be calculated in the following way using the second property of the dispersion (dividing all options by the value of the interval). Definition of variance, calculated by the method of moments, according to the following formula is less time consuming:

where i is the value of the interval;
A - conditional zero, which is convenient to use the middle of the interval with the highest frequency;
m1 is the square of the moment of the first order;
m2 - moment of the second order

(if in the statistical population the attribute changes in such a way that there are only two mutually exclusive options, then such variability is called alternative) can be calculated by the formula:

Substituting in this dispersion formula q = 1- p, we get:

Types of dispersion

Total variance measures the variation of a trait over the entire population as a whole under the influence of all the factors that cause this variation. It is equal to the mean square of the deviations of the individual values ​​of the attribute x from the total average value x and can be defined as simple variance or weighted variance.

characterizes random variation, i.e. part of the variation, which is due to the influence of unaccounted for factors and does not depend on the sign-factor underlying the grouping. Such a variance is equal to the mean square of the deviations of the individual values ​​of a feature within the X group from the arithmetic mean of the group and can be calculated as a simple variance or as a weighted variance.

In this way, within-group variance measures variation of a trait within a group and is determined by the formula:

where xi - group average;
ni is the number of units in the group.

For example, intra-group variances that need to be determined in the task of studying the influence of workers' qualifications on the level of labor productivity in the workshop show variations in output in each group caused by all possible factors (technical condition of equipment, availability of tools and materials, age of workers, labor intensity, etc. .), except for differences in the qualification category (within the group, all workers have the same qualification).

The average of the within-group variances reflects the random, i.e., that part of the variation that occurred under the influence of all other factors, with the exception of the grouping factor. It is calculated by the formula:

It characterizes the systematic variation of the resulting trait, which is due to the influence of the trait-factor underlying the grouping. It is equal to the mean square of the deviations of the group means from the overall mean. Intergroup variance is calculated by the formula:

Variance addition rule in statistics

According to variance addition rule the total variance is equal to the sum of the average of the intragroup and intergroup variances:

The meaning of this rule is that the total variance that occurs under the influence of all factors is equal to the sum of the variances that arise under the influence of all other factors and the variance that arises due to the grouping factor.

Using the formula for adding variances, it is possible to determine the third unknown from two known variances, and also to judge the strength of the influence of the grouping attribute.

Dispersion Properties

1. If all the values ​​of the attribute are reduced (increased) by the same constant value, then the variance will not change from this.
2. If all the values ​​of the attribute are reduced (increased) by the same number of times n, then the variance will accordingly decrease (increase) by n^2 times.

Among the many indicators that are used in statistics, it is necessary to highlight the calculation of variance. It should be noted that manually performing this calculation is a rather tedious task. Fortunately, there are functions in Excel that allow you to automate the calculation procedure. Let's find out the algorithm for working with these tools.

Dispersion is an indicator of variation, which is the average square of deviations from the mathematical expectation. Thus, it expresses the spread of numbers about the mean. The calculation of the dispersion can be carried out both for the general population and for the sample.

Method 1: calculation on the general population

To calculate this indicator in Excel for the general population, the function is used DISP.G. The syntax for this expression is as follows:

DISP.G(Number1;Number2;…)

In total, from 1 to 255 arguments can be applied. Arguments can be both numeric values ​​and references to the cells in which they are contained.

Let's see how to calculate this value for a range of numeric data.


Method 2: sample calculation

In contrast to the calculation of the value for the general population, in the calculation for the sample, the denominator is not the total number of numbers, but one less. This is done in order to correct the error. Excel takes into account this nuance in a special function that is designed for this type of calculation - DISP.V. Its syntax is represented by the following formula:

VAR.B(Number1;Number2;…)

The number of arguments, as in the previous function, can also range from 1 to 255.


As you can see, the Excel program is able to greatly facilitate the calculation of the variance. This statistic can be calculated by the application for both the population and the sample. In this case, all user actions are actually reduced only to specifying the range of numbers to be processed, and Excel does the main work itself. Of course, this will save a significant amount of time for users.

Let's calculate inMSEXCELvariance and standard deviation of the sample. We also calculate the variance of a random variable if its distribution is known.

First consider dispersion, then standard deviation.

Sample variance

Sample variance (sample variance,samplevariance) characterizes the spread of values ​​in the array relative to .

All 3 formulas are mathematically equivalent.

It can be seen from the first formula that sample variance is the sum of the squared deviations of each value in the array from average divided by the sample size minus 1.

dispersion samples the DISP() function is used, eng. the name of the VAR, i.e. VARIance. Since MS EXCEL 2010, it is recommended to use its analogue DISP.V() , eng. the name VARS, i.e. Sample Variance. In addition, starting from the version of MS EXCEL 2010, there is a DISP.G () function, eng. VARP name, i.e. Population VARIance which calculates dispersion for population. The whole difference comes down to the denominator: instead of n-1 like DISP.V() , DISP.G() has just n in the denominator. Prior to MS EXCEL 2010, the VARP() function was used to calculate the population variance.

Sample variance
=SQUARE(Sample)/(COUNT(Sample)-1)
=(SUMSQ(Sample)-COUNT(Sample)*AVERAGE(Sample)^2)/ (COUNT(Sample)-1)- the usual formula
=SUM((Sample -AVERAGE(Sample))^2)/ (COUNT(Sample)-1) –

Sample variance is equal to 0 only if all values ​​are equal to each other and, accordingly, are equal mean value. Usually, the larger the value dispersion, the greater the spread of values ​​in the array.

Sample variance is a point estimate dispersion distribution of the random variable from which the sample. About construction confidence intervals when evaluating dispersion can be read in the article.

Variance of a random variable

To calculate dispersion random variable, you need to know it.

For dispersion random variable X often use the notation Var(X). Dispersion is equal to the square of the deviation from the mean E(X): Var(X)=E[(X-E(X)) 2 ]

dispersion calculated by the formula:

where x i is the value that the random variable can take, and μ is the average value (), p(x) is the probability that the random variable will take the value x.

If the random variable has , then dispersion calculated by the formula:

Dimension dispersion corresponds to the square of the unit of measurement of the original values. For example, if the values ​​in the sample are measurements of the weight of the part (in kg), then the dimension of the variance would be kg 2 . This can be difficult to interpret, therefore, to characterize the spread of values, a value equal to the square root of dispersionstandard deviation.

Some properties dispersion:

Var(X+a)=Var(X), where X is a random variable and a is a constant.

Var(aХ)=a 2 Var(X)

Var(X)=E[(X-E(X)) 2 ]=E=E(X 2)-E(2*X*E(X))+(E(X)) 2=E(X 2)- 2*E(X)*E(X)+(E(X)) 2 =E(X 2)-(E(X)) 2

This dispersion property is used in article about linear regression.

Var(X+Y)=Var(X) + Var(Y) + 2*Cov(X;Y), where X and Y are random variables, Cov(X;Y) is the covariance of these random variables.

If random variables are independent, then their covariance is 0, and hence Var(X+Y)=Var(X)+Var(Y). This property of the variance is used in the output.

Let us show that for independent quantities Var(X-Y)=Var(X+Y). Indeed, Var(X-Y)= Var(X-Y)= Var(X+(-Y))= Var(X)+Var(-Y)= Var(X)+Var(-Y)= Var( X) + (-1) 2 Var (Y) \u003d Var (X) + Var (Y) \u003d Var (X + Y). This property of the variance is used to plot .

Sample standard deviation

Sample standard deviation is a measure of how widely scattered the values ​​in the sample are relative to their .

By definition, standard deviation equals the square root of dispersion:

Standard deviation does not take into account the magnitude of the values ​​in sampling, but only the degree of scattering of values ​​around them middle. Let's take an example to illustrate this.

Let's calculate the standard deviation for 2 samples: (1; 5; 9) and (1001; 1005; 1009). In both cases, s=4. It is obvious that the ratio of the standard deviation to the values ​​of the array is significantly different for the samples. For such cases, use The coefficient of variation(Coefficient of Variation, CV) - ratio standard deviation to the average arithmetic, expressed as a percentage.

In MS EXCEL 2007 and earlier versions for calculation Sample standard deviation the function =STDEV() is used, eng. the name STDEV, i.e. standard deviation. Since MS EXCEL 2010, it is recommended to use its analogue = STDEV.B () , eng. name STDEV.S, i.e. Sample STandard DEViation.

In addition, starting from the version of MS EXCEL 2010, there is a function STDEV.G () , eng. name STDEV.P, i.e. Population STandard DEViation which calculates standard deviation for population. The whole difference comes down to the denominator: instead of n-1 like STDEV.V() , STDEV.G() has just n in the denominator.

Standard deviation can also be calculated directly from the formulas below (see example file)
=SQRT(SQUADROTIV(Sample)/(COUNT(Sample)-1))
=SQRT((SUMSQ(Sample)-COUNT(Sample)*AVERAGE(Sample)^2)/(COUNT(Sample)-1))

Other dispersion measures

The SQUADRIVE() function calculates with umm of squared deviations of values ​​from their middle. This function will return the same result as the formula =VAR.G( Sample)*CHECK( Sample) , where Sample- a reference to a range containing an array of sample values ​​(). Calculations in the QUADROTIV() function are made according to the formula:

The SROOT() function is also a measure of the scatter of a set of data. The SIROTL() function calculates the average of the absolute values ​​of the deviations of values ​​from middle. This function will return the same result as the formula =SUMPRODUCT(ABS(Sample-AVERAGE(Sample)))/COUNT(Sample), where Sample- a reference to a range containing an array of sample values.

Calculations in the function SROOTKL () are made according to the formula:

.

Conversely, if is a non-negative a.e. a function such that , then there is an absolutely continuous probability measure on such that is its density.

    Change of measure in the Lebesgue integral:

,

where is any Borel function integrable with respect to the probability measure .

Dispersion, types and properties of dispersion The concept of dispersion

Dispersion in statistics is found as the standard deviation of the individual values ​​of the trait squared from the arithmetic mean. Depending on the initial data, it is determined by the simple and weighted variance formulas:

1. simple variance(for ungrouped data) is calculated by the formula:

2. Weighted variance (for a variation series):

where n - frequency (repeatability factor X)

An example of finding the variance

This page describes a standard example of finding the variance, you can also look at other tasks for finding it

Example 1. Determination of group, average of group, between-group and total variance

Example 2. Finding the variance and coefficient of variation in a grouping table

Example 3. Finding the variance in a discrete series

Example 4. We have the following data for a group of 20 correspondence students. It is necessary to build an interval series of the feature distribution, calculate the mean value of the feature and study its variance

Let's build an interval grouping. Let's determine the range of the interval by the formula:

where X max is the maximum value of the grouping feature; X min is the minimum value of the grouping feature; n is the number of intervals:

We accept n=5. The step is: h \u003d (192 - 159) / 5 \u003d 6.6

Let's make an interval grouping

For further calculations, we will build an auxiliary table:

X "i - the middle of the interval. (for example, the middle of the interval 159 - 165.6 \u003d 162.3)

The average growth of students is determined by the formula of the arithmetic weighted average:

We determine the dispersion by the formula:

The formula can be converted like this:

From this formula it follows that the variance is the difference between the mean of the squares of the options and the square and the mean.

Variance in variation series with equal intervals according to the method of moments can be calculated in the following way using the second property of the dispersion (dividing all options by the value of the interval). Definition of variance, calculated by the method of moments, according to the following formula is less time consuming:

where i is the value of the interval; A - conditional zero, which is convenient to use the middle of the interval with the highest frequency; m1 is the square of the moment of the first order; m2 - moment of the second order

Feature variance (if in the statistical population the attribute changes in such a way that there are only two mutually exclusive options, then such variability is called alternative) can be calculated by the formula:

Substituting in this dispersion formula q = 1- p, we get:

Types of dispersion

Total variance measures the variation of a trait over the entire population as a whole under the influence of all the factors that cause this variation. It is equal to the mean square of the deviations of the individual values ​​of the attribute x from the total average value x and can be defined as simple variance or weighted variance.

Intragroup variance characterizes random variation, i.e. part of the variation, which is due to the influence of unaccounted for factors and does not depend on the sign-factor underlying the grouping. Such a variance is equal to the mean square of the deviations of the individual values ​​of a feature within the X group from the arithmetic mean of the group and can be calculated as a simple variance or as a weighted variance.

In this way, within-group variance measures variation of a trait within a group and is determined by the formula:

where xi - group average; ni is the number of units in the group.

For example, intra-group variances that need to be determined in the task of studying the influence of workers' qualifications on the level of labor productivity in the workshop show variations in output in each group caused by all possible factors (technical condition of equipment, availability of tools and materials, age of workers, labor intensity, etc. .), except for differences in the qualification category (within the group, all workers have the same qualification).

The average of the within-group variances reflects random variation, that is, that part of the variation that occurred under the influence of all other factors, with the exception of the grouping factor. It is calculated by the formula:

Intergroup variance characterizes the systematic variation of the resulting trait, which is due to the influence of the trait-factor underlying the grouping. It is equal to the mean square of the deviations of the group means from the overall mean. Intergroup variance is calculated by the formula: