Mean square. Calculating Standard Deviation in Microsoft Excel

One of the main tools of statistical analysis is the calculation of the standard deviation. This indicator allows you to make an estimate of the standard deviation for a sample or for the general population. Let's learn how to use the standard deviation formula in Excel.

Let's immediately define what the standard deviation is and what its formula looks like. This value is the square root of the arithmetic mean of the squares of the difference between all the values ​​of the series and their arithmetic mean. There is an identical name for this indicator - standard deviation. Both names are completely equivalent.

But, of course, in Excel, the user does not have to calculate this, since the program does everything for him. Let's learn how to calculate standard deviation in Excel.

Calculation in Excel

You can calculate the specified value in Excel using two special functions STDEV.V(according to the sample) and STDEV.G(according to the general population). The principle of their operation is absolutely the same, but they can be called in three ways, which we will discuss below.

Method 1: Function Wizard


Method 2: Formulas tab


Method 3: Entering the formula manually

There is also a way where you don't need to call the argument window at all. To do this, enter the formula manually.


As you can see, the mechanism for calculating the standard deviation in Excel is very simple. The user only needs to enter numbers from the population or links to cells that contain them. All calculations are performed by the program itself. It is much more difficult to understand what the calculated indicator is and how the results of the calculation can be applied in practice. But understanding this already belongs more to the realm of statistics than to learning how to work with software.

To calculate the geometric mean simple, the formula is used:

geometric weighted

To determine the geometric weighted average, the formula is used:

The average diameters of wheels, pipes, the average sides of the squares are determined using the root mean square.

RMS values ​​are used to calculate some indicators, such as the coefficient of variation, which characterizes the rhythm of output. Here, the standard deviation from the planned output for a certain period is determined by the following formula:

These values ​​accurately characterize the change in economic indicators compared to their base value, taken in its average value.

Quadratic simple

The mean square simple is calculated by the formula:

Quadratic weighted

The weighted root mean square is:

22. Absolute measures of variation include:

range of variation

mean linear deviation

dispersion

standard deviation

Range of variation (r)

Span variation is the difference between the maximum and minimum values ​​of the attribute

It shows the limits in which the value of the attribute changes in the studied population.

The work experience of five applicants in the previous job is: 2,3,4,7 and 9 years. Solution: range of variation = 9 - 2 = 7 years.

For a generalized characteristic of the differences in the values ​​of the attribute, the average variation indicators are calculated based on the allowance for deviations from the arithmetic mean. The difference is taken as the deviation from the mean.

At the same time, in order to avoid turning into zero the sum of deviations of the trait options from the mean (the zero property of the mean), one has to either ignore the signs of the deviation, that is, take this sum modulo , or square the deviation values

Mean linear and square deviation

Average linear deviation is the arithmetic mean of the absolute deviations of the individual values ​​of the attribute from the mean.

The average linear deviation is simple:

The work experience of five applicants in the previous job is: 2,3,4,7 and 9 years.

In our example: years;

Answer: 2.4 years.

Average linear deviation weighted applies to grouped data:

The average linear deviation, due to its conventionality, is used relatively rarely in practice (in particular, to characterize the fulfillment of contractual obligations in terms of the uniformity of delivery; in the analysis of product quality, taking into account the technological features of production).

Standard deviation

The most perfect characteristic of variation is the standard deviation, which is called the standard (or standard deviation). Standard deviation() is equal to the square root of the mean square of the deviations of the individual values ​​of the feature from the arithmetic mean:

The standard deviation is simple:

The weighted standard deviation is applied for grouped data:

Between the mean square and mean linear deviations under conditions of normal distribution, the following relationship takes place: ~ 1.25.

The standard deviation, being the main absolute measure of variation, is used in determining the values ​​of the ordinates of the normal distribution curve, in calculations related to the organization of sample observation and establishing the accuracy of sample characteristics, as well as in assessing the boundaries of the variation of a trait in a homogeneous population.

Instruction

Let there be several numbers characterizing - or homogeneous quantities. For example, the results of measurements, weighings, statistical observations, etc. All quantities presented must be measured by the same measurement. To find the standard deviation, do the following.

Determine the arithmetic mean of all numbers: add all the numbers and divide the sum by the total number of numbers.

Determine the dispersion (scatter) of numbers: add up the squares of the deviations found earlier and divide the resulting sum by the number of numbers.

There are seven patients in the ward with a temperature of 34, 35, 36, 37, 38, 39 and 40 degrees Celsius.

It is required to determine the average deviation from the average.
Solution:
"in the ward": (34+35+36+37+38+39+40)/7=37 ºС;

Temperature deviations from the average (in this case, the normal value): 34-37, 35-37, 36-37, 37-37, 38-37, 39-37, 40-37, it turns out: -3, -2, -1 , 0, 1, 2, 3 (ºС);

Divide the sum of numbers obtained earlier by their number. For the accuracy of the calculation, it is better to use a calculator. The result of the division is the arithmetic mean of the summands.

Pay close attention to all stages of the calculation, as an error in at least one of the calculations will lead to an incorrect final indicator. Check the received calculations at each stage. The arithmetic average has the same meter as the summands of the numbers, that is, if you determine the average attendance, then all indicators will be “person”.

This method of calculation is used only in mathematical and statistical calculations. So, for example, the arithmetic mean in computer science has a different calculation algorithm. The arithmetic mean is a very conditional indicator. It shows the probability of an event, provided that it has only one factor or indicator. For the most in-depth analysis, many factors must be taken into account. For this, the calculation of more general quantities is used.

The arithmetic mean is one of the measures of central tendency, widely used in mathematics and statistical calculations. Finding the arithmetic average of several values ​​​​is very simple, but each task has its own nuances, which are simply necessary to know in order to perform correct calculations.

Quantitative results of such experiments.

How to find the arithmetic mean

The search for the arithmetic mean for an array of numbers should begin with determining the algebraic sum of these values. For example, if the array contains the numbers 23, 43, 10, 74 and 34, then their algebraic sum will be 184. When writing, the arithmetic mean is denoted by the letter μ (mu) or x (x with a bar). Next, the algebraic sum should be divided by the number of numbers in the array. In this example, there were five numbers, so the arithmetic mean will be 184/5 and will be 36.8.

Features of working with negative numbers

If there are negative numbers in the array, then the arithmetic mean is found using a similar algorithm. There is a difference only when calculating in the programming environment, or if there are additional conditions in the task. In these cases, finding the arithmetic mean of numbers with different signs comes down to three steps:

1. Finding the common arithmetic mean by the standard method;
2. Finding the arithmetic mean of negative numbers.
3. Calculation of the arithmetic mean of positive numbers.

The responses of each of the actions are written separated by commas.

Natural and decimal fractions

If the array of numbers is represented by decimal fractions, the solution occurs according to the method of calculating the arithmetic mean of integers, but the result is reduced according to the requirements of the problem for the accuracy of the answer.

When working with natural fractions, they should be reduced to a common denominator, which is multiplied by the number of numbers in the array. The numerator of the answer will be the sum of the given numerators of the original fractional elements.

It is used in those cases when, when replacing individual values ​​of a feature with an average value, it is necessary to keep the sum of squares of the original values ​​unchanged.

The main area of ​​its use is the measurement of the degree of fluctuation of the individual values ​​of a trait relative to the arithmetic mean (standard deviation). In addition, the root mean square is used in cases where it is necessary to calculate the average value of a feature expressed in square or cubic units (when calculating the average size of square sections, average diameters of pipes, trunks, etc.).

root mean square calculated in two forms:

- how simple

how weighted

(4.22)

All power averages differ from each other by the values ​​of the exponent. Wherein,the higher the exponent, the more quantitative value of the average :

This property of power means is called the property majorance medium.

Thus,the choice of the type of the average indicator has a significant impact on its numerical value. The choice of the type of average is determined in each individual case by analyzing the study population, studying the content of the phenomenon. The exponential mean is chosen correctly, if at all stages of calculations its logical formula does not change , those. the socio-economic content of the averaged sign.

A special kind of averages structural averages. They are used in the study of the internal structure of the distribution series of feature values. These include mode and median.

The mode and median characterize the value of the attribute of a statistical unit that occupies a certain position in the variation series.

Fashion (Mo) - the most common value of the feature in the population. Mode is widely used in statistical practice for studying consumer demand, price registration, etc.

Median ( Me) - the value of a feature of a statistical unit that is in the middle of the ranked series and divides the population into two parts equal in number.

For discrete variational series Mo And Me are selected in accordance with the definitions: mode - as the value of the feature with the highest frequency \ n i ; the position of the median for an odd population size is determined by its number
, Where N- the volume of the statistical population. For an even length of the series, the median is equal to the average of the two options in the middle of the series.

The median is used as the most reliable indicator typical values ​​of a heterogeneous population, since it is insensitive to extreme values ​​of the trait, which may differ significantly from the main array of its values. In addition, the median finds practical application due to a special mathematical property:
.

Consider the definition of mode and median on the following example:

There is a number of distribution of work sites by skill level. The data are shown in Table 4.4.

Table 4.4 - Distribution of work areas by skill level

Accumulated

The mode is selected according to the maximum frequency value: at n max = 14, Mo= 4, i.e. the 4th category is the most common. To find the median Me central units are defined ( N+1)/2 . These are the 25th and 26th units. The group into which these units fall is determined by the accumulated frequencies. This is the 4th group, in which the feature value is 4. Thus, Me= 4, this means that half of the workers have a rank below 4, and the other has a rank above 4.

In the interval series values Mo And Me calculated in a more complex way.

Mode is defined as follows:

The interval in which the mode value is located is determined by the maximum frequency value. It's called modal.

Within the modal interval, the mode value is calculated by the formula:

Where
- the lower limit of the modal interval;

a Mo - modal interval width;

n Mo , n Mo-1 , n Mo+1 - respectively, the frequencies of modal, premodal (preceding modal) and postmodal (following modal) intervals.

The following approach is used to calculate the median in interval series:

Based on the accumulated frequencies, the median interval is found.

The median is the interval containing the central unit.

Inside the median interval value Me is determined by the formula:

(4.25)

Where
- the lower limit of the median interval;

a Me -width of the median interval;

N is the volume of the statistical population;

N Me-1- accumulated frequency of the pre-median interval;

n Me - frequency of the median interval.

Let us consider the calculation of the mode and median for the interval series of distribution using the example of a series of distribution of workers by length of service (Table 4.5).

Table 4.5 - Distribution of work area by length of service

Interval

A i

n i

N i

CalculationMo:

Maximum frequency n max = 13, it corresponds to the fourth group, therefore, the interval with boundaries of 12–16 years is modal.

The mode is calculated by the formula:

Most often there are workers with work experience of about 13 years.

The mode is not located in the middle of the modal interval, it is shifted to its lower border, this is due to the structure of this distribution series (the frequency of the premodal interval is much higher than the frequency of the postmodal interval).

Median Calculation:

The median interval is determined from the cumulative frequency graph. It contains the 25th and 26th statistical units, which are in different groups - in the 3rd and 4th. For finding Me you can use any of them. We will carry out the calculation for the 3rd group:

Same meaning Me can be obtained when calculating it for the 4th group:

With double center Me is always located at the junction of intervals containing central units. Computed value Me shows that the first 25 workers have less than 12 years of work experience, and the remaining 25, therefore, have more than 12 years.

The mode can be determined graphically by the distribution polygon in discrete series, by the distribution histogram - in interval series, and the median - by cumulate.

To find the mode in the interval series, the right vertex of the modal rectangle must be connected to the upper right corner of the previous rectangle, and the left vertex to the upper left corner of the next rectangle. The abscissa of the point of intersection of these lines will be the distribution mode.

To determine the median, the height of the largest ordinate of the cumulate, corresponding to the total population, is divided in half. A straight line is drawn through the obtained point, parallel to the abscissa axis, until it intersects with the cumulate. The abscissa of the intersection point is the median.

Except Mo And Me in the variant series, other structural characteristics - quantiles - can be determined. Quantiles are intended for a deeper study of the structure of the distribution series. quantile- this is the value of a feature that occupies a certain place in the population ordered by this feature. There are the following types of quantiles:

- quartiles– attribute values ​​dividing the ordered set into 4 equal parts;

- deciles– attribute values ​​dividing the population into 10 equal parts;

- percentels- attribute values ​​dividing the population into 100 equal parts.

Thus, to characterize the position of the center of the distribution series, 3 indicators can be used: average valuesign,mode, median.

When choosing the type and form of a specific indicator of the distribution center, it is necessary to proceed from the following recommendations:

For sustainable socio-economic processes, the arithmetic mean is used as an indicator of the center. Such processes are characterized by symmetrical distributions, in which

= Me= Mo;

For unstable processes, the position of the distribution center is characterized by Mo or Me. For asymmetric processes, the preferred characteristic of the distribution center is the median, since it occupies a position between the arithmetic mean and the mode.

It should be noted that this calculation of the variance has a drawback - it turns out to be biased, i.e. its mathematical expectation is not equal to the true value of the variance. More about this. At the same time, not everything is so bad. With an increase in the sample size, it still approaches its theoretical counterpart, i.e. is asymptotically unbiased. Therefore, when dealing with large sample sizes, the formula above can be used.

It is useful to translate the language of signs into the language of words. It turns out that the variance is the average square of the deviations. That is, the average value is first calculated, then the difference between each original and average value is taken, squared, added up and then divided by the number of values ​​in this population. The difference between the individual value and the mean reflects the measure of the deviation. It is squared to ensure that all deviations become exclusively positive numbers and to avoid mutual cancellation of positive and negative deviations when they are summed. Then, given the squared deviations, we simply calculate the arithmetic mean. Average - square - deviations. Deviations are squared, and the average is considered. The answer lies in just three words.

However, in its pure form, such as, for example, the arithmetic mean, or index, dispersion is not used. It is rather an auxiliary and intermediate indicator that is necessary for other types of statistical analysis. She doesn't even have a normal unit of measure. Judging by the formula, this is the square of the original data unit. Without a bottle, as they say, you will not understand.

(module 111)

In order to return the dispersion to reality, that is, to use it for more mundane purposes, a square root is extracted from it. It turns out the so-called standard deviation (RMS). There are names "standard deviation" or "sigma" (from the name of the Greek letter). The standard deviation formula is:

To obtain this indicator for the sample, use the formula:

As with variance, there is a slightly different calculation option. But as the sample grows, the difference disappears.

The standard deviation, obviously, also characterizes the measure of data dispersion, but now (unlike dispersion) it can be compared with the original data, since they have the same units of measurement (this is clear from the calculation formula). But this indicator in its pure form is not very informative, since it contains too many intermediate calculations that are confusing (deviation, squared, sum, average, root). Nevertheless, it is already possible to work directly with the standard deviation, because the properties of this indicator are well studied and known. For example, there is this three sigma rule, which states that 997 data points out of 1000 are within ±3 sigma of the arithmetic mean. Standard deviation, as a measure of uncertainty, is also involved in many statistical calculations. With its help, the degree of accuracy of various estimates and forecasts is established. If the variation is very large, then the standard deviation will also be large, therefore, the forecast will be inaccurate, which will be expressed, for example, in very wide confidence intervals.

The coefficient of variation

The standard deviation gives an absolute estimate of the spread measure. Therefore, to understand how large the spread is relative to the values ​​themselves (i.e., regardless of their scale), a relative indicator is required. This indicator is called coefficient of variation and is calculated using the following formula:

The coefficient of variation is measured as a percentage (if multiplied by 100%). By this indicator, you can compare a variety of phenomena, regardless of their scale and units of measurement. This fact is what makes the coefficient of variation so popular.

In statistics, it is accepted that if the value of the coefficient of variation is less than 33%, then the population is considered homogeneous, if it is more than 33%, then it is heterogeneous. It's hard for me to comment here. I don’t know who and why defined it this way, but it is considered an axiom.

I feel that I was carried away by a dry theory and I need to bring something visual and figurative. On the other hand, all indicators of variation describe approximately the same thing, only they are calculated differently. Therefore, it is difficult to shine with a variety of examples. Only the values ​​​​of indicators can differ, but not their essence. So let's compare how the values ​​of different indicators of variation differ for the same set of data. Let's take an example with the calculation of the average linear deviation (of ). Here is the original data:

And a reminder chart.

Based on these data, we calculate various indicators of variation.

The mean is the usual arithmetic mean.

The range of variation is the difference between the maximum and minimum:

The average linear deviation is calculated by the formula:

Standard deviation:

We summarize the calculation in a table.

As you can see, the linear mean and standard deviation give similar values ​​for the degree of data variation. The variance is sigma squared, so it will always be a relatively large number, which, in fact, does not say anything. The range of variation is the difference between the extremes and can tell a lot.

Let's sum up some results.

Variation of an indicator reflects the variability of a process or phenomenon. Its degree can be measured using several indicators.

1. The range of variation is the difference between the maximum and minimum. Reflects the range of possible values.
2. Average linear deviation - reflects the average of the absolute (modulo) deviations of all values ​​of the analyzed population from their average value.
3. Dispersion - the average square of deviations.
4. Standard deviation - the root of the variance (mean squared deviations).
5. The coefficient of variation is the most universal indicator that reflects the degree of dispersion of values, regardless of their scale and units of measurement. The coefficient of variation is measured as a percentage and can be used to compare the variation of various processes and phenomena.

Thus, in statistical analysis there is a system of indicators reflecting the homogeneity of phenomena and the stability of processes. Often, variation indicators do not have independent meaning and are used for further data analysis (calculation of confidence intervals