Factor and dispersion analysis in Excel with calculation automation. One-way analysis of variance

To analyze the variability of a trait under the influence of controlled variables, the dispersion method is used.

To study the relationship between values ​​- factorial method. Let us consider analytical tools in more detail: factorial, dispersion and two-factor dispersion methods for assessing variability.

ANOVA in Excel

Conditionally, the goal of the dispersion method can be formulated as follows: to isolate from the total variability of parameter 3 the particular variability:

  • 1 - determined by the action of each of the studied values;
  • 2 - dictated by the relationship between the studied values;
  • 3 - random, dictated by all unaccounted for circumstances.

In Microsoft Excel, analysis of variance can be performed using the "Data Analysis" tool (tab "Data" - "Analysis"). This is a spreadsheet add-on. If the add-in is not available, you need to open "Excel Options" and enable the setting for analysis.

Work begins with the design of the table. Rules:

  1. Each column should contain the values ​​of one factor under study.
  2. Arrange the columns in ascending/descending order of the value of the parameter under study.

Consider the analysis of variance in Excel using an example.

The company's psychologist analyzed using a special technique the strategy of the behavior of employees in a conflict situation. It is assumed that behavior is influenced by the level of education (1 - secondary, 2 - secondary specialized, 3 - higher education).

Enter data into an Excel spreadsheet:


Significant parameter is filled with yellow color. Since the P-value between groups is greater than 1, Fisher's test cannot be considered significant. Consequently, behavior in a conflict situation does not depend on the level of education.



Factor analysis in Excel: an example

Factor analysis is a multivariate analysis of relationships between the values ​​of variables. Using this method, you can solve the most important tasks:

  • comprehensively describe the measured object (moreover, capaciously, compactly);
  • identify hidden variable values ​​that determine the presence of linear statistical correlations;
  • classify variables (determine the relationship between them);
  • reduce the number of required variables.

Consider the example of factor analysis. Suppose we know the sales of any goods for the last 4 months. It is necessary to analyze which items are in demand and which are not.



Now you can clearly see which product sales give the main growth.

Two-way analysis of variance in Excel

Shows how two factors affect the change in the value of a random variable. Consider two-way analysis of variance in Excel using an example.

Task. A group of men and women were presented with sounds of different volumes: 1 - 10 dB, 2 - 30 dB, 3 - 50 dB. The response time was recorded in milliseconds. It is necessary to determine whether gender affects the response; Does loudness affect response?

Exercise . The 1st year students were surveyed in order to identify the activities to which they devote their free time. Check if the distribution of verbal and non-verbal preferences of students differ.

Decision carried out using a calculator.
Finding group averages:

NP 1P 2
1 12 17
2 18 19
3 23 25
4 10 7
5 15 17
x cf 15.6 17

Let's denote p - the number of levels of the factor (p=2). The number of measurements at each level is the same and equal to q=5.
The last row contains the group means for each level of the factor.
The overall mean can be obtained as the arithmetic mean of the group means:
(1)
The spread of group averages of the percentage of failure relative to the total average is affected by both changes in the level of the considered factor and random factors.
In order to take into account the influence of this factor, the total sample variance is divided into two parts, the first of which is called the factorial S 2 f, and the second - the residual S 2 rest.
In order to take into account these components, the total sum of the squared deviations of the variant from the total average is first calculated:

and the factorial sum of the squared deviations of the group means from the total mean, which characterizes the influence of this factor:

The last expression is obtained by replacing each variant in the Rtot expression with the group mean for the given factor.
The residual sum of squared deviations is obtained as the difference:
R rest \u003d R total - R f
To determine the total sample variance, it is necessary to divide Rtotal by the number of measurements pq:

and to get the unbiased total sample variance, this expression must be multiplied by pq/(pq-1):

Accordingly, for the unbiased factorial sample variance:

where p-1 is the number of degrees of freedom of the unbiased factorial sample variance.
In order to assess the influence of the factor on changes in the parameter under consideration, the value is calculated:

Since the ratio of the two sample variances S 2 f and S 2 rem is distributed according to the Fisher-Snedekor law, the resulting value f obs is compared with the value of the distribution function

at the critical point f cr corresponding to the chosen level of significance a.
If f obl >f cr, then the factor has a significant impact and should be taken into account, otherwise it has an insignificant effect that can be neglected.
The following formulas can also be used to calculate Robs and Rf:
(4)
(5)
We find the overall average by the formula (1):
To calculate Rtot using formula (4), we compile a table of 2 squares option:
NP 2 1P 2 2
1 144 289
2 324 361
3 529 625
4 100 49
5 225 289
1322 1613

The overall average is calculated by formula (1):

Rtot = 1322 + 1613 - 5 2 16.3 2 = 278.1
We find R f according to the formula (5):
R f \u003d 5 (15.6 2 + 17 2) - 2 16.3 2 \u003d 4.9
We get R rest: R rest \u003d R total - R f \u003d 278.1 - 4.9 \u003d 273.2
We determine the factorial and residual variance:


If the mean values ​​of a random variable calculated for individual samples are the same, then the estimates of the factorial and residual variances are unbiased estimates of the general variance and differ insignificantly.
Then a comparison of the estimates of these variances according to the Fisher criterion should show that there is no reason to reject the null hypothesis about the equality of the factorial and residual variances.
The estimate of the factor variance is less than the estimate of the residual variance, so we can immediately assert the validity of the null hypothesis about the equality of the mathematical expectations for the layers of the sample.
In other words, in this example, the factor Ф does not significantly affect the random variable.
Let's check the null hypothesis H 0: the equality of the average values ​​of x.
Find f obl

For the significance level α=0.05, the number of degrees of freedom 1 and 8, we find f cr from the Fisher-Snedekor distribution table.
f cr (0.05; 1; 8) = 5.32
Due to the fact that f obs< f кр, нулевую гипотезу о существенном влиянии фактора на результаты экспериментов отклоняем.
In other words, the distribution of verbal and non-verbal preferences of students differ.

Exercise. The plant has four lines for the production of facing tiles. 10 tiles were randomly selected from each line during the shift and their thickness (mm) was measured. Deviations from the nominal size are given in the table. It is required at the significance level a = 0.05 to establish the dependence of the production of high-quality tiles on the production line (factor A).

Exercise. At the significance level a = 0.05, investigate the effect of paint color on the service life of the coating.

Example #1. 13 tests were performed, of which 4 were at the first level of the factor, 4 were at the second, 3 were at the third and 2 were at the fourth. Using the method of analysis of variance at a significance level of 0.05, check the null hypothesis about the equality of group means. It is assumed that the samples are taken from normal populations with the same variances. The test results are shown in the table.

Decision:
Finding group averages:

NP 1P 2P 3P 4
1 1.38 1.41 1.32 1.31
2 1.38 1.42 1.33 1.33
3 1.42 1.44 1.34 -
4 1.42 1.45 - -
5.6 5.72 3.99 2.64
x cf 1.4 1.43 1.33 1.32

Let's denote p - the number of levels of the factor (p=4). The number of measurements at each level is: 4,4,3,2
The last row contains the group means for each level of the factor.
The overall average is calculated by the formula:

To calculate Stotal using formula (4), we compile a table of 2 squares option:

NP 2 1P 2 2P 2 3P 2 4
1 1.9 1.99 1.74 1.72
2 1.9 2.02 1.77 1.77
3 2.02 2.07 1.8 -
4 2.02 2.1 - -
7.84 8.18 5.31 3.49

The total sum of squared deviations is found by the formula:


We find S f by the formula:


We get S rest: S rest \u003d S total - S f \u003d 0.0293 - 0.0263 \u003d 0.003
Determine the factor variance:

and residual variance:

If the mean values ​​of a random variable calculated for individual samples are the same, then the estimates of the factorial and residual variances are unbiased estimates of the general variance and differ insignificantly.
Then a comparison of the estimates of these variances according to the Fisher criterion should show that there is no reason to reject the null hypothesis about the equality of the factorial and residual variances.
The estimate of the factorial variance is greater than the estimate of the residual variance, so we can immediately assert that the null hypothesis about the equality of mathematical expectations for the sample layers is not true.
In other words, in this example, the factor Ф has a significant impact on the random variable.
Let's check the null hypothesis H 0: the equality of the average values ​​of x.
Find f obl

For the significance level α=0.05, the number of degrees of freedom 3 and 12, we find f cr from the Fisher-Snedekor distribution table.
f cr (0.05; 3; 12) = 3.49
Due to the fact that fobs > fcr, we accept the null hypothesis about the significant influence of the factor on the results of experiments (we reject the null hypothesis about the equality of group means). In other words, the group means as a whole differ significantly.

Example #2. The school has 5 sixth grades. The psychologist is tasked with determining whether the average level of situational anxiety in the classes is the same. For this were given in the table. Check the significance level α=0.05, the assumption that the average situational anxiety in the classes does not differ.

Example #3. To study the value of X, 4 tests were performed at each of the five levels of factor F. The test results are given in the table. Find out whether the influence of the factor F on the value of X is significant. Take α = 0.05. It is assumed that the samples are taken from normal populations with the same variances.

Example #4. Suppose that three groups of students, 10 people each, participated in the pedagogical experiment. The groups used different teaching methods: in the first - traditional (F 1), in the second - based on computer technology (F 2), in the third - a method that widely uses tasks for independent work (F 3). Knowledge was assessed on a ten-point system.
It is required to process the obtained data on exams and make a conclusion about whether the influence of the teaching method is significant, taking α=0.05 as the significance level.
The results of the exams are given in the table, F j - the level of the factor x ij - the assessment of the i-th student of the student according to the method F j .

Factor level

Example number 5. The results of competitive variety testing of crops are shown (yield in c.d. ha). Each variety was tested in four plots. Use the method of analysis of variance to study the effect of the variety on the yield. Set the significance of the influence of the factor (the share of intergroup variation in the total variation) and the significance of the results of the experiment at a significance level of 0.05.
Yields in variety testing plots

Variety Productivity on repetitions of c. from ha
1 2 3 4
1
2
3
42,4
52,5
52,3
37,4
50,1
53,0
40,7
53,8
51,4
38,2
50,7
53,6

ANOVA is a set of statistical methods designed to test hypotheses about the relationship between certain features and the studied factors that do not have a quantitative description, as well as to establish the degree of influence of factors and their interaction. In specialized literature, it is often called ANOVA (from the English name Analysis of Variations). This method was first developed by R. Fischer in 1925.

Types and criteria for analysis of variance

This method is used to investigate the relationship between qualitative (nominal) features and a quantitative (continuous) variable. In fact, it tests the hypothesis about the equality of the arithmetic means of several samples. Thus, it can be considered as a parametric criterion for comparing the centers of several samples at once. If you use this method for two samples, then the results of the analysis of variance will be identical to the results of the Student's t-test. However, unlike other criteria, this study allows you to study the problem in more detail.

Analysis of variance in statistics is based on the law: the sum of the squared deviations of the combined sample is equal to the sum of the squares of the intragroup deviations and the sum of the squares of the intergroup deviations. For the study, Fisher's test is used to establish the significance of the difference between intergroup and intragroup variances. However, for this, the necessary prerequisites are the normality of the distribution and the homoscedasticity (equality of variances) of the samples. Distinguish between one-dimensional (single-factor) analysis of variance and multivariate (multifactorial). The first considers the dependence of the value under study on one attribute, the second - on many at once, and also allows you to identify the relationship between them.

Factors

Factors are called controlled circumstances that affect the final result. Its level or method of processing is called the value that characterizes the specific manifestation of this condition. These figures are usually given in a nominal or ordinal scale of measurement. Often output values ​​are measured on quantitative or ordinal scales. Then there is the problem of grouping the output data in a series of observations that correspond to approximately the same numerical values. If the number of groups is too large, then the number of observations in them may be insufficient to obtain reliable results. If the number is taken too small, this can lead to the loss of essential features of influence on the system. The specific method of grouping data depends on the volume and nature of the variation in values. The number and size of intervals in univariate analysis are most often determined by the principle of equal intervals or by the principle of equal frequencies.

Tasks of dispersion analysis

So, there are cases when you need to compare two or more samples. It is then that it is advisable to use the analysis of variance. The name of the method indicates that the conclusions are made on the basis of the study of the components of the variance. The essence of the study is that the overall change in the indicator is divided into components that correspond to the action of each individual factor. Consider a number of problems that a typical analysis of variance solves.

Example 1

The workshop has a number of machine tools - automatic machines that produce a specific part. The size of each part is a random value, which depends on the settings of each machine and random deviations that occur during the manufacturing process of the parts. It is necessary to determine from the measurements of the dimensions of the parts whether the machines are set up in the same way.

Example 2

During the manufacture of an electrical apparatus, various types of insulating paper are used: capacitor, electrical, etc. The apparatus can be impregnated with various substances: epoxy resin, varnish, ML-2 resin, etc. Leaks can be eliminated under vacuum at elevated pressure, when heated. It can be impregnated by immersion in varnish, under a continuous stream of varnish, etc. The electrical apparatus as a whole is poured with a certain compound, of which there are several options. Quality indicators are the dielectric strength of the insulation, the overheating temperature of the winding in operating mode, and a number of others. During the development of the technological process of manufacturing devices, it is necessary to determine how each of the listed factors affects the performance of the device.

Example 3

The trolleybus depot serves several trolleybus routes. They operate trolleybuses of various types, and 125 inspectors collect fares. The management of the depot is interested in the question: how to compare the economic performance of each controller (revenue) given the different routes, different types of trolleybuses? How to determine the economic feasibility of launching trolleybuses of a certain type on a particular route? How to establish reasonable requirements for the amount of revenue that the conductor brings on each route in various types of trolleybuses?

The task of choosing a method is how to obtain maximum information regarding the impact on the final result of each factor, determine the numerical characteristics of such an impact, their reliability at minimal cost and in the shortest possible time. Methods of dispersion analysis allow to solve such problems.

Univariate analysis

The study aims to assess the magnitude of the impact of a particular case on the review being analyzed. Another task of univariate analysis may be to compare two or more circumstances with each other in order to determine the difference in their influence on the recall. If the null hypothesis is rejected, then the next step is to quantify and build confidence intervals for the obtained characteristics. In the case when the null hypothesis cannot be rejected, it is usually accepted and a conclusion is made about the nature of the influence.

One-way analysis of variance can become a non-parametric analogue of the Kruskal-Wallis rank method. It was developed by the American mathematician William Kruskal and economist Wilson Wallis in 1952. This test is intended to test the null hypothesis that the effects of influence on the studied samples are equal with unknown but equal mean values. In this case, the number of samples must be more than two.

The Jonkhier (Jonkhier-Terpstra) criterion was proposed independently by the Dutch mathematician T. J. Terpstrom in 1952 and the British psychologist E. R. Jonkhier in 1954. It is used when it is known in advance that the available groups of results are ordered by an increase in the influence of the factor under study, which is measured on an ordinal scale.

M - the Bartlett criterion, proposed by the British statistician Maurice Stevenson Bartlett in 1937, is used to test the null hypothesis about the equality of the variances of several normal general populations from which the studied samples are taken, in the general case having different sizes (the number of each sample must be at least four ).

G is the Cochran test, which was discovered by the American William Gemmel Cochran in 1941. It is used to test the null hypothesis about the equality of the variances of normal populations for independent samples of equal size.

The nonparametric Levene test, proposed by the American mathematician Howard Levene in 1960, is an alternative to the Bartlett test in conditions where there is no certainty that the samples under study follow a normal distribution.

In 1974, American statisticians Morton B. Brown and Alan B. Forsythe proposed a test (the Brown-Forsyth test), which is somewhat different from the Levene test.

Two-way analysis

Two-way analysis of variance is used for linked normally distributed samples. In practice, complex tables of this method are often used, in particular, those in which each cell contains a set of data (repeated measurements) corresponding to fixed level values. If the assumptions necessary to apply the two-way analysis of variance are not met, then the non-parametric rank test of Friedman (Friedman, Kendall and Smith), developed by the American economist Milton Friedman at the end of 1930, is used. This criterion does not depend on the type of distribution.

It is only assumed that the distribution of quantities is the same and continuous, and that they themselves are independent of each other. When testing the null hypothesis, the output data is presented in the form of a rectangular matrix, in which the rows correspond to the levels of factor B, and the columns correspond to levels A. Each cell of the table (block) can be the result of measurements of parameters on one object or on a group of objects with constant values ​​of the levels of both factors . In this case, the corresponding data are presented as the average values ​​of a certain parameter for all measurements or objects of the sample under study. To apply the output criterion, it is necessary to move from the direct results of measurements to their rank. The ranking is carried out for each row separately, that is, the values ​​are ordered for each fixed value.

The Page test (L-test), proposed by the American statistician E. B. Page in 1963, is designed to test the null hypothesis. For large samples, the Page approximation is used. They, subject to the reality of the corresponding null hypotheses, obey the standard normal distribution. In the case when the rows of the source table have the same values, it is necessary to use the average ranks. In this case, the accuracy of the conclusions will be the worse, the greater the number of such coincidences.

Q - Cochran's criterion, proposed by V. Cochran in 1937. It is used in cases where groups of homogeneous subjects are exposed to more than two influences and for which two options for reviews are possible - conditionally negative (0) and conditionally positive (1) . The null hypothesis consists of equality of influence effects. Two-way analysis of variance makes it possible to determine the existence of processing effects, but does not make it possible to determine for which columns this effect exists. When solving this problem, the method of multiple Scheffe equations for coupled samples is used.

Multivariate analysis

The problem of multivariate analysis of variance arises when it is necessary to determine the influence of two or more conditions on a certain random variable. The study provides for the presence of one dependent random variable, measured on a scale of difference or ratios, and several independent variables, each of which is expressed on a scale of names or in a rank scale. Dispersion analysis of data is a fairly developed branch of mathematical statistics, which has a lot of options. The concept of the study is common for both univariate and multivariate studies. Its essence lies in the fact that the total variance is divided into components, which corresponds to a certain grouping of data. Each grouping of data has its own model. Here we will consider only the main provisions necessary for understanding and practical use of its most used variants.

Factor analysis of variance requires careful attention to the collection and presentation of input data, and especially to the interpretation of the results. In contrast to the one-factor, the results of which can be conditionally placed in a certain sequence, the results of the two-factor require a more complex presentation. An even more difficult situation arises when there are three, four or more circumstances. Because of this, the model rarely includes more than three (four) conditions. An example would be the occurrence of resonance at a certain value of capacitance and inductance of the electric circle; the manifestation of a chemical reaction with a certain set of elements from which the system is built; the occurrence of anomalous effects in complex systems under a certain coincidence of circumstances. The presence of interaction can radically change the model of the system and sometimes lead to a rethinking of the nature of the phenomena with which the experimenter is dealing.

Multivariate analysis of variance with repeated experiments

Measurement data can often be grouped not by two, but by more factors. So, if we consider the dispersion analysis of the service life of tires for trolleybus wheels, taking into account the circumstances (manufacturer and the route on which the tires are used), then we can single out as a separate condition the season during which the tires are used (namely: winter and summer operation). As a result, we will have the problem of the three-factor method.

In the presence of more conditions, the approach is the same as in two-way analysis. In all cases, the model is trying to simplify. The phenomenon of the interaction of two factors does not appear so often, and the triple interaction occurs only in exceptional cases. Include those interactions for which there is previous information and good reasons to take it into account in the model. The process of isolating individual factors and taking them into account is relatively simple. Therefore, there is often a desire to highlight more circumstances. You shouldn't get carried away with this. The more conditions, the less reliable the model becomes and the greater the chance of error. The model itself, which includes a large number of independent variables, becomes quite difficult to interpret and inconvenient for practical use.

General idea of ​​analysis of variance

Analysis of variance in statistics is a method of obtaining observational results that depend on various concurrent circumstances and assessing their influence. A controlled variable that corresponds to the method of influence on the object of study and acquires a certain value in a certain period of time is called a factor. They can be qualitative and quantitative. Levels of quantitative conditions acquire a certain value on a numerical scale. Examples are temperature, pressing pressure, amount of substance. Qualitative factors are different substances, different technological methods, apparatuses, fillers. Their levels correspond to the scale of names.

The quality also includes the type of packaging material, the storage conditions of the dosage form. It is also rational to include the degree of grinding of raw materials, the fractional composition of granules, which have a quantitative value, but are difficult to regulate, if a quantitative scale is used. The number of quality factors depends on the type of dosage form, as well as the physical and technological properties of medicinal substances. For example, tablets can be obtained from crystalline substances by direct compression. In this case, it is sufficient to carry out the selection of sliding and lubricating agents.

Examples of quality factors for different types of dosage forms

  • Tinctures. Extractant composition, type of extractor, raw material preparation method, production method, filtration method.
  • Extracts (liquid, thick, dry). The composition of the extractant, the extraction method, the type of installation, the method of removing the extractant and ballast substances.
  • Pills. Composition of excipients, fillers, disintegrants, binders, lubricants and lubricants. The method of obtaining tablets, the type of technological equipment. Type of shell and its components, film formers, pigments, dyes, plasticizers, solvents.
  • injection solutions. Type of solvent, filtration method, nature of stabilizers and preservatives, sterilization conditions, method of filling ampoules.
  • Suppositories. The composition of the suppository base, the method of obtaining suppositories, fillers, packaging.
  • Ointments. The composition of the base, structural components, method of preparation of the ointment, type of equipment, packaging.
  • Capsules. Type of shell material, method of obtaining capsules, type of plasticizer, preservative, dye.
  • Liniments. Production method, composition, type of equipment, type of emulsifier.
  • Suspensions. Type of solvent, type of stabilizer, dispersion method.

Examples of quality factors and their levels studied in the tablet manufacturing process

  • Baking powder. Potato starch, white clay, a mixture of sodium bicarbonate with citric acid, basic magnesium carbonate.
  • binding solution. Water, starch paste, sugar syrup, methylcellulose solution, hydroxypropyl methylcellulose solution, polyvinylpyrrolidone solution, polyvinyl alcohol solution.
  • sliding substance. Aerosil, starch, talc.
  • Filler. Sugar, glucose, lactose, sodium chloride, calcium phosphate.
  • Lubricant. Stearic acid, polyethylene glycol, paraffin.

Models of dispersion analysis in the study of the level of competitiveness of the state

One of the most important criteria for assessing the state of the state, which is used to assess the level of its welfare and socio-economic development, is competitiveness, that is, a set of properties inherent in the national economy that determine the ability of the state to compete with other countries. Having determined the place and role of the state in the world market, it is possible to establish a clear strategy for ensuring economic security on an international scale, because it is the key to positive relations between Russia and all players in the world market: investors, creditors, state governments.

To compare the level of competitiveness of states, countries are ranked using complex indices, which include various weighted indicators. These indices are based on key factors that affect the economic, political, etc. situation. The complex of models for studying the competitiveness of the state provides for the use of methods of multidimensional statistical analysis (in particular, this is an analysis of variance (statistics), econometric modeling, decision making) and includes the following main stages:

  1. Formation of a system of indicators-indicators.
  2. Evaluation and forecasting of indicators of the competitiveness of the state.
  3. Comparison of indicators-indicators of competitiveness of states.

And now let's consider the content of the models of each of the stages of this complex.

At the first stage using methods of expert study, a reasonable set of economic indicators-indicators for assessing the competitiveness of the state is formed, taking into account the specifics of its development on the basis of international ratings and data from statistical departments, reflecting the state of the system as a whole and its processes. The choice of these indicators is justified by the need to select those that most fully, from the point of view of practice, allow to determine the level of the state, its investment attractiveness and the possibility of relative localization of existing potential and actual threats.

The main indicators-indicators of international rating systems are indices:

  1. Global Competitiveness (GCC).
  2. Economic freedom (IES).
  3. Human Development (HDI).
  4. Perceptions of Corruption (CPI).
  5. Internal and external threats (IVZZ).
  6. Potential for International Influence (IPIP).

Second phase provides for the assessment and forecasting of indicators of the competitiveness of the state according to international ratings for the studied 139 states of the world.

Third stage provides for a comparison of the conditions for the competitiveness of states using the methods of correlation and regression analysis.

Using the results of the study, it is possible to determine the nature of the processes in general and for individual components of the competitiveness of the state; test the hypothesis about the influence of factors and their relationship at the appropriate level of significance.

The implementation of the proposed set of models will allow not only to assess the current situation of the level of competitiveness and investment attractiveness of states, but also to analyze the shortcomings of management, prevent errors of wrong decisions, and prevent the development of a crisis in the state.

Analysis of variance is a statistical method for assessing the relationship between factor and performance characteristics in different groups, selected randomly, based on the determination of differences (diversity) in the values ​​of the characteristics. The analysis of variance is based on the analysis of deviations of all units of the studied population from the arithmetic mean. As a measure of deviations, dispersion (B) is taken - the average square of deviations. Deviations caused by the influence of a factor attribute (factor) are compared with the magnitude of deviations caused by random circumstances. If the deviations caused by the factor attribute are more significant than random deviations, then the factor is considered to have a significant impact on the resulting attribute.

In order to calculate the variance of the deviation value of each option (each registered numerical value of the attribute) from the arithmetic mean, squared. This will get rid of negative signs. Then these deviations (differences) are summed up and divided by the number of observations, i.e. average out deviations. Thus, the dispersion values ​​are obtained.

An important methodological value for the application of analysis of variance is the correct formation of the sample. Depending on the goal and objectives, selective groups can be randomly formed independently of each other (control and experimental groups to study some indicator, for example, the effect of high blood pressure on the development of stroke). Such samples are called independent.

Often, the results of exposure to factors are studied in the same sample group (for example, in the same patients) before and after exposure (treatment, prevention, rehabilitation measures), such samples are called dependent.

Analysis of variance, in which the influence of one factor is checked, is called one-factor analysis (univariate analysis). When studying the influence of more than one factor, multivariate analysis of variance (multivariate analysis) is used.

Factor signs are those signs that affect the phenomenon under study.

Effective signs are those signs that change under the influence of factor signs.

Conditions for the use of analysis of variance:

The task of the study is to determine the strength of the influence of one (up to 3) factors on the result or to determine the strength of the combined influence of various factors (gender and age, physical activity and nutrition, etc.).

The studied factors should be independent (unrelated) to each other. For example, one cannot study the combined effect of work experience and age, height and weight of children, etc. on the incidence of the population.

The selection of groups for the study is carried out randomly (random selection). The organization of a dispersion complex with the implementation of the principle of random selection of options is called randomization (translated from English - random), i.e. chosen at random.

Both quantitative and qualitative (attributive) features can be used.

When conducting a one-way analysis of variance, it is recommended (necessary condition for application):

1. The normality of the distribution of the analyzed groups or the correspondence of the sample groups to general populations with a normal distribution.

2. Independence (non-connectedness) of the distribution of observations in groups.

3. Presence of frequency (recurrence) of observations.

First, a null hypothesis is formulated, that is, it is assumed that the factors under study do not have any effect on the values ​​of the resulting attribute and the resulting differences are random.

Then we determine what is the probability of obtaining the observed (or stronger) differences, provided that the null hypothesis is true.

If this probability is small, then we reject the null hypothesis and conclude that the results of the study are statistically significant. This does not yet mean that the effect of the studied factors has been proven (this is primarily a matter of research planning), but it is still unlikely that the result is due to chance.

When all the conditions for applying the analysis of variance are met, the decomposition of the total variance mathematically looks like this:

Dotot. = Dfact + D rest.,

Dotot. - the total variance of the observed values ​​(variant), characterized by the spread of the variant from the total average. Measures the variation of a trait in the entire population under the influence of all the factors that caused this variation. The overall diversity is made up of intergroup and intragroup;

Dfact - factorial (intergroup) dispersion, characterized by the difference in the averages in each group and depends on the influence of the studied factor, according to which each group is differentiated. For example, in groups of different etiological factors of the clinical course of pneumonia, the average level of the spent bed-day is not the same - intergroup diversity is observed.

D rest. - residual (intragroup) variance, which characterizes the dispersion of the variant within the groups. Reflects random variation, i.e. part of the variation that occurs under the influence of unspecified factors and does not depend on the trait - the factor underlying the grouping. The variation of the trait under study depends on the strength of the influence of some unaccounted random factors, both on organized (given by the researcher) and random (unknown) factors.

Therefore, the total variation (dispersion) is composed of the variation caused by organized (given) factors, called factorial variation and unorganized factors, i.e. residual variation (random, unknown).

For a sample size of n, the sample variance is calculated as the sum of the squared deviations from the sample mean divided by n-1 (sample size minus one). Thus, with a fixed sample size n, the variance is a function of the sum of squares (deviations), denoted, for brevity, SS (from the English Sum of Squares - Sum of Squares). In what follows, we often omit the word "selective", knowing full well that we are considering a sample variance or an estimate of the variance. The analysis of variance is based on the division of the variance into parts or components. Consider the following dataset:

The means of the two groups are significantly different (2 and 6, respectively). The sum of the squared deviations within each group is 2. Adding them together, we get 4. If we now repeat these calculations without taking into account group membership, that is, if we calculate SS based on the total average of these two samples, we get a value of 28. In other words, the variance (sum squares) based on within-group variability results in much lower values ​​than those calculated based on total variability (relative to the overall mean). The reason for this is obviously the significant difference between the means, and this difference between the means explains the existing difference between the sums of squares.

SS St. St. MS F p
Effect 24.0 24.0 24.0 .008
Mistake 4.0 1.0

As can be seen from the table, the total sum of squares SS = 28 is divided into components: the sum of squares due to within-group variability (2+2=4; see the second row of the table) and the sum of squares due to the difference in the means between groups (28-(2+ 2)=24; see the first line of the table). Note that MS in this table is the mean square equal to SS divided by the number of degrees of freedom (stdf).

In the simple example above, you could immediately calculate the t-test for independent samples. The results obtained, of course, coincide with the results of the analysis of variance.

However, situations where some phenomenon is completely described by one variable are extremely rare. For example, if we are trying to learn how to grow large tomatoes, we should consider factors related to the genetic structure of plants, soil type, light, temperature, etc. Thus, when conducting a typical experiment, you have to deal with a large number of factors. The main reason why using ANOVA is preferable to re-comparing two samples at different factor levels using t-test series is that ANOVA is significantly more efficient and, for small samples, more informative.

Suppose that in the two-sample analysis example discussed above, we add another factor, such as Gender. Let each group now consist of 3 men and 3 women. The plan of this experiment can be presented in the form of a table:

Before doing the calculations, you can see that in this example, the total variance has at least three sources:

1) random error (intragroup variance),

2) variability associated with belonging to the experimental group

3) variability due to the sex of the objects of observation.

Note that there is another possible source of variability - the interaction of factors, which we will discuss later). What happens if we don't include gender as a factor in our analysis and calculate the usual t-test? If we calculate sums of squares ignoring gender (i.e. combining objects of different sexes into one group when calculating within-group variance and thus obtaining the sum of squares for each group equal to SS = 10 and the total sum of squares SS = 10+10 = 20) , we get a larger intragroup variance than with a more accurate analysis with additional subgrouping by sex (in this case, the intragroup means will be equal to 2, and the total intragroup sum of squares is equal to SS = 2+2+2+2 = 8).

So, with the introduction of an additional factor: gender, the residual variance decreased. This is because the male mean is smaller than the female mean, and this difference in means increases the overall within-group variability if gender is not taken into account. Controlling the error variance increases the sensitivity (power) of the test.

This example shows another advantage of the analysis of variance compared to the usual two-sample t-test. Analysis of variance allows you to study each factor by controlling the values ​​of other factors. This, in fact, is the main reason for its greater statistical power (smaller sample sizes are required to obtain meaningful results). For this reason, analysis of variance, even on small samples, gives statistically more significant results than a simple t-test.

) is designed to compare only two populations. However, it is often misused for pairwise comparison of more groups (Fig. 1), which causes the so-called. effect of multiple comparisons(English) multiple comparisons; Glantz 1999, p. 101-104). We will talk about this effect and how to deal with it later. In this post I will describe the principles univariate analysis of variance just designed for simultaneous comparison of the average values ​​of two or more groups. Principles of ANOVA an alysis o f va riance, ANOVA) were developed in the 1920s. Sir Ronald Aylmer Fisher Ronald Aylmer Fisher) - "a genius who almost single-handedly laid the foundations of modern statistics" (Hald 1998).

The question may arise: why the method used for comparison medium values ​​is called dispersive analysis? The thing is that when establishing the difference between the average values, we are actually comparing the variances of the analyzed populations. However, first things first...

Formulation of the problem

The example below is taken from the book Maindonald & Brown(2010). Weight data are available for tomatoes (whole plant; weight , in kg) grown for 2 months under three different experimental conditions (trt , from treatment) - on water (water), in an environment with the addition of fertilizer (nutrient), as well as in an environment with the addition of fertilizer and herbicide 2,4-D (nutrient + 24D):

# Create a table with data: tomato<- data.frame (weight= c (1.5 , 1.9 , 1.3 , 1.5 , 2.4 , 1.5 , # water 1.5 , 1.2 , 1.2 , 2.1 , 2.9 , 1.6 , # nutrient 1.9 , 1.6 , 0.8 , 1.15 , 0.9 , 1.6 ) , # nutrient+24D trt = rep (c ("Water" , "Nutrient" , "Nutrient+24D" ) , c (6 , 6 , 6 ) ) ) # View the result: Weight Weight Trt 1 1.50 Water 2 1.90 Water 3 1.30 Water 4 1.50 Water 5 2.40 Water 6 1.50 Water 7 1.50 Nutrew 8 1.20 Nutrew 9 1.20 Nutrew 11 2.90 Nutrew 12 1.60 Nutrew 13 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 1.6 0.80 Nutrient+24D 16 1.15 Nutrient+24D 17 0.90 Nutrient+24D 18 1.60 Nutrient+24D


The variable trt is a factor with three levels. For a more visual comparison of the experimental conditions in the future, we will make the "water" level the base one (eng. reference), i.e. the level against which R will compare all other levels. This can be done with the relevel() function:


To better understand the properties of the available data, we visualize them using the observed differences between group means are insignificant and are caused by the influence of random factors (i.e., in fact, all obtained plant weight measurements come from one normally distributed general population):

We emphasize once again that the considered example corresponds to the case one-factor analysis of variance: we study the effect of one factor - growing conditions (with three levels - Water , Nutrient and Nutrient + 24D ) on the response variable of interest to us - the weight of plants.

Unfortunately, the researcher almost never has the opportunity to study the entire population. How then can we know if the above null hypothesis is true given only the sample data? We can phrase this question differently: what is the probability of obtaining observed differences between group means by drawing random samples from one normally distributed population? To answer this question, we need a statistical test that would quantitatively characterize the magnitude of the differences between the compared groups.