Permissible limit of mean approximation error values. Estimation using the Fisher F-criterion of the statistical reliability of the results of regression modeling

5. Using the F-criterion, it was found that the obtained pair regression equation as a whole is statistically insignificant, and inadequately describes the studied phenomenon of the relationship between the monthly pension y and the subsistence minimum x.

6. An econometric model of multiple linear regression has been formed, linking the value of the net income of a conditional firm y with capital turnover x1 and capital employed x2

7. By calculating the elasticity coefficients, it is shown that with a change in capital turnover by 1%, the value of the company's net income changes by 0.0008%, and with a change in used capital by 1%, the value of the company's net income changes by 0.56%.

8. Using the t-test, the statistical significance of the regression coefficients was assessed. It was found that the explanatory variable x 1 is statistically insignificant and can be excluded from the regression equation, while the explanatory variable x 2 is statistically significant.

9. Using the F-criterion, it was found that the obtained pair regression equation as a whole is statistically significant, and adequately describes the studied phenomenon of the relationship between the value of the net income of a conditional firm y with capital turnover x 1 and used capital x 2.

10. The average error of approximation of statistical data by the linear equation of multiple regression was calculated, which amounted to 29.8%. It is shown due to which observation in the statistical database the value of this error exceeds the allowable value.

14. Building a paired regression model without using EXCEL.

Using the statistical material given in Table 3.5, it is necessary to:

2. Evaluate the tightness of the connection using indicators of correlation and determination.

3. Using the coefficient of elasticity, determine the degree of connection between the factor attribute and the resultant one.

4. Determine the average approximation error.

5. Evaluate the statistical reliability of the simulation using the Fisher F-test.

Table 3.5. Initial data.

The share of cash income aimed at increasing savings in deposits, loans, certificates and for the purchase of foreign currency, in the total amount of average per capita cash income, %

Average monthly accrued wages, c.u.

Kaluga

Kostroma

Orlovskaya

Ryazan

Smolensk

To determine the unknown parameters b 0 , b 1 of the equation of paired linear regression, we use the standard system of normal equations, which has the form

(3.7)

To solve this system, it is first necessary to determine the values ​​of Sx 2 and Sxy. These values ​​are determined from the table of initial data, supplementing it with the appropriate columns (table 3.6).

Table 3.6. To the calculation of regression coefficients.

Then system (3.7) takes the form

Expressing b 0 from the first equation and substituting the resulting expression into the second equation, we get:

Performing term-by-term multiplication and expanding the brackets, we get:

Finally, the equation of paired linear regression, which relates the share of the population's monetary income aimed at increasing savings y with the average monthly accrued wages x, has the form:

So, as the paired linear regression equation is constructed, we determine the linear correlation coefficient from the dependence:

where are the values ​​of standard deviations of the corresponding parameters.

To calculate the linear correlation coefficient from dependence (3.9), we will perform intermediate calculations.

Substituting the values ​​of the found parameters into expression (3.9), we obtain

.

The obtained value of the linear correlation coefficient indicates the presence of a weak inverse statistical relationship between the share of the population's monetary income aimed at increasing savings y and the average monthly accrued wages x.

The coefficient of determination is , which means that only 9.6% is explained by the regression of the explanatory variable by y. Accordingly, the value of 1 equal to 90.4% characterizes the share of the variance of the variable caused by the influence of all the other explanatory variables not taken into account in the econometric model.

The coefficient of elasticity is equal to

Consequently, with a change in the value of the average monthly accrued wages by 1%, the share of the population's cash income aimed at increasing savings also decreases by 1%, and with an increase in wages, there is a decrease in the share of the population's cash income aimed at increasing savings. This conclusion is contrary to common sense and can only be explained by the incorrectness of the formed mathematical model.

Let us calculate the average approximation error.

Table 3.7. On the calculation of the average approximation error.

The obtained value exceeds (12…15)%, which indicates the significance of the average deviation of the calculated data from the actual data, on which the econometric model is built.

The reliability of statistical modeling is performed on the basis of Fisher's F-criterion. The theoretical value of the Fisher criterion Fcalc is determined from the ratio of the values ​​of the factorial and residual variances calculated for one degree of freedom according to the formula

where n is the number of observations;

m is the number of explanatory variables (for the considered example m m =1).

The critical value Fcrit is determined from statistical tables and for the significance level a = 0.05 is equal to 10.13. Since F calc

15. Building a multiple regression model without using EXCEL.

Using the statistical material given in Table 3.8, you must:

1. Build a linear multiple regression equation, explain the economic meaning of its parameters.

2. To give a comparative assessment of the closeness of the relationship between factors and the effective feature using average (general) elasticity coefficients.

3. Assess the statistical significance of the regression coefficients using the t-test and the null hypothesis of the equation being insignificant using the F-test.

4. Evaluate the quality of the equation by determining the average approximation error.

Table 3.8. Initial data.

Net income, mln USD

Turnover of capital USD mln

Used capital, mln. USD

To determine the unknown parameters b 0 , b 1 , b 2 of the multiple linear regression equation, we use the standard system of normal equations, which has the form

(3.11)

To solve this system, it is first necessary to determine the values ​​of Sx 1 2 , Sx 2 2 , Sx 1 y, Sx 2 y, Sx 1 x 2 . These values ​​are determined from the table of initial data, supplementing it with the appropriate columns (table 3.9).

Table 3.9. To the calculation of regression coefficients.

Then system (3.11) takes the form

To solve this system, we use the Gauss method, which consists in the successive elimination of unknowns: we divide the first equation of the system by 10, then we multiply the resulting equation by 370.6 and subtract it from the second equation of the system, then we multiply the resulting equation by 158.20 and subtract it from the third equation of the system. Repeating the indicated algorithm for the transformed second and third equations of the system, we obtain:

Þ Þ

Þ .

After transformation we have:

Then, finally, the dependence of net income on capital turnover and capital employed in the form of a linear multiple regression equation has the form:

From the resulting econometric equation, it can be seen that with an increase in capital employed, net income increases, and vice versa, with an increase in capital turnover, net income decreases. In addition, the larger the regression coefficient, the greater the influence of the explanatory variable on the dependent variable. In this example, the value of the regression coefficient is greater than the value of the coefficient, therefore, the capital used has a much greater impact on net income than capital turnover. To quantify this conclusion, we determine the partial coefficients of elasticity.

The analysis of the obtained results also shows that the used capital has a greater impact on net income. So, in particular, with an increase in capital employed by 1%, net income increases by 1.17%. At the same time, with an increase in capital turnover by 1%, net income decreases by 0.5%.

Theoretical value of the Fisher criterion F calc

The value of the critical value F crit is determined by statistical tables and for the significance level a = 0.05 is equal to 4.74. Since F calc > F crit, the null hypothesis is rejected, and the resulting regression equation is assumed to be statistically significant.

The assessment of the statistical significance of the regression coefficients according to the t-criterion is reduced to comparing the numerical value of these coefficients with the magnitude of their random errors and according to the dependence:

The working formula for calculating the theoretical value of the t-statistic is:

, (3.13)

where the pair correlation coefficients and the multiple correlation coefficient are calculated from the dependencies:

Then the theoretical (calculated) values ​​of t-statistics are respectively equal to:

Since the critical value of t-statistics, determined according to statistical tables for the significance level a=0.05, equal to tcrit=2.36 is greater in absolute value than = - 1.798, then the null hypothesis is not rejected and the explanatory variable x 1 is statistically insignificant and its can be excluded from the regression equation. Conversely, for the second regression coefficient > t crit (3.3 >2.36), and the explanatory variable x 2 is statistically significant.

Let's calculate the average approximation error.

Table 3.10. On the calculation of the average approximation error.

Then the average approximation error is equal to

The obtained value does not exceed the allowable limit equal to (12…15)%.

16. History of the development of the theory of measurements

At first, TI developed as a theory of psychophysical measurements. In post-war publications, the American psychologist S.S. Stephens focused on measurement scales. In the second half of the XX century. The scope of TI is expanding rapidly. One of the volumes of the "Encyclopedia of Psychological Sciences" published in the USA in the 1950s was called "Psychological Measurements". The compilers of this publication have expanded the scope of TI from psychophysics to psychology in general. In the article of this collection "Fundamentals of the theory of measurements", the presentation went on an abstract-mathematical level, without reference to any specific field of application. In it, the emphasis was placed on “homomorphisms of empirical systems with relations into numerical ones” (there is no need to go into these mathematical terms here), and the mathematical complexity of the presentation increased in comparison with the works of S.S. Stevens.

In one of the first domestic articles on TI (late 60s), it was found that the points assigned by experts when evaluating objects of expertise, as a rule, are measured on an ordinal scale. The works that appeared in the early 1970s led to a significant expansion of the area of ​​TI use. It was applied to pedagogical qualimetry (measuring the quality of students' knowledge), in system studies, in various tasks of the theory of expert assessments, for aggregating product quality indicators, in sociological studies, etc.

Along with establishing the type of scale for measuring specific data, the search for data analysis algorithms was put forward as the two main problems of TI, the result of which does not change with any allowable transformation of the scale (i.e., is invariant with respect to this transformation). Ordinal scales in geography are the Beaufort scale winds ("calm", "weak wind", "moderate wind", etc.), a scale of earthquake strength. Obviously, it cannot be argued that an earthquake of 2 magnitudes (the lamp swung under the ceiling) is exactly 5 times weaker than an earthquake of 10 magnitudes (complete destruction of everything on the surface of the earth).

In medicine, ordinal scales are the stage scale of hypertension (according to Myasnikov), the scale of degrees of heart failure (according to Strazhesko-Vasilenko-Lang), the scale of the severity of coronary insufficiency (according to Fogelson), etc. All these scales are built according to the scheme: the disease is not detected; the first stage of the disease; second stage; the third stage ... Sometimes stages 1a, 16, etc. are distinguished. Each stage has a medical characteristic peculiar only to it. When describing disability groups, numbers are used in the opposite order: the most severe - the first disability group, then - the second, the lightest - the third.

The house numbers are also measured in an ordinal scale - they show the order in which the houses are along the street. Volume numbers in a writer's collected works or case numbers in an enterprise's archive are usually associated with the chronological order in which they were created.

When assessing the quality of products and services, ordinal scales are popular in the so-called qualimetry (literal translation - quality measurement). Namely, a unit of output is assessed as good or bad. In a more thorough analysis, a scale with three gradations is used: there are significant defects - there are only minor defects - there are no defects. Sometimes four gradations are used: there are critical defects (making it impossible to use) - there are significant defects - only minor defects are present - there are no defects. The product grade has a similar meaning - the highest grade, the first grade, the second grade, ...

When assessing environmental impacts, the first, most generalized assessment is usually ordinal, for example: the natural environment is stable - the natural environment is oppressed (degrading). The environmental-medical scale is similar: there is no pronounced impact on people's health - a negative impact on health is noted.

The ordinal scale is also used in other areas. In econometrics, these are primarily various methods of expert assessments.

All measurement scales are divided into two groups - scales of qualitative signs and scales of quantitative signs. The ordinal scale and the scale of names are the main scales of qualitative features, therefore, in many specific areas, the results of qualitative analysis can be considered as measurements on these scales. Scales of quantitative signs are scales of intervals, ratios, differences, absolute. The scale of intervals measures the value of potential energy or the coordinate of a point on a straight line. In these cases, neither the natural reference point nor the natural unit of measurement can be marked on the scale. The researcher himself must set the reference point and choose the unit of measurement himself. Valid transformations in the interval scale are linear increasing transformations, i.e. linear functions. The Celsius and Fahrenheit temperature scales are related precisely by this relationship: ° С = 5/9 (° F - 32), where ° С is the temperature (in degrees) on the Celsius scale, and ° F is the temperature on the Fahrenheit scale.

Of the quantitative scales, the most common in science and practice are the ratio scales. They have a natural reference point - zero, i.e. no quantity, but no natural unit of measure. Most physical units are measured on a ratio scale: body mass, length, charge, as well as prices in the economy. Permissible transformations in the scale of relations are similar (changing only the scale). In other words, linear incremental conversions without an intercept, such as converting prices from one currency to another at a fixed rate. Suppose we are comparing the economic efficiency of two investment projects using prices in rubles. Let the first project be better than the second. Now let's switch to China's currency, the yuan, using a fixed exchange rate. Obviously, the first project should again be more profitable than the second. However, calculation algorithms do not automatically ensure the fulfillment of this condition, and it is necessary to check that it is fulfilled. The results of such a test for average values ​​are described below.

In the scale of differences there is a natural unit of measurement, but there is no natural reference point. Time is measured on a scale of differences, if the year (or day - from noon to noon) is taken as a natural unit of measurement, and on a scale of intervals in the general case. At the present level of knowledge, a natural reference point cannot be specified. Different authors calculate the date of the creation of the world in different ways, as well as the moment of the Nativity of Christ.

For the absolute scale only, the measurement results are numbers in the usual sense of the word, such as the number of people in a room. For an absolute scale, only the identity transformation is allowed.

In the process of development of the corresponding field of knowledge, the type of scale may change. So, at first the temperature was measured on an ordinal scale (colder - warmer). Then - on the interval scale (Celsius, Fahrenheit, Reaumur). Finally, after the discovery of absolute zero, temperature can be considered measured on a ratio scale (the Kelvin scale). It should be noted that sometimes there are disagreements among specialists as to which scales should be used to consider certain real quantities as measured. In other words, the measurement process includes the definition of the type of scale (together with the rationale for choosing a particular type of scale). In addition to the six main types of scales listed, other scales are sometimes used.

17. Invariant algorithms and mean values.

Let us formulate the main requirement for data analysis algorithms in TI: conclusions drawn on the basis of data measured on a scale of a certain type should not change with an acceptable transformation of the measurement scale of these data. In other words, the conclusions must be invariant with respect to the allowed scale transformations.

Thus, one of the main goals of the theory of measurements is the fight against the subjectivity of the researcher when assigning numerical values ​​to real objects. So, distances can be measured in arshins, meters, microns, miles, parsecs and other units of measurement. Mass (weight) - in pounds, kilograms, pounds, etc. Prices for goods and services can be indicated in yuan, rubles, tenge, hryvnia, lats, kroons, marks, US dollars and other currencies (subject to specified conversion rates). Let us emphasize a very important, albeit quite obvious, circumstance: the choice of units of measurement depends on the researcher, i.e. subjective. Statistical inferences can be adequate to reality only when they do not depend on which unit of measurement the researcher prefers, when they are invariant under an acceptable scale transformation. Of the many algorithms for econometric data analysis, only a few satisfy this condition. Let's show it on an example of comparison of average values.

Let X 1 , X 2 ,.., X n be a sample of size n. The arithmetic mean is often used. The use of the arithmetic mean is so commonplace that the second word in the term is often omitted and referred to as the average salary, average income, and other averages for specific economic data, meaning by "average" the arithmetic mean. Such a tradition can lead to erroneous conclusions. Let us show this by the example of calculating the average wages (average income) of employees of a conditional enterprise. Out of 100 workers, only 5 have wages that exceed it, and the wages of the remaining 95 are significantly less than the arithmetic average. The reason is obvious - the salary of one person - the general director - exceeds the salary of 95 workers - low-skilled and highly skilled workers, engineers and employees. The situation resembles that described in the well-known story about the hospital, in which 10 patients, 9 of them have a temperature of 40 ° C, and one has already exhausted himself, is in the morgue with a temperature of 0 ° C. Meanwhile, the average temperature in the hospital is 36°C - it doesn't get any better!

Thus, the arithmetic mean can be used only for fairly homogeneous populations (without large outliers in one direction or another). And what are the averages to use to describe wages? It is quite natural to use the median - the arithmetic mean of the 50th and 51st employees, if their salaries are in non-decreasing order. First comes the salaries of 40 low-skilled workers, and then - from the 41st to the 70th worker - the wages of highly skilled workers. Consequently, the median falls precisely on them and is equal to 200. For 50 workers, the salary does not exceed 200, and for 50 - at least 200, so the median shows the "center", around which the bulk of the studied values ​​are grouped. Another average is the mode, the most frequently occurring value. In the case under consideration, this is the wages of low-skilled workers, i.e. 100. Thus, to describe the salary, we have three average values ​​- mode (100 units), median (200 units) and arithmetic mean (400 units).

For the distributions of income and wages observed in real life, the same pattern is true: the mode is less than the median, and the median is less than the arithmetic mean.

Why are averages used in economics? Usually, in order to replace a set of numbers with a single number, to compare the sets using averages. Let, for example, Y 1 , Y 2 ,..., Y n be a set of experts' assessments "given" to one object of expertise (for example, one of the options for the company's strategic development), Z 1, Z 2 ,..., Z n - the second (another variant of such development). How can these aggregates be compared? Obviously, the easiest way is by averages.

How to calculate averages? Various types of averages are known: arithmetic mean, median, mode, geometric mean, harmonic mean, mean square. Recall that the general concept of the average value was introduced by the French mathematician of the first half of the 19th century. Academician O. Koshi. It is as follows: the average value is any function Ф(X 1, X 2,..., X n) such that for all possible values ​​of the arguments, the value of this function is not less than the minimum of the numbers X 1, X 2,... , X n , and no more than the maximum of these numbers. All of the above types of averages are Cauchy averages.

With an acceptable scale transformation, the value of the mean obviously changes. But the conclusions about for which population the average is greater, and for which it is less, should not change (in accordance with the requirement of invariance of conclusions, adopted as the main requirement in TI). Let us formulate the corresponding mathematical problem of finding the form of average values, the result of comparison of which is stable with respect to admissible scale transformations.

Let F(X 1 X 2 ,..., X n) be the Cauchy mean. Let the average for the first population be less than the average for the second population: then, according to the TI, for the stability of the result of comparing the means, it is necessary that for any admissible transformation g from the group of admissible transformations in the corresponding scale it is true that the average of the transformed values ​​from the first population was also less than the average of the transformed values for the second set. Moreover, the formulated condition must be true for any two collections Y 1 , Y 2 ,...,Y n and Z 1, Z 2 ,..., Z n and, recall, any admissible transformation. Average values ​​satisfying the formulated condition will be called admissible (in the corresponding scale). According to the TI, only such averages can be used in the analysis of expert opinions and other data measured in the scale under consideration.

With the help of mathematical theory, developed in the 1970s, it is possible to describe the form of admissible means in the main scales. It is clear that for data measured in the scale of names, only the mode is suitable as an average.

18. Average values ​​in an ordinal scale

Let's consider the processing of expert opinions measured in an ordinal scale. The following assertion is true.

Theorem1 . Of all the Cauchy averages, only the members of the variational series (order statistics) are acceptable averages in the ordinal scale.

Theorem 1 is valid under the condition that the mean Ф(Х 1 Х 2 ,..., Х n) is continuous (over the totality of variables) and a symmetric function. The latter means that when the arguments are rearranged, the value of the function Ф(X 1 X 2 ,..., X n) does not change. This condition is quite natural, because we find the average value for the totality (set), and not for the sequence. The set does not change depending on the order in which we list its elements.

According to Theorem 1, for data measured on an ordinal scale, one can use, in particular, the median as an average (for an odd sample size). With an even volume, one of the two central members of the variational series should be used - as they are sometimes called, the left median or the right median. Mode can also be used - it is always a member of the variation series. But you can never calculate the arithmetic mean, geometric mean, etc.

The following theorem is true.

Theorem 2. Let Y 1 , Y 2 ,...,Y m be independent identically distributed random variables with the distribution function F(x), and Z 1, Z 2 ,..., Z n be independent identically distributed random variables with the function distribution H(x), moreover, the samples Y 1 , Y 2 ,...,Y m and Z 1 , Z 2 ,..., Z n are independent of each other and MY X > MZ X . In order for the probability of an event to tend to 1 as min(m, n) for any strictly increasing continuous function g satisfying the condition |g i |>X, it is necessary and sufficient that the inequality F(x)< Н(х), причем существовало число х 0 , для которого F(x 0)

Note. The upper limit condition is purely intramathematical. In fact, the function g is an arbitrary admissible transformation in the ordinal scale.

According to Theorem 2, the arithmetic mean can also be used on an ordinal scale if samples from two distributions that satisfy the inequality given in the theorem are compared. Simply put, one of the distribution functions must always lie above the other. Distribution functions cannot intersect, they are only allowed to touch each other. This condition is satisfied, for example, if the distribution functions differ only in the shift:

F(x) = H(x + ∆)

for some ∆.

The last condition is satisfied if two values ​​of a certain quantity are measured using the same measuring instrument, in which the distribution of errors does not change when moving from measuring one value of the quantity under consideration to measuring another.

Kolmogorov averages

A generalization of several of the averages listed above is the Kolmogorov average. For numbers X 1, X 2,..., X n, the Kolmogorov mean is calculated by the formula

G((F(X l) + F(X 2)+...F(X n))/n),

where F is a strictly monotonic function (i.e. strictly increasing or strictly decreasing),

G is the inverse function of F.

Among the Kolmogorov averages there are many well-known characters. So, if F(x) = x, then the Kolmogorov mean is the arithmetic mean, if F(x) = lnx, then the geometric mean, if F(x) = 1/x, then the harmonic mean, if F(x) \u003d x 2, then the mean square, etc. The Kolmogorov mean is a special case of the Cauchy mean. On the other hand, popular averages such as median and mode cannot be represented as Kolmogorov averages. The following assertions are proved in the monograph.

Theorem3 . If some intra-mathematical regularity conditions are true in the interval scale, of all Kolmogorov averages, only the arithmetic average is admissible. Thus, the geometric mean or root mean square of temperatures (in Celsius) or distances is meaningless. The arithmetic mean should be used as the mean. You can also use the median or mode.

Theorem 4. If some intra-mathematical regularity conditions are true in the ratio scale, of all Kolmogorov averages, only power averages with F(x) = x c and geometric average are admissible.

Comment. The geometric mean is the limit of power means for c > 0.

Are there Kolmogorov averages that should not be used in the ratio scale? Of course have. For example F(x) = e x.

Similar to average values, other statistical characteristics can be studied - indicators of spread, connection, distance, etc. It is easy to show, for example, that the correlation coefficient does not change under any admissible transformation in the bowl of intervals, like the ratio of variances, the variance does not change in the scale of differences, the coefficient of variation - in the scale of ratios, etc.

The above results on averages are widely used, not only in economics, management, the theory of expert assessments or sociology, but also in engineering, for example, to analyze the methods of aggregation of sensors in the APCS of blast furnaces. TI is of great applied importance in the problems of standardization and quality management, in particular in qualimetry, where interesting theoretical results have been obtained. So, for example, any change in the weighting coefficients of individual indicators of product quality leads to a change in the ordering of products according to the weighted average (this theorem was proved by Prof. V.V. Podinovsky). Therefore, the above brief information about TI and its methods combines in a certain sense economics, sociology and engineering sciences and is an adequate apparatus for solving the most complex problems that were not previously amenable to effective analysis, moreover, thus. opens the way to building realistic models and solving the problem of forecasting.

22. Paired Linear Regression

Let us now turn to a more detailed study of the simplest case of pairwise linear regression. Linear regression is described by the simplest functional dependence in the form of a straight line equation and is characterized by a transparent interpretation of the model parameters (equation coefficients). The right side of the equation allows you to obtain the theoretical (calculated) values ​​of the resulting (explained) variable from the given values ​​of the regressor (explanatory variable). These values ​​are sometimes also called predictive (in the same sense), i.e. obtained from theoretical formulas. However, when putting forward a hypothesis about the nature of the dependence, the coefficients of the equation still remain unknown. Generally speaking, obtaining approximate values ​​of these coefficients is possible by various methods.

But the most important and widespread of them is the method of least squares (LSM). It is based (as already explained) on the requirement to minimize the sum of squared deviations of the actual values ​​of the resulting feature from the calculated (theoretical) ones. Instead of theoretical values ​​(to obtain them), the right-hand sides of the regression equation are substituted into the sum of squared deviations, and then the partial derivatives of this function are found (the sum of squared deviations of the actual values ​​of the effective feature from the theoretical ones). These partial derivatives are taken not with respect to the variables x and y, but with respect to the parameters a and b. Partial derivatives are equated to zero and after simple but cumbersome transformations, a system of normal equations is obtained to determine the parameters. Coefficient with variable x, i.e. b is called the regression coefficient, it shows the average change in the result with a change in the factor by one unit. The parameter a may not have an economic interpretation, especially if the sign of this coefficient is negative.

Pairwise linear regression is used to study the consumption function. The regression coefficient in the consumption function is used to calculate the multiplier. Almost always, the regression equation is supplemented with an indicator of the tightness of the connection. For the simplest case of linear regression, this indicator of the tightness of the relationship is the linear correlation coefficient. But since the linear correlation coefficient characterizes the closeness of the relationship of features in a linear form, the proximity of the absolute value of the linear correlation coefficient to zero does not yet serve as an indicator of the absence of a relationship between the features.

It is with a different choice of the model specification and, consequently, the type of dependence that the actual relationship can be quite close to unity. But the quality of the selection of a linear function is determined using the square of the linear correlation coefficient - the coefficient of determination. It characterizes the proportion of the variance of the resultant attribute y, explained by regression in the total variance of the resultant attribute. The value that complements the coefficient of determination to 1 characterizes the proportion of variance caused by the influence of other factors not taken into account in the model (residual variance).

Pair regression is represented by a relationship between two variables y and x of the following form:

where y is the dependent variable (outcome feature), and x is the independent variable (explanatory variable, or feature factor). There is linear regression and non-linear regression. Linear regression is described by an equation of the form:

y = a + bx + .

Nonlinear regression, in turn, can be non-linear with respect to the explanatory variables included in the analysis, but linear with respect to the estimated parameters. Or maybe the regression is non-linear in terms of the estimated parameters. As examples of a regression that is non-linear in the explanatory variables, but linear in the estimated parameters, one can indicate polynomial dependencies of various degrees (polynomials) and an equilateral hyperbola.

Non-linear regression by the estimated parameters is a power-law relative to the parameter (the parameter is in the exponent) dependence, exponential dependence, where the parameter is at the base of the degree, and exponential dependence, when the entire linear dependence is entirely in the exponent. Note that in all these three cases, the random component (random remainder)  enters the right side of the equation as a factor, and not as a term, i.e. multiplicatively! The average deviation of the calculated values ​​of the resulting feature from the actual ones is characterized by an average approximation error. It is expressed as a percentage and should not exceed 7-8%. This average approximation error is simply expressed as a percentage of the average of the relative values ​​of the differences between actual and calculated values.

Of great importance is the average coefficient of elasticity, which serves as an important characteristic of many economic phenomena and processes. It is calculated as the product of the value of the derivative of this functional dependence by the ratio of the average value x to the average value y. The elasticity coefficient shows how many percent, on average, the result y will change from its average value when the factor x changes by 1% from its (factor x) average value.

With paired regression and with multiple regression (when there are many factors) and with residual variance, the tasks of analysis of variance are closely related. Analysis of variance examines the variance of the dependent variable. In this case, the total sum of squared deviations is divided into two parts. The first term is the sum of squared deviations due to regression, or explained (factorial). The second term is the residual sum of squared deviations not explained by factorial regression.

The share of the variance explained by the regression in the total variance of the resulting feature y is characterized by the coefficient (index) of determination, which is nothing more than the ratio of the sum of squared deviations due to regression to the total sum of squared deviations (the first term to the entire sum).

When the model parameters (coefficients of unknowns) are determined using the least squares method, then, in essence, some random variables are found (in the process of obtaining estimates). Of particular importance is the estimation of the regression coefficient, which is some special form of a random variable. The properties of this random variable depend on the properties of the remainder term in the equation (in the model). Let us consider the explanatory variable x as a non-random exogenous variable for a paired linear regression model. It just means that the values ​​of the variable x in all observations can be considered predetermined and have nothing to do with the dependence under study. Thus, the actual value of the explained variable consists of two components: a non-random component and a random component (residual term).

On the other hand, the regression coefficient determined by the method of least squares (OLS) is equal to the quotient of dividing the covariance of the x and y variables by the variance of the x variable. Therefore, it also contains a random component. After all, the covariance depends on the values ​​of the variable y, where the values ​​of the variable y depend on the values ​​of the random residual term . Further, it is easy to show that the covariance of the variables x and y is equal to the product of the estimated regression coefficient beta () and the variance of the variable x, added to the covariance of the variables x and . Thus, the estimate of the beta regression coefficient is equal to this unknown regression coefficient itself, added to the quotient of dividing the covariance of the variables x and  by the variance of the variable x. Those. the estimate of the regression coefficient b obtained from any sample is presented as the sum of two terms: a constant value equal to the true value of the coefficient  (beta), and from a random component that depends on the covariance of the variables x and .

23. Mathematical conditions of Gauss-Markov and their application.

For a regression analysis based on ordinary least squares to give the best results, the random term must satisfy the four Gauss-Markov conditions.

The mathematical expectation of the random term is zero, i.e. it is unbiased. If the regression equation includes a constant term, then it is natural to consider such a requirement fulfilled, since this is a constant term and must take into account any systematic trend in the values ​​of the variable y, which, on the contrary, should not contain the explanatory variables of the regression equation.

The variance of the random term is constant for all observations.

The covariance of the values ​​of random variables forming the sample must be equal to zero, i.e. there is no systematic relationship between the values ​​of the random term in any two specific observations. Random members must be independent of each other.

The distribution law of the random term must be independent of the explanatory variables.

Moreover, in many applications, the explanatory variables are not stochastic; do not have a random component. The value of any independent variable in each observation should be considered exogenous, completely determined by external causes not taken into account in the regression equation.

Together with the indicated Gauss-Markov conditions, it is also assumed that the random term has a normal distribution. It is valid under very broad conditions and is based on the so-called central limit theorem (CLT). The essence of this theorem is that if a random variable is the general result of the interaction of a large number of other random variables, none of which has a predominant influence on the behavior of this general result, then such a resulting random variable will be described by an approximately normal distribution. This closeness to the normal distribution allows us to use the normal distribution and, in a sense, its generalization, the Student distribution, which differs noticeably from the normal distribution mainly on the so-called “tails”, i.e. for small values ​​of the sample size. It is also important that if the random term is normally distributed, then the regression coefficients will also be distributed according to the normal law.

The established regression curve (regression equation) allows solving the problem of the so-called point forecast. In such calculations, some value of x is taken outside the studied observation interval and substituted into the right side of the regression equation (extrapolation procedure). Because estimates for the regression coefficients are already known, then it is possible to calculate the value of the explained variable y corresponding to the taken value of x. Naturally, in accordance with the meaning of prediction (forecast), calculations are carried out forward (into the area of ​​future values).

However, since the coefficients were determined with a certain error, it is not the point estimate (point forecast) for the effective feature that is of interest, but the knowledge of the limits within which the values ​​of the productive feature corresponding to the taken value of the factor x will lie with a certain probability.

To do this, the value of the standard error (standard deviation) is calculated. It can be obtained in the spirit of what has just been said as follows. The expression of the free term a from the estimates in terms of average values ​​is substituted into the linear regression equation. Then it turns out that the standard error depends on the error of the average of the resulting factor y and additively on the error of the regression coefficient b. Simply, the square of this standard error is equal to the sum of the squared error of the mean y and the product of the squared error of the regression coefficient times the square of the deviation of the factor x and its mean. Further, the first term, according to the laws of statistics, is equal to the quotient of dividing the variance of the general population by the size (volume) of the sample.

Instead of the unknown variance, the sample variance is used as an estimate. Accordingly, the error of the regression coefficient is defined as the quotient of dividing the sample variance by the variance of the x factor. You can get the value of the standard error (standard deviation) and other considerations, more independent of the linear regression model. For this, the concept of average error and marginal error and the relationship between them are used.

But even after obtaining the standard error, the question remains about the boundaries within which the predicted value will lie. In other words, about the interval of measurement error, in the natural assumption in many cases that the middle of this interval is given by the calculated (average) value of the effective factor y. Here the central limit theorem comes to the rescue, which just indicates with what probability the unknown value is within this confidence interval.

In essence, the standard error formula, regardless of how and in what form it is obtained, characterizes the error in the position of the regression line. The value of the standard error reaches a minimum when the value of the factor x coincides with the average value of the factor.

24. Statistical testing of hypotheses and evaluation of the significance of linear regression by the Fisher criterion.

After the linear regression equation is found, the significance of both the equation as a whole and its individual parameters is assessed. The assessment of the significance of the regression equation as a whole can be performed using various criteria. The use of Fisher's F-criterion is quite common and effective. In this case, the null hypothesis H o is put forward that the regression coefficient is equal to zero, i.e. b=0, and hence the factor x has no effect on the result y. The direct calculation of the F-criterion is preceded by an analysis of the variance. The central place in it is occupied by the decomposition of the total sum of squared deviations of the variable y from the mean value of y into two parts - "explained" and "unexplained":

The total sum of the squared deviations of the individual values ​​of the effective feature y from the average value y is caused by the influence of many factors.

We conditionally divide the entire set of causes into two groups: the studied factor x and other factors. If the factor does not affect the result, then the regression line on the graph is parallel to the x-axis and y=y. Then the entire dispersion of the resulting attribute is due to the influence of other factors and the total sum of squared deviations will coincide with the residual. If other factors do not affect the result, then y is functionally related to x and the residual sum of squares is zero. In this case, the sum of squared deviations explained by the regression is the same as the total sum of squares. Since not all points of the correlation field lie on the regression line, their scatter always takes place as due to the influence of the factor x, i.e. regression of y on x, and caused by the action of other causes (unexplained variation). The suitability of the regression line for prediction depends on how much of the total variation of the trait y is accounted for by the explained variation.

Obviously, if the sum of squared deviations due to the regression is greater than the residual sum of squares, then the regression equation is statistically significant and the x factor has a significant impact on the result. This is equivalent to the fact that the coefficient of determination will approach unity. Any sum of squared deviations is related to the number of degrees of freedom, i.e. the number of freedom of independent variation of a feature. The number of degrees of freedom is related to the number of population units or to the number of constants determined from it. In relation to the problem under study, the number of degrees of freedom should show how many independent deviations out of n possible [(y 1 - y), (y 2 - y), ... (y n - y)] are required to form a given sum of squares. So, for the total sum of squares ∑(y-y cf) 2, (n-1) independent deviations are required, since in a population of n units, after calculating the average level, only (n-1) the number of deviations freely vary. When calculating the explained or factorial sum of squares ∑(y-y cf) 2, the theoretical (calculated) values ​​of the effective feature y* found along the regression line are used: y(x)=a+bx.

Let us now return to the expansion of the total sum of squared deviations of the effective factor from the average of this value. This sum contains two parts already defined above: the sum of squared deviations, explained by the regression, and another sum, which is called the residual sum of squared deviations. This decomposition is related to the analysis of variance, which directly answers the fundamental question: how to evaluate the significance of the regression equation as a whole and its individual parameters? It also largely determines the meaning of this question. To assess the significance of the regression equation as a whole, the Fisher test (F-test) is used. According to the approach proposed by Fisher, a null hypothesis is put forward: the regression coefficient is equal to zero, i.e. value b=0. This means that the factor X has no effect on the result Y.

Recall that almost always the points obtained as a result of a statistical study do not lie exactly on the regression line. They are scattered, being removed more or less far from the regression line. This scattering is due to the influence of other factors, other than the explanatory factor X, that are not taken into account in the regression equation. When calculating the explained, or factorial sum of squared deviations, the theoretical values ​​of the resulting attribute found along the regression line are used.

For a given set of values ​​of the variables Y and X, the calculated value of the average value of Y in linear regression is a function of only one parameter - the regression coefficient. In accordance with this, the factorial sum of squared deviations has the number of degrees of freedom equal to 1. And the number of degrees of freedom of the residual sum of squared deviations in linear regression is n-2.

Therefore, dividing each sum of squared deviations in the original expansion by its number of degrees of freedom, we obtain the average squared deviations (dispersion per one degree of freedom). Further, dividing the factorial variance by one degree of freedom by the residual variance by one degree of freedom, we obtain a criterion for testing the null hypothesis, the so-called F-relation, or the criterion of the same name. Namely, if the null hypothesis is true, the factorial and residual variances turn out to be simply equal to each other.

To reject the null hypothesis, i.e. accepting the opposite hypothesis, which expresses the fact of the significance (presence) of the dependence under study, and not just a random coincidence of factors simulating a dependence that does not actually exist, it is necessary to use tables of critical values ​​of the indicated ratio. The tables determine the critical (threshold) value of the Fisher criterion. It is also called theoretical. Then, by comparing it with the corresponding empirical (actual) value of the criterion calculated from the observational data, it is checked whether the actual value of the ratio exceeds the critical value from the tables.

In more detail, this is done as follows. A given level of probability of the presence of a null hypothesis is chosen and the critical value of the F-criterion is found from the tables, at which a random discrepancy of the variances by 1 degree of freedom can still occur, i.e. the maximum such value. Then the calculated value of the ratio F- is recognized as reliable (i.e., expressing the difference between the actual and residual variances), if this ratio is greater than the tabular one. Then the null hypothesis is rejected (it is not true that there are no signs of a connection) and, on the contrary, we come to the conclusion that the connection exists and is significant (it is non-random, significant).

If the value of the ratio is less than the tabular value, then the probability of the null hypothesis is higher than the specified level (which was chosen initially) and the null hypothesis cannot be rejected without a noticeable danger of obtaining an incorrect conclusion about the presence of a connection. Accordingly, the regression equation is considered to be insignificant.

The very value of the F-criterion is associated with the coefficient of determination. In addition to assessing the significance of the regression equation as a whole, the significance of individual parameters of the regression equation is also evaluated. In this case, the standard error of the regression coefficient is determined using the empirical actual standard deviation and the empirical variance per one degree of freedom. After that, Student's distribution is used to test the significance of the regression coefficient for calculating its confidence intervals.

The assessment of the significance of the regression and correlation coefficients using Student's t-test is performed by comparing the values ​​of these values ​​and the standard error. The error value of the linear regression parameters and the correlation coefficient is determined by the following formulas:

where S is the root mean square residual sample deviation,

r xy is the correlation coefficient.

Accordingly, the value of the standard error predicted by the regression line is given by the formula:

The corresponding ratios of the values ​​of the values ​​of the regression and correlation coefficients to their standard error form the so-called t-statistics, and a comparison of the corresponding tabular (critical) value of it and its actual value makes it possible to accept or reject the null hypothesis. But further, to calculate the confidence interval, the marginal error for each indicator is found as the product of the tabular value of statistics t and the average random error of the corresponding indicator. In fact, in a slightly different way, we have actually written it just above. Then the bounds of confidence intervals are obtained: the lower bound is subtracted from the corresponding coefficients (in fact, the average ones) of the corresponding marginal error, and the upper bound is added (added).

In linear regression ∑(y x -y avg) 2 =b 2 ∑(x-x avg) 2 . It is easy to verify this by referring to the formula for the linear correlation coefficient: r 2 xy \u003d b 2 * σ 2 x / σ 2 y

where σ 2 y is the total variance of the attribute y;

σ 2 x - the variance of the attribute y due to the factor x. Accordingly, the sum of the squared deviations due to linear regression will be:

∑(y x -y cf) 2 =b 2 ∑(x-x cf) 2 .

Since, for a given amount of observations in x and y, the factorial sum of squares in linear regression depends on only one constant of the regression coefficient b, then this sum of squares has one degree of freedom. Consider the content side of the calculated value of the attribute y, i.e. at x. The value of y x is determined by the linear regression equation: y x ​​\u003d a + bx.

The parameter a can be defined as a=y-bx. Substituting the expression for the parameter a into the linear model, we get: y x ​​=y-bx+bx cp =y-b(x-x cf).

With a given set of variables y and x, the calculated value y x in linear regression is a function of only one parameter - the regression coefficient. Accordingly, the factorial sum of squared deviations has a number of degrees of freedom equal to 1.

There is an equality between the number of degrees of freedom of the total, factorial and residual sums of squares. The number of degrees of freedom of the residual sum of squares in linear regression is (n-2). The number of degrees of freedom for the total sum of squares is determined by the number of units, and since we use the average calculated from the sample data, we lose one degree of freedom, i.e. (n-1). So, we have two equalities: for the sums and for the number of degrees of freedom. And this, in turn, brings us back to comparable dispersions per one degree of freedom, the ratio of which gives the Fisher criterion.

25. Estimation of the significance of individual parameters of the regression equation and coefficients according to Student's criterion.

27. Linear and non-linear regression and methods of their study.

Linear regression and the methods of its study and evaluation would not be so important if, in addition to this very important, but still the simplest case, we did not use them to obtain a tool for analyzing more complex nonlinear dependencies. Nonlinear regressions can be divided into two essentially different classes. The first and simpler is the class of non-linear dependencies, in which there is non-linearity with respect to the explanatory variables, but which remain linear in terms of the parameters included in them and to be estimated. This includes polynomials of varying degrees and an equilateral hyperbola.

Such a non-linear regression for the variables included in the explanation by a simple transformation (replacement) of variables can easily be reduced to the usual linear regression for new variables. Therefore, the estimation of the parameters in this case is performed simply by the least squares, since the dependences are linear in the parameters. Thus, an important role in the economy is played by a non-linear dependence described by an equilateral hyperbole:

Its parameters are well estimated by the MNC, and this dependence itself characterizes the relationship of the unit costs of raw materials, fuel, materials with the volume of output, the time of circulation of goods, and all these factors with the value of the turnover. For example, the Phillips curve characterizes the non-linear relationship between the unemployment rate and the percentage of wage growth.

The situation is completely different with a regression that is non-linear in terms of the estimated parameters, for example, represented by a power function, in which the degree itself (its indicator) is a parameter, or depends on the parameter. It can also be an exponential function, where the base of the degree is a parameter and an exponential function, in which, again, the indicator contains a parameter or a combination of parameters. This class, in turn, is divided into two subclasses: one includes externally non-linear, but essentially internally linear. In this case, you can bring the model to a linear form using transformations. However, if the model is intrinsically non-linear, then it cannot be reduced to a linear function.

Thus, only models that are intrinsically non-linear are considered truly non-linear in regression analysis. All others, reduced to linear through transformations, are not considered as such, and they are considered most often in econometric studies. At the same time, this does not mean that essentially non-linear dependencies cannot be studied in econometrics. If the model is internally non-linear in parameters, then iterative procedures are used to estimate the parameters, the success of which depends on the type of equation of singularities of the iterative method used.

Let us return to the dependencies reduced to linear ones. If they are non-linear both in terms of parameters and variables, for example, of the form y \u003d a multiplied by the power of X, the indicator of which is the parameter -  (beta):

Obviously, such a ratio is easily converted into a linear equation by a simple logarithm.

After introducing new variables denoting logarithms, a linear equation is obtained. Then the regression estimation procedure consists in calculating new variables for each observation by taking the logarithms of the original values. Then the regression dependence of the new variables is estimated. To pass to the original variables, one should take the antilogarithm, i.e., in fact, return to the powers themselves instead of their exponents (after all, the logarithm is the exponent). The case of exponential or exponential functions can be considered similarly.

For an essentially non-linear regression, the usual regression estimation procedure cannot be used, since the corresponding dependence cannot be converted to a linear one. The general scheme of actions in this case is as follows:

1. Some plausible initial parameter values ​​are accepted;

2. Calculate the predicted Y values ​​from the actual X values ​​using these parameter values;

3. Calculate the residuals for all observations in the sample and then the sum of the squares of the residuals;

4. Small changes are made to one or more parameter estimates;

5. New predicted Y values, residuals and sum of squared residuals are calculated;

6. If the sum of squared residuals is less than before, then the new parameter estimates are better than the old ones and should be used as a new starting point;

7. Steps 4, 5 and 6 are repeated again until it is not possible to make such changes in the parameter estimates that would lead to a change in the sum of the residuals of squares;

8. It is concluded that the value of the sum of squares of the residuals is minimized and the final estimates of the parameters are estimates by the least squares method.

Among the non-linear functions that can be reduced to a linear form, the exponential function is widely used in econometrics. The parameter b in it has a clear interpretation, being the coefficient of elasticity. In models that are non-linear in terms of estimated parameters, but reduced to a linear form, LSM is applied to the transformed equations. The practical application of the logarithm and, accordingly, the exponent is possible when the resulting feature does not have negative values. In the study of relationships among functions that use the logarithm of the resultant characteristic, power-law dependences prevail in econometrics (supply and demand curves, production functions, development curves to characterize the relationship between the labor intensity of products, the scale of production, the dependence of GNI on the level of employment, Engel curves).

28. Inverse model and its use

Sometimes the so-called inverse model is used, which is internally non-linear, but in it, unlike the equilateral hyperbole, it is not the explanatory variable that is transformed, but the resulting feature Y. Therefore, the inverse model turns out to be internally non-linear and the LLS requirement is not fulfilled for the actual values ​​of the resultant feature Y, and for their reciprocal values. The study of correlation for non-linear regression deserves special attention. In the general case, a parabola of the second degree, as well as polynomials of a higher order, when linearized, takes the form of a multiple regression equation. If the regression equation, which is non-linear with respect to the variable being explained, during linearization takes the form of a linear pair regression equation, then a linear correlation coefficient can be used to assess the tightness of the relationship.

If the transformation of the regression equation into a linear form is associated with a dependent variable (resulting feature), then the linear correlation coefficient for the transformed feature values ​​gives only an approximate estimate of the relationship and does not numerically coincide with the correlation index. It should be borne in mind that when calculating the correlation index, the sums of the squared deviations of the effective feature Y are used, and not their logarithms. The assessment of the significance of the correlation index is performed in the same way as the assessment of the reliability (significance) of the correlation coefficient. The correlation index itself, as well as the determination index, is used to test the significance of the overall non-linear regression equation by Fisher's F-test.

Note that the ability to build non-linear models, both by reducing them to a linear form, and by using non-linear regression, on the one hand, increases the universality of regression analysis. On the other hand, it significantly complicates the tasks of the researcher. If we restrict ourselves to pairwise regression analysis, then we can plot Y and X observations as a scatterplot. Often several different non-linear functions approximate the observations if they lie on some curve. But in the case of multiple regression analysis, such a graph cannot be built.

When considering alternative models with the same definition of the dependent variable, the selection procedure is relatively simple. You can evaluate the regression based on all possible functions imaginable and select the function that best explains the changes in the dependent variable. It is clear that when a linear function explains about 64% of the variance in y, and a hyperbolic one 99.9%, the latter should obviously be chosen. But when different models use different functional forms, the problem of choosing a model becomes much more complicated.

29. Use of the Box-Cox test.

More generally, when considering alternative models with the same definition of the dependent variable, the choice is simple. It is most reasonable to evaluate regression based on all possible functions, stopping at the function that best explains the changes in the dependent variable. If the coefficient of determination measures in one case the proportion of variance explained by the regression, and in the other case the proportion of the variance of the logarithm of this dependent variable explained by the regression, then the choice is made without difficulty. Another thing is when these values ​​for the two models are very close and the choice problem becomes much more complicated.

Then the standard procedure in the form of the Box-Cox test should be applied. If you just need to compare models using the resultant factor and its logarithm as a variant of the dependent variable, then a variant of the Zarembka test is used. It proposes a Y-scale transformation that allows direct comparison of root mean square error (RMS) in linear and logarithmic models. The corresponding procedure includes the following steps:

    The geometric mean of the Y values ​​in the sample is calculated, coinciding with the exponent of the arithmetic mean of the logarithm of Y;

    Observations Y are recalculated in such a way that they are divided by the value obtained at the first step;

    Regression is estimated for a linear model using scaled Y values ​​instead of the original Y values, and for a logarithmic model using the logarithm of scaled Y values. Now the SD values ​​for the two regressions are comparable and therefore a model with a smaller sum of squared deviations provides a better fit with the true dependence of the observed values;

    To check that one of the models does not provide a significantly better fit, you can use the product of half the number of observations and the logarithm of the ratio of the RMS values ​​in the scaled regressions, and then take the absolute value of this value.

30. Concepts of intercorrelation and multicollinearity of factors.

34. Fundamentals of the MNC and the validity of its application.

Let us now turn to the basics of LSM, the validity of its application (including problems of multiple regression) and the most important properties of estimates obtained using LSM. Let's start with the fact that, along with the analytical dependence on the right side of the regression equation, the random term also plays an important role. This random component is an unobservable quantity. The statistical tests of regression parameters and correlation measures themselves are based on unverifiable assumptions about the distribution of this random component of the multiple regression. These assumptions are only preliminary. Only after constructing the regression equation is it checked whether the estimates have random residuals (empirical analogues of the random component) of the properties assumed a priori. In essence, when the model parameters are estimated, the differences between the theoretical and actual values ​​of the resulting feature are calculated in order to evaluate the random component itself. It is important to keep in mind that this is just a selective realization of the unknown remainder of the given equation.

The regression coefficients obtained from the system of normal equations are sample estimates of the strength of the connection. It is clear that they are of practical importance only when they are unbiased. Recall that in this case the mean of the residuals is equal to zero, or, what is the same, the mean of the estimate is equal to the estimated parameter itself. Then the residuals will not accumulate with a large number of sample estimates, and the found regression parameter itself can be considered as an average of a large number of unbiased estimates.

In addition, estimates should have the smallest variance, i.e. be effective, and then it becomes possible to move from practically unsuitable point estimates to interval estimation. Finally, confidence intervals are applicable with a high degree of efficiency when the probability of obtaining an estimate at a given distance from the true (unknown) value of a parameter is close to one. Such estimates are called consistent and the consistency property is characterized by an increase in their accuracy with an increase in the sample size.

However, the consistency condition is not automatically satisfied and depends essentially on the fulfillment of the following two important requirements. First, the residuals themselves must be stochastic with the most pronounced randomness, i.e. all explicitly functional dependencies must be included in the analytical component of the multiple regression, and in addition, the values ​​of the residuals must be distributed independently of each other for different samples (no autocorrelation of the residuals). The second, no less important requirement is that the variance of each deviation (residual) is the same for all values ​​of the variables X (homoscedasticity). Those. homoscedasticity is expressed by the constancy of the variance for all observations:

On the contrary, heteroscedasticity consists in the violation of such constancy of variance for different observations. In this case, the a priori (before observations) probability of obtaining strongly deviated values ​​with different theoretical distributions of the random term for different observations in the sample will be relatively high.

Autocorrelation of residuals, or the presence of a correlation between the residuals of current and previous (subsequent) observations, is seen from the value of the usual linear correlation coefficient. If it is significantly different from zero, then the residuals are autocorrelated and, therefore, the probability density function (distribution of residuals) depends on the observation point and on the distribution of residual values ​​at other observation points. It is convenient to determine the autocorrelation of the residuals from the available statistical information in the presence of an ordering of observations by the X factor. The absence of autocorrelation of the residuals ensures the consistency and efficiency of the estimates of the regression coefficients.

35. Homoscedasticity and heteroscedasticity, autocorrelation of residuals, generalized least squares method (GMLS).

The sameness of the variances of the residuals for all values ​​of the variables X, or homoscedasticity, is also absolutely necessary to obtain consistent estimates of the regression parameters from the LSM. Non-fulfillment of the homoscedasticity condition leads to the so-called heteroscedasticity. It can lead to bias in the estimates of regression coefficients. Heteroskedasticity will mainly affect the decrease in the efficiency of estimates of regression coefficients. In this case, it becomes especially difficult to use the formula for the standard error of the regression coefficient, the use of which assumes a single variance of the residuals for any values ​​of the factor. As for the unbiasedness of the estimates of the regression coefficients, it primarily depends on the independence of the residuals and the values ​​of the factors themselves.

A rather visual, though not rigorous and skill-requiring way to test homoscedasticity is a graphical study of the nature of the dependence of the residuals on the average calculated (theoretical) resulting feature, or the corresponding correlation fields. Analytical methods for studying and evaluating heteroscedasticity are more rigorous. With a significant presence of heteroscedasticity, it is advisable to use the generalized least squares (GLS) instead of the least squares.

In addition to the requirements for multiple regression arising from the application of the least squares, it is also necessary to comply with the conditions for the variables included in the model. These, first of all, include the requirements regarding the number of model factors for a given volume of observations (1 to 7). Otherwise, the regression parameters will be statistically insignificant. From the point of view of the effectiveness of the application of the corresponding numerical methods in the implementation of the least squares method, it is necessary that the number of observations exceed the number of estimated parameters (in the system of equations, the number of equations is greater than the number of variables being sought).

The most significant achievement of econometrics is the significant development of the methods for estimating unknown parameters themselves and the improvement of the criteria for identifying the static significance of the effects under consideration. In this regard, the impossibility or inexpediency of using the traditional LSM due to heteroscedasticity manifested to one degree or another led to the development of a generalized LSM (GSM). In fact, at the same time, the model is corrected, its specification is changed, and the initial data are transformed to ensure the unbiasedness, efficiency, and consistency of the estimates of the regression coefficients.

It is assumed that the mean of the residuals is equal to zero, but their variance is no longer constant, but is proportional to the values ​​of K i , where these values ​​are proportionality coefficients that are different for different values ​​of the x factor. Thus, it is these coefficients (Ki values) that characterize the heterogeneity of the dispersion. Naturally, it is assumed that the value of the dispersion itself, which is a common factor for these proportionality coefficients, is unknown.

The original model, after introducing these coefficients into the multiple regression equation, continues to be heteroscedastic (more precisely, these are the residuals of the model). Let these residuals (residuals) be not autocorrelated. Let us introduce new variables obtained by dividing the initial model variables, fixed as a result of the i-th observation, by the square root of the proportionality coefficients K i . Then we obtain a new equation in the transformed variables, in which the remainders will already be homoscedastic. The new variables themselves are weighted old (original) variables.

Therefore, the estimation of the parameters of the new equation obtained in this way with homoscedastic residuals will be reduced to a weighted LSM (essentially, this is the GLS). When used instead of the regression variables themselves, their deviations from the averages of the expression for the regression coefficients acquire a simple and standardized (uniform) form, slightly different for LSM and LMLS by the correction factor 1/K in the numerator and denominator of the fraction giving the regression coefficient.

It should be borne in mind that the parameters of the transformed (corrected) model essentially depend on what concept is taken as the basis for the proportionality coefficients К i . It is often assumed that the residuals are simply proportional to the values ​​of the factor. The model takes the simplest form when the hypothesis that the errors are proportional to the values ​​of the last factor in order is accepted. Then OLS allows to increase the weight of observations with smaller values ​​of the transformed variables in determining the regression parameters compared to the work of the standard OLS with the original original variables. But these new variables already receive a different economic content.

The hypothesis that the residuals are proportional to the value of the factor may well have a real justification. Let some insufficiently homogeneous set of data be processed, for example, including large and small enterprises at the same time. Then large volumetric values ​​of the factor can correspond to both a large variance of the resulting feature and a large variance of the residual values. Further, the use of GLS and the corresponding transition to relative values ​​not only reduces the variation of the factor, but also reduces the error variance. Thus, the simplest case of taking into account and correcting heteroscedasticity in regression models is realized through the use of GLS.

The above approach to the implementation of OLS in the form of a weighted OLS is quite practical - it is simply implemented and has a transparent economic interpretation. Of course, this is not the most general approach, and in the context of mathematical statistics, which serves as the theoretical basis of econometrics, we are offered a much more rigorous method that implements the GLS in the most general form. It needs to know the covariance matrix of the error vector (column of residuals). And this is usually unfair in practical situations, and it is impossible to find this matrix as such. Therefore, generally speaking, one has to somehow estimate the required matrix in order to use such an estimate instead of the matrix itself in the corresponding formulas. Thus, the described implementation of the GLS represents one of these estimates. It is sometimes called accessible generalized least squares.

It should also be taken into account that the coefficient of determination cannot serve as a satisfactory measure of the quality of fit when using GLS. Returning to the use of GLS, we also note that the method of using standard deviations (standard errors) in the White form (the so-called consistent standard errors in the presence of heteroscedasticity) has sufficient generality. This method is applicable under the condition that the error vector covariance matrix is ​​diagonal. If there is autocorrelation of residuals (errors), when there are non-zero elements (coefficients) in the covariance matrix and outside the main diagonal, then a more general standard error method in the Nevie-West form should be used. In this case, there is a significant limitation: nonzero elements, in addition to the main diagonal, are only on neighboring diagonals that are separated from the main diagonal by no more than a certain amount.

From what has been said, it is clear that it is necessary to be able to check the data for heteroscedasticity. The following tests serve this purpose. They test the main hypothesis about the equality of the variances of the residuals against the alternative hypothesis (about the inequality of these hypotheses). In addition, there are a priori structural constraints on the nature of heteroscedasticity. In the Goldfeld-Kuandt test, as a rule, the assumption of a direct dependence of the error variance (residual) on the value of some independent variable is used. The scheme of application of this test is as follows. First, the data are sorted in descending order of the independent variable for which heteroscedasticity is suspected. A few average observations are then excluded from this ordered dataset, where the word "few" means about a quarter (25%) of the total number of all observations. Next, two independent regressions are performed for the first of the remaining (after the elimination) mean observations and the last two of these remaining mean observations. After that, two corresponding residues are constructed. Finally, Fisher's F-statistic is compiled, and if the hypothesis under study is true, then F is indeed a Fisher distribution with the corresponding degrees of freedom. Then a large value of this statistic means that the hypothesis being tested must be rejected. Without the step of eliminating observations, the power of this test decreases.

The Breusch-Pagan test is used when it is assumed a priori that the variances depend on some additional variables. First, the usual (standard) regression is performed and a vector of residuals is obtained. Then an estimate of the variance is constructed. Next, the regression of the square of the vector of residuals divided by the empirical variance (estimate of the variance) is carried out. For her (regression) find the explained part of the variation. And for this explained part of the variation, divided in half, statistics are built. If the null hypothesis is true (the absence of heteroscedasticity is true), then this quantity has a distribution hee-square. If, on the contrary, the test revealed heteroscedasticity, then the original model is transformed by dividing the components of the vector of residuals by the corresponding components of the vector of observed independent variables.

36. Method of standard deviations in White's form.

We can draw the following conclusions. The use of GLS in the presence of heteroscedasticity is reduced to minimizing the sum of weighted squared deviations. The use of the available GLS is associated with the need for a large number of observations that exceeds the number of estimated parameters. The most favorable for the use of GLS is the case when the error (residuals) is proportional to one of the independent variables and the resulting estimates are consistent. If, nevertheless, in a model with heteroscedasticity, it is necessary to use not GLS, but standard LSM, then to obtain consistent estimates, one can use error estimates in the White or Nevie-West form.

When analyzing time series, it is often necessary to take into account the statistical dependence of observations at different points in time. In this case, the assumption of uncorrelated errors is not satisfied. Consider a simple model in which the errors form a first-order autoregressive process. In this case, the errors satisfy a simple recurrence relation, on the right side of which one of the terms is a sequence of independent normally distributed random variables with zero mean and constant variance. The second term is the product of the parameter (autoregression coefficient) and the values ​​of the residuals at the previous time. The sequence of error values ​​(residuals) itself forms a stationary random process. A stationary random process is characterized by the constancy of its characteristics over time, in particular, the mean and variance. In this case, the covariance matrix of interest to us (its terms) can be easily written out using the powers of the parameter.

Estimation of the autoregressive model for a known parameter is performed using GLS. In this case, it is enough to simply reduce the original model by a simple transformation to a model whose errors satisfy the conditions of the standard regression model. Very rarely, but still there is a situation in which the autoregression parameter is known. Therefore, it is generally necessary to perform estimation with an unknown autoregressive parameter. There are three most commonly used assessment procedures. Cochrane-Orcutt method, Hildreth-Lou procedure and Durbin method.

In general, the following conclusions are true. Time series analysis requires the correction of the usual least squares, since the errors in this case, as a rule, are correlated. Often these errors form a first-order stationary autoregressive process. OLS estimates for first-order autoregression are unbiased, consistent, but inefficient. With a known autoregression coefficient, OLS is reduced to simple transformations (corrections) of the original system and then to the application of standard least squares. If, as is more often the case, the autoregressive coefficient is unknown, then there are several procedures of the available GLS, which consist in estimating the unknown parameter (coefficient), after which the same transformations are applied as in the previous case of the known parameter.

37. The concept of the Breusch-Pagan test, the Goldfeldt-Quandt test

Ministry of Agriculture of the Russian Federation

Federal state budget educational

institution of higher professional education

"Perm State Agricultural Academy

named after academician D.N. Pryanishnikov"

Department of Finance, Credit and Economic Analysis

Control work on the discipline "Econometrics" Option - 10


    Approximation errors and its definition………………………………….3

    Analytical method of alignment of the time series and the functions used in this………………………………………………………………..4

    Practical part………………………………………………………..... 11

    1. Task 1……………………………………………………………… 11

      Task 2……………………………………………….……………...19

List of used literature………………………………………….....25

  1. Approximation errors and its definition.

Average approximation error is the average deviation of the calculated data from the actual data. It is defined as a percentage modulo.

The actual values ​​of the resulting attribute differ from the theoretical ones. The smaller this difference, the closer the theoretical values ​​fit the empirical data, this is the best quality of the model. The magnitude of the deviations of the actual and calculated values ​​of the effective feature for each observation is an approximation error. Their number corresponds to the volume of the population. In some cases, the approximation error may be zero. For comparison, deviations are used, expressed as a percentage of the actual values.

Since it can be both positive and negative, it is customary to determine the approximation errors for each observation as a percentage modulo. Deviations can be considered as an absolute approximation error, and as a relative approximation error. In order to have a general judgment about the quality of the model from the relative deviations for each observation, the average approximation error is determined as the simple arithmetic mean.

The average approximation error is calculated by the formula:

Another definition of the average approximation error is also possible:

If A £ 10-12%, then we can talk about the good quality of the model.

  1. Analytical method of time series alignment and functions used in this process.

A more perfect technique for identifying the main development trend in the series of dynamics is analytical alignment. When studying the general trend by the method of analytical alignment, it is assumed that changes in the levels of a series of dynamics can be expressed by certain mathematical functions with varying degrees of approximation accuracy. The type of equation is determined by the nature of the dynamics of the development of a particular phenomenon. In practice, according to the existing time series, the form is set and the parameters of the function y=f(t) are found, and then the behavior of deviations from the trend is analyzed. The following relationships are most often used in alignment: linear, parabolic, and exponential. In many cases, modeling time series using polynomials or an exponential function does not give satisfactory results, since the time series contains noticeable periodic fluctuations around a general trend. In such cases, harmonic analysis (Fourier series harmonics) should be used. The use of precisely this method is preferable, since it determines the law by which it is possible to accurately predict the values ​​of the levels of the series.

The purpose of the analytical alignment of the dynamic series is to determine the analytical or graphical dependence y=f(t). The function y=f(t) is chosen in such a way that it gives a meaningful explanation of the process under study. These may be different functions.

Systems of equations of the form y=f(t) for estimating the parameters of polynomials by LSM

(clickable)

Graphical representation of n-order polynomials

1. If the change in the levels of a series is characterized by a uniform increase (decrease) in the levels, when the absolute chain increments are close in magnitude, the development trend is characterized by a straight line equation.

2. If, as a result of the analysis of the type of trend of dynamics, a curvilinear dependence is established, with approximately constant acceleration, then the shape of the trend is expressed by a second-order parabola equation.

3. If the growth of the levels of a series of dynamics occurs exponentially, i.e. chain growth factors are more or less constant, the alignment of the dynamics series is carried out according to the exponential function.

After choosing the type of equation, it is necessary to define the parameters of the equation. The most common way to determine the parameters of an equation is the method of least squares, in which the minimum point of the sum of squared deviations between theoretical (adjusted according to the chosen equation) and empirical levels is taken as a solution.

Alignment in a straight line (definition of a trend line) has the expression: yt=a0+a1t

t-symbol of time;

while 0 and a1 are the parameters of the desired line.

The parameters of the straight line are found from the solution of the system of equations:

The system of equations is simplified if the values ​​of t are chosen so that their sum equals Σt = 0, i.e., the origin of time is moved to the middle of the period under consideration. If before the transfer of the reference point t = 1, 2, 3, 4…, then after the transfer:

if the number of levels in the series is odd t = -4 -3 -2 -1 0 +1 +2 +3 +4

if the number of levels in the series is even t = -7 -5 -3 -1 +1 +3 +5 +7

Thus, ∑t to an odd power will always be equal to zero.

Similarly, the parameters of the parabola of the 2nd order are found from the solution of the system of equations:

Alignment by average absolute growth or average growth rate:

Δ-average absolute increase;

K-average growth factor;

Y0-initial level of the series;

Yn is the final level of the series;

t is the ordinal number of the level, starting from zero.

After constructing the regression equation, an assessment of its reliability is carried out. The significance of the selected regression equation, equation parameters and correlation coefficient should be assessed by applying critical evaluation methods:

Fisher's F-test, Student's t-test, in this case, the calculated values ​​of the criteria are compared with the tabulated (critical) ones at a given level of significance and the number of degrees of freedom. Fact > Ftheor - the regression equation is adequate.

n is the number of observations (levels of the series), m is the number of parameters of the regression equation (model).

Checking the adequacy of the regression equation (the quality of the model as a whole) is carried out using the average approximation error, the value of which should not exceed 10-12% (recommended).

For the territories of the region, data are given for 200X.

Region number Average per capita subsistence minimum per day for one able-bodied person, rub., x Average daily salary, rub., at
1 78 133
2 82 148
3 87 134
4 79 154
5 89 162
6 106 195
7 67 139
8 88 158
9 73 152
10 87 162
11 76 159
12 115 173

Exercise:

1. Build a correlation field and formulate a hypothesis about the form of the connection.

2. Calculate the parameters of the linear regression equation

4. Using the average (general) coefficient of elasticity, give a comparative assessment of the strength of the relationship between the factor and the result.

7. Calculate the predicted value of the result if the predicted value of the factor increases by 10% from its average level. Determine the confidence interval of the prediction for the significance level .

Decision:

Let's solve this problem using Excel.

1. Comparing the available data x and y, for example, ranking them in ascending order of the x factor, one can observe a direct relationship between the signs when an increase in the per capita subsistence minimum increases the average daily wage. Based on this, it can be assumed that the relationship between the signs is direct and it can be described by the equation of a straight line. The same conclusion is confirmed on the basis of graphical analysis.

To build a correlation field, you can use the Excel PPP. Enter the initial data in the sequence: first x, then y.

Select the area of ​​cells containing the data.

Then choose: Insert / Scatter / Scatter with markers as shown in figure 1.

Figure 1 Correlation field construction

An analysis of the correlation field shows the presence of a dependence close to a straight line, since the points are located almost in a straight line.

2. To calculate the parameters of the linear regression equation
use the built-in statistical function LINEST.

For this:

1) Open an existing file containing the data to be analyzed;
2) Select an area of ​​empty cells 5×2 (5 rows, 2 columns) to display the results of regression statistics.
3) Activate Function Wizard: in the main menu, select Formulas / Insert Function.
4) In the window Category you are taking Statistical, in the function window - LINEST. Click on the button OK as shown in Figure 2;

Figure 2 Function Wizard Dialog Box

5) Fill in the function arguments:

Known values

Known x values

Constant- a logical value that indicates the presence or absence of a free term in the equation; if Constant = 1, then the free term is calculated in the usual way, if Constant = 0, then the free term is 0;

Statistics- a boolean value that indicates whether to display additional information on the regression analysis or not. If Statistics = 1, then additional information is displayed, if Statistics = 0, then only estimates of the equation parameters are displayed.

Click on the button OK;

Figure 3 LINEST Arguments Dialog Box

6) The first element of the final table will appear in the upper left cell of the selected area. To expand the entire table, press the button and then on the keyboard shortcut ++ .

Additional regression statistics will be output in the order shown in the following schema:

The value of the coefficient b The value of the coefficient a
b standard error standard error a
standard error y
F-statistic
Regression sum of squares

Figure 4 The result of calculating the LINEST function

We got the regression equation:

We conclude: With an increase in the per capita subsistence minimum by 1 rub. the average daily wage increases by an average of 0.92 rubles.

This means that 52% of the variation in wages (y) is explained by the variation of the factor x - the average per capita subsistence minimum, and 48% - by the action of other factors not included in the model.

According to the calculated coefficient of determination, it is possible to calculate the correlation coefficient: .

The relationship is rated as close.

4. Using the average (general) coefficient of elasticity, we determine the strength of the influence of the factor on the result.

For the straight line equation, the average (general) elasticity coefficient is determined by the formula:

We find the average values ​​by selecting the area of ​​cells with x values, and select Formulas / AutoSum / Average, and do the same with the values ​​of y.

Figure 5 Calculation of mean values ​​of a function and argument

Thus, if the average per capita subsistence minimum changes by 1% from its average value, the average daily wage will change by an average of 0.51%.

Using a data analysis tool Regression you can get it:
- results of regression statistics,
- results of dispersion analysis,
- results of confidence intervals,
- residuals and regression line fit charts,
- residuals and normal probability.

The procedure is as follows:

1) check access to Analysis package. In the main menu, select in sequence: File/Settings/Add-ons.

2) Drop Control select item Excel add-ins and press the button Go.

3) In the window add-ons check the box Analysis package, and then click the button OK.

If a Analysis package missing from field list Available add-ons, press the button Review to search.

If you receive a message stating that the analysis pack is not installed on your computer, click Yes to install it.

4) In the main menu, select in sequence: Data / Data Analysis / Analysis Tools / Regression, and then click the button OK.

5) Fill in the data entry and output options dialog box:

Input interval Y- the range containing the data of the effective attribute;

Input interval X- the range containing the data of the factor attribute;

Tags- a flag that indicates whether the first line contains the names of the columns or not;

Constant - zero- a flag indicating the presence or absence of a free term in the equation;

output interval- it is enough to indicate the upper left cell of the future range;

6) New worksheet - you can set an arbitrary name for the new sheet.

Then press the button OK.

Figure 6 Dialog box for entering parameters of the Regression tool

The results of the regression analysis for the problem data are shown in Figure 7.

Figure 7 The result of applying the regression tool

5. Let us estimate the quality of the equations using the average approximation error. Let's use the results of the regression analysis presented in Figure 8.

Figure 8 The result of applying the regression tool "Residual Inference"

Let's compile a new table as shown in Figure 9. In column C, we calculate the relative approximation error using the formula:

Figure 9 Calculation of the average approximation error

The average approximation error is calculated by the formula:

The quality of the constructed model is assessed as good, since it does not exceed 8 - 10%.

6. From the table with regression statistics (Figure 4), we write out the actual value of Fisher's F-test:

Insofar as at a 5% significance level, then we can conclude that the regression equation is significant (the relationship is proven).

8. We will evaluate the statistical significance of the regression parameters using Student's t-statistics and by calculating the confidence interval for each of the indicators.

We put forward the hypothesis H 0 about a statistically insignificant difference of indicators from zero:

.

for the number of degrees of freedom

Figure 7 has the actual values ​​of the t-statistic:

The t-test for the correlation coefficient can be calculated in two ways:

I way:

where - random error of the correlation coefficient.

We take the data for calculation from the table in Figure 7.

II way:

The actual t-statistic values ​​are superior to the table values:

Therefore, the hypothesis H 0 is rejected, that is, the regression parameters and the correlation coefficient are not randomly different from zero, but are statistically significant.

The confidence interval for parameter a is defined as

For parameter a, the 95% bounds, as shown in Figure 7, were:

The confidence interval for the regression coefficient is defined as

For the regression coefficient b, the 95% bounds as shown in Figure 7 were:

An analysis of the upper and lower bounds of the confidence intervals leads to the conclusion that with a probability parameters a and b, being within the specified boundaries, do not take zero values, i.e. are not statistically significant and are significantly different from zero.

7. The obtained estimates of the regression equation allow us to use it for forecasting. If the forecast value of the subsistence minimum is:

Then the predicted value of the subsistence minimum will be:

We calculate the forecast error using the formula:

where

We also calculate the variance using the Excel PPP. For this:

1) Activate Function Wizard: in the main menu, select Formulas / Insert Function.

3) Fill in the range containing the numerical data of the factor characteristic. Click OK.

Figure 10 Variance calculation

Get the variance value

To calculate the residual variance per one degree of freedom, we use the results of the analysis of variance as shown in Figure 7.

Confidence intervals for predicting individual values ​​of y at with a probability of 0.95 are determined by the expression:

The interval is quite wide, primarily due to the small volume of observations. In general, the fulfilled forecast of the average monthly salary turned out to be reliable.

The condition of the problem is taken from: Workshop on econometrics: Proc. allowance / I.I. Eliseeva, S.V. Kurysheva, N.M. Gordeenko and others; Ed. I.I. Eliseeva. - M.: Finance and statistics, 2003. - 192 p.: ill.

For a general assessment of the quality of the constructed econometric one, such characteristics as the coefficient of determination, correlation index, average relative approximation error are determined, and the significance of the regression equation is checked using F- Fisher's criterion. The listed characteristics are quite universal and can be applied to both linear and non-linear models, as well as models with two or more factor variables. The determining value in calculating all the listed quality characteristics is played by a number of residuals ε i, which is calculated by subtracting from the actual (obtained from observations) values ​​of the trait under study y i values ​​calculated according to the model equation y pi.

Determination coefficient

shows what proportion of the change in the studied trait is taken into account in the model. In other words, the coefficient of determination shows what part of the change in the variable under study can be calculated based on changes in the factor variables included in the model using the selected type of function that links the factor variables and the feature under study in the model equation.

Determination coefficient R2 can take values ​​from 0 to 1. The closer the coefficient of determination R2 to unity, the better the quality of the model.

Correlation index can be easily calculated, knowing the coefficient of determination:

Correlation index R characterizes the tightness of the type of relationship chosen when building the model between the factors taken into account in the model and the variable under study. In the case of linear pair regression, its absolute value coincides with the pair correlation coefficient r(x, y), which we considered earlier, and characterizes the tightness of the linear relationship between x and y. The values ​​of the correlation index, obviously, also lie in the range from 0 to 1. The closer the value R to unity, the more closely the selected type of function links the factor variables and the trait under study, the better the quality of the model.

(2.11)

expressed as a percentage and characterizes the accuracy of the model. The acceptable accuracy of the model in solving practical problems can be determined based on considerations of economic feasibility, taking into account the specific situation. A widely used criterion is that the accuracy is considered satisfactory if the average relative error is less than 15%. If a E rel.av. less than 5%, then the model is said to have high accuracy. It is not recommended to use models with unsatisfactory accuracy for analysis and forecasting, that is, when E rel.av. more than 15%.

Fisher F-test used to evaluate the significance of the regression equation. The calculated value of the F-criterion is determined from the ratio:

. (2.12)

critical value F-criterion is determined from tables at a given level of significance α and degrees of freedom (you can use the FDISP function in Excel). Here, still m is the number of factors taken into account in the model, n is the number of observations. If the calculated value is greater than the critical value, then the model equation is recognized as significant. The larger the calculated value F-criteria, the better the quality of the model.

Let us determine the quality characteristics of the linear model we have constructed for Example 1. Let's use the data of Table 2. Determination coefficient:

Therefore, within the linear model, the change in sales volume by 90.1% is explained by the change in air temperature.

Correlation index

.

The value of the correlation index in the case of a paired linear model, as we can see, is indeed modulo equal to the correlation coefficient between the corresponding variables (sales volume and temperature). Since the obtained value is close enough to one, we can conclude that there is a close linear relationship between the variable under study (sales volume) and the factor variable (temperature).

Fisher F-test

critical value F cr at α = 0.1; v 1 =1; ν 2 =7-1-1=5 is equal to 4.06. Estimated value F-criterion is larger than the tabular one, therefore, the model equation is significant.

Average relative approximation error

The built linear pair regression model has unsatisfactory accuracy (>15%), and it is not recommended to use it for analysis and forecasting.

As a result, despite the fact that most of the statistical characteristics meet the criteria for them, the linear paired regression model is not suitable for predicting sales volume depending on air temperature. The non-linear nature of the relationship between these variables according to the observational data is quite clearly visible in Fig.1. The analysis carried out confirmed this.