Multiple linear correlation. Correlation coefficients

The essence of causal forecasting methods is to establish a mathematical relationship between the resulting and factor variables.

A necessary condition for the application of causal forecasting methods is the availability of a large amount of data. If the relationships between variables can be described mathematically correctly, then the accuracy of the causal forecast will be quite high.
Causal forecasting methods include:


  • multivariate regression models,

  • simulation modeling.
The most common causal forecasting methods are multivariate regression models. .

1.4.1 Multivariate regression models

A multivariate regression model is an equation with multiple independent variables.

To build a multivariate regression model, various functions can be used, the most common are linear and power dependences:

In the linear model, the parameters(b 1 , b 2 , … b n) are interpreted as the effect of each of the independent variables on the predicted value if all other independent variables are equal to zero.

AT power model the parameters are elasticity coefficients. They show how many percent the result (y) will change on average with a change in the corresponding factor by 1%, while the action of other factors remains unchanged. To calculate the parameters of multiple regression equations is also used least square method.

When building regression models, the quality of the data plays a decisive role. Data collection creates the foundation for forecasts, so there are a number of requirements and rules that must be observed when collecting data.


  1. Firstly, data must be observable, i.e. received as a result of measurement, not calculation.

  1. Secondly, from the data array it is necessary exclude duplicate and strongly differing data. The more non-repeating data and the more homogeneous the population, the better the equation will be. Strongly different values ​​are understood as observations that do not fit into the general series. For example, data on wages for workers is in four and five digits (7,000, 10,000, 15,000), but one six-digit number (250,000) is found. Obviously this is a mistake.

  1. The third rule (requirement) is a fairly large amount of data. Statisticians disagree on how much data is needed to build a good equation. According to some, the data is necessary 4-6 times more number of factors. Others claim that at least 10 times more number of factors, then the law of large numbers, acting in full force, ensures the effective repayment of random deviations from the regular nature of the connection.

Building a multivariate regression model inMSexcel
In Excel spreadsheets, it is possible to build only linear multivariate regression model.
, (1.19)
To do this, select "Data analysis", and then in the appeared window - tool "regression"


Figure 1.45 - Dialog box of the "Regression" tool
In the window that appears, you need to fill in a number of fields, including:


  • input interval Y – a range of data, from one column, containing the values ​​of the resulting variable Y.

  • Input interval X is the range of data containing the values ​​of the factor variables.

If the first row or first column of the input interval contains headings, then you must check the box "tags" .

The default is applied 95% reliability level. If you want to set a different level, select the checkbox and enter the desired level of reliability in the field next to it.

Checkbox "Constant Zero" needs to be checked only if you want to get the regression equation without intercept a, so that the regression line passes through the origins.
The output of calculation results can be organized in 3 ways:


  • in the cell range of this worksheet (for this in the field "Output Range" define the upper left cell of the range where the calculation results will be displayed);

  • on the new worksheet (you can enter the desired name of this sheet in the field next to it);

  • in new workbook .

Checkboxes "Remains" and "Standardized Remains" orders them to be included in the output range.
To plot the residuals for each independent variable, check the box Residual Graph.Remains otherwise known as prediction errors. They are defined as the difference between actual and predicted Y values.
Interpreting Residual Plots
There should be no pattern in the residuals charts. If a pattern is traced, then this means that the model does not include some unknown to us, but naturally acting factor, about which there is no data.

When checking the box "Schedule Selection" a series of graphs will be displayed showing how well the theoretical regression line fits the observed ones, i.e. actual data.

Interpreting Picking Graphs
In Excel, on the selection charts, red dots indicate theoretical values Y, blue dots - initial data. If the red dots overlap well with the blue dots, then this visually indicates a successful regression equation.
A necessary step in forecasting based on multivariate regression models is the assessment of the statistical significance of the regression equation, i.e. the suitability of the constructed regression equation for use in forecasting. To solve this problem, MS Excel calculates a number of coefficients. Namely:


  1. Multiple correlation coefficient

It characterizes the tightness and direction of the relationship between the resulting and several factor variables. With a two-factor dependence, the multiple correlation coefficient is calculated by the formula:
, (1.20)


  1. Multiple coefficient of determination ( R 2 ).

R 2 is the proportion of the variation of the theoretical value relative to the actual values ​​of y, explained by the factors included in the model. The rest of the theoretical values ​​depend on other factors not involved in the model. R 2 can take values ​​from 0 to 1. If , then the quality of the model is high. This indicator is especially useful for comparing several models and choosing the best one.


  1. Normalized coefficient of determination R 2

The indicator R 2 has a drawback, consisting in the fact that large values ​​of the coefficient of determination can be achieved due to the small number of observations. Normalized provides information about what value you could get in another data set that is much larger than in this case.

Normalized is calculated by the formula:

, (1.21)

where is the normalized multiple coefficient of determination,

Multiple coefficient of determination,

The volume of the population,

Number of factor variables.


  1. regression standard error indicates the approximate amount of prediction error. It is used as the main quantity for measuring the quality of the estimated model. Calculated according to the formula:
, (1.22)

where is the sum of the squares of the residuals,

The number of degrees of freedom of the residuals.
That is, the standard error of the regression shows the value of the square of the error per one degree of freedom.


RESULTS

Regression statistics

Multiple R

0.973101

R-square

0.946926

Normalized R-square

0.940682

standard error

0.59867

Observations

20

Analysis of variance

df

SS

MS

F

Significance F

Regression

2

108.7071

54.35355

151.6535

1.45E-11

Remainder

17

6.092905

0.358406

Total

19

114.8

Odds

standard error

t-statistic

P-Value

bottom 95%

Top 95%

Bottom 95.0%

Top 95.0%

Y-intersection

1.835307

0.471065

3.89608

0.001162

0.841445

2.829169

0.841445

2.829169

x1

0.945948

0.212576

4.449917

0.000351

0.49745

1.394446

0.49745

1.394446

x2

0.085618

0.060483

1.415561

0.174964

-0.04199

0.213227

-0.04199

0.213227

The analysis of variance method consists in decomposing the total sum of squared deviations of the variable at from the average into two parts:


  1. explained by regression (or factorial),

  2. residual.
, (1.2 3)
The suitability of the regression model for prediction depends on how much of the total variation of the trait y accounts for the variation explained by the regression. Obviously, if the sum of squared deviations explained by the regression is greater than the residual, then a conclusion is made about the statistical significance of the regression equation. This is equivalent to the fact that the coefficient of determination approaches unity.
Designations in the table "Analysis of variance":
The second column of the table is called and means the number of degrees of freedom. For total variance, the number of degrees of freedom is: , for factor variance (or variance explained by regression), , for residual variance.

where n is the number of observations,

m is the number of factorial variables of the model.
The third column of the table is called . It represents the sum of the squared deviations. The total sum of squared deviations is determined by the formula:

, (1.24)
Factor sum of squares:

, (1.26)
The fourth column is called - the average value of the squared deviations. Determined by the formula:

With the help of Fisher's F-criterion, the statistical significance of the coefficient of determination of the regression equation is determined. For this, a null hypothesis is put forward, which states that between the resulting and factor variables no connection. This is possible only if all parameters of the multiple linear regression equation and the correlation coefficient are equal to zero.

To test this hypothesis, it is necessary to calculate the actual value of Fisher's F-test and compare it with the table. The actual value of the F-criterion is calculated by the formula:

, (1.28)

Selected from special statistical tables by:


  • given level of significance () and

  • the number of degrees of freedom.

In MS Excel, the tabular value of the F-criterion can be determined using the function: = FINV(probability; degrees of freedom1; degrees of freedom2)

For example: =FDISP(0.05;df1;df2)
Significance level 1 is selected for the same one on which the parameters of the regression model were calculated. The default is 95%.

If , then the proposed hypothesis is rejected and the statistical significance of the regression equation is recognized. In the case of particularly important forecasts, it is recommended to increase the table value of the F-criterion by 4 times, that is, the condition is checked:
=151.65; = 3.59
The calculated value significantly exceeds the tabulated value. This means that the coefficient of determination is significantly different from zero, so the hypothesis of the absence of a regression dependence should be rejected.
Now let's evaluate the significance of the regression coefficients based on t-Student's criterion. It allows you to determine which of the factor variables (x) have the greatest impact on the resulting variable (y).

Standard errors are usually denoted by . The subscript indicates the parameter of the regression equation for which this error is calculated.

Calculated according to the formula:

, (1.29)

where - RMS for the resulting variable,

RMS for the feature ,

Coefficient of determination for the multiple equation

regression,

The coefficient of determination for the dependence of the factor with

all other factors in the equation.

Number of degrees of freedom for the residual sum of squares

deviations.
In MS Excel, standard errors are calculated automatically (located in the 3rd column of the 3rd table).
actual valuet-Student's criterion in MS Excel is located in the 4th column of the 3rd table and is called t-statistic.
(4th column) = (2nd column) / (3rd column)

t-statistic = Coefficients/Standard error
Table valuet-Student's criterion depends on the accepted level of significance (usually ; 0.05; 0.01) and the number of degrees of freedom .

where n is the number of population units,

m is the number of factors in the equation.
In MS Excel, the tabular value of the Student's criterion can be determined using the function:

STUDRASP(probability; number of degrees of freedom)
For example: =STUDISP(0.05,7)
If , then it is concluded that the coefficient of the regression equation is statistically significant (reliable) and can be included in the model and used for forecasting.

1.4.2 Monte Carlo simulation method

The simulation method got its name in honor of the city of Monte Carlo, located in the Principality of Monaco, one of the smallest countries in the world, located on the Mediterranean coast, near the border of France and Italy.

The Monte Carlo simulation method involves the generation of random values ​​in accordance with the given constraints. Starting simulation modeling, first of all, it is necessary to develop an economic and mathematical model (EMM) of the predicted indicator, reflecting the relationship between factor variables, as well as the degree and nature of their influence on the result. Since, in the conditions of modern market conditions, the subject of economic relations is simultaneously affected by many factors of different nature and direction, and the degree of their influence is not deterministic, it seems necessary to divide the EMM variables into two groups: stochastic and deterministic;

Next, you should determine the types of probability distributions for each stochastic variable and the corresponding input parameters, simulate the values ​​of stochastic variables using the MS Excel random number generator or other software tools.

The "random number generation" tool is available to users of MS Excel 2007 after activating the add-in Analysis package. The order of activation of the add-on is described above (see page 10, fig. 1.5-1.8). To run the simulation in the menu DATA item must be selected "Data analysis", in the dialog box that appears, select a tool from the list "Random Number Generation" and click OK.

Figure 1.46 - Data analysis menu interface
In the dialog box that appears, you must select the type of probability distribution for each stochastic variable and set the appropriate input parameters.

Figure 1.47 - Random number generator dialog box
This stage is one of the most difficult, therefore, when performing it, it is necessary to use the knowledge and experience of experts. Selecting the Type of Probability Distribution can also be carried out on the basis of available statistical information. In practice, such types of probability distributions as normal, triangular and uniform are most often used.

Normal distribution (or Moivre-Gauss-Laplace law) assumes that the variants of the predicted parameter gravitate towards the mean value. Variable values ​​that are significantly different from the mean, that is, located in the "tails" of the distribution, have a low probability.

triangular distribution is a derivative of the normal distribution and assumes a linearly increasing distribution as it approaches the mean.

Uniform distribution is used in the case when all values ​​of the variable indicator have the same probability of realization.

With the importance of the variable and impossibility to choose the law of distribution it can be viewed in terms of discrete distribution. The types of probability distributions listed above require the definition of input parameters presented in Table 1.11
Table 1.11 - Input parameters of the main types of probability distributions


Type of probabilistic

distribution


Input parameters

1 Normal distribution

  • mean;

  • standard deviation;

2 Triangular distribution

  • mean;


3 Even distribution

  • limits of the possible range of values;

4 Discrete distribution

  • specific values ​​of the variable;

  • corresponding to given probabilities.

As a result of a series of experiments, the distribution of values ​​of stochastic variables will be obtained, on the basis of which the value of the predicted indicator should be calculated.

The next necessary step is to conduct an economic and statistical analysis of the simulation results, in which it is recommended to calculate the following statistical characteristics:


  • mean;

  • standard deviation;

  • dispersion;

  • minimum and maximum value;

  • range of fluctuations;

  • asymmetry coefficient;

  • excess.
The above indicators can be used to test the hypothesis of a normal distribution. If the hypothesis is confirmed, the “three sigma” rule can be used to make an interval forecast. The three sigma rule states that if a random variable X is subject to the normal distribution law with parameters and , it is almost certain that its values ​​are in the interval , that is . To improve clarity and simplify interpretation, it is advisable to build a histogram.


Figure 1.48 - Histogram of predicted indicator values

The implementation of these stages will make it possible to obtain a probabilistic estimate of the values ​​of the predicted indicator (interval forecast).

Today, everyone who is at least a little interested in data mining has probably heard about simple linear regression. It has already been written about on Habré, and Andrew Ng also spoke in detail in his well-known machine learning course. Linear regression is one of the basic and simplest methods of machine learning, but methods for assessing the quality of the constructed model are very rarely mentioned. In this article, I will try to correct this annoying omission a little by using the example of parsing the results of the summary.lm() function in the R language. In doing so, I will try to provide the necessary formulas, so all calculations can be easily programmed in any other language. This article is intended for those who have heard that it is possible to build a linear regression, but have not come across statistical procedures for assessing its quality.

Linear regression model

So, let there be several independent random variables X1, X2, ..., Xn (predictors) and a variable Y depending on them (it is assumed that all the necessary predictor transformations have already been made). Moreover, we assume that the dependence is linear and the errors are normally distributed, i.e.

Where I is an n x n square identity matrix.

So, we have data consisting of k observations of the values ​​Y and Xi and we want to estimate the coefficients. The standard method for finding coefficient estimates is the least squares method. And the analytical solution that can be obtained by applying this method looks like this:

where b with cap - coefficient vector estimation, y is a vector of values ​​of the dependent variable, and X is a matrix of size k x n+1 (n is the number of predictors, k is the number of observations), in which the first column consists of ones, the second - the values ​​of the first predictor, the third - the second, and so on, and the rows consistent with existing observations.

The summary.lm() function and evaluation of the results

Now consider an example of building a linear regression model in the R language:
> library(faraway) > lm1<-lm(Species~Area+Elevation+Nearest+Scruz+Adjacent, data=gala) >summary(lm1) Call: lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent, data = gala) Residuals: Min 1Q Median 3Q Max -111.679 -34.898 -7.862 33.460 182.584 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.068221 19.154198 0.369 0.715351 Area -0.023938 0.022422 -1.068 0.296318 Elevation 0.319465 0.053663 5.953 3.82e-06 *** Nearest 0.009144 1.054136 0.009 0.993151 Scruz -0.240524 0.215402 -1.117 0.275208 Adjacent -0.074805 0.017700 -4.226 0.000297 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 60.98 on 24 degrees of freedom Multiple R-squared: 0.7658, Adjusted R-squared: 0.7171 F- statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07
The gala table contains some data on the 30 Galapagos Islands. We will consider a model where Species - the number of different plant species on the island is linearly dependent on several other variables.

Consider the output of the summary.lm() function.
First comes a line that recalls how the model was built.
Then comes information about the distribution of residuals: minimum, first quartile, median, third quartile, maximum. At this point, it would be useful not only to look at some quantiles of the residuals, but also to check them for normality, for example, using the Shapiro-Wilk test.
Next - the most interesting - information about the coefficients. A little theory is needed here.
First we write the following result:

where sigma squared with a cap is an unbiased estimator for real sigma squared. Here b is the real vector of coefficients, and the capped epsilon is the vector of residuals, if we take the least squares estimates as coefficients. That is, under the assumption that the errors are normally distributed, the vector of coefficients will also be distributed normally around the real value, and its variance can be unbiased estimated. This means that you can test the hypothesis for the equality of the coefficients to zero, and therefore check the significance of the predictors, that is, whether the value of Xi really strongly affects the quality of the constructed model.
To test this hypothesis, we need the following statistics, which has a Student's distribution if the real value of the coefficient bi is 0:

where
is the standard error of the coefficient estimate, and t(k-n-1) is the Student's distribution with k-n-1 degrees of freedom.

We are now ready to continue parsing the output of the summary.lm() function.
So, next are the coefficient estimates obtained by the least squares method, their standard errors, the values ​​of the t-statistic and the p-values ​​for it. Typically, the p-value is compared to some sufficiently small pre-selected threshold, such as 0.05 or 0.01. And if the value of p-statistics is less than the threshold, then the hypothesis is rejected, if more, nothing concrete, unfortunately, can be said. Let me remind you that in this case, since the Student's distribution is symmetrical about 0, then the p-value will be equal to 1-F(|t|)+F(-|t|), where F is the Student's distribution function with k-n-1 degrees of freedom . Also, R kindly marks with asterisks significant coefficients for which the p-value is sufficiently small. That is, those coefficients that are very unlikely to be 0. In the line Signif. codes just contains the decoding of the asterisks: if there are three, then the p-value is from 0 to 0.001, if there are two, then it is from 0.001 to 0.01, and so on. If there are no icons, then the p-value is greater than 0.1.

In our example, we can say with great certainty that the Elevation and Adjacent predictors are really likely to affect the Species value, but nothing definite can be said about the rest of the predictors. Usually, in such cases, the predictors are removed one at a time and see how other model indicators change, for example, BIC or Adjusted R-squared, which will be discussed later.

The value of Residual standard error corresponds to a simple estimate of sigma with a cap, and the degrees of freedom are calculated as k-n-1.

And now the most important statistics, which are worth looking at first of all: R-squared and Adjusted R-squared:

where Yi are the real Y values ​​in each observation, Yi with a cap are the values ​​predicted by the model, Y with a bar is the average of all real Yi values.

Let's start with the R-squared statistic, or, as it is sometimes called, the coefficient of determination. It shows how the conditional variance of the model differs from the variance of the real values ​​of Y. If this coefficient is close to 1, then the conditional variance of the model is quite small and it is very likely that the model fits the data well. If the R-squared coefficient is much less, for example, less than 0.5, then, with a high degree of confidence, the model does not reflect the real state of affairs.

However, the R-squared statistic has one serious drawback: as the number of predictors increases, this statistic can only increase. Therefore, it may seem that a model with more predictors is better than a model with fewer, even if all the new predictors do not affect the dependent variable. Here we can recall the principle of Occam's razor. Following it, if possible, it is worth getting rid of unnecessary predictors in the model, as it becomes simpler and more understandable. For these purposes, the adjusted R-squared statistic was invented. It is a regular R-squared, but with a penalty for a large number of predictors. The main idea: if the new independent variables make a big contribution to the quality of the model, the value of this statistic increases, if not, then vice versa it decreases.

For example, consider the same model as before, but now instead of five predictors, we will leave two:
>lm2<-lm(Species~Elevation+Adjacent, data=gala) >summary(lm2) Call: lm(formula = Species ~ Elevation + Adjacent, data = gala) Residuals: Min 1Q Median 3Q Max -103.41 -34.33 -11.43 22.57 203.65 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.43287 15.02469 0.095 0.924727 Elevation 0.27657 0.03176 8.707 2.53e-09 *** Adjacent -0.06889 0.01549 -4.447 0.000134 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 60.86 on 27 degrees of freedom Multiple R-squared: 0.7376, Adjusted R-squared: 0.7181 F- statistic: 37.94 on 2 and 27 DF, p-value: 1.434e-08
As you can see, the value of the R-square statistic has decreased, but the value of the adjusted R-square even increased slightly.

Now let's test the hypothesis that all the coefficients of the predictors are equal to zero. That is, the hypothesis of whether the value of Y generally depends on the values ​​of Xi linearly. To do this, you can use the following statistics, which, if the hypothesis that all coefficients are equal to zero, is true, has

Multiple correlation coefficient used as a measure of the degree of closeness of the statistical relationship between the resulting indicator (dependent variable) y and a set of explanatory (independent) variables or, in other words, estimates the closeness of the joint influence of factors on the result.

The multiple correlation coefficient can be calculated from a number of formulas 5 , including:

    using the matrix of paired correlation coefficients

, (3.18)

where  r- determinant of the matrix of paired correlation coefficients y,
,

r 11 - determinant of the interfactorial correlation matrix
;

. (3.19)

For a model in which there are two independent variables, formula (3.18) is simplified

. (3.20)

The square of the multiple correlation coefficient is determination coefficient R 2. As in the case of pairwise regression, R 2 indicates the quality of the regression model and reflects the share of the total variation of the resulting trait y explained by changing the regression function f(x) (see 2.4). In addition, the coefficient of determination can be found by the formula

. (3.21)

However, the use R 2 in the case of multiple regression is not quite correct, since the coefficient of determination increases when regressors are added to the model. This is because the residual variance decreases when additional variables are introduced. And if the number of factors approaches the number of observations, then the residual variance will be zero, and the multiple correlation coefficient, and hence the coefficient of determination, will approach unity, although in reality the relationship between the factors and the result and the explanatory power of the regression equation can be much lower.

In order to obtain an adequate assessment of how well the variation of the resulting trait is explained by the variation of several factor traits, apply adjusted coefficient of determination

(3.22)

The adjusted coefficient of determination is always less R 2. Moreover, unlike R 2 which is always positive,
can also take a negative value.

Example (continuation of example 1). Let's calculate the multiple correlation coefficient, according to the formula (3.20):

The value of the multiple correlation coefficient, equal to 0.8601, indicates a strong relationship between the cost of transportation and the weight of the cargo and the distance it is transported.

The coefficient of determination is equal to: R 2 =0,7399.

The adjusted coefficient of determination is calculated by the formula (3.22):

=0,7092.

Note that the value of the adjusted coefficient of determination differs from the value of the coefficient of determination.

Thus, 70.9% of the variation in the dependent variable (transportation cost) is explained by the variation in the independent variables (cargo weight and transportation distance). The remaining 29.1% of the variation in the dependent variable is explained by factors not taken into account in the model.

The value of the adjusted coefficient of determination is quite large, therefore, we were able to take into account in the model the most significant factors that determine the cost of transportation. 

Regression analysis is a statistical research method that allows you to show the dependence of a parameter on one or more independent variables. In the pre-computer era, its use was quite difficult, especially when it came to large amounts of data. Today, having learned how to build a regression in Excel, you can solve complex statistical problems in just a couple of minutes. Below are specific examples from the field of economics.

Types of regression

The concept itself was introduced into mathematics in 1886. Regression happens:

  • linear;
  • parabolic;
  • power;
  • exponential;
  • hyperbolic;
  • demonstrative;
  • logarithmic.

Example 1

Consider the problem of determining the dependence of the number of retired team members on the average salary at 6 industrial enterprises.

A task. At six enterprises, we analyzed the average monthly salary and the number of employees who left of their own free will. In tabular form we have:

The number of people who left

Salary

30000 rubles

35000 rubles

40000 rubles

45000 rubles

50000 rubles

55000 rubles

60000 rubles

For the problem of determining the dependence of the number of retired workers on the average salary at 6 enterprises, the regression model has the form of the equation Y = a 0 + a 1 x 1 +…+a k x k , where x i are the influencing variables, a i are the regression coefficients, a k is the number of factors.

For this task, Y is the indicator of employees who left, and the influencing factor is the salary, which we denote by X.

Using the capabilities of the spreadsheet "Excel"

Regression analysis in Excel must be preceded by the application of built-in functions to the available tabular data. However, for these purposes, it is better to use the very useful add-in "Analysis Toolkit". To activate it you need:

  • from the "File" tab, go to the "Options" section;
  • in the window that opens, select the line "Add-ons";
  • click on the "Go" button located at the bottom, to the right of the "Management" line;
  • check the box next to the name "Analysis Package" and confirm your actions by clicking "OK".

If everything is done correctly, the desired button will appear on the right side of the Data tab, located above the Excel worksheet.

in Excel

Now that we have at hand all the necessary virtual tools for performing econometric calculations, we can begin to solve our problem. For this:

  • click on the "Data Analysis" button;
  • in the window that opens, click on the "Regression" button;
  • in the tab that appears, enter the range of values ​​for Y (the number of employees who quit) and for X (their salaries);
  • We confirm our actions by pressing the "Ok" button.

As a result, the program will automatically populate a new sheet of the spreadsheet with regression analysis data. Note! Excel has the ability to manually set the location you prefer for this purpose. For example, it could be the same sheet where the Y and X values ​​are, or even a new workbook specifically designed to store such data.

Analysis of regression results for R-square

In Excel, the data obtained during the processing of the data of the considered example looks like this:

First of all, you should pay attention to the value of the R-square. It is the coefficient of determination. In this example, R-square = 0.755 (75.5%), i.e., the calculated parameters of the model explain the relationship between the considered parameters by 75.5%. The higher the value of the coefficient of determination, the more applicable the chosen model for a particular task. It is believed that it correctly describes the real situation with an R-squared value above 0.8. If R-squared<0,5, то такой анализа регрессии в Excel нельзя считать резонным.

Ratio Analysis

The number 64.1428 shows what the value of Y will be if all the variables xi in the model we are considering are set to zero. In other words, it can be argued that the value of the analyzed parameter is also influenced by other factors that are not described in a specific model.

The next coefficient -0.16285, located in cell B18, shows the weight of the influence of variable X on Y. This means that the average monthly salary of employees within the model under consideration affects the number of quitters with a weight of -0.16285, i.e. the degree of its influence at all small. The "-" sign indicates that the coefficient has a negative value. This is obvious, since everyone knows that the higher the salary at the enterprise, the less people express a desire to terminate the employment contract or quit.

Multiple regression

This term refers to a connection equation with several independent variables of the form:

y \u003d f (x 1 + x 2 + ... x m) + ε, where y is the effective feature (dependent variable), and x 1 , x 2 , ... x m are the factor factors (independent variables).

Parameter Estimation

For multiple regression (MR) it is carried out using the method of least squares (OLS). For linear equations of the form Y = a + b 1 x 1 +…+b m x m + ε, we construct a system of normal equations (see below)

To understand the principle of the method, consider the two-factor case. Then we have a situation described by the formula

From here we get:

where σ is the variance of the corresponding feature reflected in the index.

LSM is applicable to the MP equation on a standardizable scale. In this case, we get the equation:

where t y , t x 1, … t xm are standardized variables for which the mean values ​​are 0; β i are the standardized regression coefficients, and the standard deviation is 1.

Please note that all β i in this case are set as normalized and centralized, so their comparison with each other is considered correct and admissible. In addition, it is customary to filter out factors, discarding those with the smallest values ​​of βi.

Problem using linear regression equation

Suppose there is a table of the price dynamics of a particular product N during the last 8 months. It is necessary to make a decision on the advisability of purchasing its batch at a price of 1850 rubles/t.

month number

month name

price of item N

1750 rubles per ton

1755 rubles per ton

1767 rubles per ton

1760 rubles per ton

1770 rubles per ton

1790 rubles per ton

1810 rubles per ton

1840 rubles per ton

To solve this problem in the Excel spreadsheet, you need to use the Data Analysis tool already known from the above example. Next, select the "Regression" section and set the parameters. It must be remembered that in the "Input interval Y" field, a range of values ​​for the dependent variable (in this case, the price of a product in specific months of the year) must be entered, and in the "Input interval X" - for the independent variable (month number). Confirm the action by clicking "Ok". On a new sheet (if it was indicated so), we get data for regression.

Based on them, we build a linear equation of the form y=ax+b, where the parameters a and b are the coefficients of the row with the name of the month number and the coefficients and the “Y-intersection” row from the sheet with the results of the regression analysis. Thus, the linear regression equation (LE) for problem 3 is written as:

Product price N = 11.714* month number + 1727.54.

or in algebraic notation

y = 11.714 x + 1727.54

Analysis of results

To decide whether the resulting linear regression equation is adequate, multiple correlation coefficients (MCC) and determination coefficients are used, as well as Fisher's test and Student's test. In the Excel table with regression results, they appear under the names of multiple R, R-square, F-statistic and t-statistic, respectively.

KMC R makes it possible to assess the tightness of the probabilistic relationship between the independent and dependent variables. Its high value indicates a fairly strong relationship between the variables "Number of the month" and "Price of goods N in rubles per 1 ton". However, the nature of this relationship remains unknown.

The square of the coefficient of determination R 2 (RI) is a numerical characteristic of the share of the total scatter and shows the scatter of which part of the experimental data, i.e. values ​​of the dependent variable corresponds to the linear regression equation. In the problem under consideration, this value is equal to 84.8%, i.e., the statistical data are described with a high degree of accuracy by the obtained SD.

F-statistics, also called Fisher's test, is used to assess the significance of a linear relationship, refuting or confirming the hypothesis of its existence.

(Student's criterion) helps to evaluate the significance of the coefficient with an unknown or free term of a linear relationship. If the value of the t-criterion > t cr, then the hypothesis of the insignificance of the free term of the linear equation is rejected.

In the problem under consideration for the free member, using the Excel tools, it was obtained that t = 169.20903, and p = 2.89E-12, i.e. we have a zero probability that the correct hypothesis about the insignificance of the free member will be rejected. For the coefficient at unknown t=5.79405, and p=0.001158. In other words, the probability that the correct hypothesis about the insignificance of the coefficient for the unknown will be rejected is 0.12%.

Thus, it can be argued that the resulting linear regression equation is adequate.

The problem of the expediency of buying a block of shares

Multiple regression in Excel is performed using the same Data Analysis tool. Consider a specific applied problem.

The management of NNN must make a decision on the advisability of purchasing a 20% stake in MMM SA. The cost of the package (JV) is 70 million US dollars. NNN specialists collected data on similar transactions. It was decided to evaluate the value of the block of shares according to such parameters, expressed in millions of US dollars, as:

  • accounts payable (VK);
  • annual turnover (VO);
  • accounts receivable (VD);
  • cost of fixed assets (SOF).

In addition, the parameter payroll arrears of the enterprise (V3 P) in thousands of US dollars is used.

Solution using Excel spreadsheet

First of all, you need to create a table of initial data. It looks like this:

  • call the "Data Analysis" window;
  • select the "Regression" section;
  • in the box "Input interval Y" enter the range of values ​​of dependent variables from column G;
  • click on the icon with a red arrow to the right of the "Input interval X" window and select the range of all values ​​​​from columns B, C, D, F on the sheet.

Select "New Worksheet" and click "Ok".

Get the regression analysis for the given problem.

Examination of the results and conclusions

“We collect” from the rounded data presented above on the Excel spreadsheet sheet, the regression equation:

SP \u003d 0.103 * SOF + 0.541 * VO - 0.031 * VK + 0.405 * VD + 0.691 * VZP - 265.844.

In a more familiar mathematical form, it can be written as:

y = 0.103*x1 + 0.541*x2 - 0.031*x3 +0.405*x4 +0.691*x5 - 265.844

Data for JSC "MMM" are presented in the table:

Substituting them into the regression equation, they get a figure of 64.72 million US dollars. This means that the shares of JSC MMM should not be purchased, since their value of 70 million US dollars is rather overstated.

As you can see, the use of the Excel spreadsheet and the regression equation made it possible to make an informed decision regarding the feasibility of a very specific transaction.

Now you know what regression is. The examples in Excel discussed above will help you solve practical problems from the field of econometrics.