Linear regression method in statistics. Regression in Excel: equation, examples

Regression analysis is a statistical research method that allows you to show the dependence of a parameter on one or more independent variables. In the pre-computer era, its use was quite difficult, especially when it came to large amounts of data. Today, having learned how to build a regression in Excel, you can solve complex statistical problems in just a couple of minutes. Below are specific examples from the field of economics.

Types of regression

The concept itself was introduced into mathematics in 1886. Regression happens:

linear;
parabolic;
power;
exponential;
hyperbolic;
demonstrative;
logarithmic.

Example 1

Consider the problem of determining the dependence of the number of retired team members on the average salary at 6 industrial enterprises.

A task. At six enterprises, we analyzed the average monthly salary and the number of employees who left of their own free will. In tabular form we have:


		The number of people who left	Salary
			30000 rubles
			35000 rubles
			40000 rubles
			45000 rubles
			50000 rubles
			55000 rubles
			60000 rubles

For the problem of determining the dependence of the number of retired workers on the average salary at 6 enterprises, the regression model has the form of the equation Y = a 0 + a 1 x 1 +…+a k x k , where x i are the influencing variables, a i are the regression coefficients, a k is the number of factors.

For this task, Y is the indicator of employees who left, and the influencing factor is the salary, which we denote by X.

Using the capabilities of the spreadsheet "Excel"

Regression analysis in Excel must be preceded by the application of built-in functions to the available tabular data. However, for these purposes, it is better to use the very useful add-in "Analysis Toolkit". To activate it you need:

from the "File" tab, go to the "Options" section;
in the window that opens, select the line "Add-ons";
click on the "Go" button located at the bottom, to the right of the "Management" line;
check the box next to the name "Analysis Package" and confirm your actions by clicking "OK".

If everything is done correctly, the desired button will appear on the right side of the Data tab, located above the Excel worksheet.

in Excel

Now that we have at hand all the necessary virtual tools for performing econometric calculations, we can begin to solve our problem. For this:

click on the "Data Analysis" button;
in the window that opens, click on the "Regression" button;
in the tab that appears, enter the range of values for Y (the number of employees who quit) and for X (their salaries);
We confirm our actions by pressing the "Ok" button.

As a result, the program will automatically populate a new sheet of the spreadsheet with regression analysis data. Note! Excel has the ability to manually set the location you prefer for this purpose. For example, it could be the same sheet where the Y and X values are, or even a new workbook specifically designed to store such data.

Analysis of regression results for R-square

In Excel, the data obtained during the processing of the data of the considered example looks like this:

First of all, you should pay attention to the value of the R-square. It is the coefficient of determination. In this example, R-square = 0.755 (75.5%), i.e., the calculated parameters of the model explain the relationship between the considered parameters by 75.5%. The higher the value of the coefficient of determination, the more applicable the chosen model for a particular task. It is believed that it correctly describes the real situation with an R-squared value above 0.8. If R-squared<0,5, то такой анализа регрессии в Excel нельзя считать резонным.

Ratio Analysis

The number 64.1428 shows what the value of Y will be if all the variables xi in the model we are considering are set to zero. In other words, it can be argued that the value of the analyzed parameter is also influenced by other factors that are not described in a particular model.

The next coefficient -0.16285, located in cell B18, shows the weight of the influence of variable X on Y. This means that the average monthly salary of employees within the model under consideration affects the number of quitters with a weight of -0.16285, i.e. the degree of its influence at all small. The "-" sign indicates that the coefficient has a negative value. This is obvious, since everyone knows that the higher the salary at the enterprise, the less people express a desire to terminate the employment contract or quit.

Multiple regression

This term refers to a connection equation with several independent variables of the form:

y \u003d f (x 1 + x 2 + ... x m) + ε, where y is the effective feature (dependent variable), and x 1 , x 2 , ... x m are the factor factors (independent variables).

Parameter Estimation

For multiple regression (MR) it is carried out using the method of least squares (OLS). For linear equations of the form Y = a + b 1 x 1 +…+b m x m + ε, we construct a system of normal equations (see below)

To understand the principle of the method, consider the two-factor case. Then we have a situation described by the formula

From here we get:

where σ is the variance of the corresponding feature reflected in the index.

LSM is applicable to the MP equation on a standardizable scale. In this case, we get the equation:

where t y , t x 1, … t xm are standardized variables for which the mean values are 0; β i are the standardized regression coefficients, and the standard deviation is 1.

Please note that all β i in this case are set as normalized and centralized, so their comparison with each other is considered correct and admissible. In addition, it is customary to filter out factors, discarding those with the smallest values of βi.

Problem using linear regression equation

Suppose there is a table of the price dynamics of a particular product N during the last 8 months. It is necessary to make a decision on the advisability of purchasing its batch at a price of 1850 rubles/t.


month number	month name	price of item N
		1750 rubles per ton
		1755 rubles per ton
		1767 rubles per ton
		1760 rubles per ton
		1770 rubles per ton
		1790 rubles per ton
		1810 rubles per ton
		1840 rubles per ton

To solve this problem in the Excel spreadsheet, you need to use the Data Analysis tool already known from the above example. Next, select the "Regression" section and set the parameters. It must be remembered that in the "Input interval Y" field, a range of values for the dependent variable (in this case, the price of a product in specific months of the year) must be entered, and in the "Input interval X" - for the independent variable (month number). Confirm the action by clicking "Ok". On a new sheet (if it was indicated so), we get data for regression.

Based on them, we build a linear equation of the form y=ax+b, where the parameters a and b are the coefficients of the row with the name of the month number and the coefficients and the “Y-intersection” row from the sheet with the results of the regression analysis. Thus, the linear regression equation (LE) for problem 3 is written as:

Product price N = 11.714* month number + 1727.54.

or in algebraic notation

y = 11.714 x + 1727.54

Analysis of results

To decide whether the resulting linear regression equation is adequate, multiple correlation coefficients (MCC) and determination coefficients are used, as well as Fisher's test and Student's test. In the Excel table with regression results, they appear under the names of multiple R, R-square, F-statistic and t-statistic, respectively.

KMC R makes it possible to assess the tightness of the probabilistic relationship between the independent and dependent variables. Its high value indicates a fairly strong relationship between the variables "Number of the month" and "Price of goods N in rubles per 1 ton". However, the nature of this relationship remains unknown.

The square of the coefficient of determination R 2 (RI) is a numerical characteristic of the share of the total scatter and shows the scatter of which part of the experimental data, i.e. values of the dependent variable corresponds to the linear regression equation. In the problem under consideration, this value is equal to 84.8%, i.e., the statistical data are described with a high degree of accuracy by the obtained SD.

F-statistics, also called Fisher's test, is used to assess the significance of a linear relationship, refuting or confirming the hypothesis of its existence.

(Student's criterion) helps to evaluate the significance of the coefficient with an unknown or free term of a linear relationship. If the value of the t-criterion > t cr, then the hypothesis of the insignificance of the free term of the linear equation is rejected.

In the problem under consideration for the free member, using the Excel tools, it was obtained that t = 169.20903, and p = 2.89E-12, i.e. we have a zero probability that the correct hypothesis about the insignificance of the free member will be rejected. For the coefficient at unknown t=5.79405, and p=0.001158. In other words, the probability that the correct hypothesis about the insignificance of the coefficient for the unknown will be rejected is 0.12%.

Thus, it can be argued that the resulting linear regression equation is adequate.

The problem of the expediency of buying a block of shares

Multiple regression in Excel is performed using the same Data Analysis tool. Consider a specific applied problem.

The management of NNN must make a decision on the advisability of purchasing a 20% stake in MMM SA. The cost of the package (JV) is 70 million US dollars. NNN specialists collected data on similar transactions. It was decided to evaluate the value of the block of shares according to such parameters, expressed in millions of US dollars, as:

accounts payable (VK);
annual turnover (VO);
accounts receivable (VD);
cost of fixed assets (SOF).

In addition, the parameter payroll arrears of the enterprise (V3 P) in thousands of US dollars is used.

Solution using Excel spreadsheet

First of all, you need to create a table of initial data. It looks like this:

call the "Data Analysis" window;
select the "Regression" section;
in the box "Input interval Y" enter the range of values of dependent variables from column G;
click on the icon with a red arrow to the right of the "Input interval X" window and select the range of all values from columns B, C, D, F on the sheet.

Select "New Worksheet" and click "Ok".

Get the regression analysis for the given problem.

Examination of the results and conclusions

“We collect” from the rounded data presented above on the Excel spreadsheet sheet, the regression equation:

SP \u003d 0.103 * SOF + 0.541 * VO - 0.031 * VK + 0.405 * VD + 0.691 * VZP - 265.844.

In a more familiar mathematical form, it can be written as:

y = 0.103*x1 + 0.541*x2 - 0.031*x3 +0.405*x4 +0.691*x5 - 265.844

Data for JSC "MMM" are presented in the table:

Substituting them into the regression equation, they get a figure of 64.72 million US dollars. This means that the shares of JSC MMM should not be purchased, since their value of 70 million US dollars is rather overstated.

As you can see, the use of the Excel spreadsheet and the regression equation made it possible to make an informed decision regarding the feasibility of a very specific transaction.

Now you know what regression is. The examples in Excel discussed above will help you solve practical problems from the field of econometrics.

Regression analysis underlies the creation of most econometric models, among which should be included the cost estimation models. To build valuation models, this method can be used if the number of analogues (comparable objects) and the number of cost factors (comparison elements) correlate with each other as follows: P> (5 -g-10) x to, those. there should be 5-10 times more analogues than cost factors. The same requirement for the ratio of the amount of data and the number of factors applies to other tasks: establishing a relationship between the cost and consumer parameters of an object; justification of the procedure for calculating corrective indices; clarification of price trends; establishing a relationship between wear and changes in influencing factors; obtaining dependencies for calculating cost standards, etc. The fulfillment of this requirement is necessary in order to reduce the probability of working with a data sample that does not satisfy the requirement of normal distribution of random variables.

The regression relationship reflects only the average trend of the resulting variable, for example, cost, from changes in one or more factor variables, for example, location, number of rooms, area, floor, etc. This is the difference between a regression relationship and a functional one, in which the value of the resulting variable is strictly defined for a given value of factor variables.

The presence of a regression relationship / between the resulting at and factor variables x p ..., x k(factors) indicates that this relationship is determined not only by the influence of the selected factor variables, but also by the influence of variables, some of which are generally unknown, others cannot be assessed and taken into account:

The influence of unaccounted for variables is denoted by the second term of this equation ?, which is called the approximation error.

There are the following types of regression dependencies:

? paired regression - the relationship between two variables (resultant and factorial);
? multiple regression - dependence of one resulting variable and two or more factor variables included in the study.

The main task of regression analysis is to quantify the closeness of the relationship between variables (in paired regression) and multiple variables (in multiple regression). The tightness of the relationship is quantified by the correlation coefficient.

The use of regression analysis allows you to establish the regularity of the influence of the main factors (hedonic characteristics) on the indicator under study, both in their totality and each of them individually. With the help of regression analysis, as a method of mathematical statistics, it is possible, firstly, to find and describe the form of the analytical dependence of the resulting (desired) variable on the factorial ones and, secondly, to estimate the closeness of this dependence.

By solving the first problem, a mathematical regression model is obtained, with the help of which the desired indicator is then calculated for given factor values. The solution of the second problem makes it possible to establish the reliability of the calculated result.

Thus, regression analysis can be defined as a set of formal (mathematical) procedures designed to measure the closeness, direction and analytical expression of the form of the relationship between the resulting and factor variables, i.e. the output of such an analysis should be a structurally and quantitatively defined statistical model of the form:

where y - the average value of the resulting variable (the desired indicator, for example, cost, rent, capitalization rate) over P her observations; x is the value of the factor variable (/-th cost factor); to - number of factor variables.

Function f(x l ,...,x lc), describing the dependence of the resulting variable on the factorial ones is called the regression equation (function). The term "regression" (regression (lat.) - retreat, return to something) is associated with the specifics of one of the specific tasks solved at the stage of the formation of the method, and currently does not reflect the entire essence of the method, but continues to be used.

Regression analysis generally includes the following steps:

? formation of a sample of homogeneous objects and collection of initial information about these objects;
? selection of the main factors influencing the resulting variable;
? checking the sample for normality using X 2 or binomial criterion;
? acceptance of the hypothesis about the form of communication;
? mathematical data processing;
? obtaining a regression model;
? assessment of its statistical indicators;
? verification calculations using a regression model;
? analysis of results.

The specified sequence of operations takes place in the study of both a pair relationship between a factor variable and one resulting variable, and a multiple relationship between the resulting variable and several factor variables.

The use of regression analysis imposes certain requirements on the initial information:

? a statistical sample of objects should be homogeneous in functional and constructive-technological terms;
? quite numerous;
? the cost indicator under study - the resulting variable (price, cost, costs) - must be reduced to the same conditions for its calculation for all objects in the sample;
? factor variables must be measured accurately enough;
? factor variables must be independent or minimally dependent.

The requirements for homogeneity and completeness of the sample are in conflict: the more strictly the selection of objects is carried out according to their homogeneity, the smaller the sample is received, and, conversely, to enlarge the sample, it is necessary to include objects that are not very similar to each other.

After the data are collected for a group of homogeneous objects, they are analyzed to establish the form of the relationship between the resulting and factor variables in the form of a theoretical regression line. The process of finding a theoretical regression line consists in a reasonable choice of an approximating curve and calculation of the coefficients of its equation. The regression line is a smooth curve (in a particular case, a straight line) that describes with the help of a mathematical function the general trend of the dependence under study and smoothes irregular, random outliers from the influence of side factors.

To display paired regression dependencies in assessment tasks, the following functions are most often used: linear - y - a 0 + ars + s power - y - aj&i + c demonstrative - y - linear exponential - y - a 0 + ar * + s. Here - e approximation error due to the action of unaccounted for random factors.

In these functions, y is the resulting variable; x - factor variable (factor); a 0 , a r a 2 - regression model parameters, regression coefficients.

The linear exponential model belongs to the class of so-called hybrid models of the form:

where

where x (i = 1, /) - values of factors;

b t (i = 0, /) are the coefficients of the regression equation.

In this equation, the components A, B and Z correspond to the cost of individual components of the asset being valued, for example, the cost of a land plot and the cost of improvements, and the parameter Q is common. It is designed to adjust the value of all components of the asset being valued for a common influence factor, such as location.

The values of factors that are in the degree of the corresponding coefficients are binary variables (0 or 1). The factors that are at the base of the degree are discrete or continuous variables.

Factors associated with multiplication sign coefficients are also continuous or discrete.

The specification is carried out, as a rule, using an empirical approach and includes two stages:

? plotting points of the regression field on the graph;
? graphical (visual) analysis of the type of a possible approximating curve.

The type of regression curve is not always immediately selectable. To determine it, the points of the regression field are first plotted on the graph according to the initial data. Then a line is visually drawn along the position of the points, trying to find out the qualitative pattern of the connection: uniform growth or uniform decrease, growth (decrease) with an increase (decrease) in the rate of dynamics, a smooth approach to a certain level.

This empirical approach is complemented by logical analysis, starting from already known ideas about the economic and physical nature of the factors under study and their mutual influence.

For example, it is known that the dependences of the resulting variables - economic indicators (prices, rent) on a number of factor variables - price-forming factors (distance from the center of the settlement, area, etc.) are non-linear, and they can be described quite strictly by a power, exponential or quadratic function . But with small ranges of factors, acceptable results can also be obtained using a linear function.

If it is still impossible to immediately make a confident choice of any one function, then two or three functions are selected, their parameters are calculated, and then, using the appropriate criteria for the tightness of the connection, the function is finally selected.

In theory, the regression process of finding the shape of a curve is called specification model, and its coefficients - calibration models.

If it is found that the resulting variable y depends on several factorial variables (factors) x ( , x 2 , ..., x k, then they resort to building a multiple regression model. Usually, three forms of multiple communication are used: linear - y - a 0 + a x x x + a^x 2 + ... + a k x k, demonstrative - y - a 0 a*i a x t- a x b, power - y - a 0 x x ix 2 a 2. .x^ or combinations thereof.

The exponential and exponential functions are more universal, as they approximate non-linear relationships, which are the majority of the dependences studied in the assessment. In addition, they can be applied in the evaluation of objects and in the method of statistical modeling for mass evaluation, and in the method of direct comparison in individual evaluation when establishing correction factors.

At the calibration stage, the parameters of the regression model are calculated by the least squares method, the essence of which is that the sum of the squared deviations of the calculated values of the resulting variable at., i.e. calculated according to the selected relation equation, from the actual values should be minimal:

Values j) (. and y. known, therefore Q is a function of only the coefficients of the equation. To find the minimum S take partial derivatives Q by the coefficients of the equation and equate them to zero:

As a result, we obtain a system of normal equations, the number of which is equal to the number of determined coefficients of the desired regression equation.

Suppose we need to find the coefficients of the linear equation y - a 0 + ars. The sum of squared deviations is:

/=1

Differentiate a function Q by unknown coefficients a 0 and and equate the partial derivatives to zero:

After transformations we get:

where P - number of original actual values at them (the number of analogues).

The above procedure for calculating the coefficients of the regression equation is also applicable for nonlinear dependencies, if these dependencies can be linearized, i.e. bring to a linear form using a change of variables. Power and exponential functions after taking logarithm and the corresponding change of variables acquire a linear form. For example, a power function after taking a logarithm takes the form: In y \u003d 1n 0 +a x 1ph. After the change of variables Y- In y, L 0 - In and No. X- In x we get a linear function

Y=A0 + cijX, whose coefficients are found as described above.

The least squares method is also used to calculate the coefficients of a multiple regression model. So, the system of normal equations for calculating a linear function with two variables Xj and x 2 after a series of transformations, it looks like this:

Usually this system of equations is solved using linear algebra methods. A multiple exponential function is brought to a linear form by taking logarithms and changing variables in the same way as a paired exponential function.

When using hybrid models, multiple regression coefficients are found using numerical procedures of the method of successive approximations.

To make the final choice from several regression equations, it is necessary to check each equation for the tightness of the connection, which is measured by the correlation coefficient, variance and coefficient of variation. For evaluation, you can also use the criteria of Student and Fisher. The greater the tightness of the connection reveals the curve, the more preferable it is, all other things being equal.

If a problem of such a class is being solved, when it is necessary to establish the dependence of a cost indicator on cost factors, then the desire to take into account as many influencing factors as possible and thereby build a more accurate multiple regression model is understandable. However, two objective limitations hinder the expansion of the number of factors. First, building a multiple regression model requires a much larger sample of objects than building a paired model. It is generally accepted that the number of objects in the sample should exceed the number P factors, at least 5-10 times. It follows that in order to build a model with three influencing factors, it is necessary to collect a sample of approximately 20 objects with different sets of factor values. Secondly, the factors selected for the model in their influence on the value indicator should be sufficiently independent of each other. This is not easy to ensure, since the sample usually combines objects belonging to the same family, in which there is a regular change in many factors from object to object.

The quality of regression models is usually tested using the following statistics.

Standard deviation of the regression equation error (estimation error):

where P - sample size (number of analogues);

to - number of factors (cost factors);

Error unexplained by the regression equation (Fig. 3.2);

y. - the actual value of the resulting variable (for example, cost); y t - calculated value of the resulting variable.

This indicator is also called standard error of estimation (RMS error). In the figure, the dots indicate specific values of the sample, the symbol indicates the line of the mean values of the sample, the inclined dash-dotted line is the regression line.

Rice. 3.2.

The standard deviation of the estimation error measures the amount of deviation of the actual values of y from the corresponding calculated values. at( , obtained using the regression model. If the sample on which the model is built is subject to the normal distribution law, then it can be argued that 68% of the real values at are in the range at ± & e from the regression line, and 95% - in the range at ± 2d e. This indicator is convenient because the units of measure sg? match the units of measurement at,. In this regard, it can be used to indicate the accuracy of the result obtained in the evaluation process. For example, in a certificate of value, you can indicate that the value of the market value obtained using the regression model V with a probability of 95% is in the range from (V-2d,.) before (at + 2ds).

Coefficient of variation of the resulting variable:

where y - the mean value of the resulting variable (Figure 3.2).

In regression analysis, the coefficient of variation var is the standard deviation of the result, expressed as a percentage of the mean of the result variable. The coefficient of variation can serve as a criterion for the predictive qualities of the resulting regression model: the smaller the value var, the higher are the predictive qualities of the model. The use of the coefficient of variation is preferable to the exponent &e, since it is a relative exponent. In the practical use of this indicator, it can be recommended not to use a model whose coefficient of variation exceeds 33%, since in this case it cannot be said that these samples are subject to the normal distribution law.

Determination coefficient (multiple correlation coefficient squared):

This indicator is used to analyze the overall quality of the resulting regression model. It indicates what percentage of the variation in the resulting variable is due to the influence of all factor variables included in the model. The determination coefficient always lies in the range from zero to one. The closer the value of the coefficient of determination to unity, the better the model describes the original data series. The coefficient of determination can be represented in another way:

Here is the error explained by the regression model,

a - error unexplained

regression model. From an economic point of view, this criterion makes it possible to judge what percentage of the price variation is explained by the regression equation.

The exact acceptance limit of the indicator R2 it is impossible to specify for all cases. Both the sample size and the meaningful interpretation of the equation must be taken into account. As a rule, when studying data on objects of the same type, obtained at approximately the same time, the value R2 does not exceed the level of 0.6-0.7. If all prediction errors are zero, i.e. when the relationship between the resulting and factor variables is functional, then R2 =1.

Adjusted coefficient of determination:

The need to introduce an adjusted coefficient of determination is explained by the fact that with an increase in the number of factors to the usual coefficient of determination almost always increases, but the number of degrees of freedom decreases (n - k- one). The adjustment entered always reduces the value R2, because the (P - 1) > (n- to - one). As a result, the value R 2 CKOf) may even become negative. This means that the value R2 was close to zero before adjustment and the proportion of variance explained by the regression equation of the variable at very small.

Of the two variants of regression models that differ in the value of the adjusted coefficient of determination, but have equally good other quality criteria, the variant with a large value of the adjusted coefficient of determination is preferable. The coefficient of determination is not adjusted if (n - k): k> 20.

Fisher ratio:

This criterion is used to assess the significance of the determination coefficient. Residual sum of squares is a measure of prediction error using a regression of known cost values at.. Its comparison with the regression sum of squares shows how many times the regression dependence predicts the result better than the mean at. There is a table of critical values F R Fisher coefficient depending on the number of degrees of freedom of the numerator - to, denominator v 2 = p - k- 1 and significance level a. If the calculated value of the Fisher criterion F R is greater than the table value, then the hypothesis of the insignificance of the coefficient of determination, i.e. about the discrepancy between the relationships embedded in the regression equation and the really existing ones, with a probability p = 1 - a is rejected.

Average approximation error(average percentage deviation) is calculated as the average relative difference, expressed as a percentage, between the actual and calculated values of the resulting variable:

The lower the value of this indicator, the better the predictive quality of the model. When the value of this indicator is not higher than 7%, they indicate the high accuracy of the model. If a 8 > 15%, indicate the unsatisfactory accuracy of the model.

Standard error of the regression coefficient:

where (/I) -1 .- diagonal element of the matrix (X G X) ~ 1 to - number of factors;

X- matrix of factor variables values:

X7- transposed matrix of factor variables values;

(JL) _| is a matrix inverse to a matrix.

The smaller these scores for each regression coefficient, the more reliable the estimate of the corresponding regression coefficient.

Student's test (t-statistics):

This criterion allows you to measure the degree of reliability (significance) of the relationship due to a given regression coefficient. If the calculated value t. greater than table value

t av , where v - p - k - 1 is the number of degrees of freedom, then the hypothesis that this coefficient is statistically insignificant is rejected with a probability of (100 - a)%. There are special tables of the /-distribution that make it possible to determine the critical value of the criterion by a given level of significance a and the number of degrees of freedom v. The most commonly used value of a is 5%.

Multicollinearity, i.e. the effect of mutual relationships between factor variables leads to the need to be content with a limited number of them. If this is not taken into account, then you can end up with an illogical regression model. To avoid the negative effect of multicollinearity, before building a multiple regression model, pair correlation coefficients are calculated rxjxj between selected variables X. and X

Here XjX; - mean value of the product of two factorial variables;

XjXj- the product of the average values of two factor variables;

Evaluation of the variance of the factor variable x..

Two variables are considered to be regressively related (i.e., collinear) if their pairwise correlation coefficient is strictly greater than 0.8 in absolute value. In this case, any of these variables should be excluded from consideration.

In order to expand the possibilities of economic analysis of the resulting regression models, averages are used coefficients of elasticity, determined by the formula:

where Xj- mean value of the corresponding factor variable;

y - mean value of the resulting variable; a i - regression coefficient for the corresponding factor variable.

The elasticity coefficient shows how many percent the value of the resulting variable will change on average when the factor variable changes by 1%, i.e. how the resulting variable reacts to a change in the factor variable. For example, how does the price of sq. m area of the apartment at a distance from the city center.

Useful from the point of view of analyzing the significance of a particular regression coefficient is the estimate private coefficient of determination:

Here is the estimate of the variance of the resulting

variable. This coefficient shows how many percent the variation of the resulting variable is explained by the variation of the /-th factor variable included in the regression equation.

Hedonic characteristics are understood as the characteristics of an object that reflect its useful (valuable) properties from the point of view of buyers and sellers.

Regression and correlation analysis - statistical research methods. These are the most common ways to show the dependence of a parameter on one or more independent variables.

Below, using concrete practical examples, we will consider these two very popular analyzes among economists. We will also give an example of obtaining results when they are combined.

Regression Analysis in Excel

Shows the influence of some values (independent, independent) on the dependent variable. For example, how the number of economically active population depends on the number of enterprises, wages, and other parameters. Or: how do foreign investments, energy prices, etc. affect the level of GDP.

The result of the analysis allows you to prioritize. And based on the main factors, to predict, plan the development of priority areas, make management decisions.

Regression happens:

linear (y = a + bx);
parabolic (y = a + bx + cx 2);
exponential (y = a * exp(bx));
power (y = a*x^b);
hyperbolic (y = b/x + a);
logarithmic (y = b * 1n(x) + a);
exponential (y = a * b^x).

Consider the example of building a regression model in Excel and interpreting the results. Let's take a linear type of regression.

A task. At 6 enterprises, the average monthly salary and the number of employees who left were analyzed. It is necessary to determine the dependence of the number of retired employees on the average salary.

The linear regression model has the following form:

Y \u003d a 0 + a 1 x 1 + ... + a k x k.

Where a are the regression coefficients, x are the influencing variables, and k is the number of factors.

In our example, Y is the indicator of quit workers. The influencing factor is wages (x).

Excel has built-in functions that can be used to calculate the parameters of a linear regression model. But the Analysis ToolPak add-in will do it faster.

Activate a powerful analytical tool:

Once activated, the add-on will be available under the Data tab.

Now we will deal directly with the regression analysis.

First of all, we pay attention to the R-square and coefficients.

R-square is the coefficient of determination. In our example, it is 0.755, or 75.5%. This means that the calculated parameters of the model explain the relationship between the studied parameters by 75.5%. The higher the coefficient of determination, the better the model. Good - above 0.8. Poor - less than 0.5 (such an analysis can hardly be considered reasonable). In our example - "not bad".

The coefficient 64.1428 shows what Y will be if all the variables in the model under consideration are equal to 0. That is, other factors that are not described in the model also affect the value of the analyzed parameter.

The coefficient -0.16285 shows the weight of the variable X on Y. That is, the average monthly salary within this model affects the number of quitters with a weight of -0.16285 (this is a small degree of influence). The “-” sign indicates a negative impact: the higher the salary, the less quit. Which is fair.

Correlation analysis in Excel

Correlation analysis helps to establish whether there is a relationship between indicators in one or two samples. For example, between the operating time of the machine and the cost of repairs, the price of equipment and the duration of operation, the height and weight of children, etc.

If there is a relationship, then whether an increase in one parameter leads to an increase (positive correlation) or a decrease (negative) in the other. Correlation analysis helps the analyst determine whether the value of one indicator can predict the possible value of another.

The correlation coefficient is denoted r. Varies from +1 to -1. The classification of correlations for different areas will be different. When the coefficient value is 0, there is no linear relationship between the samples.

Consider how to use Excel to find the correlation coefficient.

The CORREL function is used to find the paired coefficients.

Task: Determine if there is a relationship between the operating time of a lathe and the cost of its maintenance.

Put the cursor in any cell and press the fx button.

In the "Statistical" category, select the CORREL function.
Argument "Array 1" - the first range of values - the time of the machine: A2: A14.
Argument "Array 2" - the second range of values - the cost of repairs: B2:B14. Click OK.

To determine the type of connection, you need to look at the absolute number of the coefficient (each field of activity has its own scale).

For correlation analysis of several parameters (more than 2), it is more convenient to use "Data Analysis" ("Analysis Package" add-on). In the list, you need to select a correlation and designate an array. All.

The resulting coefficients will be displayed in the correlation matrix. Like this one:

Correlation-regression analysis

In practice, these two techniques are often used together.

Example:

Now the regression analysis data is visible.

What is regression?

Consider two continuous variables x=(x 1 , x 2 , .., x n), y=(y 1 , y 2 , ..., y n).

Let's place the points on a 2D scatter plot and say we have linear relationship if the data is approximated by a straight line.

If we assume that y depends on x, and the changes in y caused by changes in x, we can define a regression line (regression y on the x), which best describes the straight-line relationship between these two variables.

The statistical use of the word "regression" comes from a phenomenon known as regression to the mean, attributed to Sir Francis Galton (1889).

He showed that while tall fathers tend to have tall sons, the average height of sons is smaller than that of their tall fathers. The average height of sons "regressed" and "moved back" to the average height of all fathers in the population. Thus, on average, tall fathers have shorter (but still tall) sons, and short fathers have taller (but still rather short) sons.

regression line

Mathematical equation that evaluates a simple (pairwise) linear regression line:

x called the independent variable or predictor.

Y is the dependent or response variable. This is the value we expect for y(on average) if we know the value x, i.e. is the predicted value y»

a- free member (crossing) of the evaluation line; this value Y, when x=0(Fig.1).
b- slope or gradient of the estimated line; it is the amount by which Y increases on average if we increase x for one unit.
a and b are called the regression coefficients of the estimated line, although this term is often used only for b.

Pairwise linear regression can be extended to include more than one independent variable; in this case it is known as multiple regression.

Fig.1. Linear regression line showing the intersection of a and the slope b (the amount of increase in Y when x increases by one unit)

Least square method

We perform regression analysis using a sample of observations where a and b- sample estimates of the true (general) parameters, α and β , which determine the line of linear regression in the population (general population).

The simplest method for determining the coefficients a and b is least square method(MNK).

The fit is evaluated by considering the residuals (the vertical distance of each point from the line, e.g. residual = observable y- predicted y, Rice. 2).

The line of best fit is chosen so that the sum of the squares of the residuals is minimal.

Rice. 2. Linear regression line with depicted residuals (vertical dotted lines) for each point.

Linear Regression Assumptions

So, for each observed value, the residual is equal to the difference and the corresponding predicted one. Each residual can be positive or negative.

You can use residuals to test the following assumptions behind linear regression:

The residuals are normally distributed with zero mean;

If the assumptions of linearity, normality, and/or constant variance are questionable, we can transform or and calculate a new regression line for which these assumptions are satisfied (eg, use a logarithmic transformation, etc.).

Abnormal values (outliers) and points of influence

An "influential" observation, if omitted, changes one or more model parameter estimates (ie slope or intercept).

An outlier (an observation that contradicts most of the values in the dataset) can be an "influential" observation and can be well detected visually when looking at a 2D scatterplot or a plot of residuals.

Both for outliers and for "influential" observations (points), models are used, both with their inclusion and without them, pay attention to the change in the estimate (regression coefficients).

When doing an analysis, do not automatically discard outliers or influence points, as simply ignoring them can affect the results. Always study the causes of these outliers and analyze them.

Linear regression hypothesis

When constructing a linear regression, the null hypothesis is checked that the general slope of the regression line β is equal to zero.

If the slope of the line is zero, there is no linear relationship between and: the change does not affect

To test the null hypothesis that the true slope is zero, you can use the following algorithm:

Calculate the test statistic equal to the ratio , which obeys a distribution with degrees of freedom, where the standard error of the coefficient

- estimation of the variance of the residuals.

Usually, if the significance level reached is the null hypothesis is rejected.

where is the percentage point of the distribution with degrees of freedom which gives the probability of a two-tailed test

This is the interval that contains the general slope with a probability of 95%.

For large samples, let's say we can approximate with a value of 1.96 (that is, the test statistic will tend to be normally distributed)

Evaluation of the quality of linear regression: coefficient of determination R 2

Because of the linear relationship and we expect that changes as changes , and we call this the variation that is due to or explained by the regression. The residual variation should be as small as possible.

If so, then most of the variation will be explained by the regression, and the points will lie close to the regression line, i.e. the line fits the data well.

The proportion of the total variance that is explained by the regression is called determination coefficient, usually expressed as a percentage and denoted R2(in paired linear regression, this is the value r2, the square of the correlation coefficient), allows you to subjectively assess the quality of the regression equation.

The difference is the percentage of variance that cannot be explained by regression.

With no formal test to evaluate, we are forced to rely on subjective judgment to determine the quality of the fit of the regression line.

Applying a Regression Line to a Forecast

You can use a regression line to predict a value from a value within the observed range (never extrapolate beyond these limits).

We predict the mean for observables that have a certain value by substituting that value into the regression line equation.

So, if predicting as We use this predicted value and its standard error to estimate the confidence interval for the true population mean.

Repeating this procedure for different values allows you to build confidence limits for this line. This is a band or area that contains a true line, for example, with a 95% confidence level.

Simple regression plans

Simple regression designs contain one continuous predictor. If there are 3 cases with predictor values P , such as 7, 4 and 9, and the design includes a first order effect P , then the design matrix X will be

and the regression equation using P for X1 looks like

Y = b0 + b1 P

If a simple regression design contains a higher order effect on P , such as a quadratic effect, then the values in column X1 in the design matrix will be raised to the second power:

and the equation will take the form

Y = b0 + b1 P2

Sigma-restricted and overparameterized coding methods do not apply to simple regression designs and other designs containing only continuous predictors (because there are simply no categorical predictors). Regardless of the encoding method chosen, the values of the continuous variables are incremented by the appropriate power and used as the values for the X variables. In this case, no conversion is performed. In addition, when describing regression plans, you can omit consideration of the plan matrix X, and work only with the regression equation.

Example: Simple Regression Analysis

This example uses the data provided in the table:

Rice. 3. Table of initial data.

The data is based on a comparison of the 1960 and 1970 censuses in 30 randomly selected counties. County names are represented as observation names. Information regarding each variable is presented below:

Rice. 4. Variable specification table.

Research objective

For this example, the correlation between the poverty rate and the power that predicts the percentage of families that are below the poverty line will be analyzed. Therefore, we will treat variable 3 (Pt_Poor ) as a dependent variable.

One can put forward a hypothesis: the change in the population and the percentage of families that are below the poverty line are related. It seems reasonable to expect that poverty leads to an outflow of population, hence there would be a negative correlation between the percentage of people below the poverty line and population change. Therefore, we will treat variable 1 (Pop_Chng ) as a predictor variable.

View Results

Regression coefficients

Rice. 5. Regression coefficients Pt_Poor on Pop_Chng.

At the intersection of the Pop_Chng row and Param. the non-standardized coefficient for the regression of Pt_Poor on Pop_Chng is -0.40374 . This means that for every unit decrease in population, there is an increase in the poverty rate of .40374. The upper and lower (default) 95% confidence limits for this non-standardized coefficient do not include zero, so the regression coefficient is significant at the p level<.05 . Обратите внимание на не стандартизованный коэффициент, который также является коэффициентом корреляции Пирсона для простых регрессионных планов, равен -.65, который означает, что для каждого уменьшения стандартного отклонения численности населения происходит увеличение стандартного отклонения уровня бедности на.65.

Distribution of variables

Correlation coefficients can become significantly overestimated or underestimated if there are large outliers in the data. Let us examine the distribution of the dependent variable Pt_Poor by county. To do this, we will build a histogram of the Pt_Poor variable.

Rice. 6. Histogram of the Pt_Poor variable.

As you can see, the distribution of this variable differs markedly from the normal distribution. However, although even two counties (the right two columns) have a higher percentage of families that are below the poverty line than expected in a normal distribution, they appear to be "inside the range."

Rice. 7. Histogram of the Pt_Poor variable.

This judgment is somewhat subjective. The rule of thumb is that outliers should be taken into account if an observation (or observations) does not fall within the interval (mean ± 3 times standard deviation). In this case, it is worth repeating the analysis with and without outliers to make sure that they do not have a serious effect on the correlation between members of the population.

Scatterplot

If one of the hypotheses is a priori about the relationship between the given variables, then it is useful to check it on the plot of the corresponding scatterplot.

Rice. 8. Scatterplot.

The scatterplot shows a clear negative correlation (-.65) between the two variables. It also shows the 95% confidence interval for the regression line, i.e., with 95% probability the regression line passes between the two dashed curves.

Significance criteria

Rice. 9. Table containing the significance criteria.

The test for the Pop_Chng regression coefficient confirms that Pop_Chng is strongly related to Pt_Poor , p<.001 .

Outcome

This example showed how to analyze a simple regression plan. An interpretation of non-standardized and standardized regression coefficients was also presented. The importance of studying the response distribution of the dependent variable is discussed, and a technique for determining the direction and strength of the relationship between the predictor and the dependent variable is demonstrated.

Portal for the student. Self-training