Regression analysis example. Regression analysis

The main feature of regression analysis is that it can be used to obtain specific information about the form and nature of the relationship between the variables under study.

The sequence of stages of regression analysis

Let us briefly consider the stages of regression analysis.

Task formulation. At this stage, preliminary hypotheses about the dependence of the studied phenomena are formed.

Definition of dependent and independent (explanatory) variables.

Collection of statistical data. Data must be collected for each of the variables included in the regression model.

Formulation of a hypothesis about the form of connection (simple or multiple, linear or non-linear).

Definition regression functions (consists in the calculation of the numerical values of the parameters of the regression equation)

Evaluation of the accuracy of regression analysis.

Interpretation of the obtained results. The results of the regression analysis are compared with preliminary hypotheses. The correctness and plausibility of the obtained results are evaluated.

Prediction of unknown values of the dependent variable.

With the help of regression analysis, it is possible to solve the problem of forecasting and classification. Predictive values are calculated by substituting the values of the explanatory variables into the regression equation. The classification problem is solved in this way: the regression line divides the entire set of objects into two classes, and the part of the set where the value of the function is greater than zero belongs to one class, and the part where it is less than zero belongs to another class.

Tasks of regression analysis

Consider the main tasks of regression analysis: establishing the form of dependence, determining regression functions, an estimate of the unknown values of the dependent variable.

Establishing the form of dependence.

The nature and form of the relationship between variables can form the following types of regression:

positive linear regression (expressed as a uniform growth of the function);

positive uniformly accelerating regression;

positive uniformly increasing regression;

negative linear regression (expressed as a uniform drop in function);

negative uniformly accelerated decreasing regression;

negative uniformly decreasing regression.

However, the varieties described are usually not found in pure form, but in combination with each other. In this case, one speaks of combined forms of regression.

Definition of the regression function.

The second task is to determine the effect on the dependent variable of the main factors or causes, all other things being equal, and subject to the exclusion of the impact on the dependent variable of random elements. regression function defined as a mathematical equation of one type or another.

Estimation of unknown values of the dependent variable.

The solution of this problem is reduced to solving a problem of one of the following types:

Estimation of the values of the dependent variable within the considered interval of the initial data, i.e. missing values; this solves the problem of interpolation.

Estimating the future values of the dependent variable, i.e. finding values outside the given interval of the initial data; this solves the problem of extrapolation.

Both problems are solved by substituting the found estimates of the parameters of the values of the independent variables into the regression equation. The result of solving the equation is an estimate of the value of the target (dependent) variable.

Let's look at some of the assumptions that regression analysis relies on.

Linearity assumption, i.e. it is assumed that the relationship between the variables under consideration is linear. So, in this example, we built a scatterplot and were able to see a clear linear relationship. If, on the scatterplot of variables, we see a clear absence of a linear relationship, i.e. there is a non-linear relationship, non-linear methods of analysis should be used.

Normality Assumption leftovers. It assumes that the distribution of the difference between predicted and observed values is normal. To visually determine the nature of the distribution, you can use histograms leftovers.

When using regression analysis, one should take into account its main limitation. It consists in the fact that regression analysis allows you to detect only dependencies, and not the relationships that underlie these dependencies.

Regression analysis makes it possible to assess the degree of association between variables by calculating the expected value of a variable based on several known values.

Regression equation.

The regression equation looks like this: Y=a+b*X

Using this equation, the variable Y is expressed in terms of the constant a and the slope of the line (or slope) b multiplied by the value of the variable X. The constant a is also called the intercept, and the slope is the regression coefficient or B-factor.

In most cases (if not always) there is a certain scatter of observations about the regression line.

Remainder is the deviation of an individual point (observation) from the regression line (predicted value).

To solve the problem of regression analysis in MS Excel, select from the menu Service"Analysis Package" and the Regression analysis tool. Specify the X and Y input intervals. The Y input interval is the range of dependent data being analyzed and must include one column. The input interval X is the range of independent data to be analyzed. The number of input ranges must not exceed 16.

At the output of the procedure in the output range, we get the report given in table 8.3a-8.3v.

RESULTS

Table 8.3a. Regression statistics
Regression statistics
Multiple R
R-square
Normalized R-square
standard error
Observations

First, consider the upper part of the calculations presented in table 8.3a, - regression statistics.

Value R-square, also called the measure of certainty, characterizes the quality of the resulting regression line. This quality is expressed by the degree of correspondence between the original data and the regression model (calculated data). The measure of certainty is always within the interval .

In most cases, the value R-square is between these values, called extreme, i.e. between zero and one.

If the value R-square close to unity, this means that the constructed model explains almost all the variability of the corresponding variables. Conversely, the value R-square, close to zero, means poor quality of the constructed model.

In our example, the measure of certainty is 0.99673, which indicates a very good fit of the regression line to the original data.

plural R - coefficient of multiple correlation R - expresses the degree of dependence of independent variables (X) and dependent variable (Y).

Multiple R equal to the square root of the coefficient of determination, this value takes values in the range from zero to one.

In simple linear regression analysis plural R equal to the Pearson correlation coefficient. Really, plural R in our case, it is equal to the Pearson correlation coefficient from the previous example (0.998364).

Table 8.3b. Regression coefficients
	Odds	standard error	t-statistic
Y-intersection
Variable X 1
* A truncated version of the calculations is given

Now consider the middle part of the calculations presented in table 8.3b. Here, the regression coefficient b (2.305454545) and the offset along the y-axis are given, i.e. constant a (2.694545455).

Based on the calculations, we can write the regression equation as follows:

Y= x*2.305454545+2.694545455

The direction of the relationship between the variables is determined based on the signs (negative or positive) of the regression coefficients (coefficient b).

If the sign of the regression coefficient is positive, the relationship between the dependent variable and the independent variable will be positive. In our case, the sign of the regression coefficient is positive, therefore, the relationship is also positive.

If the sign of the regression coefficient is negative, the relationship between the dependent variable and the independent variable is negative (inverse).

AT table 8.3c. output results are presented leftovers. In order for these results to appear in the report, it is necessary to activate the "Residuals" checkbox when launching the "Regression" tool.

REMAINING WITHDRAWAL

Table 8.3c. Remains
Observation	Predicted Y	Remains	Standard balances

Using this part of the report, we can see the deviations of each point from the constructed regression line. Greatest absolute value remainder in our case - 0.778, the smallest - 0.043. For a better interpretation of these data, we will use the graph of the original data and the constructed regression line presented in Fig. rice. 8.3. As you can see, the regression line is quite accurately "fitted" to the values of the original data.

It should be taken into account that the example under consideration is quite simple and it is far from always possible to qualitatively construct a linear regression line.

Rice. 8.3. Initial data and regression line

The problem of estimating unknown future values of the dependent variable based on the known values of the independent variable remained unconsidered, i.e. forecasting task.

Having a regression equation, the forecasting problem is reduced to solving the equation Y= x*2.305454545+2.694545455 with known values of x. The results of predicting the dependent variable Y six steps ahead are presented in table 8.4.

Table 8.4. Y variable prediction results
	Y(predicted)

Thus, as a result of using regression analysis in the Microsoft Excel package, we:

built a regression equation;

established the form of dependence and the direction of the relationship between the variables - a positive linear regression, which is expressed in a uniform growth of the function;

established the direction of the relationship between the variables;

assessed the quality of the resulting regression line;

were able to see the deviations of the calculated data from the data of the original set;

predicted the future values of the dependent variable.

If a regression function is defined, interpreted and justified, and the assessment of the accuracy of the regression analysis meets the requirements, we can assume that the constructed model and predictive values are sufficiently reliable.

The predicted values obtained in this way are the average values that can be expected.

In this paper, we reviewed the main characteristics descriptive statistics and among them such concepts as mean,median,maximum,minimum and other characteristics of data variation.

There was also a brief discussion of the concept emissions. The considered characteristics refer to the so-called exploratory data analysis, its conclusions may not apply to the general population, but only to a data sample. Exploratory data analysis is used to draw primary conclusions and form hypotheses about the population.

The basics of correlation and regression analysis, their tasks and possibilities of practical use were also considered.

Regression analysis is a method of establishing an analytical expression of a stochastic relationship between the studied features. The regression equation shows how, on average, changes at when changing any of x i , and looks like:

where y - dependent variable (it is always one);

X i - independent variables (factors) (there may be several of them).

If there is only one independent variable, this is a simple regression analysis. If there are several P 2), then such an analysis is called multivariate.

In the course of regression analysis, two main tasks are solved:

construction of the regression equation, i.e. finding the type of relationship between the result indicator and independent factors x 1 , x 2 , …, x n .

assessment of the significance of the resulting equation, i.e. determination of how much the selected factor features explain the variation of the feature y.

Regression analysis is used mainly for planning, as well as for the development of a regulatory framework.

Unlike correlation analysis, which only answers the question of whether there is a relationship between the analyzed features, regression analysis also gives its formalized expression. In addition, if the correlation analysis studies any relationship of factors, then the regression analysis studies one-sided dependence, i.e. a connection showing how a change in factor signs affects the resultant sign.

Regression analysis is one of the most developed methods of mathematical statistics. Strictly speaking, the implementation of regression analysis requires the fulfillment of a number of special requirements (in particular, x l ,x 2 ,...,x n ;y must be independent, normally distributed random variables with constant variances). In real life, strict compliance with the requirements of regression and correlation analysis is very rare, but both of these methods are very common in economic research. Dependencies in the economy can be not only direct, but also inverse and non-linear. A regression model can be built in the presence of any dependence, however, in multivariate analysis, only linear models of the form are used:

The construction of the regression equation is carried out, as a rule, by the method of least squares, the essence of which is to minimize the sum of squared deviations of the actual values of the resulting attribute from its calculated values, i.e.:

where t - number of observations;

j =a+b 1 x 1 j +b 2 x 2 j + ... + b n X n j - calculated value of the result factor.

Regression coefficients are recommended to be determined using analytical packages for a personal computer or a special financial calculator. In the simplest case, the regression coefficients of a one-factor linear regression equation of the form y = a + bx can be found using the formulas:

cluster analysis

Cluster analysis is one of the methods of multivariate analysis, designed for grouping (clustering) a population, the elements of which are characterized by many features. The values of each of the features serve as the coordinates of each unit of the studied population in the multidimensional space of features. Each observation, characterized by the values of several indicators, can be represented as a point in the space of these indicators, the values of which are considered as coordinates in a multidimensional space. Distance between points R and q With k coordinates is defined as:

The main criterion for clustering is that the differences between clusters should be more significant than between observations assigned to the same cluster, i.e. in a multidimensional space, the inequality must be observed:

where r 1, 2 - distance between clusters 1 and 2.

As well as the regression analysis procedures, the clustering procedure is quite laborious, it is advisable to perform it on a computer.

During their studies, students very often encounter a variety of equations. One of them - the regression equation - is considered in this article. This type of equation is used specifically to describe the characteristics of the relationship between mathematical parameters. This type of equality is used in statistics and econometrics.

Definition of regression

In mathematics, regression is understood as a certain quantity that describes the dependence of the average value of a data set on the values of another quantity. The regression equation shows, as a function of a particular feature, the average value of another feature. The regression function has the form of a simple equation y \u003d x, in which y acts as a dependent variable, and x is an independent variable (feature factor). In fact, the regression is expressed as y = f (x).

What are the types of relationships between variables

In general, two opposite types of relationship are distinguished: correlation and regression.

The first is characterized by equality of conditional variables. In this case, it is not known for certain which variable depends on the other.

If there is no equality between the variables and the conditions say which variable is explanatory and which is dependent, then we can talk about the presence of a connection of the second type. In order to build a linear regression equation, it will be necessary to find out what type of relationship is observed.

Types of regressions

To date, there are 7 different types of regression: hyperbolic, linear, multiple, nonlinear, pairwise, inverse, logarithmically linear.

Hyperbolic, linear and logarithmic

The linear regression equation is used in statistics to clearly explain the parameters of the equation. It looks like y = c + m * x + E. The hyperbolic equation has the form of a regular hyperbola y \u003d c + m / x + E. The logarithmically linear equation expresses the relationship using the logarithmic function: In y \u003d In c + m * In x + In E.

Multiple and non-linear

Two more complex types of regression are multiple and non-linear. The multiple regression equation is expressed by the function y \u003d f (x 1, x 2 ... x c) + E. In this situation, y is the dependent variable and x is the explanatory variable. The variable E is stochastic and includes the influence of other factors in the equation. The non-linear regression equation is a bit inconsistent. On the one hand, with respect to the indicators taken into account, it is not linear, and on the other hand, in the role of assessing indicators, it is linear.

Inverse and Pairwise Regressions

An inverse is a kind of function that needs to be converted to a linear form. In the most traditional application programs, it has the form of a function y \u003d 1 / c + m * x + E. The paired regression equation shows the relationship between the data as a function of y = f(x) + E. Just like the other equations, y depends on x and E is a stochastic parameter.

The concept of correlation

This is an indicator that demonstrates the existence of a relationship between two phenomena or processes. The strength of the relationship is expressed as a correlation coefficient. Its value fluctuates within the interval [-1;+1]. A negative indicator indicates the presence of feedback, a positive indicator indicates a direct one. If the coefficient takes a value equal to 0, then there is no relationship. The closer the value is to 1 - the stronger the relationship between the parameters, the closer to 0 - the weaker.

Methods

Correlation parametric methods can estimate the tightness of the relationship. They are used on the basis of distribution estimates to study parameters that obey the normal distribution law.

The parameters of the linear regression equation are necessary to identify the type of dependence, the function of the regression equation and evaluate the indicators of the chosen relationship formula. The correlation field is used as a method for identifying a relationship. To do this, all existing data must be represented graphically. In a rectangular two-dimensional coordinate system, all known data must be plotted. This is how the correlation field is formed. The value of the describing factor is marked along the abscissa, while the values of the dependent factor are marked along the ordinate. If there is a functional relationship between the parameters, they line up in the form of a line.

If the correlation coefficient of such data is less than 30%, we can talk about the almost complete absence of a connection. If it is between 30% and 70%, then this indicates the presence of links of medium closeness. A 100% indicator is evidence of a functional connection.

A non-linear regression equation, just like a linear one, must be supplemented with a correlation index (R).

Correlation for Multiple Regression

The coefficient of determination is an indicator of the square of the multiple correlation. He speaks about the tightness of the relationship of the presented set of indicators with the trait under study. It can also talk about the nature of the influence of parameters on the result. The multiple regression equation is evaluated using this indicator.

In order to calculate the multiple correlation index, it is necessary to calculate its index.

Least square method

This method is a way of estimating regression factors. Its essence lies in minimizing the sum of squared deviations obtained due to the dependence of the factor on the function.

A paired linear regression equation can be estimated using such a method. This type of equations is used in case of detection between the indicators of a paired linear relationship.

Equation Options

Each parameter of the linear regression function has a specific meaning. The paired linear regression equation contains two parameters: c and m. The parameter t shows the average change in the final indicator of the function y, subject to a decrease (increase) in the variable x by one conventional unit. If the variable x is zero, then the function is equal to the parameter c. If the variable x is not zero, then the factor c does not make economic sense. The only influence on the function is the sign in front of the factor c. If there is a minus, then we can say about a slow change in the result compared to the factor. If there is a plus, then this indicates an accelerated change in the result.

Each parameter that changes the value of the regression equation can be expressed in terms of an equation. For example, the factor c has the form c = y - mx.

Grouped data

There are such conditions of the task in which all information is grouped according to the attribute x, but at the same time, for a certain group, the corresponding average values of the dependent indicator are indicated. In this case, the average values characterize how the indicator depends on x. Thus, the grouped information helps to find the regression equation. It is used as a relationship analysis. However, this method has its drawbacks. Unfortunately, averages are often subject to external fluctuations. These fluctuations are not a reflection of the patterns of the relationship, they just mask its "noise". Averages show patterns of relationship much worse than a linear regression equation. However, they can be used as a basis for finding an equation. By multiplying the size of a particular population by the corresponding average, you can get the sum of y within the group. Next, you need to knock out all the received amounts and find the final indicator y. It is a little more difficult to make calculations with the sum indicator xy. In the event that the intervals are small, we can conditionally take the indicator x for all units (within the group) the same. Multiply it with the sum of y to find the sum of the products of x and y. Further, all the sums are knocked together and the total sum xy is obtained.

Multiple Pair Equation Regression: Assessing the Importance of a Relationship

As discussed earlier, multiple regression has a function of the form y \u003d f (x 1, x 2, ..., x m) + E. Most often, such an equation is used to solve the problem of supply and demand for goods, interest income on repurchased shares, studying the causes and type of production cost function. It is also actively used in a wide variety of macroeconomic studies and calculations, but at the level of microeconomics, such an equation is used a little less often.

The main task of multiple regression is to build a data model containing a huge amount of information in order to further determine what effect each of the factors has individually and in their totality on the indicator to be modeled and its coefficients. The regression equation can take on a variety of values. In this case, two types of functions are usually used to assess the relationship: linear and nonlinear.

A linear function is depicted in the form of such a relationship: y \u003d a 0 + a 1 x 1 + a 2 x 2, + ... + a m x m. In this case, a2, a m , are considered to be the coefficients of "pure" regression. They are necessary to characterize the average change in the parameter y with a change (decrease or increase) in each corresponding parameter x by one unit, with the condition of a stable value of other indicators.

Nonlinear equations have, for example, the form of a power function y=ax 1 b1 x 2 b2 ...x m bm . In this case, the indicators b 1, b 2 ..... b m - are called elasticity coefficients, they demonstrate how the result will change (by how much%) with an increase (decrease) in the corresponding indicator x by 1% and with a stable indicator of other factors.

What factors should be considered when building a multiple regression

In order to correctly construct a multiple regression, it is necessary to find out which factors should be paid special attention to.

It is necessary to have some understanding of the nature of the relationship between economic factors and the modeled. The factors to be included must meet the following criteria:

Must be measurable. In order to use a factor describing the quality of an object, in any case, it should be given a quantitative form.
There should be no factor intercorrelation, or functional relationship. Such actions most often lead to irreversible consequences - the system of ordinary equations becomes unconditioned, and this entails its unreliability and fuzzy estimates.
In the case of a huge correlation indicator, there is no way to find out the isolated influence of factors on the final result of the indicator, therefore, the coefficients become uninterpretable.

Construction Methods

There are a huge number of methods and ways to explain how you can choose the factors for the equation. However, all these methods are based on the selection of coefficients using the correlation index. Among them are:

Exclusion method.
Turn on method.
Stepwise regression analysis.

The first method involves sifting out all coefficients from the aggregate set. The second method involves the introduction of many additional factors. Well, the third is the elimination of factors that were previously applied to the equation. Each of these methods has the right to exist. They have their pros and cons, but they can solve the issue of screening out unnecessary indicators in their own way. As a rule, the results obtained by each individual method are quite close.

Methods of multivariate analysis

Such methods for determining factors are based on the consideration of individual combinations of interrelated features. These include discriminant analysis, pattern recognition, principal component analysis, and cluster analysis. In addition, there is also factor analysis, however, it appeared as a result of the development of the component method. All of them are applied in certain circumstances, under certain conditions and factors.

The main goal of regression analysis consists in determining the analytical form of the relationship, in which the change in the resultant attribute is due to the influence of one or more factor signs, and the set of all other factors that also affect the resultant attribute is taken as constant and average values.
Tasks of regression analysis:
a) Establishing the form of dependence. Regarding the nature and form of the relationship between phenomena, there are positive linear and non-linear and negative linear and non-linear regression.
b) Definition of the regression function in the form of a mathematical equation of one type or another and establishing the influence of explanatory variables on the dependent variable.
c) Estimation of unknown values of the dependent variable. Using the regression function, you can reproduce the values of the dependent variable within the interval of given values of the explanatory variables (i.e., solve the interpolation problem) or evaluate the course of the process outside the specified interval (i.e., solve the extrapolation problem). The result is an estimate of the value of the dependent variable.

Pair regression - the equation of the relationship of two variables y and x: y=f(x), where y is the dependent variable (resultant sign); x - independent, explanatory variable (feature-factor).

There are linear and non-linear regressions.
Linear regression: y = a + bx + ε
Nonlinear regressions are divided into two classes: regressions that are non-linear with respect to the explanatory variables included in the analysis, but linear with respect to the estimated parameters, and regressions that are non-linear with respect to the estimated parameters.
Regressions that are non-linear in explanatory variables:

Regressions that are non-linear in the estimated parameters:

power y=a x b ε
exponential y=a b x ε
exponential y=e a+b x ε

The construction of the regression equation is reduced to estimating its parameters. To estimate the parameters of regressions that are linear in parameters, the method of least squares (LSM) is used. LSM makes it possible to obtain such estimates of parameters under which the sum of the squared deviations of the actual values of the effective feature y from the theoretical values y x is minimal, i.e.

.
For linear and nonlinear equations reducible to linear, the following system is solved for a and b:

You can use ready-made formulas that follow from this system:

The closeness of the connection between the studied phenomena is estimated by the linear pair correlation coefficient r xy for linear regression (-1≤r xy ≤1):

and correlation index p xy - for non-linear regression (0≤p xy ≤1):

An assessment of the quality of the constructed model will be given by the coefficient (index) of determination, as well as the average approximation error.
The average approximation error is the average deviation of the calculated values from the actual ones:

.
Permissible limit of values A - no more than 8-10%.
The average coefficient of elasticity E shows how many percent on average the result y will change from its average value when the factor x changes by 1% from its average value:
.

The task of analysis of variance is to analyze the variance of the dependent variable:
∑(y-y )²=∑(y x -y )²+∑(y-y x)²
where ∑(y-y)² is the total sum of squared deviations;
∑(y x -y)² - sum of squared deviations due to regression ("explained" or "factorial");
∑(y-y x)² - residual sum of squared deviations.
The share of the variance explained by regression in the total variance of the effective feature y is characterized by the coefficient (index) of determination R2:

The coefficient of determination is the square of the coefficient or correlation index.

F-test - evaluation of the quality of the regression equation - consists in testing the hypothesis But about the statistical insignificance of the regression equation and the indicator of closeness of connection. For this, a comparison of the actual F fact and the critical (tabular) F table of the values of the Fisher F-criterion is performed. F fact is determined from the ratio of the values of the factorial and residual variances calculated for one degree of freedom:
,
where n is the number of population units; m is the number of parameters for variables x.
F table is the maximum possible value of the criterion under the influence of random factors for given degrees of freedom and significance level a. Significance level a - the probability of rejecting the correct hypothesis, provided that it is true. Usually a is taken equal to 0.05 or 0.01.
If F table< F факт, то Н о - гипотеза о случайной природе оцениваемых характеристик отклоняется и признается их статистическая значимость и надежность. Если F табл >F is a fact, then the hypothesis H about is not rejected and the statistical insignificance, the unreliability of the regression equation is recognized.
To assess the statistical significance of the regression and correlation coefficients, Student's t-test and confidence intervals for each of the indicators are calculated. A hypothesis H about the random nature of the indicators is put forward, i.e. about their insignificant difference from zero. The assessment of the significance of the regression and correlation coefficients using the Student's t-test is carried out by comparing their values with the magnitude of the random error:
; ; .
Random errors of linear regression parameters and correlation coefficient are determined by the formulas:

Comparing the actual and critical (tabular) values of t-statistics - t tabl and t fact - we accept or reject the hypothesis H o.
The relationship between Fisher's F-test and Student's t-statistics is expressed by the equality

If t table< t факт то H o отклоняется, т.е. a , b и r xy не случайно отличаются от нуля и сформировались под влиянием систематически действующего фактора х. Если t табл >t the fact that the hypothesis H about is not rejected and the random nature of the formation of a, b or r xy is recognized.
To calculate the confidence interval, we determine the marginal error D for each indicator:
Δ a =t table m a , Δ b =t table m b .
The formulas for calculating confidence intervals are as follows:
γ a \u003d aΔ a; γ a \u003d a-Δ a; γ a =a+Δa
γ b = bΔ b ; γ b = b-Δ b ; γb =b+Δb
If zero falls within the boundaries of the confidence interval, i.e. If the lower limit is negative and the upper limit is positive, then the estimated parameter is assumed to be zero, since it cannot simultaneously take on both positive and negative values.
The forecast value y p is determined by substituting the corresponding (forecast) value x p into the regression equation y x =a+b·x . The average standard error of the forecast m y x is calculated:
,
where
and the confidence interval of the forecast is built:
γ y x =y p Δ y p ; γ y x min=y p -Δ y p ; γ y x max=y p +Δ y p
where Δ y x =t table ·m y x .

Solution example

Task number 1. For seven territories of the Ural region For 199X, the values of two signs are known.
Table 1.

Required: 1. To characterize the dependence of y on x, calculate the parameters of the following functions:
a) linear;
b) power law (previously it is necessary to perform the procedure of linearization of variables by taking the logarithm of both parts);
c) demonstrative;
d) equilateral hyperbola (you also need to figure out how to pre-linearize this model).
2. Evaluate each model through the average approximation error A and Fisher's F-test.

Solution (Option #1)

To calculate the parameters a and b of the linear regression y=a+b·x (the calculation can be done using a calculator).
solve the system of normal equations with respect to a and b:

Based on the initial data, we calculate ∑y, ∑x, ∑y x, ∑x², ∑y²:

	y	x	yx	x2	y2	y x	y-y x	Ai
l	68,8	45,1	3102,88	2034,01	4733,44	61,3	7,5	10,9
2	61,2	59,0	3610,80	3481,00	3745,44	56,5	4,7	7,7
3	59,9	57,2	3426,28	3271,84	3588,01	57,1	2,8	4,7
4	56,7	61,8	3504,06	3819,24	3214,89	55,5	1,2	2,1
5	55,0	58,8	3234,00	3457,44	3025,00	56,5	-1,5	2,7
6	54,3	47,2	2562,96	2227,84	2948,49	60,5	-6,2	11,4
7	49,3	55,2	2721,36	3047,04	2430,49	57,8	-8,5	17,2
Total	405,2	384,3	22162,34	21338,41	23685,76	405,2	0,0	56,7
Wed value (Total/n)	57,89 y	54,90 x	3166,05 x y	3048,34 x²	3383,68 y²	X	X	8,1
s	5,74	5,86	X	X	X	X	X	X
s2	32,92	34,34	X	X	X	X	X	X

a=y -b x = 57.89+0.35 54.9 ≈ 76.88

Regression equation: y= 76,88 - 0,35X. With an increase in the average daily wage by 1 rub. the share of spending on the purchase of food products is reduced by an average of 0.35% points.
Calculate the linear coefficient of pair correlation:

Communication is moderate, reverse.
Let's determine the coefficient of determination: r² xy =(-0.35)=0.127
The 12.7% variation in the result is explained by the variation in the x factor. Substituting the actual values into the regression equation X, we determine the theoretical (calculated) values of y x . Let us find the value of the average approximation error A :

On average, the calculated values deviate from the actual ones by 8.1%.
Let's calculate the F-criterion:

The obtained value indicates the need to accept the hypothesis H 0 about the random nature of the revealed dependence and the statistical insignificance of the parameters of the equation and the indicator of closeness of connection.
1b. The construction of the power model y=a x b is preceded by the procedure of linearization of variables. In the example, linearization is done by taking the logarithm of both sides of the equation:
lg y=lg a + b lg x
Y=C+b Y
where Y=lg(y), X=lg(x), C=lg(a).

For calculations, we use the data in Table. 1.3.
Table 1.3

	Y	X	YX	Y2	x2	y x	y-y x	(y-yx)²	Ai
1	1,8376	1,6542	3,0398	3,3768	2,7364	61,0	7,8	60,8	11,3
2	1,7868	1,7709	3,1642	3,1927	3,1361	56,3	4,9	24,0	8,0
3	1,7774	1,7574	3,1236	3,1592	3,0885	56,8	3,1	9,6	5,2
4	1,7536	1,7910	3,1407	3,0751	3,2077	55,5	1,2	1,4	2,1
5	1,7404	1,7694	3,0795	3,0290	3,1308	56,3	-1,3	1,7	2,4
6	1,7348	1,6739	2,9039	3,0095	2,8019	60,2	-5,9	34,8	10,9
7	1,6928	1,7419	2,9487	2,8656	3,0342	57,4	-8,1	65,6	16,4
Total	12,3234	12,1587	21,4003	21,7078	21,1355	403,5	1,7	197,9	56,3
Mean	1,7605	1,7370	3,0572	3,1011	3,0194	X	X	28,27	8,0
σ	0,0425	0,0484	X	X	X	X	X	X	X
σ2	0,0018	0,0023	X	X	X	X	X	X	X

Calculate C and b:

C=Y -b X = 1.7605+0.298 1.7370 = 2.278126
We get a linear equation: Y=2.278-0.298 X
After potentiating it, we get: y=10 2.278 x -0.298
Substituting in this equation the actual values X, we get the theoretical values of the result. Based on them, we calculate the indicators: the tightness of the connection - the correlation index p xy and the average approximation error A .

The characteristics of the power model indicate that it describes the relationship somewhat better than the linear function.

1c. The construction of the equation of the exponential curve y \u003d a b x is preceded by the procedure for linearizing the variables when taking the logarithm of both parts of the equation:
lg y=lg a + x lg b
Y=C+B x
For calculations, we use the table data.

	Y	x	Yx	Y2	x2	y x	y-y x	(y-yx)²	Ai
1	1,8376	45,1	82,8758	3,3768	2034,01	60,7	8,1	65,61	11,8
2	1,7868	59,0	105,4212	3,1927	3481,00	56,4	4,8	23,04	7,8
3	1,7774	57,2	101,6673	3,1592	3271,84	56,9	3,0	9,00	5,0
4	1,7536	61,8	108,3725	3,0751	3819,24	55,5	1,2	1,44	2,1
5	1,7404	58,8	102,3355	3,0290	3457,44	56,4	-1,4	1,96	2,5
6	1,7348	47,2	81,8826	3,0095	2227,84	60,0	-5,7	32,49	10,5
7	1,6928	55,2	93,4426	2,8656	3047,04	57,5	-8,2	67,24	16,6
Total	12,3234	384,3	675,9974	21,7078	21338,41	403,4	-1,8	200,78	56,3
Wed zn.	1,7605	54,9	96,5711	3,1011	3048,34	X	X	28,68	8,0
σ	0,0425	5,86	X	X	X	X	X	X	X
σ2	0,0018	34,339	X	X	X	X	X	X	X

The values of the regression parameters A and AT amounted to:

A=Y -B x = 1.7605+0.0023 54.9 = 1.887
A linear equation is obtained: Y=1.887-0.0023x. We potentiate the resulting equation and write it in the usual form:
y x =10 1.887 10 -0.0023x = 77.1 0.9947 x
We estimate the tightness of the relationship through the correlation index p xy:

3588,01 56,9 3,0 9,00 5,0 4 56,7 0,0162 0,9175 0,000262 3214,89 55,5 1,2 1,44 2,1 5 55 0,0170 0,9354 0,000289 3025,00 56,4 -1,4 1,96 2,5 6 54,3 0,0212 1,1504 0,000449 2948,49 60,8 -6,5 42,25 12,0 7 49,3 0,0181 0,8931 0,000328 2430,49 57,5 -8,2 67,24 16,6 Total405,2 0,1291 7,5064 0,002413 23685,76 405,2 0,0 194,90 56,5 Mean57,9 0,0184 1,0723 0,000345 3383,68 XX27,84 8,1 σ 5,74 0,002145 XXXXXXX σ232,9476 0,000005 XX

Regression analysis is a statistical research method that allows you to show the dependence of a parameter on one or more independent variables. In the pre-computer era, its use was quite difficult, especially when it came to large amounts of data. Today, having learned how to build a regression in Excel, you can solve complex statistical problems in just a couple of minutes. Below are specific examples from the field of economics.

Types of regression

The concept itself was introduced into mathematics in 1886. Regression happens:

linear;
parabolic;
power;
exponential;
hyperbolic;
demonstrative;
logarithmic.

Example 1

Consider the problem of determining the dependence of the number of retired team members on the average salary at 6 industrial enterprises.

A task. At six enterprises, we analyzed the average monthly salary and the number of employees who left of their own free will. In tabular form we have:


		The number of people who left	Salary
			30000 rubles
			35000 rubles
			40000 rubles
			45000 rubles
			50000 rubles
			55000 rubles
			60000 rubles

For the problem of determining the dependence of the number of retired workers on the average salary at 6 enterprises, the regression model has the form of the equation Y = a 0 + a 1 x 1 +…+a k x k , where x i are the influencing variables, a i are the regression coefficients, a k is the number of factors.

For this task, Y is the indicator of employees who left, and the influencing factor is the salary, which we denote by X.

Using the capabilities of the spreadsheet "Excel"

Regression analysis in Excel must be preceded by the application of built-in functions to the available tabular data. However, for these purposes, it is better to use the very useful add-in "Analysis Toolkit". To activate it you need:

from the "File" tab, go to the "Options" section;
in the window that opens, select the line "Add-ons";
click on the "Go" button located at the bottom, to the right of the "Management" line;
check the box next to the name "Analysis Package" and confirm your actions by clicking "OK".

If everything is done correctly, the desired button will appear on the right side of the Data tab, located above the Excel worksheet.

in Excel

Now that we have at hand all the necessary virtual tools for performing econometric calculations, we can begin to solve our problem. For this:

click on the "Data Analysis" button;
in the window that opens, click on the "Regression" button;
in the tab that appears, enter the range of values for Y (the number of employees who quit) and for X (their salaries);
We confirm our actions by pressing the "Ok" button.

As a result, the program will automatically populate a new sheet of the spreadsheet with regression analysis data. Note! Excel has the ability to manually set the location you prefer for this purpose. For example, it could be the same sheet where the Y and X values are, or even a new workbook specifically designed to store such data.

Analysis of regression results for R-square

In Excel, the data obtained during the processing of the data of the considered example looks like this:

First of all, you should pay attention to the value of the R-square. It is the coefficient of determination. In this example, R-square = 0.755 (75.5%), i.e., the calculated parameters of the model explain the relationship between the considered parameters by 75.5%. The higher the value of the coefficient of determination, the more applicable the chosen model for a particular task. It is believed that it correctly describes the real situation with an R-squared value above 0.8. If R-squared<0,5, то такой анализа регрессии в Excel нельзя считать резонным.

Ratio Analysis

The number 64.1428 shows what the value of Y will be if all the variables xi in the model we are considering are set to zero. In other words, it can be argued that the value of the analyzed parameter is also influenced by other factors that are not described in a particular model.

The next coefficient -0.16285, located in cell B18, shows the weight of the influence of variable X on Y. This means that the average monthly salary of employees within the model under consideration affects the number of quitters with a weight of -0.16285, i.e. the degree of its influence at all small. The "-" sign indicates that the coefficient has a negative value. This is obvious, since everyone knows that the higher the salary at the enterprise, the less people express a desire to terminate the employment contract or quit.

Multiple regression

This term refers to a connection equation with several independent variables of the form:

y \u003d f (x 1 + x 2 + ... x m) + ε, where y is the effective feature (dependent variable), and x 1 , x 2 , ... x m are the factor factors (independent variables).

Parameter Estimation

For multiple regression (MR) it is carried out using the method of least squares (OLS). For linear equations of the form Y = a + b 1 x 1 +…+b m x m + ε, we construct a system of normal equations (see below)

To understand the principle of the method, consider the two-factor case. Then we have a situation described by the formula

From here we get:

where σ is the variance of the corresponding feature reflected in the index.

LSM is applicable to the MP equation on a standardizable scale. In this case, we get the equation:

where t y , t x 1, … t xm are standardized variables for which the mean values are 0; β i are the standardized regression coefficients, and the standard deviation is 1.

Please note that all β i in this case are set as normalized and centralized, so their comparison with each other is considered correct and admissible. In addition, it is customary to filter out factors, discarding those with the smallest values of βi.

Problem using linear regression equation

Suppose there is a table of the price dynamics of a particular product N during the last 8 months. It is necessary to make a decision on the advisability of purchasing its batch at a price of 1850 rubles/t.


month number	month name	price of item N
		1750 rubles per ton
		1755 rubles per ton
		1767 rubles per ton
		1760 rubles per ton
		1770 rubles per ton
		1790 rubles per ton
		1810 rubles per ton
		1840 rubles per ton

To solve this problem in the Excel spreadsheet, you need to use the Data Analysis tool already known from the above example. Next, select the "Regression" section and set the parameters. It must be remembered that in the "Input interval Y" field, a range of values for the dependent variable (in this case, the price of a product in specific months of the year) must be entered, and in the "Input interval X" - for the independent variable (month number). Confirm the action by clicking "Ok". On a new sheet (if it was indicated so), we get data for regression.

Based on them, we build a linear equation of the form y=ax+b, where the parameters a and b are the coefficients of the row with the name of the month number and the coefficients and the “Y-intersection” row from the sheet with the results of the regression analysis. Thus, the linear regression equation (LE) for problem 3 is written as:

Product price N = 11.714* month number + 1727.54.

or in algebraic notation

y = 11.714 x + 1727.54

Analysis of results

To decide whether the resulting linear regression equation is adequate, multiple correlation coefficients (MCC) and determination coefficients are used, as well as Fisher's test and Student's test. In the Excel table with regression results, they appear under the names of multiple R, R-square, F-statistic and t-statistic, respectively.

KMC R makes it possible to assess the tightness of the probabilistic relationship between the independent and dependent variables. Its high value indicates a fairly strong relationship between the variables "Number of the month" and "Price of goods N in rubles per 1 ton". However, the nature of this relationship remains unknown.

The square of the coefficient of determination R 2 (RI) is a numerical characteristic of the share of the total scatter and shows the scatter of which part of the experimental data, i.e. values of the dependent variable corresponds to the linear regression equation. In the problem under consideration, this value is equal to 84.8%, i.e., the statistical data are described with a high degree of accuracy by the obtained SD.

F-statistics, also called Fisher's test, is used to assess the significance of a linear relationship, refuting or confirming the hypothesis of its existence.

(Student's criterion) helps to evaluate the significance of the coefficient with an unknown or free term of a linear relationship. If the value of the t-criterion > t cr, then the hypothesis of the insignificance of the free term of the linear equation is rejected.

In the problem under consideration for the free member, using the Excel tools, it was obtained that t = 169.20903, and p = 2.89E-12, i.e., we have a zero probability that the correct hypothesis about the insignificance of the free member will be rejected. For the coefficient at unknown t=5.79405, and p=0.001158. In other words, the probability that the correct hypothesis about the insignificance of the coefficient for the unknown will be rejected is 0.12%.

Thus, it can be argued that the resulting linear regression equation is adequate.

The problem of the expediency of buying a block of shares

Multiple regression in Excel is performed using the same Data Analysis tool. Consider a specific applied problem.

The management of NNN must make a decision on the advisability of purchasing a 20% stake in MMM SA. The cost of the package (JV) is 70 million US dollars. NNN specialists collected data on similar transactions. It was decided to evaluate the value of the block of shares according to such parameters, expressed in millions of US dollars, as:

accounts payable (VK);
annual turnover (VO);
accounts receivable (VD);
cost of fixed assets (SOF).

In addition, the parameter payroll arrears of the enterprise (V3 P) in thousands of US dollars is used.

Solution using Excel spreadsheet

First of all, you need to create a table of initial data. It looks like this:

call the "Data Analysis" window;
select the "Regression" section;
in the box "Input interval Y" enter the range of values of dependent variables from column G;
click on the icon with a red arrow to the right of the "Input interval X" window and select the range of all values from columns B, C, D, F on the sheet.

Select "New Worksheet" and click "Ok".

Get the regression analysis for the given problem.

Examination of the results and conclusions

“We collect” from the rounded data presented above on the Excel spreadsheet sheet, the regression equation:

SP \u003d 0.103 * SOF + 0.541 * VO - 0.031 * VK + 0.405 * VD + 0.691 * VZP - 265.844.

In a more familiar mathematical form, it can be written as:

y = 0.103*x1 + 0.541*x2 - 0.031*x3 +0.405*x4 +0.691*x5 - 265.844

Data for JSC "MMM" are presented in the table:

Substituting them into the regression equation, they get a figure of 64.72 million US dollars. This means that the shares of JSC MMM should not be purchased, since their value of 70 million US dollars is rather overstated.

As you can see, the use of the Excel spreadsheet and the regression equation made it possible to make an informed decision regarding the feasibility of a very specific transaction.

Now you know what regression is. The examples in Excel discussed above will help you solve practical problems from the field of econometrics.

Portal for the student. Self-training