Empirical regression coefficients. Fundamentals of Linear Regression

In the presence of a correlation between factor and resultant signs, doctors often have to determine by what amount the value of one sign can change when another is changed by a unit of measurement generally accepted or established by the researcher himself.

For example, how will the body weight of schoolchildren of the 1st grade (girls or boys) change if their height increases by 1 cm. For this purpose, the regression analysis method is used.

Most often, the regression analysis method is used to develop normative scales and standards for physical development.

  1. Definition of regression. Regression is a function that allows, based on the average value of one attribute, to determine the average value of another attribute that is correlated with the first one.

    For this purpose, the regression coefficient and a number of other parameters are used. For example, you can calculate the number of colds on average at certain values ​​of the average monthly air temperature in the autumn-winter period.

  2. Definition of the regression coefficient. The regression coefficient is the absolute value by which the value of one attribute changes on average when another attribute associated with it changes by a specified unit of measurement.
  3. Regression coefficient formula. R y / x \u003d r xy x (σ y / σ x)
    where R y / x - regression coefficient;
    r xy - correlation coefficient between features x and y;
    (σ y and σ x) - standard deviations of features x and y.

    In our example ;
    σ x = 4.6 (standard deviation of air temperature in the autumn-winter period;
    σ y = 8.65 (standard deviation of the number of infectious colds).
    Thus, R y/x is the regression coefficient.
    R y / x \u003d -0.96 x (4.6 / 8.65) \u003d 1.8, i.e. with a decrease in the average monthly air temperature (x) by 1 degree, the average number of infectious colds (y) in the autumn-winter period will change by 1.8 cases.

  4. Regression Equation. y \u003d M y + R y / x (x - M x)
    where y is the average value of the attribute, which should be determined when the average value of another attribute (x) changes;
    x - known average value of another feature;
    R y/x - regression coefficient;
    M x, M y - known average values ​​of features x and y.

    For example, the average number of infectious colds (y) can be determined without special measurements at any average value of the average monthly air temperature (x). So, if x \u003d - 9 °, R y / x \u003d 1.8 diseases, M x \u003d -7 °, M y \u003d 20 diseases, then y \u003d 20 + 1.8 x (9-7) \u003d 20 + 3 .6 = 23.6 diseases.
    This equation is applied in the case of a straight-line relationship between two features (x and y).

  5. Purpose of the regression equation. The regression equation is used to plot the regression line. The latter allows, without special measurements, to determine any average value (y) of one attribute, if the value (x) of another attribute changes. Based on these data, a graph is built - regression line, which can be used to determine the average number of colds at any value of the average monthly temperature within the range between the calculated values ​​of the number of colds.
  6. Regression sigma (formula).
    where σ Ru/x - sigma (standard deviation) of the regression;
    σ y is the standard deviation of the feature y;
    r xy - correlation coefficient between features x and y.

    So, if σ y is the standard deviation of the number of colds = 8.65; r xy - the correlation coefficient between the number of colds (y) and the average monthly air temperature in the autumn-winter period (x) is - 0.96, then

  7. Purpose of sigma regression. Gives a characteristic of the measure of the diversity of the resulting feature (y).

    For example, it characterizes the diversity of the number of colds at a certain value of the average monthly air temperature in the autumn-winter period. So, the average number of colds at air temperature x 1 \u003d -6 ° can range from 15.78 diseases to 20.62 diseases.
    At x 2 = -9°, the average number of colds can range from 21.18 diseases to 26.02 diseases, etc.

    The regression sigma is used in the construction of a regression scale, which reflects the deviation of the values ​​of the effective attribute from its average value plotted on the regression line.

  8. Data required to calculate and plot the regression scale
    • regression coefficient - Ry/x;
    • regression equation - y \u003d M y + R y / x (x-M x);
    • regression sigma - σ Rx/y
  9. The sequence of calculations and graphic representation of the regression scale.
    • determine the regression coefficient by the formula (see paragraph 3). For example, one should determine how much the body weight will change on average (at a certain age depending on gender) if the average height changes by 1 cm.
    • according to the formula of the regression equation (see paragraph 4), determine what will be the average, for example, body weight (y, y 2, y 3 ...) * for a certain growth value (x, x 2, x 3 ...) .
      ________________
      * The value of "y" should be calculated for at least three known values ​​of "x".

      At the same time, the average values ​​of body weight and height (M x, and M y) for a certain age and sex are known

    • calculate the sigma of the regression, knowing the corresponding values ​​of σ y and r xy and substituting their values ​​into the formula (see paragraph 6).
    • based on the known values ​​x 1, x 2, x 3 and their corresponding average values ​​y 1, y 2 y 3, as well as the smallest (y - σ ru / x) and largest (y + σ ru / x) values ​​\u200b\u200b(y) construct a regression scale.

      For a graphical representation of the regression scale, the values ​​x, x 2 , x 3 (y-axis) are first marked on the graph, i.e. a regression line is built, for example, the dependence of body weight (y) on height (x).

      Then, at the corresponding points y 1 , y 2 , y 3 the numerical values ​​of the regression sigma are marked, i.e. on the graph find the smallest and largest values ​​of y 1 , y 2 , y 3 .

  10. Practical use of the regression scale. Normative scales and standards are being developed, in particular for physical development. According to the standard scale, it is possible to give an individual assessment of the development of children. At the same time, physical development is assessed as harmonious if, for example, at a certain height, the child’s body weight is within one sigma of regression to the average calculated unit of body weight - (y) for a given height (x) (y ± 1 σ Ry / x).

    Physical development is considered disharmonious in terms of body weight if the child's body weight for a certain height is within the second regression sigma: (y ± 2 σ Ry/x)

    Physical development will be sharply disharmonious both due to excess and insufficient body weight if the body weight for a certain height is within the third sigma of the regression (y ± 3 σ Ry/x).

According to the results of a statistical study of the physical development of 5-year-old boys, it is known that their average height (x) is 109 cm, and their average body weight (y) is 19 kg. The correlation coefficient between height and body weight is +0.9, standard deviations are presented in the table.

Required:

  • calculate the regression coefficient;
  • using the regression equation, determine what the expected body weight of 5-year-old boys will be with a height equal to x1 = 100 cm, x2 = 110 cm, x3 = 120 cm;
  • calculate the regression sigma, build a regression scale, present the results of its solution graphically;
  • draw the appropriate conclusions.

The condition of the problem and the results of its solution are presented in the summary table.

Table 1

Conditions of the problem Problem solution results
regression equation sigma regression regression scale (expected body weight (in kg))
M σ r xy R y/x X At σRx/y y - σ Rу/х y + σ Rу/х
1 2 3 4 5 6 7 8 9 10
Height (x) 109 cm ± 4.4cm +0,9 0,16 100cm 17.56 kg ± 0.35 kg 17.21 kg 17.91 kg
Body weight (y) 19 kg ± 0.8 kg 110 cm 19.16 kg 18.81 kg 19.51 kg
120 cm 20.76 kg 20.41 kg 21.11 kg

Solution.

Conclusion. Thus, the regression scale within the calculated values ​​of body weight allows you to determine it for any other value of growth or to assess the individual development of the child. To do this, restore the perpendicular to the regression line.

  1. Vlasov V.V. Epidemiology. - M.: GEOTAR-MED, 2004. - 464 p.
  2. Lisitsyn Yu.P. Public health and healthcare. Textbook for high schools. - M.: GEOTAR-MED, 2007. - 512 p.
  3. Medik V.A., Yuriev V.K. A course of lectures on public health and health care: Part 1. Public health. - M.: Medicine, 2003. - 368 p.
  4. Minyaev V.A., Vishnyakov N.I. and others. Social medicine and healthcare organization (Guide in 2 volumes). - St. Petersburg, 1998. -528 p.
  5. Kucherenko V.Z., Agarkov N.M. and others. Social hygiene and organization of health care (Tutorial) - Moscow, 2000. - 432 p.
  6. S. Glantz. Medico-biological statistics. Per from English. - M., Practice, 1998. - 459 p.

The study of correlation dependencies is based on the study of such relationships between variables, in which the values ​​of one variable, it can be taken as a dependent variable, "on average" change depending on what values ​​another variable takes, considered as a cause in relation to the dependent variable. The action of this cause is carried out in a complex interaction of various factors, as a result of which the manifestation of a pattern is obscured by the influence of chances. Calculating the average values ​​of the resulting attribute for a given group of values ​​of the attribute-factor, the influence of chances is partly eliminated. Calculating the parameters of the theoretical communication line, they are further eliminated and an unambiguous (in form) change "y" with a change in the factor "x" is obtained.

To study stochastic relationships, the method of comparing two parallel series, the method of analytical groupings, correlation analysis, regression analysis, and some nonparametric methods are widely used. In general, the task of statistics in the field of studying relationships is not only to quantify their presence, direction and strength of the connection, but also to determine the form (analytical expression) of the influence of factor characteristics on the resultant one. To solve it, methods of correlation and regression analysis are used.

CHAPTER 1. REGRESSION EQUATION: THEORETICAL FOUNDATIONS

1.1. Regression equation: essence and types of functions

Regression (Latin regressio - reverse movement, transition from more complex forms of development to less complex ones) is one of the basic concepts in probability theory and mathematical statistics, expressing the dependence of the average value of a random variable on the values ​​of another random variable or several random variables. This concept was introduced by Francis Galton in 1886.

The theoretical regression line is the line around which the points of the correlation field are grouped and which indicates the main direction, the main trend of the relationship.

The theoretical regression line should reflect the change in the average values ​​of the effective attribute "y" as the values ​​of the factor attribute "x" change, provided that all other - random in relation to the factor "x" - causes are mutually cancelled. Therefore, this line should be drawn in such a way that the sum of the deviations of the points of the correlation field from the corresponding points of the theoretical regression line is equal to zero, and the sum of the squares of these deviations is the minimum value.

y=f(x) - the regression equation is a formula for a statistical relationship between variables.

A straight line on a plane (in a space of two dimensions) is given by the equation y=a+b*x. In more detail: the variable y can be expressed in terms of a constant (a) and a slope (b) multiplied by the variable x. The constant is sometimes also called the intercept, and the slope is also called the regression or B-factor.

An important step in regression analysis is to determine the type of function, which characterizes the relationship between features. The main basis should be a meaningful analysis of the nature of the dependence under study, its mechanism. At the same time, it is far from always possible to theoretically substantiate the form of connection of each of the factors with the performance indicator, since the studied socio-economic phenomena are very complex and the factors that form their level are closely intertwined and interact with each other. Therefore, on the basis of a theoretical analysis, the most general conclusions can often be drawn regarding the direction of the relationship, the possibility of its change in the population under study, the legitimacy of using a linear relationship, the possible presence of extreme values, etc. A necessary addition to such assumptions should be the analysis of specific factual data.

An approximate idea of ​​the link line can be obtained based on the empirical regression line. The empirical regression line is usually a broken line, has a more or less significant break. This is explained by the fact that the influence of other unaccounted factors that affect the variation of the resultant attribute is not fully compensated in the average, due to an insufficiently large number of observations, therefore, an empirical link can be used to select and justify the type of theoretical curve, provided that the number of observations is sufficient great.

One of the elements of specific studies is the comparison of various dependency equations based on the use of quality criteria for the approximation of empirical data by competing models. The following types of functions are most often used to characterize the relationships of economic indicators:

1. Linear:

2. Hyperbolic:

3. Demonstrative:

4. Parabolic:

5. Power:

6. Logarithmic:

7. Logistic:

A model with one explanatory and one explanatory variable is a paired regression model. If two or more explanatory (factorial) variables are used, then one speaks of using a multiple regression model. In this case, linear, exponential, hyperbolic, exponential and other types of functions connecting these variables can be chosen as options.

To find the parameters a and b, the regression equations use the least squares method. When using the least squares method to find such a function that best fits the empirical data, it is believed that the bag of squared deviations of the empirical points from the theoretical regression line should be the minimum value.

The criterion of the least squares method can be written as follows:

Therefore, the application of the least squares method to determine the parameters a and b of the straight line that best fits the empirical data is reduced to an extremum problem.

Regarding the ratings, the following conclusions can be drawn:

1. Least squares estimates are sampling functions, which makes them easy to calculate.

2. The least squares estimates are point estimates of the theoretical regression coefficients.

3. The empirical line of regression necessarily passes through the point x, y.

4. The empirical regression equation is constructed in such a way that the sum of deviations

.

A graphical representation of the empirical and theoretical communication line is shown in Figure 1.


The parameter b in the equation is the regression coefficient. If there is a direct correlation, the regression coefficient has a positive value, and in the case of an inverse relationship, the regression coefficient is negative. The regression coefficient shows how much the value of the effective attribute "y" changes on average when the factor attribute "x" changes by one. Geometrically, the regression coefficient is the slope of the straight line depicting the correlation equation relative to the x-axis (for the equation

).

The branch of multivariate statistical analysis devoted to recovering dependencies is called regression analysis. The term "linear regression analysis" is used when the function under consideration depends linearly on the estimated parameters (dependence on independent variables can be arbitrary). Theory of Evaluation

unknown parameters is well developed precisely in the case of linear regression analysis. If there is no linearity and it is impossible to pass to a linear problem, then, as a rule, one should not expect good properties from estimates. Let us demonstrate approaches in the case of dependencies of various types. If the dependence has the form of a polynomial (polynomial). If the correlation calculation characterizes the strength of the relationship between two variables, then regression analysis serves to determine the type of this relationship and makes it possible to predict the value of one (dependent) variable based on the value of another (independent) variable. To perform a linear regression analysis, the dependent variable must have an interval (or ordinal) scale. At the same time, binary logistic regression reveals the dependence of a dichotomous variable on some other variable related to any scale. The same application conditions are valid for probit analysis. If the dependent variable is categorical, but has more than two categories, then multinomial logistic regression will be an appropriate method here, and nonlinear relationships between variables that belong to the interval scale can be analyzed. For this, the method of non-linear regression is intended.

REGRESSION COEFFICIENT

- English coefficient, regression; German Regressionskoeffizient. One of the characteristics of the relationship between dependent y and independent variable x. K. r. shows by how many units the value accepted by y increases if the variable x changes by one unit of its change. Geometrically, K. r. is the slope of the straight line y.

Antinazi. Encyclopedia of Sociology, 2009

See what "REGRESSION COEFFICIENT" is in other dictionaries:

    regression coefficient- - [L.G. Sumenko. English Russian Dictionary of Information Technologies. M .: GP TsNIIS, 2003.] Topics information technology in general EN regression coefficient ... Technical Translator's Handbook

    Regression coefficient- 35. Regression coefficient Parameter of the regression analysis model Source: GOST 24026 80: Research tests. Experiment planning. Terms and Definitions …

    regression coefficient- The coefficient of the independent variable in the regression equation ... Dictionary of Sociological Statistics

    REGRESSION COEFFICIENT- English. coefficient, regression; German Regressionskoeffizient. One of the characteristics of the relationship between dependent y and independent variable x. K. r. shows by how many units the value accepted by y increases if the variable x changes to ... ... Explanatory Dictionary of Sociology

    sample regression coefficient- 2.44. sample regression coefficient Coefficient of a variable in a regression curve or surface equation Source: GOST R 50779.10 2000: Statistical methods. Probability and bases of statistics. Terms and Definitions … Dictionary-reference book of terms of normative and technical documentation

    Partial regression coefficient- a statistical measure that indicates the degree of influence of the independent variable on the dependent in a situation where the mutual influence of all other variables in the model is under the control of the researcher ... Sociological Dictionary Socium

    REGRESSIONS, WEIGHT- A synonym for the concept of regression coefficient ... Explanatory Dictionary of Psychology

    HERITABILITY COEFFICIENT- An indicator of the relative share of genetic variability in the overall phenotypic variation of a trait. The most common methods for assessing the heritability of economically useful traits are: where h2 is the heritability coefficient; r intraclass… … Terms and definitions used in breeding, genetics and reproduction of farm animals

    - (R squared) is the proportion of the variance of the dependent variable that is explained by the dependence model in question, that is, the explanatory variables. More precisely, this is one minus the proportion of unexplained variance (the variance of the random error of the model, or conditional ... ... Wikipedia

    The coefficient of the independent variable in the regression equation. So, for example, in a linear regression equation linking random variables Y and X, R. k. b0 and b1 are equal: where r is the correlation coefficient of X and Y, . Calculation of estimates R. k. Mathematical Encyclopedia

Books

  • Introduction to econometrics (CDpc), Yanovsky Leonid Petrovich, Bukhovets Alexey Georgievich. The foundations of econometrics and statistical analysis of one-dimensional time series are given. Much attention is paid to classical pair and multiple regression, classical and generalized methods…
  • Speed ​​reading. Effective Simulator (CDpc) , . The program is addressed to users who wish to master the technique of speed reading in the shortest possible time. The course is built on the principle of "theory - practice". Theoretical material and practical ...

In the previous notes, the focus has often been on a single numerical variable, such as mutual fund returns, Web page load time, or soft drink consumption. In this and the following notes, we will consider methods for predicting the values ​​of a numeric variable depending on the values ​​of one or more other numeric variables.

The material will be illustrated with a through example. Forecasting sales volume in a clothing store. The Sunflowers chain of discount clothing stores has been constantly expanding for 25 years. However, the company does not currently have a systematic approach to selecting new outlets. The location where the company intends to open a new store is determined based on subjective considerations. The selection criteria are favorable rental conditions or the manager's idea of ​​the ideal location of the store. Imagine that you are the head of the Special Projects and Planning Department. You have been tasked with developing a strategic plan for opening new stores. This plan should contain a forecast of annual sales in newly opened stores. You believe that selling space is directly related to revenue and want to factor that fact into your decision making process. How do you develop a statistical model that predicts annual sales based on new store size?

Typically, regression analysis is used to predict the values ​​of a variable. Its goal is to develop a statistical model that predicts the values ​​of the dependent variable, or response, from the values ​​of at least one independent, or explanatory, variable. In this note, we will consider a simple linear regression - a statistical method that allows you to predict the values ​​of the dependent variable Y by the values ​​of the independent variable X. The following notes will describe a multiple regression model designed to predict the values ​​of the independent variable Y by the values ​​of several dependent variables ( X 1 , X 2 , …, X k).

Download note in or format, examples in format

Types of regression models

where ρ 1 is the autocorrelation coefficient; if ρ 1 = 0 (no autocorrelation), D≈ 2; if ρ 1 ≈ 1 (positive autocorrelation), D≈ 0; if ρ 1 = -1 (negative autocorrelation), D ≈ 4.

In practice, the application of the Durbin-Watson criterion is based on a comparison of the value D with critical theoretical values d L and d U for a given number of observations n, the number of independent variables of the model k(for simple linear regression k= 1) and significance level α. If a D< d L , the hypothesis of independence of random deviations is rejected (hence, there is a positive autocorrelation); if D > d U, the hypothesis is not rejected (that is, there is no autocorrelation); if d L< D < d U there is not enough reason to make a decision. When the calculated value D exceeds 2, then d L and d U it is not the coefficient itself that is being compared D, and the expression (4 – D).

To calculate the Durbin-Watson statistics in Excel, we turn to the bottom table in Fig. fourteen Balance withdrawal. The numerator in expression (10) is calculated using the function = SUMMQDIFF(array1, array2), and the denominator = SUMMQ(array) (Fig. 16).

Rice. 16. Formulas for calculating Durbin-Watson statistics

In our example D= 0.883. The main question is: what value of the Durbin-Watson statistic should be considered small enough to conclude that there is a positive autocorrelation? It is necessary to correlate the value of D with the critical values ​​( d L and d U) depending on the number of observations n and significance level α (Fig. 17).

Rice. 17. Critical values ​​of Durbin-Watson statistics (table fragment)

Thus, in the problem of the volume of sales in a store delivering goods to your home, there is one independent variable ( k= 1), 15 observations ( n= 15) and significance level α = 0.05. Consequently, d L= 1.08 and dU= 1.36. Because the D = 0,883 < d L= 1.08, there is a positive autocorrelation between the residuals, the least squares method cannot be applied.

Testing Hypotheses about Slope and Correlation Coefficient

The above regression was applied solely for forecasting. To determine regression coefficients and predict the value of a variable Y for a given variable value X the method of least squares was used. In addition, we considered the standard error of the estimate and the coefficient of mixed correlation. If the residual analysis confirms that the applicability conditions of the least squares method are not violated, and the simple linear regression model is adequate, based on the sample data, it can be argued that there is a linear relationship between the variables in the population.

Applicationt -criteria for slope. By checking whether the population slope β 1 is equal to zero, one can determine whether there is a statistically significant relationship between the variables X and Y. If this hypothesis is rejected, it can be argued that between the variables X and Y there is a linear relationship. The null and alternative hypotheses are formulated as follows: H 0: β 1 = 0 (no linear relationship), H1: β 1 ≠ 0 (there is a linear relationship). By definition t-statistic is equal to the difference between the sample slope and the hypothetical population slope, divided by the standard error of the slope estimate:

(11) t = (b 1 β 1 ) / Sb 1

where b 1 is the slope of the direct regression based on sample data, β1 is the hypothetical slope of the direct general population, , and test statistics t It has t- distribution with n - 2 degrees of freedom.

Let's check if there is a statistically significant relationship between store size and annual sales at α = 0.05. t-criteria is displayed along with other parameters when using Analysis package(option Regression). The full results of the Analysis Package are shown in Fig. 4, a fragment related to t-statistics - in fig. eighteen.

Rice. 18. Application results t

Because the number of stores n= 14 (see Fig. 3), critical value t-statistics at a significance level α = 0.05 can be found by the formula: t L=STUDENT.INV(0.025;12) = -2.1788 where 0.025 is half the significance level and 12 = n – 2; t U\u003d STUDENT.INV (0.975, 12) \u003d +2.1788.

Because the t-statistics = 10.64 > t U= 2.1788 (Fig. 19), null hypothesis H 0 is rejected. On the other hand, R-value for X\u003d 10.6411, calculated by the formula \u003d 1-STUDENT.DIST (D3, 12, TRUE), is approximately equal to zero, so the hypothesis H 0 is rejected again. The fact that R-value is almost zero, meaning that if there were no real linear relationship between store size and annual sales, it would be almost impossible to detect it using linear regression. Therefore, there is a statistically significant linear relationship between average annual store sales and store size.

Rice. 19. Testing the hypothesis about the slope of the general population at a significance level of 0.05 and 12 degrees of freedom

ApplicationF -criteria for slope. An alternative approach to testing hypotheses about the slope of a simple linear regression is to use F-criteria. Recall that F-criterion is used to test the relationship between two variances (see details). When testing the slope hypothesis, the measure of random errors is the error variance (the sum of squared errors divided by the number of degrees of freedom), so F-test uses the ratio of the variance explained by the regression (i.e., the values SSR divided by the number of independent variables k), to the error variance ( MSE=S YX 2 ).

By definition F-statistic is equal to the mean squared deviations due to regression (MSR) divided by the error variance (MSE): F = MSR/ MSE, where MSR=SSR / k, MSE =SSE/(n– k – 1), k is the number of independent variables in the regression model. Test statistics F It has F- distribution with k and n– k – 1 degrees of freedom.

For a given significance level α, the decision rule is formulated as follows: if F > FU, the null hypothesis is rejected; otherwise, it is not rejected. The results, presented in the form of a summary table of the analysis of variance, are shown in fig. twenty.

Rice. 20. Table of analysis of variance to test the hypothesis of the statistical significance of the regression coefficient

Similarly t-criterion F-criteria is displayed in the table when using Analysis package(option Regression). Full results of the work Analysis package shown in fig. 4, fragment related to F-statistics - in fig. 21.

Rice. 21. Application results F- Criteria obtained using the Excel Analysis ToolPack

F-statistic is 113.23 and R-value close to zero (cell SignificanceF). If the significance level α is 0.05, determine the critical value F-distributions with one and 12 degrees of freedom can be obtained from the formula F U\u003d F. OBR (1-0.05; 1; 12) \u003d 4.7472 (Fig. 22). Because the F = 113,23 > F U= 4.7472, and R-value close to 0< 0,05, нулевая гипотеза H 0 deviates, i.e. The size of a store is closely related to its annual sales volume.

Rice. 22. Testing the hypothesis about the slope of the general population at a significance level of 0.05, with one and 12 degrees of freedom

Confidence interval containing slope β 1 . To test the hypothesis about the existence of a linear relationship between variables, you can build a confidence interval containing the slope β 1 and make sure that the hypothetical value β 1 = 0 belongs to this interval. The center of the confidence interval containing the slope β 1 is the sample slope b 1 , and its boundaries are the quantities b 1 ±t n –2 Sb 1

As shown in fig. eighteen, b 1 = +1,670, n = 14, Sb 1 = 0,157. t 12 \u003d STUDENT.OBR (0.975, 12) \u003d 2.1788. Consequently, b 1 ±t n –2 Sb 1 = +1.670 ± 2.1788 * 0.157 = +1.670 ± 0.342, or + 1.328 ≤ β 1 ≤ +2.012. Thus, the slope of the population with a probability of 0.95 lies in the range from +1.328 to +2.012 (i.e., from $1,328,000 to $2,012,000). Because these values ​​are greater than zero, there is a statistically significant linear relationship between annual sales and store area. If the confidence interval contained zero, there would be no relationship between the variables. In addition, the confidence interval means that every 1,000 sq. feet results in an increase in average sales of $1,328,000 to $2,012,000.

Usaget -criteria for the correlation coefficient. correlation coefficient was introduced r, which is a measure of the relationship between two numeric variables. It can be used to determine whether there is a statistically significant relationship between two variables. Let us denote the correlation coefficient between the populations of both variables by the symbol ρ. The null and alternative hypotheses are formulated as follows: H 0: ρ = 0 (no correlation), H 1: ρ ≠ 0 (there is a correlation). Checking for the existence of a correlation:

where r = + , if b 1 > 0, r = – , if b 1 < 0. Тестовая статистика t It has t- distribution with n - 2 degrees of freedom.

In the problem of the Sunflowers store chain r2= 0.904, and b 1- +1.670 (see Fig. 4). Because the b 1> 0, the correlation coefficient between annual sales and store size is r= +√0.904 = +0.951. Let's test the null hypothesis that there is no correlation between these variables using t- statistics:

At a significance level of α = 0.05, the null hypothesis should be rejected because t= 10.64 > 2.1788. Thus, it can be argued that there is a statistically significant relationship between annual sales and store size.

When discussing inferences about population slopes, confidence intervals and criteria for testing hypotheses are interchangeable tools. However, the calculation of the confidence interval containing the correlation coefficient turns out to be more difficult, since the form of the sampling distribution of the statistic r depends on the true correlation coefficient.

Estimation of mathematical expectation and prediction of individual values

This section discusses methods for estimating the expected response Y and predictions of individual values Y for given values ​​of the variable X.

Construction of a confidence interval. In example 2 (see above section Least square method) the regression equation made it possible to predict the value of the variable Y X. In the problem of choosing a location for a retail outlet, the average annual sales in a store with an area of ​​4000 sq. feet was equal to 7.644 million dollars. However, this estimate of the mathematical expectation of the general population is a point. to estimate the mathematical expectation of the general population, the concept of a confidence interval was proposed. Similarly, one can introduce the concept confidence interval for the mathematical expectation of the response for a given value of a variable X:

where , = b 0 + b 1 X i– predicted value variable Y at X = X i, S YX is the mean square error, n is the sample size, Xi- the given value of the variable X, µ Y|X = Xi– mathematical expectation of a variable Y at X = Х i,SSX=

Analysis of formula (13) shows that the width of the confidence interval depends on several factors. At a given level of significance, an increase in the amplitude of fluctuations around the regression line, measured using the mean square error, leads to an increase in the width of the interval. On the other hand, as expected, an increase in the sample size is accompanied by a narrowing of the interval. In addition, the width of the interval changes depending on the values Xi. If the value of the variable Y predicted for quantities X, close to the average value , the confidence interval turns out to be narrower than when predicting the response for values ​​far from the mean.

Let's say that when choosing a location for a store, we want to build a 95% confidence interval for the average annual sales in all stores with an area of ​​4000 square meters. feet:

Therefore, the average annual sales volume in all stores with an area of ​​​​4,000 square meters. feet, with a 95% probability lies in the range from 6.971 to 8.317 million dollars.

Compute the confidence interval for the predicted value. In addition to the confidence interval for the mathematical expectation of the response for a given value of the variable X, it is often necessary to know the confidence interval for the predicted value. Although the formula for calculating such a confidence interval is very similar to formula (13), this interval contains a predicted value and not an estimate of the parameter. Interval for predicted response YX = Xi for a specific value of the variable Xi is determined by the formula:

Let's assume that when choosing a location for a retail outlet, we want to build a 95% confidence interval for the predicted annual sales volume in a store with an area of ​​4000 square meters. feet:

Therefore, the predicted annual sales volume for a 4,000 sq. feet, with a 95% probability lies in the range from 5.433 to 9.854 million dollars. As you can see, the confidence interval for the predicted response value is much wider than the confidence interval for its mathematical expectation. This is because the variability in predicting individual values ​​is much greater than in estimating the expected value.

Pitfalls and ethical issues associated with the use of regression

Difficulties associated with regression analysis:

  • Ignoring the conditions of applicability of the method of least squares.
  • An erroneous estimate of the conditions for applicability of the method of least squares.
  • Wrong choice of alternative methods in violation of the conditions of applicability of the least squares method.
  • Application of regression analysis without in-depth knowledge of the subject of study.
  • Extrapolation of the regression beyond the range of the explanatory variable.
  • Confusion between statistical and causal relationships.

The widespread use of spreadsheets and statistical software has eliminated the computational problems that prevented the use of regression analysis. However, this led to the fact that regression analysis began to be used by users who do not have sufficient qualifications and knowledge. How do users know about alternative methods if many of them have no idea at all about the conditions for applicability of the least squares method and do not know how to check their implementation?

The researcher should not be carried away by grinding numbers - calculating the shift, slope and mixed correlation coefficient. He needs deeper knowledge. Let's illustrate this with a classic example taken from textbooks. Anscombe showed that all four datasets shown in Fig. 23 have the same regression parameters (Fig. 24).

Rice. 23. Four artificial data sets

Rice. 24. Regression analysis of four artificial data sets; done with Analysis package(click on the image to enlarge the image)

So, from the point of view of regression analysis, all these data sets are completely identical. If the analysis ended there, we would lose a lot of useful information. This is evidenced by the scatter plots (Fig. 25) and residual plots (Fig. 26) constructed for these data sets.

Rice. 25. Scatter plots for four datasets

Scatter plots and residual plots show that these data are different from each other. The only set distributed along a straight line is set A. The plot of the residuals calculated from set A has no pattern. The same cannot be said for sets B, C, and D. The scatter plot plotted for set B shows a pronounced quadratic pattern. This conclusion is confirmed by the plot of residuals, which has a parabolic shape. The scatter plot and residual plot show that dataset B contains an outlier. In this situation, it is necessary to exclude the outlier from the data set and repeat the analysis. The technique for detecting and eliminating outliers from observations is called influence analysis. After eliminating the outlier, the result of the re-evaluation of the model may be completely different. A scatterplot plotted from data set D illustrates an unusual situation in which the empirical model is highly dependent on a single response ( X 8 = 19, Y 8 = 12.5). Such regression models need to be calculated especially carefully. So, scatter and residual plots are an essential tool for regression analysis and should be an integral part of it. Without them, regression analysis is not credible.

Rice. 26. Plots of residuals for four datasets

How to avoid pitfalls in regression analysis:

  • Analysis of the possible relationship between variables X and Y always start with a scatterplot.
  • Before interpreting the results of a regression analysis, check the conditions for its applicability.
  • Plot the residuals versus the independent variable. This will allow to determine how the empirical model corresponds to the results of observation, and to detect violation of the constancy of the variance.
  • Use histograms, stem and leaf plots, box plots, and normal distribution plots to test the assumption of a normal distribution of errors.
  • If the applicability conditions of the least squares method are not met, use alternative methods (for example, quadratic or multiple regression models).
  • If the applicability conditions of the least squares method are met, it is necessary to test the hypothesis about the statistical significance of the regression coefficients and construct confidence intervals containing the mathematical expectation and the predicted response value.
  • Avoid predicting values ​​of the dependent variable outside the range of the independent variable.
  • Keep in mind that statistical dependencies are not always causal. Remember that correlation between variables does not mean that there is a causal relationship between them.

Summary. As shown in the block diagram (Fig. 27), the note describes a simple linear regression model, the conditions for its applicability, and ways to test these conditions. Considered t-criterion for testing the statistical significance of the slope of the regression. A regression model was used to predict the values ​​of the dependent variable. An example is considered related to the choice of a place for a retail outlet, in which the dependence of the annual sales volume on the store area is studied. The information obtained allows you to more accurately select a location for the store and predict its annual sales. In the following notes, the discussion of regression analysis will continue, as well as multiple regression models.

Rice. 27. Block diagram of a note

Materials from the book Levin et al. Statistics for managers are used. - M.: Williams, 2004. - p. 792–872

If the dependent variable is categorical, logistic regression should be applied.