Negative dependence in corrective analysis. Coursework: Correlation Analysis

Any law of nature or social development can be represented by a description of a set of relationships. If these dependencies are stochastic, and the analysis is carried out on a sample from the general population, then this area of ​​research refers to the tasks of statistical study of dependencies, which include correlation, regression, variance, covariance analysis and analysis of contingency tables.

    Is there a relationship between the studied variables?

    How to measure the closeness of connections?

The general scheme of the relationship between parameters in a statistical study is shown in fig. one.

Figure S is a model of the real object under study. Explanatory (independent, factorial) variables describe the conditions for the functioning of the object. Random factors are factors whose influence is difficult to take into account or whose influence is currently neglected. The resulting (dependent, explained) variables characterize the result of the object's functioning.

The choice of the method of analysis of the relationship is carried out taking into account the nature of the analyzed variables.

Correlation analysis - a method of processing statistical data, which consists in studying the relationship between variables.

The goal of correlation analysis is to provide some information about one variable with the help of another variable. In cases where it is possible to achieve the goal, the variables are said to be correlated. The correlation reflects only the linear dependence of the quantities, but does not reflect their functional connectivity. For example, if we calculate the correlation coefficient between the values ​​A = sin(x) and B = cos(x), then it will be close to zero, i.e. there is no relationship between the quantities.

When studying correlation, graphical and analytical approaches are used.

Graphical analysis begins with the construction of a correlation field. The correlation field (or scatterplot) is a graphical relationship between the measurement results of two features. To build it, the initial data is plotted on a graph, displaying each pair of values ​​(xi, yi) as a point with coordinates xi and yi in a rectangular coordinate system.

Visual analysis of the correlation field allows us to make an assumption about the form and direction of the relationship between the two studied indicators. According to the form of the relationship, correlation dependences are usually divided into linear (see Fig. 1) and non-linear (see Fig. 2). With a linear dependence, the envelope of the correlation field is close to an ellipse. The linear relationship of two random variables is that when one random variable increases, the other random variable tends to increase (or decrease) according to a linear law.

The direction of the relationship is positive if an increase in the value of one attribute leads to an increase in the value of the second (see Fig. 3) and negative if an increase in the value of one attribute leads to a decrease in the value of the second (see Fig. 4).

Dependencies that have only positive or only negative directions are called monotonic.

CORRELATION ANALYSIS- a set of methods for assessing the relationship between random phenomena and events based on the mathematical theory of correlation. In this case, the simplest characteristics are used that require a minimum of calculations. The term "correlation" is usually identified with the concepts of "relationship" and "interdependence". However, they are not adequate. Correlation is only one of the types of communication between signs, which is manifested on average and is linear. If there is an unambiguous relationship between two quantities, then such a relationship is called functional, and one of the quantities (cause) can uniquely determine the value of the other quantity (consequence). Funkts, dependence is a particular expression of a random (probabilistic, stochastic) dependence, when the connection does not appear for each value of two quantities, but only on average.

K. a. is used in the study of two or more random variables in order to identify the two most important quantitative characteristics: the mathematical equation of the relationship between these quantities and the assessment of the closeness of the relationship between them. The initial data for determining these characteristics are the synchronous results of observation (measurement, experiment), i.e., simultaneously obtained from experience, statistical data on signs, the relationship between which is being studied. The initial data can be given in the form of tables with records of the results of observation or their equivalent representations on magnetic tape, punched tape or punched cards.

K. a. found wide application in medicine and biology for definition of closeness and the equations of communication between various signs, napr, results of analyzes a wedge, signs or the special inspections which are carried out over healthy or sick people (see. Correlation of functions of an organism). Results To. and. are used to make objective forecasts of diseases, assess the patient's condition, the course of the disease (see Forecasting). A priori, only by results of theoretical biol, and honey. studies, it is difficult or impossible to predict how the studied traits are related. In order to answer this question, an observation or a special experiment is carried out.

Two-dimensional correlation analysis is used in the processing of experimental data on the manifestation of any two signs.

CORRELATION TABLE. Note. The table shows the intervals of signs X and Y, as well as the frequency of their occurrence (in the center of the table), calculated from the results of morphometric analysis of the microvasculature of the bulboconjunctival area, where Y is the diameter of the venule, and X is the diameter of the arteriole (in mmc).

Each result of the experiment is a random variable, and objective patterns appear only in the entire set of measurement results. Therefore, conclusions are drawn based on the results of processing the entire set of experimental data, and not on individual values, which are random. To reduce the influence of a random event, the initial data are combined into groups, which is achieved by compiling a correlation table (see table). Such a table contains the intervals (or their midpoints) of the values ​​of two features - Y and X, as well as the frequency of occurrence of the X and Y values ​​in the corresponding interval of these values. These frequencies, calculated from the results of the experiment, are a practical estimate of the probability of the joint occurrence of the X and Y values ​​of a particular interval. The construction of a correlation table is the first step in processing the initial information. The construction of correlation tables and their further complete processing is carried out quickly on universal or specialized computers (see. Electronic computer). According to the grouped data of the correlation table, the empirical characteristics of the equation and the tightness of the connection are calculated. To determine the relationship equation between Y and X, the average values ​​of the Y feature are calculated in each interval of the X feature. get for each i-th interval the value of Yxi, the connection of which for all i-intervals gives an empirical regression line characterizing the form of the relationship of the attribute Y with the attribute X on average - the graph of the function Yx= f(x). If there were an unambiguous relationship between features Y and X, the relationship equation would be sufficient for solving practical and theoretical problems, since it can always be used to determine the value of the feature Y if the value X is given. In practice, the relationship between Y and X is not unambiguous, this relationship is random and one value of X corresponds to a number of values ​​of Y. Therefore, another characteristic is needed that measures the strength, closeness of the relationship between Y and X. Such characteristics are the dispersion (correlation) ratio ηух and the correlation coefficient ryx. The first of these quantities characterizes the tightness of the connection between Y and X in an arbitrary function f, and ryx is used only when f is a linear function.

The values ​​of ηyx and ryx are also simply determined from the correlation table. The calculation is usually carried out in the following order: the average values ​​of both attributes X and Y are determined, their standard deviations σx and σy, and then ηxy according to the formula:

and ryx according to the formula:

where n is the total number of experiments, Xcpi is the average value of X of the i-th interval, Ycpj is the average value of Y of the j-th interval, k, l are the number of intervals of features X and Y, respectively, mi(x) is the frequency (number) of Xcpi values . Quantitative characteristics of the accuracy of determining ηyx and ryx are their standard deviations, which are equal to

The values ​​of the coefficient η lie between zero and one (0=<ηyx=<1). Если ηyx= 0 (рис., а), то это свидетельствует о том, что признаки Y и X недисперсированы, т. е. регрессия Yx = f(x) не дает связи между признаками Y и X, а при ηyx = 1 существует однозначная связь между Y и X (рис., б, ж). Для ηyx<1 признак Y только частично определяется признаком X, и необходимо изучение дополнительных признаков для повышения достоверности определения Y (рис., г, д, е, и).

The value of the coefficient r lies between -1 and +1 (-1=

Multivariate correlation analysis - determination of the equation and the tightness of the connection in cases where the number of studied features is more than two. So, if Y is a complex feature and its outcome depends on the appearance of a set of features X1, X2, ..., Xn, then, according to experimental data, the following should be determined: ., Хn, i.e. Yx1x2...xn = F(x1, x2...,xn) ; b) the tightness of the connection between Y and the set X1, X2,..., Xn.

Preliminary processing of results of supervision at multidimensional K. and. is that for each pair of features, the values ​​of the dispersion relations ηyxi (i = 1,2,..., n) and ηxixj (i!=j) of the correlation coefficients ryxi and rxixj are determined, as well as the paired regressions Yxi = fi(xi ). These data are then used to determine the multiple regression equations Yx1x2...xn = F (x1,x2,...,xn), the multiple dispersion ratio ηyx1x2...xn, and the multiple correlation coefficient Ryx1x2...xn. The multiple regression equation makes it possible to determine the value of the feature Y by the set of values ​​X1, X2, ..., Xn, i.e., if this equation is available, it is possible to predict the values ​​of Y based on the results of specific values ​​of the resulting set (for example, the results of the analysis by features X1, X2...Xn). The value ηyx1x2...xn is used as a characteristic of the tightness of the connection between Y and the set of features X1, X2, ...Xn for an arbitrary function F, and Ryx1x2...xn - for the case when the function F is linear. Coefficients ηyx1x2....xn and Ryx1x2...xn take values ​​between zero and one. Inclusion in consideration for multidimensional K. a. additional features makes it possible to get the values ​​ηyx1x2...xn, Ryx1x2...xn closer to unity and thus improve the accuracy of the Y feature prediction using the multiple regression equation.

As an example, consider the results of paired K. a., as well as the multiple regression equation and the multiple correlation coefficient between the signs: Y - stable pseudoparesis, X1 - lateralization of the motor defect in the limbs on the right, X2 - the same in the limbs on the left, X3 - vegetative crises. The values ​​of dispersion ratios and pair correlation coefficients for them will be respectively ηyx1 = 0.429, ηyx2 = 0.616, ηyx3 = -0.334, and ryx1 = 0.320, ryx2 = 0.586, ryx3 = -0.325. According to the equation of multiple linear regression Yх1х2х3 = 0.638 x1 + 0.839 x2 - 0.195 x3. The multiple correlation coefficient will be expressed as Ryx1x2x3 =0.721. It can be seen from the example that, according to X1, X2 and X3 data, stable pseudoparesis can be predicted with sufficient accuracy for practice.

Methods To. and. also make it possible to obtain dynamic characteristics and. In this case, the studied signs (eg, ECG, EEG, etc.) are considered as random functions of Y(t) and X(t). Based on the results of observation of these functions, two most important characteristics are also determined: a) evaluation of the communication operator (mathematical equation) between Y (t) and X (t); b) assessment of the closeness of the connection between them. Dispersion and correlation functions of random functions Y (t) and X(t) are taken as characteristics of the tightness of the connection. These functions are a generalization of dispersion relations and correlation coefficients. Thus, the normalized mutual dispersion function ηyx(t) of each fixed value t is the dispersion relation between the values ​​of the features Y(t) and X(t). Similarly, the normalized cross-correlation function Ryx(t) is, for each fixed value of t, the correlation coefficient between features Y(t) and X(t). The characteristic of a linear relationship (dependency) for the same studied quantity at different points in time is called autocorrelation.

K. a. is one of the methods for solving the problem of identification, which is widely used in obtaining mathematical models and automation of medical biol, research and treatment.

Bibliography: Computing systems and automatic diagnostics of heart diseases, ed. C. Caceres and L. Dreyfus, trans. from English, M., 1974; Gutman S. R. On two models of electroencephalogram converging to a normal random process, in: Upravlenie i inform. processes in wildlife, ed. V. V. Larina, p. 205, M., 1971; Zaslavskaya R. M., Perepel-kin E. G. and Akhmetov K. Zh. Correlations between indicators of hemocoagulation and lipid metabolism in patients with angina during the day, Cardiology, t. 111, 1977; K r a m e r G. Mathematical methods of statistics, trans. from English, M., 1975; Pasternak E. B. et al. Study of the electrical activity of the atria in atrial fibrillation using instrumental correlation analysis, Cardiology, t. 17, Xia 7, p. 50, 1977; Sinitsyn B. S. Automatic correlators and their application, Novosibirsk, 1964, bibliogr.; At r-b and x V. Yu. Statistical analysis in biological and medical researches, M., 1975, bibliogr.

V. N. Reibman, N. S. Reibman.

The use of statistical methods in the processing of psychological research materials provides a great opportunity to extract useful information from experimental data. One of the most common statistical methods is correlation analysis.

The term "correlation" was first used by the French paleontologist J. Cuvier, who deduced the "law of correlation of parts and organs of animals" (this law allows you to restore the appearance of the whole animal from the found parts of the body). This term was introduced into statistics by the English biologist and statistician F. Galton (not just “connection” - relation, and "as if a connection" - corelation).

Correlation analysis is a test of hypotheses about relationships between variables using correlation coefficients, two-dimensional descriptive statistics, a quantitative measure of the relationship (joint variability) of two variables. Thus, this is a set of methods for detecting correlations between random variables or features.

Correlation analysis for two random variables includes:

  • building a correlation field and compiling a correlation table;
  • calculation of sample correlation coefficients and correlation ratios;
  • testing the statistical hypothesis of the significance of the relationship.

The main purpose of correlation analysis is to identify the relationship between two or more variables under study, which is considered as a joint coordinated change in the two characteristics under study. This variability has three main characteristics: shape, direction and strength.

The form of the correlation can be linear or non-linear. A linear form is more convenient for identifying and interpreting a correlation. For a linear correlation, two main directions can be distinguished: positive (“forward connection”) and negative (“feedback”).

The strength of the connection directly indicates how pronounced the joint variability of the studied variables is. In psychology, the functional interconnection of phenomena can be empirically revealed only as a probabilistic connection of the corresponding features. A visual representation of the nature of the probabilistic relationship is given by a scatter diagram - a graph whose axes correspond to the values ​​of two variables, and each subject is a point.

Correlation coefficients are used as a numerical characteristic of a probabilistic relationship, the values ​​of which vary in the range from –1 to +1. After the calculations, the researcher, as a rule, selects only the strongest correlations, which are further interpreted (Table 1).

The criterion for selecting “sufficiently strong” correlations can be either the absolute value of the correlation coefficient itself (from 0.7 to 1) or the relative value of this coefficient, determined by the level of statistical significance (from 0.01 to 0.1), depending on sample size. In small samples, for further interpretation, it is more correct to select strong correlations based on the level of statistical significance. For studies that are conducted on large samples, it is better to use the absolute values ​​of the correlation coefficients.

Thus, the task of correlation analysis is reduced to establishing the direction (positive or negative) and the form (linear, non-linear) of the relationship between varying features, measuring its tightness, and, finally, checking the significance level of the obtained correlation coefficients.

Currently, many different correlation coefficients have been developed. The most used are r-Pearson, r-Spearman and τ - Kendall. Modern computer statistical programs in the "Correlations" menu offer exactly these three coefficients, and for solving other research problems, methods for comparing groups are offered.

The choice of method for calculating the correlation coefficient depends on the type of scale to which the variables belong (Table 2).

For variables with an interval and with a nominal scale, the Pearson correlation coefficient (correlation of product moments) is used. If at least one of the two variables has an ordinal scale or is not normally distributed, Spearman's rank correlation is used, or

t-Kendall. If one of the two variables is dichotomous, point two-series correlation can be used (in the statistical computer program SPSS, this possibility is not available; instead, the calculation of rank correlation can be used). In the event that both variables are dichotomous, a four-field correlation is used (this type of correlation is calculated by SPSS based on the definition of distance measures and similarity measures). The calculation of the correlation coefficient between two non-dichotomous variables is possible only if the relationship between them is linear (unidirectional). If the connection, for example, U-shaped (ambiguous), the correlation coefficient is not suitable for use as a measure of the strength of the connection: its value tends to zero.

Thus, the conditions for applying the correlation coefficients will be as follows:

  • variables measured in a quantitative (rank, metric) scale on the same sample of objects;
  • the relationship between variables is monotonic.

The main statistical hypothesis, which is tested by correlation analysis, is non-directional and contains the assertion that the correlation is equal to zero in the general population H 0: r xy= 0. If it is rejected, the alternative hypothesis is accepted H 1: r xy≠ 0 about the presence of a positive or negative correlation - depending on the sign of the calculated correlation coefficient.

Based on the acceptance or rejection of hypotheses meaningful conclusions are drawn. If, according to the results of statistical testing H 0: r xy= 0 does not deviate at level a, then the meaningful conclusion will be as follows: the relationship between X and Y not found. If at H 0 r xy= 0 deviates at level a, which means that a positive (negative) relationship has been found between X and Y. However, the interpretation of the revealed correlations should be approached with caution. From a scientific point of view, simply establishing a relationship between two variables does not imply the existence of a causal relationship. Moreover, the presence of a correlation does not establish a sequence relationship between cause and effect. It simply indicates that two variables are more related to each other than would be expected from a coincidence. Nevertheless, with caution, the use of correlation methods in the study of causal relationships is fully justified. Categorical phrases such as "variable X is the reason for the increase in the indicator" should be avoided. Y". Such statements should be formulated as assumptions, which should be strictly substantiated theoretically.

A detailed description of the mathematical procedure for each correlation coefficient is given in textbooks on mathematical statistics; ; ; and others. We will restrict ourselves to describing the possibility of using these coefficients depending on the type of measurement scale.

Correlation of Metric Variables

To study the relationship of two metric variables measured on the same sample, we use correlation coefficient r-Pearson. The coefficient itself characterizes the presence of only a linear relationship between the features, usually denoted by the symbols X and Y. The linear correlation coefficient is a parametric method and its correct application is possible only if the measurement results are presented on a scale of intervals, and the very distribution of values ​​in the analyzed variables differs from the normal to a small extent. There are many situations in which its use is appropriate. For example: establishing a connection between the intellect of a student and his academic performance; between mood and success in getting out of a problem situation; between income level and temperament, etc.

The Pearson coefficient is widely used in psychology and pedagogy. For example, in the works of I. Ya. Kaplunovich and P. D. Rabinovich, M. P. Nuzhdina, the calculation of the Pearson linear correlation coefficient was used to confirm the hypotheses put forward.

When processing data "manually", it is necessary to calculate the correlation coefficient, and then determine p- level of significance (in order to simplify data verification, tables of critical values ​​are used rxy, which are compiled using this criterion). The value of Pearson's linear correlation coefficient cannot exceed +1 and be less than -1. These two numbers +1 and -1 are the limits for the correlation coefficient. When the calculation results in a value greater than +1 or less than -1, this indicates that a calculation error has occurred.

When calculating on a computer, the statistical program (SPSS, Statistica) accompanies the calculated correlation coefficient with a more accurate value p-level.

For a statistical decision on acceptance or rejection H0 usually set α = 0.05, and for a large volume of observations (100 or more) α = 0.01. If a p ≤ α, H 0 is rejected and a meaningful conclusion is made that a statistically significant (significant) relationship has been found between the studied variables (positive or negative, depending on the sign of the correlation). When p > α, H0 is not rejected, the meaningful conclusion is limited to the statement that the relationship (statistically significant) was not found.

If no connection is found, but there is reason to believe that the connection actually exists, you should check the possible reasons for the unreliability of the connection.

Communication non-linearity– To do this, analyze the two-dimensional scatter plot. If the relationship is non-linear, but monotonic, go to rank correlations. If the relationship is not monotonic, then divide the sample into parts in which the relationship is monotonic, and calculate the correlations separately for each part of the sample, or divide the sample into contrasting groups and then compare them according to the level of expression of the trait.

The presence of outliers and a pronounced asymmetry in the distribution of one or both features. To do this, you need to look at the histograms of the frequency distribution of both features. If there are outliers or asymmetries, exclude outliers or switch to rank correlations.

Sample heterogeneity(analyze the 2D scatterplot). Try to divide the sample into parts in which the relationship may have different directions.

If the relationship is statistically significant, then before making a meaningful conclusion, it is necessary to exclude the possibility of a false correlation:

  • connection due to outliers. If there are outliers, go to rank correlations or exclude outliers;
  • the relationship is due to the influence of the third variable. If there is a similar phenomenon, it is necessary to calculate the correlation not only for the entire sample, but also for each group separately. If the "third" variable is metric, calculate the partial correlation.

Partial correlation coefficient rxy-z is calculated if it is necessary to test the assumption that the relationship between two variables X and Y does not depend on the influence of the third variable Z. Very often, two variables correlate with each other only due to the fact that both of them change in concert under the influence of a third variable. In other words, in fact, there is no connection between the corresponding properties, but it appears in a statistical relationship under the influence of a common cause. For example, a common cause of variability in two variables may be age when studying the relationship of various psychological characteristics in a group of different ages. When interpreting partial correlation in terms of causality, one should be careful, because if Z correlates with X and with Y, and the partial correlation rxy-z close to zero, it does not necessarily follow that Z is a common reason for X and Y.

Correlation of rank variables

If the correlation coefficient is unacceptable to quantitative data r-Pearson, then to test the hypothesis about the relationship of two variables after preliminary ranking, correlations can be applied r-spearman or τ -Kendalla. For example, in a study of the psychophysical characteristics of musically gifted adolescents by I. A. Lavochkin, the Spearman criterion was used.

For the correct calculation of both coefficients (Spearman and Kendall), the results of measurements must be presented in a scale of ranks or intervals. There are no fundamental differences between these criteria, but it is generally accepted that the Kendall coefficient is more “meaningful”, since it analyzes the relationships between variables more fully and in detail, sorting through all possible correspondences between pairs of values. Spearman's coefficient more accurately takes into account the quantitative degree of association between variables.

Spearman's rank correlation coefficient is a non-parametric analog of the classical Pearson correlation coefficient, but its calculation takes into account not distribution-related indicators of the compared variables (arithmetic mean and variance), but ranks. For example, it is necessary to determine the relationship between the ranking assessments of personality traits that are included in a person’s idea of ​​his “I am real” and “I am ideal”.

The Spearman coefficient is widely used in psychological research. For example, in the work of Yu. V. Bushov and N. N. Nesmelova: it was he who was used to study the dependence of the accuracy of estimating and reproducing the duration of sound signals on the individual characteristics of a person.

Since this coefficient is analogous r-Pearson, then using it to test hypotheses is similar to using the coefficient r-Pearson. That is, the tested statistical hypothesis, the procedure for making a statistical decision and the formulation of a meaningful conclusion are the same. In computer programs (SPSS, Statistica) significance levels for the same coefficients r-Pearson and r-Spearman always match.

Ratio advantage r-Spearman versus ratio r-Pearson - in greater sensitivity to communication. We use it in the following cases:

  • the presence of a significant deviation in the distribution of at least one variable from the normal form (skewness, outliers);
  • the appearance of a curvilinear (monotonic) connection.

Restriction for applying the coefficient r- Spearman's are:

  • for each variable at least 5 observations;
  • the coefficient with a large number of identical ranks in one or both variables gives a coarsened value.

Rank correlation coefficient τ -Kendalla is an independent original method based on the calculation of the ratio of pairs of values ​​of two samples that have the same or different trends (increase or decrease in values). This ratio is also called concordance factor. Thus, the main idea of ​​this method is that the direction of the connection can be judged by comparing the subjects in pairs: if a pair of subjects has a change in X coincides in direction with the change in Y, this indicates a positive relationship, if not the same - a negative relationship, for example, in the study of personal qualities that are of decisive importance for family well-being. In this method, one variable is represented as a monotonic sequence (for example, husband's data) in ascending order of magnitude; another variable (for example, wife's data) is assigned the corresponding ranking places. The number of inversions (violations of monotonicity compared to the first row) is used in the formula for the correlation coefficients.

When counting τ- Kendall "manually" data is first ordered by variable X. Then, for each subject, it is calculated how many times his rank in Y turns out to be less than the rank of the subjects below. The result is recorded in the Matches column. The sum of all values ​​in the "Coincidence" column is P- the total number of matches, is substituted into the formula for calculating the Kendall coefficient, which is more computationally simpler, but with an increase in the sample, in contrast to r- Spearman, the volume of calculations does not increase proportionally, but exponentially. So, for example, when N= 12 it is necessary to sort through 66 pairs of subjects, and when N= 489 - already 1128 pairs, i.e., the amount of calculations increases by more than 17 times. When calculating on a computer in a statistical program (SPSS, Statistica), the Kendall coefficient is calculated similarly to the coefficients r-Spearman and r-Pearson. Calculated correlation coefficient τ -Kendall is characterized by a more accurate value p-level.

Applying the Kendall coefficient is preferred if there are outliers in the original data.

A feature of rank correlation coefficients is that the maximum rank correlations (+1, –1) do not necessarily correspond to strict direct or inversely proportional relationships between the original variables X and Y: only a monotonous functional connection between them is sufficient. Rank correlations reach their maximum modulo value if a larger value of one variable always corresponds to a larger value of another variable (+1), or a larger value of one variable always corresponds to a smaller value of another variable and vice versa (–1).

The statistical hypothesis to be tested, the procedure for making a statistical decision and the formulation of a meaningful conclusion are the same as for the case r-Spearman or r-Pearson.

If a statistically significant relationship is not found, but there is reason to believe that there really is a relationship, you should first go from the coefficient

r-Spearman to ratio τ -Kendall (or vice versa), and then check the possible reasons for the unreliability of the connection:

  • communication non-linearity: To do this, look at the 2D scatter plot. If the relationship is not monotonous, then divide the sample into parts in which the relationship is monotonous, or divide the sample into contrasting groups and then compare them according to the level of symptom expression;
  • sample heterogeneity: look at a two-dimensional scatter plot, try to divide the sample into parts in which the relationship may have different directions.

If the connection is statistically significant, then before making a meaningful conclusion, it is necessary to exclude the possibility of a false correlation (by analogy with metric correlation coefficients).

Correlation of dichotomous variables

When comparing two variables measured on a dichotomous scale, the measure of correlation is the so-called coefficient j, which is the correlation coefficient for dichotomous data.

Value coefficient φ lies between +1 and -1. It can be both positive and negative, characterizing the direction of the connection between two dichotomously measured features. However, the interpretation of φ may raise specific problems. The dichotomous data included in the scheme for calculating the coefficient φ does not look like a two-dimensional normal surface, therefore, it is incorrect to assume that the interpreted values rxy\u003d 0.60 and φ \u003d 0.60 are the same. The coefficient φ can be calculated by the coding method, as well as using the so-called four-field table or contingency table.

To apply the correlation coefficient φ, the following conditions must be met:

  • the traits being compared should be measured on a dichotomous scale;
  • X and Y should be the same.

This type of correlation is calculated in the SPSS computer program based on the definition of distance measures and similarity measures. Some statistical procedures, such as factor analysis, cluster analysis, multivariate scaling, are based on the application of these measures, and sometimes they themselves provide additional possibilities for calculating similarity measures.

When one variable is measured on a dichotomous scale (variable X), and the other in a scale of intervals or ratios (variable Y), is used biserial correlation coefficient, for example, when testing hypotheses about the effect of a child's gender on height and weight. This coefficient varies in the range from -1 to +1, but its sign does not matter for the interpretation of the results. For its use, the following conditions must be met:

  • compared signs should be measured in different scales: one X- in a dichotomous scale; another Y– in a scale of intervals or ratios;
  • variable Y has a normal distribution law;
  • the number of varying features in the compared variables X and Y should be the same.

If the variable X measured on a dichotomous scale, and the variable Y in the rank scale (variable Y), can be used rank-biserial correlation coefficient, which is closely related to Kendall's τ and uses the concepts of coincidence and inversion in its definition. The interpretation of the results is the same.

Conducting correlation analysis using SPSS and Statistica computer programs is a simple and convenient operation. To do this, after calling the Bivariate Correlations dialog box (Analyze>Correlate> Bivariate…), you need to move the variables under study to the Variables field and select the method by which the correlation between the variables will be detected. The result output file contains a square table (Correlations) for each calculated criterion. Each cell of the table contains: the very value of the correlation coefficient (Correlation Coefficient), the statistical significance of the calculated coefficient Sig, the number of subjects.

The heading and side columns of the resulting correlation table contain the names of the variables. The diagonal (upper left - lower right corner) of the table consists of units, since the correlation of any variable with itself is maximum. The table is symmetrical about this diagonal. If the checkbox "Mark significant correlations" is checked in the program, then statistically significant coefficients will be marked in the final correlation table: at the level of 0.05 and less - with one asterisk (*), and at the level of 0.01 - with two asterisks (**).

So, to summarize: the main purpose of correlation analysis is to identify the relationship between variables. The measure of connection is the correlation coefficients, the choice of which directly depends on the type of scale in which the variables are measured, the number of varying features in the compared variables, and the distribution of variables. The presence of a correlation between two variables does not mean that there is a causal relationship between them. Although correlation does not directly indicate causality, it can be a clue to the causes. On its basis, hypotheses can be formed. In some cases, the lack of correlation has a deeper effect on the hypothesis of causation. Zero correlation of two variables may indicate that there is no influence of one variable on the other.

COURSE WORK

Topic: Correlation analysis

Introduction

1. Correlation analysis

1.1 The concept of correlation

1.2 General classification of correlations

1.3 Correlation fields and the purpose of their construction

1.4 Stages of correlation analysis

1.5 Correlation coefficients

1.6 Normalized Bravais-Pearson correlation coefficient

1.7 Spearman's rank correlation coefficient

1.8 Basic properties of correlation coefficients

1.9 Checking the significance of correlation coefficients

1.10 Critical values ​​of the pair correlation coefficient

2. Planning a multivariate experiment

2.1 Condition of the problem

2.2 Determination of the center of the plan (main level) and the level of variation of factors

2.3 Building a planning matrix

2.4 Checking the homogeneity of the dispersion and the equal accuracy of measurements in different series

2.5 Coefficients of the regression equation

2.6 Reproducibility dispersion

2.7 Checking the significance of the coefficients of the regression equation

2.8 Checking the adequacy of the regression equation

Conclusion

Bibliography

INTRODUCTION

Experiment planning is a mathematical and statistical discipline that studies the methods of rational organization of experimental research - from the optimal choice of the studied factors and the determination of the actual plan of the experiment in accordance with its purpose to methods for analyzing the results. The beginning of experiment planning was laid by the works of the English statistician R. Fisher (1935), who emphasized that rational experiment planning gives no less significant gain in the accuracy of estimates than optimal processing of measurement results. In the 60s of the 20th century, a modern theory of experiment planning emerged. Its methods are closely related to the theory of approximation of functions and mathematical programming. Optimal plans are constructed and their properties are investigated for a wide class of models.

Experiment planning is the choice of an experiment plan that meets the specified requirements, a set of actions aimed at developing an experimentation strategy (from obtaining a priori information to obtaining a workable mathematical model or determining optimal conditions). This is a purposeful control of the experiment, implemented in conditions of incomplete knowledge of the mechanism of the phenomenon under study.

In the process of measurements, subsequent data processing, as well as formalization of the results in the form of a mathematical model, errors occur and part of the information contained in the original data is lost. The use of experiment planning methods makes it possible to determine the error of the mathematical model and judge its adequacy. If the accuracy of the model is insufficient, then the use of experiment planning methods makes it possible to modernize the mathematical model with additional experiments without losing previous information and at minimal cost.

The purpose of experiment planning is to find such conditions and rules for conducting experiments under which it is possible to obtain reliable and reliable information about the object with the least labor costs, as well as present this information in a compact and convenient form with a quantitative assessment of accuracy.

Among the main planning methods used at different stages of the study, the following are used:

Planning a screening experiment, the main meaning of which is the selection of a group of significant factors from the totality of factors that are subject to further detailed study;

Designing an experiment for analysis of variance, i.e. drawing up plans for objects with qualitative factors;

Planning a regression experiment that allows you to obtain regression models (polynomial and others);

Planning an extreme experiment, in which the main task is the experimental optimization of the object of study;

Planning in the study of dynamic processes, etc.

The purpose of studying the discipline is to prepare students for production and technical activities in the specialty using the methods of planning theory and modern information technologies.

Objectives of the discipline: the study of modern methods of planning, organizing and optimizing scientific and industrial experiments, conducting experiments and processing the results.

1. CORRELATION ANALYSIS

1.1 The concept of correlation

The researcher is often interested in how two or more variables are related to each other in one or more of the studied samples. For example, can height affect a person's weight, or can pressure affect product quality?

This kind of relationship between variables is called correlation, or correlation. A correlation is a consistent change in two features, reflecting the fact that the variability of one feature is in line with the variability of the other.

It is known, for example, that on average there is a positive relationship between the height of people and their weight, and such that the greater the height, the greater the weight of a person. However, there are exceptions to this rule when relatively short people are overweight, and, conversely, asthenics, with high growth, are light. The reason for such exclusions is that each biological, physiological or psychological trait is determined by the influence of many factors: environmental, genetic, social, ecological, etc.

Correlations are probabilistic changes that can only be studied on representative samples by methods of mathematical statistics. Both terms - correlation and correlation dependence - are often used interchangeably. Dependence means influence, connection - any coordinated changes that can be explained by hundreds of reasons. Correlations cannot be considered as evidence of a causal relationship, they only indicate that changes in one feature, as a rule, are accompanied by certain changes in another.

Correlation dependence - These are the changes that the values ​​of one feature make to the probability of occurrence of different values ​​of another feature.

The task of correlation analysis is reduced to establishing the direction (positive or negative) and the form (linear, non-linear) of the relationship between varying features, measuring its tightness, and, finally, checking the level of significance of the obtained correlation coefficients.

Correlations differ in form, direction and degree (strength) .

The shape of the correlation can be rectilinear or curvilinear. For example, the relationship between the number of training sessions on the simulator and the number of correctly solved problems in the control session can be straightforward. Curvilinear can be, for example, the relationship between the level of motivation and the effectiveness of the task (Figure 1). With an increase in motivation, the efficiency of the task first increases, then the optimal level of motivation is reached, which corresponds to the maximum efficiency of the task; a further increase in motivation is accompanied by a decrease in efficiency.

Figure 1 - The relationship between the effectiveness of problem solving and the strength of the motivational tendency

In direction, the correlation can be positive ("direct") and negative ("reverse"). With a positive straight-line correlation, higher values ​​of one attribute correspond to higher values ​​of another, and lower values ​​of one attribute correspond to low values ​​of another (Figure 2). With a negative correlation, the ratios are reversed (Figure 3). With a positive correlation, the correlation coefficient has a positive sign, with a negative correlation - a negative sign.

Figure 2 - Direct correlation

Figure 3 - Inverse correlation


Figure 4 - No correlation

The degree, strength or tightness of the correlation is determined by the value of the correlation coefficient. The strength of the connection does not depend on its direction and is determined by the absolute value of the correlation coefficient.

1.2 General classification of correlations

Depending on the correlation coefficient, the following correlations are distinguished:

Strong or close with correlation coefficient r>0.70;

Medium (at 0.50

Moderate (at 0.30

Weak (at 0.20

Very weak (at r<0,19).

1.3 Correlation fields and the purpose of their construction

Correlation is studied on the basis of experimental data, which are the measured values ​​(x i , y i) of two features. If there is little experimental data, then the two-dimensional empirical distribution is represented as a double series of x i and y i values. In this case, the correlation between features can be described in different ways. The correspondence between an argument and a function can be given by a table, formula, graph, etc.

Correlation analysis, like other statistical methods, is based on the use of probabilistic models that describe the behavior of the studied features in a certain general population, from which the experimental values ​​x i and y i are obtained. When the correlation between quantitative characteristics, the values ​​of which can be accurately measured in units of metric scales (meters, seconds, kilograms, etc.), is investigated, the model of a two-dimensional normally distributed general population is very often adopted. Such a model displays the relationship between variables x i and y i graphically as a locus of points in a rectangular coordinate system. This graphical dependence is also called a scatterplot or correlation field.
This model of a two-dimensional normal distribution (correlation field) allows you to give a visual graphical interpretation of the correlation coefficient, because distribution in aggregate depends on five parameters: μ x , μ y – average values ​​(mathematical expectations); σ x ,σ y are the standard deviations of the random variables X and Y and p is the correlation coefficient, which is a measure of the relationship between the random variables X and Y.
If p \u003d 0, then the values, x i , y i , obtained from a two-dimensional normal set, are located on the graph in x, y coordinates within the area bounded by a circle (Figure 5, a). In this case, there is no correlation between the random variables X and Y and they are called uncorrelated. For a two-dimensional normal distribution, uncorrelatedness means at the same time the independence of the random variables X and Y.

The concept of relationship is quite common in psychological research. A psychologist has to operate with it when it becomes necessary to compare the measurements of two or more indicators of signs or phenomena in order to draw any conclusions.

The nature of the relationship between the studied phenomena can be unambiguous, i.e. such when a certain value of one attribute corresponds to a clear and definite value of another. So, for example, in the subtest for the search for patterns of tests of mental functions, the number of "raw" points scored is determined by the formula:
Xi \u003d Stz - Soz / Stz + Spz * Sbc,
where Xi is the value of the variants, Sтз is the number of a priori specified patterns (correspondences) in the subtest, Soz is the number of erroneously indicated matches to the subjects, Soz is the number of unspecified (missing) matches to the subjects, Sbc is the number of all words viewed by the subjects in the test.

Such a relationship is called functional: here one indicator is a function of another, which is an argument in relation to the first.

However, a clear-cut relationship is not always found. More often one has to deal with a situation in which one value of a feature can correspond to several values ​​of another. These values ​​vary within more or less defined boundaries. This type of relationship is called correlation or correlative.

Several types of correlation expressions are used. So, to express the relationship between features that have a quantitative nature of the variation of their values, measures of the central tendency are used: tabulation followed by the calculation of the pair correlation coefficient, the coefficient of multiple and partial correlation, the coefficient of multiple determination, the correlation ratio.

If it is necessary to study the relationship between features, the variation of which is of a qualitative nature (the results of projective methods of personality research, studies using the Semantic Differential method, studies using Open Scales, etc.), then use the qualitative alternative correlation coefficient (tetrachoric indicator), Pearson's criterion x2, indicators of contingency (contingency) of Pearson and Chuprov.

To determine the qualitative-quantitative correlation, i.e. such a correlation, when one sign has a qualitative variation, and the other - quantitative. Special methods are used.

Correlation coefficient (the term was first introduced by F. Galton in 1888) is an indicator of the strength of the relationship between two comparable sample (s) options. Whatever formula is used to calculate the correlation coefficient, its value ranges from -1 to +1. In the case of a complete positive correlation, this coefficient is equal to plus 1, and in the case of a complete negative correlation, it is minus 1. This is usually a straight line passing through the points of intersection of the values ​​of each pair of data.

If the variant values ​​do not line up on a straight line, but form a “cloud”, then the absolute value of the correlation coefficient becomes less than one and approaches zero as the “cloud” is rounded off. If the correlation coefficient is 0, both options are completely independent of each other.

Any calculated (empirical) value of the correlation coefficient must be checked for validity (statistical significance) against the appropriate tables of critical values ​​of the correlation coefficient. If the empirical value is less than or equal to the tabulated value for the 5 percent level (P = 0.05), the correlation is not significant. If the calculated value of the correlation coefficient is greater than the tabulated value for P = 0.01, then the correlation is statistically significant (significant).

In the case when the value of the coefficient is between 0.05 > P > 0.01, in practice one speaks of the significance of the correlation for P = 0.05.

The Bravais-Pearson correlation coefficient (r) is a parametric indicator proposed in 1896, for the calculation of which the arithmetic mean and mean square values ​​​​of the variant are compared. To calculate this coefficient, the following formula is used (it may look different for different authors):
r= (E Xi Xi1) - NXap X1ap / N-1 Qx Qx1,

where E Xi Xi1 - the sum of the products of the values ​​of pairwise comparable options, n is the number of compared pairs, NXap, X1ap - arithmetic mean options Xi, Xi; respectively, Qx, Qx, - standard deviations of distributions x and x.

The Spearman rank correlation coefficient Rs (rank correlation coefficient, Spearman coefficient) is the simplest form of the correlation coefficient and measures the relationship between the ranks (places) of a given variant on different grounds, without taking into account its own value. Here the relationship is more qualitative than quantitative.

Usually this non-parametric test is used in cases where it is necessary to draw conclusions not so much about the intervals between data as about their ranks, and also when the distribution curves are extremely asymmetric and do not allow the use of such parametric tests as the Bravais-Pearson correlation coefficient (in these In some cases, it may be necessary to convert quantitative data into ordinal data). If the coefficient Rs is close to +1, then this means that the two rows of the sample ranked according to certain characteristics practically coincide, and if this coefficient is close to - 1, we can talk about a complete inverse relationship.

Like the calculation of the Bravais-Pearson correlation coefficient, it is more convenient to present the calculations of the Rs coefficient in tabular form.

Regression generalizes the concept of a functional relationship to the case of a stochastic (probabilistic) nature of the relationship between the values ​​of a variant. The purpose of solving the category of regression problems is to estimate the value of continuous output variance from the values ​​of the input options.