Spearman's rank correlation coefficient is an example of a comparison. Spearman correlation analysis

In cases where the measurements of the studied characteristics are carried out on an order scale, or the form of the relationship differs from a linear one, the study of the relationship between two random variables is carried out using rank correlation coefficients. Consider Spearman's rank correlation coefficient. When calculating it, it is necessary to rank (order) the sample options. Ranking is the grouping of experimental data in a certain order, either ascending or descending.

The ranking operation is carried out according to the following algorithm:

1. A lower value is assigned a lower rank. The highest value is assigned a rank corresponding to the number of ranked values. The lowest value is assigned a rank equal to 1. For example, if n=7, then the highest value will receive rank number 7, except for the cases provided for by the second rule.

2. If several values ​​are equal, then they are assigned a rank, which is the average of those ranks that they would have received if they were not equal. As an example, consider an ascending sample consisting of 7 elements: 22, 23, 25, 25, 25, 28, 30. The values ​​22 and 23 occur once, so their ranks are respectively equal to R22=1, and R23=2 . The value 25 occurs 3 times. If these values ​​did not repeat, then their ranks would be equal to 3, 4, 5. Therefore, their rank R25 is equal to the arithmetic mean of 3, 4 and 5: . The values ​​28 and 30 do not repeat, so their ranks are respectively R28=6 and R30=7. Finally, we have the following correspondence:

3. The total amount of ranks must match the calculated one, which is determined by the formula:

where n is the total number of ranked values.

The discrepancy between the actual and calculated amounts of ranks will indicate an error made in the calculation of ranks or their summation. In this case, you need to find and fix the error.

Spearman's rank correlation coefficient is a method that allows you to determine the strength and direction of the relationship between two features or two feature hierarchies. The use of the rank correlation coefficient has a number of limitations:

  • a) The expected correlation should be monotonic.
  • b) The volume of each of the samples must be greater than or equal to 5. To determine the upper limit of the sample, tables of critical values ​​​​are used (Table 3 of the Appendix). The maximum value of n in the table is 40.
  • c) During the analysis, it is likely that a large number of identical ranks will occur. In this case, an amendment needs to be made. The most favorable case is when both studied samples represent two sequences of mismatched values.

To conduct a correlation analysis, the researcher must have two samples that can be ranked, for example:

  • - two signs measured in the same group of subjects;
  • - two individual trait hierarchies identified in two subjects for the same set of traits;
  • - two group hierarchies of features;
  • - individual and group hierarchies of signs.

We begin the calculation with ranking the studied indicators separately for each of the signs.

Let us analyze a case with two features measured in the same group of subjects. First, the individual values ​​are ranked by the first feature obtained by different subjects, and then the individual values ​​by the second feature. If lower ranks of one indicator correspond to lower ranks of another indicator, and higher ranks of one indicator correspond to higher ranks of another indicator, then the two features are positively related. If the higher ranks of one indicator correspond to the lower ranks of another indicator, then the two signs are negatively related. To find rs, we determine the differences between the ranks (d) for each subject. The smaller the difference between the ranks, the closer the rank correlation coefficient rs will be to "+1". If there is no relationship, then there will be no correspondence between them, hence rs will be close to zero. The greater the difference between the ranks of the subjects in two variables, the closer to "-1" will be the value of the coefficient rs. Thus, the Spearman rank correlation coefficient is a measure of any monotonic relationship between the two characteristics under study.

Consider the case with two individual feature hierarchies identified in two subjects for the same set of features. In this situation, the individual values ​​obtained by each of the two subjects according to a certain set of features are ranked. The feature with the lowest value should be assigned the first rank; the attribute with a higher value - the second rank, etc. Care should be taken to ensure that all attributes are measured in the same units. For example, it is impossible to rank indicators if they are expressed in points of different “price”, since it is impossible to determine which of the factors will take the first place in terms of severity until all values ​​are brought to a single scale. If features that have low ranks in one of the subjects also have low ranks in the other, and vice versa, then the individual hierarchies are positively related.

In the case of two group hierarchies of features, the average group values ​​obtained in two groups of subjects are ranked according to the same set of features for the studied groups. Next, we follow the algorithm given in the previous cases.

Let us analyze the case with individual and group hierarchy of features. They start by ranking separately the individual values ​​of the subject and the mean group values ​​according to the same set of features that were obtained, with the exception of the subject who does not participate in the mean group hierarchy, since his individual hierarchy will be compared with it. Rank correlation makes it possible to assess the degree of consistency between the individual and group hierarchy of features.

Let us consider how the significance of the correlation coefficient is determined in the cases listed above. In the case of two features, it will be determined by the sample size. In the case of two individual feature hierarchies, the significance depends on the number of features included in the hierarchy. In the last two cases, the significance is determined by the number of traits studied, and not by the size of the groups. Thus, the significance of rs in all cases is determined by the number of ranked values ​​n.

When testing the statistical significance of rs, tables of critical values ​​of the rank correlation coefficient are used, compiled for different numbers of ranked values ​​and different levels of significance. If the absolute value of rs reaches a critical value or exceeds it, then the correlation is significant.

When considering the first option (a case with two features measured in the same group of subjects), the following hypotheses are possible.

H0: The correlation between variables x and y is not different from zero.

H1: The correlation between variables x and y is significantly different from zero.

If we work with any of the three remaining cases, then we need to put forward another pair of hypotheses:

H0: The correlation between the x and y hierarchies is nonzero.

H1: The correlation between x and y hierarchies is significantly different from zero.

The sequence of actions in calculating the Spearman rank correlation coefficient rs is as follows.

  • - Determine which two features or two feature hierarchies will participate in the matching as x and y variables.
  • - Rank the values ​​of the variable x, assigning rank 1 to the smallest value, according to the ranking rules. Place the ranks in the first column of the table in order of the numbers of the subjects or signs.
  • - Rank the values ​​of the variable y. Place the ranks in the second column of the table in order of the numbers of the subjects or signs.
  • - Calculate the differences d between the ranks x and y for each row of the table. The results are placed in the next column of the table.
  • - Calculate the squared differences (d2). Place the obtained values ​​in the fourth column of the table.
  • - Calculate the sum of the squares of the differences? d2.
  • - If the same ranks occur, calculate the corrections:

where tx is the volume of each group of equal ranks in sample x;

ty is the size of each group of equal ranks in sample y.

Calculate the rank correlation coefficient depending on the presence or absence of identical ranks. In the absence of identical ranks, the rank correlation coefficient rs is calculated using the formula:

In the presence of the same ranks, the rank correlation coefficient rs is calculated using the formula:

where?d2 is the sum of the squared differences between the ranks;

Tx and Ty - corrections for the same ranks;

n is the number of subjects or features that participated in the ranking.

Determine the critical values ​​of rs from table 3 of the Appendix, for a given number of subjects n. A significant difference from zero of the correlation coefficient will be observed provided that rs is not less than the critical value.

Correlation analysis is a method that allows you to detect relationships between a certain number of random variables. The purpose of correlation analysis is to identify an estimate of the strength of connections between such random variables or features that characterize certain real processes.

Today we propose to consider how Spearman's correlation analysis is used to visually display the forms of connection in practical trading.

Spearman correlation or the basis of correlation analysis

In order to understand what correlation analysis is, one should first understand the concept of correlation.

At the same time, if the price starts to move in the direction you need, it is necessary to unblock positions in time.


For this strategy, which is based on correlation analysis, trading instruments with a high degree of correlation (EUR/USD and GBP/USD, EUR/AUD and EUR/NZD, AUD/USD and NZD/USD, CFD contracts, etc.) .

Video: Applying the Spearman Correlation to the Forex Market

37. Spearman's rank correlation coefficient.

S. 56 (64) 063.JPG

http://psystat.at.ua/publ/1-1-0-33

Spearman's rank correlation coefficient is used when:
- variables have ranking scale measurements;
- data distribution is too different from normal or not known at all
- samples are small (N< 30).

The interpretation of Spearman's rank correlation coefficient does not differ from Pearson's coefficient, but its meaning is somewhat different. To understand the difference between these methods and logically substantiate the areas of their application, let's compare their formulas.

Pearson correlation coefficient:

Spearman's correlation coefficient:

As you can see, the formulas differ significantly. Compare Formulas

The Pearson correlation formula uses the arithmetic mean and standard deviation of the correlated series, while the Spearman formula does not. Thus, in order to obtain an adequate result according to the Pearson formula, it is necessary that the correlated series be close to the normal distribution (the mean and standard deviation are normal distribution parameters). For the Spearman formula, this is not relevant.

An element of Pearson's formula is the standardization of each series in z-score.

As you can see, the conversion of variables to the Z-scale is present in the Pearson correlation coefficient formula. Accordingly, for the Pearson coefficient, the scale of the data is absolutely irrelevant: for example, we can correlate two variables, one of which has a min. = 0 and max. = 1, and the second min. = 100 and max. = 1000. No matter how different the range of values ​​is, they will all be converted to standard z-values ​​with the same scale.

There is no such normalization in the Spearman coefficient, so

A MANDATORY CONDITION FOR USING THE SPEERMAN COEFFICIENT IS THE EQUALITY OF THE RANGE OF TWO VARIABLES.

Before using the Spearman coefficient for data series with different ranges, it is necessary to rank. Ranking leads to the fact that the values ​​of these series acquire the same minimum = 1 (minimum rank) and a maximum equal to the number of values ​​(maximum, last rank = N, i.e. the maximum number of cases in the sample).

In what cases it is possible to do without ranking

These are cases where the data is originally ranking scale. For example, the Rokeach value orientations test.

Also, these are cases when the number of value options is small and there are fixed minimum and maximum in the sample. For example, in the semantic differential, minimum = 1, maximum = 7.

An example of calculating the Spearman rank correlation coefficient

Rokeach's value orientations test was carried out on two samples X and Y. Task: to find out how close the value hierarchies of these samples are (literally, how similar they are).

The resulting value r=0.747 is checked against critical value table. According to the table, at N=18, the obtained value is reliable at the level of p<=0,005

Rank correlation coefficients according to Spearman and Kendal

For variables belonging to the ordinal scale or for variables that do not follow a normal distribution, as well as for variables belonging to the interval scale, Spearman's rank correlation is calculated instead of the Pearson coefficient. To do this, individual values ​​of variables are assigned ranking places, which are subsequently processed using the appropriate formulas. To reveal rank correlation, uncheck the default Pearson correlation check box in the Bivariate Correlations... dialog box. Instead, activate the Spearman correlation calculation. This calculation will give the following results. The rank correlation coefficients are very close to the corresponding values ​​of the Pearson coefficients (the original variables have a normal distribution).

titkova-matmetody.pdf p. 45

Spearman's rank correlation method allows you to determine the tightness (strength) and direction

correlation between two signs or two profiles (hierarchies) signs.

To calculate the rank correlation, it is necessary to have two series of values,

which can be ranked. These ranges of values ​​can be:

1) two signs measured in the same group test subjects;

2) two individual feature hierarchies, identified in two subjects for the same

a set of features;

3) two group hierarchies of features,

4) individual and group feature hierarchy.

First, the indicators are ranked separately for each of the features.

As a rule, a lower value of a feature is assigned a lower rank.

In the first case (two features), individual values ​​are ranked according to the first

trait obtained by different subjects, and then individual values ​​for the second

sign.

If two signs are positively related, then the subjects with low ranks in

one of them will have low ranks in the other, and the subjects with high ranks in

one of the traits will also have high ranks on the other trait. For counting rs

it is necessary to determine the differences (d) between the ranks obtained by these subjects on both

signs. Then these indicators d are transformed in a certain way and subtracted from 1. Than

the smaller the difference between the ranks, the larger rs will be, the closer it will be to +1.

If there is no correlation, then all ranks will be mixed and there will be no

no match. The formula is designed so that in this case rs will be close to 0.

In case of negative correlation low ranks of subjects on one basis

will correspond to high ranks on another attribute, and vice versa. The more mismatch

between the ranks of subjects in two variables, the closer rs is to -1.

In the second case (two individual profiles), individual

values ​​obtained by each of the 2 subjects according to a certain (the same for them

both) a set of features. The first rank will receive the trait with the lowest value; second rank -

a sign with a higher value, etc. Obviously, all features must be measured in

the same units, otherwise ranking is impossible. For example, it's impossible

rank the indicators according to the Cattell Personality Questionnaire (16PF), if they are expressed in

"raw" scores, since the ranges of values ​​are different for different factors: from 0 to 13, from 0 to

20 and from 0 to 26. We cannot say which of the factors will take first place in terms of

severity, until we bring all the values ​​​​to a single scale (most often this is the scale of the walls).

If the individual hierarchies of two subjects are positively related, then the signs

having low ranks in one of them will have low ranks in the other, and vice versa.

For example, if for one subject the factor E (dominance) has the lowest rank, then for

another subject, it should have a low rank if one subject has factor C

(emotional stability) has the highest rank, then the other subject must also have

this factor has a high rank, and so on.

In the third case (two group profiles), the average group values ​​are ranked,

received in 2 groups of subjects according to a certain, identical for two groups, set

signs. In what follows, the line of reasoning is the same as in the previous two cases.

In the case of the 4th (individual and group profiles), they are ranked separately

individual values ​​of the subject and average group values ​​for the same set

signs that are obtained, as a rule, with the exclusion of this individual subject - he

does not participate in the average group profile, with which his individual will be compared

profile. Rank correlation will allow you to check how consistent the individual and

group profiles.

In all four cases, the significance of the obtained correlation coefficient is determined by

by number of ranked values N. In the first case, this number will coincide with

sample size n. In the second case, the number of observations will be the number of features,

constituting a hierarchy. In the third and fourth cases, N is also the number of matched

signs, not the number of subjects in groups. Detailed explanations are given in the examples. If a

the absolute value of rs reaches a critical value or exceeds it, the correlation

reliable.

Hypotheses.

There are two possible hypotheses. The first refers to case 1, the second to the other three

The first version of hypotheses

H0: The correlation between variables A and B is not different from zero.

H2: The correlation between variables A and B is significantly different from zero.

The second version of the hypotheses

H0: Correlation between hierarchies A and B is not different from zero.

H2: The correlation between hierarchies A and B is significantly different from zero.

Limitations of the rank correlation coefficient

1. At least 5 observations must be submitted for each variable. Upper

the sampling limit is determined by the available tables of critical values .

2. Spearman's rank correlation coefficient rs with a large number of identical

ranks for one or both matched variables gives coarse values. Ideally

both correlated series must be two sequences of non-matching

values. If this condition is not met, an adjustment must be made for

the same ranks.

Spearman's rank correlation coefficient is calculated by the formula:

If in both compared ranking series there are groups of the same ranks,

before calculating the rank correlation coefficient, it is necessary to correct for the same

ranks Ta and Tv:

Ta \u003d Σ (a3 - a) / 12,

TV \u003d Σ (v3 - c) / 12,

where a - the volume of each group of identical ranks in the rank series A, in volume of each

groups of equal ranks in the rank series B.

To calculate the empirical value of rs, use the formula:

38. Dotted biserial correlation coefficient.

For correlation in general, see question no. 36 with. 56 (64) 063.JPG

harchenko-korranaliz.pdf

Let variable X be measured on a strong scale, and variable Y on a dichotomous one. The point biserial correlation coefficient rpb is calculated by the formula:

Here x 1 is the average value for X objects with the value "one" for Y;

x 0 - the average value for X objects with a value of "zero" for Y;

s x - standard deviation of all values ​​for X;

n 1 - the number of objects "one" in Y, n 0 - the number of objects "zero" in Y;

n = n 1 + n 0 is the sample size.

The point biserial correlation coefficient can also be calculated using other equivalent expressions:

Here x is the overall mean value for the variable X.

Point Biserial Correlation Coefficient rpb varies from –1 to +1. Its value is equal to zero in the event that variables with a unit for Y have an average Y, equal to the mean of variables with zero over Y.

Examination significance hypotheses point biserial correlation coefficient is to check null hypothesish 0 about the equality of the general correlation coefficient to zero: ρ = 0, which is carried out using the Student's criterion. Empirical value

compared with critical values t a (df) for the number of degrees of freedom df = n– 2

If the condition | t| ≤ ta(df), the null hypothesis ρ = 0 is not rejected. The point biserial correlation coefficient significantly differs from zero if the empirical value | t| falls into the critical region, that is, if the condition | t| > ta(n– 2). Reliability of relationship calculated using point biserial correlation coefficient rpb, can also be determined using the criterion χ 2 for the number of degrees of freedom df= 2.

Dot-biserial correlation

The subsequent modification of the correlation coefficient of the product of moments was reflected in the dotted-biserial r. This stat. shows the relationship between two variables, one of which is supposedly continuous and normally distributed, and the other is discrete in the exact sense of the word. The dot-biserial correlation coefficient is denoted by r pbis Because in r pbis the dichotomy reflects the true nature of the discrete variable, and not being artificial, as in the case r bis, its sign is arbitrarily determined. Therefore, for all practices goals r pbis considered in the range from 0.00 to +1.00.

There is also such a case when two variables are considered to be continuous and normally distributed, but both are artificially dichotomized, as in the case of biserial correlation. To assess the relationship between such variables, the tetrachoric correlation coefficient is used r tet, which was also bred by Pearson. Main (exact) formulas and procedures for calculating r tet are quite complex. Therefore, with pract. this method uses the approximations r tet obtained on the basis of shortened procedures and tables.

/online/dictionary/dictionary.php?term=511

DOTTED BISERIAL COEFFICIENT OF CORRELATION is the correlation coefficient between two variables, one of which is measured on a dichotomous scale and the other on an interval scale. It is used in classical and modern testology as an indicator of the quality of a test task - reliability-consistency with the overall test score.

To correlate variables measured in dichotomous and interval scale use dot-biserial correlation coefficient.
The point-biserial correlation coefficient is a method of correlation analysis of the ratio of variables, one of which is measured in the scale of names and takes only 2 values ​​(for example, men / women, the answer is correct / the answer is incorrect, there is a sign / there is no sign), and the second in the scale ratios or interval scale. The formula for calculating the coefficient of point-biserial correlation:

Where:
m1 and m0 are the average values ​​of X with a value of 1 or 0 in Y.
σx is the standard deviation of all values ​​for X
n1 ,n0 – number of X values ​​from 1 or 0 to Y.
n is the total number of pairs of values

Most often, this type of correlation coefficient is used to calculate the relationship of test items with a summary scale. This is one type of validation check.

39. Rank-biserial correlation coefficient.

For correlation in general, see question no. 36 with. 56 (64) 063.JPG

harchenko-korranaliz.pdf p. 28

The rank-biserial correlation coefficient used when one of the variables ( X) is presented in an ordinal scale, and the other ( Y) - in dichotomous, calculated by the formula

.

Here, is the average rank of objects having unity in Y; is the average rank of objects with zero in Y, n is the sample size.

Examination significance hypotheses rank-biserial correlation coefficient is carried out similarly to the point biserial correlation coefficient using Student's t-test with replacement in the formulas rpb on the rrb.

When one variable is measured on a dichotomous scale (variable x), and the other in the rank scale (variable Y), using the rank-biserial correlation coefficient. We remember that the variable x, measured in a dichotomous scale, takes only two values ​​(codes) 0 and 1. Let us emphasize in particular: despite the fact that this coefficient varies in the range from –1 to +1, its sign does not matter for interpreting the results. This is another exception to the general rule.

The calculation of this coefficient is made according to the formula:

where ` X 1 average rank over those elements of the variable Y, which corresponds to the code (feature) 1 in the variable X;

`X 0 – average rank for those elements of the variable Y, which corresponds to the code (feature) 0 in the variable X\

N- the total number of elements in the variable x.

To apply the rank-biserial correlation coefficient, the following conditions must be met:

1. The variables being compared must be measured on different scales: one X- in a dichotomous scale; another Y– in the ranking scale.

2. The number of varying features in the compared variables X and Y should be the same.

3. To assess the level of reliability of the rank-biserial correlation coefficient, one should use the formula (11.9) and the table of critical values ​​for the Student's test when k = n - 2.

http://psystat.at.ua/publ/drugie_vidy_koehfficienta_korreljacii/1-1-0-38

Cases where one of the variables is present in dichotomous scale, and the other in rank (ordinal), require the use rank-biserial correlation coefficient:

rpb=2 / n * (m1 - m0)

where:
n is the number of measurement objects
m1 and m0 - the average rank of objects with 1 or 0 in the second variable.
This coefficient is also used when checking the validity of tests.

40. Linear correlation coefficient.

About correlation in general (and about linear correlation in particular), see question No. 36 with. 56 (64) 063.JPG

Mr. PEARSON'S CORRELATION COEFFICIENT

r-Pearson (Pearson r) is used to study the relationship between two metricother variables measured on the same sample. There are many situations in which it is appropriate to use it. Does intelligence affect performance in senior university years? Is the size of an employee's salary related to his goodwill towards colleagues? Does the mood of a student affect the success of solving a complex arithmetic problem? To answer such questions, the researcher must measure two indicators of interest to each member of the sample. The data to study the relationship is then tabulated, as in the example below.

EXAMPLE 6.1

The table shows an example of the initial measurement data for two indicators of intelligence (verbal and non-verbal) in 20 students of the 8th grade.

The relationship between these variables can be depicted using a scatter diagram (see Figure 6.3). The diagram shows that there is some relationship between the measured indicators: the greater the value of verbal intelligence, the (mainly) the greater the value of non-verbal intelligence.

Before giving the formula for the correlation coefficient, let's try to trace the logic of its occurrence, using the data of Example 6.1. The position of each /-point (subject with the number /) on the scatter diagram relative to the other points (Fig. 6.3) can be given by the magnitudes and signs of the deviations of the corresponding values ​​of the variables from their average values: (xj - MJ and (mind at ). If the signs of these deviations coincide, then this indicates in favor of a positive relationship (large values ​​for X correspond to large values at or smaller values ​​for X correspond to smaller values y).

For subject No. 1, the deviation from the average X and by at positive, and for subject No. 3, both deviations are negative. Consequently, the data of both indicate a positive relationship between the studied traits. On the contrary, if the signs of deviations from the average X and by at differ, this will indicate a negative relationship between the signs. Thus, for subject No. 4, the deviation from the average X is negative, according to y - positive, and for subject No. 9 - vice versa.

Thus, if the product of deviations (x, - M X ) X (mind at ) positive, then the data of the /-subject indicate a direct (positive) relationship, and if negative, then an inverse (negative) relationship. Accordingly, if Xwy are mostly directly proportional, then most of the products of the deviations will be positive, and if they are related inversely, then most of the products will be negative. Therefore, the sum of all products of deviations for a given sample can serve as a general indicator for the strength and direction of the relationship:

With a directly proportional relationship between the variables, this value is large and positive - for most of the subjects, the deviations coincide in sign (large values ​​of one variable correspond to large values ​​of the other variable and vice versa). If X and at have feedback, then for most subjects, large values ​​of one variable will correspond to smaller values ​​of another variable, i.e., the signs of the products will be negative, and the sum of the products as a whole will also be large in absolute value, but negative in sign. If there is no systematic relationship between the variables, then the positive terms (products of deviations) will be balanced by negative terms, and the sum of all products of deviations will be close to zero.

So that the sum of the products does not depend on the sample size, it is enough to average it. But we are interested in the measure of the relationship not as a general parameter, but as a calculated estimate of it - statistics. Therefore, as for the dispersion formula, in this case we will do the same, we divide the sum of the products of deviations not by N, and on TV - 1. It turns out a measure of communication, widely used in physics and technical sciences, which is called covariance (Covahance):


AT psychology, unlike physics, most variables are measured on arbitrary scales, since psychologists are not interested in the absolute value of the trait, but in the relative position of the subjects in the group. In addition, covariance is very sensitive to the scale (dispersion) in which the features are measured. To make the measure of communication independent of the units of measurement of either attribute, it is enough to divide the covariance into the corresponding standard deviations. Thus, it was obtained for-K. Pearson's correlation coefficient mule:

or, after substituting the expressions for o x and


If the values ​​of both variables were converted to r-values ​​using the formula


then the r-Pearson correlation coefficient formula looks simpler (071.JPG):

/dict/sociology/article/soc/soc-0525.htm

CORRELATION LINEAR- statistical non-causal linear relationship between two quantitative variables X and at. Measured using the "factor K.L." Pearson, which is the result of dividing the covariance by the standard deviations of both variables:

,

where s xy- covariance between variables X and at;

s x , s y- standard deviations for variables X and at;

x i , y i- variable values X and at for object number i;

x, y- arithmetic averages for variables X and at.

Pearson's ratio r can take values ​​from the interval [-1; +1]. Meaning r = 0 means no linear relationship between variables X and at(but does not rule out a non-linear statistical relationship). Positive coefficient values ​​( r> 0) indicate a direct linear relationship; the closer its value is to +1, the stronger the statistical direct relationship. Negative coefficient values ​​( r < 0) свидетельствуют об обратной линейной связи; чем ближе его значение к -1, тем сильнее обратная связь. Значения r= ±1 mean the presence of a full linear connection, direct or reverse. In the case of a complete connection, all points with coordinates ( x i , y i) lie on a straight line y = a + bx.

"Coefficient K.L." Pearson is also used to measure the tightness of the relationship in the linear pair regression model.

41. Correlation matrix and correlation graph.

For correlation in general, see question no. 36 with. 56 (64) 063.JPG

correlation matrix. Often, correlation analysis includes the study of the relationship not of two, but of many variables measured on a quantitative scale on a single sample. In this case, correlations are calculated for each pair of this set of variables. Calculations are usually carried out on a computer, and the result is a correlation matrix.

Correlation matrix(correlation matrix) is the result of calculating correlations of the same type for each pair from the set R variables measured in a quantitative scale on one sample.

EXAMPLE

Assume that we are studying relationships between 5 variables (vl, v2,..., v5; P= 5), measured on a sample of N=30 Human. Below is a table of initial data and a correlation matrix.

And
related data:

Correlation matrix:

It is easy to see that the correlation matrix is ​​square, symmetrical with respect to the main diagonal (takkakg, y = /) y), with units on the main diagonal (since G and = Gu = 1).

The correlation matrix is square: the number of rows and columns is equal to the number of variables. She is symmetrical relative to the main diagonal, since the correlation X with at equals correlation at with X. Units are located on its main diagonal, since the correlation of a feature with itself is equal to one. Consequently, not all elements of the correlation matrix are subject to analysis, but those that are above or below the main diagonal.

Number of correlation coefficients, P features to be analyzed in the study of relationships is determined by the formula: P(P- 1)/2. In the example above, the number of such correlation coefficients is 5(5 - 1)/2 = 10.

The main task of analyzing the correlation matrix is revealing the structure of interrelations of a set of features. This allows visual analysis correlation pleiades- graphic image structures statisticallysignificant connections if there are not very many such connections (up to 10-15). Another way is to use multivariate methods: multiple regression, factorial or cluster analysis (see section "Multivariate methods..."). Using factorial or cluster analysis, it is possible to identify groupings of variables that are more closely related to each other than to other variables. A combination of these methods is also very effective, for example, if there are many signs and they are not homogeneous.

Comparison of correlations - an additional task of analyzing the correlation matrix, which has two options. If it is necessary to compare correlations in one of the rows of the correlation matrix (for one of the variables), the comparison method for dependent samples is applied (pp. 148-149). When comparing correlations of the same name calculated for different samples, the comparison method for independent samples is used (pp. 147-148).

Comparison Methods correlations in diagonals correlation matrix (for assessing the stationarity of a random process) and comparing several correlation matrices obtained for different samples (for their homogeneity) are time-consuming and beyond the scope of this book. You can get acquainted with these methods from the book by GV Sukhodolsky 1 .

The problem of statistical significance of correlations. The problem is that the statistical hypothesis testing procedure involves one-multiple test carried out on one sample. If the same method is applied many times, even if in relation to different variables, then the probability of obtaining a result purely by chance increases. In general, if we repeat the same hypothesis testing method to times in relation to different variables or samples, then with the established value of a, we are guaranteed to receive confirmation of the hypothesis in ahk the number of cases.

Let's assume that the correlation matrix for 15 variables is analyzed, that is, 15(15-1)/2 = 105 correlation coefficients are calculated. To test the hypotheses, the level a = 0.05 is set. By testing the hypothesis 105 times, we will get its confirmation five times (!) regardless of whether the connection actually exists. Knowing this and having received, say, 15 "statistically significant" correlation coefficients, can we tell which of them are obtained by chance, and which ones reflect a real relationship?

Strictly speaking, in order to make a statistical decision, it is necessary to reduce the level a by as many times as the number of hypotheses being tested. But this is hardly advisable, since the probability of ignoring a really existing connection (make a type II error) increases in an unpredictable way.

The correlation matrix alone is not a sufficient basisfor statistical conclusions regarding the individual coefficients included in itcorrelations!

There is only one really convincing way to solve this problem: divide the sample randomly into two parts and take into account only those correlations that are statistically significant in both parts of the sample. An alternative may be the use of multivariate methods (factorial, cluster or multiple regression analysis) - for the selection and subsequent interpretation of groups of statistically significantly related variables.

The problem of missing values. If there are missing values ​​in the data, then two options for calculating the correlation matrix are possible: a) line-by-line deletion of values (excludecaseslistwise); b) pairwise deletion of values (excludecasespairwise). At line-by-line deletion observations with gaps, the entire line is deleted for the object (subject) that has at least one missing value for one of the variables. This method leads to a "correct" correlation matrix in the sense that all coefficients are calculated from the same set of objects. However, if the missing values ​​are randomly distributed in the variables, then this method can lead to the fact that there will be no object left in the considered data set (each line will contain at least one missing value). To avoid this situation, use another method called pairwise removal. This method takes into account only gaps in each selected pair of variable columns and ignores gaps in other variables. Correlation for a pair of variables is calculated for those objects where there are no gaps. In many situations, especially when the number of gaps is relatively small, say 10%, and the gaps are fairly randomly distributed, this method does not lead to serious errors. However, sometimes this is not the case. For example, in the systematic bias (shift) of the estimate, the systematic location of the gaps can be “hidden”, which is the reason for the difference in the correlation coefficients built on different subsets (for example, for different subgroups of objects). Another problem associated with the correlation matrix calculated with in pairs gap removal occurs when using this matrix in other types of analysis (for example, in multiple regression or factor analysis). They assume that a "correct" correlation matrix is ​​used with a certain level of consistency and "correspondence" of various coefficients. The use of a matrix with "bad" (biased) estimates leads to the fact that the program is either unable to analyze such a matrix, or the results will be erroneous. Therefore, if a pairwise method of eliminating missing data is used, it is necessary to check whether there are or are not systematic patterns in the distribution of gaps.

If the pairwise elimination of missing data does not lead to any systematic shift in the means and variances (standard deviations), then these statistics will be similar to those calculated with the line-wise method of removing gaps. If there is a significant difference, then there is reason to assume that there is a shift in the estimates. For example, if the mean (or standard deviation) of the values ​​of the variable BUT, which was used in calculating its correlation with the variable AT, much less than the mean (or standard deviation) of the same values ​​of the variable BUT, which were used in calculating its correlation with the variable C, then there is every reason to expect that these two correlations (A-Bus) based on different subsets of data. There will be a shift in the correlations caused by the non-random location of the gaps in the values ​​of the variables.

Analysis of correlation pleiades. After solving the problem of the statistical significance of the elements of the correlation matrix, statistically significant correlations can be represented graphically in the form of a correlation pleiad or pleiades. Correlation galaxy - it is a figure consisting of vertices and lines connecting them. The vertices correspond to the features and are usually denoted by numbers - the numbers of the variables. The lines correspond to statistically significant relationships and graphically express the sign, and sometimes the /j-significance level of the relationship.

The correlation galaxy can reflect all statistically significant relationships of the correlation matrix (sometimes called correlation graph ) or only their meaningfully selected part (for example, corresponding to one factor according to the results of factor analysis).

EXAMPLE OF CONSTRUCTING A CORRELATION PLEIADI


Preparation for the state (final) certification of graduates: formation of the USE database (general list of USE participants of all categories, indicating subjects) - taking into account reserve days in case of coincidence of subjects;

  • Work plan (27)

    Decision

    2. The activities of the educational institution to improve the content and assess the quality in the subjects of natural and mathematical education MOU secondary school No. 4, Litvinovskaya, Chapaevskaya,

  • is a quantitative assessment of the statistical study of the relationship between phenomena, used in non-parametric methods.

    The indicator shows how the observed sum of squared differences between the ranks differs from the case of no connection.

    Service assignment. With this online calculator, you can:

    • calculation of Spearman's rank correlation coefficient;
    • calculation of the confidence interval for the coefficient and evaluation of its significance;

    Spearman's rank correlation coefficient refers to the indicators of the assessment of the closeness of communication. A qualitative characteristic of the tightness of the relationship of the rank correlation coefficient, as well as other correlation coefficients, can be assessed using the Chaddock scale.

    Coefficient calculation consists of the following steps:

    Properties of Spearman's rank correlation coefficient

    Application area. Rank correlation coefficient used to evaluate the quality of communication between two sets. In addition, its statistical significance is used when analyzing data for heteroscedasticity.

    Example. On a data sample of observed variables X and Y:

    1. make a ranking table;
    2. find Spearman's rank correlation coefficient and test its significance at level 2a
    3. assess the nature of addiction
    Decision. Assign ranks to the feature Y and the factor X .
    XYrank X, dxrank Y, d y
    28 21 1 1
    30 25 2 2
    36 29 4 3
    40 31 5 4
    30 32 3 5
    46 34 6 6
    56 35 8 7
    54 38 7 8
    60 39 10 9
    56 41 9 10
    60 42 11 11
    68 44 12 12
    70 46 13 13
    76 50 14 14

    Rank matrix.
    rank X, dxrank Y, d y(dx - dy) 2
    1 1 0
    2 2 0
    4 3 1
    5 4 1
    3 5 4
    6 6 0
    8 7 1
    7 8 1
    10 9 1
    9 10 1
    11 11 0
    12 12 0
    13 13 0
    14 14 0
    105 105 10

    Checking the correctness of the compilation of the matrix based on the calculation of the checksum:

    The sum over the columns of the matrix are equal to each other and the checksum, which means that the matrix is ​​composed correctly.
    Using the formula, we calculate the Spearman's rank correlation coefficient.


    The relationship between trait Y and factor X is strong and direct
    Significance of Spearman's rank correlation coefficient
    In order to test the null hypothesis at the significance level α that the general Spearman rank correlation coefficient is equal to zero under the competing hypothesis H i . p ≠ 0, it is necessary to calculate the critical point:

    where n is the sample size; ρ is Spearman's sample rank correlation coefficient: t(α, k) is the critical point of the two-sided critical region, which is found from the table of critical points of the Student's distribution, according to the significance level α and the number of degrees of freedom k = n-2.
    If |p|< Т kp - нет оснований отвергнуть нулевую гипотезу. Ранговая корреляционная связь между качественными признаками не значима. Если |p| >T kp - null hypothesis is rejected. There is a significant rank correlation between qualitative features.
    According to Student's table we find t(α/2, k) = (0.1/2;12) = 1.782

    Since T kp< ρ , то отклоняем гипотезу о равенстве 0 коэффициента ранговой корреляции Спирмена. Другими словами, коэффициент ранговой корреляции статистически - значим и ранговая корреляционная связь между оценками по двум тестам значимая.

    In practice, Spearman's rank correlation coefficient (P) is often used to determine the closeness of the relationship between two features. The values ​​of each feature are ranked in ascending order (from 1 to n), then the difference (d) between the ranks corresponding to one observation is determined.

    Example #1. The relationship between the volume of industrial production and investments in fixed capital in 10 regions of one of the federal districts of the Russian Federation in 2003 is characterized by the following data.
    Calculate Spearman's rank correlation coefficients and Kendala. Check their significance at α=0.05. Formulate a conclusion about the relationship between the volume of industrial production and investments in fixed assets in the regions of the Russian Federation under consideration.

    Assign ranks to the feature Y and the factor X . Find the sum of the difference of squares d 2 .
    Using the calculator, we calculate the Spearman's rank correlation coefficient:

    X Y rank X, dx rank Y, d y (dx - dy) 2
    1.3 300 1 2 1
    1.8 1335 2 12 100
    2.4 250 3 1 4
    3.4 946 4 8 16
    4.8 670 5 7 4
    5.1 400 6 4 4
    6.3 380 7 3 16
    7.5 450 8 5 9
    7.8 500 9 6 9
    17.5 1582 10 16 36
    18.3 1216 11 9 4
    22.5 1435 12 14 4
    24.9 1445 13 15 4
    25.8 1820 14 19 25
    28.5 1246 15 10 25
    33.4 1435 16 14 4
    42.4 1800 17 18 1
    45 1360 18 13 25
    50.4 1256 19 11 64
    54.8 1700 20 17 9
    364

    The relationship between feature Y factor X is strong and direct.

    Estimation of Spearman's rank correlation coefficient



    According to the Student's table, we find Ttable.
    T table \u003d (18; 0.05) \u003d 1.734
    Since Tobs > Ttabl, we reject the hypothesis that the rank correlation coefficient is equal to zero. In other words, Spearman's rank correlation coefficient is statistically significant.

    Interval estimate for rank correlation coefficient (confidence interval)
    Confidence interval for Spearman's rank correlation coefficient: p(0.5431;0.9095).

    Example #2. Initial data.

    5 4
    3 4
    1 3
    3 1
    6 6
    2 2
    Since the matrix has related ranks (the same rank number) of the 1st row, we will reshape them. The ranks are re-formed without changing the importance of the rank, that is, the corresponding ratios (greater than, less than or equal to) must be preserved between the rank numbers. It is also not recommended to set the rank above 1 and below the value equal to the number of parameters (in this case n = 6). Reformation of ranks is made in table.
    New ranks
    1 1 1
    2 2 2
    3 3 3.5
    4 3 3.5
    5 5 5
    6 6 6
    Since there are bound ranks of the 2nd row in the matrix, we will reshape them. Reformation of ranks is made in table.
    Seat numbers in ordered rowLocation of factors according to the expert's assessmentNew ranks
    1 1 1
    2 2 2
    3 3 3
    4 4 4.5
    5 4 4.5
    6 6 6
    Rank matrix.
    rank X, dxrank Y, d y(dx - dy) 2
    5 4.5 0.25
    3.5 4.5 1
    1 3 4
    3.5 1 6.25
    6 6 0
    2 2 0
    21 21 11.5
    Since among the values ​​of features x and y there are several identical ones, i.e. bound ranks are formed, then in this case the Spearman coefficient is calculated as:

    where


    j - numbers of links in order for feature x;
    And j is the number of identical ranks in the j-th bundle in x;
    k - numbers of sheaves in order for feature y;
    In k - the number of identical ranks in the k-th bundle in y.
    A = [(2 3 -2)]/12 = 0.5
    B = [(2 3 -2)]/12 = 0.5
    D = A + B = 0.5 + 0.5 = 1

    The relationship between feature Y and factor X is moderate and direct.