Standard confidence interval. Confidence interval

Target– to teach students algorithms for calculating confidence intervals of statistical parameters.

During statistical data processing, the calculated arithmetic mean, coefficient of variation, correlation coefficient, difference criteria and other point statistics should receive quantitative confidence limits, which indicate possible fluctuations of the indicator up and down within the confidence interval.

Example 3.1 . The distribution of calcium in the blood serum of monkeys, as previously established, is characterized by the following selective indicators: = 11.94 mg%; = 0.127 mg%; n= 100. It is required to determine the confidence interval for the general average ( ) with confidence probability P = 0,95.

The general average is with a certain probability in the interval:

, where – sample arithmetic mean; t- Student's criterion; is the error of the arithmetic mean.

According to the table "Values ​​of Student's criterion" we find the value with a confidence level of 0.95 and the number of degrees of freedom k\u003d 100-1 \u003d 99. It is equal to 1.982. Together with the values ​​of the arithmetic mean and statistical error, we substitute it into the formula:

or 11.69
12,19

Thus, with a probability of 95%, it can be argued that the general average of this normal distribution is between 11.69 and 12.19 mg%.

Example 3.2 . Determine the boundaries of the 95% confidence interval for the general variance ( ) distribution of calcium in the blood of monkeys, if it is known that
= 1.60, with n = 100.

To solve the problem, you can use the following formula:

Where is the statistical error of the variance.

Find the sample variance error using the formula:
. It is equal to 0.11. Meaning t- criterion with a confidence probability of 0.95 and the number of degrees of freedom k= 100–1 = 99 is known from the previous example.

Let's use the formula and get:

or 1.38
1,82

A more accurate confidence interval for the general variance can be constructed using (chi-square) - Pearson's test. Critical points for this criterion are given in a special table. When using the criterion a two-sided significance level is used to construct a confidence interval. For the lower bound, the significance level is calculated by the formula
, for the upper
. For example, for a confidence level = 0,99= 0,010,= 0.990. Accordingly, according to the table of distribution of critical values , with the calculated confidence levels and the number of degrees of freedom k= 100 – 1= 99, find the values
and
. We get
equals 135.80, and
equals 70.06.

To find the confidence limits of the general variance using we use the formulas: for the lower bound
, for the upper bound
. Substitute the task data for the found values into formulas:
= 1,17;
= 2.26. Thus, with a confidence level P= 0.99 or 99% the general variance will lie in the range from 1.17 to 2.26 mg% inclusive.

Example 3.3 . Among the 1000 wheat seeds from the lot that arrived at the elevator, 120 seeds infected with ergot were found. It is necessary to determine the probable boundaries of the total proportion of infected seeds in a given batch of wheat.

Confidence limits for the general share for all its possible values ​​should be determined by the formula:

,

Where n is the number of observations; m is the absolute number of one of the groups; t is the normalized deviation.

The sample fraction of infected seeds is equal to
or 12%. With a confidence level R= 95% normalized deviation ( t-Student's criterion for k =
)t = 1,960.

We substitute the available data into the formula:

Hence, the boundaries of the confidence interval are = 0.122–0.041 = 0.081, or 8.1%; = 0.122 + 0.041 = 0.163, or 16.3%.

Thus, with a confidence level of 95%, it can be stated that the total proportion of infected seeds is between 8.1 and 16.3%.

Example 3.4 . The coefficient of variation, which characterizes the variation of calcium (mg%) in the blood serum of monkeys, was equal to 10.6%. Sample size n= 100. It is necessary to determine the boundaries of the 95% confidence interval for the general parameter CV.

Confidence limits for the general coefficient of variation CV are determined by the following formulas:

and
, where K intermediate value calculated by the formula
.

Knowing that with a confidence level R= 95% normalized deviation (Student's t-test for k =
)t = 1.960, pre-calculate the value TO:

.

or 9.3%

or 12.3%

Thus, the general coefficient of variation with a confidence probability of 95% lies in the range from 9.3 to 12.3%. With repeated samples, the coefficient of variation will not exceed 12.3% and will not fall below 9.3% in 95 cases out of 100.

Questions for self-control:

Tasks for independent solution.

1. The average percentage of fat in milk for lactation of cows of Kholmogory crosses was as follows: 3.4; 3.6; 3.2; 3.1; 2.9; 3.7; 3.2; 3.6; 4.0; 3.4; 4.1; 3.8; 3.4; 4.0; 3.3; 3.7; 3.5; 3.6; 3.4; 3.8. Set confidence intervals for the overall mean at a 95% confidence level (20 points).

2. On 400 plants of hybrid rye, the first flowers appeared on average 70.5 days after sowing. The standard deviation was 6.9 days. Determine the error of the mean and confidence intervals for the population mean and variance at a significance level W= 0.05 and W= 0.01 (25 points).

3. When studying the length of the leaves of 502 specimens of garden strawberries, the following data were obtained: = 7.86 cm; σ = 1.32 cm, \u003d ± 0.06 cm. Determine the confidence intervals for the arithmetic mean of the population with significance levels of 0.01; 0.02; 0.05. (25 points).

4. When examining 150 adult men, the average height was 167 cm, and σ \u003d 6 cm. What are the limits of the general average and general variance with a confidence probability of 0.99 and 0.95? (25 points).

5. The distribution of calcium in the blood serum of monkeys is characterized by the following selective indicators: = 11.94 mg%, σ = 1,27, n = 100. Plot a 95% confidence interval for the population mean of this distribution. Calculate the coefficient of variation (25 points).

6. The total nitrogen content in the blood plasma of albino rats at the age of 37 and 180 days was studied. Results are expressed in grams per 100 cm 3 of plasma. At the age of 37 days, 9 rats had: 0.98; 0.83; 0.99; 0.86; 0.90; 0.81; 0.94; 0.92; 0.87. At the age of 180 days, 8 rats had: 1.20; 1.18; 1.33; 1.21; 1.20; 1.07; 1.13; 1.12. Set confidence intervals for the difference with a confidence level of 0.95 (50 points).

7. Determine the boundaries of the 95% confidence interval for the general variance of the distribution of calcium (mg%) in the blood serum of monkeys, if for this distribution the sample size n = 100, the statistical error of the sample variance s σ 2 = 1.60 (40 points).

8. Determine the boundaries of the 95% confidence interval for the general variance of the distribution of 40 spikelets of wheat along the length (σ 2 = 40.87 mm 2). (25 points).

9. Smoking is considered the main factor predisposing to obstructive pulmonary disease. Passive smoking is not considered such a factor. Scientists questioned the safety of passive smoking and examined the airway in non-smokers, passive and active smokers. To characterize the state of the respiratory tract, we took one of the indicators of the function of external respiration - the maximum volumetric velocity of the middle of exhalation. A decrease in this indicator is a sign of impaired airway patency. Survey data are shown in the table.

Number of examined

Maximum mid-expiratory flow rate, l/s

Standard deviation

Non-smokers

work in a non-smoking area

work in a smoke-filled room

smokers

smoking a small number of cigarettes

average number of cigarette smokers

smoking a large number of cigarettes

From the table, find the 95% confidence intervals for the general mean and general variance for each of the groups. What are the differences between the groups? Present the results graphically (25 points).

10. Determine the boundaries of the 95% and 99% confidence intervals for the general variance of the number of piglets in 64 farrowings, if the statistical error of the sample variance s σ 2 = 8.25 (30 points).

11. It is known that the average weight of rabbits is 2.1 kg. Determine the boundaries of the 95% and 99% confidence intervals for the general mean and variance when n= 30, σ = 0.56 kg (25 points).

12. In 100 ears, the grain content of the ear was measured ( X), spike length ( Y) and the mass of grain in the ear ( Z). Find confidence intervals for the general mean and variance for P 1 = 0,95, P 2 = 0,99, P 3 = 0.999 if = 19, = 6.766 cm, = 0.554 g; σ x 2 = 29.153, σ y 2 = 2.111, σ z 2 = 0.064. (25 points).

13. In randomly selected 100 ears of winter wheat, the number of spikelets was counted. The sample set was characterized by the following indicators: = 15 spikelets and σ = 2.28 pcs. Determine the accuracy with which the average result is obtained ( ) and plot the confidence interval for the overall mean and variance at 95% and 99% significance levels (30 points).

14. The number of ribs on the shells of a fossil mollusk Orthambonites calligramma:

It is known that n = 19, σ = 4.25. Determine the boundaries of the confidence interval for the general mean and general variance at a significance level W = 0.01 (25 points).

15. To determine milk yields on a commercial dairy farm, the productivity of 15 cows was determined daily. According to the data for the year, each cow gave on average the following amount of milk per day (l): 22; 19; 25; twenty; 27; 17; thirty; 21; eighteen; 24; 26; 23; 25; twenty; 24. Plot confidence intervals for the general variance and the arithmetic mean. Can we expect the average annual milk yield per cow to be 10,000 liters? (50 points).

16. In order to determine the average wheat yield for the farm, mowing was carried out on sample plots of 1, 3, 2, 5, 2, 6, 1, 3, 2, 11 and 2 ha. The yield (c/ha) from the plots was 39.4; 38; 35.8; 40; 35; 42.7; 39.3; 41.6; 33; 42; 29 respectively. Plot confidence intervals for the general variance and the arithmetic mean. Is it possible to expect that the average yield for the agricultural enterprise will be 42 c/ha? (50 points).

In statistics, there are two types of estimates: point and interval. Point Estimation is a single sample statistic that is used to estimate a population parameter. For example, the sample mean is a point estimate of the population mean, and the sample variance S2- point estimate of the population variance σ2. it was shown that the sample mean is an unbiased estimate of the population expectation. The sample mean is called unbiased because the mean of all sample means (with the same sample size n) is equal to the mathematical expectation of the general population.

In order for the sample variance S2 became an unbiased estimator of the population variance σ2, the denominator of the sample variance should be set equal to n – 1 , but not n. In other words, the population variance is the average of all possible sample variances.

When estimating population parameters, it should be kept in mind that sample statistics such as , depend on specific samples. To take this fact into account, to obtain interval estimation the mathematical expectation of the general population analyze the distribution of sample means (for more details, see). The constructed interval is characterized by a certain confidence level, which is the probability that the true parameter of the general population is estimated correctly. Similar confidence intervals can be used to estimate the proportion of a feature R and the main distributed mass of the general population.

Download note in or format, examples in format

Construction of a confidence interval for the mathematical expectation of the general population with a known standard deviation

Building a confidence interval for the proportion of a trait in the general population

In this section, the concept of a confidence interval is extended to categorical data. This allows you to estimate the share of the trait in the general population R with a sample share RS= X/n. As mentioned, if the values nR and n(1 - p) exceed the number 5, the binomial distribution can be approximated by the normal one. Therefore, to estimate the share of a trait in the general population R it is possible to construct an interval whose confidence level is equal to (1 - α)x100%.


where pS- sample share of the feature, equal to X/n, i.e. the number of successes divided by the sample size, R- the share of the trait in the general population, Z is the critical value of the standardized normal distribution, n- sample size.

Example 3 Let's assume that a sample is extracted from the information system, consisting of 100 invoices completed during the last month. Let's say that 10 of these invoices are incorrect. In this way, R= 10/100 = 0.1. The 95% confidence level corresponds to the critical value Z = 1.96.

Thus, there is a 95% chance that between 4.12% and 15.88% of invoices contain errors.

For a given sample size, the confidence interval containing the proportion of the trait in the general population seems to be wider than for a continuous random variable. This is because measurements of a continuous random variable contain more information than measurements of categorical data. In other words, categorical data that takes only two values ​​contain insufficient information to estimate the parameters of their distribution.

ATcalculation of estimates drawn from a finite population

Estimation of mathematical expectation. Correction factor for the final population ( fpc) was used to reduce the standard error by a factor of . When calculating confidence intervals for estimates of population parameters, a correction factor is applied in situations where samples are drawn without replacement. Thus, the confidence interval for the mathematical expectation, having a confidence level equal to (1 - α)x100%, is calculated by the formula:

Example 4 To illustrate the application of a correction factor for a finite population, let us return to the problem of calculating the confidence interval for the average amount of invoices discussed in Example 3 above. Suppose that a company issues 5,000 invoices per month, and =110.27 USD, S= $28.95 N = 5000, n = 100, α = 0.05, t99 = 1.9842. According to formula (6) we get:

Estimation of the share of the feature. When choosing no return, the confidence interval for the proportion of the feature that has a confidence level equal to (1 - α)x100%, is calculated by the formula:

Confidence intervals and ethical issues

When sampling a population and formulating statistical inferences, ethical problems often arise. The main one is how the confidence intervals and point estimates of sample statistics agree. Publishing point estimates without specifying the appropriate confidence intervals (usually at 95% confidence levels) and the sample size from which they are derived can be misleading. This may give the user the impression that a point estimate is exactly what he needs to predict the properties of the entire population. Thus, it is necessary to understand that in any research, not point, but interval estimates should be put at the forefront. In addition, special attention should be paid to the correct choice of sample sizes.

Most often, the objects of statistical manipulations are the results of sociological surveys of the population on various political issues. At the same time, the results of the survey are placed on the front pages of newspapers, and the sampling error and the methodology of statistical analysis are printed somewhere in the middle. To prove the validity of the obtained point estimates, it is necessary to indicate the sample size on the basis of which they were obtained, the boundaries of the confidence interval and its significance level.

Next note

Materials from the book Levin et al. Statistics for managers are used. - M.: Williams, 2004. - p. 448–462

Central limit theorem states that, given a sufficiently large sample size, the sample distribution of means can be approximated by a normal distribution. This property does not depend on the type of population distribution.

In the previous subsections, we considered the question of estimating the unknown parameter a one number. Such an assessment is called "point". In a number of tasks, it is required not only to find for the parameter a suitable numerical value, but also evaluate its accuracy and reliability. It is required to know what errors the parameter substitution can lead to a its point estimate a and with what degree of confidence can we expect that these errors will not go beyond known limits?

Problems of this kind are especially relevant for a small number of observations, when the point estimate and in is largely random and an approximate replacement of a by a can lead to serious errors.

To give an idea of ​​the accuracy and reliability of the estimate a,

in mathematical statistics, so-called confidence intervals and confidence probabilities are used.

Let for the parameter a derived from experience unbiased estimate a. We want to estimate the possible error in this case. Let us assign some sufficiently large probability p (for example, p = 0.9, 0.95, or 0.99) such that an event with probability p can be considered practically certain, and find a value of s for which

Then the range of practically possible values ​​of the error that occurs when replacing a on the a, will be ± s; large absolute errors will appear only with a small probability a = 1 - p. Let's rewrite (14.3.1) as:

Equality (14.3.2) means that with probability p the unknown value of the parameter a falls within the interval

In this case, one circumstance should be noted. Previously, we repeatedly considered the probability of a random variable falling into a given non-random interval. Here the situation is different: a not random, but random interval / r. Randomly its position on the x-axis, determined by its center a; in general, the length of the interval 2s is also random, since the value of s is calculated, as a rule, from experimental data. Therefore, in this case, it would be better to interpret the value of p not as the probability of "hitting" the point a into the interval / p, but as the probability that a random interval / p will cover the point a(Fig. 14.3.1).

Rice. 14.3.1

The probability p is called confidence level, and the interval / p - confidence interval. Interval boundaries if. a x \u003d a- s and a 2 = a + and are called trust boundaries.

Let's give one more interpretation to the concept of a confidence interval: it can be considered as an interval of parameter values a, compatible with experimental data and not contradicting them. Indeed, if we agree to consider an event with a probability a = 1-p practically impossible, then those values ​​of the parameter a for which a - a> s must be recognized as contradicting the experimental data, and those for which |a - a a t na 2 .

Let for the parameter a there is an unbiased estimate a. If we knew the law of distribution of the quantity a, the problem of finding the confidence interval would be quite simple: it would be enough to find a value of s for which

The difficulty lies in the fact that the distribution law of the estimate a depends on the law of distribution of quantity X and, consequently, on its unknown parameters (in particular, on the parameter itself a).

To get around this difficulty, one can apply the following roughly approximate trick: replace the unknown parameters in the expression for s with their point estimates. With a relatively large number of experiments P(about 20 ... 30) this technique usually gives satisfactory results in terms of accuracy.

As an example, consider the problem of the confidence interval for the mathematical expectation.

Let produced P x, whose characteristics are the mathematical expectation t and variance D- unknown. For these parameters, the following estimates were obtained:

It is required to build a confidence interval / р, corresponding to the confidence probability р, for the mathematical expectation t quantities x.

In solving this problem, we use the fact that the quantity t is the sum P independent identically distributed random variables X h and according to the central limit theorem for sufficiently large P its distribution law is close to normal. In practice, even with a relatively small number of terms (of the order of 10 ... 20), the distribution law of the sum can be approximately considered normal. We will assume that the value t distributed according to the normal law. The characteristics of this law - the mathematical expectation and variance - are equal, respectively t and

(see chapter 13 subsection 13.3). Let's assume that the value D is known to us and we will find such a value Ep for which

Applying formula (6.3.5) of Chapter 6, we express the probability on the left side of (14.3.5) in terms of the normal distribution function

where is the standard deviation of the estimate t.

From the equation

find the Sp value:

where arg Ф* (x) is the inverse function of Ф* (X), those. such a value of the argument for which the normal distribution function is equal to X.

Dispersion D, through which the value is expressed a 1P, we do not know exactly; as its approximate value, you can use the estimate D(14.3.4) and put approximately:

Thus, the problem of constructing a confidence interval is approximately solved, which is equal to:

where gp is defined by formula (14.3.7).

In order to avoid reverse interpolation in the tables of the function Ф * (l) when calculating s p, it is convenient to compile a special table (Table 14.3.1), which lists the values ​​of the quantity

depending on r. The value (p determines for the normal law the number of standard deviations that must be set aside to the right and left of the dispersion center so that the probability of falling into the resulting area is equal to p.

Through the value of 7 p, the confidence interval is expressed as:

Table 14.3.1

Example 1. 20 experiments were carried out on the value x; the results are shown in table. 14.3.2.

Table 14.3.2

It is required to find an estimate of for the mathematical expectation of the quantity X and construct a confidence interval corresponding to a confidence level p = 0.8.

Solution. We have:

Choosing for the origin n: = 10, according to the third formula (14.2.14) we find the unbiased estimate D :

According to the table 14.3.1 we find

Confidence limits:

Confidence interval:

Parameter values t, lying in this interval are compatible with the experimental data given in table. 14.3.2.

In a similar way, a confidence interval can be constructed for the variance.

Let produced P independent experiments on a random variable X with unknown parameters from and A, and for the variance D the unbiased estimate is obtained:

It is required to approximately build a confidence interval for the variance.

From formula (14.3.11) it can be seen that the value D represents

amount P random variables of the form . These values ​​are not

independent, since any of them includes the quantity t, dependent on everyone else. However, it can be shown that as P the distribution law of their sum is also close to normal. Almost at P= 20...30 it can already be considered normal.

Let's assume that this is so, and find the characteristics of this law: the mathematical expectation and variance. Since the score D- unbiased, then M[D] = D.

Variance Calculation D D is associated with relatively complex calculations, so we give its expression without derivation:

where c 4 - the fourth central moment of the quantity x.

To use this expression, you need to substitute in it the values ​​\u200b\u200bof 4 and D(at least approximate). Instead of D you can use the evaluation D. In principle, the fourth central moment can also be replaced by its estimate, for example, by a value of the form:

but such a replacement will give an extremely low accuracy, since in general, with a limited number of experiments, high-order moments are determined with large errors. However, in practice it often happens that the form of the distribution law of the quantity X known in advance: only its parameters are unknown. Then we can try to express u4 in terms of D.

Let us take the most common case, when the value X distributed according to the normal law. Then its fourth central moment is expressed in terms of the variance (see Chapter 6 Subsection 6.2);

and formula (14.3.12) gives or

Replacing in (14.3.14) the unknown D his assessment D, we get: whence

The moment u 4 can be expressed in terms of D also in some other cases, when the distribution of the quantity X is not normal, but its appearance is known. For example, for the law of uniform density (see Chapter 5) we have:

where (a, P) is the interval on which the law is given.

Consequently,

According to the formula (14.3.12) we get: from where we find approximately

In cases where the form of the law of distribution of the value of 26 is unknown, when estimating the value of a /) it is still recommended to use the formula (14.3.16), if there are no special grounds for believing that this law is very different from the normal one (has a noticeable positive or negative kurtosis) .

If the approximate value of a /) is obtained in one way or another, then it is possible to construct a confidence interval for the variance in the same way as we built it for the mathematical expectation:

where the value depending on the given probability p is found in Table. 14.3.1.

Example 2. Find an Approximately 80% Confidence Interval for the Variance of a Random Variable X under the conditions of example 1, if it is known that the value X distributed according to a law close to normal.

Solution. The value remains the same as in Table. 14.3.1:

According to the formula (14.3.16)

According to the formula (14.3.18) we find the confidence interval:

The corresponding range of values ​​of the standard deviation: (0.21; 0.29).

14.4. Exact methods for constructing confidence intervals for the parameters of a random variable distributed according to the normal law

In the previous subsection, we considered roughly approximate methods for constructing confidence intervals for the mean and variance. Here we give an idea of ​​the exact methods for solving the same problem. We emphasize that in order to accurately find the confidence intervals, it is absolutely necessary to know in advance the form of the law of distribution of the quantity x, whereas this is not necessary for the application of approximate methods.

The idea of ​​exact methods for constructing confidence intervals is as follows. Any confidence interval is found from the condition expressing the probability of fulfillment of some inequalities, which include the estimate of interest to us a. Grade distribution law a in the general case depends on the unknown parameters of the quantity x. However, sometimes it is possible to pass in inequalities from a random variable a to some other function of observed values X p X 2, ..., X p. the distribution law of which does not depend on unknown parameters, but depends only on the number of experiments and on the form of the distribution law of the quantity x. Random variables of this kind play a large role in mathematical statistics; they have been studied in most detail for the case of a normal distribution of the quantity x.

For example, it has been proved that under a normal distribution of the quantity X random value

subject to the so-called Student's distribution law With P- 1 degrees of freedom; the density of this law has the form

where G(x) is the known gamma function:

It is also proved that the random variable

has "distribution % 2 " with P- 1 degrees of freedom (see chapter 7), the density of which is expressed by the formula

Without dwelling on the derivations of distributions (14.4.2) and (14.4.4), we will show how they can be applied when constructing confidence intervals for the parameters Ty D .

Let produced P independent experiments on a random variable x, distributed according to the normal law with unknown parameters TIO. For these parameters, estimates

It is required to construct confidence intervals for both parameters corresponding to the confidence probability p.

Let us first construct a confidence interval for the mathematical expectation. It is natural to take this interval symmetrical with respect to t; denote by s p half the length of the interval. The value of sp must be chosen so that the condition

Let's try to pass on the left side of equality (14.4.5) from a random variable t to a random variable T, distributed according to Student's law. To do this, we multiply both parts of the inequality |m-w?|

to a positive value: or, using the notation (14.4.1),

Let us find a number / p such that the value / p can be found from the condition

It can be seen from formula (14.4.2) that (1) is an even function, so (14.4.8) gives

Equality (14.4.9) determines the value / p depending on p. If you have at your disposal a table of integral values

then the value / p can be found by reverse interpolation in the table. However, it is more convenient to compile a table of values ​​/ p in advance. Such a table is given in the Appendix (Table 5). This table shows the values ​​depending on the confidence probability p and the number of degrees of freedom P- 1. Having determined / p according to the table. 5 and assuming

we find half the width of the confidence interval / p and the interval itself

Example 1. 5 independent experiments were performed on a random variable x, normally distributed with unknown parameters t and about. The results of the experiments are given in table. 14.4.1.

Table 14.4.1

Find an estimate t for the mathematical expectation and construct a 90% confidence interval / p for it (i.e., the interval corresponding to the confidence probability p \u003d 0.9).

Solution. We have:

According to table 5 of the application for P - 1 = 4 and p = 0.9 we find where

The confidence interval will be

Example 2. For the conditions of example 1 of subsection 14.3, assuming the value X normally distributed, find the exact confidence interval.

Solution. According to table 5 of the application, we find at P - 1 = 19ir =

0.8 / p = 1.328; from here

Comparing with the solution of example 1 of subsection 14.3 (e p = 0.072), we see that the discrepancy is very small. If we keep the accuracy to the second decimal place, then the confidence intervals found by the exact and approximate methods are the same:

Let's move on to constructing a confidence interval for the variance. Consider the unbiased variance estimate

and express the random variable D through the value V(14.4.3) having distribution x 2 (14.4.4):

Knowing the distribution law of the quantity V, it is possible to find the interval / (1 ) in which it falls with a given probability p.

distribution law k n _ x (v) the value of I 7 has the form shown in fig. 14.4.1.

Rice. 14.4.1

The question arises: how to choose the interval / p? If the distribution law of the quantity V was symmetric (like a normal law or Student's distribution), it would be natural to take the interval /p symmetric with respect to the mathematical expectation. In this case, the law k n _ x (v) asymmetrical. Let us agree to choose the interval /p so that the probabilities of output of the quantity V outside the interval to the right and left (shaded areas in Fig. 14.4.1) were the same and equal

To construct an interval / p with this property, we use Table. 4 applications: it contains numbers y) such that

for the quantity V, having x 2 -distribution with r degrees of freedom. In our case r = n- 1. Fix r = n- 1 and find in the corresponding line of the table. 4 two values x 2 - one corresponding to a probability the other - probabilities Let us designate these

values at 2 and xl? The interval has y 2 , with his left, and y~ right end.

Now we find the required confidence interval /| for the variance with boundaries D, and D2, which covers the point D with probability p:

Let us construct such an interval / (, = (?> b A), which covers the point D if and only if the value V falls into the interval / r. Let us show that the interval

satisfies this condition. Indeed, the inequalities are equivalent to the inequalities

and these inequalities hold with probability p. Thus, the confidence interval for the dispersion is found and is expressed by the formula (14.4.13).

Example 3. Find the confidence interval for the variance under the conditions of example 2 of subsection 14.3, if it is known that the value X distributed normally.

Solution. We have . According to table 4 of the application

we find at r = n - 1 = 19

According to the formula (14.4.13) we find the confidence interval for the dispersion

Corresponding interval for standard deviation: (0.21; 0.32). This interval only slightly exceeds the interval (0.21; 0.29) obtained in Example 2 of Subsection 14.3 by the approximate method.

  • Figure 14.3.1 considers a confidence interval that is symmetric about a. In general, as we will see later, this is not necessary.

Estimation of confidence intervals

Learning objectives

The statistics consider the following two main tasks:

    We have some estimate based on sample data and we want to make some probabilistic statement about where the true value of the parameter being estimated is.

    We have a specific hypothesis that needs to be tested based on sample data.

In this topic, we consider the first problem. We also introduce the definition of a confidence interval.

A confidence interval is an interval that is built around the estimated value of a parameter and shows where the true value of the estimated parameter lies with an a priori given probability.

After studying the material on this topic, you:

    learn what is the confidence interval of the estimate;

    learn to classify statistical problems;

    master the technique of constructing confidence intervals, both using statistical formulas and using software tools;

    learn to determine the required sample sizes to achieve certain parameters of accuracy of statistical estimates.

Distributions of sample characteristics

T-distribution

As discussed above, the distribution of the random variable is close to a standardized normal distribution with parameters 0 and 1. Since we do not know the value of σ, we replace it with some estimate s . The quantity already has a different distribution, namely, or Student's distribution, which is determined by the parameter n -1 (number of degrees of freedom). This distribution is close to the normal distribution (the larger n, the closer the distributions).

On fig. 95
Student's distribution with 30 degrees of freedom is presented. As you can see, it is very close to the normal distribution.

Similar to the functions for working with the normal distribution NORMDIST and NORMINV, there are functions for working with the t-distribution - STUDIST (TDIST) and STUDRASPBR (TINV). An example of the use of these functions can be found in the STUDRIST.XLS file (template and solution) and in fig. 96
.

Distributions of other characteristics

As we already know, to determine the accuracy of the expectation estimate, we need a t-distribution. To estimate other parameters, such as variance, other distributions are required. Two of them are the F-distribution and x 2 -distribution.

Confidence interval for the mean

Confidence interval is an interval that is built around the estimated value of the parameter and shows where the true value of the estimated parameter lies with a priori given probability.

The construction of a confidence interval for the mean value occurs in the following way:

Example

The fast food restaurant plans to expand its assortment with a new type of sandwich. In order to estimate the demand for it, the manager plans to randomly select 40 visitors from among those who have already tried it and ask them to rate their attitude towards the new product on a scale from 1 to 10. The manager wants to estimate the expected number of points that the new product will receive and construct a 95% confidence interval for this estimate. How to do it? (see file SANDWICH1.XLS (template and solution).

Solution

To solve this problem, you can use . The results are presented in fig. 97
.

Confidence interval for the total value

Sometimes, according to sample data, it is required to estimate not the mathematical expectation, but the total sum of values. For example, in a situation with an auditor, it may be of interest to estimate not the average value of an invoice, but the sum of all invoices.

Let N be the total number of elements, n be the sample size, T 3 be the sum of the values ​​in the sample, T" be the estimate for the sum over the entire population, then , and the confidence interval is calculated by the formula , where s is the estimate of the standard deviation for the sample, is the estimate of the mean for the sample.

Example

Let's say a tax office wants to estimate the amount of total tax refunds for 10,000 taxpayers. The taxpayer either receives a refund or pays additional taxes. Find the 95% confidence interval for the refund amount, assuming a sample size of 500 people (see file REFUND AMOUNT.XLS (template and solution).

Solution

There is no special procedure in StatPro for this case, however, you can see that the bounds can be obtained from the bounds for the mean using the above formulas (Fig. 98
).

Confidence interval for proportion

Let p be the expectation of a share of customers, and pv be an estimate of this share, obtained from a sample of size n. It can be shown that for sufficiently large the estimate distribution will be close to normal with mean p and standard deviation . The standard error of the estimate in this case is expressed as , and the confidence interval as .

Example

The fast food restaurant plans to expand its assortment with a new type of sandwich. In order to estimate the demand for it, the manager randomly selected 40 visitors from among those who had already tried it and asked them to rate their attitude towards the new product on a scale from 1 to 10. The manager wants to estimate the expected proportion of customers who rate the new product at least than 6 points (he expects these customers to be the consumers of the new product).

Solution

Initially, we create a new column on the basis of 1 if the client's score was more than 6 points and 0 otherwise (see the SANDWICH2.XLS file (template and solution).

Method 1

Counting the amount of 1, we estimate the share, and then we use the formulas.

The value of z cr is taken from special normal distribution tables (for example, 1.96 for a 95% confidence interval).

Using this approach and specific data to construct a 95% interval, we obtain the following results (Fig. 99
). The critical value of the parameter z cr is 1.96. The standard error of the estimate is 0.077. The lower limit of the confidence interval is 0.475. The upper limit of the confidence interval is 0.775. Thus, a manager can assume with 95% certainty that the percentage of customers who rate a new product 6 points or more will be between 47.5 and 77.5.

Method 2

This problem can be solved using standard StatPro tools. To do this, it suffices to note that the share in this case coincides with the average value of the Type column. Next apply StatPro/Statistical Inference/One-Sample Analysis to build a confidence interval for the mean value (expectation estimate) for the Type column. The results obtained in this case will be very close to the result of the 1st method (Fig. 99).

Confidence interval for standard deviation

s is used as an estimate of the standard deviation (the formula is given in section 1). The density function of the estimate s is the chi-squared function, which, like the t-distribution, has n-1 degrees of freedom. There are special functions for working with this distribution CHI2DIST (CHIDIST) and CHI2OBR (CHIINV) .

The confidence interval in this case will no longer be symmetrical. The conditional scheme of the boundaries is shown in fig. 100 .

Example

The machine should produce parts with a diameter of 10 cm. However, due to various circumstances, errors occur. The quality controller is concerned about two things: first, the average value should be 10 cm; secondly, even in this case, if the deviations are large, then many details will be rejected. Every day he makes a sample of 50 parts (see file QUALITY CONTROL.XLS (template and solution). What conclusions can such a sample give?

Solution

We construct 95% confidence intervals for the mean and for the standard deviation using StatPro/Statistical Inference/ One-Sample Analysis(Fig. 101
).

Further, using the assumption of a normal distribution of diameters, we calculate the proportion of defective products, setting a maximum deviation of 0.065. Using the capabilities of the lookup table (the case of two parameters), we construct the dependence of the percentage of rejects on the mean value and standard deviation (Fig. 102
).

Confidence interval for the difference of two means

This is one of the most important applications of statistical methods. Situation examples.

    A clothing store manager would like to know how much more or less the average female shopper spends in the store than a male.

    The two airlines fly similar routes. A consumer organization would like to compare the difference between the average expected flight delay times for both airlines.

    The company sends out coupons for certain types of goods in one city and does not send out in another. Managers want to compare the average purchases of these items over the next two months.

    A car dealer often deals with married couples at presentations. To understand their personal reactions to the presentation, couples are often interviewed separately. The manager wants to evaluate the difference in ratings given by men and women.

Case of independent samples

The mean difference will have a t-distribution with n 1 + n 2 - 2 degrees of freedom. The confidence interval for μ 1 - μ 2 is expressed by the ratio:

This problem can be solved not only by the above formulas, but also by standard StatPro tools. To do this, it is enough to apply

Confidence interval for difference between proportions

Let be the mathematical expectation of the shares. Let be their sample estimates built on samples of size n 1 and n 2, respectively. Then is an estimate for the difference . Therefore, the confidence interval for this difference is expressed as:

Here z cr is the value obtained from the normal distribution of special tables (for example, 1.96 for 95% confidence interval).

The standard error of the estimate is expressed in this case by the relation:

.

Example

The store, in preparation for the big sale, undertook the following marketing research. The top 300 buyers were selected and randomly divided into two groups of 150 members each. All of the selected buyers were sent invitations to participate in the sale, but only for members of the first group was attached a coupon giving the right to a 5% discount. During the sale, the purchases of all 300 selected buyers were recorded. How can a manager interpret the results and make a judgment about the effectiveness of couponing? (See COUPONS.XLS file (template and solution)).

Solution

For our particular case, out of 150 customers who received a discount coupon, 55 made a purchase on sale, and among 150 who did not receive a coupon, only 35 made a purchase (Fig. 103
). Then the values ​​of the sample proportions are 0.3667 and 0.2333, respectively. And the sample difference between them is equal to 0.1333, respectively. Assuming a confidence interval of 95%, we find from the normal distribution table z cr = 1.96. The calculation of the standard error of the sample difference is 0.0524. Finally, we get that the lower limit of the 95% confidence interval is 0.0307, ​​and the upper limit is 0.2359, respectively. The results obtained can be interpreted in such a way that for every 100 customers who received a discount coupon, we can expect from 3 to 23 new customers. However, it should be kept in mind that this conclusion in itself does not mean the efficiency of using coupons (because by providing a discount, we lose in profit!). Let's demonstrate this on specific data. Suppose that the average purchase amount is 400 rubles, of which 50 rubles. there is a store profit. Then the expected profit per 100 customers who did not receive a coupon is equal to:

50 0.2333 100 \u003d 1166.50 rubles.

Similar calculations for 100 buyers who received a coupon give:

30 0.3667 100 \u003d 1100.10 rubles.

The decrease in the average profit to 30 is explained by the fact that, using the discount, buyers who received a coupon will, on average, make a purchase for 380 rubles.

Thus, the final conclusion indicates the inefficiency of using such coupons in this particular situation.

Comment. This problem can be solved using standard StatPro tools. To do this, it suffices to reduce this problem to the problem of estimating the difference of two averages by the method, and then apply StatPro/Statistical Inference/Two-Sample Analysis to build a confidence interval for the difference between two mean values.

Confidence interval control

The length of the confidence interval depends on following conditions:

    directly data (standard deviation);

    significance level;

    sample size.

Sample size for estimating the mean

Let us first consider the problem in the general case. Let us denote the value of half the length of the confidence interval given to us as B (Fig. 104
). We know that the confidence interval for the mean value of some random variable X is expressed as , where . Assuming:

and expressing n , we get .

Unfortunately, we do not know the exact value of the variance of the random variable X. In addition, we do not know the value of t cr as it depends on n through the number of degrees of freedom. In this situation, we can do the following. Instead of the variance s, we use some estimate of the variance for some available realizations of the random variable under study. Instead of the t cr value, we use the z cr value for the normal distribution. This is quite acceptable, since the density functions for the normal and t-distributions are very close (except for the case of small n ). Thus, the desired formula takes the form:

.

Since the formula gives, generally speaking, non-integer results, rounding with an excess of the result is taken as the desired sample size.

Example

The fast food restaurant plans to expand its assortment with a new type of sandwich. In order to estimate the demand for it, the manager randomly plans to select a number of visitors from among those who have already tried it, and ask them to rate their attitude towards the new product on a scale from 1 to 10. The manager wants to estimate the expected number of points that the new product will receive. product and plot the 95% confidence interval of that estimate. However, he wants half the width of the confidence interval not to exceed 0.3. How many visitors does he need to poll?

as follows:

Here r ots is an estimate of the fraction p, and B is a given half of the length of the confidence interval. An inflated value for n can be obtained using the value r ots= 0.5. In this case, the length of the confidence interval will not exceed the given value B for any true value of p.

Example

Let the manager from the previous example plan to estimate the proportion of customers who prefer a new type of product. He wants to construct a 90% confidence interval whose half length is less than or equal to 0.05. How many clients should be randomly sampled?

Solution

In our case, the value of z cr = 1.645. Therefore, the required quantity is calculated as .

If the manager had reason to believe that the desired value of p is, for example, about 0.3, then by substituting this value in the above formula, we would get a smaller value of the random sample, namely 228.

Formula to determine random sample sizes in case of difference between two means written as:

.

Example

Some computer company has a customer service center. Recently, the number of customer complaints about the poor quality of service has increased. The service center mainly employs two types of employees: those with little experience, but who have completed special training courses, and those with extensive practical experience, but who have not completed special courses. The company wants to analyze customer complaints over the past six months and compare their average numbers per each of the two groups of employees. It is assumed that the numbers in the samples for both groups will be the same. How many employees must be included in the sample to get a 95% interval with a half length of no more than 2?

Solution

Here σ ots is an estimate of the standard deviation of both random variables under the assumption that they are close. Thus, in our task, we need to somehow obtain this estimate. This can be done, for example, as follows. Looking at customer complaint data over the past six months, a manager may notice that there are generally between 6 and 36 complaints per employee. Knowing that for a normal distribution virtually all values ​​are no more than three standard deviations from the mean, he can reasonably believe that:

, whence σ ots = 5.

Substituting this value into the formula, we get .

Formula to determine the size of a random sample in the case of estimating the difference between the shares looks like:

Example

Some company has two factories for the production of similar products. The manager of a company wants to compare the defect rates of both factories. According to available information, the rejection rate at both factories is from 3 to 5%. It is supposed to build a 99% confidence interval with a half length of no more than 0.005 (or 0.5%). How many products should be selected from each factory?

Solution

Here p 1ot and p 2ot are estimates of two unknown fractions of rejects at the 1st and 2nd factories. If we put p 1ots \u003d p 2ots \u003d 0.5, then we will get an overestimated value for n. But since in our case we have some a priori information about these shares, we take the upper estimate of these shares, namely 0.05. We get

When some population parameters are estimated from sample data, it is useful to provide not only a point estimate of the parameter, but also a confidence interval that shows where the exact value of the parameter being estimated may lie.

In this chapter, we also got acquainted with quantitative relationships that allow us to build such intervals for various parameters; learned ways to control the length of the confidence interval.

We also note that the problem of estimating the sample size (experiment planning problem) can be solved using standard StatPro tools, namely StatPro/Statistical Inference/Sample Size Selection.

The mind is not only in knowledge, but also in the ability to apply knowledge in practice. (Aristotle)

Confidence intervals

general review

Taking a sample from the population, we will obtain a point estimate of the parameter of interest to us and calculate the standard error in order to indicate the accuracy of the estimate.

However, for most cases, the standard error as such is not acceptable. It is much more useful to combine this measure of precision with an interval estimate for the population parameter.

This can be done by using knowledge of the theoretical probability distribution of the sample statistic (parameter) in order to calculate a confidence interval (CI - Confidence Interval, CI - Confidence Interval) for the parameter.

In general, the confidence interval extends the estimates in both directions by some multiple of the standard error (of a given parameter); the two values ​​(confidence limits) that define the interval are usually separated by a comma and enclosed in parentheses.

Confidence interval for mean

Using the normal distribution

The sample mean has a normal distribution if the sample size is large, so knowledge of the normal distribution can be applied when considering the sample mean.

In particular, 95% of the distribution of the sample means is within 1.96 standard deviations (SD) of the population mean.

When we have only one sample, we call this the standard error of the mean (SEM) and calculate the 95% confidence interval for the mean as follows:

If this experiment is repeated several times, then the interval will contain the true population mean 95% of the time.

This is usually a confidence interval, such as the range of values ​​within which the true population mean (general mean) lies with a 95% confidence level.

Although it is not quite strict (the population mean is a fixed value and therefore cannot have a probability related to it) to interpret the confidence interval in this way, it is conceptually easier to understand.

Usage t- distribution

You can use the normal distribution if you know the value of the variance in the population. Also, when the sample size is small, the sample mean follows a normal distribution if the data underlying the population are normally distributed.

If the data underlying the population are not normally distributed and/or the general variance (population variance) is unknown, the sample mean obeys Student's t-distribution.

Calculate the 95% confidence interval for the population mean as follows:

Where - percentage point (percentile) t- Student distribution with (n-1) degrees of freedom, which gives a two-tailed probability of 0.05.

In general, it provides a wider interval than when using a normal distribution, because it takes into account the additional uncertainty that is introduced by estimating the population standard deviation and/or due to the small sample size.

When the sample size is large (of the order of 100 or more), the difference between the two distributions ( t-student and normal) is negligible. However, always use t- distribution when calculating confidence intervals, even if the sample size is large.

Usually 95% CI is indicated. Other confidence intervals can be calculated, such as 99% CI for the mean.

Instead of product of standard error and table value t- distribution that corresponds to a two-tailed probability of 0.05 multiply it (standard error) by a value that corresponds to a two-tailed probability of 0.01. This is a wider confidence interval than the 95% case because it reflects increased confidence that the interval does indeed include the population mean.

Confidence interval for proportion

The sampling distribution of proportions has a binomial distribution. However, if the sample size n reasonably large, then the proportion sample distribution is approximately normal with mean .

Estimate by sampling ratio p=r/n(where r- the number of individuals in the sample with the characteristics of interest to us), and the standard error is estimated:

The 95% confidence interval for the proportion is estimated:

If the sample size is small (usually when np or n(1-p) less 5 ), then the binomial distribution must be used in order to calculate the exact confidence intervals.

Note that if p expressed as a percentage, then (1-p) replaced by (100p).

Interpretation of confidence intervals

When interpreting the confidence interval, we are interested in the following questions:

How wide is the confidence interval?

A wide confidence interval indicates that the estimate is imprecise; narrow indicates a fine estimate.

The width of the confidence interval depends on the size of the standard error, which, in turn, depends on the sample size and, when considering a numeric variable from the variability of the data, give wider confidence intervals than studies of a large data set of few variables.

Does the CI include any values ​​of particular interest?

You can check whether the likely value for a population parameter falls within a confidence interval. If yes, then the results are consistent with this likely value. If not, then it is unlikely (for a 95% confidence interval, the chance is almost 5%) that the parameter has this value.