The sample may be. An example of a non-representative sample

Interval estimation of event probability. Formulas for calculating the number of samples in the case of a random selection method.

To determine the probabilities of the events of interest to us, we use the sampling method: we carry out n independent experiments, in each of which event A may occur (or not occur) (probability R occurrence of event A in each experiment is constant). Then the relative frequency p* of occurrences of events BUT in a series of n tests is taken as a point estimate for the probability p occurrence of an event BUT in a separate test. In this case, the value p* is called sample share event occurrences BUT, and r - general share .

By virtue of the corollary of the central limit theorem (the Moivre-Laplace theorem), the relative frequency of an event with a large sample size can be considered normally distributed with the parameters M(p*)=p and

Therefore, for n>30, the confidence interval for the general fraction can be built using the formulas:


where u cr is found according to the tables of the Laplace function, taking into account the given confidence probability γ: 2Ф(u cr)=γ.

With a small sample size n≤30, the marginal error ε is determined from the Student distribution table:
where t cr =t(k; α) and the number of degrees of freedom k=n-1 probability α=1-γ (two-sided area).

The formulas are valid if the selection was carried out randomly in a repeated manner (the general population is infinite), otherwise it is necessary to make a correction for the non-repeating selection (table).

Average sampling error for the general proportion

PopulationEndlessultimate volume N
Selection typeRepeatednon-repeating
Average sampling error

Formulas for calculating the sample size with a proper random selection method

Selection methodSample size formulas
for middlefor share
Repeated
non-repeating
Share of units w = . Accuracy ε = . Probability γ =

Problems about the general share

To the question "Does the given value of p 0 cover the confidence interval?" - can be answered by testing the statistical hypothesis H 0:p=p 0 . It is assumed that the experiments are carried out according to the Bernoulli test scheme (independent, probability p occurrence of an event BUT constant). By volume sample n determine the relative frequency p * of occurrence of event A: where m- number of occurrences of the event BUT in a series of n tests. To test the hypothesis H 0, statistics are used that, with a sufficiently large sample size, have a standard normal distribution (Table 1).
Table 1 - Hypotheses about the general share

Hypothesis

H0:p=p0H 0:p 1 \u003d p 2
AssumptionsBernoulli test schemeBernoulli test scheme
Sample estimates
Statistics K
Statistics distribution K Standard normal N(0,1)

Example #1. Using random re-sampling, the company's management conducted a random survey of 900 of its employees. There were 270 women among the respondents. Plot a confidence interval that, with a probability of 0.95, covers the true proportion of women in the entire team of the firm.
Solution. By condition, the sample proportion of women is (the relative frequency of women among all respondents). Since the selection is repeated and the sample size is large (n=900), the marginal sampling error is determined by the formula

The value of u cr is found from the table of the Laplace function from the relation 2Ф(u cr)=γ, i.e. The Laplace function (Appendix 1) takes the value 0.475 at u cr =1.96. Therefore, the marginal error and the desired confidence interval
(p – ε, p + ε) = (0.3 – 0.18; 0.3 + 0.18) = (0.12; 0.48)
So, with a probability of 0.95, we can guarantee that the proportion of women in the entire team of the firm is in the range from 0.12 to 0.48.

Example #2. The car park owner considers the day "lucky" if the car park is more than 80% full. During the year, 40 car park inspections were carried out, of which 24 were “successful”. With a probability of 0.98, find the confidence interval for estimating the true percentage of "lucky" days during the year.
Solution. The sample fraction of “good” days is
According to the table of the Laplace function, we find the value of u cr for a given
confidence level
Ф(2.23) = 0.49, u cr = 2.33.
Considering the selection to be non-repetitive (i.e., two checks were not carried out on the same day), we find the marginal error:
where n=40 , N = 365 (days). From here
and confidence interval for the general fraction: (p – ε, p + ε) = (0.6 – 0.17; 0.6 + 0.17) = (0.43; 0.77)
With a probability of 0.98, it can be expected that the proportion of "good" days during the year is in the range from 0.43 to 0.77.

Example #3. After checking 2500 items in the batch, they found that 400 items were of the highest grade, but n–m were not. How many products do you need to check to determine the share of the premium grade with an accuracy of 0.01 with 95% certainty?
We are looking for a solution according to the formula for determining the size of the sample for re-selection.

Ф(t) = γ/2 = 0.95/2 = 0.475 and according to the Laplace table this value corresponds to t=1.96
Sample fraction w = 0.16; sampling error ε = 0.01

Example #4. A batch of products is accepted if the probability that the product will meet the standard is at least 0.97. Among the randomly selected 200 products of the tested lot, 193 products were found to meet the standard. Is it possible to accept the batch at the significance level α=0.02?
Solution. We formulate the main and alternative hypotheses.
H 0: p \u003d p 0 \u003d 0.97 - unknown general share p equal to the specified value p 0 =0.97. In relation to the condition - the probability that the part from the tested lot will be in accordance with the standard is 0.97; those. batch of products can be accepted.
H1:p<0,97 - вероятность того, что деталь из проверяемой партии окажется соответствующей стандарту, меньше 0.97; т.е. партию изделий нельзя принять. При такой альтернативной гипотезе критическая область будет левосторонней.
Observed statistic value K(table) calculate for given values ​​p 0 =0.97, n=200, m=193


The critical value is found from the table of the Laplace function from the equality


According to the condition α=0.02, hence F(Kcr)=0.48 and Kcr=2.05. The critical region is left-handed, i.e. is the interval (-∞;-K kp)= (-∞;-2.05). The observed value Kobs = -0.415 does not belong to the critical region, therefore, at this level of significance, there is no reason to reject the main hypothesis. A batch of products can be accepted.

Example number 5. Two factories produce the same type of parts. To assess their quality, samples were taken from the products of these factories and the following results were obtained. Among the 200 selected products of the first factory, 20 were defective, and among the 300 products of the second factory, 15 were defective.
At a significance level of 0.025, find out if there is a significant difference in the quality of parts manufactured by these factories.

According to the condition α=0.025, hence F(Kcr)=0.4875 and Kcr=2.24. With a two-sided alternative, the area of ​​​​admissible values ​​has the form (-2.24; 2.24). The observed value Kobs =2.15 falls within this interval, i.e. at this level of significance, there is no reason to reject the main hypothesis. Factories produce products of the same quality.

Plan:

1. Problems of mathematical statistics.

2. Sample types.

3. Selection methods.

4. Statistical distribution of the sample.

5. Empirical distribution function.

6. Polygon and histogram.

7. Numerical characteristics of the variation series.

8. Statistical estimates of distribution parameters.

9. Interval estimates of distribution parameters.

1. Tasks and methods of mathematical statistics

Math statistics is a branch of mathematics devoted to the methods of collecting, analyzing and processing the results of statistical observational data for scientific and practical purposes.

Let it be required to study a set of homogeneous objects with respect to some qualitative or quantitative feature that characterizes these objects. For example, if there is a batch of parts, then the standard of the part can serve as a qualitative sign, and the controlled size of the part can serve as a quantitative sign.

Sometimes a continuous study is carried out, i.e. examine each object with respect to the desired feature. In practice, a comprehensive survey is rarely used. For example, if the population contains a very large number of objects, then it is physically impossible to conduct a continuous survey. If the survey of the object is associated with its destruction or requires large material costs, then it makes no sense to conduct a complete survey. In such cases, a limited number of objects (sample set) are randomly selected from the entire population and subjected to their study.

The main task of mathematical statistics is to study the entire population based on sample data, depending on the goal, i.e. the study of the probabilistic properties of the population: the law of distribution, numerical characteristics, etc. for making managerial decisions under conditions of uncertainty.

2. Sample types

Population is the set of objects from which the sample is made.

Sample population (sample) is a collection of randomly selected objects.

Population size is the number of objects in this collection. The volume of the general population is denoted N, selective - n.

Example:

If out of 1000 parts 100 parts are selected for examination, then the volume of the general population N = 1000, and the sample size n = 100.

Sampling can be done in two ways: after the object is selected and observed over it, it can be returned or not returned to the general population. That. The samples are divided into repeated and non-repeated.

Repeatedcalled sampling, at which the selected object (before selecting the next one) is returned to the general population.

Non-repeatingcalled sampling, at which the selected object is not returned to the general population.

In practice, non-repetitive random selection is usually used.

In order for the data of the sample to be able to judge with sufficient confidence about the feature of interest in the general population, it is necessary that the objects of the sample represent it correctly. The sample must correctly represent the proportions of the population. The sample must be representative (representative).

By virtue of the law of large numbers, it can be argued that the sample will be representative if it is carried out randomly.

If the size of the general population is large enough, and the sample is only a small part of this population, then the distinction between repeated and non-repeated samples is erased; in the limiting case, when an infinite general population is considered, and the sample has a finite size, this difference disappears.

Example:

In the American journal Literary Review, using statistical methods, a study was made of forecasts regarding the outcome of the upcoming US presidential election in 1936. Applicants for this post were F.D. Roosevelt and A. M. Landon. Telephone directories were taken as a source for the general population of the studied Americans. Of these, 4 million addresses were randomly selected, to which the editors of the magazine sent out postcards asking them to express their attitude towards the candidates for the presidency. After processing the results of the poll, the magazine published a sociological forecast that Landon would win the upcoming elections with a large margin. And ... I was wrong: Roosevelt won.
This example can be seen as an example of a non-representative sample. The fact is that in the United States in the first half of the twentieth century, only the wealthy part of the population, who supported the views of Landon, had telephones.

3. Selection methods

In practice, various methods of selection are used, which can be divided into 2 types:

1. Selection does not require dividing the population into parts (a) simple random no repeat; b) simple random repeat).

2. Selection, in which the general population is divided into parts. (a) typical selection; b) mechanical selection; in) serial selection).

Simple random call this selection, in which objects are extracted one by one from the entire general population (randomly).

Typicalcalled selection, in which objects are selected not from the entire general population, but from each of its “typical” parts. For example, if a part is manufactured on several machines, then the selection is made not from the entire set of parts produced by all machines, but from the products of each machine separately. Such selection is used when the trait being examined fluctuates noticeably in various "typical" parts of the general population.

Mechanicalcalled selection, in which the general population is "mechanically" divided into as many groups as there are objects to be included in the sample, and one object is selected from each group. For example, if you need to select 20% of the parts made by the machine, then every 5th part is selected; if it is required to select 5% of the parts - every 20th, etc. Sometimes such a selection may not ensure a representative sample (if every 20th turning roller is selected, and the cutter is replaced immediately after the selection, then all the rollers turned with blunt cutters will be selected).

Serialcalled selection, in which objects are selected from the general population not one at a time, but in “series”, which are subjected to a continuous survey. For example, if products are manufactured by a large group of automatic machines, then the products of only a few machines are subjected to a continuous examination.

In practice, combined selection is often used, in which the above methods are combined.

4. Statistical distribution of the sample

Let a sample be taken from the general population, and the value x 1-observed once, x 2 -n 2 times, ... x k - n k times. n= n 1 +n 2 +...+n k is the sample size. Observed valuescalled options, and the sequence is a variant written in ascending order - variational series. Number of observationscalled frequencies (absolute frequencies), and their relationship to the sample size- relative frequencies or statistical probabilities.

If the number of options is large or the sample is made from a continuous general population, then the variation series is compiled not by individual point values, but by intervals of values ​​of the general population. Such a series is called interval. The lengths of the intervals must be equal.

The statistical distribution of the sample called a list of options and their corresponding frequencies or relative frequencies.

The statistical distribution can also be specified as a sequence of intervals and their corresponding frequencies (the sum of the frequencies that fall into this interval of values)

The point variation series of frequencies can be represented by a table:

x i
x 1
x2

x k
n i
n 1
n 2

nk

Similarly, one can represent a point variational series of relative frequencies.

And:

Example:

The number of letters in some text X turned out to be equal to 1000. The first letter was "i", the second - the letter "i", the third - the letter "a", the fourth - "u". Then came the letters "o", "e", "y", "e", "s".

Let's write down the places that they occupy in the alphabet, respectively, we have: 33, 10, 1, 32, 16, 6, 21, 31, 29.

After ordering these numbers in ascending order, we get a variation series: 1, 6, 10, 16, 21, 29, 31, 32, 33.

The frequencies of the appearance of letters in the text: "a" - 75, "e" -87, "i" - 75, "o" - 110, "y" - 25, "s" - 8, "e" - 3, "yu "- 7," I "- 22.

We compose a point variational series of frequencies:

Example:

Volume sampling frequency distribution specified n = 20.

Make a point variation series of relative frequencies.

x i

2

6

12

n i

3

10

7

Solution:

Find the relative frequencies:


x i

2

6

12

w i

0,15

0,5

0,35

When constructing an interval distribution, there are rules for choosing the number of intervals or the size of each interval. The criterion here is the optimal ratio: with an increase in the number of intervals, the representativeness improves, but the amount of data and the time for processing them increase. Difference x max - x min between the largest and smallest values ​​​​variant is called on a grand scale samples.

To count the number of intervals k usually apply the empirical formula of Sturgess (implying rounding to the nearest convenient integer): k = 1 + 3.322 log n .

Accordingly, the value of each interval h can be calculated using the formula:

5. Empirical distribution function

Consider some sample from the general population. Let the statistical distribution of the frequencies of the quantitative attribute X be known. Let us introduce the notation: n xis the number of observations in which a feature value less than x was observed; n is the total number of observations (sample size). Relative event frequency X<х равна n x /n . If x changes, then the relative frequency also changes, i.e. relative frequencyn x /nis a function of x. Because it is found empirically, it is called empirical.

Empirical distribution function (sample distribution function) call the function, which determines for each x the relative frequency of the event X<х.


where is the number of options less than x,

n - sample size.

Unlike the empirical distribution function of the sample, the distribution function F(x) of the population is called theoretical distribution function.

The difference between the empirical and theoretical distribution functions is that the theoretical function F (x) determines the probability of an event X F*(x) tends in probability to the probability F (x) of this event. That is, for large n F*(x) and F(x) differ little from each other.

That. it is advisable to use the empirical distribution function of the sample for an approximate representation of the theoretical (integral) distribution function of the general population.

F*(x) has all the properties F(x).

1. Values F*(x) belong to the interval.

2. F*(x) is a non-decreasing function.

3. If is the smallest variant, then F*(x) = 0, at x < x1; if x k is the largest variant, then F*(x) = 1, for x > x k .

Those. F*(x) serves to estimate F(x).

If the sample is given by a variational series, then the empirical function has the form:

The graph of the empirical function is called the cumulative.

Example:

Plot an empirical function over the given sample distribution.


Solution:

Sample size n = 12 + 18 +30 = 60. The smallest option is 2, i.e. at x < 2. Event X<6, (x 1 = 2) наблюдалось 12 раз, т.е. F*(x)=12/60=0.2 at 2 < x < 6. Event X<10, (x 1 =2, x 2 = 6) наблюдалось 12 + 18 = 30 раз, т.е.F*(x)=30/60=0,5 при 6 < x < 10. Because x=10 is the largest option, then F*(x) = 1 at x>10. The desired empirical function has the form:

Cumulate:


The cumulate makes it possible to understand the information presented graphically, for example, to answer the questions: “Determine the number of observations in which the value of the feature was less than 6 or not less than 6. F*(6) = 0.2 » Then the number of observations in which the value of the observed feature was less than 6 is 0.2* n \u003d 0.2 * 60 \u003d 12. The number of observations in which the value of the observed feature was not less than 6 is (1-0.2) * n \u003d 0.8 * 60 \u003d 48.

If an interval variation series is given, then to compile the empirical distribution function, the midpoints of the intervals are found and the empirical distribution function is obtained from them similarly to the point variation series.

6. Polygon and histogram

For clarity, various graphs of the statistical distribution are built: polynomial and histograms

Frequency polygon- this is a broken line, the segments of which connect the points ( x 1 ;n 1 ), ( x 2 ;n 2 ),…, ( x k ; n k ), where are the options, are the frequencies corresponding to them.

Polygon of relative frequencies - this is a broken line, the segments of which connect the points ( x 1 ;w 1 ), (x 2 ;w 2 ),…, ( x k ;w k ), where x i are options, w i are relative frequencies corresponding to them.

Example:

Plot the relative frequency polynomial over the given sample distribution:

Solution:

In the case of a continuous feature, it is advisable to build a histogram, for which the interval, which contains all the observed values ​​of the feature, is divided into several partial intervals of length h and for each partial interval n i is found - the sum of the variant frequencies that fall into the i-th interval. (For example, when measuring a person's height or weight, we are dealing with a continuous sign).

Frequency histogram- this is a stepped figure, consisting of rectangles, the bases of which are partial intervals of length h, and the heights are equal to the ratio (frequency density).

Square i-th partial rectangle is equal to the sum of the frequencies of the variant of the i-th interval, i.e. the frequency histogram area is equal to the sum of all frequencies, i.e. sample size.

Example:

The results of the change in voltage (in volts) in the electrical network are given. Compose a variation series, build a polygon and a frequency histogram if the voltage values ​​are as follows: 227, 215, 230, 232, 223, 220, 228, 222, 221, 226, 226, 215, 218, 220, 216, 220, 225, 212 , 217, 220.

Solution:

Let's create a series of variations. We have n = 20, x min =212, x max =232.

Let's use the Sturgess formula to calculate the number of intervals.

The interval variational series of frequencies has the form:


Frequency Density

212-21 6

0,75

21 6-22 0

0,75

220-224

1,75

224-228

228-232

0,75

Let's build a histogram of frequencies:

Let's construct a polygon of frequencies by first finding the midpoints of the intervals:


Histogram of relative frequencies call a stepped figure consisting of rectangles, the bases of which are partial intervals of length h, and the heights are equal to the ratio w i/h (relative frequency density).

Square The i-th partial rectangle is equal to the relative frequency of the variant that fell into the i-th interval. Those. the area of ​​the histogram of relative frequencies is equal to the sum of all relative frequencies, i.e. unit.

7. Numerical characteristics of the variation series

Consider the main characteristics of the general and sample populations.

General secondary is called the arithmetic mean of the values ​​of the feature of the general population.

For different values ​​x 1 , x 2 , x 3 , …, x n . sign of the general population of volume N we have:

If the attribute values ​​have corresponding frequencies N 1 +N 2 +…+N k =N , then


sample mean is called the arithmetic mean of the values ​​of the feature of the sample population.

If the attribute values ​​have corresponding frequencies n 1 +n 2 +…+n k = n, then


Example:

Calculate the sample mean for the sample: x 1 = 51.12; x 2 \u003d 51.07; x 3 \u003d 52.95; x 4 \u003d 52.93; x 5 \u003d 51.1; x 6 \u003d 52.98; x 7 \u003d 52.29; x 8 \u003d 51.23; x 9 \u003d 51.07; x10 = 51.04.

Solution:

General variance is called the arithmetic mean of the squared deviations of the values ​​of the characteristic X of the general population from the general average.

For different values ​​x 1 , x 2 , x 3 , …, x N of the sign of the population of volume N we have:

If the attribute values ​​have corresponding frequencies N 1 +N 2 +…+N k =N , then

General standard deviation (standard) called the square root of the general variance

Sample variance is called the arithmetic mean of the squared deviations of the observed values ​​of the feature from the mean value.

For different values ​​x 1 , x 2 , x 3 , ..., x n of the sign of the sample population of volume n we have:


If the attribute values ​​have corresponding frequencies n 1 +n 2 +…+n k = n, then


Sample standard deviation (standard) is called the square root of the sample variance.


Example:

The sampling set is given by the distribution table. Find the sample variance.


Solution:

Theorem: The variance is equal to the difference between the mean of the squares of the feature values ​​and the square of the total mean.

Example:

Find the variance for this distribution.



Solution:

8. Statistical estimates of distribution parameters

Let the general population be studied by some sample. In this case, it is possible to obtain only an approximate value of the unknown parameter Q, which serves as its estimate. It is obvious that estimates can vary from one sample to another.

Statistical evaluationQ* the unknown parameter of the theoretical distribution is called the function f, which depends on the observed values ​​of the sample. The task of statistical estimation of unknown parameters from a sample is to construct such a function from the available data of statistical observations, which would give the most accurate approximate values ​​of real, unknown to the researcher, values ​​of these parameters.

Statistical estimates are divided into point and interval, depending on the way they are provided (number or interval).

A point estimate is called a statistical estimate. parameter Q of the theoretical distribution determined by one value of the parameter Q *=f (x 1 , x 2 , ..., x n), wherex 1 , x 2 , ...,xn- the results of empirical observations on the quantitative attribute X of a certain sample.

Such parameter estimates obtained from different samples most often differ from each other. The absolute difference /Q *-Q / is called sampling error (estimation).

In order for statistical estimates to give reliable results about the estimated parameters, it is necessary that they be unbiased, efficient and consistent.

Point Estimation, the mathematical expectation of which is equal (not equal) to the estimated parameter, is called unshifted (shifted). M(Q *)=Q .

Difference M( Q *)-Q is called bias or systematic error. For unbiased estimates, the systematic error is 0.

efficient assessment Q *, which, for a given sample size n, has the smallest possible variance: D min(n = const ). The effective estimator has the smallest spread compared to other unbiased and consistent estimators.

Wealthyis called such a statistical assessment Q *, which for ntends in probability to the estimated parameter Q , i.e. with an increase in the sample size n the estimate tends in probability to the true value of the parameter Q.

The consistency requirement is consistent with the law of large numbers: the more initial information about the object under study, the more accurate the result. If the sample size is small, then the point estimate of the parameter can lead to serious errors.

Any sample (volumen) can be thought of as an ordered setx 1 , x 2 , ...,xn independent identically distributed random variables.

Sample means for different volume samples n from the same population will be different. That is, the sample mean can be considered as a random variable, which means that we can talk about the distribution of the sample mean and its numerical characteristics.

The sample mean satisfies all the requirements imposed on statistical estimates, i.e. gives an unbiased, efficient, and consistent estimate of the population mean.

It can be proved that. Thus, the sample variance is a biased estimate of the general variance, giving it an underestimated value. That is, with a small sample size, it will give a systematic error. For an unbiased, consistent estimate, it suffices to take the quantity, which is called the corrected variance. i.e.

In practice, to estimate the general variance, the corrected variance is used when n < 30. In other cases ( n >30) deviation from hardly noticeable. Therefore, for large values n bias error can be neglected.

One can also prove that the relative frequencyn i / n is an unbiased and consistent probability estimate P(X=x i ). Empirical distribution function F*(x ) is an unbiased and consistent estimate of the theoretical distribution function F(x)=P(X< x ).

Example:

Find the unbiased estimates of the mean and variance from the sample table.

x i
n i

Solution:

Sample size n=20.

The unbiased estimate of the mathematical expectation is the sample mean.


To calculate the unbiased estimate of the variance, we first find the sample variance:

Now let's find the unbiased estimate:

9. Interval estimates of distribution parameters

An interval estimate is a statistical estimate determined by two numerical values ​​- the ends of the interval under study.

Number> 0, where | Q - Q*|< , characterizes the accuracy of the interval estimate.

Trustedcalled interval , which with a given probabilitycovers unknown parameter value Q . Complementing the confidence interval to the set of all possible parameter values Q called critical area. If the critical region is located on only one side of the confidence interval, then the confidence interval is called unilateral: left-sided, if the critical region exists only on the left, and right-handed unless on the right. Otherwise, the confidence interval is called bilateral.

Reliability, or confidence level, Q estimates (using Q *) name the probability with which the following inequality is fulfilled: | Q - Q*|< .

Most often, the confidence probability is set in advance (0.95; 0.99; 0.999) and the requirement is imposed on it to be close to one.

Probabilitycalled the probability of error, or the level of significance.

Let | Q - Q*|< , then. This means that with a probabilityit can be argued that the true value of the parameter Q belongs to the interval. The smaller the deviation, the more accurate the estimate.

The boundaries (ends) of the confidence interval are called confidence boundaries, or critical boundaries.

The values ​​of the boundaries of the confidence interval depend on the distribution law of the parameter Q*.

Deviation valuehalf the width of the confidence interval is called assessment accuracy.

Methods for constructing confidence intervals were first developed by the American statistician Y. Neumann. Estimation Accuracy, confidence probability and sample size n interconnected. Therefore, knowing the specific values ​​of two quantities, you can always calculate the third.

Finding the confidence interval for estimating the mathematical expectation of a normal distribution if the standard deviation is known.

Let a sample be made from the general population, subject to the law of normal distribution. Let the general standard deviation be known, but the mathematical expectation of the theoretical distribution is unknown a ().

The following formula is valid:

Those. according to the specified deviation valueit is possible to find with what probability the unknown general mean belongs to the interval. And vice versa. It can be seen from the formula that with an increase in the sample size and a fixed value of the confidence probability, the value- decreases, i.e. the accuracy of the estimate is increased. With an increase in reliability (confidence probability), the value-increases, i.e. the accuracy of the estimate decreases.

Example:

As a result of the tests, the following values ​​were obtained -25, 34, -20, 10, 21. It is known that they obey the normal distribution law with a standard deviation of 2. Find the estimate a * for the mathematical expectation a. Plot a 90% confidence interval for it.

Solution:

Let's find the unbiased estimate

Then


The confidence interval for a has the form: 4 - 1.47< a< 4+ 1,47 или 2,53 < a < 5, 47

Finding the confidence interval for estimating the mathematical expectation of a normal distribution if the standard deviation is unknown.

Let it be known that the general population is subject to the law of normal distribution, where a and. Accuracy of Confidence Interval Covering with Reliabilitythe true value of the parameter a, in this case, is calculated by the formula:

, where n is the sample size, , - Student's coefficient (it should be found from the given values n and from the table "Critical points of Student's distribution").

Example:

As a result of the tests, the following values ​​were obtained -35, -32, -26, -35, -30, -17. It is known that they obey the law of normal distribution. Find the confidence interval for the population mean a with a confidence level of 0.9.

Solution:

Let's find the unbiased estimate.

Let's find.

Then

The confidence interval will take the form(-29.2 - 5.62; -29.2 + 5.62) or (-34.82; -23.58).

Finding the confidence interval for the variance and standard deviation of a normal distribution

Let a random sample of volume be taken from some general set of values ​​distributed according to the normal lawn < 30 for which sample variances are calculated: biasedand corrected s 2. Then to find interval estimates with a given reliabilityfor general dispersionDgeneral standard deviationthe following formulas are used.


or,

Values- find using the table of values ​​of critical pointsPearson distributions.

The confidence interval for the variance is found from these inequalities by squaring all parts of the inequality.

Example:

The quality of 15 bolts was checked. Assuming that the error in their manufacture is subject to the normal distribution law, and the sample standard deviationequal to 5 mm, determine with reliabilityconfidence interval for unknown parameter

We represent the boundaries of the interval as a double inequality:

The ends of the two-sided confidence interval for the variance can be determined without performing arithmetic operations for a given level of confidence and sample size using the corresponding table (Bounds of confidence intervals for the variance depending on the number of degrees of freedom and reliability). To do this, the ends of the interval obtained from the table are multiplied by the corrected variance s 2.

Example:

Let's solve the previous problem in a different way.

Solution:

Let's find the corrected variance:

According to the table "Bounds of confidence intervals for the variance depending on the number of degrees of freedom and reliability", we find the boundaries of the confidence interval for the variance atk=14 and: lower limit 0.513 and upper limit 2.354.

Multiply the obtained bounds bys 2 and extract the root (because we need a confidence interval not for the variance, but for the standard deviation).

As can be seen from the examples, the value of the confidence interval depends on the method of its construction and gives close but different results.

For samples of sufficiently large size (n>30) the boundaries of the confidence interval for the general standard deviation can be determined by the formula: - some number, which is tabulated and given in the corresponding reference table.

If 1- q<1, то формула имеет вид:

Example:

Let's solve the previous problem in the third way.

Solution:

Previously founds= 5,17. q(0.95; 15) = 0.46 - we find according to the table.

Then:

It often happens that it is necessary to analyze a particular social phenomenon and obtain information about it. Such tasks often arise in statistics and in statistical research. Verification of a fully defined social phenomenon is often impossible. For example, how to find out the opinion of the population or all residents of a certain city on any issue? Asking absolutely everyone is almost impossible and very laborious. In such cases, we need a sample. This is exactly the concept on which almost all research and analysis is based.

What is a sample

When analyzing a particular social phenomenon, it is necessary to obtain information about it. If we take any study, we can see that not every unit of the totality of the object of study is subject to research and analysis. Only a certain part of this totality is taken into account. This process is sampling: when only certain units from the set are examined.

Of course, much depends on the type of sample. But there are also basic rules. The main one says that the selection from the population must be absolutely random. The population units to be used should not be selected due to any criterion. Roughly speaking, if it is necessary to collect a population from the population of a certain city and select only men, then there will be an error in the study, because the selection was not carried out randomly, but was selected according to gender. Almost all sampling methods are based on this rule.

Sampling rules

In order for the selected set to reflect the main qualities of the whole phenomenon, it must be built according to specific laws, where the main attention should be paid to the following categories:

  • sample (sample population);
  • general population;
  • representativeness;
  • representativeness error;
  • population unit;
  • sampling methods.

Features of selective observation and sampling are as follows:

  1. All the results obtained are based on mathematical laws and rules, that is, with the correct conduct of the study and with the correct calculations, the results will not be distorted on a subjective basis
  2. It makes it possible to get a result much faster and with less time and resources, studying not the entire array of events, but only a part of them.
  3. It can be used to study various objects: from specific issues, for example, age, gender of the group of interest to us, to the study of public opinion or the level of material support of the population.

Selective observation

Selective - this is such a statistical observation in which not the entire population of the studied is subjected to research, but only some part of it, selected in a certain way, and the results of the study of this part apply to the entire population. This part is called the sampling frame. This is the only way to study a large array of the object of study.

But selective observation can be used only in cases where it is necessary to study only a small group of units. For example, when studying the ratio of men to women in the world, selective observation will be used. For obvious reasons, it is impossible to take into account every inhabitant of our planet.

But with the same study, but not of all the inhabitants of the earth, but of a certain 2 "A" class in a particular school, a certain city, a certain country, selective observation can be dispensed with. After all, it is quite possible to analyze the entire array of the object of study. It is necessary to count the boys and girls of this class - that will be the ratio.

Sample and population

It's actually not as difficult as it sounds. In any object of study there are two systems: general and sample population. What is it? All units belong to the general. And to the sample - those units of the total population that were taken for the sample. If everything is done correctly, then the selected part will be a reduced layout of the entire (general) population.

If we talk about the general population, then we can distinguish only two of its varieties: definite and indefinite general population. Depends on whether the total number of units of a given system is known or not. If it is a certain population, then sampling will be easier due to the fact that it is known what percentage of the total number of units will be sampled.

This moment is very necessary in research. For example, if it is necessary to investigate the percentage of low-quality confectionery products at a particular plant. Assume that the population has already been defined. It is known for sure that this enterprise produces 1000 confectionery products per year. If we make a sample of 100 random confectionery products from this thousand and send them for examination, then the error will be minimal. Roughly speaking, 10% of all products were subject to research, and based on the results, taking into account the representativeness error, we can talk about poor quality of all products.

And if you make a sample of 100 confectionery products from an indefinite general population, where there were actually, say, 1 million units, then the result of the sample and the study itself will be critically implausible and inaccurate. Feel the difference? Therefore, the certainty of the general population in most cases is extremely important and greatly affects the result of the study.

Population representativeness

So, now one of the most important questions - what should be the sample? This is the most important point of the study. At this stage, it is necessary to calculate the sample and select units from the total number into it. The population was selected correctly if certain features and characteristics of the general population remain in the sample. This is called representativeness.

In other words, if, after selection, a part retains the same tendencies and characteristics as the entire quantity of the examined, then such a population is called representative. But not every specific sample can be selected from a representative population. There are also such objects of research, the sample of which simply cannot be representative. This is where the concept of representativeness error comes from. But let's talk about this a little more.

How to make a selection

So, in order to maximize representativeness, there are three basic sampling rules:


Error (error) of representativeness

The main characteristic of the quality of the selected sample is the concept of "representativeness error". What is it? These are certain discrepancies between the indicators of selective and continuous observation. According to the error indicators, the representativeness is divided into reliable, ordinary and approximate. In other words, deviations of up to 3%, from 3 to 10% and from 10 to 20%, respectively, are acceptable. Although in statistics it is desirable that the error does not exceed 5-6%. Otherwise, there is reason to talk about the insufficient representativeness of the sample. To calculate representativeness error and how it affects a sample or population, many factors are taken into account:

  1. The probability with which an accurate result is to be obtained.
  2. Number of sampling units. As mentioned earlier, the smaller the number of units in the sample, the greater the representativeness error will be, and vice versa.
  3. Homogeneity of the study population. The more heterogeneous the population, the greater the representativeness error will be. The ability of a population to be representative depends on the homogeneity of all its constituent units.
  4. A method of selecting units in a sample population.

In specific studies, the percentage error of the mean is usually set by the researcher himself, based on the observation program and according to data from previous studies. As a rule, the maximum sampling error (error of representativeness) within 3-5% is considered acceptable.

More is not always better

It is also worth remembering that the main thing in organizing selective observation is to bring its volume to an acceptable minimum. At the same time, one should not strive to excessively reduce the sampling error limits, since this can lead to an unjustified increase in the amount of sample data and, consequently, to an increase in the cost of sampling.

At the same time, the size of the representativeness error should not be excessively increased. After all, in this case, although there will be a decrease in the sample size, this will lead to a deterioration in the reliability of the results obtained.

What questions are usually asked by the researcher?

Any research, if carried out, is for some purpose and to obtain some results. When conducting a sample survey, as a rule, the initial questions are:


Methods for selecting research units in the sample

Not every sample is representative. Sometimes one and the same sign is differently expressed in the whole and in its part. To achieve the requirements of representativeness, it is advisable to use various sampling methods. Moreover, the use of one method or another depends on the specific circumstances. Some of these sampling methods include:

  • random selection;
  • mechanical selection;
  • typical selection;
  • serial (nested) selection.

Random selection is a system of activities aimed at random selection of population units, when the probability of being included in the sample is equal for all units of the general population. This technique is advisable to apply only in the case of homogeneity and a small number of its inherent features. Otherwise, some characteristic features run the risk of not being reflected in the sample. Features of random selection underlie all other methods of sampling.

With mechanical selection of units is carried out at a certain interval. If it is necessary to form a sample of specific crimes, it is possible to remove every 5th, 10th or 15th card from all the statistical records of recorded crimes, depending on their total number and available sample sizes. The disadvantage of this method is that before the selection it is necessary to have a complete account of the units of the population, then it is necessary to conduct a ranking, and only after that it is possible to sample with a certain interval. This method takes a lot of time, so it is not often used.

Typical (regionalized) selection is a type of sample in which the general population is divided into homogeneous groups according to a certain attribute. Sometimes researchers use other terms instead of "groups": "districts" and "zones". Then, from each group, a certain number of units is randomly selected in proportion to the share of the group in the total population. A typical selection is often carried out in several stages.

Serial sampling is a method in which the selection of units is carried out in groups (series) and all units of the selected group (series) are subject to examination. The advantage of this method is that sometimes it is more difficult to select individual units than series, for example, when studying a person who is serving a sentence. Within the selected areas, zones, the study of all units without exception is applied, for example, the study of all persons serving sentences in a particular institution.

Part of the objects from the population selected for study in order to draw a conclusion about the entire population. In order for the conclusion obtained by studying the sample to be extended to the entire population, the sample must have the property of being representative.

Sample representativeness

The property of the sample to correctly reflect the general population. The same sample may or may not be representative of different populations.
Example:

A sample consisting entirely of Muscovites who own a car does not represent the entire population of Moscow.

The sample of Russian enterprises with up to 100 employees does not represent all enterprises in Russia.

The sample of Muscovites making purchases in the market does not represent the purchasing behavior of all Muscovites.

At the same time, these samples (subject to other conditions) can perfectly represent Muscovite car owners, small and medium-sized Russian enterprises and buyers making purchases in the markets, respectively.

It is important to understand that sample representativeness and sampling error are different phenomena. Representativeness, unlike error, does not depend on sample size.

No matter how much we increase the number of surveyed Muscovites-car owners, we will not be able to represent all Muscovites with this sample.

Sampling error (confidence interval)

The deviation of the results obtained with the help of sample observation from the true data of the general population.

There are two types of sampling error: statistical and systematic. The statistical error depends on the sample size. The larger the sample size, the lower it is.

Example:
For a simple random sample of 400 units, the maximum statistical error (with 95% confidence) is 5%, for a sample of 600 units - 4%, for a sample of 1100 units - 3% .

The systematic error depends on various factors that have a constant impact on the study and bias the results of the study in a certain direction.

Example:
- The use of any probability sample underestimates the proportion of high-income people who lead an active lifestyle. This happens due to the fact that such people are much more difficult to find in any particular place (for example, at home).

The problem of respondents who refuse to answer the questions of the questionnaire (the share of "refuseniks" in Moscow, for different surveys, ranges from 50% to 80%)

In some cases, when true distributions are known, bias can be leveled out by introducing quotas or reweighting the data, but in most real studies, even estimating it can be quite problematic.

Sample types

Samples are divided into two types:

probabilistic

improbability

Probability samples

1.1 Random sampling (simple random selection)

Such a sample assumes the homogeneity of the general population, the same probability of the availability of all elements, the presence of a complete list of all elements. When selecting elements, as a rule, a table of random numbers is used.
1.2 Mechanical (systematic) sampling

A kind of random sample, sorted by some attribute (alphabetical order, phone number, date of birth, etc.). The first element is selected randomly, then every 'k'th element is selected in increments of 'n'. The size of the general population, while - N=n*k

1.3 Stratified (zoned)

It is used in case of heterogeneity of the general population. The general population is divided into groups (strata). In each stratum, selection is carried out randomly or mechanically.

1.4 Serial (nested or clustered) sampling

With serial sampling, the units of selection are not the objects themselves, but groups (clusters or nests). Groups are selected randomly. Objects within groups are surveyed all over.

Incredible Samples

The selection in such a sample is carried out not according to the principles of chance, but according to subjective criteria - accessibility, typicality, equal representation, etc.

Quota sampling

Initially, a certain number of groups of objects are allocated (for example, men aged 20-30 years, 31-45 years and 46-60 years; persons with an income of up to 30 thousand rubles, with an income of 30 to 60 thousand rubles and with an income of more than 60 thousand rubles ) For each group, the number of objects to be surveyed is specified. The number of objects that should fall into each of the groups is set, most often, either in proportion to the previously known share of the group in the general population, or the same for each group. Within the groups, objects are selected randomly. Quota samples are used quite often in marketing research.

Snowball Method

The sample is constructed as follows. Each respondent, starting with the first, is asked to contact his friends, colleagues, acquaintances who would fit the selection conditions and could take part in the study. Thus, with the exception of the first step, the sample is formed with the participation of the objects of study themselves. The method is often used when it is necessary to find and interview hard-to-reach groups of respondents (for example, respondents with a high income, respondents belonging to the same professional group, respondents who have some similar hobbies / passions, etc.)
2.3 Spontaneous sampling

The most accessible respondents are polled. Typical examples of spontaneous sampling are surveys in newspapers/magazines, questionnaires given to respondents for self-completion, most Internet surveys. The size and composition of random samples is not known in advance, and is determined by only one parameter - the activity of the respondents.
2.4 Sample of typical cases

Units of the general population are selected that have an average (typical) value of the attribute. This raises the problem of choosing a feature and determining its typical value.

Implementation of the research plan

This stage, we recall, includes the collection of information and its analysis. The process of implementing a marketing research plan typically requires the most research and is the source of the greatest error.

When collecting statistical data, a number of shortcomings and problems arise:

firstly, some respondents may not be in the agreed place and they have to be contacted again or replaced;

secondly, some respondents may be uncooperative or give biased, knowingly false answers.

Thanks to modern computing and telecommunication technologies, data collection methods are developing and improving.

Some firms conduct surveys from a single center. In this case, professional interviewers sit in offices and dial random phone numbers. If they hear the response of callers, the interviewer asks the person who answered the phone to answer a few questions. The latter are read from the computer monitor screen and the respondents' answers are typed on the keyboard. This method eliminates the need for formatting and encoding data, reduces the number of errors.