What is sampling in statistics. Problems about the general share

The total number of objects of observation (people, households, enterprises, settlements, etc.) that have a certain set of characteristics (gender, age, income, number, turnover, etc.), limited in space and time. Population examples

  • All residents of Moscow (10.6 million people according to the 2002 census)
  • Muscovite men (4.9 million according to the 2002 census)
  • Russian legal entities (2.2 million at the beginning of 2005)
  • Retail outlets selling food products (20 thousand at the beginning of 2008), etc.

Sample (Sample population)

Part of the objects from the population selected for study in order to draw a conclusion about the entire population. In order for the conclusion obtained by studying the sample to be extended to the entire population, the sample must have the property of being representative.

Sample representativeness

The property of the sample to correctly reflect the general population. The same sample may or may not be representative of different populations.
Example:

  • A sample consisting entirely of Muscovites who own a car does not represent the entire population of Moscow.
  • The sample of Russian enterprises with up to 100 employees does not represent all enterprises in Russia.
  • The sample of Muscovites making purchases in the market does not represent the purchasing behavior of all Muscovites.

At the same time, these samples (subject to other conditions) can perfectly represent Muscovite car owners, small and medium-sized Russian enterprises and buyers making purchases in the markets, respectively.
It is important to understand that sample representativeness and sampling error are different phenomena. Representativeness, unlike error, does not depend on sample size.
Example:
No matter how much we increase the number of surveyed Muscovites-car owners, we will not be able to represent all Muscovites with this sample.

Sampling error (confidence interval)

The deviation of the results obtained with the help of sample observation from the true data of the general population.
There are two types of sampling error: statistical and systematic. The statistical error depends on the sample size. The larger the sample size, the lower it is.
Example:
For a simple random sample of 400 units, the maximum statistical error (with 95% confidence) is 5%, for a sample of 600 units - 4%, for a sample of 1100 units - 3% .
The systematic error depends on various factors that have a constant impact on the study and bias the results of the study in a certain direction.
Example:

  • The use of any probability sample underestimates the proportion of high-income people who are active. This happens due to the fact that such people are much more difficult to find in any particular place (for example, at home).
  • The problem of respondents who refuse to answer questions (the share of “refuseniks” in Moscow, for different surveys, ranges from 50% to 80%)

In some cases, when true distributions are known, bias can be leveled out by introducing quotas or reweighting the data, but in most real studies, even estimating it can be quite problematic.

Sample types

Samples are divided into two types:

  • probabilistic
  • improbability

1. Probability samples
1.1 Random sampling (simple random selection)
Such a sample assumes the homogeneity of the general population, the same probability of the availability of all elements, the presence of a complete list of all elements. When selecting elements, as a rule, a table of random numbers is used.
1.2 Mechanical (systematic) sampling
A kind of random sample, sorted by some attribute (alphabetical order, phone number, date of birth, etc.). The first element is selected randomly, then every 'k'th element is selected in increments of 'n'. The size of the general population, while - N=n*k
1.3 Stratified (zoned)
It is used in case of heterogeneity of the general population. The general population is divided into groups (strata). In each stratum, selection is carried out randomly or mechanically.
1.4 Serial (nested or clustered) sampling
With serial sampling, the units of selection are not the objects themselves, but groups (clusters or nests). Groups are selected randomly. Objects within groups are surveyed all over.

2. Incredible samples
The selection in such a sample is carried out not according to the principles of chance, but according to subjective criteria - accessibility, typicality, equal representation, etc.
2.1. Quota sampling
Initially, a certain number of groups of objects are allocated (for example, men aged 20-30 years, 31-45 years and 46-60 years; persons with an income of up to 30 thousand rubles, with an income of 30 to 60 thousand rubles and with an income of more than 60 thousand rubles ) For each group, the number of objects to be surveyed is specified. The number of objects that should fall into each of the groups is set, most often, either in proportion to the previously known share of the group in the general population, or the same for each group. Within the groups, objects are selected randomly. Quota sampling is used quite often.
2.2. Snowball Method
The sample is constructed as follows. Each respondent, starting with the first, is asked to contact his friends, colleagues, acquaintances who would fit the selection conditions and could take part in the study. Thus, with the exception of the first step, the sample is formed with the participation of the objects of study themselves. The method is often used when it is necessary to find and interview hard-to-reach groups of respondents (for example, respondents with a high income, respondents belonging to the same professional group, respondents who have some similar hobbies / passions, etc.)
2.3 Spontaneous sampling
The most accessible respondents are polled. Typical examples of spontaneous samples are in newspapers/magazines given to respondents for self-completion, most Internet surveys. The size and composition of spontaneous samples is not known in advance, and is determined by only one parameter - the activity of the respondents.
2.4 Sample of typical cases
Units of the general population are selected that have an average (typical) value of the attribute. This raises the problem of choosing a feature and determining its typical value.

Course of lectures on the theory of statistics

More detailed information on sample observations can be obtained by viewing.

Selective research.

The concept of the sampling method.

Selective observation- this is such a non-continuous observation in which the selection of units of the population to be studied is carried out randomly, the selected part is subjected to research, after which the results are distributed to the entire population.

The sampling method is used when

1 when the observation itself is associated with damage or destruction of the observed units (yarn for spice, electric light bulb for combustion product)

2 large aggregate volume

3 high costs (financial and labor).

Usually, 5-10% of the total population is subjected to a sample survey, less often 15-25%.

The purpose of sampling is to determine the characteristics of the overall mean and overall proportion (P). Characteristics of the sample population - sample mean and the sample fraction (w) differ from the general characteristics by the amount of sampling error ( ). Therefore, it is necessary to calculate the sampling error or the representativeness error, which is determined by formulas developed in probability theory for each type of sample and selection method.

There are the following ways to select units:

1 return ball selection, commonly referred to as resampling.

With repeated selection, the probability of getting each individual unit into the sample remains constant, because after selecting a unit, it is returned to the population again and can be selected again.

2 selection according to the unreturned ball scheme, called random sampling. In this case, each selected unit is not returned back, and the probability of getting individual units into the sample changes all the time (for the remaining units it will increase) (lot), tables of random numbers, for example, 75 out of 780.

Sample types.

1 Actually - random.

This is one in which the selection of units in the sample is made directly from the entire mass of units in the general population.

In this case, the number of selected units is usually determined based on the accepted proportion of the sample.

For a sample, there is the ratio of the number of units in the sample population and the number of units in the general population N.

So, with a 5% sample from a batch of goods of 2000 units, the sample size n is 100 units. (
), and with a 20% sample it will be 400 units.

(
)

An important condition for a proper random sample that each unit of the population is given an equal opportunity to be included in the sample.

With random selection, the marginal sampling error for the mean is equal to

- sampling variance

n - sample size

t is the confidence factor, which is determined from the table of values ​​of the Laplace integral function for a given probability P.

With non-repetitive sampling, the marginal sampling error is determined by the formula for the average

where N is the size of the general population of the share

To determine the ash content of coal, 100 samples of coal were examined randomly. As a result of the survey, it was found that the average ash content of coal in the sample is 16%, = 5%. In 10 samples, the ash content of coal was > 20% with a probability of 0.954 to determine the limits in which the average ash content of coal in the deposit and the proportion of coal with an ash content > 20% will be

Average ash content

determine the marginal sampling error


2*0.5=1%

at p=0.954 t=2

share of coal with ash content >20%

the sample share is determined

where m is the proportion of units that have a feature

sampling error for share

With a probability of 0.954, it can be argued that the proportion of coal with an ash content of more than 20% in the deposit will be within

P= 10%+(-)6% or

mechanical sampling.

This is a kind of actually - random. In this case, the entire population is divided into n equal parts, and then one unit is selected from each part.

All units of the population must be arranged in a certain order. At the same time, in relation to the indicator under study, the units of the general population can be ordered according to a significant, secondary or neutral feature. In this case, the unit that is in the middle of each group should be selected from each group. This avoids sampling bias.

Apply: when examining buyers in stores, visitors in clinics, every 5,4,3, etc.

Example mechanical sampling

To determine the average term of using a short-term loan in a bank, a 5% mechanical sample will be made, which includes 100 accounts. As a result of the survey, it was found that the average term for using a short-term loan is 30 days with
9 days in 5 accounts Loan term > 60 days.

Sampling error

those. with a probability of 0.954 it can be argued that the term of using the loan fluctuates

1 within 30days+(-)2days, i.e.

2 shares of loans with a term > 60 days.

the sample share will be

determine the share error

with a probability of 0.954, it can be argued that the share of bank loans with a maturity of >60 days will be within

Typical sample.

The general population is divided into homogeneous typical groups. Then, from each typical group, an individual selection of units into the sample is made by a random or mechanical sample.

For example: pr. tr. workers, consisting of separate groups by qualification.

Important feature- gives more accurate results compared to others, tk. the sample includes a typological unit.

The selection of units of observation in the sample set is carried out by various methods. Consider a typical sample with proportional selection within typical groups.

The sample size from a typical group in the selection proportional to the number of typical groups is determined by the formula

where =V samples from typical group

= V of the typical group.

The marginal error of the sample mean and proportion for a non-repetitive random and mechanical selection method within typical groups is calculated by the formulas


where = sample variance

Example: typical sample

To determine the average age of men entering marriage, a 5% sample was made in the district with the selection of units in proportion to the number of typical groups

Mechanical selection was used within the groups

With a probability of 0.954, determine the limits within which the average age of men who have married and the proportion of men who have remarried will lie.

average age of marriage for men in the sample

marginal sampling error

with a probability of 0.954 it can be argued that the average age of men entering into marriage will be within

for men entering into a second marriage be within

the sample share is determined

the sample variance of the alternative feature is

with a probability of 0.954 it can be argued that the proportion of those who marry a second time is within

serial sampling.

With serial sampling, the population is divided into groups of the same size - series. The sample population is selected series. Within the series, a continuous observation of the units that fell into the series is carried out.

With repetitive selection and determined by the formula

where
- interseries variance

where
sample mean of the series

sample mean of serial sample

R- number of series of the general population

r - number of selected series

Example: in the workshop of 10 brigades, in order to study their labor productivity, a 20% serial sample will be carried out, which included 2 brigades. As a result of the survey, it was found that

with a probability of 0.997 to determine the limits within which the average output of the shop workers will be.

the sample mean of a serial sample is determined by the formula

with a probability of 0.997 it can be argued that the average output of the shop workers is within

There are 200 boxes of parts, 40 pieces in each box, in the finished product warehouse of the workshop. 10% serial sampling will be made to check the quality of finished products. As a result of the sampling, it was found that for defective parts is 15%. The serial sample variance is 0.0049.

With a probability of 0.997, determine the limits in which the proportion of defective products in a batch of boxes is

The proportion of defective parts will be within

determine the marginal sampling error for the share by the formula

with a probability of 0.997 it can be argued that the proportion of defective parts

in the party is within

In the practice of designing sample observation, there is a need to find the size of the sample, which is necessary to ensure a certain accuracy in the calculation of general characteristics - the average and the proportion.

The marginal sampling error, the probability of its occurrence, and the variation of the feature are known in advance.

With random re-selection the sample size is determined by the formula

with random non-repetitive and mechanical selection, the sample size

for a typical sample

for serial sampling

For example, 2000 families live in the district.

It is planned to conduct a sample survey of them by the method of random non-repetitive selection to find the average family size.

Determine the required sample size, provided that with a probability of 0.954 the sampling error does not exceed 1 person with a standard deviation of 3 people.

10 thousand people live in the city. families. Using mechanical sampling, it is proposed to determine the proportion of families with three or more children. What should be the sample size for the sampling error to be less than 0.02 with a probability P=0.954 if the variance is known to be 0.02 from previous surveys?

Plan:

1. Problems of mathematical statistics.

2. Sample types.

3. Selection methods.

4. Statistical distribution of the sample.

5. Empirical distribution function.

6. Polygon and histogram.

7. Numerical characteristics of the variation series.

8. Statistical estimates of distribution parameters.

9. Interval estimates of distribution parameters.

1. Tasks and methods of mathematical statistics

Math statistics is a branch of mathematics devoted to the methods of collecting, analyzing and processing the results of statistical observational data for scientific and practical purposes.

Let it be required to study a set of homogeneous objects with respect to some qualitative or quantitative feature that characterizes these objects. For example, if there is a batch of parts, then the standard of the part can serve as a qualitative sign, and the controlled size of the part can serve as a quantitative sign.

Sometimes a continuous study is carried out, i.e. examine each object with respect to the desired feature. In practice, a comprehensive survey is rarely used. For example, if the population contains a very large number of objects, then it is physically impossible to conduct a complete survey. If the survey of the object is associated with its destruction or requires large material costs, then it makes no sense to conduct a complete survey. In such cases, a limited number of objects (sample set) are randomly selected from the entire population and subjected to their study.

The main task of mathematical statistics is to study the entire population based on sample data, depending on the goal, i.e. the study of the probabilistic properties of the population: the law of distribution, numerical characteristics, etc. for making managerial decisions under conditions of uncertainty.

2. Sample types

Population is the set of objects from which the sample is made.

Sample population (sample) is a collection of randomly selected objects.

Population size is the number of objects in this collection. The volume of the general population is denoted N, selective - n.

Example:

If out of 1000 parts 100 parts are selected for examination, then the volume of the general population N = 1000, and the sample size n = 100.

Sampling can be done in two ways: after the object is selected and observed over it, it can be returned or not returned to the general population. That. The samples are divided into repeated and non-repeated.

Repeatedcalled sampling, at which the selected object (before selecting the next one) is returned to the general population.

Non-repeatingcalled sampling, at which the selected object is not returned to the general population.

In practice, non-repetitive random selection is usually used.

In order for the data of the sample to be sufficiently confident in judging the feature of interest in the general population, it is necessary that the objects of the sample represent it correctly. The sample must correctly represent the proportions of the population. The sample must be representative (representative).

By virtue of the law of large numbers, it can be argued that the sample will be representative if it is carried out randomly.

If the size of the general population is large enough, and the sample is only a small part of this population, then the distinction between repeated and non-repeated samples is erased; in the limiting case, when an infinite general population is considered, and the sample has a finite size, this difference disappears.

Example:

In the American journal Literary Review, using statistical methods, a study was made of forecasts regarding the outcome of the upcoming US presidential election in 1936. Applicants for this post were F.D. Roosevelt and A. M. Landon. Reference books of telephone subscribers were taken as a source for the general population of the studied Americans. Of these, 4 million addresses were randomly selected, to which the editors of the magazine sent out postcards asking them to express their attitude towards the candidates for the presidency. After processing the results of the poll, the magazine published a sociological forecast that Landon would win the upcoming elections with a large margin. And ... I was wrong: Roosevelt won.
This example can be seen as an example of a non-representative sample. The fact is that in the United States in the first half of the twentieth century, only the wealthy part of the population, who supported the views of Landon, had telephones.

3. Selection methods

In practice, various methods of selection are used, which can be divided into 2 types:

1. Selection does not require dividing the population into parts (a) simple random no repeat; b) simple random repeat).

2. Selection, in which the general population is divided into parts. (a) typical selection; b) mechanical selection; in) serial selection).

Simple random call this selection, in which objects are extracted one by one from the entire general population (randomly).

Typicalcalled selection, in which objects are selected not from the entire general population, but from each of its “typical” parts. For example, if a part is manufactured on several machines, then the selection is made not from the entire set of parts produced by all machines, but from the products of each machine separately. Such selection is used when the trait being examined fluctuates noticeably in various "typical" parts of the general population.

Mechanicalcalled selection, in which the general population is "mechanically" divided into as many groups as there are objects to be included in the sample, and one object is selected from each group. For example, if you need to select 20% of the parts made by the machine, then every 5th part is selected; if it is required to select 5% of the parts - every 20th, etc. Sometimes such a selection may not ensure a representative sample (if every 20th turning roller is selected, and the cutter is replaced immediately after the selection, then all the rollers turned with blunt cutters will be selected).

Serialcalled selection, in which objects are selected from the general population not one at a time, but in “series”, which are subjected to a continuous survey. For example, if products are manufactured by a large group of automatic machines, then the products of only a few machines are subjected to a continuous examination.

In practice, combined selection is often used, in which the above methods are combined.

4. Statistical distribution of the sample

Let a sample be taken from the general population, and the value x 1-observed once, x 2 -n 2 times, ... x k - n k times. n= n 1 +n 2 +...+n k is the sample size. Observed valuescalled options, and the sequence is a variant written in ascending order - variational series. Number of observationscalled frequencies (absolute frequencies), and their relationship to the sample size- relative frequencies or statistical probabilities.

If the number of options is large or the sample is made from a continuous general population, then the variation series is compiled not by individual point values, but by intervals of values ​​of the general population. Such a series is called interval. The lengths of the intervals must be equal.

The statistical distribution of the sample called a list of options and their corresponding frequencies or relative frequencies.

The statistical distribution can also be specified as a sequence of intervals and their corresponding frequencies (the sum of the frequencies that fall into this interval of values)

The point variation series of frequencies can be represented by a table:

x i
x 1
x2

x k
n i
n 1
n 2

nk

Similarly, one can represent a point variational series of relative frequencies.

And:

Example:

The number of letters in some text X turned out to be equal to 1000. The first letter was "i", the second - the letter "i", the third - the letter "a", the fourth - "u". Then came the letters "o", "e", "y", "e", "s".

Let's write down the places that they occupy in the alphabet, respectively, we have: 33, 10, 1, 32, 16, 6, 21, 31, 29.

After ordering these numbers in ascending order, we get a variation series: 1, 6, 10, 16, 21, 29, 31, 32, 33.

The frequencies of the appearance of letters in the text: "a" - 75, "e" -87, "i" - 75, "o" - 110, "y" - 25, "s" - 8, "e" - 3, "yu "- 7," I "- 22.

We compose a point variational series of frequencies:

Example:

Volume sampling frequency distribution specified n = 20.

Make a point variation series of relative frequencies.

x i

2

6

12

n i

3

10

7

Decision:

Find the relative frequencies:


x i

2

6

12

w i

0,15

0,5

0,35

When constructing an interval distribution, there are rules for choosing the number of intervals or the size of each interval. The criterion here is the optimal ratio: with an increase in the number of intervals, the representativeness improves, but the amount of data and the time for processing them increase. Difference x max - x min between the largest and smallest values ​​​​variant is called on a grand scale samples.

To count the number of intervals k usually apply the empirical formula of Sturgess (implying rounding to the nearest convenient integer): k = 1 + 3.322 log n .

Accordingly, the value of each interval h can be calculated using the formula:

5. Empirical distribution function

Consider some sample from the general population. Let the statistical distribution of the frequencies of the quantitative attribute X be known. Let us introduce the notation: n xis the number of observations in which a feature value less than x was observed; n is the total number of observations (sample size). Relative event frequency X<х равна n x /n . If x changes, then the relative frequency also changes, i.e. relative frequencyn x /nis a function of x. Because it is found empirically, it is called empirical.

Empirical distribution function (sample distribution function) call the function, which determines for each x the relative frequency of the event X<х.


where is the number of options less than x,

n - sample size.

Unlike the empirical distribution function of the sample, the distribution function F(x) of the population is called theoretical distribution function.

The difference between the empirical and theoretical distribution functions is that the theoretical function F (x) determines the probability of an event X F*(x) tends in probability to the probability F (x) of this event. That is, for large n F*(x) and F(x) differ little from each other.

That. it is advisable to use the empirical distribution function of the sample for an approximate representation of the theoretical (integral) distribution function of the general population.

F*(x) has all the properties F(x).

1. Values F*(x) belong to the interval.

2. F*(x) is a non-decreasing function.

3. If is the smallest variant, then F*(x) = 0, at x < x1; if x k is the largest variant, then F*(x) = 1, for x > x k .

Those. F*(x) serves to estimate F(x).

If the sample is given by a variational series, then the empirical function has the form:

The graph of the empirical function is called the cumulative.

Example:

Plot an empirical function over the given sample distribution.


Decision:

Sample size n = 12 + 18 +30 = 60. The smallest option is 2, i.e. at x < 2. Event X<6, (x 1 = 2) наблюдалось 12 раз, т.е. F*(x)=12/60=0.2 at 2 < x < 6. Event X<10, (x 1 =2, x 2 = 6) наблюдалось 12 + 18 = 30 раз, т.е.F*(x)=30/60=0,5 при 6 < x < 10. Because x=10 is the largest option, then F*(x) = 1 at x>10. The desired empirical function has the form:

Cumulate:


The cumulate makes it possible to understand the information presented graphically, for example, to answer the questions: “Determine the number of observations in which the value of the feature was less than 6 or not less than 6. F*(6) = 0.2 » Then the number of observations in which the value of the observed feature was less than 6 is 0.2* n \u003d 0.2 * 60 \u003d 12. The number of observations in which the value of the observed feature was not less than 6 is (1-0.2) * n \u003d 0.8 * 60 \u003d 48.

If an interval variation series is given, then to compile the empirical distribution function, the midpoints of the intervals are found and the empirical distribution function is obtained from them similarly to the point variation series.

6. Polygon and histogram

For clarity, various graphs of the statistical distribution are built: polynomial and histograms

Frequency polygon- this is a broken line, the segments of which connect the points ( x 1 ;n 1 ), ( x 2 ;n 2 ),…, ( x k ; n k ), where are the options, are the frequencies corresponding to them.

Polygon of relative frequencies - this is a broken line whose segments connect the points ( x 1 ;w 1 ), (x 2 ;w 2 ),…, ( x k ;w k ), where x i are variants, w i are relative frequencies corresponding to them.

Example:

Plot the relative frequency polynomial over the given sample distribution:

Decision:

In the case of a continuous feature, it is advisable to build a histogram, for which the interval, which contains all the observed values ​​of the feature, is divided into several partial intervals of length h and for each partial interval n i is found - the sum of the variant frequencies that fall into the i-th interval. (For example, when measuring a person's height or weight, we are dealing with a continuous sign).

Frequency histogram- this is a stepped figure, consisting of rectangles, the bases of which are partial intervals of length h, and the heights are equal to the ratio (frequency density).

Square i-th partial rectangle is equal to the sum of the frequencies of the variant of the i-th interval, i.e. the frequency histogram area is equal to the sum of all frequencies, i.e. sample size.

Example:

The results of the change in voltage (in volts) in the electrical network are given. Compose a variation series, build a polygon and a frequency histogram if the voltage values ​​are as follows: 227, 215, 230, 232, 223, 220, 228, 222, 221, 226, 226, 215, 218, 220, 216, 220, 225, 212 , 217, 220.

Decision:

Let's create a series of variations. We have n = 20, x min =212, x max =232.

Let's use the Sturgess formula to calculate the number of intervals.

The interval variational series of frequencies has the form:


Frequency Density

212-21 6

0,75

21 6-22 0

0,75

220-224

1,75

224-228

228-232

0,75

Let's build a histogram of frequencies:

Let's construct a polygon of frequencies by first finding the midpoints of the intervals:


Histogram of relative frequencies called a stepped figure consisting of rectangles whose bases are partial intervals of length h, and the heights are equal to the ratio w i/h (relative frequency density).

Square The i-th partial rectangle is equal to the relative frequency of the variant that fell into the i-th interval. Those. the area of ​​the histogram of relative frequencies is equal to the sum of all relative frequencies, i.e. unit.

7. Numerical characteristics of the variation series

Consider the main characteristics of the general and sample populations.

General secondary is called the arithmetic mean of the values ​​of the feature of the general population.

For different values ​​x 1 , x 2 , x 3 , …, x n . sign of the general population of volume N we have:

If the attribute values ​​have corresponding frequencies N 1 +N 2 +…+N k =N , then


sample mean is called the arithmetic mean of the values ​​of the feature of the sample population.

If the attribute values ​​have corresponding frequencies n 1 +n 2 +…+n k = n, then


Example:

Calculate the sample mean for the sample: x 1 = 51.12; x 2 \u003d 51.07; x 3 \u003d 52.95; x 4 \u003d 52.93; x 5 \u003d 51.1; x 6 \u003d 52.98; x 7 \u003d 52.29; x 8 \u003d 51.23; x 9 \u003d 51.07; x10 = 51.04.

Decision:

General variance is called the arithmetic mean of the squared deviations of the values ​​of the characteristic X of the general population from the general average.

For different values ​​x 1 , x 2 , x 3 , …, x N of the sign of the population of volume N we have:

If the attribute values ​​have corresponding frequencies N 1 +N 2 +…+N k =N , then

General standard deviation (standard) called the square root of the general variance

Sample variance is called the arithmetic mean of the squared deviations of the observed values ​​of the feature from the mean value.

For different values ​​x 1 , x 2 , x 3 , ..., x n of the sign of the sample population of volume n we have:


If the attribute values ​​have corresponding frequencies n 1 +n 2 +…+n k = n, then


Sample standard deviation (standard) is called the square root of the sample variance.


Example:

The sampling set is given by the distribution table. Find the sample variance.


Decision:

Theorem: The variance is equal to the difference between the mean of the squares of the feature values ​​and the square of the total mean.

Example:

Find the variance for this distribution.



Decision:

8. Statistical estimates of distribution parameters

Let the general population be studied by some sample. In this case, it is possible to obtain only an approximate value of the unknown parameter Q, which serves as its estimate. It is obvious that estimates can vary from one sample to another.

Statistical evaluationQ* the unknown parameter of the theoretical distribution is called the function f, which depends on the observed values ​​of the sample. The task of statistical estimation of unknown parameters from a sample is to construct such a function from the available data of statistical observations, which would give the most accurate approximate values ​​of real, unknown to the researcher, values ​​of these parameters.

Statistical estimates are divided into point and interval, depending on the way they are provided (number or interval).

A point estimate is called a statistical estimate. parameter Q of the theoretical distribution determined by one value of the parameter Q *=f (x 1 , x 2 , ..., x n), wherex 1 , x 2 , ...,xn- the results of empirical observations on the quantitative attribute X of a certain sample.

Such parameter estimates obtained from different samples most often differ from each other. The absolute difference /Q *-Q / is called sampling error (estimation).

In order for statistical estimates to give reliable results about the estimated parameters, it is necessary that they be unbiased, efficient and consistent.

Point Estimation, the mathematical expectation of which is equal (not equal) to the estimated parameter, is called unshifted (shifted). M(Q *)=Q .

Difference M( Q *)-Q is called bias or systematic error. For unbiased estimates, the systematic error is 0.

efficient assessment Q *, which, for a given sample size n, has the smallest possible variance: D min(n = const ). The effective estimator has the smallest spread compared to other unbiased and consistent estimators.

Wealthyis called such a statistical assessment Q *, which for ntends in probability to the estimated parameter Q , i.e. with an increase in the sample size n the estimate tends in probability to the true value of the parameter Q.

The consistency requirement is consistent with the law of large numbers: the more initial information about the object under study, the more accurate the result. If the sample size is small, then the point estimate of the parameter can lead to serious errors.

Any sample (volumen) can be thought of as an ordered setx 1 , x 2 , ...,xn independent identically distributed random variables.

Sample means for different volume samples n from the same population will be different. That is, the sample mean can be considered as a random variable, which means that we can talk about the distribution of the sample mean and its numerical characteristics.

The sample mean satisfies all the requirements imposed on statistical estimates, i.e. gives an unbiased, efficient, and consistent estimate of the population mean.

It can be proved that. Thus, the sample variance is a biased estimate of the general variance, giving it an underestimated value. That is, with a small sample size, it will give a systematic error. For an unbiased, consistent estimate, it suffices to take the quantity, which is called the corrected variance. i.e.

In practice, to estimate the general variance, the corrected variance is used when n < 30. In other cases ( n >30) deviation from hardly noticeable. Therefore, for large values n bias error can be neglected.

One can also prove that the relative frequencyn i / n is an unbiased and consistent probability estimate P(X=x i ). Empirical distribution function F*(x ) is an unbiased and consistent estimate of the theoretical distribution function F(x)=P(X< x ).

Example:

Find the unbiased estimates of the mean and variance from the sample table.

x i
n i

Decision:

Sample size n=20.

The unbiased estimate of the mathematical expectation is the sample mean.


To calculate the unbiased estimate of the variance, we first find the sample variance:

Now let's find the unbiased estimate:

9. Interval estimates of distribution parameters

An interval is a statistical estimate determined by two numerical values ​​- the ends of the interval under study.

Number> 0, where | Q - Q*|< , characterizes the accuracy of the interval estimate.

Trustedcalled interval , which with a given probabilitycovers unknown parameter value Q . Complementing the confidence interval to the set of all possible parameter values Q called critical area. If the critical region is located only on one side of the confidence interval, then the confidence interval is called unilateral: left-sided, if the critical region exists only on the left, and right-handed unless on the right. Otherwise, the confidence interval is called bilateral.

Reliability, or confidence level, Q estimates (using Q *) name the probability with which the following inequality is fulfilled: | Q - Q*|< .

Most often, the confidence probability is set in advance (0.95; 0.99; 0.999) and the requirement is imposed on it to be close to one.

Probabilitycalled the probability of error, or the level of significance.

Let | Q - Q*|< , then. This means that with a probabilityit can be argued that the true value of the parameter Q belongs to the interval. The smaller the deviation, the more accurate the estimate.

The boundaries (ends) of the confidence interval are called confidence boundaries, or critical boundaries.

The values ​​of the boundaries of the confidence interval depend on the distribution law of the parameter Q*.

Deviation valuehalf the width of the confidence interval is called assessment accuracy.

Methods for constructing confidence intervals were first developed by the American statistician Y. Neumann. Estimation Accuracy, confidence probability and sample size n interconnected. Therefore, knowing the specific values ​​of two quantities, you can always calculate the third.

Finding the confidence interval for estimating the mathematical expectation of a normal distribution if the standard deviation is known.

Let a sample be made from the general population, subject to the law of normal distribution. Let the general standard deviation be known, but the mathematical expectation of the theoretical distribution is unknown a().

The following formula is valid:

Those. according to the specified deviation valueit is possible to find with what probability the unknown general mean belongs to the interval. And vice versa. It can be seen from the formula that with an increase in the sample size and a fixed value of the confidence probability, the value- decreases, i.e. the accuracy of the estimate is increased. With an increase in reliability (confidence probability), the value-increases, i.e. the accuracy of the estimate decreases.

Example:

As a result of the tests, the following values ​​were obtained -25, 34, -20, 10, 21. It is known that they obey the normal distribution law with a standard deviation of 2. Find the estimate a * for the mathematical expectation a. Plot a 90% confidence interval for it.

Decision:

Let's find the unbiased estimate

Then


The confidence interval for a has the form: 4 - 1.47< a< 4+ 1,47 или 2,53 < a < 5, 47

Finding the confidence interval for estimating the mathematical expectation of a normal distribution if the standard deviation is unknown.

Let it be known that the general population is subject to the law of normal distribution, where a and. Accuracy of Confidence Interval Covering with Reliabilitythe true value of the parameter a, in this case, is calculated by the formula:

, where n is the sample size, , - Student's coefficient (it should be found from the given values n and from the table "Critical points of Student's distribution").

Example:

As a result of the tests, the following values ​​were obtained -35, -32, -26, -35, -30, -17. It is known that they obey the law of normal distribution. Find the confidence interval for the population mean a with a confidence level of 0.9.

Decision:

Let's find the unbiased estimate.

Let's find.

Then

The confidence interval will take the form(-29.2 - 5.62; -29.2 + 5.62) or (-34.82; -23.58).

Finding the confidence interval for the variance and standard deviation of a normal distribution

Let a random sample of volume be taken from some general set of values ​​distributed according to the normal lawn < 30 for which sample variances are calculated: biasedand corrected s 2. Then to find interval estimates with a given reliabilityfor general dispersionDgeneral standard deviationthe following formulas are used.


or,

Values- find using the table of values ​​of critical pointsPearson distributions.

The confidence interval for the variance is found from these inequalities by squaring all parts of the inequality.

Example:

The quality of 15 bolts was checked. Assuming that the error in their manufacture is subject to the normal distribution law, and the sample standard deviationequal to 5 mm, determine with reliabilityconfidence interval for unknown parameter

We represent the boundaries of the interval as a double inequality:

The ends of the two-sided confidence interval for the variance can be determined without performing arithmetic operations for a given level of confidence and sample size using the corresponding table (Bounds of confidence intervals for the variance depending on the number of degrees of freedom and reliability). To do this, the ends of the interval obtained from the table are multiplied by the corrected variance s 2.

Example:

Let's solve the previous problem in a different way.

Decision:

Let's find the corrected variance:

According to the table "Bounds of confidence intervals for the variance depending on the number of degrees of freedom and reliability", we find the boundaries of the confidence interval for the variance atk=14 and: lower limit 0.513 and upper limit 2.354.

Multiply the obtained bounds bys 2 and extract the root (because we need a confidence interval not for the variance, but for the standard deviation).

As can be seen from the examples, the value of the confidence interval depends on the method of its construction and gives close but different results.

For samples of sufficiently large size (n>30) the boundaries of the confidence interval for the general standard deviation can be determined by the formula: - some number, which is tabulated and given in the corresponding reference table.

If 1- q<1, то формула имеет вид:

Example:

Let's solve the previous problem in the third way.

Decision:

Previously founds= 5,17. q(0.95; 15) = 0.46 - we find according to the table.

Then:

Population- a set of units that have mass character, typicality, qualitative uniformity and the presence of variation.

The statistical population consists of materially existing objects (Employees, enterprises, countries, regions), is an object.

Population unit- each specific unit of the statistical population.

One and the same statistical population can be homogeneous in one feature and heterogeneous in another.

Qualitative uniformity- the similarity of all units of the population for any feature and dissimilarity for all the rest.

In a statistical population, the differences between one unit of the population and another are more often of a quantitative nature. Quantitative changes in the values ​​of the attribute of different units of the population are called variation.

Feature Variation- quantitative change of a sign (for a quantitative sign) during the transition from one unit of the population to another.

sign- this is a property, characteristic or other feature of units, objects and phenomena that can be observed or measured. Signs are divided into quantitative and qualitative. The diversity and variability of the value of a feature in individual units of the population is called variation.

Attributive (qualitative) features are not quantifiable (composition of the population by sex). Quantitative characteristics have a numerical expression (composition of the population by age).

Indicator- this is a generalizing quantitative and qualitative characteristic of any property of units or aggregates for the purpose in specific conditions of time and place.

Scorecard is a set of indicators that comprehensively reflect the phenomenon under study.

For example, consider salary:
  • Sign - wages
  • Statistical population - all employees
  • The unit of the population is each worker
  • Qualitative homogeneity - accrued salary
  • Feature variation - a series of numbers

General population and sample from it

The basis is a set of data obtained as a result of measuring one or more features. The actually observed set of objects, statistically represented by a series of observations of a random variable , is sampling, and the hypothetically existing (thought-out) - general population. The general population can be finite (number of observations N = const) or infinite ( N = ∞), and a sample from the general population is always the result of a limited number of observations. The number of observations that make up a sample is called sample size. If the sample size is large enough n→∞) the sample is considered large, otherwise it is called a sample limited volume. The sample is considered small, if, when measuring a one-dimensional random variable, the sample size does not exceed 30 ( n<= 30 ), and when measuring simultaneously several ( k) features in a multidimensional space relation n to k less than 10 (n/k< 10) . The sample forms variation series if its members are order statistics, i.e., sample values ​​of the random variable X are sorted in ascending order (ranked), the values ​​of the attribute are called options.

Example. Almost the same randomly selected set of objects - commercial banks of one administrative district of Moscow, can be considered as a sample from the general population of all commercial banks in this district, and as a sample from the general population of all commercial banks in Moscow, as well as a sample of commercial banks in the country and etc.

Basic sampling methods

The reliability of statistical conclusions and meaningful interpretation of the results depends on representativeness samples, i.e. completeness and adequacy of the presentation of the properties of the general population, in relation to which this sample can be considered representative. The study of the statistical properties of the population can be organized in two ways: using continuous and discontinuous. Continuous observation includes examination of all units studied aggregates, a non-continuous (selective) observation- only parts of it.

There are five main ways to organize sampling:

1. simple random selection, in which objects are randomly extracted from the general population of objects (for example, using a table or a random number generator), and each of the possible samples has an equal probability. Such samples are called actually random;

2. simple selection through a regular procedure is carried out using a mechanical component (for example, dates, days of the week, apartment numbers, letters of the alphabet, etc.) and the samples obtained in this way are called mechanical;

3. stratified selection consists in the fact that the general population of the volume is subdivided into subsets or layers (strata) of the volume so that . Strata are homogeneous objects in terms of statistical characteristics (for example, the population is divided into strata by age group or social class; enterprises by industry). In this case, the samples are called stratified(otherwise, stratified, typical, zoned);

4. methods serial selection are used to form serial or nested samples. They are convenient if it is necessary to examine a "block" or a series of objects at once (for example, a consignment of goods, products of a certain series, or the population in the territorial-administrative division of the country). The selection of series can be carried out in a random or mechanical way. At the same time, a continuous survey of a certain batch of goods, or an entire territorial unit (a residential building or a quarter) is carried out;

5. combined(stepped) selection can combine several selection methods at once (for example, stratified and random or random and mechanical); such a sample is called combined.

Selection types

By mind there are individual, group and combined selection. At individual selection individual units of the general population are selected in the sample set, with group selection are qualitatively homogeneous groups (series) of units, and combined selection involves a combination of the first and second types.

By method selection distinguish repeated and non-repetitive sample.

Unrepeatable called selection, in which the unit that fell into the sample does not return to the original population and does not participate in the further selection; while the number of units of the general population N reduced during the selection process. At repeated selection caught in the sample, the unit after registration is returned to the general population and thus retains an equal opportunity, along with other units, to be used in the further selection procedure; while the number of units of the general population N remains unchanged (the method is rarely used in socio-economic studies). However, with a large N (N → ∞) formulas for unrepeated selection are close to those for repeated selection and the latter are used almost more often ( N = const).

The main characteristics of the parameters of the general and sample population

The basis of the statistical conclusions of the study is the distribution of a random variable , while the observed values (x 1, x 2, ..., x n) are called realizations of the random variable X(n is the sample size). The distribution of a random variable in the general population is theoretical, ideal in nature, and its sample analogue is empirical distribution. Some theoretical distributions are given analytically, i.e. them options determine the value of the distribution function at each point in the space of possible values ​​of the random variable . For a sample, it is difficult, and sometimes impossible, to determine the distribution function, therefore options are estimated from empirical data, and then they are substituted into an analytical expression describing the theoretical distribution. In this case, the assumption (or hypothesis) about the type of distribution can be both statistically correct and erroneous. But in any case, the empirical distribution reconstructed from the sample only roughly characterizes the true one. The most important distribution parameters are expected value and dispersion.

By their very nature, distributions are continuous and discrete. The best known continuous distribution is normal. Selective analogues of parameters and for it are: mean value and empirical variance. Among the discrete in socio-economic studies, the most commonly used alternative (dichotomous) distribution. The expectation parameter of this distribution expresses the relative value (or share) units of the population that have the characteristic under study (it is indicated by the letter ); the proportion of the population that does not have this feature is denoted by the letter q (q = 1 - p). The variance of the alternative distribution also has an empirical analog.

Depending on the type of distribution and on the method of selecting population units, the characteristics of the distribution parameters are calculated differently. The main ones for the theoretical and empirical distributions are given in Table. 9.1.

Sample share k n is the ratio of the number of units of the sample population to the number of units of the general population:

k n = n/N.

Sample share w is the ratio of units that have the trait under study x to sample size n:

w = n n / n.

Example. In a batch of goods containing 1000 units, with a 5% sample sample fraction k n in absolute value is 50 units. (n = N*0.05); if 2 defective products are found in this sample, then sample fraction w will be 0.04 (w = 2/50 = 0.04 or 4%).

Since the sample population is different from the general population, there are sampling errors.

Table 9.1 Main parameters of the general and sample populations

Sampling errors

With any (solid and selective) errors of two types can occur: registration and representativeness. Mistakes registration can have random and systematic character. Random errors are made up of many different uncontrollable causes, are unintentional in nature, and usually balance each other out in combination (for example, changes in instrument readings due to temperature fluctuations in the room).

Systematic errors are biased, as they violate the rules for selecting objects in the sample (for example, deviations in measurements when changing the settings of the measuring device).

Example. To assess the social status of the population in the city, it is planned to examine 25% of families. If, however, the selection of every fourth apartment is based on its number, then there is a danger of selecting all apartments of only one type (for example, one-room apartments), which will introduce a systematic error and distort the results; the choice of the apartment number by lot is more preferable, since the error will be random.

Representativeness errors inherent only in selective observation, they cannot be avoided and they arise as a result of the fact that the sample does not fully reproduce the general one. The values ​​of the indicators obtained from the sample differ from the indicators of the same values ​​in the general population (or obtained during continuous observation).

Sampling error is the difference between the value of the parameter in the general population and its sample value. For the average value of a quantitative attribute, it is equal to: , and for the share (alternative attribute) - .

Sampling errors are inherent only in sample observations. The larger these errors, the more the empirical distribution differs from the theoretical one. The parameters of the empirical distribution and are random variables, therefore, sampling errors are also random variables, they can take different values ​​for different samples, and therefore it is customary to calculate average error.

Average sampling error is a value expressing the standard deviation of the sample mean from the mathematical expectation. This value, subject to the principle of random selection, depends primarily on the sample size and on the degree of variation of the trait: the greater and the smaller the variation of the trait (hence, the value of ), the smaller the value of the average sampling error . The ratio between the variances of the general and sample populations is expressed by the formula:

those. for sufficiently large, we can assume that . The average sampling error shows the possible deviations of the parameter of the sample population from the parameter of the general population. In table. 9.2 shows expressions for calculating the average sampling error for different methods of organizing observation.

Table 9.2 Mean error (m) of sample mean and proportion for different sample types

Where is the average of the intragroup sample variances for a continuous feature;

The average of the intra-group dispersions of the share;

— number of series selected, — total number of series;

,

where is the average of the th series;

- the general average over the entire sample for a continuous feature;

,

where is the proportion of the trait in the th series;

— the total share of the trait over the entire sample.

However, the magnitude of the average error can only be judged with a certain probability Р (Р ≤ 1). Lyapunov A.M. proved that the distribution of sample means, and hence their deviations from the general mean, with a sufficiently large number, approximately obeys the normal distribution law, provided that the general population has a finite mean and limited variance.

Mathematically, this statement for the mean is expressed as:

and for the fraction, expression (1) will take the form:

where - there is marginal sampling error, which is a multiple of the average sampling error , and the multiplicity factor is Student's criterion ("confidence factor"), proposed by W.S. Gosset (pseudonym "Student"); values ​​for different sample sizes are stored in a special table.

The values ​​of the function Ф(t) for some values ​​of t are:

Therefore, expression (3) can be read as follows: with probability P = 0.683 (68.3%) it can be argued that the difference between the sample and the general mean will not exceed one value of the mean error m(t=1), with probability P = 0.954 (95.4%)— that it does not exceed the value of two mean errors m (t = 2) , with probability P = 0.997 (99.7%)- will not exceed three values m (t = 3) . Thus, the probability that this difference will exceed three times the value of the mean error determines error level and is not more than 0,3% .

In table. 9.3 formulas for calculating the marginal sampling error are given.

Table 9.3 Marginal sampling error (D) for mean and proportion (p) for different types of sampling

Extending Sample Results to the Population

The ultimate goal of sample observation is to characterize the general population. For small sample sizes, empirical estimates of the parameters ( and ) may deviate significantly from their true values ​​( and ). Therefore, it becomes necessary to establish the boundaries within which for the sample values ​​of the parameters ( and ) the true values ​​( and ) lie.

Confidence interval of any parameter θ of the general population is called a random range of values ​​of this parameter, which with a probability close to 1 ( reliability) contains the true value of this parameter.

marginal error samples Δ allows you to determine the limit values ​​of the characteristics of the general population and their confidence intervals, which are equal to:

Bottom line confidence interval obtained by subtracting marginal error from the sample mean (share), and the top one by adding it.

Confidence interval for the mean, it uses the marginal sampling error and for a given confidence level is determined by the formula:

This means that with a given probability R, which is called the confidence level and is uniquely determined by the value t, it can be argued that the true value of the mean lies in the range from , and the true value of the share is in the range from

When calculating the confidence interval for the three standard confidence levels P=95%, P=99% and P=99.9% value is selected by . Applications depending on the number of degrees of freedom. If the sample size is large enough, then the values ​​corresponding to these probabilities t are equal: 1,96, 2,58 and 3,29 . Thus, the marginal sampling error allows us to determine the marginal values ​​of the characteristics of the general population and their confidence intervals:

The distribution of the results of selective observation to the general population in socio-economic studies has its own characteristics, since it requires the completeness of the representativeness of all its types and groups. The basis for the possibility of such a distribution is the calculation relative error:

where Δ % - relative marginal sampling error; , .

There are two main methods for extending a sample observation to the population: direct conversion and method of coefficients.

Essence direct conversion is to multiply the sample mean!!\overline(x) by the size of the population .

Example. Let the average number of toddlers in the city be estimated by a sampling method and amount to a person. If there are 1000 young families in the city, then the number of places required in the municipal nursery is obtained by multiplying this average by the size of the general population N = 1000, i.e. will be 1200 seats.

Method of coefficients it is advisable to use in the case when selective observation is carried out in order to clarify the data of continuous observation.

In doing so, the formula is used:

where all variables are the size of the population:

Required sample size

Table 9.4 Required sample size (n) for different types of sampling organization

When planning a sampling survey with a predetermined value of the allowable sampling error, it is necessary to correctly estimate the required sample size. This amount can be determined on the basis of the allowable error during selective observation based on a given probability that guarantees an acceptable error level (taking into account the way the observation is organized). Formulas for determining the required sample size n can be easily obtained directly from the formulas for the marginal sampling error. So, from the expression for the marginal error:

the sample size is directly determined n:

This formula shows that with decreasing marginal sampling error Δ significantly increases the required sample size, which is proportional to the variance and the square of the Student's t-test.

For a specific method of organizing observation, the required sample size is calculated according to the formulas given in Table. 9.4.

Practical Calculation Examples

Example 1. Calculation of the mean value and confidence interval for a continuous quantitative characteristic.

To assess the speed of settlement with creditors in the bank, a random sample of 10 payment documents was carried out. Their values ​​turned out to be equal (in days): 10; 3; fifteen; fifteen; 22; 7; eight; one; nineteen; 20.

Required with probability P = 0.954 determine marginal error Δ sample mean and confidence limits of the average calculation time.

Decision. The average value is calculated by the formula from Table. 9.1 for the sample population

The dispersion is calculated according to the formula from Table. 9.1.

The mean square error of the day.

The error of the mean is calculated by the formula:

those. mean value is x ± m = 12.0 ± 2.3 days.

The reliability of the mean was

The limiting error is calculated by the formula from Table. 9.3 for reselection, since the size of the population is unknown, and for P = 0.954 confidence level.

Thus, the mean value is `x ± D = `x ± 2m = 12.0 ± 4.6, i.e. its true value lies in the range from 7.4 to 16.6 days.

Use of Student's table. The application allows us to conclude that for n = 10 - 1 = 9 degrees of freedom the obtained value is reliable with a significance level a £ 0.001, i.e. the resulting mean value is significantly different from 0.

Example 2. Estimate of the probability (general share) r.

With a mechanical sampling method of surveying the social status of 1000 families, it was revealed that the proportion of low-income families was w = 0.3 (30%)(the sample was 2% , i.e. n/N = 0.02). Required with confidence level p = 0.997 define an indicator R low-income families throughout the region.

Decision. According to the presented function values Ф(t) find for a given confidence level P = 0.997 meaning t=3(see formula 3). Marginal share error w determine by the formula from Table. 9.3 for non-repeating sampling (mechanical sampling is always non-repeating):

Limiting relative sampling error in % will be:

The probability (general share) of low-income families in the region will be p=w±Δw, and the confidence limits p are calculated based on the double inequality:

w — Δw ≤ p ≤ w — Δw, i.e. the true value of p lies within:

0,3 — 0,014 < p <0,3 + 0,014, а именно от 28,6% до 31,4%.

Thus, with a probability of 0.997, it can be argued that the proportion of low-income families among all families in the region ranges from 28.6% to 31.4%.

Example 3 Calculation of the mean value and confidence interval for a discrete feature specified by an interval series.

In table. 9.5. the distribution of applications for the production of orders according to the timing of their implementation by the enterprise is set.

Table 9.5 Distribution of observations by time of occurrence

Decision. The average order completion time is calculated by the formula:

The average time will be:

= (3*20 + 9*80 + 24*60 + 48*20 + 72*20)/200 = 23.1 months

We get the same answer if we use the data on p i from the penultimate column of Table. 9.5 using the formula:

Note that the middle of the interval for the last gradation is found by artificially supplementing it with the width of the interval of the previous gradation equal to 60 - 36 = 24 months.

The dispersion is calculated by the formula

where x i- the middle of the interval series.

Therefore!!\sigma = \frac (20^2 + 14^2 + 1 + 25^2 + 49^2)(4) and the standard error is .

The error of the mean is calculated by the formula for months, i.e. the mean is!!\overline(x) ± m = 23.1 ± 13.4.

The limiting error is calculated by the formula from Table. 9.3 for reselection because the population size is unknown, for a 0.954 confidence level:

So the mean is:

those. its true value lies in the range from 0 to 50 months.

Example 4 To determine the speed of settlements with creditors of N = 500 enterprises of the corporation in a commercial bank, it is necessary to conduct a selective study using the method of random non-repetitive selection. Determine the required sample size n so that with a probability P = 0.954 the error of the sample mean does not exceed 3 days, if the trial estimates showed that the standard deviation s was 10 days.

Decision. To determine the number of necessary studies n, we use the formula for non-repetitive selection from Table. 9.4:

In it, the value of t is determined from for the confidence level Р = 0.954. It is equal to 2. The mean square value s = 10, the population size N = 500, and the marginal error of the mean Δ x = 3. Substituting these values ​​into the formula, we get:

those. it is enough to make a sample of 41 enterprises in order to estimate the required parameter - the speed of settlements with creditors.

Selective observation applies when applying continuous observation physically impossible due to a large amount of data or economically impractical. Physical impossibility occurs, for example, when studying passenger flows, market prices, family budgets. Economic inexpediency occurs when assessing the quality of goods associated with their destruction, for example, tasting, testing bricks for strength, etc.

The statistical units selected for observation are sampling frame or sampling, and their entire array - general population(GS). Wherein number of units in the sample designate n, and in the entire HS - N. Attitude n/n called relative size or sample share.

The quality of sampling results depends on sample representativeness, that is, on how representative it is in the HS. To ensure the representativeness of the sample, it is necessary to observe principle of random selection of units, which assumes that the inclusion of a HS unit in the sample cannot be influenced by any other factor than chance.

Exist 4 ways of random selection to sample:

  1. Actually random selection or "lotto method", when serial numbers are assigned to statistical values, entered on certain objects (for example, kegs), which are then mixed in some container (for example, in a bag) and selected at random. In practice, this method is carried out using a random number generator or mathematical tables of random numbers.
  2. Mechanical selection, according to which each ( N/n)-th value of the general population. For example, if it contains 100,000 values, and you want to select 1,000, then every 100,000 / 1000 = 100th value will fall into the sample. Moreover, if they are not ranked, then the first one is chosen at random from the first hundred, and the numbers of the others will be one hundred more. For example, if unit number 19 was the first, then number 119 should be next, then number 219, then number 319, and so on. If the population units are ranked, then #50 is selected first, then #150, then #250, and so on.
  3. The selection of values ​​from a heterogeneous data array is carried out stratified(stratified) way, when the general population is previously divided into homogeneous groups, to which random or mechanical selection is applied.
  4. A special sampling method is serial selection, in which not individual quantities are randomly or mechanically chosen, but their series (sequences from some number to some consecutive), within which continuous observation is carried out.

The quality of sample observations also depends on sampling type: repeated or non-repetitive.
At re-selection the statistical values ​​or their series that fell into the sample are returned to the general population after use, having a chance to get into a new sample. At the same time, all values ​​of the general population have the same probability of being included in the sample.
Non-repeating selection means that the statistical values ​​or their series included in the sample are not returned to the general population after use, and therefore the probability of getting into the next sample increases for the remaining values ​​of the latter.

Non-repetitive sampling gives more accurate results, so it is used more often. But there are situations when it cannot be applied (study of passenger flows, consumer demand, etc.) and then a re-selection is carried out.

Sampling errors

The sampling set can be formed on the basis of a quantitative sign of statistical values, as well as on an alternative or attributive basis. In the first case, the generalizing characteristic of the sample is the value denoted by , and in the second - sample share quantities, denoted w. In the general population, respectively: general average and general share p.

Differences - and WR called sampling error, which is divided by registration error and representativeness error. The first part of the sampling error occurs due to incorrect or inaccurate information due to misunderstanding of the essence of the issue, carelessness of the registrar when filling out questionnaires, forms, etc. It is fairly easy to detect and fix. The second part of the error arises from the constant or spontaneous non-compliance with the principle of random selection. It is difficult to detect and eliminate, it is much larger than the first and therefore the main attention is paid to it.

The value of the sampling error may be different for different samples from the same general population, therefore, in statistics it is determined average error of resampling and non-sampling according to the formulas:

Repeated;

- non-repetitive;

Where Dv is the sample variance.

For example, in a factory with 1000 employees. 5% random non-repetitive sampling was carried out in order to determine the average length of service of employees. The results of the sampling observation are given in the first two columns of the following table:

X , years
(work experience)

f , pers.
(number of employees in the sample)

X and

X and f

In the 3rd column, the midpoints of the X intervals are defined (as half the sum of the lower and upper boundaries of the interval), and in the 4th column, the products of X and f to find the sample mean using the weighted arithmetic mean formula:

143.0/50 = 2.86 (years).

Calculate the weighted sample variance:
= 105,520/50 = 2,110.

Now let's find the average non-retest error:
= 0.200 (years).

From the formulas for average sampling errors, it can be seen that the error is smaller with non-repetitive sampling, and, as proven in probability theory, it occurs with a probability of 0.683 (that is, if you take 1000 samples from one general population, then in 683 of them the error will not exceed the average sampling error ). This probability (0.683) is not high, so it is of little use for practical calculations where a higher probability is needed. To determine the sampling error with a higher probability than 0.683, calculate marginal sampling error:

Where t– confidence coefficient, depending on the probability with which the marginal sampling error is determined.

Confidence Factor Values t calculated for different probabilities and are available in special tables (Laplace integral), of which the following combinations are widely used in statistics:

Probability 0,683 0,866 0,950 0,954 0,988 0,990 0,997 0,999
t 1 1,5 1,96 2 2,5 2,58 3 3,5

Given a specific level of probability, the value corresponding to it is selected from the table t and determine the marginal sampling error by the formula.
In this case, = 0.95 and t= 1.96, that is, they believe that with a probability of 95%, the marginal sampling error is 1.96 times greater than the average. This probability (0.95) is considered standard and is applied by default in calculations.

In our , we define the marginal sampling error at the standard 95% probability (from taking t= 1.96 for 95% chance): = 1.96*0.200 = 0.392 (years).

After calculating the marginal error, one finds confidence interval of the generalizing characteristic of the general population. Such an interval for the general average has the form
That is, the average length of service of workers at the entire plant lies in the range from 2.468 to 3.252 years.

Determining the sample size

When developing a program of selective observation, sometimes they are given a specific value of the marginal error with a level of probability. The minimum sample size that provides the given accuracy remains unknown. It can be obtained from the formulas for the mean and marginal errors, depending on the type of sample. So, substituting and into and, solving it with respect to the sample size, we obtain the following formulas:
for resampling n =
for no resampling n = .

In addition, for statistical values ​​with quantitative characteristics, one must also know the sample variance, but by the beginning of the calculations it is not known either. Therefore, it is accepted approximately one of the following ways(in priority order):

When studying non-numerical characteristics, even if there is no approximate information about the sample fraction, it is accepted w= 0.5, which, according to the share dispersion formula, corresponds to the sample dispersion in the maximum size Dv = 0,5*(1-0,5) = 0,25.