The sample is representative. Representativeness - what is it? Systematic random sampling

In fact, we will start with not one, but three questions: what is a sample? when is it representative? what does she represent?

Aggregate- this is any group of people, organizations, events of interest to us, about which we want to draw conclusions, and happening, or object, - any element of such a collection.

Sample- any subgroup of the set of cases (objects) selected for analysis.

If we want to study the decision-making activity of state legislators, we could examine such activity in the legislatures of the states of Virginia, North Carolina and South Carolina, and not in all fifty states, and, based on this, distribute received data on the population from which these three states were selected. If we want to investigate Pennsylvania's voter preference system, we could do so by interviewing 50 U.S. workers. S. Steele in Pittsburgh and distribute the results of the poll to all voters in the state.

Similarly If we want to measure the intelligence of college students, we could test all the defensive players registered in Ohio State in a given football season and then extend the results to the population of which they are a part. In each example, we proceed as follows: we establish a subgroup within the population, study this subgroup, or sample, in some detail, and extend our results to the entire population. These are the main stages of sampling.

However seems It is quite obvious that each of these samples has a significant drawback. For example, although the legislatures of Virginia, North Carolina, and South Carolina are part of the constellation of state legislatures, they are, for historical, geographic, and political reasons, likely to operate in very similar ways and very differently from such distinct legislatures. states like New York, Nebraska and Alaska. Although the fifty steelworkers in Pittsburgh may indeed be Pennsylvania voters, they, by virtue of socioeconomic status, education, and life experience, may well have different views from those of many other people who are likewise voters.

Likewise, although Ohio State footballers are college students, they may well be different from other students for a variety of reasons. In other words, although each of these subgroups is indeed a sample, the members of each of them are systematically different from most of the other members of the population from which they are selected. As a separate group, none of them is typical in terms of the distribution of features of opinions, behavioral motives and characteristics in the general population with which it is associated. Accordingly, political scientists would say that none of these samples is representative.


Representative Sample- this is such a sample in which all the main features of the general population from which the given sample is taken are presented approximately in the same proportion or with the same frequency with which this feature appears in this general population. Thus, if 50% of all state legislatures meet only once every two years, about half of a representative sample of state legislatures should be of this type. If 30% of Pennsylvania's voters are blue-collar, about 30% of a representative sample of those voters (rather than 100% as in the example above) should be blue-collar.

And if 2% of all college students are athletes, about the same proportion of a representative sample of college students should be athletes. In other words, a representative sample is a microcosm, a smaller but accurate model of the population it is intended to represent. To the extent that the sample is representative, the conclusions based on the study of this sample can be safely considered applicable to the original population. This distribution of results is what we call generalizability.

Perhaps a graphic illustration will help clarify this. Suppose we want to study patterns of political group membership among US adults. Figure 5.1 shows three circles divided into six equal sectors. Figure 5.1a represents the entire population under consideration. The members of the population are classified according to the political groups (such as parties and interest groups) to which they belong.

In this example every adult belongs to at least one and no more than six political groups; and these six levels of membership are equally common in the aggregate (hence the equal sectors). Suppose we want to investigate people's motives for joining a group, group choice, and participation patterns, but due to resource constraints, we are only able to examine one out of every six members of the population. Who should be selected for analysis?

Rice. 5.1. Formation of a sample from the general population

One of the possible samples of a given size is illustrated by the shaded area in Fig. 5.1b, however, it clearly does not reflect the structure of the population.

If we were to make generalizations based on this sample, we would conclude:

1) that all adult Americans belong to five political groups and

2) that the entire group behavior of Americans coincides with the behavior of those who belong precisely to the five groups.

However, we know that the first conclusion is not true, and this may cause us to doubt the validity of the second.

Thus, the sample shown in Figure 5.1b is not representative because it does not reflect the distribution of a given population property (often called a parameter) according to its actual distribution. Such a sample is said to be shifted towards members of the five groups or shifted away from all other group membership models. Based on such biased sampling, we usually come to erroneous conclusions about the population.

This can be most clearly demonstrated by the example of the catastrophe that befell the magazine Literary Digest in the 1930s, which organized a public opinion poll on the results of the elections. “ Literary Digest” was a periodical that reprinted editorials from newspapers and other materials reflecting public opinion; this magazine was very popular at the beginning of the century.

Since 1920. The magazine conducted a wide-ranging nationwide poll in which more than a million people were mailed ballots asking them to mark their preferred candidate for the upcoming presidential election. For a number of years, the magazine's polling results were so accurate that the September poll seemed to render the November election of little importance.

And how could a mistake occur with such a large sample? However, in 1936, this is exactly what happened: with a large majority of votes (60:40), the victory was predicted by the Republican candidate Alf Landon. In the elections, Landon lost to a disabled person - Franklin D. Roosevelt- practically with the same result with which he should have won. The credibility of the Literary Digest was so severely undermined that the magazine went out of print soon after. What happened? It's very simple: the Digest poll used a biased sample. Postcards were sent to people whose names were extracted from two sources: telephone directories and car registration lists.

Although this method of selection had not been very different from other methods before, it was quite different now, during the Great Depression of 1936, when the less affluent voters, Roosevelt's most likely mainstay, could not afford a telephone, let alone car. Thus, in fact, the sample used in the Digest poll was biased towards those who were most likely to run for the Republicans, and it is still surprising that Roosevelt had such a good result.

How to solve this problem? Returning to our example, let's compare the sample in Fig. 5.1b with a selection in fig. 5.1c. In the latter case, a sixth of the population was also selected for analysis, but each of the main types of the population is represented in the sample in the proportion in which it is represented in the entire population. Such a sample shows that one in every six American adults belongs to one political group, one in six to two, and so on. Such a sample would also reveal other differences among its members that could be related to participation in a different number of groups. Thus, the sample presented in Figure 5.1c is a representative sample for the population under consideration.

Of course, this example is simplified from at least two extremely important points of view. First, most of the populations of interest to political scientists are more diverse than the one in the example. People, documents, governments, organizations, decisions, etc. differ from each other not in one, but in a much larger number of characteristics. Therefore, a representative sample should be such that each of the core, a distinct area was represented in proportion to its share in the population.

Secondly, the situation where the actual distribution of the variables, or characteristics that we want to measure, is not known in advance, is much more common than the opposite - perhaps it was not measured in the previous population census. Thus, a representative sample must be designed so that it can accurately reflect the existing distribution even when we cannot directly assess its validity. The sampling procedure must have an internal logic capable of convincing us that, if we were able to compare the sample with the census, it would indeed be representative.

To provide the opportunity accurate reflection of the complex organization of a given population and a certain degree of confidence that the proposed procedures are able to do this, researchers turn to statistical methods. In doing so, they operate in two directions. First, using certain rules (internal logic), researchers decide which specific objects to study, what exactly to include in a particular sample. Second, using very different rules, they decide how many objects to select. We will not study these numerous rules in detail, we will consider only their role in political science research. Let's start with the strategies for selecting objects that form a representative sample.

The ultimate goal of studying a sample population is always to obtain information about the population. To do this, a sample study must meet certain conditions. One of the main conditions representativeness (representativeness) of the sample. As discussed earlier, a distinction is made between qualitative and quantitative representativeness.

Randomness, which guarantees the qualitative (structural) representativeness of statistical studies, is achieved by fulfilling a number of conditions for the formation of sample groups (sets):

1. Each member of the population must have an equal probability of being included in the sample.

2. The selection of units of observation from the general population must be carried out regardless of the trait under study. If the selection is carried out purposefully, then it is also necessary to observe the conditions for the independence of the distribution of the trait under study.

3. The selection should be carried out from homogeneous groups.

Compliance with the conditions that guarantee the maximum proximity of the sample and the general population is ensured by special methods of selection. Depending on the method of formation, the following samples are distinguished:

1. Samples that do not require the division of the general population into parts (actually, random repeated or non-repeated sampling).

2. Samples that require splitting the general population into parts (mechanical, typical or typological samples, cohort, paired-conjugate samples).

Actually, a random sample is formed by random selection - at random. Random selection is based on mixing. For example: choosing a ball in a sports lotto after mixing all the balls, choosing the winning lottery numbers, randomly choosing patient cards for research, etc. Sometimes random numbers are used, obtained from tables of random numbers or using random number generators. According to these numbers, from a pre-numbered array of the general population, observation units with numbers corresponding to the random numbers that have fallen out are selected.

When compiling a random sample, after the object is selected and all the necessary data about it is registered, you can do two things: the object can be returned, or not returned to the general population. According to this the sample is called repeated(the object is returned to the population) or non-repetitive(the object is not returned to the population). Since in most statistical studies there is practically no difference between repeated and non-repeated samples, the condition is a priori accepted that the sample is repeated.

Estimating the required sample size

In order for the sample to be quantitatively representative of the general population, it is necessary to first estimate the amount of data to be included in the sample.

With an unknown size of the general population the amount of resampling that guarantees representative results if the result is reflected in the indicator as relative value (share), determined by the formula:

where p is the value of the indicator of the trait under study, in %; q = (100- p) ;

t is a confidence coefficient showing what is the probability that the size of the indicator will not go beyond the limits of the marginal error (usually t = 2 is taken, which provides a 95% probability of an error-free forecast);

 - marginal error of the indicator.

For example: one of the indicators characterizing the health of workers in industrial enterprises is the percentage of workers who were not ill during the year. Suppose that for the industrial sector to which the surveyed enterprise belongs, this indicator is 25%. The marginal error that can be allowed so that the spread of the indicator values ​​does not exceed reasonable limits is 5%. In this case, the indicator can take values ​​of 25% ± 5%, i.e. from 20% to 30%. Assuming t = 2, we get

In that case, if the indicator is the average value, then the number of observations can be determined by the formula:

where σ is the standard deviation, which can be obtained from previous studies, or on the basis of trial (pilot) studies.

With repetitive selection and under the condition of a known general population to determine the required random sample size in case of using relative values ​​(shares) the formula is applied:

for average values the formula is used:

where N is the size of the general population.

Based on the conditions of the above example and assuming the size of the general population N=500 workers, we get:

It is easy to see that the required sample size for non-repetitive sampling is less than for repeated sampling (respectively, 188 and 300 workers).

In general, the number of observations required to obtain representative data varies inversely with the square of the error allowed.

Mechanical sampling- sampling, when units of observation are selected mechanically from the surveyed population. For example: the selection of every fifth or every tenth worker according to the cards of the personnel department of the enterprise or according to the outpatient cards of the polyclinic of the medical unit.

typical, typological or zoned sampling involves the breakdown of the general population into a number of qualitatively homogeneous groups. For example: when studying the incidence of university students for an in-depth examination in each course, student groups that are typical in their composition are selected. Often this selection method is combined with other methods. For example: the territory of the city is divided into typical areas depending on the degree of pollution, in these areas, observation groups are formed by random selection.

cohort selection refers to targeted selection. With this method, individuals are selected from the general population (the distribution into subgroups is non-random), united by the moment of the appearance of any sign or the studied effect that plays a significant role in the study (year of birth, onset of the disease, taking the drug, etc.).

Case-control study(SC) is a type of epidemiological study in which the distribution of a risk factor is compared between a group of patients with a disease and a control group. The study (SC) refers to retrospective, since the researcher, dividing patients into groups, according to whether or not they have a disease, finds out information from the past from them.

We should dwell separately on the use of the sampling method in sanitary statistics when studying the general morbidity of the population. The theoretical premises of the sampling method have been tested in the course of special studies. So, V.S. Bykhovsky et al. in 1928 they made parallel processing of 132.8 thousand cards with data on diseases by a continuous method and by the method of mechanical selection of every fifth card. An analysis of the results of this processing showed a high representativeness of the data from a selective study of morbidity. However, up to the present day, there are no unified methodological approaches to conducting selective sanitary-statistical studies in wide practice.

Sample representativeness

Parameter name Meaning
Article subject: Sample representativeness
Rubric (thematic category) Psychology

Sample Requirements

A number of mandatory requirements are applied to the sample, determined, first of all, by the goals and objectives of the study. Planning an experiment should include taking into account both the sample size and a number of its features. Thus, in psychological research, the requirement homogeneity samples. It means that a psychologist, studying, for example, adolescents, cannot include adults in the same sample. On the contrary, a study performed by the method of age cuts, in principle, assumes the presence of subjects of different ages. At the same time, in this case, the homogeneity of the sample should be observed, but according to other criteria, primarily such as age and gender. The basis for the formation of a homogeneous sample can be different characteristics, such as the level of intelligence, nationality, the absence of certain diseases, etc., based on the objectives of the study.

In general statistics, there is a concept repeated and non-repetitive selections, or, in other words, selections with a return and without a return. As an example, as a rule, the choice of a ball taken from a container is given. In the case of a draw with a return, each chosen ball is again returned to the container and, therefore, must be chosen again. In a non-repetitive selection, the once selected ball is set aside and can no longer participate in the selection. In psychological research, one can find analogues of this kind of methods of organizing selective research, since a psychologist often has to test the same subjects several times using the same methodology. At the same time, strictly speaking, the testing procedure is repeated in this case. A sample of subjects with complete identity of the composition in the case of repeated studies will always have some differences due to the functional and age variability inherent in all people. Such a selection by the nature of the procedure is repeated, although the meaning of the term here is obviously different than in the case of balls.

It is important to emphasize that all the requirements for any sample boil down to the fact that on its basis the psychologist must obtain the most complete, undistorted information about the characteristics of the general population from which this sample is taken. In other words, the sample should reflect as fully as possible the characteristics of the general population being studied.

The composition of the experimental sample should represent (simulate) the general population, since the conclusions obtained in the experiment are supposed to be transferred to the entire general population in the future. For this reason, the sample must have a special quality - representativeness, making it possible to extend the conclusions obtained on it to the entire general population.

The representativeness of the sample is very important, however, for objective reasons, it is extremely difficult to maintain it. Thus, it is a well-known fact that from 70% to 90% of all psychological studies of human behavior were conducted in the USA in the 60s of the XX century with college students, most of them were students of psychology. In laboratory studies performed on animals, the most common object of study are rats. For this reason, it is no coincidence that psychology used to be called ʼʼthe science of sophomores and white ratsʼʼ. College psychology students make up only 3% of the total US population. Obviously, the sample of students is not representative as a model that claims to represent the entire population of the country.

Representative sample, or, as they say, representative a sample is such a sample in which all the main features of the general population are represented in approximately the same proportion and with the same frequency with which this feature appears in this general population. In other words, a representative sample is a smaller but accurate model of the population it is intended to represent. To the extent that the sample is representative, the conclusions based on the study of this sample can be considered with a high degree of certainty applicable to the entire population. This dissemination of results is called generalizability.

Ideally, a representative sample should be such that each of the basic characteristics studied by the psychologist, traits, personality traits, etc. would be represented in it in proportion to the same features in the general population. According to these requirements, the sampling procedure must have an internal logic that can convince the researcher that, when compared with the general population, it will indeed turn out to be representative, representative.

In his specific activity, the psychologist acts as follows: he establishes a subgroup (sample) within the general population, studies this sample in detail (carries out experimental work with it), and then, if the results of statistical analysis allow, extends the findings to the entire population. These are the main stages of the work of a psychologist with a sample.

The novice psychologist must keep in mind a frequently repeated mistake: every time he collects any data by any method and from any source, he is always tempted to extend his conclusions to the entire population. In order to avoid such a mistake, one must not only have common sense, but, above all, have a good command of the basic concepts of mathematical statistics.

Sample representativeness - concept and types. Classification and features of the category "Representativeness of the sample" 2017, 2018.

The concept of representativeness is often found in statistical reporting and in the preparation of speeches and reports. Perhaps, without it, it is difficult to imagine any type of presentation of information for review.

Representativeness - what is it?

Representativeness reflects how the selected objects or parts correspond to the content and meaning of the data set from which they were selected.

Other definitions

The concept of representativeness can be developed in different contexts. But in its sense, representativeness is the correspondence of the features and properties of selected units from the general population, which accurately reflect the characteristics of the entire general database as a whole.

Also, the representativeness of information is defined as the ability of sample data to represent the parameters and properties of the population that are important from the point of view of the ongoing study.

Representative Sample

The principle of sampling is to select the most important and accurately reflect the properties of the total data set. For this, various methods are used that allow obtaining accurate results and a general idea of ​​using only selective materials that describe the quality of all data.

Thus, it is not necessary to study the entire material, but it is enough to consider the sample representativeness. What's this? This is a selection of individual data in order to have an idea of ​​the total mass of information.

Depending on the method, they are distinguished as probabilistic and improbable. A probabilistic one is a sample that is made by calculating the most important and interesting data, which are further representatives of the general population. This is a deliberate choice or a random selection, nevertheless, justified by its content.

Incredible - this is one of the varieties of random sampling, compiled according to the principle of an ordinary lottery. In this case, the opinion of the one who makes up such a sample is not taken into account. Only a blind lot is used.

Probability sampling

Probability samples can also be divided into several types:

  • One of the simplest and most understandable principles is non-representative sampling. For example, this method is often used in social surveys. At the same time, survey participants are not selected from the crowd on any specific grounds, and information is obtained from the first 50 people who took part in it.
  • Intentional samples differ in that they have a number of requirements and conditions in the selection, but still rely on random coincidence, not pursuing the goal of achieving good statistics.
  • Quota-based sampling is another variation of non-probabilistic sampling that is often used to examine large datasets. It uses a lot of terms and conditions. Objects are selected that should correspond to them. That is, using the example of a social survey, it can be assumed that 100 people will be interviewed, but only the opinion of a certain number of people who meet the established requirements will be taken into account when compiling a statistical report.

Probability samples

For probabilistic samples, a number of parameters are calculated that the objects in the sample will correspond to, and among them, in different ways, precisely those facts and data that will be presented as representativeness of the sample data can be selected. Such ways of calculating the necessary data can be:

  • A simple random sample. It consists in the fact that among the selected segment, a completely random lottery method selects the required amount of data, which will be a representative sample.
  • Systematic and random sampling makes it possible to draw up a system for calculating the necessary data based on a randomly selected segment. Thus, if the first random number that indicates the sequence number of the data selected from the total population is 5, then the subsequent data to be selected may be, for example, 15, 25, 35, and so on. This example clearly explains that even a random choice can be based on systematic calculations of the necessary input data.

Sample of consumers

Intentional sampling is a method that consists in considering each individual segment, and based on its assessment, a population is compiled that reflects the characteristics and properties of the overall database. In this way, more data is collected that meets the requirements of a representative sample. It is easy to select a number of options that will not be included in the total number, without losing the quality of the selected data representing the total population. In this way, the representativeness of the results of the study is determined.

Sample size

Not the last issue to be addressed is the sample size for a representative representation of the population. The sample size does not always depend on the number of sources in the general population. However, the representativeness of the sample population directly depends on how many segments the result should be divided into. The more such segments, the more data gets into the resulting sample. If the results require a general notation and do not require specifics, then, accordingly, the sample becomes smaller, because, without going into details, the information is presented more superficially, which means that its reading will be general.

The concept of representativeness error

Representativeness error is a specific discrepancy between the characteristics of the population and sample data. When conducting any sample study, it is impossible to obtain absolutely accurate data, as in a complete study of general populations and a sample provided with only part of the information and parameters, while a more detailed study is possible only when studying the entire population. Thus, some errors and errors are inevitable.

Types of errors

There are some errors that occur when compiling a representative sample:

  • Systematic.
  • Random.
  • Deliberate.
  • Unintentional.
  • Standard.
  • Limit.

The reason for the appearance of random errors may be the discontinuous nature of the study of the general population. Typically, the random error of representativeness is of negligible size and nature.

Systematic errors, meanwhile, arise when the rules for selecting data from the total population are violated.

The mean error is the difference between the sample mean and the underlying population. It does not depend on the number of units in the sample. It is inversely proportional. Then the larger the volume, the smaller the value of the average error.

Marginal error is the largest possible difference between the average values ​​of the sample made and the total population. Such an error is characterized as the maximum of probable errors under given conditions of their occurrence.

Intentional and unintentional errors of representativeness

Data offset errors can be intentional or unintentional.

Then the reasons for the appearance of deliberate errors is the approach to the selection of data by the method of determining trends. Unintentional errors occur even at the stage of preparing a sample observation, forming a representative sample. To avoid such errors, it is necessary to create a good sampling frame for listing sampling units. It must fully comply with the objectives of the sampling, be reliable, covering all aspects of the study.

Validity, reliability, representativeness. Error Calculation

Calculation of the error of representativeness (Mm) of the arithmetic mean (M).

Standard deviation: sample size (>30).

Representative error (Mr) and (R): sample size (n>30).

In the case when you have to study a population where the number of samples is small and is less than 30 units, then the number of observations will become less by one unit.

The magnitude of the error is directly proportional to the sample size. The representativeness of information and the calculation of the degree of possibility of making an accurate forecast reflects a certain amount of marginal error.

Representational systems

Not only is a representative sample used in the process of evaluating the presentation of information, but the person receiving the information himself uses representative systems. Thus, the brain processes some by creating a representative sample from the entire flow of information in order to qualitatively and quickly evaluate the submitted data and understand the essence of the issue. Answer the question: "Representativeness - what is it?" - on the scale of human consciousness is quite simple. To do this, the brain uses all the subjects, depending on what kind of information needs to be isolated from the general flow. Thus, they distinguish:

  • The visual representational system, where the organs of visual perception of the eye are involved. People who often use such a system are called visuals. With the help of this system, a person processes information coming in the form of images.
  • auditory representational system. The main organ that is used is hearing. Information supplied in the form of sound files or speech is processed by this particular system. People who perceive information better by ear are called auditory.
  • The kinesthetic representational system is the processing of the flow of information by perceiving it with the help of olfactory and tactile channels.
  • The digital representational system is used together with others as a means of obtaining information from the outside. perception and understanding of the received data.

So, representativeness - what is it? A simple selection from a multitude or an integral procedure in information processing? We can definitely say that representativeness largely determines our perception of data flows, helping to isolate the most significant and significant from it.

Population- a set of units that have mass character, typicality, qualitative uniformity and the presence of variation.

The statistical population consists of materially existing objects (Employees, enterprises, countries, regions), is an object.

Population unit- each specific unit of the statistical population.

One and the same statistical population can be homogeneous in one feature and heterogeneous in another.

Qualitative uniformity- the similarity of all units of the population for any feature and dissimilarity for all the rest.

In a statistical population, the differences between one unit of the population and another are more often of a quantitative nature. Quantitative changes in the values ​​of the attribute of different units of the population are called variation.

Feature Variation- quantitative change of a sign (for a quantitative sign) during the transition from one unit of the population to another.

sign- this is a property, characteristic or other feature of units, objects and phenomena that can be observed or measured. Signs are divided into quantitative and qualitative. The diversity and variability of the value of a feature in individual units of the population is called variation.

Attributive (qualitative) features are not quantifiable (composition of the population by sex). Quantitative characteristics have a numerical expression (composition of the population by age).

Index- this is a generalizing quantitative and qualitative characteristic of any property of units or aggregates for the purpose in specific conditions of time and place.

Scorecard is a set of indicators that comprehensively reflect the phenomenon under study.

For example, consider salary:
  • Sign - wages
  • Statistical population - all employees
  • The unit of the population is each worker
  • Qualitative homogeneity - accrued salary
  • Feature variation - a series of numbers

General population and sample from it

The basis is a set of data obtained as a result of measuring one or more features. The actually observed set of objects, statistically represented by a series of observations of a random variable , is sampling, and the hypothetically existing (thought-out) - general population. The general population can be finite (number of observations N = const) or infinite ( N = ∞), and a sample from the general population is always the result of a limited number of observations. The number of observations that make up a sample is called sample size. If the sample size is large enough n→∞) the sample is considered big, otherwise it is called a sample limited volume. The sample is considered small, if, when measuring a one-dimensional random variable, the sample size does not exceed 30 ( n<= 30 ), and when measuring simultaneously several ( k) features in a multidimensional space relation n to k less than 10 (n/k< 10) . The sample forms variation series if its members are order statistics, i.e., sample values ​​of the random variable X are sorted in ascending order (ranked), the values ​​of the attribute are called options.

Example. Almost the same randomly selected set of objects - commercial banks of one administrative district of Moscow, can be considered as a sample from the general population of all commercial banks in this district, and as a sample from the general population of all commercial banks in Moscow, as well as a sample of commercial banks in the country and etc.

Basic sampling methods

The reliability of statistical conclusions and meaningful interpretation of the results depends on representativeness samples, i.e. completeness and adequacy of the presentation of the properties of the general population, in relation to which this sample can be considered representative. The study of the statistical properties of the population can be organized in two ways: using continuous and discontinuous. Continuous observation includes examination of all units studied aggregates, a non-continuous (selective) observation- only parts of it.

There are five main ways to organize sampling:

1. simple random selection, in which objects are randomly extracted from the general population of objects (for example, using a table or a random number generator), and each of the possible samples has an equal probability. Such samples are called actually random;

2. simple selection through a regular procedure is carried out using a mechanical component (for example, dates, days of the week, apartment numbers, letters of the alphabet, etc.) and the samples obtained in this way are called mechanical;

3. stratified selection consists in the fact that the general population of volume is subdivided into subsets or layers (strata) of volume so that . Strata are homogeneous objects in terms of statistical characteristics (for example, the population is divided into strata by age group or social class; enterprises by industry). In this case, the samples are called stratified(otherwise, stratified, typical, zoned);

4. methods serial selection are used to form serial or nested samples. They are convenient if it is necessary to examine a "block" or a series of objects at once (for example, a consignment of goods, products of a certain series, or the population in the territorial-administrative division of the country). The selection of series can be carried out in a random or mechanical way. At the same time, a continuous survey of a certain batch of goods, or an entire territorial unit (a residential building or a quarter) is carried out;

5. combined(stepped) selection can combine several selection methods at once (for example, stratified and random or random and mechanical); such a sample is called combined.

Selection types

By mind there are individual, group and combined selection. At individual selection individual units of the general population are selected in the sample set, with group selection are qualitatively homogeneous groups (series) of units, and combined selection involves a combination of the first and second types.

By method selection distinguish repeated and non-repetitive sample.

Unrepeatable called selection, in which the unit that fell into the sample does not return to the original population and does not participate in the further selection; while the number of units of the general population N reduced during the selection process. At repeated selection caught in the sample, the unit after registration is returned to the general population and thus retains an equal opportunity, along with other units, to be used in the further selection procedure; while the number of units of the general population N remains unchanged (the method is rarely used in socio-economic studies). However, with a large N (N → ∞) formulas for unrepeated selection are close to those for repeated selection and the latter are used almost more often ( N = const).

The main characteristics of the parameters of the general and sample population

The basis of the statistical conclusions of the study is the distribution of a random variable , while the observed values (x 1, x 2, ..., x n) are called realizations of the random variable X(n is the sample size). The distribution of a random variable in the general population is theoretical, ideal in nature, and its sample analogue is empirical distribution. Some theoretical distributions are given analytically, i.e. them options determine the value of the distribution function at each point in the space of possible values ​​of the random variable . For a sample, it is difficult, and sometimes impossible, to determine the distribution function, therefore options are estimated from empirical data, and then they are substituted into an analytical expression describing the theoretical distribution. In this case, the assumption (or hypothesis) about the type of distribution can be both statistically correct and erroneous. But in any case, the empirical distribution reconstructed from the sample only roughly characterizes the true one. The most important distribution parameters are expected value and dispersion.

By their very nature, distributions are continuous and discrete. The best known continuous distribution is normal. Selective analogues of parameters and for it are: mean value and empirical variance. Among the discrete in socio-economic studies, the most commonly used alternative (dichotomous) distribution. The expectation parameter of this distribution expresses the relative value (or share) units of the population that have the characteristic under study (it is indicated by the letter ); the proportion of the population that does not have this feature is denoted by the letter q (q = 1 - p). The variance of the alternative distribution also has an empirical analog.

Depending on the type of distribution and on the method of selecting population units, the characteristics of the distribution parameters are calculated differently. The main ones for the theoretical and empirical distributions are given in Table. 9.1.

Sample share k n is the ratio of the number of units of the sample population to the number of units of the general population:

k n = n/N.

Sample share w is the ratio of units that have the trait under study x to sample size n:

w = n n / n.

Example. In a batch of goods containing 1000 units, with a 5% sample sample fraction k n in absolute value is 50 units. (n = N*0.05); if 2 defective products are found in this sample, then sample fraction w will be 0.04 (w = 2/50 = 0.04 or 4%).

Since the sample population is different from the general population, there are sampling errors.

Table 9.1 Main parameters of the general and sample populations

Sampling errors

With any (solid and selective) errors of two types can occur: registration and representativeness. Mistakes registration can have random and systematic character. Random errors are made up of many different uncontrollable causes, are unintentional in nature, and usually balance each other out in combination (for example, changes in instrument readings due to temperature fluctuations in the room).

Systematic errors are biased, as they violate the rules for selecting objects in the sample (for example, deviations in measurements when changing the settings of the measuring device).

Example. To assess the social status of the population in the city, it is planned to examine 25% of families. If, however, the selection of every fourth apartment is based on its number, then there is a danger of selecting all apartments of only one type (for example, one-room apartments), which will introduce a systematic error and distort the results; the choice of the apartment number by lot is more preferable, since the error will be random.

Representativeness errors inherent only in selective observation, they cannot be avoided and they arise as a result of the fact that the sample does not fully reproduce the general one. The values ​​of the indicators obtained from the sample differ from the indicators of the same values ​​in the general population (or obtained during continuous observation).

Sampling error is the difference between the value of the parameter in the general population and its sample value. For the average value of a quantitative attribute, it is equal to: , and for the share (alternative attribute) - .

Sampling errors are inherent only in sample observations. The larger these errors, the more the empirical distribution differs from the theoretical one. The parameters of the empirical distribution and are random variables, therefore, sampling errors are also random variables, they can take different values ​​for different samples, and therefore it is customary to calculate average error.

Average sampling error is a value expressing the standard deviation of the sample mean from the mathematical expectation. This value, subject to the principle of random selection, depends primarily on the sample size and on the degree of variation of the trait: the greater and the smaller the variation of the trait (hence, the value of ), the smaller the value of the average sampling error . The ratio between the variances of the general and sample populations is expressed by the formula:

those. for sufficiently large, we can assume that . The average sampling error shows the possible deviations of the parameter of the sample population from the parameter of the general population. In table. 9.2 shows expressions for calculating the average sampling error for different methods of organizing observation.

Table 9.2 Mean error (m) of sample mean and proportion for different sample types

Where is the average of the intragroup sample variances for a continuous feature;

The average of the intra-group dispersions of the share;

— number of series selected, — total number of series;

,

where is the average of the th series;

- the general average over the entire sample for a continuous feature;

,

where is the proportion of the trait in the th series;

— the total share of the trait over the entire sample.

However, the magnitude of the average error can only be judged with a certain probability Р (Р ≤ 1). Lyapunov A.M. proved that the distribution of sample means, and hence their deviations from the general mean, with a sufficiently large number, approximately obeys the normal distribution law, provided that the general population has a finite mean and limited variance.

Mathematically, this statement for the mean is expressed as:

and for the fraction, expression (1) will take the form:

where - there is marginal sampling error, which is a multiple of the average sampling error , and the multiplicity factor is Student's criterion ("confidence factor"), proposed by W.S. Gosset (pseudonym "Student"); values ​​for different sample sizes are stored in a special table.

The values ​​of the function Ф(t) for some values ​​of t are:

Therefore, expression (3) can be read as follows: with probability P = 0.683 (68.3%) it can be argued that the difference between the sample and the general mean will not exceed one value of the mean error m(t=1), with probability P = 0.954 (95.4%)— that it does not exceed the value of two mean errors m (t = 2) , with probability P = 0.997 (99.7%)- will not exceed three values m (t = 3) . Thus, the probability that this difference will exceed three times the value of the mean error determines error level and is not more than 0,3% .

In table. 9.3 formulas for calculating the marginal sampling error are given.

Table 9.3 Marginal sampling error (D) for mean and proportion (p) for different types of sampling

Extending Sample Results to the Population

The ultimate goal of sample observation is to characterize the general population. For small sample sizes, empirical estimates of the parameters ( and ) may deviate significantly from their true values ​​( and ). Therefore, it becomes necessary to establish the boundaries within which for the sample values ​​of the parameters ( and ) the true values ​​( and ) lie.

Confidence interval of any parameter θ of the general population is called a random range of values ​​of this parameter, which with a probability close to 1 ( reliability) contains the true value of this parameter.

marginal error samples Δ allows you to determine the limit values ​​of the characteristics of the general population and their confidence intervals, which are equal to:

Bottom line confidence interval obtained by subtracting marginal error from the sample mean (share), and the top one by adding it.

Confidence interval for the mean, it uses the marginal sampling error and for a given confidence level is determined by the formula:

This means that with a given probability R, which is called the confidence level and is uniquely determined by the value t, it can be argued that the true value of the mean lies in the range from , and the true value of the share is in the range from

When calculating the confidence interval for the three standard confidence levels P=95%, P=99% and P=99.9% value is selected by . Applications depending on the number of degrees of freedom. If the sample size is large enough, then the values ​​corresponding to these probabilities t are equal: 1,96, 2,58 and 3,29 . Thus, the marginal sampling error allows us to determine the marginal values ​​of the characteristics of the general population and their confidence intervals:

The distribution of the results of selective observation to the general population in socio-economic studies has its own characteristics, since it requires the completeness of the representativeness of all its types and groups. The basis for the possibility of such a distribution is the calculation relative error:

where Δ % - relative marginal sampling error; , .

There are two main methods for extending a sample observation to the population: direct conversion and method of coefficients.

Essence direct conversion is to multiply the sample mean!!\overline(x) by the size of the population .

Example. Let the average number of toddlers in the city be estimated by a sampling method and amount to a person. If there are 1000 young families in the city, then the number of places required in the municipal nursery is obtained by multiplying this average by the size of the general population N = 1000, i.e. will be 1200 seats.

Method of coefficients it is advisable to use in the case when selective observation is carried out in order to clarify the data of continuous observation.

In doing so, the formula is used:

where all variables are the size of the population:

Required sample size

Table 9.4 Required sample size (n) for different types of sampling organization

When planning a sampling survey with a predetermined value of the allowable sampling error, it is necessary to correctly estimate the required sample size. This amount can be determined on the basis of the allowable error during selective observation based on a given probability that guarantees an acceptable error level (taking into account the way the observation is organized). Formulas for determining the required sample size n can be easily obtained directly from the formulas for the marginal sampling error. So, from the expression for the marginal error:

the sample size is directly determined n:

This formula shows that with decreasing marginal sampling error Δ significantly increases the required sample size, which is proportional to the variance and the square of the Student's t-test.

For a specific method of organizing observation, the required sample size is calculated according to the formulas given in Table. 9.4.

Practical Calculation Examples

Example 1. Calculation of the mean value and confidence interval for a continuous quantitative characteristic.

To assess the speed of settlement with creditors in the bank, a random sample of 10 payment documents was carried out. Their values ​​turned out to be equal (in days): 10; 3; fifteen; fifteen; 22; 7; eight; one; 19; twenty.

Required with probability P = 0.954 determine marginal error Δ sample mean and confidence limits of the average calculation time.

Solution. The average value is calculated by the formula from Table. 9.1 for the sample population

The dispersion is calculated according to the formula from Table. 9.1.

The mean square error of the day.

The error of the mean is calculated by the formula:

those. mean value is x ± m = 12.0 ± 2.3 days.

The reliability of the mean was

The limiting error is calculated by the formula from Table. 9.3 for reselection, since the size of the population is unknown, and for P = 0.954 confidence level.

Thus, the mean value is `x ± D = `x ± 2m = 12.0 ± 4.6, i.e. its true value lies in the range from 7.4 to 16.6 days.

Use of Student's table. The application allows us to conclude that for n = 10 - 1 = 9 degrees of freedom the obtained value is reliable with a significance level a £ 0.001, i.e. the resulting mean value is significantly different from 0.

Example 2. Estimate of the probability (general share) r.

With a mechanical sampling method of surveying the social status of 1000 families, it was revealed that the proportion of low-income families was w = 0.3 (30%)(the sample was 2% , i.e. n/N = 0.02). Required with confidence level p = 0.997 define an indicator R low-income families throughout the region.

Solution. According to the presented function values Ф(t) find for a given confidence level P = 0.997 meaning t=3(see formula 3). Marginal share error w determine by the formula from Table. 9.3 for non-repeating sampling (mechanical sampling is always non-repeating):

Limiting relative sampling error in % will be:

The probability (general share) of low-income families in the region will be p=w±Δw, and the confidence limits p are calculated based on the double inequality:

w — Δw ≤ p ≤ w — Δw, i.e. the true value of p lies within:

0,3 — 0,014 < p <0,3 + 0,014, а именно от 28,6% до 31,4%.

Thus, with a probability of 0.997, it can be argued that the proportion of low-income families among all families in the region ranges from 28.6% to 31.4%.

Example 3 Calculation of the mean value and confidence interval for a discrete feature specified by an interval series.

In table. 9.5. the distribution of applications for the production of orders according to the timing of their implementation by the enterprise is set.

Table 9.5 Distribution of observations by time of occurrence

Solution. The average order completion time is calculated by the formula:

The average time will be:

= (3*20 + 9*80 + 24*60 + 48*20 + 72*20)/200 = 23.1 months

We get the same answer if we use the data on p i from the penultimate column of Table. 9.5 using the formula:

Note that the middle of the interval for the last gradation is found by artificially supplementing it with the width of the interval of the previous gradation equal to 60 - 36 = 24 months.

The dispersion is calculated by the formula

where x i- the middle of the interval series.

Therefore!!\sigma = \frac (20^2 + 14^2 + 1 + 25^2 + 49^2)(4) and the standard error is .

The error of the mean is calculated by the formula for months, i.e. the mean is!!\overline(x) ± m = 23.1 ± 13.4.

The limiting error is calculated by the formula from Table. 9.3 for reselection because the population size is unknown, for a 0.954 confidence level:

So the mean is:

those. its true value lies in the range from 0 to 50 months.

Example 4 To determine the speed of settlements with creditors of N = 500 enterprises of the corporation in a commercial bank, it is necessary to conduct a selective study using the method of random non-repetitive selection. Determine the required sample size n so that with a probability P = 0.954 the error of the sample mean does not exceed 3 days, if the trial estimates showed that the standard deviation s was 10 days.

Solution. To determine the number of necessary studies n, we use the formula for non-repetitive selection from Table. 9.4:

In it, the value of t is determined from for the confidence level P = 0.954. It is equal to 2. The mean square value s = 10, the population size N = 500, and the marginal error of the mean Δ x = 3. Substituting these values ​​into the formula, we get:

those. it is enough to make a sample of 41 enterprises in order to estimate the required parameter - the speed of settlements with creditors.