The variation series is called. Define a variational series

Variational series, their elements.

A researcher interested in the tariff category of mechanical workers
shop, conducted a survey of 100 workers. Locate the observed values
prize-naka in ascending order. This operation is called ranking
tistic data. As a result, we get the following series, which calls-
Xia ranked:

1,1,..1, 2,2..2, 3,3,..3, 4,4,..4, 5,5,..5, 6,6,..6.

It follows from the ranked series that the studied feature (tariff
digit) took on six different values: 1, 2, 3, 4, 5, and 6.

Further various meanings prize-naka will be called option-
mi,
and under variation - understand the change in the values ​​of the attribute.

Depending on the values ​​​​taken by the sign, the signs are divided
on the discretely varying and continuously varying.

The tariff category is a discretely varying feature. Number, impressions-
how many times the variant x occurs in a series of observations is called hour-
totoy
option m x .

Instead of the frequency of the variant x, one can consider its relation to the general
number of observations n, which is called often variant and its relation designation-begins w x .

w x =m x /n=m x /åm x

A table that allows you to judge the distribution of frequencies (or frequencies) between options is called discrete variation series.

Along with the concept of frequency, the concept is used accumulated frequency,
which is denoted t x acc. The accumulated hour shows how many
observations, the sign took on values ​​less than the given value x. Relative
accumulated frequency to total number n observations are called accumulated-
frequency
and denote w x nak. It's obvious that



w x nac =m x nac /n=m x nac /åm x .

Accumulated frequencies (frequencies_ for a discrete variation series, calculated in the following table:

X mx m x nak w x nak
0+4=4 0,04
4+6=10 0,10
10+12=22 0,22
22+16=38 0,38
38+44=82 0,82
82+18=100 1,00
Above 6

Let it be necessary to investigate the output per worker - a machine operator of a mechanical shop in the reporting year as a percentage of previous year. Here, the studied feature x is the output in the reporting year as a percentage of the previous one. This is a continuously varying sign. To identify characteristic features Variations in the values ​​of the attribute will be combined into groups of workers whose output varies within 10%. We will present the grouped data in the table:

Research Feature x Number of workers m Share of workers w Accumulated frequency m x acc w x nak
80-90 8/117 8/117
90-100 15/117 8+15=23 23/117
100-110 46/117 23+46=69 69/117
110-120 29/117 69+29=98 98/117
120-130 13/117 98+13=111 111/117
130-140 3/117 111+3=114 114/117
140-150 3/117 114+3=117 117/117
å

In the frequency table, m shows how many observations the feature took values, belonging to that or other interval. This frequency is called interval, and its ratio to the total number of observations is interval frequency w. A table that allows you to judge the distribution of frequencies between the intervals of variation in the values ​​of a feature is called interval variation series.

The interval variation series is built according to observational data for
discontinuously varying feature, as well as discretely varying, if
a large number of observed options. A discrete variational series is built
only for a discrete variable feature

Sometimes the interval variation series is conditionally replaced by a discrete one.
Then the middle value of the interval is taken as the option x, and, accordingly,
interval frequency - for t x.

To determine the optimal constant interval h is often used Sturgess formula:

h=(x max – x min)/(1+3.322*lg n).

Construction of int.var.series

Frequencies m show how many observations the trait took on values ​​belonging to a particular interval. Such a frequency is called the interval frequency, and its ratio to the total number of observations is the interval frequency w. A table that makes it possible to judge the distribution of frequencies (or frequencies) between the intervals of variation in the values ​​of a feature is called the interval variation series.

An interval variational series is built according to observational data for a continuously varying trait, as well as for a discretely varying one, if the number of observed variants is large. A discrete variational series is built only for a discretely varying trait.

Sometimes the interval variation series is conditionally replaced by a discrete one. Then the middle value of the interval is taken as the variant x, and the corresponding interval frequency is taken as mx

To construct an interval variation series, it is necessary to determine the value of the interval, set full scale intervals and, in accordance with it, group the results of observations.

To determine the optimal constant interval h, the Sturgess formula is often used:

h = (xmax - xmin) /(1+ 3.322 log n) .

where xmax xmin are the maximum and minimum options, respectively. If, as a result of calculations, h turns out to be a fractional number, then either the nearest integer or the nearest simple fraction should be taken as the value of the interval.

It is recommended to take the value a1=xmin-h/2 as the beginning of the first interval; the beginning of the second interval coincides with the end of the first and is equal to a2=a1 +h; the beginning of the third interval coincides with the end of the second and is equal to a3=a2 + h. The construction of intervals continues until the beginning of the next interval in order is not greater than xmax. After establishing the scale of intervals, the results of observations should be grouped.

5) The concept, forms of expression and types of statistical indicators.

statistic is a quantitative characteristic of socio-economic phenomena and processes in terms of qualitative certainty. The qualitative certainty of the indicator lies in the fact that it is directly related to internal content the phenomenon or process being studied, its essence.

Statistical indicator system is a set of interrelated indicators that has a single-level or multi-level structure and is aimed at solving a specific statistical problem.

Unlike a sign, a statistical indicator is obtained by calculation. This can be a simple count of population units, the summation of their attribute values, a comparison of 2 or more values, or more complex calculations.

A distinction is made between a specific statistical indicator and an indicator-category.

Specific statistic characterizes the size, magnitude of the phenomenon or process being studied in a given place and in given time. However, in theoretical works and at the design stage of statistical observation, they also operate with absolute indicators or indicators-categories.

Category indicators reflect the essence, general distinctive properties specific statistical indicators of the same type without specifying the place, time and numerical value. All statistical indicators are divided according to the coverage of population units into individual and free, and according to the form - into absolute, relative and average.

Individual indicators characterize a separate object or a separate unit of the population - an enterprise, a firm, a bank, etc. An example is the number of industrial and production personnel of an enterprise. On the basis of the correlation of two individual absolute indicators characterizing the same object or unit, an individual relative indicator is obtained.

Summary indicators unlike individual ones, they characterize a group of units, which is a part of the statistical population or the entire population as a whole. These indicators are divided into volumetric and calculated ones.

Volume indicators are obtained by adding the values ​​of the attribute of individual units of the population. The resulting value, called the volume of the attribute, can act as a volume absolute indicator, and can be compared with another volume absolute value or the volume of the population. In the last 2 cases, volumetric relative and volumetric averages are obtained.

Estimated indicators, calculated by various formulas, serve to solve individual statistical tasks analysis - measurement of variation, characteristics of structural changes, assessment of the relationship, etc. They are also divided into absolute, relative or average.

This group includes indices, closeness coefficients, sampling errors and other indicators.

The coverage of population units and the form of expression are the main, but not the only classification features of statistical indicators. Important classification feature is also a time factor. Socio-economic processes and phenomena are reflected in statistical indicators or as of a certain moment time, usually certain date, beginning or end of a month, year, or certain period- day, week, month, quarter, year. In the first case, the indicators are momentary, in the second - interval.

Depending on belonging to one or two objects of study, there are single object and inter-object indicators. If the former characterize only one object, then the latter are obtained by comparing two quantities related to different objects.

From the point of view of spatial certainty, statistical indicators are divided into all-territorial characterizing the studied object or phenomenon in the whole country, regional and local relating to any part of the territory or a separate object.

6) Types and relationship of relative indicators.

Relative indicator is the result of dividing one absolute indicator by another and expresses the ratio between quantitative characteristics socio-economic processes and phenomena. Therefore, in relation to absolute indicators relative indicators or indicators in the form relative values are derivatives.

When calculating a relative indicator, the absolute indicator that is in the numerator of the resulting ratio is called current or comparable. The indicator with which comparison is made and which is in the denominator is called the basis or base of comparison. Relative indicators can be expressed as percentages, ppm, ratios, or they can be named numbers.

All relative indicators used in practice are divided into:

dynamics; plan; implementation of the plan; structures; coordination; Intensity and level of eco-go development; comparisons.

Relative indicator of dynamics pre-is the ratio of the level of the process or phenomenon under study for a given period of time to the level of the same process or phenomenon in the past.

OPD = current indicator / previous. Or baseline.

The value calculated in this way shows how many times current level exceeds the previous one or what proportion of the last one is. If this indicator is expressed as a multiple ratio, it is called growth factor, when this coefficient is multiplied by 100%, we get growth rate.

Relative structure index represents the ratio of the structural parts of the object under study and their whole. The relative indicator of the structure is expressed in fractions of a unit or as a percentage. The calculated values ​​\u200b\u200b(d i), respectively called shares or specific gravity, show which share either has or which specific gravity has the i-th part in the total.

Relative indicators of coordination characterize the ratio of individual parts of the whole to each other. At the same time, the part that has the largest share or is a priority from an economic, social or any other point of view is selected as the basis for comparison. The result is how many units of each structural part account for 1 unit of the basic structural part.

Relative intensity indicator characterizes the degree of distribution of the process or phenomenon under study in its inherent environment. This indicator is calculated when absolute value turns out to be insufficient for formulating reasonable conclusions about the scale of the phenomenon, its size, saturation, and distribution density. It can be expressed as a percentage, ppm or be a named value. A variety of relative indicators of intensity are relative indicators of the level of eco-th development, characterizing production per capita and playing important role in assessing the development of the state economy. In terms of the form of expression, these indicators are close to the average indicators, which often leads to their confusion or identification. The difference between them lies only in the fact that when calculating the average, we are dealing with a set of units, each of which is a carrier of an average feature.

Relative Comparison Index is the ratio of the absolute indicators of the same name characterizing different objects(enterprises, firms, regions, districts, etc.)

Variation indicators

The study of variation (change in the values ​​of a trait within the population) has great importance in statistics and social and economic research in general. Absolute and relative indicators of variation, characterizing the fluctuation of the values ​​of a varying attribute, allow, in particular, to measure the degree of connection and relationship, to assess the degree of homogeneity of the population, the typicality and stability of the mean, and to determine the magnitude of the possible error of sample observation.

The absolute indicators of variation include the range of variation, the average linear deviation, variance, mean standard deviation and quarterly deviation.

The range of variation shows how much the value of a quantitatively varying attribute changes

R=xmax-xmin, where xmax(xmin) is the maximum (minimum) value of the attribute in the aggregate (in the distribution series).

The mean linear deviation d is defined as average value from the deviations of the trait options from the average in the first degree, taken by the modulo:

The mean linear deviation is relatively rarely used to assess the variation of a trait. Typically, the variance and standard deviation are calculated.

If it is necessary to compare the fluctuation of several features in one set or the same feature in several sets with various indicators distribution center, then use the relative indicators of variation.

These include the following indicators:

1. Oscillation coefficient:

2. Relative linear deviation:

3. Coefficient of variation:

4. Relative indicator of quartile variation:

The most commonly used measure of relative variation is the coefficient of variation. This indicator is used not only for a comparative assessment of variation, but also as a characteristic of the homogeneity of the population. The set is considered to be homogeneous if<0,33.

Forms.

1. Stat. reporting is such an organizational form in which units of observables provide information about their activities in the form of forms, a regulatory apparatus.

The peculiarity of reporting is that it is obligatorily justified, obligatory in execution and legally confirmed by the signature of the head or responsible person.

2. Specially organized observation is the most striking and simple example of this form of observation. census. The census is usually carried out at regular intervals, simultaneously in the entire study area at the same time.

Russian statistical bodies conduct censuses of the population of certain types of settlements and organizations, material resources, perennial plantations, NZ construction objects, etc.

4. Register form of observation - based on the maintenance of the statistical register. In the register each unit obl-I har-Xia number of indicators. In domestic statistical practice, the most widely used registers are us-I and p / p registers.

Registration of the population - conducted by the registry office

Registration p / p - USREO lead.org. statistics.

Kinds.

can be divided into groups according to the following. featured:

a) at the time of registration

b) in terms of coverage of units of cos-ti

By time reg. they are:

Current (continuous)

Discontinuous (periodic and one-time)

At current obs. changes in phenomena and processes are recorded as they are received (registration of birth, death, marriage, divorce, etc.)

Periodic obs. carried out through the intervals (N census every 10 years)

One-time obs. held either irregularly or only once (referendum)

By scope cos. stat. obl. there are:

solid

discontinuous

Continuous observ. is a survey of all units of cos

Non-continuous observation assumes that only part of the research is subject to maintenance.

There are several types of discontinuous observation:

Main method array

Selective (self)

monographic

This method is x-Xia in that, as a rule, the most creatures are selected, usually the largest units. owls in a cat. middle means. part of all the observable signs.

With monographic observation, careful an. are subjected to units study oh owls or m.b. or typical for this cov-ti units. or represent some new varieties of phenomena.

Obs. carried out in order to identify or emerging trends in the development this phenomenon.

Ways

Direct observation

Documentary observ.

Directly called. such observable with a cat the registrars themselves, by means of direct measurement, calculation, containment, establish the fact subject to registration and, on this basis, make an entry in the form.

Documentary method obl. based on the use of various documents as sources of information, as a rule of accounting x-ra (i.e. statistical reporting)

Poll is a method of persuasion with a cat. the necessary information is obtained from the words of the respondent (i.e. the respondent) (oral, correspondent, questionnaire, private, etc.)

Determination of sampling errors.

In the process of sampling observation, two types of errors are distinguished: registration and representativeness.

Registration errors - deviations between the value of the indicator obtained during the statistical observation and its actual value. These errors can appear both during continuous and non-continuous observation. Registration errors occur due to incorrect or inaccurate information. The sources of this type of error can be a misunderstanding of the essence of the issue, the inattention of the registrar, the omission or repeated counting of individual units of observation. Registration errors are divided into systematic due to causes acting in one direction and smoothing the results of the examination (rounding of figures), and random, which are the result of the action of various random factors (rearrangement of adjacent digits). Random errors have different directions and, with a sufficiently large volume of the surveyed population, cancel each other out.

Representativeness errors - deviations of the values ​​of the indicator of the surveyed population from its value in the initial population. These errors are also divided into systematic, appearing as a result of violation of the principles of selection of units to be observed from the initial population, and random that arise if the selected population incompletely reproduces the entire population as a whole. The amount of random error can be estimated.

Sampling error- the difference between the value of the attribute in the general population and its value calculated from the results of selective observation. In the practice of sample surveys, the average and marginal sampling errors are most often determined.

The average sampling error for different selection methods is calculated differently. If random or mechanical selection, then

For the average: m \u003d s 2 / (n) 1/2

For fraction: m = (w(1-w)/n) 1/ 2 , where

m - mean sampling error

s 2 - general dispersion

n - volume sampling frame

If the sampling set is formed on the basis of a typical sample and the selection of units is carried out in proportion to the volume of typical groups, then the average error is equal to:

For the middle: m = (s i 2 / n) 1/2

For share: m = (w i (1-w i) / n) 1/2 , where

s i 2 - the average of the intra-group variances

w i is the proportion of units in the entire group that have the trait under study.

s i 2 = ås 2 n i / ån i

The average error of serial sampling is equal to:

For the middle: m = (d x 2 / r) 1/2

For share: m = (d 2 w / r) 1/2

d 2 w - intergroup variance of share

d x 2 - intergroup dispersion of a quantitative trait.

r is the number of selected series/

d 2 x \u003d å (x i -x) 2 / r

d 2 w \u003d å (w i - w) 2 / r

If the selection of units from the general population is carried out in a non-repetitive way, then an amendment is made to the mean error formulas: (1-n/N) 1/2

Marginal sampling error D is calculated as the product of the confidence factor t and the average sampling error: D = t*m. D is related to the probability level that guarantees it. This level determines the confidence factor t, and vice versa. The values ​​of t are given in special mathematical tables.

Determining the sample size.

The sample size is calculated, as a rule, at the stage of designing a sample survey. The formulas for determining the sample size follow from the formulas for the marginal sampling errors.

The volume of random and mechanical repeated samples is determined by the formulas:

For average n \u003d t 2 s 2 / D 2

For share n \u003d t 2 w (1-w) / D 2

In the case of non-retry sampling:

For average n \u003d t 2 s 2 N / ND 2 + t 2 s 2

For share n = t 2 w(1-w)N / ND 2 +t 2 w(1-w).

The values ​​s 2 and w prior to the random observation are unknown. Approximately they are found like this:

1. taken from previous surveys;

2. if the maximum and minimum values ​​of the attribute are known, then the standard deviation is determined according to the “three sigma” rule:

s= xmax – xmin / 6

3. when studying an alternative sign, if there is no information about its share in the general population, the maximum possible value w = 0.5 is taken

With typical selection, proportional to the size of typical groups, the sample size for each group is determined by the formula : n i = n*N i / N, where

n i - sample size from the i-th group

N i- the volume of the i -th group in the gene-th cos-ti.

With a sample proportional to the variation of the trait, the sample size from each group is found as follows: n i = nN i s i /åN i s i .

With a typical resampling proportional to the size of the groups, the total sample size is found as follows:

For average n \u003d t 2 s 2 i / D 2

For share n \u003d t 2 w (1-w) / D 2

In the case of non-repeating typical sampling:

For average n = t 2 s 2 i N / D 2 N+t 2 s 2 i

For share n = t 2 w(1-w)N / D 2 N+t 2 w(1-w)

Basic concepts and prerequisites for the use of correlation and regression analysis.

Correlation is a statistical dependence between random variables that do not have a strictly functional character, in which a change in one of random variables leads to a change in the mathematical expectation of the other.

Correlation analysis- has as its task the quantitative determination of the closeness of the connection between two signs and between the effective and many factor signs. The tightness of the connection is quantitatively expressed by the value of the correlation coefficients.

Correlation-Regression analysis as a general concept includes the measurement of tightness, the direction of communication and the establishment of an analytical expression (form) of communication (regression analysis).

Regression analysis consists in determining the analytical expression of the relationship, in which the change in one value (called the dependent or effective feature) is due to the influence of one or more independent values ​​(factors), and the set of all other factors that also affect the dependent value, takes - toils for constant and average values. Regression can be single-factor (pair) and multi-factor (multiple).

The purpose of regression analysis is an assessment of the functional dependence of the conditional average value of the effective attribute (Y) on the factorial (x 1, x 2, ... x k) signs.

The main premise of regression analysis is that only the resultant sign (Y) obeys the normal distribution law, and the factor signs x 1, x 2, ..., x k can have an arbitrary distribution law. In the analysis of time series, time t acts as a factor sign. At the same time, in the regression analysis, the presence of causal relationships between the effective (Y) factorial (x 1, x 2, ..., x k) signs is implied in advance. The regression equation, or the statistical model of the relationship of socio-economic phenomena, expressed by the function Y x \u003d f (x 1, x 2, ..., x k), is quite adequate to the real simulated phenomenon or process if the following are observed requirements for their construction.

1. The totality of the initial data under study is homogeneous and mathematically described by continuous functions.

2. The possibility of describing the simulated phenomenon by one or more equations of cause-and-effect relationships.

3. All factor signs must have a quantitative (numerical) expression.

4. The presence of a sufficiently large volume of the sample under study.

5. Cause-and-effect relationships between phenomena and processes should be described in a linear or linear form of dependence.

6. Absence of quantitative restrictions on the parameters of the communication model.

7. The constancy of the territorial and temporal structure of the studied population.

The theoretical validity of the relationship models built on the basis of correlation and regression analysis is ensured by observing the following basic conditions.

1. All signs and their joint distributions must obey the normal distribution law;

2. The variance of the modeled trait (Y) should always remain constant when changing the value (Y) and the values ​​of factor traits.

3. Separate observations should be independent, i.e., the results obtained in the i-th observation should not be related to the previous ones and contain information about subsequent observations, as well as influence them.

SUMMARY OBJECTIVES AND CONTENT

observation provides information on each unit of the object under study. The data obtained are not general indicators. With their help, it is impossible to draw conclusions about the object as a whole without preliminary data processing.

Therefore, the goal of the next stage of statistical research is to systematize the primary data and obtain, on this basis, a summary characteristic of the entire object with the help of generalizing statistical fields.

Summary - a set of sequential operations to generalize specific single facts that form a set, to identify typical features and patterns inherent in the phenomenon under study as a whole.

if during statistical observation data are collected about each unit of an object, then the result of the summary is detailed data that reflects the entire population as a whole

A statistical summary should be conducted on the basis of a preliminary theoretical analysis of phenomena and processes so that during the summary information about the phenomenon under study is not lost and all statistical results reflect the most important characteristic features of the object.

According to the depth of material processing, the summary can be simple and complex.

A simple summary is the operation of calculating the totals for the same units of observation.

A complex summary is a set of operations that includes grouping observation units, counting the totals for each group and for the entire object, and presenting the grouping and summary results in the form of statistical tables.

The summary is preceded by the development of its program, which consists of the following stages: selection of grouping characteristics; determination of the order of formation of groups; development of a system of statistical pok-lei to characterize groups and the object as a whole; development of a system of layouts of statistical tables in which the results of the summary should be presented.

According to the form of material processing, the summary: decentralized and centralized.

With a decentralized summary (it is used, as a rule, in the processing of statistical reporting), the development of the material is carried out in successive stages. So, the reports of enterprises are summarized by the statistical authorities of the constituent entities of the Russian Federation, and the results for the region are already sent to the State Statistics Committee of Russia, and there they determine whether the country's national economy as a whole is determined.

With a centralized summary, all primary material enters one organization, where it is processed from beginning to end. The centralized summary is usually used to process materials from one-time statistical surveys.

According to the technique of execution, the statistical summary is divided into mechanized and manual.

Mechanized summary - in which all operations are carried out using electronic computers. With manual summaries, all basic operations (calculation of group and total totals) are carried out manually.

To carry out the summary, a plan is drawn up that sets out organizational issues: by whom and when all operations will be carried out, the procedure for conducting it, the composition of the information to be published in the periodical press.

Closing rows of din-ki

When analyzing rows of din-ki, it becomes necessary to close them-combine two or more rows into one row. Closing is necessary in cases where the levels of the series are incomparable due to territorial changes, due to changes in prices and due to changes in the methodology for calculating the levels of the series. it is necessary to close (combine) the above two rows into one. This can be done using the comparability factor. Multiplying the data for the year by the obtained coefficient, we get a closed (comparable) series of dynamics of absolute values , and after the change are taken as 100%, and the rest are recalculated as a percentage relative to these levels, respectively.

30. M-dy alignment rows din-ki

Any series of din-ki can theoretically be represented as three components:

Trend (the main trend and development of the dynamic series);

Cyclic (periodic) fluctuations, including seasonal ones;

Random fluctuations.

One of the tasks that arise in the analysis of dynamic series is to establish changes in the levels of the phenomenon under study. In some cases, the pattern of changes in the levels of a series of din-ki is quite clear, for example, either a systematic decrease in the levels of a series, or their increase. sometimes the levels of the series undergo a variety of changes (sometimes they increase, sometimes they decrease). In this case, we can only speak of a general trend and development: either to growth or to decline.

Identification of the main trend and development (trend) is called the alignment of the time series, and m-dy identification of the main trend m-dy leveling.

The direct selection of the trend can be made by three me-mi.

* Md coarse intervals. This md is based on the enlargement of time lines, which include the levels of the series. For example, a row of din-ki

daily output is replaced by a series of monthly output projections, and so on.

* Md moving average. In this m-de, the initial levels of the series are replaced by average values, which are obtained from a given level and several symmetrically surrounding ones. The integer number of levels over which the average value is calculated is called the smoothing interval. The smoothing interval can be odd (3, 5, 7, etc. points) or even (2, 4, 6, etc. points). The calculation of averages is carried out by the sliding method, that is, by gradually excluding the first level from the accepted sliding period and including the next one. With odd smoothing, the resulting arithmetic mean value is assigned to the middle of the calculated interval.

The "-" m-dika of smoothing by moving averages consists in the conventionality of determining smoothed levels for points at the beginning and end of the series.

* Analytical alignment - is the most effective way to identify the main trend and development. In this case, the levels of a series of dynamics are expressed as a function of time: Yt=f(t)

The purpose of the analyti- cal alignment of the din-th series is to determine the analyte-th factory f(t). In practice, according to the available time series, the form is set and the parameters of the function f(t) are found, and then the behavior of deviations from the trend is analyzed.

In economics, a function of the form is often used: Уi = а0 +∑ ai +ti

Of the functions of the form (3.12), most often when leveling, the linear system / (*) \u003d ao + a1 * t or the parabolic f (t) \u003d a0 + att + a2 t2 is used.

The coefficients ao,a,a2,...,ap are found in the formula by least squares.

According to this method, to find the parameters of the p-th degree polynomial, it is necessary to solve the system of so-called normal equations:

nao+a1∑t=∑Y

ao∑t+ a1∑t*t= ∑Y*t.

The trend shows how systematic factors affect the levels of the din-ki. Fluctuation of levels around the trend serves as a measure of the impact of residual (random) factors. This impact can be assessed

according to the standard deviation formula.

Basic concepts of correlation-regression analysis.

Parameter name Meaning
Article subject: Variation series
Rubric (thematic category) Production

Observed values ​​of a random variable X 1 , X 2 , …, x k called options.

Frequency options X i is called a number n i (i=1,…,k) showing how many times this variant occurs in the sample.

Frequency(relative frequency, shares) options x i (i=1,…,k) is usually called the ratio of its frequency n i to sample size n.

Frequencies and frequencies are called scales.

Accumulated frequency it is customary to call the number of options, the values ​​​​of which are less than a given X:

Accumulated frequency It is customary to call the ratio of the accumulated frequency to the sample size:

variation series(statistical series) - it is customary to call a sequence of options written in ascending order and their corresponding weights.

The variation series should be discrete(sample of values ​​of a discrete random variable) and continuous (interval)(selection of values ​​of a continuous random variable).

The discrete variational series has the form:

When the number of options is large or the feature is continuous (a random variable can take any value in a certain interval), they are interval variation series.

To build an interval variation series, carry out grouping option - they are divided into separate intervals:

The number of intervals is sometimes determined using Sturges formulas:

Then the number of variants that fall into each interval is calculated - frequencies n i(or frequency n i/n). If the variant is on the border of the interval, then it is attached to the right interval.

The interval variational series has the form:

Options
Frequencies

Empirical (statistical) distribution function it is customary to call a function whose value at the point X is equal to the relative frequency that the variant will take on a value less than X(cumulative frequency for X):

Frequency polygon is called a polyline whose segments connect points with coordinates ( X 1 ; n 1), (X 2 ; n 2), …, (x k; nk). The frequency polygon, which is the statistical analogue of the distribution polygon.

It is worth saying that for a continuous variational series, a polygon can be built if the values X 1 , X 2 , …, x k take the midpoints of the intervals.

An interval variation series is usually graphically depicted using histograms.

bar chart- a stepped figure consisting of rectangles whose bases are partial length intervals h= x i +1 – x i, i= 0,…,k-1, and the heights are equal to the frequencies (or frequencies) of the intervals n i (w i).

Cumulate(cumulative curve) - curve of accumulated frequencies (frequencies). For discrete series the cumulate is a broken line connecting the points or , . For interval series cumulate starts from the point, the abscissa of which is equal to the beginning of the first interval, and the ordinate is the accumulated frequency (frequency) equal to zero. Other points of this broken line correspond to the ends of the intervals.

Variation series - concept and types. Classification and features of the category "Variation series" 2017, 2018.

  • - Variation series of distribution

    Distribution of retail trade turnover in the Russian Federation in 1995 by type of ownership, million rubles Types of distribution series Lecture VIII. Distribution series As a result of processing and systematization of primary statistical data, they obtain ....


  • - Variation series

    The simplest transformation of statistical data is their ordering by magnitude. Sample size from the general population, ordered in non-decreasing order of elements, i.e. , is called a variation series: . In the case when the volume of observations ... .


  • - Task 2. Interval variation series

    1. Based on a given sample corresponding to the task variant, build an interval variation series; build a histogram and cumulate (use two methods: inserting an Excel chart and the "Histogram" mode of the "Data Analysis" package). 2. Analyze the resulting histogram. ... .


  • - Compile a variation series of the variability of the trait of bean seeds or leaves of any plant of the same age. Reveal patterns of trait variability.

    A population is a structural unit of a species. The number of populations. Causes of population fluctuations. The relationship of individuals in populations and between different populations of the same and different species. 1. An important feature of a species is its distribution in groups, populations in ...

  • Variation series: definition, types, main characteristics. Method of calculation
    fashion, median, arithmetic mean in medical and statistical studies
    (Show on a conditional example).

    A variational series is a series of numerical values ​​of the trait under study, which differ from each other in their magnitude and are located in certain sequence(in ascending or descending order). Each numerical value of the series is called a variant (V), and the numbers showing how often this or that variant occurs in the composition of this series is called the frequency (p).

    The total number of cases of observations, of which the variation series consists, is denoted by the letter n. The difference in the meaning of the studied characteristics is called variation. If the variable sign does not have a quantitative measure, the variation is called qualitative, and the distribution series is called attributive (for example, distribution by disease outcome, health status, etc.).

    If a variable sign has a quantitative expression, such a variation is called quantitative, and the distribution series is called variational.

    Variational series are divided into discontinuous and continuous - according to the nature of the quantitative trait, simple and weighted - according to the frequency of occurrence of the variant.

    In a simple variational series, each variant occurs only once (p=1), in a weighted one, the same variant occurs several times (p>1). Examples of such series will be discussed later in the text. If a quantitative sign is continuous, i.e. between integers there are intermediate fractional quantities, the variational series is called continuous.

    For example: 10.0 - 11.9

    14.0 - 15.9, etc.

    If the quantitative sign is discontinuous, i.e. its individual values ​​(variants) differ from each other by an integer and do not have intermediate fractional values, the variational series is called discontinuous or discrete.

    Using the data from the previous example about the heart rate

    for 21 students, we will build a variation series (Table 1).

    Table 1

    Distribution of medical students by pulse rate (bpm)

    Thus, to build a variational series means the available numerical values(options) systematize, streamline, i.e. arrange in a certain sequence (in ascending or descending order) with their corresponding frequencies. In the example under consideration, the options are arranged in ascending order and are expressed as discontinuous (discrete) integers, each option occurs several times, i.e. we are dealing with a weighted, discontinuous or discrete variational series.

    As a rule, if the number of observations in the statistical population we are studying does not exceed 30, then it is enough to arrange all the values ​​of the trait under study in a variational series in increasing order, as in Table. 1, or in descending order.

    At in large numbers observations (n>30), the number of occurring variants can be very large, in this case an interval or grouped variational series is compiled, in which, to simplify subsequent processing and clarify the nature of the distribution, the variants are combined into groups.

    Usually number group option ranges from 8 to 15.

    There must be at least 5 of them, because. otherwise, it will be too rough, excessive enlargement, which distorts the overall picture of variation and greatly affects the accuracy of the average values. When the number of group options is more than 20-25, the accuracy of calculating the average values ​​increases, but the features of the variation of the attribute are significantly distorted and mathematical processing becomes more complicated.

    When compiling a grouped series, it is necessary to take into account

    − variant groups must be placed in a specific order (ascending or descending);

    - the intervals in the variant groups should be the same;

    − the values ​​of the boundaries of the intervals should not coincide, because it will not be clear in which groups to attribute individual options;

    - it is necessary to take into account the qualitative features of the collected material when setting the limits of the intervals (for example, when studying the weight of adults, an interval of 3-4 kg is acceptable, and for children in the first months of life it should not exceed 100 g.)

    Let's build a grouped (interval) series that characterizes the data on the pulse rate (number of beats per minute) for 55 medical students before the exam: 64, 66, 60, 62,

    64, 68, 70, 66, 70, 68, 62, 68, 70, 72, 60, 70, 74, 62, 70, 72, 72,

    64, 70, 72, 76, 76, 68, 70, 58, 76, 74, 76, 76, 82, 76, 72, 76, 74,

    79, 78, 74, 78, 74, 78, 74, 74, 78, 76, 78, 76, 80, 80, 80, 78, 78.

    To build a grouped series, you need:

    1. Determine the value of the interval;

    2. Determine the middle, beginning and end of the groups of the variant of the variation series.

    ● The value of the interval (i) is determined by the number of expected groups (r), the number of which is set depending on the number of observations (n) according to a special table

    Number of groups depending on the number of observations:

    In our case, for 55 students, it is possible to make up from 8 to 10 groups.

    The value of the interval (i) is determined by the following formula -

    i = Vmax-Vmin/r

    In our example, the value of the interval is 82-58/8= 3.

    If the interval value is fractional number, the result should be rounded up to an integer.

    There are several types of averages:

    arithmetic mean,

    geometric mean,

    ● harmonic mean,

    root mean square,

    ● medium progressive,

    ● median

    In medical statistics, arithmetic averages are most often used.

    Medium arithmetic value(M) is a generalizing value that determines the typical that is characteristic of the entire population. The main methods for calculating M are: the arithmetic mean method and the method of moments (conditional deviations).

    The arithmetic mean method is used to calculate the simple arithmetic mean and the weighted arithmetic mean. The choice of method for calculating the arithmetic mean value depends on the type of variation series. In the case of a simple variational series, in which each variant occurs only once, the simple arithmetic mean is determined by the formula:

    where: М – arithmetic mean value;

    V is the value of the variable feature (options);

    Σ - indicates the action - summation;

    n is the total number of observations.

    An example of calculating the arithmetic mean is simple. Respiratory rate (number of breaths per minute) in 9 men aged 35: 20, 22, 19, 15, 16, 21, 17, 23, 18.

    To determine the average level of respiratory rate in men aged 35, it is necessary:

    1. Build a variational series, placing all options in ascending or descending order. We got a simple variational series, because variant values ​​occur only once.

    M = ∑V/n = 171/9 = 19 breaths per minute

    Conclusion. Respiratory frequency in men aged 35 is on average 19 respiratory movements per minute.

    If individual values ​​of a variant are repeated, there is no need to write out each variant in a line; it is enough to list the sizes of the variant that occur (V) and next to indicate the number of their repetitions (p). such a variational series, in which the options are, as it were, weighted according to the number of frequencies corresponding to them, is called the weighted variational series, and the calculated average value is the arithmetic weighted average.

    The arithmetic weighted average is determined by the formula: M= ∑Vp/n

    where n is the number of observations, equal to the sum frequencies - Σr.

    An example of calculating the arithmetic weighted average.

    Duration of disability (in days) in 35 patients with acute respiratory diseases (ARI) who were treated by a local doctor during the first quarter current year was: 6, 7, 5, 3, 9, 8, 7, 5, 6, 4, 9, 8, 7, 6, 6, 9, 6, 5, 10, 8, 7, 11, 13, 5, 6, 7, 12, 4, 3, 5, 2, 5, 6, 6, 7 days.

    The methodology for determining the average duration of disability in patients with acute respiratory infections is as follows:

    1. Let's build a weighted variational series, because individual variant values ​​are repeated several times. To do this, you can arrange all the options in ascending or descending order with their corresponding frequencies.

    In our case, the options are in ascending order.

    2. Calculate the arithmetic weighted average using the formula: M = ∑Vp/n = 233/35 = 6.7 days

    Distribution of patients with acute respiratory infections by duration of disability:

    Duration of incapacity for work (V) Number of patients (p) vp
    ∑p = n = 35 ∑Vp = 233

    Conclusion. The duration of disability in patients with acute respiratory diseases averaged 6.7 days.

    Mode (Mo) is the most common variant in the variation series. For the distribution presented in the table, the mode corresponds to the variant equal to 10, it occurs more often than others - 6 times.

    Distribution of patients by length of stay hospital bed(in days)

    V
    p

    Sometimes it is difficult to determine the exact value of the mode, since there may be several observations in the data being studied that occur “most often”.

    Median (Me) is a non-parametric indicator that divides the variation series into two equal halves: on both sides of the median is the same number option.

    For example, for the distribution shown in the table, the median is 10 because on both sides of this value is located on the 14th option, i.e. the number 10 takes central position in this series is its median.

    Given that the number of observations in this example is even (n=34), the median can be determined as follows:

    Me = 2+3+4+5+6+5+4+3+2/2 = 34/2 = 17

    This means that the middle of the series falls on the seventeenth option, which corresponds to a median of 10. For the distribution presented in the table, the arithmetic mean is:

    M = ∑Vp/n = 334/34 = 10.1

    So, for 34 observations from Table. 8, we got: Mo=10, Me=10, arithmetic mean (M) is 10.1. In our example, all three indicators turned out to be equal or close to each other, although they are completely different.

    The arithmetic mean is the resultant sum of all influences; all variants, without exception, take part in its formation, including extreme ones, often atypical for a given phenomenon or set.

    Mode and median, in contrast to the arithmetic mean, do not depend on the value of all individual values variable sign (values ​​of the extreme variant and the degree of scattering of the series). The arithmetic mean characterizes the entire mass of observations, the mode and median characterize the bulk

    ​ Variation series - a series in which they are compared (in ascending or descending order) options and their respective frequencies

    Variants are separate quantitative expressions of a trait. Designated Latin letter V . classical understanding the term "variant" implies that each unique value feature, regardless of the number of repetitions.

    For example, in the variation series of indicators of systolic blood pressure measured in ten patients:

    110, 120, 120, 130, 130, 130, 140, 140, 160, 170;

    only 6 values ​​are options:

    110, 120, 130, 140, 160, 170.

    Frequency is a number indicating how many times an option is repeated. Denoted by a Latin letter P . The sum of all frequencies (which, of course, is equal to the number of all studied) is denoted as n.

      In our example, the frequencies will take on the following values:
    • for variant 110 frequency P = 1 (value 110 occurs in one patient),
    • for variant 120 frequency P = 2 (value 120 occurs in two patients),
    • for variant 130 frequency P = 3 (value 130 occurs in three patients),
    • for variant 140 frequency P = 2 (value 140 occurs in two patients),
    • for variant 160 frequency P = 1 (value 160 occurs in one patient),
    • for variant 170 frequency P = 1 (value 170 occurs in one patient),

    Types of variation series:

    1. simple- this is a series in which each option occurs only once (all frequencies are equal to 1);
    2. suspended- a series in which one or more options occur repeatedly.

    The variation series is used to describe large arrays of numbers; it is in this form that the collected data of the majority are initially presented. medical research. In order to characterize the variation series, special indicators are calculated, including average values, indicators of variability (the so-called dispersion), indicators of the representativeness of sample data.

    Variation series indicators

    1) The arithmetic mean is a generalizing indicator that characterizes the size of the studied trait. The arithmetic mean is denoted as M , is the most common type of average. The arithmetic mean is calculated as the ratio of the sum of the values ​​of the indicators of all units of observation to the number of all examined. The method for calculating the arithmetic mean differs for a simple and weighted variation series.

    Formula for calculation simple arithmetic mean:

    Formula for calculation weighted arithmetic mean:

    M = Σ(V * P)/ n

    ​ 2) Mode - another average value of the variation series, corresponding to the most frequently repeated variant. Or, to put it differently, this is the option that corresponds to the highest frequency. Designated as Mo . The mode is calculated only for weighted series, since in simple rows none of the options is repeated and all frequencies are equal to one.

    For example, in the variation series of heart rate values:

    80, 84, 84, 86, 86, 86, 90, 94;

    the value of the mode is 86, since this variant occurs 3 times, therefore its frequency is the highest.

    3) Median - the value of the option, dividing the variation series in half: on both sides of it is equal number option. The median, as well as the arithmetic mean and mode, refers to average values. Designated as Me

    4) Standard deviation (synonyms: standard deviation, sigma deviation, sigma) - a measure of the variability of the variation series. It is an integral indicator that combines all cases of deviation of a variant from the mean. In fact, it answers the question: how far and how often do the options spread from the arithmetic mean. Denoted Greek letter σ ("sigma").

    When the population size is more than 30 units, the standard deviation is calculated using the following formula:

    For small populations - 30 observation units or less - the standard deviation is calculated using a different formula:

    Let's call different sample values options a series of values ​​and denote: X 1 , X 2, …. First of all, let's make ranging options, i.e. arrange them in ascending or descending order. For each option, its own weight is indicated, i.e. number that characterizes the contribution of this option to total population. Frequencies or frequencies act as weights.

    Frequency n i option x i called a number showing how many times this option occurs in the considered sample population.

    Frequency or relative frequency w i option x i the number is called equal to the ratio frequency of the variant to the sum of the frequencies of all variants. The frequency shows what part of the units of the sample population has a given variant.

    The sequence of options with their corresponding weights (frequencies or frequencies), written in ascending (or descending) order, is called variational series.

    Variational series are discrete and interval.

    For a discrete variational series, the point values ​​of the attribute are specified, for the interval series, the attribute values ​​are specified in the form of intervals. Variation series can show the distribution of frequencies or relative frequencies(frequencies), depending on which value is indicated for each option - frequency or frequency.

    Discrete variation series of frequency distribution looks like:

    Frequencies are found by the formula , i = 1, 2, …, m.

    w 1 +w 2 + … + w m = 1.

    Example 4.1. For a given set of numbers

    4, 6, 6, 3, 4, 9, 6, 4, 6, 6

    build discrete variation series distributions of frequencies and frequencies.

    Solution . The volume of the population is n= 10. The discrete frequency distribution series has the form

    Interval series have a similar form of recording.

    Interval variation series of frequency distribution is written as:

    The sum of all frequencies is equal to the total number of observations, i.e. total volume: n = n 1 +n 2 + … + n m .

    Interval variation series of distribution of relative frequencies (frequencies) looks like:

    The frequency is found by the formula , i = 1, 2, …, m.

    The sum of all frequencies is equal to one: w 1 +w 2 + … + w m = 1.

    Most often in practice, interval series are used. If there are a lot of statistical sample data and their values ​​differ from each other by arbitrarily small amount, then the discrete series for these data will be quite cumbersome and inconvenient for further research. In this case, data grouping is used, i.e. the interval containing all the values ​​of the attribute is divided into several partial intervals and, having calculated the frequency for each interval, an interval series is obtained. Let us write down in more detail the scheme for constructing an interval series, assuming that the lengths of partial intervals will be the same.

    2.2 Building an interval series

    To build an interval series, you need:

    Determine the number of intervals;

    Determine the length of the intervals;

    Determine the location of the intervals on the axis.

    For determining number of intervals k There is a Sturges formula, according to which

    ,

    where n- the volume of the totality.

    For example, if there are 100 characteristic values ​​(variant), then it is recommended to take the number of intervals equal to the intervals to construct an interval series.

    However, very often in practice the number of intervals is chosen by the researcher himself, considering that this number should not be very large, so that the series is not cumbersome, but also not very small, so as not to lose some properties of the distribution.

    Interval length h is determined by the following formula:

    ,

    where x max and x min is the largest and most small value options.

    the value called on a grand scale row.

    To construct the intervals themselves, they proceed in different ways. One of the most simple ways is as follows. The value is taken as the beginning of the first interval
    . Then the rest of the boundaries of the intervals are found by the formula . Obviously, the end of the last interval a m+1 must satisfy the condition

    After all boundaries of the intervals are found, the frequencies (or frequencies) of these intervals are determined. To solve this problem, they look through all the options and determine the number of options that fall into a particular interval. Full build Let's look at an interval series using an example.

    Example 4.2. For the following statistics, written in ascending order, build an interval series with the number of intervals equal to 5:

    11, 12, 12, 14, 14, 15, 21, 21, 22, 23, 25, 38, 38, 39, 42, 42, 44, 45, 50, 50, 55, 56, 58, 60, 62, 63, 65, 68, 68, 68, 70, 75, 78, 78, 78, 78, 80, 80, 86, 88, 90, 91, 91, 91, 91, 91, 93, 93, 95, 96.

    Solution. Total n=50 variant values.

    The number of intervals is specified in the problem condition, i.e. k=5.

    The length of the intervals is
    .

    Let's define the boundaries of the intervals:

    a 1 = 11 − 8,5 = 2,5; a 2 = 2,5 + 17 = 19,5; a 3 = 19,5 + 17 = 36,5;

    a 4 = 36,5 + 17 = 53,5; a 5 = 53,5 + 17 = 70,5; a 6 = 70,5 + 17 = 87,5;

    a 7 = 87,5 +17 = 104,5.

    To determine the frequency of intervals, we count the number of options that fall into this interval. For example, the options 11, 12, 12, 14, 14, 15 fall into the first interval from 2.5 to 19.5. Their number is 6, therefore, the frequency of the first interval is n 1=6. The frequency of the first interval is . Variants 21, 21, 22, 23, 25, the number of which is 5, fall into the second interval from 19.5 to 36.5. Therefore, the frequency of the second interval is n 2 =5, and the frequency . Having similarly found frequencies and frequencies for all intervals, we obtain the following interval series.

    The interval series of the frequency distribution has the form:

    The sum of the frequencies is 6+5+9+11+8+11=50.

    The interval series of the frequency distribution has the form:

    The sum of the frequencies is 0.12+0.1+0.18+0.22+0.16+0.22=1. ■

    When constructing interval series, depending on the specific conditions of the problem under consideration, other rules can be applied, namely

    1. Interval variation series can consist of partial intervals different lengths. Unequal lengths of intervals make it possible to single out the properties of a statistical population with an uneven distribution of a feature. For example, if the boundaries of the intervals determine the number of inhabitants in cities, then it is advisable in this problem to use intervals that are unequal in length. It is obvious that for big cities matters and not a big difference in the number of inhabitants, and for large cities the difference in tens and hundreds of inhabitants is not significant. interval series with unequal lengths of partial intervals are studied mainly in general theory statistics and their consideration is beyond the scope of this manual.

    2. In mathematical statistics sometimes interval series are considered, for which the left boundary of the first interval is assumed to be equal to –∞, and the right boundary of the last interval is +∞. This is done in order to bring statistical distribution to the theoretical.

    3. When constructing interval series, it may turn out that the value of some variant coincides exactly with the interval boundary. The best thing to do in this case is as follows. If there is only one such coincidence, then consider that the variant under consideration with its frequency fell into the interval located closer to the middle of the interval series, if there are several such variants, then either all of them are assigned to the intervals to the right of these variant, or all to the left.

    4. After determining the number of intervals and their length, the location of the intervals can be done in another way. Find the arithmetic mean of all the considered values ​​of the options X cf. and build the first interval in such a way that this sample mean would be inside some interval. Thus, we get the interval from X cf. – 0.5 h before X avg. + 0.5 h. Then left and right, adding the length of the interval, we build the remaining intervals until x min and x max will not fall into the first and last intervals, respectively.

    5. Interval series for large numbers It is convenient to write intervals vertically, i.e. record intervals not in the first line, but in the first column, and frequencies (or frequencies) in the second column.

    Sample data can be considered as values ​​of some random variable X. A random variable has its own distribution law. It is known from probability theory that the law of distribution of a discrete random variable can be specified as a distribution series, and for a continuous one, using the distribution density function. However, there is a universal distribution law that holds for both discrete and continuous random variables. This distribution law is given as a distribution function F(x) = P(X<x). For sample data, you can specify an analogue of the distribution function - the empirical distribution function.


    Similar information.