Plot a statistical interval distribution series. Grouping data and building a distribution series

The results of grouping the collected statistical data are usually presented in the form of distribution series. A distribution series is an ordered distribution of population units into groups according to the trait under study.

The distribution series are divided into attributive and variational, depending on the feature underlying the grouping. If the sign is qualitative, then the distribution series is called attributive. An example of an attribute series is the distribution of enterprises and organizations by form of ownership (see Table 3.1).

If the attribute on which the distribution series is constructed is quantitative, then the series is called variational.

The variational distribution series always consists of two parts: a variant and their corresponding frequencies (or frequencies). A variant is a value that can take a feature in units of the population, a frequency is the number of units of observation that have a given value of the feature. The sum of the frequencies is always equal to the size of the population. Sometimes, instead of frequencies, frequencies are calculated - these are frequencies expressed either in fractions of a unit (then the sum of all frequencies is equal to 1), or as a percentage of the population volume (the sum of frequencies will be equal to 100%).

Variational series are discrete and interval. For discrete series (Table 3.7), options are expressed in specific numbers, most often integers.

Table 3.8. Distribution of employees by working time in the insurance company

Working time in the company, full years (options)	Number of employees
	Human (frequencies)	in % of total (frequent)
up to a year	15	11,6
1	17	13,2
2	19	14,7
3	26	20,2
4	10	7,8
5	18	13,9
6	24	18,6
Total	129	100,0

In the interval series (see Table 3.2), the values of the indicator are set as intervals. The intervals have two boundaries: lower and upper. Intervals can be open or closed. Open ones do not have one of the borders, so, in Table. 3.2 the first interval has no lower bound, and the last has no upper bound. When constructing an interval series, depending on the nature of the spread of the values of the attribute, both equal and unequal intervals are used (Table 3.2 shows a variation series with equal intervals).

If the feature takes a limited number of values, usually no more than 10, discrete distribution series are built. If the variant is larger, then the discrete series loses its visibility; in this case, it is advisable to use the interval form of the variational series. With a continuous variation of a feature, when its values within certain limits differ from each other by an arbitrarily small amount, an interval distribution series is also built.

3.3.1. Construction of discrete variational series

Consider the technique for constructing discrete variational series using an example.

Example 3.2. The following data on the quantitative composition of 60 families are available:

In order to get an idea of the distribution of families according to the number of their members, a variational series should be constructed. Since the attribute takes a limited number of integer values, we construct a discrete variational series. To do this, it is first recommended to write out all the values of the attribute (the number of members in the family) in ascending order (i.e., to rank the statistical data):

Then you need to count the number of families with the same composition. The number of family members (the value of the variable trait) is the options (we will denote them by x), the number of families with the same composition is the frequencies (we will denote them by f). We represent the grouping results in the form of the following discrete variational distribution series:

Table 3.11.

Number of family members (x)	Number of families (y)
1	8
2	14
3	20
4	9
5	5
6	4
Total	60

3.3.2. Construction of interval variation series

Let us show the method of constructing interval variational distribution series using the following example.

Example 3.3. As a result of statistical observation, the following data were obtained on the average interest rate of 50 commercial banks (%):

Table 3.12.

14,7	19,0	24,5	20,8	12,3	24,6	17,0	14,2	19,7	18,8
18,1	20,5	21,0	20,7	20,4	14,7	25,1	22,7	19,0	19,6
19,0	18,9	17,4	20,0	13,8	25,6	13,0	19,0	18,7	21,1
13,3	20,7	15,2	19,9	21,9	16,0	16,9	15,3	21,4	20,4
12,8	20,8	14,3	18,0	15,1	23,8	18,5	14,4	14,4	21,0

As you can see, it is extremely inconvenient to view such an array of data, in addition, there are no patterns of change in the indicator. Let's construct an interval distribution series.

Let's define the number of intervals.
The number of intervals in practice is often set by the researcher himself based on the objectives of each particular observation. However, it can also be calculated mathematically using the Sturgess formula

n = 1 + 3.322lgN,

where n is the number of intervals;

N is the volume of the population (the number of units of observation).

For our example, we get: n \u003d 1 + 3.322lgN \u003d 1 + 3.322lg50 \u003d 6.6 "7.
Let us determine the value of the intervals (i) by the formula
where x max - the maximum value of the feature;

x min - the minimum value of the attribute.

For our example

The intervals of the variational series are illustrative if their boundaries have "round" values, so we will round the value of the interval 1.9 to 2, and the minimum value of the feature 12.3 to 12.0.
Let us define the boundaries of the intervals.
Intervals, as a rule, are written in such a way that the upper limit of one interval is simultaneously the lower limit of the next interval. So, for our example, we get: 12.0-14.0; 14.0-16.0; 16.0-18.0; 18.0-20.0; 20.0-22.0; 22.0-24.0; 24.0-26.0.

Such a record means that the feature is continuous. If the trait options take strictly defined values, for example, only integers, but their number is too large to build a discrete series, then you can create an interval series where the lower limit of the interval will not coincide with the upper limit of the next interval (this will mean that the feature is discrete ). For example, in the distribution of employees of an enterprise by age, you can create the following interval groups of years: 18-25, 26-33, 34-41, 42-49, 50-57, 58-65, 66 and more.

Also, in our example, we could make the first and last intervals open, etc. write: up to 14.0; 24.0 and above.

Based on the initial data, we construct a ranked series. To do this, we write in ascending order the values that the feature takes. The results are presented in the table: Table 3.13. Ranked series of interest rates of commercial banks

Bank rate % (options)
12,3	17,0	19,9	23,8
12,8	17,4	20,0	24,5
13,0	18,0	20,0	24,6
13,3	18,1	20,4	25,1
13,8	18,5	20,4	25,6
14,2	18,7	20,5
14,3	18,8	20,7
14,4	18,9	20,7
14,7	19,0	20,8
14,7	19,0	21,0
15,1	19,0	21,0
15,2	19,0	21,1
15,3	19,0	21,4
16,0	19,6	21,9
16,9	19,7	22,7

Let's calculate the frequencies.
When counting frequencies, a situation may arise when the value of a feature falls on the border of an interval. In this case, you can follow the rule: the given unit is assigned to the interval for which its value is the upper limit. So, the value 16.0 in our example will refer to the second interval.

The grouping results obtained in our example will be presented in a table.

Table 3.14. Distribution of commercial banks by lending rate

Short rate, %	Number of banks, units (frequencies)	Accumulated Frequencies
12,0-14,0	5	5
14,0-16,0	9	14
16,0-18,0	4	18
18,0-20,0	15	33
20,0-22,0	11	44
22,0-24,0	2	46
24,0-26,0	4	50
Total	50	-

The last column of the table presents the accumulated frequencies, which are obtained by successive summation of frequencies, starting from the first (for example, for the first interval - 5, for the second interval 5 + 9 = 14, for the third interval 5 + 9 + 4 = 18, etc. .). The accumulated frequency, for example, 33, shows that 33 banks have a loan rate that does not exceed 20% (the upper limit of the corresponding interval).

In the process of grouping data when constructing variational series, unequal intervals are sometimes used. This applies to those cases where the characteristic values obey the rule of arithmetic or geometric progression, or when the application of the Sturgess formula leads to the appearance of "empty" interval groups that do not contain a single observation unit. Then the boundaries of the intervals are set arbitrarily by the researcher himself, based on common sense and the objectives of the survey, or according to formulas. So, for data that changes in an arithmetic progression, the size of the intervals is calculated as follows.

A discrete variational series is constructed for discrete features.

In order to build a discrete variation series, you need to do the following: 1) order the units of observation in ascending order of the studied attribute value,

2) determine all possible values of the attribute x i , sort them in ascending order,

sign value, i .

feature value frequency and denote f i . The sum of all frequencies of the series is equal to the number of elements in the studied population.

Example 1 .

List of grades obtained by students in exams: 3; 4; 3; five; 4; 2; 2; 4; 4; 3; five; 2; 4; five; 4; 3; 4; 3; 3; 4; 4; 2; 2; five; five; 4; five; 2; 3; 4; 4; 3; 4; five; 2; five; five; 4; 3; 3; 4; 2; 4; 4; five; 4; 3; five; 3; five; 4; 4; five; 4; 4; five; 4; five; five; five.

Here the number X - gradeis a discrete random variable, and the resulting list of estimates isstatistical (observed) data .

order the units of observation in ascending order of the studied value of the feature:

2; 2; 2; 2; 2; 2; 2; 2; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5.

2) determine all possible values of the attribute x i , sort them in ascending order:

In this example, all scores can be divided into four groups with the following values: 2; 3; 4; five.

The value of a random variable corresponding to a separate group of observed data is called sign value, variant (option) and designate x i .

The number that shows how many times the corresponding feature value occurs in a series of observations is called feature value frequency and denote f i .

For our example

score 2 occurs - 8 times,

score 3 occurs - 12 times,

score 4 occurs - 23 times,

score 5 occurs - 17 times.

There are 60 ratings in total.

4) write the received data into a table of two rows (columns) - x i and f i .

Based on these data, it is possible to construct a discrete variational series

Discrete variation series - this is a table in which the occurring values of the studied trait are indicated as separate values in ascending order and their frequencies

Construction of an interval variation series

In addition to a discrete variational series, there is often such a way of grouping data as an interval variational series.

An interval series is built if:

the sign has a continuous nature of change;

there are a lot of discrete values (more than 10)

frequencies of discrete values are very small (do not exceed 1-3 with a relatively large number of units of observation);

many discrete values of a feature with the same frequencies.

An interval variation series is a way of grouping data in the form of a table that has two columns (feature values in the form of an interval of values and the frequency of each interval).

Unlike a discrete series, the values of the characteristic of an interval series are not represented by individual values, but by an interval of values ("from - to").

The number that shows how many observation units fell into each selected interval is called feature value frequency and denote f i . The sum of all frequencies of the series is equal to the number of elements (observation units) in the studied population.

If a unit has a feature value equal to the value of the upper limit of the interval, then it should be referred to the next interval.

For example, a child with a height of 100 cm will fall into the 2nd interval, and not into the first; and a child with a height of 130 cm will fall into the last interval, and not into the third.

Based on these data, it is possible to construct an interval variation series.

Each interval has a lower limit (x n), an upper limit (x in) and an interval width ( i).

An interval boundary is a feature value that lies on the border of two intervals.

children's height (cm)	children's height (cm)	amount of children




over 130

If an interval has an upper and lower bound, then it is called closed interval. If the interval has only a lower or only an upper bound, then this is - open interval. Only the very first or the very last interval can be open. In the above example, the last interval is open.

Interval width (i) is the difference between the upper and lower bounds.

i = x n - x in

The width of an open interval is assumed to be the same as the width of an adjacent closed interval.

children's height (cm)		amount of children	Interval width (i)
		amount of children	Interval width (i)



	for calculations 130+20=150		20 (because the width of the adjacent closed interval is 20)

All interval series are divided into interval series with equal intervals and interval series with unequal intervals. . In interval rows with equal intervals, the width of all intervals is the same. In interval series with unequal intervals, the width of the intervals is different.

In this example, an interval series with unequal intervals.

Condition:

There is data on the age composition of workers (years): 18, 38, 28, 29, 26, 38, 34, 22, 28, 30, 22, 23, 35, 33, 27, 24, 30, 32, 28, 25, 29, 26, 31, 24, 29, 27, 32, 25, 29, 29.

1. Build an interval distribution series.
2. Build a graphic representation of the series.
3. Graphically determine the mode and median.

Decision:

1) According to the Sturgess formula, the population must be divided into 1 + 3.322 lg 30 = 6 groups.

The maximum age is 38, the minimum is 18.

Interval width Since the ends of the intervals must be integers, we will divide the population into 5 groups. Interval width - 4.

To facilitate the calculations, let's arrange the data in ascending order: 18, 22, 22, 23, 24, 24, 25, 25, 26, 26, 27, 27, 28, 28, 28, 29, 29, 29, 29, 29, 30 , 30, 31, 32, 32, 33, 34, 35, 38, 38.

Age distribution of workers

Graphically, a series can be displayed as a histogram or a polygon. Histogram - bar chart. The base of the column is the width of the interval. The height of the bar is equal to the frequency.

A polygon (or distribution polygon) is a graph of frequencies. To build it according to the histogram, we connect the midpoints of the upper sides of the rectangles. We close the polygon on the x-axis at distances equal to half the interval from the extreme x values.

Mode (Mo) is the value of the trait under study, which occurs most frequently in a given population.

To determine the mode from the histogram, you need to select the highest rectangle, draw a line from the right vertex of this rectangle to the upper right corner of the previous rectangle, and draw a line from the left vertex of the modal rectangle to the left vertex of the next rectangle. From the point of intersection of these lines, draw a perpendicular to the x-axis. The abscissa will be fashion. Mo ≈ 27.5. This means that the most common age in this population is 27-28 years.

The median (Me) is the value of the trait under study, which is in the middle of an ordered variation series.

We find the median by the cumulate. Cumulate - graph of accumulated frequencies. Abscissas are variants of a series. The ordinates are the accumulated frequencies.

To determine the median for the cumulate, we find along the ordinate axis a point corresponding to 50% of the accumulated frequencies (in our case, 15), draw a straight line through it, parallel to the Ox axis, and draw a perpendicular to the x axis from the point of its intersection with the cumulate. The abscissa is the median. Me ≈ 25.9. This means that half of the workers in this population are under 26 years of age.

Laboratory work №1. Primary processing of statistical data

Construction of distribution series

The ordered distribution of population units into groups according to any one attribute is called near distribution . In this case, the sign can be both quantitative, then the series is called variational , and qualitative, then the series is called attributive . So, for example, the population of a city can be distributed according to age groups in a variation series, or according to professional affiliation in an attribute series (of course, many more qualitative and quantitative features can be proposed for constructing distribution series, the choice of feature is determined by the task of statistical research).

Any distribution series is characterized by two elements:

- option(x i) - these are individual values of the attribute of units of the sample population. For a variational series, the variant takes numerical values, for an attributive series - qualitative ones (for example, x = "civil servant");

- frequency(n i) is a number showing how many times this or that feature value occurs. If the frequency is expressed as a relative number (i.e., the proportion of population elements corresponding to a given value of options in the total volume of the population), then it is called relative frequency or frequency.

Variation series can be:

- discrete when the trait under study is characterized by a certain number (usually an integer).

- interval when the boundaries "from" and "to" are defined for a continuously variable feature. An interval series is also built if the set of values of a discretely variable feature is large.

An interval series can be constructed both with intervals of equal length (equal interval series) and with unequal intervals, if this is dictated by the conditions of the statistical study. For example, a series of income distribution of the population with the following intervals can be considered:<5тыс р., 5-10 тыс р., 10-20 тыс.р., 20-50 тыс р., и т.д. Если цель исследования не определяет способ построения интервального ряда, то строится равноинтервальный ряд, число интервалов в котором определяется по формуле Стерджесса:

where k is the number of intervals, n is the sample size. (Of course, the formula usually gives a fractional number, and the nearest integer to the resulting number is chosen as the number of intervals.) The length of the interval in this case is determined by the formula

Graphically, variational series can be represented as histograms(a "column" of height corresponding to the frequency in this interval is built above each interval of the interval series), distribution area(broken line connecting points ( x i;n i) or cumulates(constructed according to the accumulated frequencies, i.e. for each value of the attribute, the frequency of occurrence in the set of objects with a value of the attribute less than the given one is taken).

When working in Excel, the following functions can be used to build variational series:

CHECK( data array) – to determine the sample size. The argument is the range of cells that contains the sample data.

COUNTIF( range; criterion) - can be used to build an attribute or variation series. The arguments are the range of the attribute sample values array and the criterion - the numeric or text value of the attribute or the number of the cell in which it is located. The result is the frequency of occurrence of that value in the sample.

FREQUENCY( data array; interval array) – to build a variational series. The arguments are the range of the sample data array and the column of intervals. If it is required to build a discrete series, then the values of the options are indicated here, if it is interval, then the upper boundaries of the intervals (they are also called "pockets"). Since the result is a column of frequencies, the introduction of the function must be completed by pressing the CTRL+SHIFT+ENTER key combination. Note that when setting an array of intervals when introducing a function, the last value in it can be omitted - all values that did not fall into the previous "pockets" will be placed in the corresponding "pocket". This sometimes helps to avoid the error that the largest sample value is not automatically placed in the last "pocket".

In addition, for complex groupings (according to several criteria), the “pivot tables” tool is used. They can also be used to build attribute and variation series, but this unnecessarily complicates the task. Also, to build a variation series and a histogram, there is a “histogram” procedure from the “Analysis Package” add-in (to use add-ins in Excel, you must first download them, they are not installed by default)

We illustrate the process of primary data processing with the following examples.

Example 1.1. there are data on the quantitative composition of 60 families.

Build a variation series and a distribution polygon

Decision.

Let's open the Excel spreadsheets. Let's enter an array of data in the range A1:L5. If you are studying a document in electronic form (in Word format, for example), all you need to do is select a table with data and copy it to the clipboard, then select cell A1 and paste the data - they will automatically occupy the appropriate range. Let's calculate the sample size n - the number of sample data, for this, in cell B7, enter the formula = COUNT (A1: L5). Note that in order to enter the desired range into the formula, it is not necessary to enter its designation from the keyboard, it is enough to select it. Let's determine the minimum and maximum values in the sample by entering the formula =MIN(A1:L5) into cell B8, and into cell B9: =MAX(A1:L5).

Fig.1.1 Example 1. Primary processing of statistical data in Excel tables

Next, let's prepare a table for building a variation series by entering names for the interval column (variant values) and the frequency column. In the column of intervals, enter the values of the attribute from the minimum (1) to the maximum (6), occupying the range B12:B17. Select the frequency column, enter the formula =FREQUENCY(A1:L5;B12:B17) and press the key combination CTRL+SHIFT+ENTER

Fig.1.2 Example 1. Construction of a variation series

For control, we calculate the sum of frequencies using the SUM function (the S function icon in the Editing group on the Home tab), the calculated sum must match the previously calculated sample size in cell B7.

Now let's build a polygon: having selected the resulting frequency range, select the "Graph" command on the "Insert" tab. By default, the values on the horizontal axis will be ordinal numbers - in our case, from 1 to 6, which coincides with the values of the options (numbers of tariff categories).

The name of the series of the chart “series 1” can either be changed using the same “select data” option on the “Designer” tab, or simply deleted.

Fig.1.3. Example 1. Building a frequency polygon

Example 1.2. Data are available on pollutant emissions from 50 sources:

10,4	18,6	10,3	26,0	45,0	18,2	17,3	19,2	25,8	18,7
28,2	25,2	18,4	17,5	41,8	14,6	10,0	37,8	10,5	16,0
18,1	16,8	38,5	37,7	17,9	29,0	10,1	28,0	12,0	14,0
14,2	20,8	13,5	42,4	15,5	17,9	19,	10,8	12,1	12,4
12,9	12,6	16,8	19,7	18,3	36,8	15,0	37,0	13,0	19,5

Compile an equal interval series, build a histogram

Decision

Let's add an array of data to an Excel sheet, it will occupy the range A1:J5 As in the previous task, we will determine the sample size n, the minimum and maximum values in the sample. Since now we need not a discrete, but an interval series, and the number of intervals in the problem is not specified, we calculate the number of intervals k using the Sturgess formula. To do this, in cell B10, enter the formula =1+3.322*LOG10(B7).

Fig.1.4. Example 2. Construction of an equal interval series

The resulting value is not an integer, it is approximately 6.64. Since for k=7 the length of the intervals will be expressed as an integer (in contrast to the case of k=6), we will choose k=7 by entering this value in cell C10. We calculate the length of the interval d in cell B11 by entering the formula = (B9-B8) / C10.

Let's define an array of intervals, specifying the upper bound for each of the 7 intervals. To do this, in cell E8, calculate the upper limit of the first interval by entering the formula =B8+B11; in cell E9 the upper limit of the second interval by entering the formula =E8+B11. To calculate the remaining values of the upper limits of the intervals, we fix the number of cell B11 in the entered formula using the $ sign, so that the formula in cell E9 becomes =E8+B$11, and copy the contents of cell E9 to cells E10-E14. The last value obtained is equal to the maximum value in the sample calculated earlier in cell B9.

Fig.1.5. Example 2. Construction of an equal interval series

Now let's fill the array of "pockets" using the FREQUENCY function, as was done in example 1.

Fig.1.6. Example 2. Construction of an equal interval series

Based on the resulting variational series, we will build a histogram: select the frequency column and select "Histogram" on the "Insert" tab. Having received the histogram, we will change the labels of the horizontal axis in it to values in the range of intervals, for this we select the “Select data” option of the “Designer” tab. In the window that appears, select the "Change" command for the "Horizontal axis labels" section and enter the range of values \u200b\u200bvariants by selecting it with the "mouse".

Fig.1.7. Example 2. Building a histogram

Fig.1.8. Example 2. Building a histogram

Lab #1

According to mathematical statistics

Topic: Primary processing of experimental data

3. Evaluation in points. one

5. Security questions.. 2

6. Methodology for performing laboratory work .. 3

purpose of work

Acquisition of skills of primary processing of empirical data by methods of mathematical statistics.

On the basis of a set of experimental data, perform the following tasks:

Exercise 1. Construct an interval variation series of distribution.

Task 2. Construct a histogram of the frequencies of the interval variation series.

Task 3. Compose an empirical distribution function and plot.

a) mode and median;

b) conditional initial moments;

c) sample mean;

d) sample variance, corrected population variance, corrected standard deviation;

e) coefficient of variation;

e) asymmetry;

g) kurtosis;

Task 5. Determine the boundaries of the true values of the numerical characteristics of the random variable under study with a given reliability.

Task 6. Meaningful interpretation of the results of primary processing according to the condition of the problem.

Score in points

Tasks 1-5 – 6 points

Task 6 – 2 points

Lab Protection(oral interview on control questions and laboratory work) - 2 points

The work is submitted in writing on A4 sheets and includes:

1) Title page (Appendix 1)

2) Initial data.

3) Presentation of work according to the specified sample.

4) Calculation results (performed manually and/or using MS Excel) in the specified order.

5) Conclusions - a meaningful interpretation of the results of primary processing according to the condition of the problem.

6) Oral interview on work and control questions.

5. Security questions

Methodology for performing laboratory work

Task 1. Construct an interval variation series of distribution

In order to present statistical data in the form of a variational series with equally spaced variants, it is necessary:

1. In the original data table, find the smallest and largest values.

2. Determine range of variation :

3. Determine the length of the interval h, if there are up to 1000 data in the sample, use the formula: , where n - sample size - the amount of data in the sample; lgn is taken for calculations).

The calculated ratio is rounded up to convenient integer value .

4. To determine the beginning of the first interval for an even number of intervals, it is recommended to take the value ; and for an odd number of intervals .

5. Record grouping intervals and arrange them in ascending order of boundaries

, ,………., ,

where is the lower bound of the first interval. A convenient number is taken for no more than , the upper limit of the last interval must be no less than . It is recommended that the intervals contain the initial values of the random variable and be separated from 5 to 20 intervals.

6. Write down the initial data on the intervals of groupings, i.e. calculate from the original table the number of values of a random variable that fall within the specified intervals. If some values coincide with the boundaries of the intervals, then they are attributed either only to the previous or only to the subsequent interval.

Remark 1. The intervals need not be taken equal in length. In areas where the values are denser, it is more convenient to take smaller short intervals, and where less often - larger ones.

Remark 2.If for some values “zero” or small values of frequencies are obtained, then it is necessary to regroup the data, enlarging the intervals (increasing the step ).

Portal for the student. Self-training

3.3.1. Construction of discrete variational series

3.3.2. Construction of interval variation series

Construction of an interval variation series

RELATED ARTICLES