Discrete series online. Construction of an interval variation series for continuous quantitative data

Laboratory work №1. Primary processing of statistical data

Construction of distribution series

The ordered distribution of population units into groups according to any one attribute is called near distribution . In this case, the sign can be both quantitative, then the series is called variational , and qualitative, then the series is called attributive . So, for example, the population of a city can be distributed according to age groups in a variation series, or according to professional affiliation in an attribute series (of course, many more qualitative and quantitative features can be proposed for constructing distribution series, the choice of feature is determined by the task of statistical research).

Any distribution series is characterized by two elements:

- option(x i) - these are individual values ​​of the attribute of units of the sample population. For a variational series, the variant takes on numerical values, for an attributive series - qualitative ones (for example, x = "civil servant");

- frequency(n i) is a number showing how many times this or that feature value occurs. If the frequency is expressed as a relative number (i.e., the proportion of population elements corresponding to a given value of options in the total volume of the population), then it is called relative frequency or frequency.

Variation series can be:

- discrete when the trait under study is characterized by a certain number (usually an integer).

- interval when the boundaries "from" and "to" are defined for a continuously variable feature. An interval series is also built if the set of values ​​of a discretely variable feature is large.

An interval series can be built both with intervals of equal length (equal interval series) and with unequal intervals, if this is dictated by the conditions of the statistical study. For example, a series of income distribution of the population with the following intervals can be considered:<5тыс р., 5-10 тыс р., 10-20 тыс.р., 20-50 тыс р., и т.д. Если цель исследования не определяет способ построения интервального ряда, то строится равноинтервальный ряд, число интервалов в котором определяется по формуле Стерджесса:



where k is the number of intervals, n is the sample size. (Of course, the formula usually gives a fractional number, and the nearest integer to the resulting number is chosen as the number of intervals.) The length of the interval in this case is determined by the formula

.

Graphically, variational series can be represented as histograms(a "column" of height corresponding to the frequency in this interval is built above each interval of the interval series), distribution area(broken line connecting points ( x i;n i) or cumulates(constructed according to the accumulated frequencies, i.e. for each value of the attribute, the frequency of occurrence in the set of objects with a value of the attribute less than the given one is taken).

When working in Excel, the following functions can be used to build variational series:

CHECK( data array) – to determine the sample size. The argument is the range of cells that contains the sample data.

COUNTIF( range; criterion) - can be used to build an attribute or variation series. The arguments are the range of the attribute's sample values ​​array and the criterion - the numeric or text value of the attribute or the number of the cell in which it is located. The result is the frequency of occurrence of that value in the sample.

FREQUENCY( data array; interval array) – to build a variational series. The arguments are the range of the sample data array and the column of intervals. If it is required to build a discrete series, then the values ​​​​of the options are indicated here, if it is an interval series, then the upper boundaries of the intervals (they are also called "pockets"). Since the result is a column of frequencies, the introduction of the function must be completed by pressing the CTRL+SHIFT+ENTER key combination. Note that when setting an array of intervals when introducing a function, the last value in it can be omitted - all values ​​that did not fall into the previous "pockets" will be placed in the corresponding "pocket". This sometimes helps to avoid the error that the largest sample value is not automatically placed in the last "pocket".

In addition, for complex groupings (according to several criteria), the “pivot tables” tool is used. They can also be used to build attribute and variation series, but this unnecessarily complicates the task. Also, to build a variation series and a histogram, there is a “histogram” procedure from the “Analysis Package” add-in (to use add-ins in Excel, you must first download them, they are not installed by default)

We illustrate the process of primary data processing with the following examples.

Example 1.1. there are data on the quantitative composition of 60 families.

Build a variation series and a distribution polygon

Decision.

Let's open the Excel spreadsheets. Let's enter an array of data in the range A1:L5. If you are studying a document in electronic form (in Word format, for example), all you need to do is select a table with data and copy it to the clipboard, then select cell A1 and paste the data - they will automatically occupy the appropriate range. Let's calculate the sample size n - the number of sample data, for this, in cell B7, enter the formula = COUNT (A1: L5). Note that in order to enter the desired range into the formula, it is not necessary to enter its designation from the keyboard, it is enough to select it. Let's determine the minimum and maximum values ​​in the sample by entering the formula =MIN(A1:L5) in cell B8, and in cell B9: =MAX(A1:L5).

Fig.1.1 Example 1. Primary processing of statistical data in Excel tables

Next, let's prepare a table for building a variation series by entering names for the interval column (option values) and the frequency column. In the column of intervals, enter the values ​​of the attribute from the minimum (1) to the maximum (6), occupying the range B12:B17. Select the frequency column, enter the formula =FREQUENCY(A1:L5;B12:B17) and press the key combination CTRL+SHIFT+ENTER

Fig.1.2 Example 1. Construction of a variation series

For control, we calculate the sum of frequencies using the SUM function (function icon S in the Editing group on the Home tab), the calculated sum must match the previously calculated sample size in cell B7.

Now let's build a polygon: having selected the resulting frequency range, select the "Graph" command on the "Insert" tab. By default, the values ​​on the horizontal axis will be ordinal numbers - in our case, from 1 to 6, which coincides with the values ​​of the options (numbers of tariff categories).

The series name of the "series 1" chart can either be changed using the same "select data" option on the "Designer" tab, or simply deleted.

Fig.1.3. Example 1. Building a frequency polygon

Example 1.2. Data are available on pollutant emissions from 50 sources:

10,4 18,6 10,3 26,0 45,0 18,2 17,3 19,2 25,8 18,7
28,2 25,2 18,4 17,5 41,8 14,6 10,0 37,8 10,5 16,0
18,1 16,8 38,5 37,7 17,9 29,0 10,1 28,0 12,0 14,0
14,2 20,8 13,5 42,4 15,5 17,9 19, 10,8 12,1 12,4
12,9 12,6 16,8 19,7 18,3 36,8 15,0 37,0 13,0 19,5

Compile an equal interval series, build a histogram

Decision

Let's add an array of data to an Excel sheet, it will occupy the range A1:J5 As in the previous task, we will determine the sample size n, the minimum and maximum values ​​in the sample. Since now we need not a discrete, but an interval series, and the number of intervals in the problem is not specified, we calculate the number of intervals k using the Sturgess formula. To do this, in cell B10, enter the formula =1+3.322*LOG10(B7).

Fig.1.4. Example 2. Construction of an equal interval series

The resulting value is not an integer, it is approximately 6.64. Since for k=7 the length of the intervals will be expressed as an integer (in contrast to the case of k=6), we will choose k=7 by entering this value in cell C10. We calculate the length of the interval d in cell B11 by entering the formula = (B9-B8) / C10.

Let's define an array of intervals, specifying the upper bound for each of the 7 intervals. To do this, in cell E8, calculate the upper limit of the first interval by entering the formula =B8+B11; in cell E9 the upper limit of the second interval by entering the formula =E8+B11. To calculate the remaining values ​​of the upper limits of the intervals, we fix the number of cell B11 in the entered formula using the $ sign, so that the formula in cell E9 becomes =E8+B$11, and copy the contents of cell E9 to cells E10-E14. The last value obtained is equal to the maximum value in the sample calculated earlier in cell B9.

Fig.1.5. Example 2. Construction of an equal interval series


Now let's fill the array of "pockets" using the FREQUENCY function, as was done in example 1.

Fig.1.6. Example 2. Construction of an equal interval series

Based on the resulting variational series, we will build a histogram: select the frequency column and select "Histogram" on the "Insert" tab. Having received the histogram, we will change the labels of the horizontal axis in it to values ​​in the range of intervals, for this we select the “Select data” option of the “Designer” tab. In the window that appears, select the "Change" command for the "Horizontal axis labels" section and enter the range of values ​​\u200b\u200bvariants by selecting it with the "mouse".

Fig.1.7. Example 2. Building a histogram

Fig.1.8. Example 2. Building a histogram

A discrete variational series is constructed for discrete features.

In order to build a discrete variation series, you need to do the following: 1) order the units of observation in ascending order of the studied attribute value,

2) determine all possible values ​​of the attribute x i , sort them in ascending order,

sign value, i .

feature value frequency and denote f i . The sum of all frequencies of the series is equal to the number of elements in the studied population.

Example 1 .

List of grades obtained by students in exams: 3; 4; 3; 5; 4; 2; 2; 4; 4; 3; 5; 2; 4; 5; 4; 3; 4; 3; 3; 4; 4; 2; 2; 5; 5; 4; 5; 2; 3; 4; 4; 3; 4; 5; 2; 5; 5; 4; 3; 3; 4; 2; 4; 4; 5; 4; 3; 5; 3; 5; 4; 4; 5; 4; 4; 5; 4; 5; 5; 5.

Here the number X - gradeis a discrete random variable, and the resulting list of estimates isstatistical (observed) data .

    order the units of observation in ascending order of the studied value of the feature:

2; 2; 2; 2; 2; 2; 2; 2; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 3; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 4; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5; 5.

2) determine all possible values ​​of the attribute x i , sort them in ascending order:

In this example, all scores can be divided into four groups with the following values: 2; 3; 4; 5.

The value of a random variable corresponding to a separate group of observed data is called sign value, variant (option) and designate x i .

The number that shows how many times the corresponding feature value occurs in a series of observations is called feature value frequency and denote f i .

For our example

score 2 occurs - 8 times,

score 3 occurs - 12 times,

score 4 occurs - 23 times,

score 5 occurs - 17 times.

There are 60 ratings in total.

4) write the received data into a table of two rows (columns) - x i and f i .

Based on these data, it is possible to construct a discrete variational series

Discrete variation series - this is a table in which the occurring values ​​of the studied trait are indicated as separate values ​​in ascending order and their frequencies

  1. Construction of an interval variation series

In addition to a discrete variational series, there is often such a way of grouping data as an interval variational series.

An interval series is built if:

    the sign has a continuous nature of change;

    there are a lot of discrete values ​​(more than 10)

    frequencies of discrete values ​​are very small (do not exceed 1-3 with a relatively large number of units of observation);

    many discrete values ​​of a feature with the same frequencies.

An interval variation series is a way of grouping data in the form of a table that has two columns (feature values ​​in the form of an interval of values ​​and the frequency of each interval).

Unlike a discrete series, the values ​​of the sign of an interval series are not represented by individual values, but by an interval of values ​​("from - to").

The number that shows how many observation units fell into each selected interval is called feature value frequency and denote f i . The sum of all frequencies of the series is equal to the number of elements (observation units) in the studied population.

If a unit has a feature value equal to the value of the upper limit of the interval, then it should be referred to the next interval.

For example, a child with a height of 100 cm will fall into the 2nd interval, and not into the first; and a child with a height of 130 cm will fall into the last interval, and not into the third.

Based on these data, it is possible to construct an interval variation series.

Each interval has a lower limit (x n), an upper limit (x in) and an interval width ( i).

An interval boundary is a feature value that lies on the border of two intervals.

children's height (cm)

children's height (cm)

amount of children

over 130

If an interval has an upper and lower bound, then it is called closed interval. If the interval has only a lower or only an upper bound, then this is - open interval. Only the very first or the very last interval can be open. In the above example, the last interval is open.

Interval width (i) is the difference between the upper and lower bounds.

i = x n - x in

The width of an open interval is assumed to be the same as the width of an adjacent closed interval.

children's height (cm)

amount of children

Interval width (i)

for calculations 130+20=150

20 (because the width of the adjacent closed interval is 20)

All interval series are divided into interval series with equal intervals and interval series with unequal intervals. . In interval rows with equal intervals, the width of all intervals is the same. In interval series with unequal intervals, the width of the intervals is different.

In this example, an interval series with unequal intervals.

Higher professional education

"RUSSIAN ACADEMY OF PEOPLE'S ECONOMY AND

CIVIL SERVICE UNDER THE PRESIDENT

RUSSIAN FEDERATION"

(Kaluga branch)

Department of Natural Science and Mathematical Disciplines

TEST

Subject "Statistics"

Student ___ Mayboroda Galina Yurievna ______

Correspondence department faculty State and municipal management group G-12-V

Lecturer ____________________ Hamer G.V.

PhD, Associate Professor

Kaluga-2013

Task 1.

Task 1.1. 4

Task 1.2. sixteen

Task 1.3. 24

Task 1.4. 33

Task 2.

Task 2.1. 43

Task 2.2. 48

Task 2.3. 53

Task 2.4. 58

Task 3.

Task 3.1. 63

Task 3.2. 68

Task 3.3. 73

Task 3.4. 79

Task 4.

Problem 4.1. 85

Task 4.2. 88

Task 4.3. 90

Task 4.4. 93

List of used sources. 96

Task 1.

Task 1.1.

There are the following data on the output and the amount of profit by the enterprises of the region (table 1).

Table 1

Data on production output and the amount of profit by enterprises

company number Output, million rubles Profit, million rubles company number Output, million rubles Profit, million rubles
63,0 6,7 56,0 7,2
48,0 6,2 81,0 9,6
39,0 6,5 55,0 6,3
28,0 3,0 76,0 9,1
72,0 8,2 54,0 6,0
61,0 7,6 53,0 6,4
47,0 5,9 68,0 8,5
37,0 4,2 52,0 6,5
25,0 2,8 44,0 5,0
60,0 7,9 51,0 6,4
46,0 5,5 50,0 5,8
34,0 3,8 65,0 6,7
21,0 2,1 49,0 6,1
58,0 8,0 42,0 4,8
45,0 5,7 32,0 4,6

According to the original data:

1. Build a statistical series of distribution of enterprises by output, forming five groups at equal intervals.

Build distribution series graphs: polygon, histogram, cumulate. Graphically determine the value of mode and median.

2. Calculate the characteristics of a series of distribution of enterprises by output: arithmetic mean, dispersion, standard deviation, coefficient of variation.

Make a conclusion.

3. Using the method of analytical grouping, establish the presence and nature of the correlation between the cost of manufactured products and the amount of profit per enterprise.

4. Measure the tightness of the correlation between the cost of production and the amount of profit by the empirical correlation.

Draw general conclusions.

Decision:

Let's build a statistical series of distribution

To construct an interval variation series that characterizes the distribution of enterprises in terms of output, it is necessary to calculate the value and boundaries of the intervals of the series.

When constructing a series with equal intervals, the value of the interval h is determined by the formula:

x max and x min- the largest and smallest values ​​of the attribute in the studied set of enterprises;

k- number of interval series groups.

Number of groups k specified in the assignment. k= 5.

x max= 81 million rubles, x min= 21 million rubles

Calculation of the interval value:

million rubles

By successively adding the value of the interval h = 12 million rubles. to the lower boundary of the interval, we obtain the following groups:

1 group: 21 - 33 million rubles.

2 group: 33 - 45 million rubles;

Group 3: 45 - 57 million rubles.

Group 4: 57 - 69 million rubles.

Group 5: 69 - 81 million rubles.

To construct an interval series, it is necessary to calculate the number of enterprises included in each group ( group frequencies).

The process of grouping enterprises by output volume is presented in auxiliary table 2. Column 4 of this table is necessary to build an analytical grouping (paragraph 3 of the task).

table 2

Table for constructing an interval distribution series and

analytical grouping

Groups of enterprises by output, million rubles company number Output, million rubles Profit, million rubles
21-33 21,0 2,1
25,0 2,8
28,0 3,0
32,0 4,6
Total 106,0 12,5
33-45 34,0 3,8
37,0 4,2
39,0 6,5
42,0 4,8
44,0 5,0
Total 196,0 24,3
45-57 45,0 5,7
46,0 5,5
47,0 5,9
48,0 6,2
49,0 6,1
50,0 5,8
51,0 6,4
52,0 6,5
53,0 6,4
54,0 6,0
55,0 6,3
56,0 7,2
Total 606,0 74,0
57-69 58,0 8,0
60,0 7,9
61,0 7,6
63,0 6,7
65,0 6,7
68,0 8,5
Total 375,0 45,4
69-81 72,0 8,2
76,0 9,1
81,0 9,6
Total 229,0 26,9
Total 183,1

Based on the group summary rows of the “Total” table 3, a final table 3 is formed, representing the interval series of the distribution of enterprises by output.

Table 3

A number of distribution of enterprises by output volume

Conclusion. The constructed grouping shows that the distribution of enterprises in terms of output is not uniform. The most common enterprises with a production volume of 45 to 57 million rubles. (12 enterprises). The least common are enterprises with output from 69 to 81 million rubles. (3 enterprises).

Let's build graphs of the distribution series.

Polygon often used to represent discrete series. To construct a polygon in a rectangular coordinate system, the values ​​of the argument are plotted on the abscissa axis, i.e. options (for interval variational series, the middle of the interval is taken as an argument) and on the ordinate axis - frequency values. Further, in this coordinate system, points are built, the coordinates of which are pairs of corresponding numbers from the variation series. The resulting points are connected in series by straight line segments. The polygon is shown in Figure 1.

bar graph - bar chart. It allows you to evaluate the symmetry of the distribution. The histogram is shown in Figure 2.

Figure 1 - Polygon distribution of enterprises by volume

output

Fashion

Figure 2 - Histogram of the distribution of enterprises by volume

output

Fashion- the value of the trait that occurs most often in the study population.

For an interval series, the mode can be graphically determined from the histogram (Figure 2). For this, the highest rectangle is selected, which in this case is modal (45–57 million rubles). Then the right vertex of the modal rectangle is connected to the upper right corner of the previous rectangle. And the left vertex of the modal rectangle is with the upper left corner of the subsequent rectangle. Further, from the point of their intersection, a perpendicular is lowered to the abscissa axis. The abscissa of the point of intersection of these lines will be the distribution mode.

Million rub.

Conclusion. In the considered set of enterprises, the enterprises with output of 52 million rubles are the most common.

Cumulate - broken curve. It is built on the accumulated frequencies (calculated in Table 4). The cumulate starts from the lower boundary of the first interval (21 million rubles), the accumulated frequency is deposited at the upper boundary of the interval. The cumulate is shown in Figure 3.

Median

Figure 3 - Cumulative distribution of enterprises by volume

output

Median Me is the value of the feature that falls in the middle of the ranked series. There are the same number of population units on both sides of the median.

In an interval series, the median can be determined graphically from a cumulative curve. To determine the median from the point on the cumulative frequency scale corresponding to 50% (30:2 = 15), a straight line is drawn parallel to the abscissa axis until it intersects with the cumulate. Then, from the point of intersection of the indicated straight line with the cumulate, a perpendicular is lowered to the abscissa axis. The abscissa of the intersection point is the median.

Million rub.

Conclusion. In the considered set of enterprises, half of the enterprises have a production volume of no more than 52 million rubles, and the other half - no less than 52 million rubles.


Similar information.


When processing large amounts of information, which is especially important when conducting modern scientific developments, the researcher faces the serious task of correctly grouping the initial data. If the data is discrete, then, as we have seen, there are no problems - you just need to calculate the frequency of each feature. If the trait under study has continuous character (which is more common in practice), then the choice of the optimal number of intervals for grouping a feature is by no means a trivial task.

To group continuous random variables, the entire variation range of the feature is divided into a certain number of intervals to.

Grouped interval (continuous) variational series called intervals ranked by the value of the feature (), where indicated together with the corresponding frequencies () the number of observations that fell into the r "th interval, or relative frequencies ():

Characteristic value intervals

mi frequency

bar graph and cumulate (ogiva), already discussed in detail by us, are an excellent data visualization tool that allows you to get a primary understanding of the data structure. Such graphs (Fig. 1.15) are built for continuous data in the same way as for discrete data, only taking into account the fact that continuous data completely fills the area of ​​​​its possible values, taking any values.

Rice. 1.15.

So the columns on the histogram and the cumulate must be in contact, have no areas where the attribute values ​​do not fall within all possible(i.e., the histogram and cumulate should not have "holes" along the abscissa axis, in which the values ​​of the variable under study do not fall, as in Fig. 1.16). The height of the bar corresponds to the frequency - the number of observations that fall into the given interval, or the relative frequency - the proportion of observations. Intervals must not cross and are usually the same width.

Rice. 1.16.

The histogram and the polygon are approximations of the probability density curve (differential function) f(x) theoretical distribution, considered in the course of probability theory. Therefore, their construction is of such importance in the primary statistical processing of quantitative continuous data - by their form one can judge the hypothetical distribution law.

Cumulate - the curve of the accumulated frequencies (frequencies) of the interval variation series. The graph of the integral distribution function is compared with the cumulate F(x), also considered in the course of probability theory.

Basically, the concepts of histogram and cumulates are associated precisely with continuous data and their interval variation series, since their graphs are empirical estimates of the probability density function and distribution function, respectively.

The construction of an interval variation series begins with determining the number of intervals k. And this task is perhaps the most complex, important and controversial in the issue under study.

The number of intervals should not be too small, as the histogram will be too smooth ( oversmoothed), loses all the features of the variability of the initial data - in Fig. 1.17 you can see how the same data on which the graphs of Fig. 1.15 are used to construct a histogram with a smaller number of intervals (left graph).

At the same time, the number of intervals should not be too large - otherwise we will not be able to estimate the distribution density of the data under study along the numerical axis: the histogram will turn out to be undersmoothed (undersmoothed) with unfilled intervals, uneven (see Fig. 1.17, right graph).

Rice. 1.17.

How to determine the most preferred number of intervals?

Back in 1926, Herbert Sturges proposed a formula for calculating the number of intervals into which it is necessary to divide the initial set of values ​​of the studied attribute. This formula has really become super popular - most statistical textbooks offer it, and many statistical packages use it by default. Whether this is justified and in all cases is a very serious question.

So what is the Sturges formula based on?

Consider the binomial distribution )