Build an interval distribution series. Summary and grouping of statistics

Lab #1

According to mathematical statistics

Topic: Primary processing of experimental data

3. Evaluation in points. one

5. Security questions.. 2

6. Methodology for performing laboratory work .. 3

Objective

Acquisition of skills of primary processing of empirical data by methods of mathematical statistics.

On the basis of a set of experimental data, perform the following tasks:

Exercise 1. Construct an interval variation series of distribution.

Task 2. Construct a histogram of the frequencies of the interval variation series.

Task 3. Compose an empirical distribution function and plot.

a) mode and median;

b) conditional initial moments;

c) sample mean;

d) sample variance, corrected population variance, corrected standard deviation;

e) coefficient of variation;

e) asymmetry;

g) kurtosis;

Task 5. Determine the boundaries of the true values ​​of the numerical characteristics of the random variable under study with a given reliability.

Task 6. Meaningful interpretation of the results of primary processing according to the condition of the problem.

Score in points

Tasks 1-56 points

Task 62 points

Lab Protection(oral interview on control questions and laboratory work) - 2 points

The work is submitted in writing on A4 sheets and includes:

1) Title page (Appendix 1)

2) Initial data.

3) Presentation of work according to the specified sample.

4) Calculation results (performed manually and/or using MS Excel) in the specified order.

5) Conclusions - a meaningful interpretation of the results of primary processing according to the condition of the problem.

6) Oral interview on work and control questions.



5. Security questions


Methodology for performing laboratory work

Task 1. Construct an interval variation series of distribution

In order to present statistical data in the form of a variational series with equally spaced variants, it is necessary:

1. In the original data table, find the smallest and largest values.

2. Determine range of variation :

3. Determine the length of the interval h, if there are up to 1000 data in the sample, use the formula: , where n - sample size - the amount of data in the sample; lgn is taken for calculations).

The calculated ratio is rounded up to convenient integer value .

4. To determine the beginning of the first interval for an even number of intervals, it is recommended to take the value ; and for an odd number of intervals .

5. Record grouping intervals and arrange them in ascending order of boundaries

, ,………., ,

where is the lower bound of the first interval. A convenient number is taken for no more than , the upper limit of the last interval must be no less than . It is recommended that the intervals contain the initial values ​​of the random variable and be separated from 5 to 20 intervals.

6. Write down the initial data on the intervals of groupings, i.e. calculate from the original table the number of values ​​of a random variable that fall within the specified intervals. If some values ​​coincide with the boundaries of the intervals, then they are attributed either only to the previous or only to the subsequent interval.

Remark 1. The intervals need not be taken equal in length. In areas where the values ​​are denser, it is more convenient to take smaller short intervals, and where less often - larger ones.

Remark 2.If for some values ​​“zero” or small values ​​of frequencies are obtained, then it is necessary to regroup the data, enlarging the intervals (increasing the step ).

The most important stage in the study of socio-economic phenomena and processes is the systematization of primary data and, on this basis, obtaining a summary characteristic of the entire object using generalizing indicators, which is achieved by summarizing and grouping primary statistical material.

Statistical summary - this is a complex of sequential operations to generalize specific single facts that form a set, to identify typical features and patterns inherent in the phenomenon under study as a whole. Conducting a statistical summary includes the following steps :

  • choice of grouping feature;
  • determination of the order of formation of groups;
  • development of a system of statistical indicators to characterize groups and the object as a whole;
  • development of layouts of statistical tables for presenting summary results.

Statistical grouping called the division of units of the studied population into homogeneous groups according to certain characteristics that are essential for them. Groupings are the most important statistical method of summarizing statistical data, the basis for the correct calculation of statistical indicators.

There are the following types of groupings: typological, structural, analytical. All these groupings are united by the fact that the units of the object are divided into groups according to some attribute.

grouping sign is called the sign by which the units of the population are divided into separate groups. The conclusions of a statistical study depend on the correct choice of a grouping attribute. As a basis for grouping, it is necessary to use significant, theoretically substantiated features (quantitative or qualitative).

Quantitative signs of grouping have a numerical expression (trading volume, age of a person, family income, etc.), and qualitative features of the grouping reflect the state of the population unit (sex, marital status, industry affiliation of the enterprise, its form of ownership, etc.).

After the basis of the grouping is determined, the question of the number of groups into which the study population should be divided should be decided. The number of groups depends on the objectives of the study and the type of indicator underlying the grouping, the size of the population, the degree of variation of the trait.

For example, the grouping of enterprises according to the forms of ownership takes into account municipal, federal and the property of the subjects of the federation. If the grouping is carried out according to a quantitative attribute, then it is necessary to pay special attention to the number of units of the object under study and the degree of fluctuation of the grouping attribute.

When the number of groups is determined, then the grouping intervals should be determined. Interval - these are the values ​​of a variable characteristic that lie within certain boundaries. Each interval has its own value, upper and lower limits, or at least one of them.

The lower bound of the interval is called the smallest value of the attribute in the interval, and upper bound - the largest value of the attribute in the interval. The interval value is the difference between the upper and lower limits.

Grouping intervals, depending on their size, are: equal and unequal. If the variation of the trait manifests itself in relatively narrow boundaries and the distribution is uniform, then a grouping is built with equal intervals. The value of an equal interval is determined by the following formula :

where Xmax, Xmin - the maximum and minimum values ​​of the attribute in the aggregate; n is the number of groups.

The simplest grouping, in which each selected group is characterized by one indicator, is a distribution series.

Statistical distribution series - this is an ordered distribution of population units into groups according to a certain attribute. Depending on the trait underlying the formation of a distribution series, attributive and variation distribution series are distinguished.

attributive they call the distribution series built according to qualitative characteristics, that is, signs that do not have a numerical expression (distribution by type of labor, by sex, by profession, etc.). Attribute distribution series characterize the composition of the population according to one or another essential feature. Taken over several periods, these data allow us to study the change in the structure.

Variation rows called distribution series built on a quantitative basis. Any variational series consists of two elements: variants and frequencies. Options the individual values ​​of the attribute that it takes in the variation series are called, that is, the specific value of the varying attribute.

Frequencies called the number of individual variant or each group of the variation series, that is, these are numbers that show how often certain variants occur in the distribution series. The sum of all frequencies determines the size of the entire population, its volume. Frequencies frequencies are called, expressed in fractions of a unit or as a percentage of the total. Accordingly, the sum of the frequencies is equal to 1 or 100%.

Depending on the nature of the variation of a feature, three forms of a variation series are distinguished: a ranked series, a discrete series, and an interval series.

Ranked variation series - this is the distribution of individual units of the population in ascending or descending order of the trait under study. Ranking makes it easy to divide quantitative data into groups, immediately detect the smallest and largest values ​​of a feature, and highlight the values ​​that are most often repeated.

Discrete variation series characterizes the distribution of population units according to a discrete attribute that takes only integer values. For example, the tariff category, the number of children in the family, the number of employees in the enterprise, etc.

If a sign has a continuous change, which within certain limits can take on any values ​​("from - to"), then for this sign you need to build interval variation series . For example, the amount of income, work experience, the cost of fixed assets of the enterprise, etc.

Examples of solving problems on the topic "Statistical summary and grouping"

Task 1 . There is information on the number of books received by students by subscription for the past academic year.

Construct a ranged and discrete variational distribution series, denoting the elements of the series.

Decision

This set is a set of options for the number of books students receive. Let us count the number of such variants and arrange them in the form of a variational ranked and variational discrete distribution series.

Task 2 . There is data on the value of fixed assets for 50 enterprises, thousand rubles.

Build a distribution series, highlighting 5 groups of enterprises (at equal intervals).

Decision

For the solution, we choose the largest and smallest values ​​of the cost of fixed assets of enterprises. These are 30.0 and 10.2 thousand rubles.

Find the size of the interval: h \u003d (30.0-10.2): 5 \u003d 3.96 thousand rubles.

Then the first group will include enterprises, the amount of fixed assets of which is from 10.2 thousand rubles. up to 10.2 + 3.96 = 14.16 thousand rubles. There will be 9 such enterprises. The second group will include enterprises, the amount of fixed assets of which will be from 14.16 thousand rubles. up to 14.16 + 3.96 = 18.12 thousand rubles. There will be 16 such enterprises. Similarly, we find the number of enterprises included in the third, fourth and fifth groups.

The resulting distribution series is placed in the table.

Task 3 . For a number of light industry enterprises, the following data were obtained:

Make a grouping of enterprises according to the number of workers, forming 6 groups at equal intervals. Count for each group:

1. number of enterprises
2. number of workers
3. volume of manufactured products per year
4. average actual output per worker
5. amount of fixed assets
6. average size of fixed assets of one enterprise
7. average value of manufactured products by one enterprise

Record the results of the calculation in tables. Draw your own conclusions.

Decision

For the solution, we choose the largest and smallest values ​​of the average number of workers in the enterprise. These are 43 and 256.

Find the size of the interval: h = (256-43): 6 = 35.5

Then the first group will include enterprises with an average number of workers ranging from 43 to 43 + 35.5 = 78.5 people. There will be 5 such enterprises. The second group will include enterprises, the average number of workers in which will be from 78.5 to 78.5 + 35.5 = 114 people. There will be 12 such enterprises. Similarly, we find the number of enterprises included in the third, fourth, fifth and sixth groups.

We put the resulting distribution series in a table and calculate the necessary indicators for each group:

Conclusion : As can be seen from the table, the second group of enterprises is the most numerous. It includes 12 enterprises. The smallest are the fifth and sixth groups (two enterprises each). These are the largest enterprises (in terms of the number of workers).

Since the second group is the most numerous, the volume of output per year by the enterprises of this group and the volume of fixed assets are much higher than others. At the same time, the average actual output of one worker at the enterprises of this group is not the highest. The enterprises of the fourth group are in the lead here. This group also accounts for a fairly large amount of fixed assets.

In conclusion, we note that the average size of fixed assets and the average value of the output of one enterprise are directly proportional to the size of the enterprise (in terms of the number of workers).

If the random variable under study is continuous, then the ranking and grouping of the observed values ​​often do not allow us to highlight the characteristic features of the variation of its values. This is explained by the fact that individual values ​​of a random variable can differ as little as desired from each other, and therefore, in the totality of observed data, the same values ​​of a quantity can rarely occur, and the frequencies of variants differ little from each other.

It is also impractical to construct a discrete series for a discrete random variable, the number of possible values ​​of which is large. In such cases, one should build interval variation series distribution.

To construct such a series, the entire interval of variation of the observed values ​​of a random variable is divided into a series partial intervals and counting the frequency of occurrence of magnitude values ​​in each partial interval.

Interval variation series called an ordered set of intervals of variation of the values ​​of a random variable with the corresponding frequencies or relative frequencies of hits in each of them of the values ​​of the quantity.

To build an interval series, you need:

  1. define value partial intervals;
  2. define width intervals;
  3. set for each interval it top and lower bound ;
  4. group the results of the observation.

1 . The question of choosing the number and width of grouping intervals has to be decided in each specific case based on goals research, volume sampling and degree of variation feature in the sample.

Approximate number of intervals k can only be estimated from the sample size n in one of the following ways:

  • according to the formula Sturges : k = 1 + 3.32 log n ;
  • using table 1.

Table 1

2 . Intervals of the same width are generally preferred. To determine the width of the intervals h calculate:

  • range of variation R - sample values: R = x max - x min ,

where xmax and xmin - maximum and minimum sample options;

  • the width of each interval h determined by the following formula: h = R/k .

3 . Bottom line first interval x h1 is chosen so that the minimum sample variant xmin fell approximately in the middle of this interval: x h1 = x min - 0.5 h .

Intervals obtained by adding to the end of the previous interval the length of the partial interval h :

xhi = xhi-1 +h.

The construction of the scale of intervals based on the calculation of the boundaries of the intervals continues until the value x hi satisfies the relation:

x hi< x max + 0,5·h .

4 . In accordance with the scale of intervals, the values ​​of the attribute are grouped - for each partial interval, the sum of the frequencies is calculated n i variant caught in i -th interval. In this case, the interval includes values ​​of a random variable greater than or equal to the lower limit and less than the upper limit of the interval.

Polygon and histogram

For clarity, various graphs of the statistical distribution are built.

Based on the data of the discrete variational series, we construct polygon frequencies or relative frequencies.

Frequency polygon x 1 ; n 1 ), (x2 ; n 2 ), ..., (x k ; nk ). To build a polygon of frequencies on the abscissa axis, options are set aside x i , and on the y-axis - the corresponding frequencies n i . Points ( x i ; n i ) are connected by segments of straight lines and a frequency polygon is obtained (Fig. 1).

Relative frequency polygon is called a polyline whose segments connect the points ( x 1 ; W 1 ), (x2 ; W2 ), ..., (x k ; Wk ). To build a polygon of relative frequencies on the abscissa, lay off options x i , and on the y-axis - the relative frequencies corresponding to them Wi . Points ( x i ; Wi ) are connected by segments of straight lines and a polygon of relative frequencies is obtained.

When continuous feature it is expedient to build histogram .

frequency histogram called a stepped figure consisting of rectangles whose bases are partial intervals of length h , and the heights are equal to the ratio n i / h (frequency density).

To build a histogram of frequencies, partial intervals are plotted on the abscissa axis, and segments are drawn above them parallel to the abscissa axis at a distance n i / h .

grouping- this is the division of the population into groups that are homogeneous in some way.

Service assignment. With the online calculator you can:

  • build a variation series, build a histogram and a polygon;
  • find indicators of variation (mean, mode (including graphically), median, range of variation, quartiles, deciles, quartile coefficient of differentiation, coefficient of variation and other indicators);

Instruction. To group a series, you must select the type of the resulting variation series (discrete or interval) and specify the amount of data (number of rows). The resulting solution is saved in a Word file (see the example of grouping statistical data).

Number of input data
",0);">

If the grouping has already been done and the discrete variation series or interval series, then you need to use the online calculator Variation indicators. Testing the hypothesis about the type of distribution produced using the service Study of the form of distribution.

Types of statistical groupings

Variation series. In the case of observations of a discrete random variable, the same value can be encountered several times. Such values ​​\u200b\u200bof a random variable x i are recorded indicating n i the number of times it appears in n observations, this is the frequency of this value.
In the case of a continuous random variable, grouping is used in practice.
  1. Typological grouping- this is the division of the studied qualitatively heterogeneous population into classes, socio-economic types, homogeneous groups of units. To build this grouping, use the Discrete variational series parameter.
  2. Structural grouping is called, in which a homogeneous population is divided into groups that characterize its structure according to some varying feature. To build this grouping, use the Interval series parameter.
  3. A grouping that reveals the relationship between the studied phenomena and their features is called analytical group(see analytical grouping of series).

Principles of building statistical groupings

A series of observations ordered in ascending order is called a variation series. grouping sign is the sign by which the population is divided into separate groups. It is called the base of the group. Grouping can be based on both quantitative and qualitative characteristics.
After determining the basis of the grouping, the question of the number of groups into which the study population should be divided should be decided.

When using personal computers for processing statistical data, the grouping of units of an object is carried out using standard procedures.
One such procedure is based on using the Sturgess formula to determine the optimal number of groups:

k = 1+3.322*lg(N)

Where k is the number of groups, N is the number of population units.

The length of the partial intervals is calculated as h=(x max -x min)/k

Then count the number of hits of observations in these intervals, which are taken as frequencies n i . Few frequencies, the values ​​of which are less than 5 (n i< 5), следует объединить. в этом случае надо объединить и соответствующие интервалы.
The midpoints of the intervals x i =(c i-1 +c i)/2 are taken as new values.

What is the grouping of statistical data, and how it is related to the distribution series, was considered in this lecture, where you can also learn about what a discrete and variational distribution series is.

Distribution series are one of the varieties of statistical series (besides them, dynamics series are used in statistics), they are used to analyze data on the phenomena of social life. The construction of variational series is quite a feasible task for everyone. However, there are rules to remember.

How to build a discrete variational distribution series

Example 1 Data are available on the number of children in 20 surveyed families. Construct a discrete variational series distribution of families by number of children.

0 1 2 3 1
2 1 2 1 0
4 3 2 1 1
1 0 1 0 2

Decision:

  1. Let's start with the layout of the table, in which we will then enter the data. Since the distribution rows have two elements, the table will consist of two columns. The first column is always a variant - what we are studying - we take its name from the task (the end of the sentence with the task in the conditions) - by number of children- so our version is the number of children.

The second column is the frequency - how often our variant occurs in the phenomenon under study - we also take the name of the column from the task - distribution of families - so our frequency is the number of families with the corresponding number of children.

  1. Now, from the initial data, we select those values ​​that occur at least once. In our case, this

And let's arrange this data in the first column of our table in a logical order, in this case increasing from 0 to 4. We get

And in conclusion, let's calculate how many times each value of the options occurs.

0 1 2 3 1

2 1 2 1 0

4 3 2 1 1

1 0 1 0 2

As a result, we obtain a complete table or the required series of distribution of families by the number of children.

Exercise . There is data on the tariff categories of 30 workers of the enterprise. Construct a discrete variational series for the distribution of workers by wage category. 2 3 2 4 4 5 5 4 6 3

1 4 4 5 5 6 4 3 2 3

4 5 4 5 5 6 6 3 3 4

How to build an interval variation series of distribution

Let's build an interval distribution series, and see how its construction differs from a discrete series.

Example 2 There is data on the amount of profit received by 16 enterprises, million rubles. — 23 48 57 12 118 9 16 22 27 48 56 87 45 98 88 63. Construct an interval variational series for the distribution of enterprises by profit volume, selecting 3 groups at equal intervals.

The general principle of constructing a series, of course, will be preserved, the same two columns, the same variants and frequency, but in this case the variants will be located in the interval and the frequencies will be counted differently.

Decision:

  1. Let's start similarly to the previous task by building a table layout, into which we will then enter data. Since the distribution rows have two elements, the table will consist of two columns. The first column is always a variant - what we are studying - we take its name from the task (the end of the sentence with the task in the conditions) - by the amount of profit - which means that our variant is the amount of profit received.

The second column is the frequency - how often our variant occurs in the phenomenon under study - we also take the name of the column from the assignment - the distribution of enterprises - this means our frequency is the number of enterprises with the corresponding profit, in this case falling into the interval.

As a result, the layout of our table will look like this:

where i is the value or length of the interval,

Xmax and Xmin - the maximum and minimum value of the feature,

n is the required number of groups according to the condition of the problem.

Let's calculate the interval value for our example. To do this, among the initial data, we find the largest and smallest

23 48 57 12 118 9 16 22 27 48 56 87 45 98 88 63 - the maximum value is 118 million rubles, and the minimum is 9 million rubles. Let's calculate the formula.

In the calculation we got the number 36, (3) three in the period, in such situations the value of the interval must be rounded up to a larger one so that after the calculations the maximum data is not lost, which is why the value of the interval in the calculation is 36.4 million rubles.

  1. Now let's build the intervals - our options in this problem. The first interval is started from the minimum value, the value of the interval is added to it and the upper limit of the first interval is obtained. Then the upper limit of the first interval becomes the lower limit of the second interval, the value of the interval is added to it and the second interval is obtained. And so on as many times as required to build intervals according to the condition.

Pay attention, if we did not round the value of the interval to 36.4, but would leave it at 36.3, then the last value would be 117.9. It is in order to avoid data loss that it is necessary to round the value of the interval to a larger value.

  1. Let's count the number of enterprises that fall into each specific interval. When processing data, it must be remembered that the upper value of the interval in this interval is not taken into account (is not included in this interval), but is taken into account in the next interval (the lower limit of the interval is included in this interval, and the upper one is not included), except for the last interval.

When carrying out data processing, it is best to indicate the selected data with conventional icons or color to simplify processing.

23 48 57 12 118 9 16 22

27 48 56 87 45 98 88 63

We will mark the first interval in yellow - and determine how much data falls into the interval from 9 to 45.4, while this 45.4 will be taken into account in the second interval (provided that it is in the data) - as a result, we get 7 enterprises in the first interval. And so on for all intervals.

  1. (additional action) Let's calculate the total amount of profit received by enterprises for each interval and in general. To do this, add up the data marked with different colors and get the total profit value.

For the first interval 23 + 12 + 9 + 16 + 22 + 27 + 45 = 154 million rubles

For the second interval - 48 + 57 + 48 + 56 + 63 = 272 million rubles.

For the third interval - 118 + 87 + 98 + 88 = 391 million rubles.

Exercise . There is data on the size of the deposit in the bank of 30 depositors, thousand rubles. 150, 120, 300, 650, 1500, 900, 450, 500, 380, 440,

600, 80, 150, 180, 250, 350, 90, 470, 1100, 800,

500, 520, 480, 630, 650, 670, 220, 140, 680, 320

Build interval variation series distribution of contributors, by the size of the contribution, highlighting 4 groups at equal intervals. For each group, calculate the total amount of contributions.