Approximation factor in excel. Approximation method in Microsoft Excel

Polynomial approximation of a function continuous on a segment.

Approximation (from the Latin "approximate" - "approach") - an approximate expression of any mathematical objects (for example, numbers or functions) through other simpler, more convenient to use or simply more well-known. In scientific research, approximation is used to describe, analyze, generalize and further use empirical results.

As is known, there can be an exact (functional) relationship between the values, when one value of the argument corresponds to one specific value, and a less accurate (correlation) relationship, when one specific value of the argument corresponds to an approximate value or some set of function values ​​that are more or less close to each other. When conducting scientific research, processing the results of an observation or experiment, you usually have to deal with the second option. When studying the quantitative dependences of various indicators, the values ​​of which are determined empirically, as a rule, there is some variability. It is partly determined by the heterogeneity of the studied objects of inanimate and, especially, living nature, partly due to the error of observation and quantitative processing of materials. It is not always possible to eliminate the last component completely; it can only be minimized by a careful choice of an adequate research method and accuracy of work. Therefore, when performing any research work, the problem arises of identifying the true nature of the dependence of the studied indicators, this or that degree masked by the neglect of the variability of values. For this, approximation is used - an approximate description of the correlation dependence of variables by a suitable functional dependence equation that conveys the main trend of the dependence (or its "trend").

When choosing an approximation, one should proceed from the specific task of the study. Usually, the simpler the equation used for approximation, the more approximate the obtained description of the dependence.

Therefore, it is important to read how significant and what caused the deviations of specific values ​​from the resulting trend. When describing the dependence of empirically determined values, much greater accuracy can be achieved using some more complex, multi-parametric equation. However, there is no point in trying to convey random deviations of values ​​in specific series of empirical data with maximum accuracy. It is much more important to catch the general regularity, which in this case is most logically and with acceptable accuracy expressed precisely by the two-parameter equation of the power function. Thus, when choosing an approximation method, the researcher always makes a compromise: he decides to what extent in this case it is expedient and appropriate to "sacrifice" the details and, accordingly, how generalized the dependence of the compared variables should be expressed. Along with the identification of patterns masked by random deviations of empirical data from the general pattern, approximation also allows solving many other important problems: to formalize the found dependence; find unknown values ​​of the dependent variable by interpolation or, if applicable, extrapolation.

Here the polynomial approximation will be considered. This means that our task is that, based on the initial data (function and segment), it is necessary to find such a polynomial, the deviation of the line of which from the graph of the initial function will be minimal.

The most popular polynomial approximation method is the least squares method. In Excel, it is implemented using a chart and a trend line.

Let's analyze this method in Excel.

Initial data:

First, we need to partition this segment using the "Chebyshev" partition, because this type of partitioning always gives a more accurate result.

In column I (Fig. 1) we write numbers from 0 to 8, because The segment is divided into 8 parts.

In the z column of the cell, we calculate by the formula: COS (3.141593 * I / 8). To calculate each cell, use the corresponding I.

The value of each x is found by the formula: 2*z + 1.

In column F(x) we calculate the value of this function for each x.


Picture 1
Next, in cells H2, I2, J2, we set the initial values ​​of the coefficients a, b and c in the required polynomial (Fig. 2).


Figure 2
In column F from cells 2 to 10, we calculate the deviation values, i.e. modulus of the difference between the value of the initial function and the found polynomial.

Formula: ABS((1+x^2)^0.5+2^(-x)-($H$2*x^2+$I$2*x+$J$2)).

In cell B11, the sum of deviations is calculated, and in cell B12, the average deviation (Fig. 3).


Figure 3
Using the Chart Wizard, we build a scatter plot based on the data of the x and F(x) columns. Now in the "Diagram" tab, select "Add trendline" and set the necessary checkbox to show the equation on the diagram (Fig. 4).


Figure 4
Now we substitute the coefficients from the resulting equation into cells H2, I2 and J2 (Fig. 5).


Figure 5
As you can see, the average deviation is 0.117006252.

Found polynomial: 0.363*x² - 0.6901*x + 2.2203.

Let us propose another method of polynomial approximation.

Open the "Tools" tab and select "Search for solutions". In the window that appears, specify F11 as the target cell, and equal to the minimum value. In the "changing cells" field, specify H2, I2 and J2.

Click the "Run" button. After performing the procedure, we see that the results have changed (Fig. 6).


Figure 6
This time the mean deviation is 0.106084329.

Found polynomial: 0.35724*x² - 0.702*x + 2.259158.

This result is much more accurate than the previous one, which confirms the advantage of using the minimization of the sum of deviations in comparison with the least squares method.

DEPENDENCIES

Excel has tools to predict processes. The approximation problem arises when it is necessary to analytically describe the phenomena that take place in life and are given in the form of tables containing the values ​​of the argument (arguments) and functions. If the dependence can be found, it is possible to make a prediction about the behavior of the system under study in the future and, possibly, choose the optimal direction for its development. Such an analytical function (also called a trend) can have a different form and a different level of complexity, depending on the complexity of the system and the desired representation accuracy.

10.1. Linear Regression

The simplest and most popular is the straight line approximation - linear regression.

Let we have factual information about the levels of profit Y depending on the size X of capital investments - Y(X). On fig. 10.1-1 shows four such points M(Y,X). Let us also have reason to assume that this dependence is linear, i.e. has the form Y=A+BX. If we were able to find the coefficients A and B and build a straight line from them (for example, such as in the figure), in the future we could make conscious assumptions about the dynamics of the business and the possible commercial state of the enterprise in the future. Obviously, we would be satisfied with a straight line that is as close as possible to the known points M(Y,X), i.e. having the minimum sum of deviations or the sum of errors (in the figure, the deviations are shown by dotted lines). It is known that there is only one such line.

To solve this problem, the method of least squares of errors is used. The difference (error) between the known value Y1 of the point М1(Y1,X1) and the value Y(X1) calculated by the straight line equation for the same value X1 will be

D1 = Y1 – A – B X1.

The same difference

for X=X2 will be D2 = Y2 – A – B X2;

for X=X3 D3 = Y3 – A – B X3;

and for X=X4 D4 = Y4 – A – B X4.

Let us write an expression for the sum of squares of these errors

Ф(A,В)=(Y1–A–B X1) 2 +(Y2–A–B X2) 2 +(Y3–A–B X3) 2 +(Y4–A–B X4) 2

or abbreviated F(B,A) = å(Yi – A – BXi) 2 .

Here we know all X and Y, and the coefficients A and B are unknown. The minimality conditions are the well-known relations

¶Ф(A,B)/¶A=0 and ¶Ф(A,B)/¶B=0.

Let us derive these expressions (we omit the indices at the sign of the sum):

¶[å(Yi–A–B Xi) 2 ]/¶A = å(Yi–A–B Xi)(–1)

¶[å(Yi–A–B Xi) 2 ]/¶B = å(Yi–A–B Xi)(–Xi).

We transform the obtained formulas and equate them to zero

Microsoft Excel (also sometimes referred to as Microsoft Office Excel) is a spreadsheet program created by Microsoft for Microsoft Windows, Windows NT, and Mac OS. It provides economic and statistical calculation capabilities, graphical tools, and, with the exception of Excel 2008 under Mac OS X, the VBA (Visual Basic for Applications) macro programming language. Microsoft Excel is part of Microsoft Office and today Excel is one of the most popular applications in the world.

In MS Excel, the approximation of experimental data is carried out by plotting them (x - abstract values) or a scatter plot (x - has specific values) with subsequent selection of an appropriate approximating function (trend line).

The following function options are possible:

· Linear - y=ax+b. Usually used in the simplest cases, when experimental data increase or decrease at a constant rate.

· Polynomial - y=a 0 +a 1 x+a 2 x 2 +…+a n x n , where up to the sixth order inclusive (n? 6), a i are constants. Used to describe experimental data that alternately increase and decrease. The degree of the polynomial is determined by the number of extrema (maxima or minima) of the curve. A polynomial of the second degree can describe only one maximum or minimum, a polynomial of the third degree can have one or two extrema, a polynomial of the fourth degree - no more than three extrema, etc.

· Logarithmic - y=a·lnx+b, where a and b are constants, ln is the natural logarithm function. The function is used to describe experimental data that first increase or decrease rapidly, and then gradually stabilize.

· Power - y=b·x a , where a and b are constants. The power function approximation is used for experimental data with a constantly increasing (or decreasing) growth rate. The data must not have zero or negative values.

· Exponential - y=b·e ax , a and b are constants, e is the base of the natural logarithm. Used to describe experimental data that rises or falls rapidly and then gradually stabilizes. Often its use stems from theoretical considerations.

The degree of closeness of the experimental data approximation by the selected function is estimated by the coefficient of determination (R2). Thus, if there are several suitable options for the types of approximating functions, one can choose a function with a large coefficient of determination (tending to 1).

Approximation of experimental data in MathCAD

MathCAD is a specific programming language that makes it easy to solve mathematical equations. MathCAD - a computer algebra system from the class of computer-aided design systems, focused on the preparation of interactive documents with calculations and visual support, is easy to use and apply for teamwork. MathCAD is ideal for performing mathematical modeling - solving various kinds of equations and reporting on the results.

There are very few data types in MathCAD compared to general-purpose programming languages ​​- only three. Let us briefly characterize them (they will be described in more detail later).

Numbers (both real and complex): MathCAD stores all numbers in the same format (double-precision floating point), without separating them into integers and reals. One number is allocated 64 bits. In this case, the decimal part cannot exceed 17 characters in length, and the order must lie between -307 and 307. Complex numbers at the implementation level are a pair of real numbers. Moreover, in many types of calculations, the number is perceived as complex, even if it does not have an imaginary part. The described features of numbers in MathCAD relate only to numerical calculations. When working in symbolic mode, there are completely different levels of precision.

Strings: In general, any text enclosed in quotation marks. In practice, strings are mainly used to specify error messages that occur when running programs in the MathCAD language.

Arrays: These include matrices, vectors, tensors, tables - any ordered sequence of elements of an arbitrary type. Ranked variables can also be classified as data of this type. The so-called dimensional variables should be singled out as a separate group, that is, units of measurement that are of great importance in science and technology. There is no boolean data type in MathCAD. Logical operators and functions use the numbers 0 and 1 to represent true and false.

There are several functions in MathCAD that allow you to perform regression using dependencies that are most often encountered in practice. There are only six such functions in MathCAD. Here are some of them:

· expfit(vx,vy,vg) - regression by exponential function y = a*e b*x +c.

· sinfit(vx,vy,vg) - regression by sinusoidal function y = a*sin(x+b)+c.

· pwrfit(vx,vy,vg) - power function regression e = a*x b +c.

The listed functions use a three-parameter approximating function, non-linear in parameters. When calculating the optimal values ​​of the three parameters of the regression function using the least squares method, it becomes necessary to solve a complex system of three nonlinear equations. Such a system can often have multiple solutions. Therefore, in the MathCAD functions that perform regression by three-parameter dependencies, an additional vg argument is introduced. This argument is a three-component vector containing the approximate values ​​of the parameters a,b and c that are included in the approximating function. Incorrect selection of the elements of the vector vg can lead to an unsatisfactory regression result. In MathCAD, there are tools for carrying out regression of the most general form. This means that any functions can be used as approximants and the optimal values ​​of any of their parameters, both linear and non-linear, can be found. In the event that the regression function is linear in all parameters, i.e. represents a linear combination of hard-coded functions, regression can be done using the built-in function linfit(vx,vy,F). The argument F is a vector function, from the elements of which a linear combination should be built that best approximates the given sequence of points. The result of the linfit function is a vector of linear coefficients. Each element of this vector is the coefficient of the function in the corresponding place in the vector F. Thus, in order to obtain a regression function, it is enough to multiply these two vectors scalarly.

Average approximation error- average deviation of calculated values ​​from actual ones:

Where y x is the calculated value according to the equation.

The value of the average approximation error up to 15% indicates a well-chosen model of the equation.

For seven territories of the Ural region for 199X, the values ​​of two signs are known.

Required:
1. To characterize the dependence of y on x, calculate the parameters of the following functions:
a) linear;
b) power;
c) demonstrative;
d) equilateral hyperbola (you also need to figure out how to pre-linearize this model).
2. Evaluate each model through average approximation error A cf and Fisher's F-test.

We carry out the solution using the online calculator Linear regression equation.
a) linear regression equation;
Using the graphical method.
This method is used to visualize the form of communication between the studied economic indicators. To do this, a graph is plotted in a rectangular coordinate system, the individual values ​​of the resulting attribute Y are plotted along the ordinate axis, and the individual values ​​of the factor attribute X are plotted along the abscissa axis.
The set of points of the effective and factor signs is called correlation field.


Based on the correlation field, one can hypothesize (for the general population) that the relationship between all possible values ​​of X and Y is linear.
The linear regression equation is y = bx + a + ε
Here ε is a random error (deviation, perturbation).
Reasons for the existence of a random error:
1. Not including significant explanatory variables in the regression model;
2. Aggregation of variables. For example, the total consumption function is an attempt at a general expression of the totality of individual spending decisions of individuals. This is only an approximation of individual relationships that have different parameters.
3. Incorrect description of the model structure;
4. Wrong functional specification;
5. Measurement errors.
Since the deviations ε i for each particular observation i are random and their values ​​in the sample are unknown, then:
1) according to the observations x i and y i, only estimates of the parameters α and β can be obtained
2) The estimates of the parameters α and β of the regression model are, respectively, the values ​​a and b, which are random in nature, since correspond to a random sample;
Then the estimated regression equation (built from the sample data) will look like y = bx + a + ε, where e i are the observed values ​​(estimates) of the errors ε i , and and b, respectively, the estimates of the parameters α and β of the regression model that should be found.
To estimate the parameters α and β - use LSM (least squares).




We get b = -0.35, a = 76.88
Regression equation:
y = -0.35 x + 76.88

x y x2 y2 x y y(x) (y i -y cp) 2 (y-y(x)) 2 |y - y x |:y
45,1 68,8 2034,01 4733,44 3102,88 61,28 119,12 56,61 0,1094
59 61,2 3481 3745,44 3610,8 56,47 10,98 22,4 0,0773
57,2 59,9 3271,84 3588,01 3426,28 57,09 4,06 7,9 0,0469
61,8 56,7 3819,24 3214,89 3504,06 55,5 1,41 1,44 0,0212
58,8 55 3457,44 3025 3234 56,54 8,33 2,36 0,0279
47,2 54,3 2227,84 2948,49 2562,96 60,55 12,86 39,05 0,1151
55,2 49,3 3047,04 2430,49 2721,36 57,78 73,71 71,94 0,172
384,3 405,2 21338,41 23685,76 22162,34 405,2 230,47 201,71 0,5699

Note: y(x) values ​​are found from the resulting regression equation:
y(45.1) = -0.35*45.1 + 76.88 = 61.28
y(59) = -0.35*59 + 76.88 = 56.47
... ... ...

Approximation error
Let us evaluate the quality of the regression equation using the absolute approximation error. Average approximation error- average deviation of calculated values ​​from actual ones:

Since the error is less than 15%, this equation can be used as a regression.

F-statistics. Fisher's criterion.










3. Table value is determined from Fisher distribution tables for a given significance level, taking into account that the number of degrees of freedom for the total sum of squares (larger variance) is 1 and the number of degrees of freedom for the residual sum of squares (lower variance) in linear regression is n-2 .
4. If the actual value of the F-criterion is less than the table value, then they say that there is no reason to reject the null hypothesis.
Otherwise, the null hypothesis is rejected and the alternative hypothesis about the statistical significance of the equation as a whole is accepted with probability (1-α).

< Fkp, то коэффициент детерминации статистически не значим (Найденная оценка уравнения регрессии статистически не надежна).

b) power regression;
The solution is carried out using the Nonlinear regression service. Select Power y = ax b
c) exponential regression;
d) model of an equilateral hyperbola.
System of normal equations.

For our data, the system of equations has the form
7a + 0.1291b = 405.2
0.1291a + 0.0024b = 7.51
Express a from the first equation and substitute it into the second equation
We get b = 1054.67, a = 38.44
Regression equation:
y = 1054.67 / x + 38.44
Approximation error.
Let us evaluate the quality of the regression equation using the absolute approximation error.

Since the error is less than 15%, this equation can be used as a regression.

Fisher's criterion.
The significance of the regression model is checked using the Fisher F-test, the calculated value of which is found as the ratio of the variance of the initial series of observations of the studied indicator and the unbiased estimate of the variance of the residual sequence for this model.
If the calculated value with k1=(m) and k2=(n-m-1) degrees of freedom is greater than the tabular value at a given significance level, then the model is considered significant.

where m is the number of factors in the model.
The assessment of the statistical significance of paired linear regression is carried out according to the following algorithm:
1. A null hypothesis is put forward that the equation as a whole is statistically insignificant: H 0: R 2 =0 at the significance level α.
2. Next, determine the actual value of the F-criterion:

where m=1 for pairwise regression.
Table value of the criterion with degrees of freedom k1=1 and k2=5, Fkp = 6.61
Since the actual value of F< Fkp, то коэффициент детерминации статистически не значим (Найденная оценка уравнения регрессии статистически не надежна).

Among the various forecasting methods, it is impossible not to single out the approximation. With its help, you can make approximate calculations and calculate planned indicators by replacing the original objects with simpler ones. In Excel, there is also the possibility of using this method for forecasting and analysis. Let's look at how this method can be applied in the specified program with built-in tools.

The name of this method comes from the Latin word proxima - “nearest”. It is the approximation by simplifying and smoothing known indicators, lining them up in a trend that is its basis. But this method can be used not only for forecasting, but also for the study of existing results. After all, the approximation is, in fact, a simplification of the initial data, and the simplified version is easier to study.

The main tool with which smoothing is carried out in Excel is the construction of a trend line. The bottom line is that on the basis of existing indicators, the graph of the function for future periods is being completed. The main purpose of the trend line, as you might guess, is making forecasts or identifying a general trend.

But it can be built using one of five types of approximation:

  • Linear;
  • exponential;
  • logarithmic;
  • polynomial;
  • Power.

Let's consider each of the options in more detail separately.

Method 1: Linear Smoothing

First of all, let's consider the simplest version of approximation, namely using a linear function. We will dwell on it in more detail, as we will state the general points characteristic of other methods, namely, plotting and some other nuances, which we will not dwell on when considering subsequent options.

First of all, let's build a graph, on the basis of which we will carry out the smoothing procedure. To build a graph, let's take a table in which the cost of a unit of output produced by the enterprise and the corresponding profit in a given period are indicated on a monthly basis. The graphical function that we will build will display the dependence of the increase in profit on the decrease in the cost of production.


The smoothing used in this case is described by the following formula:

In our particular case, the formula takes the following form:

y=-0.1156x+72.255

The value of the approximation reliability is equal to 0,9418 , which is a fairly acceptable result characterizing smoothing as reliable.

Method 2: Exponential Approximation

Now let's look at the exponential type of approximation in Excel.


The general form of the smoothing function is as follows:

Where e is the base of the natural logarithm.

In our particular case, the formula took the following form:

y=6282.7*e^(-0.012*x)

Method 3: logarithmic smoothing

Now it is the turn to consider the logarithmic approximation method.


In general, the smoothing formula looks like this:

Where ln is the natural logarithm. Hence the name of the method.

In our case, the formula takes the following form:

y=-62.81ln(x)+404.96

Method 4: Polynomial Smoothing

The time has come to consider the method of polynomial smoothing.


The formula that describes this type of smoothing has taken the following form:

y=8E-08x^6-0.0003x^5+0.3725x^4-269.33x^3+109525x^2-2E+07x+2E+09

Method 5: power smoothing

In conclusion, consider the power approximation method in Excel.


This method is effectively used in cases of intensive change of function data. It is important to note that this option is applicable only if the function and argument do not take negative or zero values.

The general formula describing this method is as follows:

In our particular case, it looks like this:

y = 6E+18x^(-6.512)

As you can see, when using the specific data that we used for the example, the method of polynomial approximation with a sixth degree polynomial showed the highest level of reliability ( 0,9844 ), the lowest level of reliability for the linear method ( 0,9418 ). But this does not mean at all that the same trend will be when using other examples. No, the level of efficiency of the above methods can vary significantly, depending on the specific type of function for which the trend line will be built. Therefore, if the selected method is the most efficient for this function, this does not mean at all that it will also be optimal in another situation.

If you cannot immediately determine, based on the above recommendations, which type of approximation is suitable specifically for your case, then it makes sense to try all the methods. After plotting the trend line and viewing its confidence level, you can choose the best option.