Computers Windows Internet

Kendall correlation calculation. Rank correlation and Kendall's rank correlation coefficient. What should be the starting point when defining the topic, object, subject, goal, objectives and hypothesis of the research

To calculate the coefficient rank correlation Kendall r k it is necessary to rank the data for one of the attributes in ascending order and determine the corresponding ranks for the second attribute. Then, for each rank of the second feature, the number of subsequent ranks, greater in magnitude than the taken rank, is determined, and the sum of these numbers is found.

Kendall's rank correlation coefficient is determined by the formula


where R i- the number of ranks of the second variable, starting from i+1, the magnitude of which is greater than the magnitude i rank of this variable.

There are tables of percentage points of the distribution of the coefficient r k, allowing to test the hypothesis about the significance of the correlation coefficient.

For large sample sizes, critical values r k are not tabulated, and they have to be calculated using approximate formulas, which are based on the fact that under the null hypothesis H 0: r k= 0 and large n random value

distributed approximately according to the standard normal law.

40. Relationship between traits measured in nominal or ordinal scales

The problem often arises of checking the independence of two features measured on a nominal or ordinal scale.

Let some objects measure two features X and Y with the number of levels r and s respectively. The results of such observations are conveniently presented in the form of a table, called a contingency table.

In the table u i(i = 1, ..., r) and v j (j= 1, ..., s) - the values ​​taken by the features, the value n ij- the number of objects from the total number of objects for which the attribute X took on the meaning u i, and the sign Y- meaning v j

We introduce the following random variables:

u i


- the number of objects that have a value v j


In addition, there are obvious equalities



Discrete random variables X and Y independent if and only if

for all couples i, j

Therefore, the conjecture about the independence of discrete random variables X and Y can be written like this:

As an alternative, as a rule, they use the hypothesis

The validity of the hypothesis H 0 should be judged on the basis of sample frequencies n ij contingency tables. In accordance with the law of large numbers at n→ ∞, the relative frequencies are close to the corresponding probabilities:



To test the hypothesis H 0, statistics are used

which, if the hypothesis is true, has the distribution χ 2 sec rs − (r + s- 1) degrees of freedom.

Independence criterion χ 2 rejects hypothesis H 0 with significance level α if:


41. Regression analysis. Basic concepts of regression analysis

For a mathematical description of the statistical relationships between the studied variables, the following problems should be solved:

ü choose a class of functions in which it is advisable to seek the best (in a certain sense) approximation of the dependence of interest;

ü find estimates of the unknown values ​​of the parameters included in the equations of the required dependence;

ü to establish the adequacy of the obtained equation of the required dependence;

ü to identify the most informative input variables.

The totality of the listed tasks is the subject of research in regression analysis.

The regression function (or regression) is the dependence of the mathematical expectation of one random variable on the value taken by another random variable, which forms a two-dimensional system of random variables with the first.

Let there be a system of random variables ( X,Y), then the regression function Y on X

And the regression function X on Y

Regression functions f(x) and φ (y) are not mutually reversible if only the relationship between X and Y is not functional.

When n-dimensional vector with coordinates X 1 , X 2 ,…, X n you can consider the conditional mathematical expectation for any component. For example, for X 1


called regression X 1 on X 2 ,…, X n.

For a complete definition of the regression function, it is necessary to know the conditional distribution of the output variable for fixed values ​​of the input variable.

Since in a real situation such information is not available, they are usually limited to the search for a suitable approximating function f a(x) for f(x), based on statistical data of the form ( x i, y i), i = 1,…, n... This data is the result n independent observations y 1 ,…, y n random variable Y for the values ​​of the input variable x 1 ,…, x n, while the regression analysis assumes that the values ​​of the input variable are specified accurately.

The problem of choosing the best approximating function f a(x), being the main one in regression analysis, and does not have formalized procedures for its solution. Sometimes the choice is determined based on the analysis of experimental data, more often from theoretical considerations.

If it is assumed that the regression function is sufficiently smooth, then the approximating function f a(x) can be represented as a linear combination of a set of linearly independent basis functions ψ k(x), k = 0, 1,…, m−1, i.e., in the form


where m- number of unknown parameters θ k(in the general case, the value is unknown, refined during the construction of the model).

Such a function is linear in parameters, therefore, in the case under consideration, we speak of a regression function model that is linear in parameters.

Then the problem of finding the best approximation for the regression line f(x) is reduced to finding such parameter values ​​for which f a(x; θ) is the most adequate to the available data. One of the methods to solve this problem is the least squares method.

42. Least square method

Let the set of points ( x i, y i), i= 1,…, n located on a plane along some straight line

Then, as a function f a(x) approximating the regression function f(x) = M [Y|x] it is natural to take linear function argument x:


That is, the basis functions here are chosen ψ 0 (x) ≡1 and ψ 1 (x)≡x... This regression is called simple linear regression.

If the set of points ( x i, y i), i= 1,…, n is located along some curve, then as f a(x) it is natural to try to choose the family of parabolas

This function is non-linear in parameters θ 0 and θ 1, however, by functional transformation (in this case, taking the logarithm), it can be reduced to new function f ’a(x), linear in parameters:


43. Simple Linear Regression

The simplest regression model is simple (one-dimensional, one-way, paired) linear model, which has the following form:


where ε i- random variables (errors) uncorrelated with each other, having zero mathematical expectations and the same variances σ 2 , a and b- constant coefficients (parameters) that need to be estimated from the measured response values y i.

To find the parameter estimates a and b linear regression, determining the straight line most satisfying the experimental data:


the method of least squares is applied.

According to least squares parameter estimates a and b are found from the condition of minimizing the sum of squares of deviations of the values y i vertically from the “true” regression line:

Let there be ten observations of a random variable Y with fixed values ​​of the variable X

To minimize D we equate to zero the partial derivatives with respect to a and b:



As a result, we obtain the following system of equations for finding estimates a and b:


Solving these two equations gives:



Expressions for Parameter Estimates a and b can also be represented as:

Then the empirical equation of the regression line Y on X can be written as:


Unbiased variance estimate σ 2 deviations of values y i from the fitted straight line of regression is given by the expression

Let's calculate the parameters of the regression equation


Thus, the regression line looks like:


And the estimation of variance of deviations of values y i from the fitted straight line of regression


44. Checking the Significance of the Regression Line

Found estimate b≠ 0 can be a realization of a random variable, the mathematical expectation of which is equal to zero, that is, it may turn out that there is actually no regression dependence.

To deal with this situation, you should test the hypothesis H 0: b= 0 with a competing hypothesis H 1: b ≠ 0.

The test of the significance of the regression line can be carried out using analysis of variance.

Consider the following identity:

The magnitude y iŷ i = ε i called the remainder and is the difference between two quantities:

ü deviation of the observed value (response) from the total average response;

ü deviation of the predicted response value ŷ i from the same average

The written identity can be written as


Having squared both parts of it and summed over i, we get:


Where the quantities are named:

the total (total) sum of squares of the SC n, which is equal to the sum of the squares of the deviations of observations relative to the mean value of observations

the sum of squares due to the regression of SK p, which is equal to the sum of squares of the deviations of the regression line values ​​relative to the mean of observations.

residual sum of squares SK 0. which is equal to the sum of the squares of the deviations of the observations relative to the values ​​of the regression line

So the spread Y-kov relative to their mean can be attributed to some extent to the fact that not all observations lie on the regression line. If this were the case, then the sum of squares relative to the regression would be zero. It follows that the regression will be significant if the sum of the squares of the SC p is greater than the sum of the squares of the SC 0.

Regression significance test calculations are performed in the following ANOVA table.

If errors ε i distributed according to the normal law, then if the hypothesis H 0 is valid: b= 0 statistics:


distributed according to Fisher's law with the number of degrees of freedom 1 and n−2.

The null hypothesis will be rejected at the significance level α if the calculated statistic value F will be greater than the α percentage point f 1;n−2; α of the Fisher distribution.

45. Checking the adequacy of the regression model. Residual method

The adequacy of the constructed regression model is understood as the fact that no other model gives a significant improvement in predicting the response.

If all values ​​of the responses are obtained at different values x, i.e., there are no several response values ​​obtained with the same x i, then only a limited test of the adequacy of the linear model can be carried out. The basis for such a check is the leftovers:

Deviations from the established pattern:

Insofar as X- one-dimensional variable, points ( x i, d i) can be plotted on a plane in the form of the so-called residual plot. Such a representation sometimes makes it possible to find some regularity in the behavior of the residuals. In addition, the analysis of the residuals allows you to analyze the assumption regarding the distribution of errors.

In the case when the errors are distributed according to the normal law and there is an a priori estimate of their variance σ 2 (an estimate obtained on the basis of previously performed measurements), then a more accurate assessment of the adequacy of the model is possible.

By using F-Fisher's criterion can be used to check whether the residual variance is significant s 0 2 differs from the a priori estimate. If it is significantly greater, then there is an inadequacy and the model should be revised.

If the prior estimate σ 2 no, but response measurements Y repeated two or more times with the same values X, then these repeated observations can be used to obtain another estimate σ 2 (the first is the residual variance). Such an estimate is said to represent a “pure” error, since if x are the same for two or more observations, then only random changes can affect the results and create a scatter between them.

The resulting estimate turns out to be a more reliable estimate of the variance than the estimate obtained by other methods. For this reason, when planning experiments, it makes sense to set up experiments with repetitions.

Suppose we have m different meanings X : x 1 , x 2 , ..., x m... Let for each of these values x i there is n i response observations Y... Total observations are obtained:

Then the simple linear regression model can be written as:


Let's find the variance of “pure” errors. This variance is the combined estimate of the variance σ 2, if we represent the values ​​of the responses y ij at x = x i as sample volume n i... As a result, the variance of “pure” errors is:

This variance serves as an estimate σ 2 regardless of whether the fitted model is correct.

Let us show that the sum of squares of “pure errors” is a part of the residual sum of squares (the sum of squares included in the expression for the residual variance). Remaining for j th observation at x i can be written as:

If you square both sides of this equality and then sum them over j and by i, we get:

On the left of this equality is the residual sum of squares. The first term on the right is the sum of squares of “pure” errors, the second term can be called the sum of squares of inadequacy. The last amount has m−2 degrees of freedom, therefore, the variance of inadequacy

The statistics of the criterion for testing the hypothesis H 0: the simple linear model is adequate, against the hypothesis H 1: the simple linear model is inadequate, the random variable is

If the null hypothesis is true, the value F has a Fisher distribution with degrees of freedom m−2 and nm... The hypothesis of linearity of the regression line should be rejected with a significance level α, if the obtained value of the statistic is greater than the α-percentage point of the Fisher distribution with the number of degrees of freedom m−2 and nm.

46. Checking the adequacy of the regression model (see 45). ANOVA

47. Checking the adequacy of the regression model (see 45). Determination coefficient

Sometimes, to characterize the quality of the regression line, a sample coefficient of determination is used R 2, showing what part (fraction) of the sum of squares, due to the regression, SK p is in the total sum of squares SK n:

The closer R 2 to one, the better the regression approximates the experimental data, the closer the observations are adjacent to the regression line. If R 2 = 0, then the changes in the response are completely due to the influence of unaccounted factors, and the regression line is parallel to the axis x-ov. In the case of simple linear regression, the coefficient of determination R 2 is equal to the square of the correlation coefficient r 2 .

The maximum value R 2 = 1 can be achieved only in the case when the observations were carried out at different values ​​of x-ov. If there are repeated experiments in the data, then the value of R 2 cannot reach unity, no matter how good the model is.

48. Confidence Intervals for Simple Linear Regression Parameters

Just as the sample mean is an estimate of the true mean (the population mean), so are the sample parameters of the regression equation a and b- nothing more than an estimate of the true regression coefficients. Different samples give different estimates of the mean - just as different samples will give different estimates of the regression coefficients.

Assuming that the error distribution law ε i are described by the normal law, the parameter estimate b will have a normal distribution with parameters:


Since the parameter estimate a is a linear combination of independent normally distributed quantities, it will also have a normal distribution with mean and variance:


In this case, the (1 - α) confidence interval for estimating the variance σ 2 taking into account that the ratio ( n−2)s 0 2 /σ 2 distributed by law χ 2 with the number of degrees of freedom n−2 will be determined by the expression


49. Confidence intervals for the regression line. Confidence interval for dependent variable values

We usually do not know the true values ​​of the regression coefficients. a and b... We only know their estimates. In other words, the true regression line can go higher or lower, be steeper or shallower than the one constructed from the sample data. We calculated the confidence intervals for the regression coefficients. You can also calculate the confidence region for the regression line itself.

Let for simple linear regression it is necessary to construct (1− α ) confidence interval for the mathematical expectation of the response Y at value NS = NS 0. This mathematical expectation is a+bx 0, and its estimate

Since, then.

The obtained estimate of the mathematical expectation is a linear combination of uncorrelated normally distributed values ​​and therefore also has a normal distribution centered at the point of the true value of the conditional mathematical expectation and variance

Therefore, the confidence interval for the regression line at each value x 0 can be represented as


As you can see, the minimum confidence interval is obtained at x 0 equal to the mean and increases as x 0 “moves away” from the middle in any direction.

To obtain a set of joint confidence intervals suitable for the entire regression function, along its entire length, in the above expression instead of t n −2,α / 2 must be substituted

One of the factors limiting the application of criteria based on the assumption of normality is the sample size. As long as the sample is large enough (for example, 100 or more observations), you can assume that the sample distribution is normal, even if you are not sure that the distribution of the variable in the population is normal. However, if the sample is small, these criteria should only be used if there is confidence that the variable is indeed normally distributed. However, there is no way to test this assumption in a small sample.

The use of criteria based on the assumption of normality is also limited to a scale of measurements (see chapter Basic concepts of data analysis). Statistical methods such as t-test, regression, etc. assume that the original data is continuous. However, there are situations where the data are simply ranked (measured on an ordinal scale) rather than measured accurately.

A typical example is given by the ratings of sites on the Internet: the first position is taken by the site with the maximum number of visitors, the second position is taken by the site with the maximum number of visitors among the remaining sites (among sites from which the first site has been removed), etc. Knowing the ratings, we can say that the number of visitors to one site is greater than the number of visitors to another, but how much more is impossible to say. Imagine you have 5 sites: A, B, C, D, E, which are in the top 5 places. Suppose that in the current month we had the following arrangement: A, B, C, D, E, and in the previous month: D, E, A, B, C. The question is, there have been significant changes in site ratings or not? In this situation, obviously, we cannot use the t-test to compare these two groups of data, and move on to the area of ​​specific probabilistic calculations (and any statistical criterion contains a probabilistic calculation!). We reason like this: how likely is it that the difference in the two site layouts is due to purely random reasons, or that the difference is too large and cannot be explained by pure chance. In this reasoning, we only use the ranks or permutations of sites and do not in any way use a specific form of distribution of the number of visitors to them.

For the analysis of small samples and for data measured on poor scales, nonparametric methods are used.

A quick tour of nonparametric procedures

Essentially, for each parametric criterion, there is at least, one nonparametric alternative.

In general, these procedures fall into one of the following categories:

  • distinction criteria for independent samples;
  • distinction criteria for dependent samples;
  • assessment of the degree of dependence between the variables.

In general, the approach to statistical criteria in data analysis should be pragmatic and not burdened with unnecessary theoretical reasoning. With a STATISTICA computer at your disposal, you can easily apply several criteria to your data. Knowing about some of the pitfalls of the methods, you will choose the right solution through experimentation. The development of the plot is quite natural: if you need to compare the values ​​of two variables, then you use the t-test. However, it should be remembered that it is based on the assumption of normality and equality of variances in each group. Breaking free from these assumptions results in nonparametric tests that are especially useful for small samples.

The development of the t-test leads to analysis of variance, which is used when the number of compared groups is more than two. The corresponding development of nonparametric procedures leads to a nonparametric analysis of variance, although it is significantly poorer than the classical analysis of variance.

To assess the dependence, or, to put it somewhat pompously, the degree of tightness of the connection, the Pearson correlation coefficient is calculated. Strictly speaking, its application has limitations associated, for example, with the type of scale in which the data are measured and the nonlinearity of the dependence; therefore, alternatively, nonparametric, or so-called rank, correlation coefficients are also used, which are used, for example, for ranked data. If the data are measured on a nominal scale, then it is natural to present them in contingency tables that use Pearson's chi-square test with various variations and corrections for accuracy.

So, in essence, there are only a few types of criteria and procedures that you need to know and be able to use, depending on the specifics of the data. You need to determine which criterion should be applied in a particular situation.

Nonparametric methods are most appropriate when sample sizes are small. If there is a lot of data (for example, n> 100), it often doesn't make sense to use nonparametric statistics.

If the sample size is very small (for example, n = 10 or less), then the significance levels for those nonparametric tests that use the normal approximation can only be considered as rough estimates.

Differences between independent groups... If there are two samples (for example, men and women) that need to be compared with respect to some average value, for example, the mean pressure or the number of leukocytes in the blood, then the t-test can be used for independent samples.

Nonparametric alternatives to this test are the Val'd-Wolfowitz, Mann-Whitney series test) / n, where x i - i-th value, n is the number of observations. If the variable contains negative values ​​or zero (0), the geometric mean cannot be calculated.

Harmonic mean

The harmonic average is sometimes used to average frequencies. The harmonic mean is calculated by the formula: ГС = n / S (1 / x i) where ГС is the harmonic mean, n is the number of observations, х i is the value of observation with the number i. If the variable contains zero (0), the harmonic mean cannot be calculated.

Dispersion and standard deviation

Sample variance and standard deviation are the most commonly used measures of variability (variation) in data. The variance is calculated as the sum of the squares of the deviations of the values ​​of the variable from the sample mean, divided by n-1 (but not by n). The standard deviation is calculated as the square root of the variance estimate.

Swing

The range of a variable is an indicator of volatility, calculated as a maximum minus a minimum.

Quartile scope

The quarterly range, by definition, is: upper quartile minus lower quartile (75% percentile minus 25% percentile). Since the 75% percentile (upper quartile) is the value to the left of which 75% of cases are located, and the 25% percentile (lower quartile) is the value to the left of which 25% of cases are located, the quartile range is the interval around the median. which contains 50% of the cases (variable values).

Asymmetry

Asymmetry is a characteristic of the shape of the distribution. The distribution is skewed to the left if the skewness value is negative. The distribution is skewed to the right if the asymmetry is positive. The skewness of the standard normal distribution is 0. The skewness is associated with the third moment and is defined as: skewness = n × M 3 / [(n-1) × (n-2) × s 3], where M 3 is: (x i -x mean x) 3, s 3 is the standard deviation raised to the third power, n is the number of observations.

Excess

Kurtosis is a characteristic of the shape of a distribution, namely, a measure of the severity of its peak (relative to a normal distribution, the kurtosis of which is equal to 0). As a rule, distributions with a sharper peak than normal have a positive kurtosis; distributions whose peak is less acute than the peak of the normal distribution have negative kurtosis. The excess is associated with the fourth moment and is determined by the formula:

kurtosis = / [(n-1) × (n-2) × (n-3) × s 4], where M j is: (x-x mean x, s 4 is the standard deviation to the fourth power, n is the number of observations ...

The needs of economic and social practice require the development of methods for the quantitative description of processes that allow accurate registration of not only quantitative, but also qualitative factors. Provided that the values ​​of qualitative characteristics can be ordered, or ranked according to the degree of decrease (increase) of the characteristic, it is possible to assess the closeness of the relationship between the qualitative characteristics. Qualitative means a feature that cannot be measured accurately, but it allows you to compare objects with each other and, therefore, arrange them in decreasing or increasing order of quality. And the real content of measurements in rank scales is the order in which objects are arranged according to the severity of the measured feature.

For practical purposes, the use of rank correlation is very useful. For example, if a high rank correlation is established between two qualitative features of products, then it is enough to control products only by one of the features, which makes the control cheaper and faster.

As an example, we can consider the existence of a connection between the availability of commercial products of a number of enterprises and overhead costs for sales. In the course of 10 observations, the following table was obtained:

Let us arrange the values ​​of X in ascending order, with each value assigning its ordinal number (rank) to each value:

Thus,

Let's build the following table, where the pairs X and Y are written, obtained as a result of observation with their ranks:

Denoting the difference in ranks as, we write the formula for calculating Spearman's sample correlation coefficient:

where n is the number of observations, it is also the number of pairs of ranks.

Spearman's coefficient has the following properties:

If there is a complete direct relationship between the qualitative features X and Y in the sense that the ranks of the objects coincide for all values ​​of i, then Spearman's sample correlation coefficient is 1. Indeed, substituting it into the formula, we get 1.

If there is a complete inverse relationship between the qualitative features X and Y in the sense that rank corresponds to rank, then Spearman's sample correlation coefficient is -1.

Indeed, if

Substituting the value in the Spearman correlation coefficient formula, we get -1.

If there is neither a complete straight line nor a complete feedback, then Spearman's sample correlation coefficient is between -1 and 1, and the closer to 0 its value, the less the relationship between the features.

According to the above example, we will find the value of P, for this we will complete the table with the values ​​and:

Kendall's sample correlation coefficient. You can assess the relationship between two qualitative characteristics using Kendall's rank correlation coefficient.

Let the ranks of the objects of the sample of size n be equal:

on the basis of X:

on the basis of Y:. Let us assume that to the right there are ranks, large, to the right there are ranks, large, to the right there are ranks, large. Let us introduce the notation for the sum of ranks

Similarly, we introduce the notation as the sum of the number of ranks lying to the right, but less.

Kendall's sample correlation coefficient is written by the formula:

Where n is the sample size.

Kendall's coefficient has the same properties as Spearman's coefficient:

If there is a complete direct relationship between the qualitative features X and Y in the sense that the ranks of the objects coincide for all values ​​of i, then Kendall's sample correlation coefficient is 1. Indeed, to the right there are n-1 ranks, large, therefore, in the same way we establish, what. Then. And Kendall's coefficient is:.

If there is a complete inverse relationship between the qualitative features X and Y in the sense that rank corresponds to rank, then Kendall's sample correlation coefficient is -1. To the right there are no ranks, large, therefore. Likewise. Substituting the value R + = 0 in the Kendall coefficient formula, we get -1.

With a sufficiently large sample size and with the values ​​of the rank correlation coefficients not close to 1, an approximate equality takes place:

Does Kendall's coefficient give a more conservative estimate of correlation than Spearman's coefficient? (the numeric value? is always less than). While calculating the coefficient? less laborious than calculating the coefficient, the latter is easier to recalculate if a new term is added to the series.

An important advantage of the coefficient is that it can be used to determine the coefficient of partial rank correlation, which makes it possible to assess the degree of "pure" interconnection of two rank features, eliminating the influence of the third:

The significance of the rank correlation coefficients. When determining the strength of rank correlation based on sample data, it is necessary to consider the following question: with what degree of reliability can one rely on the conclusion that there is a correlation in the general population if a certain sample coefficient of rank correlation is obtained. In other words, the significance of the observed rank correlations should be checked based on the hypothesis that the two rankings under consideration are statistically independent.

With a relatively large sample size n, the significance of the rank correlation coefficients can be checked using the normal distribution table (Appendix Table 1). To test the significance of the Spearman coefficient? (for n> 20) calculate the value

and to test the significance of the Kendall coefficient? (for n> 10) calculate the value

where S = R + - R-, n is the sample size.

Next, the significance level? Is set, the critical value of tcr (?, K) is determined from the table of critical points of the Student's distribution and the calculated value or is compared with it. The number of degrees of freedom is assumed to be k = n-2. If or> tcr, then the values ​​or are considered significant.

Fechner's correlation coefficient.

Finally, we should mention the Fechner coefficient, which characterizes the elementary degree of tightness of a connection, which is advisable to use to establish the fact of a connection when there is a small amount of initial information. The basis for its calculation is taking into account the direction of deviations from the arithmetic mean of the variants of each variation series and determining the consistency of the signs of these deviations for two series, the relationship between which is measured.

This coefficient is determined by the formula:

where na is the number of coincidences of signs of deviations of individual values ​​from their arithmetic mean; nb - respectively the number of mismatches.

Fechner's coefficient can vary between -1.0<= Кф<= +1,0.

Applied aspects of rank correlation. As already noted, the rank correlation coefficients can be used not only for a qualitative analysis of the relationship between two rank features, but also in determining the strength of the relationship between the rank and quantitative features. In this case, the values ​​of the quantitative characteristic are sorted and the corresponding ranks are assigned to them.

There are a number of situations when calculating the rank correlation coefficients is also advisable when determining the strength of the relationship between two quantitative features. So, with a significant deviation of the distribution of one of them (or both) from the normal distribution, the determination of the significance level of the sample correlation coefficient r becomes incorrect, while rank coefficients? and? are not subject to such restrictions when determining the level of significance.

Another situation of this kind arises when the relationship between two quantitative features is nonlinear (but monotonous). If the number of objects in the sample is small or if the sign of the connection is important for the researcher, then the use of the correlation ratio? may be inadequate here. The calculation of the rank correlation coefficient allows one to get around the indicated difficulties.

Practical part

Task 1. Correlation-regression analysis

Statement and formalization of the problem:

An empirical sample is given, compiled on the basis of a series of observations of the state of equipment (for failure) and the number of manufactured products. The sample implicitly characterizes the relationship between the amount of equipment that has failed and the number of manufactured items. According to the meaning of the sample, it is clear that the manufactured products are produced on the equipment that remains in service, since the more% of the equipment that has failed, the fewer manufactured products. It is required to conduct a study of the sample for correlation-regression dependence, that is, to establish the form of dependence, to evaluate the regression function (regression analysis), as well as to identify the relationship between random variables and assess its tightness (correlation analysis). An additional task of correlation analysis is to estimate the regression equation of one variable for another. In addition, it is necessary to predict the number of products manufactured with a 30% equipment failure.

Let's formalize the given sample in the table, designating the data "Equipment failure,%" as X, data "Number of products" as Y:

Initial data. Table 1

According to the physical meaning of the problem, it can be seen that the number of manufactured products Y directly depends on the% of equipment failure, that is, there is a dependence of Y on X. When conducting regression analysis, it is required to find a mathematical relationship (regression) connecting the values ​​of X and Y. In this case, regression analysis, in Unlike the correlation, it assumes that the X value acts as an independent variable, or a factor, the Y value - as a dependent on it, or an effective sign. Thus, it is required to synthesize an adequate economic and mathematical model, i.e. determine (find, select) the function Y = f (X), which characterizes the relationship between the values ​​of X and Y, using which it will be possible to predict the value of Y at X = 30. This problem can be solved using correlation-regression analysis.

A brief overview of methods for solving correlation-regression problems and the rationale for the chosen solution method.

Regression analysis methods are subdivided into one- and multi-factor based on the number of factors affecting the effective trait. Univariate - the number of independent factors = 1, i.e. Y = F (X)

multifactorial - the number of factors> 1, i.e.

According to the number of investigated dependent variables (effective indicators), regression problems can also be divided into tasks with one or many effective indicators. In general, a task with many effective features can be written:

The method of correlation-regression analysis consists in finding the parameters of the approximating (approximating) dependence of the form

Since only one independent variable appears in the above problem, i.e., the dependence on only one factor influencing the result is investigated, a study for one-way dependence, or pair regression, should be applied.

If there is only one factor, the dependence is defined as:

The form of writing a specific regression equation depends on the choice of the function that displays the statistical relationship between the factor and the effective indicator and includes the following:

linear regression, equation of the form,

parabolic, equation of the form

cubic, equation of the form

hyperbolic, equation of the form

semilogarithmic, equation of the form

exponential, equation of the form

power-law, equation of the form.

Finding the function is reduced to determining the parameters of the regression equation and assessing the reliability of the equation itself. To determine the parameters, you can use both the least squares method and the least modulus method.

The first of them is that the sum of the squares of the deviations of the empirical values ​​Yi from the calculated means Yi is minimal.

The method of least modulus consists in minimizing the sum of the moduli of the difference between the empirical values ​​Yi and the calculated means Yi.

To solve the problem, we will choose the least squares method, as it is the simplest and gives good estimates in terms of statistical properties.

The technology for solving the problem of regression analysis using the least squares method.

It is possible to determine the type of dependence (linear, quadratic, cubic, etc.) between the variables by evaluating the deviation of the actual value of y from the calculated one:

where - empirical values, - calculated values ​​by the approximating function. Estimating the values ​​of Si for various functions and choosing the smallest of them, we select an approximating function.

The type of a function is determined by finding the coefficients that are found for each function as a solution to a certain system of equations:

linear regression, equation of the form, system -

parabolic, equation of the form, system -

cubic, equation of the form, system -

Having solved the system, we find, with the help of which we come to a specific expression of the analytical function, having which, we find the calculated values. Further, there is all the data for finding an estimate of the deviation value S and analyzing for a minimum.

For a linear relationship, we estimate the closeness of the relationship between factor X and the effective indicator Y in the form of a correlation coefficient r:

Average value of the indicator;

Average factor value;

y is the experimental value of the indicator;

x is the experimental value of the factor;

Standard deviation in x;

Standard deviation in y.

If the correlation coefficient r = 0, then it is believed that the relationship between the features is insignificant or absent, if r = 1, then there is a very high functional relationship between the features.

Using the Chaddock table, you can qualitatively assess the tightness of the correlation between the signs:

Chaddock table Table 2.

For a nonlinear dependence, the correlation ratio (0 1) and the correlation index R are determined, which are calculated from the following dependences.

where value is the value of the indicator calculated by the regression dependence.

As an estimate of the calculation accuracy, we use the value of the average relative approximation error

With high accuracy, it lies in the range of 0-12%.

To assess the selection of functional dependence, we use the coefficient of determination

The coefficient of determination is used as a "generalized" measure of the quality of the selection of a functional model, since it expresses the ratio between the factorial and total variance, or rather the share of the factorial variance in the total.

To assess the significance of the correlation index R, Fisher's F test is used. The actual value of the criterion is determined by the formula:

where m is the number of parameters of the regression equation, n is the number of observations. The value is compared with the critical value, which is determined according to the F-criterion table, taking into account the accepted significance level and the number of degrees of freedom and. If, then the value of the correlation index R is considered significant.

For the selected form of regression, the coefficients of the regression equation are calculated. For convenience, the calculation results are included in the table of the following structure (in general, the number of columns and their appearance change depending on the type of regression):

Table 3

The solution of the problem.

Observations were made of the economic phenomenon - the dependence of the release of products on the percentage of equipment failure. A set of values ​​is obtained.

The selected values ​​are described in table 1.

We build a graph of empirical dependence for the given sample (Fig. 1)

By the type of the graph, we determine that the analytical dependence can be represented as a linear function:

Let's calculate the pairwise correlation coefficient to assess the relationship between X and Y:

Let's build an auxiliary table:

Table 4

We solve the system of equations to find the coefficients and:

from the first equation, substituting the value

into the second equation, we get:

We find

We get the form of the regression equation:

9. To assess the tightness of the found relationship, we use the correlation coefficient r:

According to the Chaddock table, we establish that for r = 0.90 the relationship between X and Y is very high, therefore, the reliability of the regression equation is also high. To estimate the accuracy of calculations, we use the value of the average relative error of approximation:

We believe that the value provides a high degree of reliability of the regression equation.

For a linear relationship between X and Y, the determination index is equal to the square of the correlation coefficient r:. Consequently, 81% of the total variation is explained by a change in the factor characteristic X.

To assess the significance of the correlation index R, which in the case of a linear relationship is equal in absolute value to the correlation coefficient r, the Fisher's F-test is used. We determine the actual value using the formula:

where m is the number of parameters of the regression equation, n is the number of observations. That is, n = 5, m = 2.

Taking into account the accepted significance level = 0.05 and the number of degrees of freedom, we obtain the critical tabular value. Since, the value of the correlation index R is recognized as significant.

Let's calculate the predicted value Y at X = 30:

Let's build a graph of the found function:

11. Determine the error of the correlation coefficient by the value of the standard deviation

and then we determine the value of the normalized deviation

From the ratio> 2 with a probability of 95%, we can talk about the significance of the obtained correlation coefficient.

Problem 2. Linear optimization

Option 1.

The development plan of the region is supposed to bring into operation 3 oil fields with a total production volume of 9 million tons. At the first field, the volume of production is at least 1 million tons, at the second - 3 million tons, at the third - 5 million tons. To achieve this productivity, it is necessary to drill at least 125 wells. For the implementation of this plan, 25 million rubles have been allocated. capital investments (indicator K) and 80 km of pipes (indicator L).

It is required to determine the optimal (maximum) number of wells to ensure the planned productivity of each field. The initial data on the task are given in the table.

Initial data

The problem statement is given above.

Let us formalize the conditions and constraints specified in the problem. The goal of solving this optimization problem is to find the maximum value of oil production with the optimal number of wells for each field, taking into account the existing constraints on the problem.

The objective function, in accordance with the requirements of the task, will take the form:

where is the number of wells for each field.

Existing restrictions on the task for:

pipe laying length:

number of wells in each field:

construction cost of 1 well:

Linear optimization problems are solved, for example, by the following methods:

Graphically

Simplex method

Using the graphical method is convenient only when solving linear optimization problems with two variables. With a larger number of variables, the use of an algebraic apparatus is necessary. Consider a general method for solving linear optimization problems called the simplex method.

The simplex method is a typical example of iterative computations used to solve most optimization problems. Iterative procedures of this kind are considered, which ensure the solution of problems with the help of operation research models.

To solve the optimization problem using the simplex method, it is necessary that the number of unknowns Xi be greater than the number of equations, i.e. system of equations

satisfies the relation m

A = was equal to m.

Let us denote the column of the matrix A as, and the column of free terms as

A basic solution to system (1) is a set of m unknowns that are a solution to system (1).

Briefly, the algorithm of the simplex method is described as follows:

The original constraint written as an inequality like<= (=>) can be represented as equality by adding the residual variable to the left side of the constraint (subtracting the redundant variable from the left side).

For example, to the left of the original constraint

a residual variable is introduced, as a result of which the original inequality turns into the equality

If the original limitation determines the pipe flow rate, then the variable should be interpreted as the remainder, or the unused part of this resource.

Maximizing the objective function is equivalent to minimizing the same function, taken with the opposite sign. That is, in our case

equivalent to

A simplex table is compiled for the basic solution of the following form:

In this table, it is indicated that after solving the problem in these cells there will be a basic solution. - quotients from dividing a column by one of the columns; - additional multipliers for zeroing the values ​​in the cells of the table related to the resolving column. - min value of the objective function -Z, - the values ​​of the coefficients in the objective function with unknowns.

Any positive value is found among the meanings. If this is not the case, then the problem is considered solved. Any column of the table that is in it is selected, this column is called the "permissive" column. If there are no positive numbers among the elements of the resolving column, then the problem is unsolvable due to the unboundedness of the objective function on the set of its solutions. If positive numbers are present in the resolving column, go to step 5.

The column is filled with fractions, in the numerator of which are the elements of the column, and in the denominator - the corresponding elements of the resolving column. The smallest of all values ​​is selected. The line with the smallest result is called the "enable" line. At the intersection of the resolving line and the resolving column, a resolving element is found, which is highlighted in some way, for example, with color.

Based on the first simplex table, the following is compiled, in which:

Replaces row vector with column vector

the permissive line is replaced by the same line divided by the permissive element

each of the other rows of the table is replaced by the sum of this row with the resolving one, multiplied by a specially selected additional factor in order to obtain 0 in the cell of the resolving column.

With the new table, we turn to point 4.

The solution of the problem.

Based on the formulation of the problem, we have the following system of inequalities:

and the objective function

We transform the system of inequalities into a system of equations by introducing additional variables:

Let us reduce the objective function to its equivalent:

Let's build the original simplex table:

Let's choose a permissive column. Let's calculate the column:

We enter the values ​​into the table. For the smallest of them = 10, we determine the resolving line:. At the intersection of the resolving line and the resolving column, we find the resolving element = 1. We fill the part of the table with additional factors, such that: the resolving row multiplied by them, added to the rest of the table rows, forms 0 in the elements of the resolving column.

We compose the second simplex table:

We take the resolving column in it, calculate the values, enter them into the table. By the minimum, we get the resolving line. The resolving element will be 1. Find additional factors, fill in the columns.

We create the following simplex table:

Similarly, we find the resolving column, resolving row and resolving element = 2. We build the following simplex table:

Since there are no positive values ​​in the -Z line, this table is finite. The first column gives the desired values ​​of the unknowns, i.e. optimal basic solution:

In this case, the value of the objective function is -Z = -8000, which is equivalent to Zmax = 8000. The problem is solved.

Task 3. Cluster analysis

Formulation of the problem:

Split objects based on the data given in the table. The choice of the solution method is to be carried out independently, to build a graph of data dependence.

Option 1.

Initial data

Review of methods for solving this type of problems. Justification of the solution method.

Cluster analysis tasks are solved using the following methods:

Union or tree clustering method is used to form clusters of "dissimilarity" or "distance between objects". These distances can be defined in one-dimensional or multi-dimensional space.

Two-way join is used (relatively rarely) in circumstances where data is interpreted not in terms of “objects” and “properties of objects”, but in terms of observations and variables. Observations and variables are both expected to contribute to the detection of meaningful clusters at the same time.

K-means method. Used when there is already a hypothesis regarding the number of clusters. You can tell the system to form exactly, for example, three clusters so that they are as different as possible. In general, the K-means method builds exactly K different clusters located at the greatest possible distances from each other.

There are the following ways to measure distances:

Euclidean distance. This is the most common type of distance. It is simply the geometric distance in multidimensional space and is calculated as follows:

Note that the Euclidean distance (and its square) is calculated from the original, not standardized data.

Distance of city blocks (Manhattan distance). This distance is simply the average of the coordinate differences. In most cases, this measure of distance leads to the same results as for ordinary Euclidean distance. Note, however, that for this measure the influence of individual large differences (outliers) decreases (since they are not squared). The Manhattan Distance is calculated using the formula:

Chebyshev's distance. This distance can be useful when you want to define two objects as "different" if they differ in any one coordinate (any one dimension). The Chebyshev distance is calculated by the formula:

Power distance. Sometimes one wants to progressively increase or decrease the weight related to a dimension for which the corresponding objects are very different. This can be achieved using a power law distance. The power-law distance is calculated by the formula:

where r and p are user-defined parameters. A few calculation examples can show how this measure "works". The p parameter is responsible for the gradual weighting of the differences in individual coordinates, the r parameter is responsible for the progressive weighting of large distances between objects. If both parameters - r and p, are equal to two, then this distance coincides with the Euclidean distance.

Disagreement percentage. This measure is used when the data is categorical. This distance is calculated by the formula:

To solve the problem, we will choose the unification method (tree-like clustering) as the one that best meets the conditions and formulation of the problem (to split the objects). In turn, the union method can use several variants of communication rules:

Single link (nearest neighbor method). In this method, the distance between two clusters is determined by the distance between the two closest objects (nearest neighbors) in different clusters. That is, any two objects in two clusters are closer to each other than the corresponding link distance. This rule should, in a sense, string objects together to form clusters, and the resulting clusters tend to be long "chains".

Full communication (the method of the most distant neighbors). In this method, the distance between clusters is determined by the largest distance between any two features in different clusters (ie, "farthest neighbors").

There are also many other clustering methods like these (eg, unweighted pairing, weighted pairing, etc.).

Solution method technology. Calculation of indicators.

In the first step, when each object is a separate cluster, the distances between these objects are determined by the selected measure.

Since the task does not specify the units of measure for the characteristics, it is assumed that they are the same. Therefore, there is no need to normalize the initial data, so we immediately proceed to calculating the distance matrix.

The solution of the problem.

Let's build a graph of dependence according to the initial data (Fig. 2)

We will take the usual Euclidean distance as the distance between objects. Then according to the formula:

where l - signs; k is the number of features, the distance between objects 1 and 2 is equal to:

We continue to calculate the remaining distances:

Let's build a table from the obtained values:

The smallest distance. This means that we combine elements 3, 6 and 5 into one cluster. We get the following table:

The smallest distance. Elements 3, 6, 5 and 4 are combined into one cluster. We get a table of two clusters:

The minimum distance between items 3 and 6 is. This means that elements 3 and 6 are combined into one cluster. We choose the maximum distance between the newly formed cluster and the rest of the elements. For example, the distance between cluster 1 and cluster 3.6 is max (13.34166, 13.60147) = 13.34166. Let's compose the following table:

In it, the minimum distance is the distance between clusters 1 and 2. Combining 1 and 2 into one cluster, we get:

Thus, using the “far neighbor” method, two clusters were obtained: 1,2 and 3,4,5,6, the distance between which is 13.60147.

The problem has been solved.

Applications. Solving problems using software packages (MS Excel 7.0)

The problem of correlation and regression analysis.

We enter the initial data into the table (Fig. 1)

Select the menu "Service / Data Analysis". In the window that appears, select the line "Regression" (Fig. 2).

Let's set in the next window the input intervals for X and Y, the reliability level will be 95%, and the output data will be placed on a separate sheet "Report Sheet" (Fig. 3)

After carrying out the calculation, we obtain the final data of the regression analysis on the "Report Sheet" sheet:

It also displays a dot plot of the approximating function, or "Selection Graph":


The calculated values ​​and deviations are shown in the table in the “Predicted Y” and “Balances” columns, respectively.

Based on the initial data and deviations, a residual graph is plotted:

Optimization task


We enter the initial data as follows:

The unknown unknowns X1, X2, X3 are entered into cells C9, D9, E9, respectively.

The objective function coefficients for X1, X2, X3 are entered into C7, D7, E7, respectively.

Enter the objective function into cell B11 as the formula: = C7 * C9 + D7 * D9 + E7 * E9.

Existing task restrictions

For the length of pipe laying:

we add to cells C5, D5, E5, F5, G5

The number of wells in each field:

X3 Ј 100; we add to cells C8, D8, E8.

Cost of construction of 1 well:

we add to cells C6, D6, E6, F6, G6.

The formula for calculating the total length C5 * C9 + D5 * D9 + E5 * E9 is placed in cell B5, the formula for calculating the total cost C6 * C9 + D6 * D9 + E6 * E9 is placed in cell B6.


We select in the menu "Service / Search for a solution", we enter the parameters for finding a solution in accordance with the initial data (Fig. 4):

Using the "Parameters" button, set the following parameters for finding a solution (Fig. 5):


After searching for a solution, we get a report on the results:

Microsoft Excel 8.0e Results Report

Report Created: 11/17/2002 1:28:30 AM

Target Cell (Maximum)

Result

Total loot

Modifiable cells

Result

Number of wells

Number of wells

Number of wells

Restrictions

Meaning

Length

Related

Project cost

not related.

Number of wells

not related.

Number of wells

Related

Number of wells

Related

The first table shows the initial and final (optimal) value of the target cell, where the objective function of the problem being solved was placed. In the second table we see the initial and final values ​​of the variables to be optimized, which are contained in the modified cells. The third table in the results report contains information about the constraints. The column "Value" contains the optimal values ​​of the required resources and the variables to be optimized. The "Formula" column contains limits on consumed resources and variables to be optimized, written in the form of references to cells containing this data. The column "State" determines whether these or those constraints are related or unrelated. Here "bound" are constraints implemented in the optimal solution in the form of rigid equalities. The column "Difference" for resource constraints determines the remainder of the used resources, i.e. the difference between the required amount of resources and their availability.

Similarly, having written down the result of the search for a solution in the "Sustainability Report" form, we will receive the following tables:

Microsoft Excel 8.0e Resilience Report

Worksheet: [Solution of the optimization problem.xls] Solution of the optimization problem

Report Created: 11/17/2002 1:35:16 AM

Modifiable cells

Permissible

Permissible

meaning

price

Coefficient

Increase

Decrease

Number of wells

Number of wells

Number of wells

Restrictions

Limitation

Permissible

Permissible

meaning

Right part

Increase

Decrease

Length

Project cost

The sustainability report contains information about modifiable (optimized) variables and model constraints. This information is associated with the simplex method used in the optimization of linear problems, described above in terms of solving the problem. It allows you to estimate how sensitive the obtained optimal solution is to possible changes in the model parameters.

The first part of the report contains information about the modified cells containing values ​​about the number of wells in the fields. The column "Resulting value" indicates the optimal values ​​of the variables to be optimized. The column "Target coefficient" contains the initial data of the values ​​of the coefficients of the objective function. The next two columns illustrate the allowable increase and decrease of these coefficients without changing the found optimal solution.

The second part of the sustainability report contains information on the constraints imposed on the variables being optimized. The first column shows the resource requirements for the optimal solution. The second contains the values ​​of the shadow prices for the types of resources used. The last two columns contain data on possible increases or decreases in the amount of available resources.

Clustering problem.

The step-by-step method for solving the problem is given above. Here are Excel tables illustrating the progress of solving the problem:

Nearest neighbor method

Solving the problem of cluster analysis - "NEAREST NEIGHBOR'S METHOD"

Initial data

where x1 is the volume of products;

х2 - the average annual cost of the main

Industrial production assets

Far neighbor method

Solution of the problem of cluster analysis - "DISTANCE NEIGHBOR METHOD"

Initial data

where x1 is the volume of products;

х2 - the average annual cost of the main

Industrial production assets