THE BELL

There are those who read this news before you.
Subscribe to get the latest articles.
Email
Name
Surname
How would you like to read The Bell
No spam

Send your good work in the knowledge base is simple. Use the form below

Students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

Introduction

1.History of "cluster analysis"

2.Terminology

2.1Object and sign

2.2 Distance between objects (metric)

2.3Density and locality of clusters

2.4 Distance between clusters

3. Grouping methods

3.1Features of hierarchical agglomerative methods

3.2Features of iterative clustering methods

4. Feature clustering

5. Stability and quality of clustering

Bibliography

INTRODUCTION

"Cluster analysis is a set of mathematical methods designed to form relatively "remote" from each other groups of "close" objects according to information about distances or connections (measures of proximity) between them. It is similar in meaning to the terms: automatic classification, taxonomy, pattern recognition without a teacher." This definition of cluster analysis is given in the latest edition of the Statistical Dictionary. In fact, "cluster analysis" is a generalized name for a fairly large set of algorithms used to create a classification. A number of publications also use such synonyms for cluster analysis as classification and partitioning. Cluster analysis is widely used in science as a means of typological analysis. In any scientific activity, classification is one of the fundamental components, without which it is impossible to build and test scientific hypotheses and theories. Thus, in my work, I consider it necessary to consider the issues of cluster analysis (the basis of cluster analysis), as well as to consider its terminology and give some examples of using this method with data processing as my main goal.

1. HISTORY OF "CLUSTER ANALYSIS"

An analysis of domestic and foreign publications shows that cluster analysis is used in a wide variety of scientific areas: chemistry, biology, medicine, archeology, history, geography, economics, philology, etc. The book by VV Nalimov "Probabilistic Model of Language" describes the use of cluster analysis in the study of 70 analytical samples. Most of the literature on cluster analysis has appeared during the last three decades, although the first works that mentioned cluster methods appeared quite a long time ago. The Polish anthropologist K. Chekanowski put forward the idea of ​​"structural classification", which contained the main idea of ​​cluster analysis - the allocation of compact groups of objects.

In 1925, the Soviet hydrobiologist P.V. Terentyev developed the so-called "method of correlation pleiades", intended for grouping correlated features. This method gave impetus to the development of grouping methods using graphs. The term "cluster analysis" was first proposed by Trion. The word "cluster" is translated from English as "bunch, brush, bunch, group". For this reason, this type of analysis was originally called "cluster analysis". In the early 1950s, publications by R. Lewis, E. Fix and J. Hodges appeared on hierarchical cluster analysis algorithms. A noticeable impetus to the development of work on cluster analysis was given by R. Rosenblatt's work on the recognition device (perceptron), which laid the foundation for the development of the theory of "pattern recognition without a teacher."

The impetus for the development of clustering methods was the book "Principles of Numerical Taxonomy", published in 1963. two biologists - Robert Sokal and Peter Sneath. The authors of this book proceeded from the fact that in order to create effective biological classifications, the clustering procedure should ensure the use of various indicators characterizing the organisms under study, assess the degree of similarity between these organisms, and ensure the placement of similar organisms in the same group. In this case, the formed groups should be sufficiently "local", i.e. the similarity of objects (organisms) within groups should exceed the similarity of groups among themselves. The subsequent analysis of the identified groups, in the opinion of the authors, can clarify whether these groups correspond to different biological species. Thus, Sokal and Sneath assumed that revealing the structure of the distribution of objects into groups helps to establish the process of formation of these structures. And the difference and similarity of organisms of different clusters (groups) can serve as a basis for understanding the ongoing evolutionary process and elucidating its mechanism.

In the same years, many algorithms were proposed by such authors as J. McKean, G. Ball and D. Hall using k-means methods; G. Lance and W. Williams, N. Jardine and others - on hierarchical methods. A significant contribution to the development of cluster analysis methods was made by domestic scientists - E.M. Braverman, A.A. Dorofeyuk, I.B. Muchnik, L.A. Rastrigin, Yu.I. In particular, in the 60-70s. Numerous algorithms developed by Novosibirsk mathematicians N.G. Zagoruiko, V.N. Elkina and G.S. Lbov enjoyed great popularity. These are such well-known algorithms as FOREL, BIGFOR, KRAB, NTTP, DRET, TRF, etc. Based on these packages, a specialized OTEX software package was created. No less interesting software products PPSA and Klass-Master were created by Moscow mathematicians S.A. Aivazyan, I.S. Enyukov and B.G. Mirkin.

To some extent, cluster analysis methods are available in most of the most well-known domestic and foreign statistical packages: SIGAMD, DataScope, STADIA, SOMI, PNP-BIM, COPRA-2, SITO, SAS, SPSS, STATISTICA, BMDP, STATGRAPHICS, GENSTAT, S -PLUS, etc. Of course, 10 years after the release of this review, quite a lot has changed, new versions of many statistical programs have appeared, and completely new programs have appeared that use both new algorithms and greatly increased power. computer science. However, most statistical packages use algorithms proposed and developed in the 60-70s.

According to rough estimates of experts, the number of publications on cluster analysis and its applications in various fields of knowledge doubles every three years. What are the reasons for such a stormy interest in this type of analysis? Objectively, there are three main reasons for this phenomenon. This is the emergence of powerful computing technology, without which cluster analysis of real data is practically not feasible. The second reason is that modern science is increasingly based on classification in its constructions. Moreover, this process is increasingly deepening, since in parallel with this there is an increasing specialization of knowledge, which is impossible without a sufficiently objective classification.

The third reason - the deepening of special knowledge inevitably leads to an increase in the number of variables taken into account in the analysis of certain objects and phenomena. As a result, subjective classification, which previously relied on a fairly small number of features taken into account, often turns out to be unreliable. And objective classification, with an ever-increasing set of object characteristics, requires the use of complex clustering algorithms that can only be implemented on the basis of modern computers. It was these reasons that gave rise to the "cluster boom". However, among physicians and biologists, cluster analysis has not yet become a fairly popular and common research method.

2 TERMINOLOGY

2. 1 OBJECT AND SIGN

Let us first introduce such concepts as object and attribute. Object - from the Latin objectum - subject. In relation to chemistry and biology, by objects we will mean specific subjects of research that are studied using physical, chemical and other methods. Such objects can be, for example, samples, plants, animals, etc. A certain set of objects available to the researcher for study is called a sample, or a sample set. The number of objects in such a population is usually called the sample size. Typically, the sample size is denoted by the Latin letter "n" or "N".

Sign (synonyms - property, variable, characteristic; English - variable - variable.) - is a specific property of the object. These properties can be expressed as numeric or non-numeric values. For example, blood pressure (systolic or diastolic) is measured in millimeters of mercury, weight in kilograms, height in centimeters, etc. Such signs are quantitative. In contrast to these continuous numerical characteristics (scales), a number of features can have discrete, discontinuous values. In turn, such discrete features are usually divided into two groups.

1) The first group is rank variables, or as they are also called ordinal variables (scales). Such signs are characterized by the property of ordering these values. These include the stages of a particular disease, age groups, student knowledge scores, the 12-point Richter earthquake magnitude scale, etc.

2) The second group of discrete features does not have such an order and is called nominal (from the word "nominal" - sample) or classification features. An example of such signs may be the patient's condition - "healthy" or "sick", the sex of the patient, the period of observation - "before treatment" and "after treatment", etc. In these cases, it is customary to say that such features belong to the scale of names.

The concepts of an object and a feature are usually called the "Object-property" or "Object-feature" matrix. The matrix will be a rectangular table consisting of the values ​​of features that describe the properties of the sample of observations under study. In this context, one observation will be recorded as a separate line consisting of the values ​​of the features used. A separate attribute in such a data matrix will be represented by a column consisting of the values ​​of this attribute for all objects in the sample.

2. 2 DISTANCE BETWEEN OBJECTS (METRIC)

Let's introduce the concept of "distance between objects". This concept is an integral measure of the similarity of objects to each other. The distance between objects in the feature space is such a value d ij that satisfies the following axioms:

1. d ij > 0 (non-negativity of the distance)

2. d ij = d ji (symmetry)

3. d ij + d jk > d ik (triangle inequality)

4. If d ij is not equal to 0, then i is not equal to j (distinguishability of non-identical objects)

5. If d ij = 0, then i = j (indistinguishability of identical objects)

It is convenient to represent the measure of proximity (similarity) of objects as reciprocal on the distance between objects. Numerous publications devoted to cluster analysis describe more than 50 different ways to calculate the distance between objects. In addition to the term "distance", another term is often found in the literature - "metric", which implies a method for calculating a particular distance. The most accessible for perception and understanding in the case of quantitative features is the so-called "Euclidean distance" or "Euclidean metric". The formula for calculating this distance is:

This formula uses the following notation:

· d ij - distance between i-th and j-th objects;

· x ik - numerical value of the k-th variable for the i-th object;

· x jk - numerical value of the k-th variable for the j-th object;

· v - the number of variables that describe objects.

Thus, for the case v=2, when we have only two quantitative signs, the distance d ij will be equal to the length of the hypotenuse of a right triangle, which connects two points in a rectangular coordinate system. These two points will correspond to the i-th and j-th observations of the sample. Often, instead of the usual Euclidean distance, its square d 2 ij is used. In addition, in some cases, a "weighted" Euclidean distance is used, in the calculation of which weight coefficients are used for individual terms. To illustrate the concept of the Euclidean metric, we use a simple training example. The data matrix shown in the table below consists of 5 observations and two variables.

Table 1

Data matrix of five observed samples and two variables.

Using the Euclidean metric, we calculate the matrix of interobject distances, consisting of the values ​​d ij - the distance between the i-th and j-th objects. In our case, i and j are the number of the object, observation. Since the sample size is 5, i and j, respectively, can take values ​​from 1 to 5. It is also obvious that the number of all possible pairwise distances will be 5*5=25. Indeed, for the first object these will be the following distances: 1-1; 1-2; 1-3; 1-4; 1-5. For object 2 there will also be 5 possible distances: 2-1; 2-2; 2-3; 2-4; 2-5 etc. However, the number different distances will be less than 25, since it is necessary to take into account the property of indistinguishability of identical objects - d ij = 0 for i = j. This means that the distance between object #1 and the same object #1 will be zero. The same zero distances will be for all other cases i = j. In addition, it follows from the symmetry property that d ij = d ji for any i and j. Those. the distance between objects #1 and #2 is equal to the distance between objects #2 and #1.

The expression for the Euclidean distance is very similar to the so-called generalized Minkowski power distance, in which another value is used instead of two in powers. In the general case, this value is denoted by the symbol "p".

For p = 2 we get the usual Euclidean distance. So the expression for the generalized Minkowski metric has the form:

The choice of a specific value of the exponent "p" is made by the researcher himself.

A special case of the Minkowski distance is the so-called Manhattan distance, or "city-block distance", corresponding to p=1:

Thus, the Manhattan distance is the sum of the modules of the differences of the corresponding features of the objects. Letting p tend to infinity, we get the "dominance" metric, or Sup-metric:

which can also be represented as d ij = max| x ik - x jk |.

The Minkowski metric is actually a large family of metrics, including the most popular metrics. However, there are methods for calculating the distance between objects that are fundamentally different from the Minkowski metrics. The most important of these is the so-called Mahalanobis distance, which has rather specific properties. Expression for this metric:

Here through X i and X j column vectors of variable values ​​for the i-th and j-th objects are indicated. Symbol T in expression (X i - X j ) T denotes the so-called vector transposition operation. Symbol S the common intra-group variance-covariance matrix is ​​indicated. A symbol -1 above S means that you need to invert the matrix S . Unlike the Minkowski metric and the Euclidean metric, the Mahalanobis distance through the variance-covariance matrix S associated with correlations of variables. When the correlations between variables are zero, the Mahalanobis distance is equivalent to the square of the Euclidean distance.

In the case of using dichotomous (having only two values) qualitative features, the Hamming distance is widely used

equal to the number of mismatches in the values ​​of the corresponding features for the considered i-th and j-th objects.

2. 3 DENSITY AND LOCALITY OF CLUSTERS

The main goal of cluster analysis is to find groups of objects similar to each other in the sample. Let's assume that by some of the possible methods we have obtained such groups - clusters. Important properties of clusters should be noted. One of these properties is the distribution density of points, observations within a cluster. This property allows us to define a cluster as a cluster of points in a multidimensional space that is relatively dense compared to other regions of this space that either do not contain points at all or contain a small number of observations. In other words, how compact this cluster is, or vice versa, how sparse it is. Despite the sufficient evidence of this property, there is no unambiguous way to calculate such an indicator (density). The most successful indicator characterizing the compactness, the density of "packing" of multidimensional observations in a given cluster, is the dispersion of the distance from the center of the cluster to individual points of the cluster. The smaller the dispersion of this distance, the closer the observations are to the center of the cluster, the greater the density of the cluster. And vice versa, the greater the dispersion of the distance, the more sparse this cluster is, and, consequently, there are points located both near the center of the cluster and quite distant from the center of the cluster.

The next property of clusters is their size. The main indicator of the size of a cluster is its "radius". This property most fully reflects the actual cluster size if the considered cluster is round and hypersphere in multidimensional space. However, if the clusters have elongated shapes, then the concept of radius or diameter no longer reflects the true size of the cluster.

Another important property of a cluster is their locality, separability. It characterizes the degree of overlap and mutual remoteness of clusters from each other in a multidimensional space. For example, consider the distribution of three clusters in the space of new, integrated features in the figure below. Axes 1 and 2 were obtained by a special method from 12 features of the reflective properties of different forms of erythrocytes, studied using electron microscopy.

Picture 1

We see that cluster 1 has the minimum size, while clusters 2 and 3 have approximately equal sizes. At the same time, we can say that the minimum density, and hence the maximum distance dispersion, is characteristic of cluster 3. In addition, cluster 1 is separated by sufficiently large sections of empty space from both cluster 2 and cluster 3. Whereas clusters 2 and 3 partially overlap with each other. Of interest is the fact that cluster 1 has a much greater difference from the 2nd and 3rd clusters along axis 1 than along axis 2. On the contrary, clusters 2 and 3 differ approximately equally from each other both along axis 1 and along axes 2. Obviously, for such a visual analysis, it is necessary to have all observations of the sample projected onto special axes, in which the projections of cluster elements will be visible as separate clusters.

2. 4 DISTANCE BETWEEN CLUSTERS

In a broader sense, objects can be understood not only as the original subjects of research, presented in the "object-property" matrix as a separate line, or as individual points in a multidimensional feature space, but also as separate groups of such points, united by one algorithm or another into a cluster. In this case, the question arises of how to understand the distance between such accumulations of points (clusters) and how to calculate it. In this case, the variety of possibilities is even greater than in the case of calculating the distance between two observations in a multidimensional space. This procedure is complicated by the fact that, unlike points, clusters occupy a certain amount of multidimensional space and consist of many points. In cluster analysis, inter-cluster distances are widely used, calculated on the principle of the nearest neighbor (nearest neighbor), center of gravity, farthest neighbor, medians. Four methods are most widely used: single link, full link, average link, and Ward's method. In the single link method, an object will be attached to an already existing cluster if at least one of the elements of the cluster has the same level of similarity as the object being joined. For the method of complete links, an object is attached to a cluster only if the similarity between the candidate for inclusion and any of the elements of the cluster is not less than a certain threshold. For the average connection method, there are several modifications, which are some compromise between single and full connection. They calculate the average value of the similarity of the candidate for inclusion with all objects of the existing cluster. Attachment is performed when the found average similarity value reaches or exceeds a certain threshold. The most commonly used is the arithmetic mean similarity between the objects of the cluster and the candidate for inclusion in the cluster.

Many of the clustering methods differ from each other in that their algorithms at each step calculate various partitioning quality functionals. The popular Ward method is constructed in such a way as to optimize the minimum variance of intracluster distances. At the first step, each cluster consists of one object, due to which the intracluster dispersion of distances is equal to 0. By this method, those objects that give the minimum increment of dispersion are combined, as a result of which this method tends to generate hyperspherical clusters.

Multiple attempts to classify cluster analysis methods lead to dozens or even hundreds of different classes. Such a variety is generated by a large number of possible ways to calculate the distance between individual observations, no less number of methods for calculating the distance between individual clusters in the process of clustering, and various estimates of the optimality of the final cluster structure.

The most widely used in popular statistical packages are two groups of cluster analysis algorithms: hierarchical agglomerative methods and iterative grouping methods.

3. GROUPING METHODS

3. 1 FEATURES OF HIERARCHICAL AGGLOMERATIVE METHODS

In agglomerative hierarchical algorithms, which are more often used in real biomedical research, initially all objects (observations) are considered as separate, independent clusters consisting of only one element. Without the use of powerful computer technology, the implementation of cluster data analysis is very problematic.

The choice of metric is made by the researcher. After calculating the distance matrix, the process begins agglomerations (from the Latin agglomero - I attach, accumulate), passing sequentially step by step. At the first step of this process, two initial observations (monoclusters) with the smallest distance between them are combined into one cluster, which already consists of two objects (observations). Thus, instead of the former N monoclusters (clusters consisting of one object), after the first step, there will be N-1 clusters, of which one cluster will contain two objects (observations), and N-2 clusters will still consist of only one object. At the second step, various methods of combining N-2 clusters are possible. This is because one of these clusters already contains two objects. For this reason, two main questions arise:

· how to calculate the coordinates of such a cluster of two (and further more than two) objects;

· how to calculate the distance to such "poly-object" clusters from "monoclusters" and between "poly-object" clusters.

Ultimately, these questions determine the final structure of the final clusters (the structure of clusters means the composition of individual clusters and their relative position in a multidimensional space). Various combinations of metrics and methods for calculating the coordinates and mutual distances of clusters give rise to the variety of cluster analysis methods. At the second step, depending on the chosen methods for calculating the coordinates of a cluster consisting of several objects and the method for calculating intercluster distances, it is possible either to re-combine two separate observations into a new cluster, or to join one new observation to a cluster consisting of two objects. For convenience, most programs of agglomerative-hierarchical methods at the end of the work can provide two main graphs for viewing. The first graph is called a dendrogram (from the Greek dendron - tree), reflecting the process of agglomeration, the merging of individual observations into a single final cluster. Let's give an example of a dendrogram of 5 observations in two variables.

Schedule1

The vertical axis of such a graph is the axis of the intercluster distance, and the numbers of objects - cases used in the analysis - are marked along the horizontal axis. It can be seen from this dendrogram that objects No. 1 and No. 2 are first combined into one cluster, since the distance between them is the smallest and equals 1. This merger is displayed on the graph by a horizontal line connecting the vertical segments coming out of the points marked as C_1 and C_2. Let's pay attention to the fact that the horizontal line itself passes exactly at the level of the intercluster distance equal to 1. Further, at the second step, object No. 3, designated as C_3, joins this cluster, which already includes two objects. The next step is to merge objects #4 and #5, the distance between which is equal to 1.41. And at the last step, the cluster of objects 1, 2 and 3 is combined with the cluster of objects 4 and 5. The graph shows that the distance between these two penultimate clusters (the last cluster includes all 5 objects) is greater than 5, but less than 6, since the upper horizontal line connecting the two penultimate clusters passes at a level approximately equal to 7, and the level of connection of objects 4 and 5 is 1.41.

The dendrogram below was obtained by analyzing a real dataset consisting of 70 processed chemical samples, each of which was characterized by 12 features.

Chart 2

It can be seen from the graph that at the last step, when the last two clusters merge, the distance between them is about 200 units. It can be seen that the first cluster includes much fewer objects than the second cluster. Below is an enlarged section of the dendrogram on which the observation numbers are clearly visible, denoted as C_65, C_58, etc. (left to right): 65, 58, 59, 64, 63, 57, 60, 62, 56, 44, 94, etc.

Chart 3 Enlarged portion of chart #2 above

It can be seen that object 44 is a monocluster that combines with the right cluster at the penultimate step, and then, at the last step, all observations are combined into one cluster.

Another graph that is built in such procedures is a graph of intercluster distances at each step of the union. Below is a similar plot for the above dendrogram.

Chart 4

In a number of programs, it is possible to display in tabular form the results of combining objects at each step of clustering. In most of these tables, in order to avoid confusion, different terminology is used to designate the initial observations - monoclusters, and the actual clusters consisting of two or more observations. In English-language statistical packages, the initial observations (rows of the data matrix) are designated as "case" - case. In order to demonstrate the dependence of the cluster structure on the choice of the metric and the choice of the cluster union algorithm, we present below a dendrogram corresponding to the full connection algorithm. And here we see that object #44 is merged with the rest of the selection in the very last step.

Chart 5

Now let's compare it with another diagram obtained by using the single link method on the same data. In contrast to the full connection method, it can be seen that this method generates long chains of sequentially attached objects to each other. However, in all three cases, we can say that two main groups stand out.

Chart 6

Let's also pay attention to the fact that in all three cases object No. 44 joins as a monocluster, although at different steps of the clustering process. The selection of such monoclusters is a good means of detecting anomalous observations, called outliers. Let's delete this "suspicious" object No. 44 and again carry out clustering. We get the following dendrogram:

Chart 7

It can be seen that the "chain" effect is preserved, as is the division into two local groups of observations.

3. 2 FEATURES OF ITERATIVE CLUSTERING METHODS

Among iterative methods, the most popular method is McKean's k-means method. Unlike hierarchical methods, in most implementations of this method, the user himself must specify the desired number of final clusters, which is usually denoted as "k". As in hierarchical clustering methods, the user can choose one or another type of metric. Different algorithms of the k-means method also differ in the way of choosing the initial centers of the given clusters. In some versions of the method, the user himself can (or must) specify such initial points, either by selecting them from real observations, or by specifying the coordinates of these points for each of the variables. In other implementations of this method, the choice of a given number k of initial points is made randomly, and these initial points (cluster grains) can subsequently be refined in several stages. There are 4 main stages of such methods:

· select or assign k observations that will be the primary centers of the clusters;

· if necessary, intermediate clusters are formed by assigning each observation to the nearest specified cluster centers;

· after assigning all observations to individual clusters, the primary cluster centers are replaced by cluster averages;

· the previous iteration is repeated until the changes in the coordinates of the cluster centers become minimal.

In some versions of this method, the user can set a numerical value of the criterion, which is interpreted as the minimum distance for selecting new cluster centers. Observation will not be considered as a candidate for new center cluster, if its distance to the replaced center of the cluster exceeds the specified number. This parameter is called "radius" in some programs. In addition to this parameter, it is also possible to set the maximum number of iterations or reach a certain, usually quite small, number, with which the change in distance for all cluster centers is compared. This setting is commonly referred to as "convergence" because reflects the convergence of the iterative clustering process. Below we present some of the results that were obtained using the McKean k-means method to the previous data. The number of desired clusters was initially set to 3 and then to 2. Their first part contains the results of a one-factor analysis of variance, in which the cluster number acts as a grouping factor. The first column is a list of 12 variables, followed by sums of squares (SS) and degrees of freedom (df), then Fisher's F-test, and in the last column the significance level "p" achieved.

Table 2 McKean k-means data applicable to 70 test samples.

Variables

As can be seen from this table, the null hypothesis about the equality of the means in the three groups is rejected. Below is a graph of the means of all variables for individual clusters. The same cluster means of the variables are presented below in the form of a table.

Table 3. Detailed review of the data on the example of three clusters.

Variable

Cluster #1

Cluster #2

Cluster #3

Chart 8

The analysis of the average values ​​of the variables for each cluster allows us to conclude that, according to the X1 feature, clusters 1 and 3 have close values, while cluster 2 has an average value much lower than in the other two clusters. On the contrary, according to the X2 feature, the first cluster has the lowest value, while the 2nd and 3rd clusters have higher and close average values. For traits X3-X12, the mean values ​​in cluster 1 are significantly higher than in clusters 2 and 3. The following table of ANOVA analysis of the results of clustering into two clusters also shows the need to reject the null hypothesis about the equality of group means for almost all 12 features, with the exception of variable X4, for which the achieved significance level turned out to be more than 5%.

Table 4. Table of dispersion analysis of the results of clustering into two clusters.

Variables

Below is a graph and table of group means for the case of clustering into two clusters.

Table 5. Table for the case of clustering into two clusters.

Variables

Cluster #1

Cluster #2

Chart 9.

In the case when the researcher is not able to determine in advance the most probable number of clusters, he is forced to repeat the calculations, setting a different number, similar to what was done above. And then, comparing the results obtained with each other, stop at one of the most acceptable clustering options.

4 . CLUSTERING OF FEATURES

In addition to clustering individual observations, there are also feature clustering algorithms. One of the first such methods is the method of correlation pleiades Terentiev P.V. Primitive images of such pleiades can often be found in biomedical publications in the form of a circle dotted with arrows connecting signs for which the authors found a correlation. A number of programs for clustering objects and features have separate procedures. For example, in the SAS package for feature clustering, the VARCLUS procedure (from VARiable - variable and CLUSter - cluster) is used, while cluster analysis of observations is performed by other procedures - FASTCLUS and CLUSTER. The construction of a dendrogram in both cases is carried out using the TREE (tree) procedure.

In other statistical packages, the selection of elements for clustering - objects or features - is made in the same module. As a metric for feature clustering, expressions are often used that include the value of certain coefficients reflecting the strength of the relationship for a pair of features. In this case, it is very convenient for signs with a connection strength equal to one (functional dependence) to take the distance between the signs equal to zero. Indeed, with a functional connection, the value of one feature can accurately calculate the value of another feature. With a decrease in the strength of the relationship between the features, the distance increases accordingly. Below is a graph showing a dendrogram of the combination of 12 features that were used above when clustering 70 analytical samples.

Graph 10. Dendrogramclustering 12 features.

As can be seen from this dendrogram, we are dealing with two local groupings of features: X1-X10 and X11-X12. The group of features X1-X10 is characterized by a fairly small value of intercluster distances, not exceeding approximately 100 units. Here we also see some internal paired subgroups: X1 and X2, X3 and X4, X6 and X7. The distance between the features of these pairs, which is very close to zero, indicates their strong pair relationship. Whereas for the pair X11 and X12 the value of the intercluster distance is much larger and is about 300 units. Finally, a very large distance between the left (X1-X10) and right (X11-X12) clusters, equal to about 1150 units, indicates that the relationship between these two groups of features is quite minimal.

5. STABILITY AND QUALITY OF CLUSTERING

Obviously, it would be absurd to raise the question of how absolute this or that classification obtained with the help of cluster analysis methods is. When the clustering method is changed, stability manifests itself in the fact that two clusters are quite clearly visible on the dendrograms.

As one of the possible ways to check the stability of the cluster analysis results, the method of comparing the results obtained for various clustering algorithms can be used. Other ways are the so-called bootstrap method proposed by B. Efron in 1977, the "jackknife" and "sliding control" methods. The simplest means of checking the stability of a cluster solution can be to randomly divide the initial sample into two approximately equal parts, cluster both parts, and then compare the results. A more time-consuming way involves the sequential exclusion of the first object at the beginning and the clustering of the remaining (N - 1) objects. Further, sequentially carrying out this procedure with the exception of the second, third, etc. objects, the structure of all N obtained clusters is analyzed. Another algorithm for checking stability involves multiple reproduction, duplication of the original sample of N objects, then combining all duplicated samples into one large sample (pseudo-general population) and randomly extracting a new sample of N objects from it. After that, this sample is clustered, then a new random sample is taken, and clustering is carried out again, etc. It is also quite labor intensive.

There are no less problems when assessing the quality of clustering. Quite a few algorithms for optimizing cluster solutions are known. The first works that contained formulations of the criterion for minimizing intracluster variance and an algorithm (of the k-means type) for finding the optimal solution appeared in the 50s. In 1963 J. Ward's article also presented a similar optimization hierarchical algorithm. There is no universal criterion for optimizing a cluster solution. All this makes it difficult for the researcher to choose the optimal solution. In such situation in the best possible way to assert that the found cluster solution is optimal at this stage of the study, is only the consistency of this solution with the conclusions obtained using other methods of multivariate statistics.

In favor of the conclusion about the optimality of clustering, there are also positive results of checking the predictive moments of the obtained solution already on other objects of study. When using hierarchical methods of cluster analysis, we can recommend comparing several graphs with each other incremental change intercluster distance. In this case, preference should be given to the option for which a flat line of such an increment is observed from the first step to several penultimate steps with a sharp vertical rise in this graph at the last 1-2 steps of clustering.

CONCLUSIONS

In my work, I tried to show not only the complexity of this type of analysis, but also the optimal data processing capabilities, because often for the accuracy of the results you have to use from tens to hundreds of samples. This type analysis helps to classify and process the results. I also consider not unimportant the acceptability of computer technologies in this analysis, which makes it possible to make the process of processing results less time consuming and thus allows more attention to be paid to the correctness of sampling for analysis.

In the use of cluster analysis, there are such subtleties and details that appear in individual specific cases and are not immediately visible. For example, the role of the scale of features may be minimal, and may be dominant in some cases. In such cases it is necessary to use variable transformations. This is especially effective when using methods that produce non-linear feature transformations that generally increase the overall level of correlations between features.

There is even greater specificity in the use of cluster analysis in relation to objects that are described only by qualitative features. In this case, methods of preliminary digitization of qualitative features and cluster analysis with new features are quite successful. In my work, I showed that cluster analysis provides a lot of new and original information both in the case of its application in sufficiently studied systems, and in the study of systems with an unknown structure.

It should also be noted that cluster analysis has become indispensable in evolutionary research, allowing the construction of phylogenetic trees showing evolutionary paths. These methods are widely used in programs scientific research in Physical and Analytical Chemistry.

BIBLIOGRAPHY

1) Aivazyan S. A., Enyukov I. S., Meshalkin L. D. On the structure and content of the software package for applied statistical analysis//Algorithmic and software applied statistical analysis.--M., 1980.

2) Ayvazyan S. A., Bezhaeva Z. I., Staroverov O. V. Classification of multidimensional observations.--M.: Statistics, 1974.

3) Becker V. A., Lukatskaya M. L. On the analysis of the structure of the matrix of coupling coefficients//Issues of economic and statistical modeling and forecasting in industry.-- Novosibirsk, 1970.

4) Braverman E. M., Muchnik I. B. Structural Methods data processing.--M.: Nauka, 1983.

5) Voronin Yu. A. Classification theory and its applications.--Novosibirsk: Nauka, 1987.

6) Good I. J. Botryology of botryology//Classification and cluster.--M.: Mir, 1980.

7) Dubrovsky S. A. Applied multivariate statistical analysis.--M.: Finance and statistics, 1982.

8) Duran N., Odell P. Cluster analysis.--M.: Statistics, 1977.

9) Eliseeva I.I., Rukavishnikov V.S. Grouping, correlation, pattern recognition.--M.: Statistics, 1977.

10) Zagoruiko N. G. Recognition methods and their application.--M .: Soviet radio, 1972.

11) Zade L. A. Fuzzy sets and their application in pattern recognition and cluster analysis//Classification and cluster.--M.: Mir, 1980.

12) Kildishev G.S., Abolentsev Yu.I. Multidimensional groupings.--M.: Statistics, 1978.

13) Raiskaya II, Gostilin NI, Frenkel AA About one way to check the validity of partitioning in cluster analysis.//Application of multivariate statistical analysis in economics and product quality assessment.--Ch. P. Tartu, 1977.

14) Shurygin A. M. Distribution of interpoint distances and differences // Software and algorithmic support for applied multidimensional statistical analysis.--M., 1983.

15) Eeremaa R. General theory of designing cluster systems and algorithms for finding their numerical representations: Proceedings of the Computing Center of TSU.--Tartu, 1978.

16) Yastremsky B.S. Selected Works.--M.: Statistics, 1964.

Similar Documents

    The goals of market segmentation in marketing activities. The essence of cluster analysis, the main stages of its implementation. Select how to measure distance or similarity measure. Hierarchical, non-hierarchical clustering methods. Assessment of reliability and reliability.

    report, added 02.11.2009

    Main characteristics financial condition enterprises. Crisis at the enterprise, its causes, types and consequences. Modern methods and cluster analysis tools, features of their use for the financial and economic assessment of the enterprise.

    thesis, added 10/09/2013

    Perform cluster analysis of enterprises using Statgraphics Plus. Construction of a linear regression equation. Calculation of coefficients of elasticity by regression models. Assessment of the statistical significance of the equation and the coefficient of determination.

    task, added 03/16/2014

    Construction of typological regressions for individual groups of observations. Spatial data and temporal information. Scope of application of cluster analysis. The concept of homogeneity of objects, properties of the distance matrix. Carrying out typological regression.

    presentation, added 10/26/2013

    Creation of combined models and methods as modern way forecasting. An ARIMA-based model for describing stationary and non-stationary time series in solving clustering problems. Autoregressive AR models and application of correlograms.

    presentation, added 05/01/2015

    Characteristics of different types of metrics. Nearest neighbor method and its generalizations. Nearest Neighbor Algorithm. Parzen window method. Generalized metric classifier. The problem of choosing a metric. Manhattan and Euclidean distance. cosine measure.

    term paper, added 03/08/2015

    Characteristics of the construction industry of the Krasnodar Territory. Forecast of the development of housing construction. Modern methods and tools of cluster analysis. Multidimensional statistical methods for diagnosing the economic state of an enterprise.

    thesis, added 07/20/2015

    Characteristics of mortgage lending on the example of the Bryansk region. Review of mathematical decision-making methods: expert evaluations, sequential and pairwise comparisons, hierarchy analysis. Development of a search program for the optimal mortgage loan.

    term paper, added 11/29/2012

    Areas of application of system analysis, its place, role, goals and functions in modern science. The concept and content of methods of system analysis, its informal methods. Features of heuristic and expert research methods and features of their application.

    term paper, added 05/20/2013

    Development and research of econometric methods, taking into account the specifics of economic data and in accordance with the needs economics and practices. Application of econometric methods and models for statistical analysis of economic data.

University: VZFEI

Year and city: Moscow 2008


1. Introduction. The concept of the cluster analysis method.

2. Description of the methodology for applying cluster analysis. Control example of problem solving.

4. List of used literature

  1. Introduction. The concept of the cluster analysis method.

Cluster analysis is a set of methods that allow classifying multidimensional observations, each of which is described by a set of features (parameters) X1, X2, ..., Xk.

The purpose of cluster analysis is the formation of groups of objects similar to each other, which are commonly called clusters (class, taxon, concentration).

Cluster analysis is one of the areas of statistical research. It occupies a particularly important place in those branches of science that are associated with the study of mass phenomena and processes. The need to develop methods of cluster analysis and their use is dictated by the fact that they help to build scientifically based classifications, identify internal communications between units of the observed population. In addition, cluster analysis methods can be used to compress information, which is an important factor in the face of a constant increase and complexity of statistical data flows.

Cluster analysis methods allow solving the following problems:

Carrying out the classification of objects, taking into account the features that reflect the essence, nature of objects. The solution of such a problem, as a rule, leads to a deepening of knowledge about the totality of objects being classified;

Checking the assumptions made about the presence of some structure in the studied set of objects, i.e. search for an existing structure;

Construction of new classifications for poorly studied phenomena, when it is necessary to establish the presence of connections within the population and try to introduce structure into it (1. pp. 85-86).

2. Description of the methodology for applying cluster analysis. Control example of problem solving.

Cluster analysis allows you to form a breakdown into homogeneous groups (clusters) from n objects characterized by k features. The homogeneity of objects is determined by the distance p(xi xj), where xi = (xi1, …., xik) and xj= (xj1,…,xjk) are vectors composed of the values ​​of k attributes of the i-th and j-th objects, respectively.

For objects characterized by numerical features, the distance is determined by the following formula:

p(xi , xj) = √ ∑(x1m-xjm) 2 (1)*

Objects are considered homogeneous if p(xi xj)< p предельного.

A graphic representation of the union can be obtained using a cluster union tree - a dendrogram. (2. Chapter 39).

Test case (example 92).

Volume of sales

Let us classify these objects using the “near neighbor” principle. Let's find the distances between objects using the formula (1)* . Let's fill in the table.

Let's explain how the table is filled.

At the intersection of row i and column j, the distance p(xi xj) is indicated (the result is rounded up to two decimal places).

For example, at the intersection of row 1 and column 3, the distance p(x1, x3) = √(1-6) 2 +(9-8) 2 ≈ 5.10 is indicated, and at the intersection of row 3 and column 5, the distance p(x3 , x5) = √ (6-12) 2 + (8-7) 2 ≈ 6.08. Since p(xi, xj) = p(xj,xi), the lower part of the table need not be filled in.

Let's apply the "near neighbor" principle. We find in the table the smallest of the distances (if there are several of them, then we choose any of them). This is p 1.2 ≈ p 4.5 \u003d 2.24. Let p min = p 4.5 = 2.24. Then we can combine objects 4 and 5 into one group, that is, the combined column 4 and 5 will contain the smallest of the corresponding numbers of columns 4 and 5 of the original distance table. We do the same with lines 4 and 5. We get a new table.

We find in the resulting table the smallest of the distances (if there are several of them, then we will choose any of them): р min = р 1.2 = 2.24. Then we can combine objects 1,2,3 into one group, that is, the combined column 1,2,3 will contain the smallest of the corresponding numbers of columns 1 and 2 and 3 of the previous distance table. We do the same with rows 1 and 2 and 3. We get a new table.

We got two clusters: (1,2,3) and (4,5).

3. Solving problems for control work.

Problem 85.

Terms: Five production facilities are characterized by two features: sales volume and the average annual cost of fixed assets.

Volume of sales

Average annual cost fixed production assets

Solution: Let's find the distances between objects using the formula (1)* (we will round to two decimal places):

p 1,1 \u003d √ (2-2) 2 + (2-2) 2 \u003d 0

p 1.2 \u003d √ (2-5) 2 + (7-9) 2 ≈ 3.61

p 1.3 \u003d √ (2-7) 2 + (7-10) 2 ≈ 5.83

p 2.2 \u003d √ (5-5) 2 + (9-9) 2 \u003d 0

p 2.3 \u003d √ (5-7) 2 + (9-10) 2 ≈ 2.24

p 3.4 \u003d √ (7-12) 2 + (10-8) 2 ≈5.39

p 3.5 \u003d √ (7-13) 2 + (10-5) 2 ≈ 7.81

p 4.5 \u003d √ (12-13) 2 + (8-5) 2 ≈ 3.16

Based on the results of the calculations, we fill in the table:

Let's apply the nearest neighbor principle. To do this, in the table we find the smallest of the distances (if there are several of them, then select any of them). This is p 2.3=2.24. Let p min = p 2.3 = 2.24, then we can combine the objects of columns "2" and "3", and also combine the rows of objects "2" and "3". In the new table, we enter the smallest values ​​from the original table into the combined groups.

In the new table we find the smallest of the distances (if there are several of them, then we select any of them). This is p 4.5=3.16. Let p min = p 4.5 = 3.16, then we can combine the objects of columns "4" and "5", and also combine the rows of objects "4" and "5". In the new table, we enter the smallest values ​​from the original table into the combined groups.

In the new table we find the smallest of the distances (if there are several of them, then we select any of them). These are p 1, 2 and 3=3.61. Let p min = p 1, 2 and 3 = 3.61, then we can merge column objects "1" and "2 and 3" and also merge rows. In the new table, we enter the smallest values ​​from the original table into the combined groups.

We get two clusters: (1,2,3) and (4,5).

The dendrogram shows the order of selection of elements and the corresponding minimum distances pmin.

Answer: As a result of cluster analysis according to the principle of "nearest neighbor", 2 clusters of objects similar to each other are formed: (1,2,3) and (4,5).

Problem 211.

Terms: Five production facilities are characterized by two features: sales volume and the average annual value of fixed assets.

Volume of sales

Average annual cost of fixed production assets

Classify these objects using the nearest neighbor principle.

Solution: To solve the problem, we present the data in the original table. Let's determine the distances between objects. We will classify objects according to the “nearest neighbor” principle. The results are presented in the form of a dendrogram.

Volume of sales

Average annual cost of fixed production assets

Using formula (1)*, we find the distances between objects:

p 1.1 = 0, p 1.2 = 6, p 1.3 = 8.60, p 1.4 = 6.32, p 1.5 = 6.71, p 2.2 = 0, p 2 ,3 = 7.07, p 2.4 = 2, p 2.5 = 3.32, p 3.3 = 0, p 3.4 = 5.10, p 3.5 = 4.12, p 4 ,4=0, p4.5=1, p5.5=0.

The results are presented in the table:

The smallest value of the distances in the table is p 4.5=1. Let p min = p 4.5 = 1, then we can combine the objects of columns "4" and "5", and also combine the rows of objects "4" and "5". In the new table, we enter the smallest values ​​from the original table into the combined groups.

The smallest value of the distances in the new table is p 2, 4 and 5=2. Let p min = p 2, 4 and 5=2, then we can combine the objects of columns "4 and 5" and "3", and also combine the rows of objects "4 and 5" and "3". In the new table, we enter the smallest values ​​from the table into the combined groups.

The smallest value of the distances in the new table is p 3,4,5=2. Let p min = p 3,4,5=2, then we can combine the objects of the columns "3,4,5" and "2", and also combine the rows of objects "3,4,5" and "2". In the new table, we enter the smallest values ​​from the table into the combined groups.

or log in to the site.

Important! All presented Test papers for free download are intended to draw up a plan or basis for your own scientific work.

Friends! You have unique opportunity help students like you! If our site helped you find the right job, then you certainly understand how the work you added can make the work of others easier.

If the control work, in your opinion, Bad quality, or you have already met this work, let us know about it.

See CLUSTER ANALYSIS. Antinazi. Encyclopedia of Sociology, 2009 ... Encyclopedia of Sociology

cluster analysis- this is a set of methods that allow you to classify multidimensional observations, each of which is described by a certain set of variables. The purpose of cluster analysis is the formation of groups of objects similar to each other, which are commonly called ... ... Sociological Dictionary Socium

cluster analysis- a mathematical procedure for multidimensional analysis, which allows, on the basis of a set of indicators characterizing a number of objects (for example, subjects), to group them into classes (clusters) so that the objects included in one class are more ... ... Great Psychological Encyclopedia

Cluster Analysis- a mathematical procedure that allows, based on the similarity of the quantitative values ​​of several features characteristic of each object (for example, the subject) of any set, to group these objects into certain classes, or clusters. ... ... Psychological Dictionary

cluster analysis- - [L.G. Sumenko. English Russian Dictionary of Information Technologies. M.: GP TsNIIS, 2003.] Topics Information Technology in general EN cluster analysis … Technical Translator's Handbook

cluster analysis- * cluster analysis * cluster analysis or data clustering is a multidimensional statistical procedure that collects data containing information about a selection of objects, and then arranges objects into relatively homogeneous groups of clusters (Q ... ... Genetics. encyclopedic Dictionary

cluster analysis- Is it desirable to improve this article in mathematics?: Putting footnotes, make more precise indications of the sources. Correct the article according to the stylistic rules of Wikipedia. Recycle ofo ... Wikipedia

CLUSTER ANALYSIS- - a mathematical procedure for multidimensional analysis, which allows, on the basis of a set of indicators characterizing a number of objects (for example, subjects), to group them into classes (clusters), so that the objects included in one class are more ... ... Encyclopedic Dictionary of Psychology and Pedagogy

CLUSTER ANALYSIS - Common name for various mathematical methods for determining the deep structure in complex data. Cluster analysis is similar in many respects to factor analysis. Both involve searching for unitary elements (factors or clusters) that... ... Explanatory Dictionary of Psychology

CLUSTER ANALYSIS- (cluster analysis) a technique used to identify groups of objects or people that may show relative difference in a set of data. Then the characteristics of such people within each group are studied. In market research, ... ... Big explanatory sociological dictionary

CLUSTER ANALYSIS- (CLUSTER ANALYSIS) A group of statistical techniques used to determine the internal structure of data in the analysis of research information concerning multiple variables. The purpose of cluster analysis is to identify groups of objects ... ... sociological dictionary

This book is devoted to just one of the most promising approaches to the analysis of multidimensional processes and phenomena in this sense - cluster analysis.

Cluster analysis is a way of grouping multidimensional objects, based on the presentation of the results of individual observations by points of a suitable geometric space, followed by the selection of groups as "clumps" of these points. Actually, the "cluster" (cluster) in English language and means “clot”, “bunch (of grapes)”, “cluster (of stars)”, etc. This term fits unusually well into scientific terminology, since its first syllable corresponds to the traditional term “class”, and the second, as it were, indicates its artificial origin. We have no doubt that the terminology of cluster analysis will replace all constructs previously used for this purpose (unsupervised pattern recognition, stratification, taxonomy, automatic classification, etc.). The potential possibilities of cluster analysis are obvious for solving, say, the problems of identifying groups of enterprises operating in similar conditions or with similar results, homogeneous groups of the population in various aspects of life or lifestyle in general, etc.

As a scientific direction, cluster analysis declared itself in the mid-60s and has been rapidly developing since then, being one of the branches of the most intensive growth of statistical science. Suffice it to say that only the number of monographs on cluster analysis published to date in different countries is measured in hundreds (whereas, say, according to such a “deserved” method of multivariate statistical analysis as factor analysis, it is hardly possible to count several dozen books). And this is quite understandable. After all, we are actually talking about modeling the grouping operation, one of the most important not only in statistics, but in general - both in cognition and in decision-making.

A number of monographs have been published in our country devoted to the study of specific socio-economic problems using cluster analysis (1), the methodology for using cluster analysis in socio-economic research (2), the methodology of cluster analysis as such (3) (Fundamentals of statistical analysis )

The proposed book by I.D. Mandel is, as it were, perpendicular to this classification: its content is associated with each of these three areas.

The purpose of the book is to summarize state of the art cluster analysis, analyze the possibilities of its use and the tasks of further development. This idea in itself cannot but arouse respect: an unbiased analysis and generalization require a lot of work, erudition, courage, and are rated by the scientific community much lower than the promotion and development of their own designs. (However, the book also contains the author's original developments related to "intensional" analysis and the duality of classifications.)

Both the advantages of the book and its shortcomings are connected with the realization of this goal. The advantages should include:

· methodological study of the concepts of homogeneity, grouping and classification, taking into account the multidimensionality of phenomena and processes;

· a systematic review of approaches and methods of cluster analysis (including up to 150 specific algorithms);

· presentation of technology and results of experimental comparison of cluster analysis procedures; This book is devoted to just one of the most promising approaches to the analysis of multidimensional processes and phenomena in this sense - cluster analysis.

Cluster analysis is a way of grouping multidimensional objects, based on the presentation of the results of individual observations by points of a suitable geometric space, followed by the selection of groups as "clumps" of these points. Actually, “cluster” (cluster) in English means “clot”, “bunch (of grapes)”, “cluster (of stars)”, etc. This term fits unusually well into scientific terminology, since its first syllable corresponds to the traditional term "class", and the second, as it were, indicates its artificial origin. We have no doubt that the terminology of cluster analysis will replace all constructs previously used for this purpose (unsupervised pattern recognition, stratification, taxonomy, automatic classification, etc.). The potential possibilities of cluster analysis are obvious for solving, say, the problems of identifying groups of enterprises operating in similar conditions or with similar results, homogeneous groups of the population in various aspects of life or lifestyle in general, etc.

As a scientific direction, cluster analysis declared itself in the mid-60s and has been rapidly developing since then, being one of the branches of the most intensive growth of statistical science. Suffice it to say that only a number of monographs on cluster analysis, the development general schemes the use of cluster analysis methods implemented in fairly illustrative tables; recommendatory nature of the presentation.

These advantages determine the independent place of the book of I. D. Mandel among other publications.

The shortcomings of the book are the ambiguity of some recommendations and the lack of a systematic analysis of the issues of using cluster analysis methods in subject socio-economic applications. True, the latter is due to the insufficient use of cluster analysis in this area.

The book provides a springboard, the use of which facilitates progress in the most difficult issue of any theory - the practical use of the tools it provides.

B. G. Mirkin

Research topics range from the analysis of the morphology of mummified rodents in New Guinea to the study of the results of the vote of US senators, from the analysis of the behavioral functions of frozen cockroaches when they are thawed, to the study of the geographical distribution of certain types of lichen in Saskatchewan.

This explosion of publications has had a huge impact on the development and application of cluster analysis. But, unfortunately, there are also negative sides. The rapid growth of publications on cluster analysis has led to the formation of groupings of users and, as a consequence, the creation of jargon used only by the groupings that created it (Blashfield and Aldenderfer, 1978; Blashfield, 1980).

On the formation of jargon by specialists in the field social sciences evidenced, for example, by the varied terminology relating to Ward's method. The "Ward method" is called differently in the literature. At least four more of its names are known: "minimum variance method", "sum of squared error method", "hierarchical grouping minimizing" and "HGROUP". The first two names simply refer to the criterion whose optimum is determined by Ward's method, while the third is related to the sum of squared errors, which is a monotonic trace transformation of the matrix W, the intragroup covariance matrix. Finally, the widely used name "HGROUP" is the name of a popular computer program, which implements the Ward method (Veldman, 1967).

The formation of jargon hinders the development of interdisciplinary connections, impedes effective comparison methodology and results of applying cluster analysis in various fields of science, leads to unnecessary effort (re-invention of the same algorithms) and, finally, does not give new users a deep understanding of the methods they have chosen (Blashfield and aldenderfer, 1978). For example, one social science study (Rogers and Linden, 1973) compared three different clustering methods using the same data. They called these methods as follows: "hierarchical grouping", "hierarchical clustering or HCG" and "cluster analysis". And none of these names were familiar to clustering methods. A novice user of cluster analysis programs will be confused by all the existing names and will not be able to associate them with other descriptions of clustering methods. Experienced users will find themselves in a difficult position when comparing their research with similar work. We may be going to extremes, but jargon is a serious problem.

In recent years, the development of cluster analysis has somewhat slowed down, judging by the number of publications and the number of disciplines where this method is applied. We can say that at present psychology, sociology, biology, statistics and some technical disciplines enter the consolidation stage in relation to cluster analysis.

The number of articles praising the virtues of cluster analysis is gradually decreasing. At the same time, there are more and more works in which the applicability of various clustering methods is compared on the control data. In the literature, more attention has been paid to applications. Many studies are aimed at developing practical measures to test the validity of the results obtained using cluster analysis. All this testifies to serious attempts to create a reasonable statistical theory of clustering methods.


THE BELL

There are those who read this news before you.
Subscribe to get the latest articles.
Email
Name
Surname
How would you like to read The Bell
No spam