CONSTRUCTION OF HIERARCHICAL CLASSIFICATION BY SIMILARITY MATRIX
G. Tsitsiashvili, M. Osipova
Institute for Applied Mathematics Far Eastern Branch of RAS, Vladivostok, Russia Far Eastern Federal University, Vladivostok, Russia
e-mails: [email protected], [email protected]
ABSTRACT
In this paper a problem of hierarchical classification of some objects by similarity matrix solution is solved. This approach gives single solution of classification problem. Each hierarchical level is defined by some critical value of a similarity. Using critical value the similarity matrix is transformed into contiguity matrix of some no oriented graph in which connectivity components are constructed. Increasing successfully critical values it is possible to define hierarchical classification of initial objects. This approach is closely connected with reliability theory and mathematical statistics in which a reaching of critical value is one of important problems.
1 INSTRUCTIONS
A problem of a classification by a matrix of pair similarities (differences) between objects is widely used in data processing (Aivazian, Enukov, Meshalkin, 1983). These problems usually are considered as a solution of some minimum - maximum problem in which a similarity between objects in a class is maximized (difference between objects in a class are minimized) and a similarity between classes is minimized (difference between classes is maximized). One of disadvantages of such a statement of a question is no uniqueness of this problem solution and difficulties with an enumeration of all possible solutions.
But last years in different applications a tendency to solve classification problem in a connection with a construction of hierarchical classification intensifies. Such statement of a question increases demands to a uniqueness of classification problem. In this paper the uniqueness is reached by a transformation of similarity matrix into zero - one matrix via a comparison of similarity matrix elements with some critical value. This zero - one matrix becomes contiguity matrix of some no oriented graph. Then connectivity components of this graph are constructed. These components are identified with some objects classes. To define hierarchical classification of the objects critical values are increased. As a result these classes are divided into subclasses and so on. In such a way hierarchical classification is constructed by matrix of pair similarities (of pair differences).
Another applied problem of the classification is a definition of upper boundary "supremum" for similarity matrix (of lower boundary "infinum" for difference matrix) of critical values for which classification procedure gives single solution. This problem may be solved using the method of dichotomous division. For this aim on initial step we take a pair of critical values: zero and a maximum of pair similarities (a minimum and a maximum of pair differences).
Consequently hierarchical classification transforms into a definition of connectivity components in some no oriented graph. Known algorithms (Kormen, Leizerson, Rivest, 2004), (Graham, Hell, 1985) of connectivity components construction are based on a search into a depth
and into a width. A disadvantage of these algorithms is in repeated calls to earlier considered graph edges in "search tree" and a definition of for the nodes all nodes contiguity with them. In this paper we suggest algorithm which has not these disadvantages. If considered graph is connected this algorithm is similar with algorithm of spanning tree construction (Eppstein, 1999).
2 ALGORITHM OF HIERARCHICAL CLASSIFICATION
Consider n objects and denote jj their similarity measure i, j, i * j. Then similarity matrix between these objects is M =|| j ||nj=1. It consists of nonnegative numbers with j = m, where m = max j +1. To each integer number k, 0 < k < m, contrast matrix M(k) =|| mj) ||n,-=l3
1<i * j < n J j ,j
where mj) =1, if j > k, else mj) = 0. The matrix M(k) consists of zeros and units and may be
considered as contiguity matrix of some no oriented graph G(k) with n nodes which designate initial objects.
In the graph G(k) construct connectivity components J1(k\ J^),..., J^k), so that for any two nodes i, j e J\k) in the graph G(k) there is a way which connect them. If i e j\k\ j e J;(k\ t * l then there are not ways which connect the nodes i, j in the graph G(k}. Remark that when we transit from k to k +1 each set J(k+1} completely contains to some set J(k} or does not intersect with it.
Consequently the subsets J[k),J(k),...,J^ create a decomposition of the set {1,...,n} into classes by the levels k, 0 < k < m,
( /(0) r(0) r(0) ,
{J1 , J 2 ,., Jn(0)},
{J1(1), j21),., Jg1)},...,
{j(m) J(m) J(m)}
^ 1 ' 2 ' n(m)'''
in which for any class J\k+l) of the level k +1 there is the class J\k), satisfying the inclusion J((k+1) c J\k). Further construct the tree D with the hight m, its root is the node J^0) = {1,.,n}. On the level k = 1 consider the nodes J1(1), J2(1),., Jn(()) and connect them by edges with the node Jf]. On the level k = 2 consider nodes JP,J22),...,J^) and connet the node Jt(2) by the edge with the node J;(1) of the level 1, if there is the inclusion Jf) c Jl(1). This procedure continues to the level m, on which the set {1,., n} is divided into n one node subsets. To simplify the description of the tree it is possible to replace the inclusion J(k+1) c jf) by the inclusion J(k+1) c J\k). If J(k+1) = J(k) then the nodes J(k+1), j(k) of the tree D are glued.
To construct connectivity components in no oriented graph g with n nodes we use the following algorithm. On the step 1 take the node 1 and construct the connectivity component = {1}. Assume that on the step t-1 the set of nodes {1,...,t-1} is divided into connectivity components
K(t-1)
, i e Lt-1:
Kf -1) n K j-1) = 0, i * j, i, j e J
U Kf-1) = {1,...,t -1}
ieL,_
On the step t consider the next node t and calculate c = max m(k\ i e L , and put
jeKi
I = {i e Lt-1: c, =1}, К}* := K*, i e LtI,
K? :={i}U
№
(0
This means that the classes K^ 1), i e Zt_l3 with which the node t is connected by some edges, are
aggregated with t into new class K(t-1.
3 NUMERICAL EXAMPLE
Assume that similarity matrix of 15 objects has the form (m = 8):
For any critical level k = 0,1,..., 8 the graph G(k) has the following connectivity components (connectivity components with single element do not repeat on successive levels): the level k = 0 : J((0) = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
the level k = 1: J^ = {1,2,3,4,5,6,7,8,9,10,11,12,14,15}, = {13}
(1) =.
the level k = 2 : J[2) = {1,2,3,4,5,6,7,8,9,11,12,14,15}, J\2) = {10}
the level k = 3 : J[3) = {1,2,3,6,7,8,9,11,12,15}, Jf = {4}, J(3) = {5}, Jf = {14}
the level k = 4 : J(4) = {1}, J24) = {2,3,6,7}, J(4) = {8,9,11}, J^4) = {12}, J5(4) = {15}
the level k = 5: J((5) = {2,3,6,7}, J25) = {8,9,11}
the level k = 6 : J(6) = {2}, Jf = {3,6,7}, J(6) = {8}, J6) = {9}, J56) = {11} the level k = 7 : J((7) = {3,6,7} the level k = 8: J((8) = {3}, J<8) = {6}, J3(8) = {7} Then we construct the tree D with the hight 7 with glued nodes: J26) with J((7), j24) with J(( J3(4) with J25) (Fig. 1).
(5)
13
\
s
Figure 1. The tree D with the hight 7.
4 REFERENCES
1. Aivazian S.A., Enukov I.S., Meshalkin L.D. 1983. Applied statistics. Bases of modeling and initial data processing. Moscow: Finances and statistics. (In Russian).
2. Eppstein D. Spanning trees and spanners. In Sack J.R. and Urrutia J. 1999. Handbook of Computational Geometry. Elsevier. P. 425-461.
3. Graham R.L., Hell P. 1985. On the history of the minimum spanning tree problem. Annals of the History of Computing 7 (1): 43-57.
4. Kormen T., Leizerson Ch., Rivest R. 2004. Algorithms: construction and analysis. Moscow: Laboratory of basic knowledges. (In Russian).