y^K 519.178, 025.4.03(06)
Gleb B. Sologub
ON MEASURING OF SIMILARITY BETWEEN TREE NODES
Abstract
In this paper, a survey of similarity measures between vertices of a graph is presented. Distance-based and structural equivalence measures are described. It is demonstrated that most of them degenerate if applied directly to the tree nodes. Adjusted path-based similarity measure is proposed as well as a new method for representing tree nodes as binary vectors that is based on using of an ancestor matrix. It is shown that application of ordinary similarity measures to this representation gives desired non-trivial results.
Keywords: similarity measure, distance on tree nodes, structural equivalence, ancestor matrix.
1. INTRODUCTION
The concept of similarity is commonly used in relation with clustering and collaborative filtering methods in many fields, including information retrieval, data mining, network analysis, pattern recognition and machine learning. Basic task for these methods is to calculate similarity between data entries and find most similar to one another.
Tree structures are used to represent various types of hierarchical data. Examples include different ontologies, catalogs, genealogies, XML documents, language corpuses, etc.
In our work on intelligent tutoring and testing systems we need to evaluate similarity between the questions of a test in order to predict answer scores. We use tree data structures for domain modeling. Nodes of a tree represent themes or subjects; leaves represent questions. So, the main goal of our study is to develop effective and accurate measure of similarity between tree leaves.
In this paper, following the work [1], we discuss only abstract graph theoretic methods to compute the similarity on tree nodes without any regard to the problem domain.
Perfect studies on different approaches to the measuring of similarity as semantic distance that do relate to the problem domain, i.e. information retrieval, could be found in the works [2], [3], and [4].
2. PRELIMINARIES
A tree is a connected undirected simple graph with no cycles. Any two nodes of a tree are connected by a unique simple path, which is the shortest path between them. We consider a rooted tree, which has a root node and leaves.
© Gleb B. Sologub, 2011
We denote the number of tree nodes by n, nodes (vertices) by vp v2,...; particularly, root node by t, leaves by q1, q2, ...; parent nodes by t1, t2, ...; the lowest common ancestor of vertices vt and v. by lealength of the shortest path between vertices v. and v. by l(v,v;), number of common
J V 1 J 1 J
neighbors of vertices v. and v. by n...
1 J y
Also we use the following notation: A for adjacency matrix, a■■ for its elements, A. for its rows,
.J .
A. for its columns, I for identity matrix, A for ancestor matrix with elements a{. = 1 iff the Jth vertex is an ancestor of the ith vertex, ki for degree of ith vertex, D for diagonal degree matrix with elements du = k, L for Laplacian matrix, which is D - A.
Note that a.. = a.. = {0 or 1} and a..2 = a..for all i,J; n.. ■
iJ Ji iJ iJ iJ
^ik^ ki = Skaik.
3. DISTANCE ON VERTICES
Similarity is somewhat opposite to the concept of distance between information elements. One can use distances or metrics to construct similarity measure for any kinds of elements. For example, if d (x,y) is a distance between x and y, then their similarity could be measured as follows [5]:
X' y) = 1 + ) ■ (1)
1 + d (x, y)
In general, many types of monotonically decreasing functions could be used for this purpose.
3.1. PATH METRIC
The obvious measure for distance on tree nodes could be a path metric [6], i.e. length of the shortest path between them:
l (v, v ) =l ((, lcav)+l (j, lcav). (2)
The similarity measure that is based on path metric then could be expressed as
sl (v. vj) =
1 +l (v„ vj ) 1 +l ((, lcaj ) +l (vj, lcaj )
(3)
But it is not so useful for hierarchical data structure, because it makes no difference between similarities of node pairs located at different depths.
Consider a simple curriculum (fig. 1). It is obvious that the similarity between questions q1 and q2 should be greater than the similarity between questions q5 and q6, because they belong to the more specific theme «Matrices». However, the distances between them are equal.
Later, we shall improve path-based similarity measure by removing this effect.
Fig. 1. Example of a simple curriculum
3.2. RESISTANCE DISTANCE
The resistance distance W.. between vertices v. and v. of a simple connected graph G could be
J .. . ^ ~ J
used to compute similarity [7] and is defined as
W =r. +r -r -r. .
j " JJ j J'
where r is the Moore-Penrose inverse of the Laplacian matrix L of G.
(4)
However, it is shown [8] that in the case of a tree:
Qy = det L[ i; j] = l(vt, v.), (5)
where L[i; j] is a submatrix of L that is obtained by deleting the ith and the jth rows and columns from L. Sad, but the resistance distance in a tree is just a path metric again.
3.3. ADJUSTED PATH-BASED SIMILARITY
Now return to the path metric. A simplest way to account for the granularity of the domain, to which belong concerned vertices, is Fig. 2. Example of a tree, where to adjust formula (2) as
la(3i> qk) > la(3i> j + la(4p ^ ,( , \ ,( , \
I (, ICGj ) +1 (, IcGj )
1 +l ((caj, t)
Obviously, la is not a metric. This could be simply illustrated by a contrary instance. For
examp^ in fig. 2 la(qi,qk) = 6 la(qpQj) = 4/3, ^^ qk) = ^ so la(qpqj) > la(qPq) + ^^qk).
Nevertheless we could use la as a dissimilarity measure, since it is larger for vertices that are more distant to each other.
So the adjusted path-based similarity measure could be written as
h (V,, Vj ) =
(6)
1
1 +1 (lca.., t)
j
s (v. V ■) = - — .
a . J 1+h ((' vj) 1+1 (lcaj't)+1 (v' lcaj)+1 (vj' lca,j)
(7)
4. STRUCTURAL EQUIVALENCE
Two vertices of a graph are called structurally equivalent if they share the same neighbors. Thus, the similarity of vertices could be expressed by generalization of the number of common neighbors.
As the simplest and most obvious measure for the structural equivalence, the number of common neighbors is used itself [1].
But in the case of a tree it turns to be a binary variable that is equal to 1 if two vertices have the same parent, and is equal to 0 otherwise. So it is almost useless value.
4.1. COSINE SIMILARITY
One of the most popular similarity measures is a cosine similarity. It is defined by the following simple formula [9]:
s. = cos 6 = , (8)
J fII I y\\
where x and y are two vectors, ||x|| and ||y|| are the norms of x and y, (x, y) is their dot product and
6 is the angle between them.
It is often proposed to represent vertices of a graph as corresponding rows (or columns) of the adjacency matrix, so we could obtain that [1]:
= Z kakaj = n 0,3 ■ {9)
This value is almost useless again in the case of tree nodes. Especially, this is true for tree leaves, because they always have degree 1.
4.2. EUCLIDEAN DISTANCE
Given two vectors x and y we could compute the Euclidean distance between them:
Pe = X-A = (x' -y)2 . (10)
For the distance on graph nodes it could be written as
Pe (A, A) =VE * - J )2 = J\A\I2 Aj\ |2 - 2( A, A) =yl k + kj - 2n. , (11) This formula gives another degenerated measure on nodes and, especially, leaves of a tree.
4.3. TANIMOTO SIMILARITY MEASURE The next similarity measure that deals with vectors is the Tanimoto coefficient [9]:
St = ,, „2 (X,,2y)-, (12)
114 + WA -(x,y)
or using the previous representation of graph vertices as rows of adjacency matrix:
St (A, A ) = TTT-. (13)
J k+kJ- n i
It is a different mix of degrees and common neighbor counts that gives trivial results on tree nodes and leaves.
Consider two sets M and N. Jaccard index [6] is defined on these two sets as
\M n \M n N\ J(M, N) = ^-= iiii -r. (14)
M u N| Ml + |N| - M n N|
Jaccard index measures the similarity between two given sets as the size of their intersection divided by the size of their union.
Let us arrange all members of \M u N| in an ordered list L with elements l.. Consider binary vectors x and y with respective components:
[1, if l' e M; f 1, if lt e N;
X =i ' and yt =i ' (15)
[0, otherwise, [0, otherwise.
Tanimoto coefficient of these vectors is equivalent to the Jaccard index of the given sets [9]:
St (x, y) = J(M, N). (16)
4.4. PEARSON COEFFICIENT
One could use the standard Pearson correlation coefficient as the measure of similarity between two given vertices as:
-kkj
_ = cov( A,, A.) = nv n = njn -ktkj
I--r I TT r, TT r TT . (17)
k± riVkn - k?,lk.n - kJ
-nfj-v""" 'J
HHOOPMAUHOHHLIE CHCTEMM 21
And again, for leaves of a tree, we obtain degenerated formula
1, if vt and v3 have the same parent;
1 (18) , otherwise.
UjU -1
r. =—-
3 n -1
n -1
4.5. DIFFERENT REPRESENTATION OF TREE VERTICES
Other kinds of measures could be applied to binary vectors. Examples include various weighted metrics, set and string distances, and even logical comparison [9]. But if we remain to use them directly on rows of adjacency matrix of a tree, then the results will be trivial again.
We propose another way to represent tree nodes that is based on using of an ancestor matrix instead of the adjacency matrix. The ancestor matrix A of a graph is defined as a square matrix where an element aij is set to 1 if the jth vertex is an ancestor of the ith vertex, and 0 otherwise. The ancestor matrix is less sparse than the adjacency matrix of a tree, so it gives us more effect. It should be noted that different vertices v. and v. of a graph can have equal corresponding rows
i J
A t and Aj of its adjacency matrix. Particularly, this applies to any pair of leaves, such that they are children of the same parent node in a tree. Thus any of the similarity measures, described in this chapter, would give the highest value on this pair of leaves. Such behavior is undesirable, because we assume that only identical elements should have the highest value of similarity [3]. This could be observed in the case of using of A. and A. too.
i J ~
To get rid of this effect we propose to use rows of C = I + A matrix as binary vectors for measuring of distances and similarity between nodes of a tree.
Two following results of this approach reveal the relationship between the graph distance on tree nodes and the metric on rows of the extended ancestor matrix C of this tree.
Theorem 1. Let T be a rooted tree with ancestor matrix A. Then
Sa (V,., Vj) = Sr (C,, Cj ) (19)
for any two vertices v., v. of T and corresponding rows C, C. of C = I + A.
. J . J
Proof: Consider the sets p = {v, , th, t2,..., lca., tki, t^,..., t} and p = {Vj, t1, j,..., lca., tк, tkг,..., t},
where t is the root of T, lca■■ is the lowest common ancestor of v■ and v. in T, t, are their other
j . j kp
common ancestors; t. , t. are the other ancestors of given vertices v. and v., respectively. In this
jm . J
notation, using equation (16) and definition of the Jaccard index (14) we directly obtain that
sv(C,, CJ) = J (P, pj) = p +,P [- pP n ■ . (20)
I , I I j I I , j I
We recall that the length of the shortest path between two vertices is one less than the number of vertices in this path. Then we notice that
(C,,Cj) = p nP.] = \{lcap,tvtk2,...,t}| = 1 +1(lca.,t) , (21)
i|2
lie,! = p = 1 +1(V,,t) = 1 +1(V,,lca.) +1(lca.,t), (22)
|Cj||2 = |p| = 1 +1(Vj,t) = 1 +1(Vj,lca.) +1(lca.,t). (23)
Finally, we can write
s (C C ) =_!+l (lcaj, t)_
v ( ,, j) 1+l(V, lca. )+1(lca., t) + 1+l(V., lca. )+l(lca., t) - (1+l(lca., t)) . (24) It turns exactly to the sa(v., v;) by some trivial algebra. ■
a I J
Corollary 1. We can define a proper metric on vertices of T as
% (V, v.) =-l(V,Vj)-. (25)
flV ' J' 1 +1(leaj,t) +1(v..,v..) V '
Proof: Formula (25) is derived immediately by defining l a (v.., v.) = 1 - sa (v., v.) from
1 - s (v v ) = 1__1 +l(lea j, t)_= lQ^leap) +l(vj, leaj)
Sa ^, ^ 1 +1(leaj, t) +1 (, leaj) +1 (v., lea.) 1 +1(lea., t) + l(v., lea.) + l(v}, leaj) .
Theorem 1 shows that 1 - sa(v;, v) is equal to the Tanimoto distance 1 - ST(C!, CJ), and the Tanimoto distance is known to be a proper metric [9]. ■
Theorem 2. Within notation of Theorem 1,
Pe (C', Cj) = ^l(v,, v.). (26)
Proof: Using the same approach and formulas (11), (21), (22) and (23) we obtain
Pe (C,, Cj) = ^ 1 +1 (v,, lea j) +1 (leaj, t) +1 +1 (v., lea.) +1 (lea., t) - 2(1 + l(lea ., t)) , (27)
which is equal to (vi, lea.) +1(v., lea.) = ^l(v,., v.) . ■
For other kinds of metrics and similarity measures, defined on rows of the extended ancestor matrix, path-based expressions could be obtained by using the same technique.
5. DISCUSSION
Similarity measures on tree nodes were discussed primarily in relation with semantic similarity and its applications [2, 3, and 4]. Edge-counting methods were well developed in this area. The closest form to our adjusted path-based similarity measure is that was proposed in [10]. It will be interesting to adopt the technique shared by previous researchers [2, 3, and 4] to compare these measures in terms of correlation with human judgment.
Some of tree comparison methods, e.g. consensus methods, are based on computing of similarity between tree nodes too. Moreover, in the particular work [11] a set similarity measure in the form of Jaccard index is used. The difference is that they define it on leaf sets under nodes of leaf-labeled trees, whereas we consider extended ancestor sets for nodes of any rooted tree.
While the theoretical relationship between resistance distance in graph and Euclidean distance in some vector space is well known [8], we believe that particular result, obtained in Theorem 2, was not observed earlier.
Theorem 1 and Theorem 2 give us a way to compute distances on nodes of a tree using standard vector operations.
On the other hand they provide us with simple path-based methods to measure similarity in tree structured data.
We propose to use these measures in many other related areas, for example, content-based image retrieval [12] or case-based reasoning student diagnosis [13].
6. CONCLUSION
This work provides a survey of similarity measures on nodes of a tree. Distance-based and structural equivalence measures are discussed. A new method for representing tree nodes and its use for measuring similarity is described. Theorems 1 and 2 give interesting results about relationship between paths on tree nodes and metrics on rows of extended ancestor matrix of this tree. Future work would be related with further studying of different similarity measures and their comparative analysis.
References
1. Newman M.E.J. Networks: An Introduction. Oxford University Press, 2010.
2. Hliaoutakis, A., Varelas, G., Voutsakis, E., Petrakis, E. G.M., Milios, E. Information Retrieval by Semantic Similarity. Intern. Journal on Semantic Web and Information Systems (IJSWIS), 3(3), July/Sept. 2006, P. 55-73.
3. Lin D. An Information-Theoretic Definition of Similarity // In Proc. of the 15th Int. Conference on Machine Learning, 1998. P. 296-304.
4. Jiang J.J., Conrath D. W. Semantic similarity based on corpus statistics and lexical taxonomy // In Proc. of Int. Conference Research on Computational Linguistics (ROCLING X). Taiwan, 1997.
5. Segaran T. Programming Collective Intelligence. O'Reilly Media, 2007.
6. Deza M. M., Deza E. Encyclopedia of Distances. Springer, 2009.
7. Kunegis J., Schmidt S., Albayrak S., Bauckhage C., Mehlitz M. Modeling Collaborative Similarity with the Signed Resistance Distance Kernel // In Proc.European Conf. on Artificial Intelligence, 2008. P. 261-265.
8. Klein D.J., Randic M. Resistance distance // Journal of Mathematical Chemistry, 1993. Vol. 12, № 1. P. 81-95.
9. Kohonen T. Self-Organizing Maps. Springer, 2001.
10. Wu Z., Palmer M. Verb semantics and lexical selection // In Proc. of the 32nd Annual Meeting of the Associations for Computational Linguistics, Las Cruces, New Mexico, 1994. P. 133-138.
11. Zhang L. On matching nodes between trees. Tech. Rep. № 2003-2067. HP Labs, 2003.
12. Manouvrier M., Rukoz M., Jomier G. A generalized metric distance between hierarchically partitioned images // In Proc. of the 6th Int. Workshop MDM/KDD'05, August 21, 2005. P. 33-41.
13. Tsaganou G., Grigoriadou M., Cavoura T. Case-based reasoning diagnosis of students' cognitive profiles on historical text comprehension // In Proc. IEEE Int. Conf. on Advanced Learning Technologies (ICALT 2002), 2002.
ОБ ИЗМЕРЕНИИ СХОДСТВА МЕЖДУ УЗЛАМИ ДЕРЕВА
Аннотация
В данной работе выполнен обзор мер сходства между вершинами графа. Описаны меры, основанные на расстоянии, и меры структурной эквивалентности. Показано, что большинство из них вырождаются, если их непосредственно применять к узлам дерева. Предложена скорректированная мера сходства, основанная на расстоянии, а также новый метод представления узлов дерева бинарными векторами, основанный на использовании матрицы предков. Показано, что применение обычных мер сходства к этому представлению даёт желаемые нетривиальные результаты.
Ключевые слова: мера сходства, расстояние на узлах дерева, структурная эквивалентность, матрица предков.
© Наши авторы, 2011. Our authors, 2011.
Gleb B. Sologub,
Department of Applied Mathematics and Physics, Moscow Aviation Institute (State Technical University),
glebsologub@ya. ru