Применение статистически подобных методов для повышения надежности классификации объектов

Tarasov V.N.; Мезенцева Екатерина Михайловна; Малахов Сергей Валерьевич

COMPUTER SCIENCE

APPLICATION OF STATISTICALLY SIMILAR METHODS TO IMPROVE THE RELIABILITY OF OBJECT CLASSIFICATION

DOI 10.24411/2072-8735-2018-10203

Veniamin N. Tarasov,

Povolzhskiy State University of Telecommunications and Informatics, Samara, Russia, [email protected]

Ekaterina M. Mezentseva,

Povolzhskiy State University of Telecommunications and Informatics, Samara, Russia, [email protected]

Sergey V. Malakhov,

Povolzhskiy State University of Telecommunications and Informatics, Samara, Russia, [email protected]

Keywords: object categorization, the probability theory, classification algorithms, Bayesian classifier, Fisher method, priori probability, posteriori probability, decision thresholds, subsets of the intersection of sets, combined classifier.

This article provides an algorithm for partitioning a set of objects into a finite set of classes (categories). The task is to determine whether the object belongs to one of the pre-selected classes based on the analysis of all of the features that characterize the object. To solve the problem, we considered three categories (classes) to which objects should be matched when they are classified. The third is usually the category of "undefined objects" which the classifier did not recognize. The article suggested to use simultaneously two statistically similar methods of analyzing data related to the methods of parametric statistics, the method of Bayes and Fisher. The mathematical description of the Bayesian classifier, which is based on the so-called joined probabilities and the Fisher method, is given. The priori, posteriori probabilities, priori odds, and combined probabilities of the belonging of objects to the given classes are calculated. For the software implementation of the Fisher method, a Gauss quadrature formula with 15 nodes was applied. According to the results of testing the developed filter, the optimal decision thresholds for these classification methods are set. The initial training of the classifier is described and the justification is given that continuous training of the classifier should constantly occur throughout the entire life cycle. The optimality criteria for classifying messages based on statistical methods taking into account errors of the first and second kinds are presented. In this article, as the optimal criterion for assessing the quality of training of a classifier, the maximum value of the measure of proximity of two sets SB and SF is taken, this means the absolute measure N(SB?SF)is the number of common objects in these sets. An algorithm for the software implementation of the intersection of two sets is given. The results of experimental studies to evaluate the performance of message filtering algorithms using Bayes and Fisher methods, each separately and a combined algorithm, as well as the performance of a combined filter are described. The described method of organizing a combined classifier can be used in many areas: in information technology, telecommunications, medicine, biology, etc.

Information about authors:

Veniamin N. Tarasov, Povolzhskiy State University of Telecommunications and Informatics, Professor, Software and Management in technical Systems Department, Samara, Russia

Ekaterina M. Mezentseva, Povolzhskiy State University of Telecommunications and Informatics, Assistant Professor, Software and Management in technical Systems Department, Samara, Russia

Sergey V. Malakhov, Povolzhskiy State University of Telecommunications and Informatics, Assistant Professor, Software and Management in technical Systems Department, Samara, Russia

Для цитирования:

Тарасов В.Н., Мезенцева Е.М., Малахов С.В. Применение статистически подобных методов для повышения надежности классификации объектов // T-Comm: Телекоммуникации и транспорт. 2018. Том 12. №12. С. 66-70.

For citation:

Tarasov V.N., Mezentseva E.M., Malakhov S.V. (2018). Application of statistically similar methods to improve the reliability of object classification. T-Comm, vol. 12, no.12, pр. 66-70.

T-Comm Tom 12. #12-2018

7ТЛ

Introduction

The classical classification task is to determine whether an object (or observation) belongs to one of the pre-allocated classes based on the analysis of all of the features that characterize this object. For the objects, which are measured in the classification scale, we can determine their belonging to one or another class and their number and frequency. Measurement in the next, rank (or ordinal) scale, besides determining the class of membership, allows you to order objects and choose which of the two objects is preferable. That is why, objects of ordinal and classification scales of measurement are often called categorical, since they allow to correlate them to this or that class defined in advance.

The statistical methods allowing lo make it belong to methods of para metrical statistics. To solve the classification problem, the article suggests firstly using two statistically similar methods at the same time. This is due to the simple problem of the theory of probability, which says that the probability of two shooters hitting a target who are simultaneously tiring at it is higher than the probability of them hitting it separately.

Secondly, to improve the reliability of the results of the two classifiers, it is advised to use the method of analyzing a subset of intersection of sets classified by both methods to establish their proximity. For statistical classifiers, we will consider the well-known methods of Baycs and Fisher. Both methods have proven themselves in all really operating systems of filtration of e-mail (spam, not spam).

The Formutation of the problem

Let a set of objccts fi{CO} and on this set there is a partition

into a finite set of classes (or categories) Lk, k = I, ..., K. Object

CO is defined by the values of attributes xn, ft - I..... N, the

same for all objects when solving this classification task. For example, when filtering messages, the signs are the concrete words in the message, broken down into terms. The set of values of attributes xn sets the description of the object to:

O(ru) = {x^&i),..., xN (ffl)} ■ The signs can take values from

different sets of valid values. For example, this task can be formalized by reducing lo the task of approximating of a certain

function L ) = Then object co can be

[l,if (oeLk.

called a positive example of the category £., if 0(co, Lk) - 1,

and otherwise negative [5].

When using statistical classifiers, each attribute x is further

matched with its weight (significance) in the form of a real number: 0 < w. <1. which has a statistical or probabilistic nature

and depends on how the classifier is selected.

The weights are normalized in such a way that the sum of the squares of the weights of each object is 1. Then the object classification problem is posed as follows: the object a> will be related to the class £ according to the description

0(ft>) = {x,(«),..., •%(«>)} using the learning rule

G = {£,,..., LK} about classes Lk based on the calculation of

statistical probabilities p. (<y) — P(oj e Lk) and their further

analysis.

The solution of the problem

Let us take some simplification and consider three categories (classes) L,, to which objects should be correlated when they

are classified. We define them by A, B, and C, where C is usually the category of "undefined objects" that did not recognize both classifiers. Consider first the Bayesian classifier, which is based on the so-called united probabilities.

We define the statistical probability that an individual feature of object xn belongs to one of the categories A, B, and C. To do

this, we can divide the number of objects found with the attribute n in this category by the total number of objects in the same categoi^. Flere it is suggested to use another method described below.

Let it be that:

N - is the number of objects with a sign of class A; Nhi ~ is the number of objects with a sign of class B. Then the statistical probability of attribute i appearing in class A,

N.

Pet =

(t)

N^+Nu

and the probability of it appearing in class B .

N,

Phi

' bi

jV„; + N

(2)

bi

Thus, the number of objects with the sign / in one of the categories is divided by the total number of objects with a given sign i. Note that the above formulas give an accurate result only for those features that the classifier has already met in both categories earlier. This makes it too sensitive in the early stages of learning in relation to rare objects. To cope with this problem, we calculate the new probability, starting with the supposed a priori probability (/>„) and weight - (w) given to this probability, and then we add the probabilities calculated by formulas (I) and (2).

If the probability P„ = 0,5 and w= 1 - the weight of the assumed probability is equal to one attribute, then we determine the weighted average probabilities using expressions {1), (2): = (wxPtii )+pi„x(Nl,i+Nh,)

Pai

vv +jV„, +N

bi

M~ w+A<ui+Nhi

Such an approach makes it possible to avoid dividing by zero in the formulas below in their software implementation, as well as lo take into account rarely encountered features.

To obtain combined probabilities for the entire object, we will use a description 0(<y) = (ft)),..., xN(ft))} (for example. in spam filters - these are updated dictionaries), which must be constantly modified at the stage of the classifier training. In Bayesian formulas, the probabilities are assumed to be independent, and therefore their multiplication is possible:

P(co 6 A) = pel xpa2 x ...x paN, (3)

for the probability that the object belongs to class A;

P{comB) = ph] xph2 x...xpbN, (4)

for the probability that the object belongs to class B.

To calculate the statistical probability that an object belongs to one of the two categories (A, /J), we introduce two hypotheses:

Y

H , - the object belongs to the class A (co e A);

H R - the object belongs to the class B (co e B ).

We introduce the definitions:

A'fj - the total number of objects of class A\

]\Ih - the total number of objects of class B\ N.

Pa = ft> e A; Pb =

jl_ - the priori probability that the object is

M u + Nh

- the priori probability thai the object is

The priori chances of belonging objects o> e A and, co e B, will be:

n* ——Hi!— and n* — Pb . Then, based on the Bayes Pu ~ . Pb ~ .

1" P«

theorem with the use of a priori knowledge, we get: P(cD<EA)xp*ti

(5)

P(coe A)y.pa + P{(0 e B)y. ph - the posterior probability of the object belonging to the class A;

P(a>eB)xpl

mB)=-

(6)

x ». 1 W/2

dr

(7)

J02"G(n) where G (u) is the gamma-function.

Taking into account the above and the representation of the gamma-function of the whole argument, we rewrite the integral (7) in the form:

Hxi)=-

l

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

(8)

2"{n -1)! |

Values of (n-!)! separately and Ihe integrand function (8) as a whole, with software implementation may cause an overflow error. In this regard, in the program, their calculation must be implemented by a recurrent formula.

The calculation of the probability from the expression (8) in the program for the joint classifier} 1 ] is implemented using the Gauss quadrature formula with 15 nodes:

j ad* »

P(co e A) x pa + P{co e B) x ph

- the posterior probability of the object belonging to the class B. Here the statistical probabilities P(co e A) and P{co e B) are determined by formulas (3) and (4).

Fisher-based decision making

According to the Fisher method, all probabilities p^. and p. are multiplied, similarly to the Bayesian method, but then the natural logarithm is taken from the product, and the result is multiplied by -2: xl~~2*\n{P{co t A)) or X ——2*ln(.P(<y 6 B)). where the statistical probabilities P{(0 e A) and p(co e B) are also determined by formulas (3) and (4).

Fisher proved that if there is a set of independent and random probabilities (3) or (4), then the value ~2*\n{P{co ^ A))

would obey the distribution y 2 (chi-square) with 2n degrees of freedom (n is the number of features of the object):

where, t. ={b+a)/2+(b-a)x, /2 and x, - knots of the Gauss quadrature formula; Aj - Gaussian coefficients,

(/' = 1,2,...,15) [3J. In our case a = 0,b = £.

The number returned by the function F(xl) w'" sma" if

there are many signs of class A in the object, hi order to correctly classify the object, we need the opposite result To do this, subtract the value F(x„ ) fr°m tlne- Accordingly, subtracting from

1 the value of the function ) f°r a 'arBe number of attributes of class B, and we obtain the probability that the object belongs to class B.

The Fisher method is not symmetric, therefore, it is necessary to combine the probabilities of the object belonging to the class A and B, by combining the probabilities into one number, which will give us the value of this probability from 0 to 1. To do this, we use the so-called. Fisher indicator:

, 1 + PjH'A)-PjH'B) 2

P(H'A) = \-F(-2\n(P(A)) - the probability that an object belongs to class A;

PiH'^^l-F(-2\n(P(B)) - the probability that an object belongs to class B [2J.

Decision-making thresholds

In the Bayes and Fisher classification methods, at the stage of training classifiers, it is neccssary to set the initial values of the lower and upper thresholds for final decision making. Let values T and L be measured in percent and determine the upper and lower decision thresholds, respectively.

H - is one of the previously defined classes (A, B); P(H) - is the probability of an object falling into one of the

classes (A, B):

I - the Fisher indicator. Then:

- we will assume that the object belongs to the group H,

if P(H / £ T ;

- the object does not belong to the group H,

if P(H),I<L;

- if done T > P(H),1 < L, then no decision can be made

and the object belongs to class C,

For example, testing the work of the developed filter for classifying messages on the site forums at various decision thresholds showed that the most optimal are: the upper threshold is 0.95 and the lower threshold is 0.4 [ 1 ].

Initial training classifier

For the classifier to work correctly, it is neccssary to conduct its initial training. This requires a training data set consisting of examples of objects for which their belonging to a particular class is known in advance, i.e. the expert does it manually.

T

Then the learning process is initiated, using the mathematical methods given above, this set is analyzed and a model is built, which is later used to classify new objects.

Depending on the result, if it was possible to classify an object, the system marks it as an object of classes A or B or unspecified and conducts automatic training. If the object could not bc classilied, the expert himself must decide which category the given object belongs to. Thus, continuous training of the classifier should constantly occur throughout the entire life cycle.

So, thanks to the constant training of the classifier, the accuracy of the classification of objects is significantly increased using the above mathematical methods.

The criteria 1'or evaluating the performance

of classification algorithms

The criteria used in assessing the quality of the classification algorithms are based on the truth of the classification result. There are two types of errors:

- errors of the first kind or false negative - an object of class A, mistakenly recognized by the algorithm as an object of class B\

- errors of the second kind or false positives - an object of class B, mistakenly recognized by the algorithm as an object of class

Classifiers usually make a compromise between an acceptable level of errors of the 1st and 2nd kind, and when making a decision, they use threshold values that can vary, ll depends on how the classifier is more "strict'1 or more "soIT [4].

Improving the reliability and quality of the classifier

Such an approach is proposed to the organization of the classifier, which consists in the joint use ofBayes and Fisher methods, and to improve the quality of the classification, it is proposed to analyze the subset of intersection of the sets recognized by both methods (for classes A, B and Q. Let say (i=\^M) be the set of objects to be classified; Sna S and Sjrc: S are sets of objects recognized respectively by the Bayes and Fisher classifiers in categories A, B and C. Then the subset - the intersection SuflSj in all three of the categories above can be used to see the quality of work of the combined classifier. The completeness of such an intersection Si)HSr will also give estimates for the subsets SB\S/. and Sf\SB(fig. 1).

As a measure of proximity of two sets Se and S^-, we will use the absolute measure A^S^nS/.-) - the number of common objects in these sets. Thus, the maximum value of the proximity measure of two subsets is taken as the optimal criterion for assessing the quality of training of a joint classifier (fig. 1):

/V(SBnS/r) —* max. ~ (9)

Then the general scheme for assessing the quality of training of the combined classifier will consist of the following steps to determine the membership:

- Q) € A based on the Bayesian algorithm;

- co e A based on the Fisher algorithm;

- 0) e A to both algorithms (intersection);

- (0 £ B based on the Bayesian algorithm;

- (O e B based on the Fisher algorithm;

- (0 e B common to both algorithms (intersection);

- co e C based on the Bayesian algorithm;

-®eC based on the Fisher algorithm;

- (0 e C common to both algorithms (intersection).

: s/ls,. )^)

Fig, I. Illustration of a measure of proximity of two sets S/t and 5/.

In this case, we really can most fully evaluate all the components of the overall picture: only having such a complete picture can we reasonably compare the quality of the combined classifier's skills. The calculation of the intersection of two sets is organized according to the procedure of the merge algorithm, w hich calculates the intersection of two sets represented by ordered lists. At the program input, two intersecting sets A and B are Specified, given by their own pointers a and b. At the output of the program, we obtain the intersection C = AnB, given by the pointer c.

Justification of the algorithm

At each step of the main cycle, one of three situations is possible: the current element of A is less, greater than or equal to the current element of B. In the first case, the current element of A does not belong to the intersection, it is skipped and progress is made in this set, in the second the same the most is made with the set B. In the third case, matching elements are found, one copy of the element is added to the result and promotion occurs in both sets at once. Thus, all the coinciding elements of both sets fall into the result, and exactly once.

Conclusion

The description of the principles of the organization of the combined classifier was possible to form alter the practical implementation of such a classifier as a spam filter and conducting numerous experimental studies to assess the quality and performance of the developed classifier. It has been established that the quality of the filter depends on the degree of its knowledge and therefore the filter itself must be continuously retrained. The performance of the Bayes algorithm on messages with a length of 1 KB was 0.0001 s, Fisher - 0.0007 s, The combined algorithm - 0.0009 s. The combined filter performance averaged 17 messages per second, which satisfies the requirements of the majority of potential users of the system.

This method of organizing a combined classifier can be used in many areas: in information technology, medicine, biology, etc.

References

1. Mezenceva E.M, Tarasov V.N. (2010). Computer networks security, Web programming of the multi-module spam filter. Software Engineering, vol. 4, pp. 27-32.

2. Peter Seibel. (2005). Practical Common Lisp. New York: Apress. 528 p.

3. Nikolskiy S. (1974). Quadrature Formula. Moscow: Nauka. 226 p.

4. Mezenceva E.M, Tarasov V.N. (2013). An optimal filter construction based oil combining statistical classifiers. Information and communications technologies, 2013. book 1, vol. 4, pp. 53-57.

5. Mezenceva E.M, Tarasov V.N. Samarkin M.E. (2017). Deep analysis of the daia of a telecommunications company to identify abnormal customers. Problems of Infocommunications Science and Technology (PIC S&T), 13-15 Oct. 2017 4tli International Scientific-Practical Conference, Kharkiv, Ukraine, pp. 311-314. DOI; 10,1109/INFOCOMMST.2017.8246404.

Y

ПРИМЕНЕНИЕ СТАТИСТИЧЕСКИ ПОДОБНЫХ МЕТОДОВ ДЛЯ ПОВЫШЕНИЯ НАДЕЖНОСТИ КЛАССИФИКАЦИИ ОБЪЕКТОВ

Тарасов Вениамин Николаевич, Поволжский государственный университет телекоммуникаций и информатики,

г. Самара, Россия, [email protected] Мезенцева Екатерина Михайловна, Поволжский государственный университет телекоммуникаций и информатики,

г. Самара, Россия, [email protected] Малахов Сергей Валерьевич, Поволжский государственный университет телекоммуникаций и информатики,

г. Самара, Россия, [email protected]

Аннотация

Приведен алгоритм разбиения множества объектов на конечное множество классов (категорий). Ставится задача определения принадлежности объекта к одному из заранее выделенных классов на основании анализа совокупности признаков, характеризующих данный объект. Для решения поставленной задачи рассмотрели три категории (класса), к которым должны быть соотнесены объекты при их классификации. Третья обычно является категорией "не определенные объекты", которые не распознал классификатор. Предложено использовать одновременно два статистически подобных метода интеллектуального анализа данных, относящихся к методам параметрической статистики, метод Байеса и Фишера. Приведено математическое описание Байесовского классификатора, который основывается на так называемых объединенных вероятностях и метода Фишера. Вычислены априорные, апостериорные вероятности, априорные шансы и скомбинированные вероятности принадлежности объектов к заданным классам. Для программной реализации метода Фишера была применена квадратурная формула Гаусса с 15 узлами. По результатам тестирования разработанного фильтра заданы оптимальные пороги принятия решений для данных методов классификации. Описано начальное обучение классификатора и приведено обоснование того, что постоянно должно происходить непрерывное обучение классификатора на протяжении всего срока эксплуатации. Представлены критерии оптимальности для классификации сообщений на основе статистических методов с учетом ошибок первого и второго родов. В данной статье в качестве оптимального критерия для оценки качества обучения классификатора принимается максимальное значение меры близости двух множеств SB и SF, т.е. абсолютная мера N(SB?SF) - число общих объектов в этих множествах. Приведен алгоритм программной реализации пересечения двух множеств. Описаны результаты экспериментальных исследований по оценке быстродействия алгоритмов фильтрации сообщений методами Байеса и Фишера, каждого в отдельности и совмещенного алгоритма, а также производительности совмещенного фильтра. Описанный способ организации совмещенного классификатора можно использовать во многих областях: в информационных технологиях, телекоммуникациях, медицине, биологии и др.

Ключевые слова: категоризация объектов, теория вероятности, алгоритмы классификации, Байесовский классификатор, метод Фишера, априорная вероятность, апостериорная вероятность, пороги принятия решений, подмножества пересечения множеств, совмещенный классификатор.

Литература

1. Мезенцева ЕМ., Тарасов В.Н. Computer networks security. Web programming of the multi-module spam filter // Software Engineering, 2012. vol. 4, pp. 27-32.

2. Peter Seibel. Practical Common Lisp. New York: Apress, 2005. 528 p.

3. Никольский С. Квадратурная формула. М.: Наука, 1974. 226 р.

4. Мезенцева Е.М., Тарасов В.Н. Оптимальная конструкция фильтра на основе объединения статистических классификаторов // Информационные коммуникационные технологии, 2013. Том 1, vol. 4, pp. 53-57.

5. Мезенцева Е.М., Тарасов В.Н., Самаркин М.Е. Deep analysis of the data of a telecommunications company to identify abnormal customers / Problems of Infocommunications Science and Technology (PIC S&T), 13-15 Oct. 2017 4th International Scientific-Practical Conference, Kharkiv, Ukraine, рp. 311-314. DOI: I0.II09/INF0C0MMST.20I7.8246404.

Информация об авторах

Тарасов Вениамин Николаевич, Поволжский государственный университет телекоммуникаций и информатики, профессор кафедры Программного обеспечения и управления в технических системах, профессор кафедры ПОУТС, д.т.н., г. Самара, Россия Мезенцева Екатерина Михайловна, Поволжский государственный университет телекоммуникаций и информатики, доцент кафедры Программного обеспечения и управления в технических системах, доцент кафедры ПОУТС, к.т.н., г. Самара, Россия Малахов Сергей Валерьевич, Поволжский государственный университет телекоммуникаций и информатики, доцент кафедры Программного обеспечения и управления в технических системах, доцент кафедры ПОУТС, к.т.н., г. Самара, Россия

T-Comm Уом I2. #I2-20I8

Аннотация научной статьи по компьютерным и информационным наукам, автор научной работы — Tarasov V.N., Мезенцева Екатерина Михайловна, Малахов Сергей Валерьевич

Похожие темы научных работ по компьютерным и информационным наукам , автор научной работы — Tarasov V.N., Мезенцева Екатерина Михайловна, Малахов Сергей Валерьевич

Текст научной работы на тему «Применение статистически подобных методов для повышения надежности классификации объектов»