A heuristic algorithm
for generating the numerical terms
of a linguistic variable
Elena N. Chujkova
Associate Professor, Department of Computer Systems and Information Security Don State Technical University
Address: 1, Gagarin Square, Rostov-on-Don, 344000, Russia E-mail: [email protected]
Vasilij V. Galushka
Associate Professor, Department of Computer Systems and Information Security Don State Technical University
Address: 1, Gagarin Square, Rostov-on-Don, 344000, Russia E-mail: [email protected]
Abstract
In this paper we describe an easy-to-implement algorithm for automated generation of the linguistic variable term membership functions to allow for information search in a relational database based on qualitative criteria by means of the SQL query language.
The proposed algorithm makes it possible to calculate the parameters of the triangular and trapezoid membership functions taking into account the distribution of the variable of interest stored in the database. The algorithm defines the intervals covered by the term bases, so that each interval contains about the same number of values. Upper bounds of the defined intervals are used to calculate the parameters of membership functions. The parameters of the membership functions generated with this algorithm can be easily calculated with the limited computational means of the SQL language.
We review the algorithm realizations for the generation of 3 and 5 terms of a linguistic variable based on a sample from a database containing 100 or 500 different values.
The membership functions obtained through the algorithm have the required properties of orderliness, completeness, consistency and normality. They do not require further approximation. Unlike the known methods, the algorithm does not require significant computing resources, the use of specialized software, settings configuring, or a training set formation.
The algorithm implementation creates opportunities to support fuzzy search queries in relational databases using the means ofthe SQL language, as limited as they are. Thus, the system's level ofintelligence would be increased, and the user would be provided with the means of search query formulation in a natural language. The linguistic variable terms generated using our algorithm can be used within the framework of a fuzzy rule-based knowledge base of an information system, as well as to perform fuzzy inference.
Key words: relational database; SQL language; fuzzy logic; linguistic variable; fuzzy set; membership function.
Citation: Chujkova E.N., Galushka V.V. (2018) A heuristic algorithm for generating the numerical terms of a linguistic variable. Business Informatics, no. 3 (45), pp. 29—38. DOI: 10.17323/1998-0663.2018.3.29.38.
Introduction
At the core of any information system lies a database that stores the information processed by the system. One of the functions of the system is searching the database for any requested data. Currently, relational databases that support the search for information using SQL language are the most widely used. Composing a search request in SQL requires the user to set definite ranges for the values of the data being requested, something which is often impossible due the lack of such information. Moreover, the user, especially a novice, is better accustomed to using qualitative search criteria that could be expressed verbally. This kind of user-system interaction is also more convenient for the user.
In an implementation of information search in a database based on qualitative criteria, the tools of fuzzy logic can be used. They involve the use of linguistic variables and the generation of corresponding base term sets (fuzzy sets) that provide a quantitative interpretation for the qualitative search criteria [1—5].
Performing a fuzzy search query in a relational database requires converting it first into a command of the standard SQL language. Paper [6] describes such a conversion procedure that is implemented with the help of the graphical interface of the information system. Within this procedure, the values of the membership functions of the fuzzy sets used are calculated on-the-fly in the course of performing an SQL query. Linguistic variable terms are represented by parametric fuzzy numbers with trapezoid and triangular membership functions. Their parameters can be easily calculated using the commands of the SQL language, which is not intended for complex calculations.
This paper [6] uses the assumption about a uniform distribution of the data sampled from the database to find the parameters of the terms. This assumption limits the applicability of the proposed method and might lead to unaccep-
table results when it is not valid. In particular, if the distribution of the experimental data is nonuniform, it might happen that no values at all from the database would be in the base of some terms. To avoid such issues, a calculation of the terms' parameters must take the experimental data distribution into account. The term bases must contain approximately equal numbers of values available in the database.
Traditionally, the membership functions of the linguistic variable terms are generated based on expert information, which makes the procedure difficult to automate. For this reason, many researchers have made considerable effort towards the development of algorithmic methods of membership function generation. The papers [7; 8] provide an overview of the existing methods of automatic generation of membership functions. Among them are the methods of inductive logical inference [9], fuzzy c-means clustering methods [10-13], neural networks [14; 15], histograms [16], methods based on fuzzy entropy and other special measures [17; 18], genetic algorithms [19-21], ant colony system algorithms [22], heuristic algorithms [23], and particle swarm optimization [24]. The characteristics of the aforementioned methods are presented in Table 1 [8].
All the methods listed above have been developed for particular applications of classification and are used to assign experimental data to fuzzy sets. However, they are not guaranteed to produce uniform distribution with an approximately equal number of elements in each set.
Paper [25] proposed a method of partitioning a source sample of values into intervals that contain equal numbers of values. However, using this method one might see identical values being assigned to different intervals, which defeats the purpose of classification.
The membership functions generated using the above methods do not always meet the necessary conditions of being ordered, complete, consistent, or normalized [2; 4], and still
Table 1.
Characteristics of the methods of automatic generation of membership functions
Methods Characteristics
Source of information Area of applicability Shape of the membership functions Number of fuzzy sets
Neural networks dataset classification arbitrary fixed
Histograms dataset image recognition, classification Gaussian arbitrary
Fuzzy c-means clustering dataset classification triangular, trapezoid fixed
Genetic algorithms dataset and expert estimates controllers triangular, trapezoid, Gaussian fixed
Ant colony system algorithms dataset data analysis, controllers arbitrary arbitrary
Particle swarm methods dataset and expert estimates controllers, image processing Gaussian, triangular S-shaped fixed
Other methods dataset classification, controllers triangular, Gaussian 2-9
require further approximation. For example, the Gaussian membership functions generated by the particle swarm optimization are subnormal [8].
The existing algorithmic methods of generating membership functions typically have a high computational complexity. Using neural networks makes it necessary to provide a training set. The genetic algorithms, ant colony system algorithms, and particle swarm optimization methods require an objective function to be specified. They are also notorious for their large time to convergence and the possibility of their convergence to a local optimum. The implementation of most methods listed above requires some specialized software.
In this paper, we propose a simple-to-implement method for automatic generation of the membership functions for linguistic variable terms. The method takes into account the distribution of values of the parameter at hand stored in a database, which makes it possible to assign an approximately equal number of val-
ues to each term base. In addition, this method can be used to generate an arbitrary number of membership functions that possess the required properties.
1. Problem formulation
Tables of a relational database store numerical values of objects' characteristics used by an information system that must be evaluated and selected based on some qualitative fuzzy criteria (for example, "low," "medium," "high"). To perform a fuzzy search query, one has to provide linguistic variables corresponding to the properties being evaluated. To enable the automatic generation of a linguistic variable, we need an algorithm that would allow us to calculate the number and parameters of the membership functions of the linguistic variable terms. The choice of the number of values (terms) of the linguistic variable is up to the user. The basic scale of a linguistic variable is specified based on a set of all values of the property under investigation obtained
from the database and is given by the interval U=\u . ,u 1, where u . is the minimum value
L min/ max-17 nun
of that property and umxY is the maximum value of that property. Each term base must contain an approximately equal number of values of the property being evaluated that are available in the database. The set of values and their frequencies can be obtained from the database using an SQL query as a sorted (ascending) matrix Hkx2, where the number of rows k is the number of different measurements (values of the parameter at hand in the database),
the matrix elements hare the values of the
ii
i'-th measurement, and hQ are the frequencies of the z'-th measurement, i= 1, ..., k. The membership functions we are seeking to generate should be defined in such a way that their values could be easily calculated by means of the SQL language, which falls short of the computational efficiency of a universal programming language. The membership functions must meet certain requirements [2; 4].
We will represent the terms of a linguistic variable by parametric fuzzy numbers with the most common trapezoid and triangular membership functions. We will use the trapezoid membership functions for the minimum and maximum terms, and the triangular membership functions for the rest of the terms. The above choice of the membership function types was influenced by our goal to ensure the possibility of calculation on-the-fly while running an SQL query, considering the limited capabilities of SQL language. The calculation of the values of the membership functions is performed according to the following formula:
H(x) =
0,(x <a)v(x>d) 1 ,b<x<c
X a,{a<x<b)/\{a<b), ,(c<x<d)A(c<d)
b-a d-x
(1)
d-c
where a<b<c<d are the parameters of the membership function fi(x).
Let us split the base of the term into two intervals. The total number of such intervals is m = 1+ 1, where I is the number of terms. Figure 1 shows the shape of the membership functions of linguistic variable terms. Their parameters and the upper bounds of the values of the intervals are provided. The following notation is used:
T. — the j-th term; i '
a., b., c., d. — the parameters of the z-th term,
i = 1,
I;
p. - the upper bound of the values of the y'-th interval, j= 1, ...,m.
flJx)
«3 «4 a5
¿3 K
Ö1 «2 C2 C3 C4 C5
C1 4 d, ¿3 4 à3
P\ Pi Pi Pa P5 I
Fig. 1. Membership functions of linguistic variable terms
We need to define the intervals covered by the term bases so that each interval would contain an approximately equal number of values of the property at hand, and moreover, so that identical values would lie within the same interval. The upper bounds p. of the intervals being defined are used to calculate the parameters a., bj, Cj, dj of the terms membership functions and are given by the quantiles of the uniform discretization of the source sample.
Based on the aforementioned matrix H, the number n of values that must lie within each interval, assuming a uniform distribution, is given by the equation
k
n = ^hj2/m,
* i=1
where is the total number of measure-
i=1
ments (the sample size).
Let us define the vector z, its components being the numbers ofmeasurements within each interval. Then the y'-th component of the vector Z is the number Zj of measurements within the y-th interval (y = 1,..., m):
Zj=yLhi2,
ieG
where G is the set of numbers i of the rows of matrix H that contain the values of the frequencies hi2 included in z.
Let us denote by i the largest element of G: /*= max?'. Then, to calculate the quantiles p.
ieG J
(y = 1, ..., /), we only need to choose p. to
be such elements h.(i* = 1.....A;-l) of matrix H
11 ' ' '
that the function E(z)= max \z,-n\ is mini-
j=\.....i+1 1
mized. Alternatively, another equivalent criterion of optimality could be used:
j=i
The value pl+l coincides with the largest value in the sample: pM =
The optimization is performed under the following conditions:
/+1 k
i= i ¿=i
for Vy (j = 1, ..., l + 1) = ^ ig+1 =ig+1
ieG
(g = i^,...,i^ where U, = miгnг', W =
ieG ieG
for j j2, j1 ± j2 , for which Zji =Ydha, ^•2=Z/li2, G1 n G2 = 0 are defined. ,E(?1 's 2
To obtain non-trivial and nondegenerate solutions, the number I of the terms must obey the following condition: 2<l<k. The number of measurements contained in each of m
intervals cannot be smaller than the maximum
frequency of the measurements, therefore k
m ^ 1 ma* hi2 . This means that the number
iA '=1.....k
I of the terms must meet the following condition:
2</<£vmax/*,.2-l (2)
TT i=l.....k
The parameters a., b., c., d. of the terms' membership functions are found as follows:
= \fhiJ = 1 h=ihu,J = l °J {Pj^Kj<r J \pj,l<j<r
|>,,l<y</
2. Algorithm for generation of membership functions
Let us find the actual number Zj of the measurements that get assigned to the y-th interval (y'= 1,..., m) and the parameters a., b., c., d. of the linguistic variable terms according to the following algorithm:
Step 0. Verify that the conditions (2) of the existence of non-trivial and nondegenerate solutions are satisfied. If the conditions (2) are satisfied, then go to Step 1. Step 1. Initialize i (i\= 0). 1.1. For each j (j = 1, ..., m — 1), find zl as follows
1.2.1. Initialize z\ (z\ := 0).
1.2.2. While zij < n, consequentially accumulate the values ha in zL:
/ := / + 1 zlj:=zlj + hi2.
1.2.3. If zl.>n, then find R and Q:
R~n-(z\-hi2) Q:=z\-n.
Here R is the difference between the required number n of measurements within the interval and the actual number ^L of measurements
within the y-th interval, not counting the /-th measurement; Q is the difference between the actual number of measurements within the y'-th interval (counting the /-th measurement) and the required number n of measurements. The value of R shows how much the number of measurements within the j-th interval is smaller than the required number (the shortage), while the value of Q shows how much the number of measurements within the j-th interval is greater than the required number n (the surplus).
1.2.4. If the shortage without the account for the last measurement is smaller than the surplus (R < Q), then remove from z\. the last
measurement added to it h^:
i2
zL:=zlj-hi2
i := i— 1.
1.2.5. Define the upper bound pi of the y-th interval as follows: pi.. := h...
1.3. Find the number of measurements
m
that are assigned to the last m-th interval:
1.3.1. Initialize zi (zi := 0).
mm
1.3.2. Add to zim all the remaining measurement: for all i (i=i + \,..., k)
zl :=zl +K .
~ m m i2
1.3.3. Define the upper bound p\m of the m-th interval as follows: plm '-=hkl.
Step 2. Find the total deflection 8X of the number of measurements from the required n:
m
5i=Yi\zlj~n\.
j=i
Step 3. Repeat the calculations Zj by processing the rows of matrix H in reverse order beginning with the last row:
3.1. Initialize i (i := k).
3.2. Define the upper bound p2m of the m-th interval as follows: .
* m k\
3.3. For each j (j = m, ..., 2), find as follows:
3.3.1. Initialize z2j (z2j := 0).
3.3.2. While ¿2j < n, consequentially accumulate the values h in z2 :
i2 .
i:=i- 1. z2j:=aj + hiT.
3.3.3. If Z2j > n , then find R and Q:
R:=n-(z2j-hJ2) Q :=z2j- n.
3.3.4. If R < Q, then remove from z2y the value hi2:
zl]-.= &J-hn i:=i+ 1.
3.3.5. Define the upper bound p1j_l of the previous (j - i)-th interval as follows:
P2M := Vdi- htr
3.4. Find the number z2; of measurements assigned to the first interval:
3.4.1. Initialize z2y (z2y 0).
3.4.2. Add to z2; all the remaining measurement: for all i (i = i — I, ..., I)
Zll := z2i + hi2.
Step 4. Find the total deflection S2 of the number of measurements from the required n:
m
S2=^\Z2j-n\.
j=i
Step 5. Find the values of the parameters a,, b, Cj, d. of the linguistic variable terms using the elements of vector p corresponding to the distribution z of measurements for which the least total deflection S is achieved (if Sl < S2, then use vector pi corresponding to the distribution zl of measurements, otherwise use p2 corresponding to ^2) as follows: a::= hn
bi:= hii.
Let St< S2.
Then for each j (j = i, ..., m — i) do: ifj = < m -1, then c.:=pi, aj+i :=piy; if j > i, then dj _ i :=piy , b :=piy;
c ,:= pi ;
m _ i * m7
d .:= pi .
m _ i * m
3. Analysis of the results produced by the algorithm
We consider the realizations of the above algorithm used to generate 3 and 5 terms of a linguistic variable based on samples from a database that contained 100 or 500 different values.
For a sample of 100 values and 3 terms, the number of intervals m = 4; the number of values per interval n = 100/4 = 25; the number of distinct measurements (the number of rows of matrix H of the values and their frequencies)
k = 20. The source data (the elements of matrix H) along with the results of Steps 1-4 of the algorithm are presented in Table 2.
Table 2 demonstrates that the total deflection of the number of measurements from the required n for the downward processing of matrix H is greater than that for the reverse processing (^ >S2), therefore the parameters of the membership functions are defined using vector p2 . The values of the parameters of membership functions calculated at Step 5 of the algorithm are presented in Table 3.
Table 2.
Source data and the results of steps 1-4 of the algorithm
Elements of matrix H Number of measurements in the y'-th interval (zl) Upper bound of the y-th interval (P\) Number Upper
Row number of matrix m Value of the Mh measurement (ha) Frequency of the i-th measurement (ha) Interval number U) Variance (S) of measurements in the y'-th interval bound of the y-th interval (p2) Variance
1 354 5
2 616 5
3 662 5 1 28 718 21 718
4 695 4
5 699 2
6 718 7
7 741 2
8 745 5 27 770
9 758 2 2 26 784
10 764 2 12 10
11 770 9
12 784 6
13 785 2 24 791
14 790 14 3 27 800
15 791 2
16 800 9
17 810 11
18 813 5 4 19 855 28 855
19 814 2
20 855 1
Table 3.
Parameters of membership functions calculated by the algorithm
Parameters
Membership of the j-th membership function function
number ( j ) a. j b. j c. j d. j
1 354 354 718 770
2 718 770 770 791
3 770 791 855 855
Plots of the membership functions for 3 and 5 terms obtained through the algorithm are shown in Figures 2 and 3, respectively. The experimental data sampled from the database are presented by histograms in the figures.
The shape of the membership functions calculated without the account for the distribution of experimental data for the same data samples are shown in Figure 4.
i"
1
0,8-1 0,6 0,40,20
\ :
s l\ / \
I
350 400 450 500 550 600 650 700 750 800 850
Item price, rub.
-------Membership function 1
............................ Membership function 2
- Membership function 3
Experimental data
Fig. 2. Membership functions for 3 terms of the linguistic variable "Item price" for a sample of 100 price values
As seen from Figures 2 and 3, the proposed algorithm generates the membership functions for the linguistic variable terms that meet all the essential requirements [2; 4]. In particular, it is ensured that each term base contains an approximately equal number of values. By contrast, the membership functions generated without the account for the experimental data distribution (see Figure 4) do not possess the required properties and require further approximation.
1
0,8 0,6 0,4 0,20
i i i i i i i 1 r i
i /1
\ /1 1
\ ; i
\ * i
i
» /
\ i _
w
!
H
1 \ -
\ 11
/ i ! l ; 1 1 i 1 -
—i I 1
350 400 450 500 550 600 650 700 750 800 850
Item price, rub.
Membership function 1
-------Membership function 2
............................ Membership function 3
- Membership function 4
----Membership function 5
Experimental data
Fig. 3. Membership functions for 5 terms of the linguistic variable "Item price" for a sample of 100 price values
Conclusion
The proposed algorithm makes it possible to automate the procedure of generating the membership functions for the linguistic variable terms. The parameters of such membership functions can be stored in a database and later used to submit fuzzy queries in SQL language. In contrast to the existing methods of automatic generation of the membership functions, our algorithm is simple to implement,
-------Membership function 1
............................ Membership function 2
- Membership function 3
Experimental data
Fig. 4. Membership functions for 3 terms of the linguistic variable "Item price"
generated without the account for the experimental data distribution for a sample of 100 price values
does not require considerable computational resources or specialized software, is free of adjustable parameters, does not require a training set, and enables the generation of any user-
specified number of triangular and trapezoid membership functions. The membership functions generated with the help of this algorithm take into account the experimental data distribution. Additionally, the respective term bases of the linguistic variable contain approximately equal numbers of values selected from the database. The membership functions obtained through this algorithm possess all the required properties [2; 4] and do not necessitate further approximation. At the same time, the algorithm was developed for the solution of the particular problem stated above, and cannot be claimed to be universal.
The algorithm implementation creates opportunities to support fuzzy search queries in relational databases using the means of SQL language, as limited as they are. Thus, the system's level of intelligence would be increased, and the user would be provided with the means of search query formulation in a natural language. The membership functions of the linguistic variable terms generated using our algorithm can be used within the framework of a fuzzy rule-based knowledge base of an information system, as well as to perform fuzzy inference. ■
References
1. Shtovba S. (2001) Vvedenie v teoriyu nechetkih mnozhestv i nechetkuyu logiku [Introduction to the theory of fuzzy sets and fuzzy logic]. Available at: http://matlab.exponenta.ru/fuzzylogic/book1 (accessed 14 October 2013) (in Russian).
2. Korneev V.V., Gareev A.F., Vasjutin S.V., Rajh V.V. (2000) Bazy dannyh. Intellektual'naya obrabotka informatsii [Databases. Intelligent processing of information]. Moscow: Knowledge (in Russian).
3. Zadeh L. (1965) Fuzzy sets. Information and Control, vol. 8, no. 3, pp. 338—353.
4. Zadeh L. (1975) The concept of a linguistic variable and its application to approximate reasoning — II. Information Sciences, vol. 8, no. 4, pp. 301—357.
5. Andrejchikov A.V. (2006) Intellektual'nye informatsionnye sistemy [Intelligent information systems]. Moscow: Finance and Statistics (in Russian).
6. Chujkova E.N. (2014) Realizatsiya nechetkogo vybora oborudovaniya v sisteme proektirovaniya informatsionnoy seti [Realization of the fuzzy selection of equipment in the information network design system]. Vestnik of Don State Technical University, no. 3, pp. 164—171 (in Russian).
7. Medasani S., Kim J., Krishnapuram R. (1998) An overview of membership function generation techniques for pattern recognition. International Journal of Approximate Reasoning, no. 19, pp. 391-417.
8. Schwaab A., Nassar S., Filho P. (2015) Automatic methods for generation of type-1 and interval type-2 fuzzy membership functions. Journal of Computer Sciences, no. 11, pp. 976—987.
9. Kim C., Russell B. (1993) Automatic generation of membership function and fuzzy rule using inductive reasoning. Proceedings of the 3rd International Conference on Industrial Fuzzy Control and Intelligent Systems (IFIS '93), Houston, Texas, 1—3 December 1993, pp. 93—96.
10. Chen M., Wang S. (1999) Fuzzy clustering analysis for optimizing membership functions. Fuzzy Sets and Systems, no. 103, pp. 239—254.
11. Liao T., Celmins A., Hammell R. (2001) A fuzzy c-means variant for the generation of fuzzy term sets. Fuzzy Sets and Systems, vol. 135, no. 2, pp. 241—257.
12. Lopes P., Camargo H. (2012) Automatic labeling by means of semi-supervised fuzzy clustering as a boosting mechanism in the generation of fuzzy rules. Proceedings of the 13th IEEE International Conference on Information Reuse and Integration (IRI). Las Vegas, USA, 8—10 August 2012, pp. 279—286.
13. Jamsandekar S., Mudholkar R. (2014) Fuzzy classification system by self generated membership function using clustering technique. BVICAM's International Journal of Information Technology, vol. 6, no. 1,
14. Wu S., Er M., Gao Y. (2001) A fast approach for automatic generation of fuzzy rules by generalized dynamic fuzzy neural networks. IEEE Transactions on Fuzzy Systems, vol. 9, no. 4, pp. 578-594.
15. Castellano G., Castiello C., Fanelli A., Mencar C. (2005) Knowledge discovery by a neuro-fuzzy modeling framework. Fuzzy Sets and Systems, no. 149, pp. 187-207.
16. Refaey M. (2016) Automatic generation of membership functions and rules in a fuzzy logic system. Proceedings of the 5th International Conference on Informatics and Applications (ICIA 2016). Takamatsu, Japan, 14—16 November 2016, pp. 117-122.
17. Cheng H., Chen J. (1997) Automatically determine the membership function based on the maximum entropy principle. Information Sciences, vol. 96, no. 3-4, pp. 163-182.
18. Nieradka G., Butkiewicz B. (2007) A method for automatic membership function estimation based on fuzzy measures. Proceedings of the International Fuzzy Systems Association World Congress (IFSA). Cancun, Mexico, 18—21 June 2007, pp. 451-460.
19. Homaifar A., McCormick E. (1995) Simultaneous design of membership functions and rule sets for fuzzy controllers using genetic algorithms. IEEE Transactions on Fuzzy Systems, vol. 3, no. 2,
pp. 129-139.
20. Shimojima K., Fukuda T., Hasegawa Y. (1995) Self-tuning fuzzy modeling with adaptive membership function, rules, and hierarchical structure based on genetic algorithm. Fuzzy Sets and Systems, vol. 71, no. 3, pp. 295-309.
21. Kaya M., Alhajj R. (2004) Integrating multi-objective genetic algorithms into clustering for fuzzy association rules mining. Proceedings of the 4th IEEE International Conference on Data Mining (ICDM '04). Brighton, UK, 1-4 November 2004, pp. 431-434.
22. Hong T., Tung Y., Wang S., Wu M., Wu Y. (2009) An ACS-based framework for fuzzy data mining. Expert Systems with Applications, no. 36, pp. 11844-11852.
23. Ishibuchi H., Nozaki K., Tanaka H. (1993) Efficient fuzzy partition of pattern space for classification problems. Fuzzy Sets and Systems, vol. 59, no. 3, pp. 295-304.
24. Permana K., Zaiton S. (2010) Fuzzy membership function generation using particle swarm optimization.
International Journal of Open Problems in Computer Science and Mathematics, vol. 3, no. 1, pp. 27-41.
25. Piatetsky-Shapiro G., Connell C. (1984) Accurate estimation of the number of tuples satisfying a condition. ACMSIGMOD Record, vol. 19, no. 2, pp. 256-276.