THE BASIC ORGANIZATION OF AN ONTOLOGY DRIVEN RETRIEVAL SYSTEM
Alexander Yu. Krylov, 4th year student, Ivanovo State University of Chemical and Tecnology, Russia,
Ivanovo, Krylov [email protected]
Edward G. Galiaskarov, Associate Professor, Ivanovo State University of Chemistry and Technology,
Russia, Ivanovo, galiaskarov@,isuct.ru
Introduction
At present times the information searching is very important for World Wide Web, because of the fact that Internet content increases every day and gets a lot of new facts and knowledge. According to “newsru.com” in May of 2009 all servers that are connected to Internet contained 487 exabytes of data [1]. Thus, the size of Internet data had reached a value close to 500 exabytes (that is equal 500 billion Gigabytes). A probability of finding necessary information increased very significantly, but a complexity of this task, with an increase of a number of queries and their complexity taken into account, still remains very high. It is connected with the fact that a simple context search takes a lot of time being rather ineffective yet. Modern search engines partly solve this issue using different specific information search algorithms. In addition to that such systems are intended for general search, which isn’t dependent on any domain or a search theme. A use of such systems result in a great number of the found documents, that belongs to absolutely different domains and are often quite irrelevant to a user’s search query or needs. The most advanced search engines surely provide extended search opportunities, however it doesn’t solve the described above issue completely. Such thoughts bring us to the conclusion that general purpose search engines are able to solve the issue of an information search in a definite theme just partly.
Nowadays one of the ways to solving the problem is a so-called “intellectual” search, that allows to “comprehend” a submitted by a system user’s query and find the documents corresponding to the sense of a query but not just containing the words which make up the expression of the query. It raises significantly a probability to find the information a user is interested in and to reduce a labour-intensiveness of the process at the same time.
In this article we would like to present a version of an intellectual search organization in a data array, based on a use of a domain orientied ontology that allow to “comprehend”, specify and supplement a user’s query in order to define the most relevant required information.
There are a great number of philosophical and technical definitions of term “ontology”. We will use the following meaning: it is a computer resource, corresponding to a certain world view concerning a definite sphere of documents. At a formal level “ontology” is a system that consists of a notion set and a set of statements about these notions, on a basis of which it is possible to make up classes, objects, relationships, functions and terms [2]. In other words, ontology is a certain conceptual domain model, which contains all the classes of objects, their relationships and rules of the domain.
The next notion that we will use is a “search query” and it corresponds to source information for an implementation of a search by a system. Such information can be represented in quite different formats (it all depends on a search system arrangement and a target of a search) - whether text documents - then a search query correspond to a text, - or graphical files - where a search query is an image or a set of it’s characteristics, - musical and so on.
Let’s introduce such notions as “an upper level search” and “a lower level search”:
1. The upper level search - is a logical component of a system, which implements the use of a conceptual scheme of a domain. In this case it is a refinement and conversion of a user’s search query with the help of ontology, i.e. it’s “understanding/comprehension” by the system in the bounds of the used domain for a further context search;
2. The lower level search - is a logical component of a system, which implements an interaction of a search system with a physical data base of documents for the purpose of realization of an information search and sampling. The lower level search implements by 11
11
means of the use of some of already existing search engines for a context search in documents in a data base.
The use of such a search division according to levels is caused by a logical division of a conceptual domain scheme (ontology) and a base of documents.
The use of ontology makes it possible to reveal in its way a sense of a query, semantic relations between the terms it contains. That is why one of the main goals in development of such a search system is the most precise (with the help of ontology) description of a domain, represented by a database of documents by that a search is implemented. It is the mode that will allow raising the effectiveness of the use of ontology as much as possible for the purpose of solving a problem of information search in the bounds of a specified domain.
Technically this problem can be solved by means of a special software, that allows analyzing the text of a document and extract the most frequently used and important words out of it, that conforms to a theme described in the document.
1. The functioning of the system under development
Having assigned the main stages of a search passage in a search engine it is possible to present the scheme of the system functioning (Fig. 1), that solves the problem of an information search with the help of ontology.
Fig. 1 - The scheme of a search query passage in the system
The concept of functioning of such a system is the following:
1. A formed user’s search query is divided into its component words and run through a morphology subsystem that allows getting a set of word forms of all the elements of the query for the following conversion of the query. This phase is very important as ontology that is used in the system contains terms and definitions of a domain in a definite form to that at least one form of the query element should correspond so that it was possible to find this element in a conceptual scheme;
2. At this stage a “comprehension” of the query is implemented - the whole gained set of the word forms and the initial text of the query is run through ontology that allows defining the connections between terms - elements of the query. These can be connections like “part of hierarchy”, “synonym”, “association” and so forth. Using the gained connections, the user’s query is supplemented with the compliant elements, thereby making up a search image of the query that is used for the search in documents array after that. This stage implements “the upper level of the search”;
3. The search in documents array is implemented with the help of a special search engine (already embodied software) that in its turn implements “the lower level search”- a context search in documents database.
12
4. The gained results may be ranges, filtered, grouped and so on - everything that will allow a user to choose only the materials he/she is interested in among all that will have been found. Such a query passage in the system may be iterative, that allows a user not only to make operations available at the last stage of the query passage in the system but also to modify the text of an initial query and to use it for search again, in the whole base or in the found results. Such an approach to the organization of an information search allows a backtracking of a query in the system, refining it while getting the results of a search and sampling, and also correcting it at each stage of the query passage in the system, where the current stage allows doing that. Therefore with the use of such an approach in the process of a query passage in the system there forms a search image of a required document that is the most relevant to the concrete user’s request [3].
2. The architecture of the system under development
While implementing the described above engine of information search it is possible to present a system that supports the current algorithm functioning in the form of a package of components each of those implement separate operations that conform to it according to the scheme of a query passage within the system. It is necessary for the developed system to provide the following:
• a capability to use various approaches in the organization of domain schemes (a domain ontology, diagram of classes, glossary and so on) depending on a search goal;
• a dependence of a search theme only on a content of a database of documents and a domain scheme what makes other components of the system just a technological frame that can be used with another conceptual scheme and base of documents;
• a search process discreteness that provides a capability of backtracking of a search process at each of the stages of a search query passage within the system.
3. Graphical user interface
It gives a user of the system an interface for generation of a search query, query refinement at different stages of a passage within the system and an implementation of various operations with the search results. It implements in a form of html-pages in a user’s browser generated by Java Server Pages.
4. Query processing subsystem
4.1. Query refinement module
It implements the first stage of a query passage within the system. Using a morphology module that locates in data subsystem and supports a compliant interface the query refinement module refines the query source text adding to it various word forms of the elements of the query. To get the results of the query refinement this module provides with a compliant interface -“IRefinedQuery”.
4.2. Query conversion module
It implements the second stage of a query passage within the system. Using an ontology building module, that locates in data subsystem and provides a compliant interface the query conversion module converts a user’s query proceeding from the gained connections of the query elements. To get a converted query this module provides with a compliant interface -“IConvertedQuery”.
4.3. Documents list conversion module
It implements the forth stage of a query passage within the system. According to the specified descriptors of documents (for example, search results) it implements operations that provide the compliant documents in the form, convenient to a user’s perception - these are: adding of metainformation and elements of the document text, setting of the rules sorting, filter and so on. For this purpose this module provides with a compliant interface - “IProcessedDocsList”. As a user has an opportunity to implement various operations with search results, for this purpose there is supported an interface “IProcessedDocsList” - that is used for a setting of parameters of compliant operations at the specified documents (as the search results) and getting the results.
13
5. Search engine module
It implements the third stage of a query passage within the system. At the same time this module is the already developed software that implements indexing and search in documents in a data-base. “Sphinx” search system is used in this case. To perform a search the system connects to the data-base that stores documents and metainformation compliant to them. The search results are represented in the form of descriptors of the found documents and content compliant to the search. For the purpose of search implementation this module supports an interface, made in the form of web-service, an access to which is performed to a definite port.
Fig. 2 - A diagram of components of the system under development
6. Data subsystem
6.1. Morphology module
This module is used by a query refinement module for the performance of search query conversion, input by a user. The module is implemented in the form of web-service producing various word forms of a received word.
6.2. Ontology building module
This module implements storage and delivering of a domain conceptual scheme for usage/editing. The conceptual scheme enclosed in this module is defined on OWL language. This module supports with an interface for a user access that allows overlooking the ontology that corresponds to a rubricator on a start stage. At the same time a choice of ontology (scheme of which will be delivered on a query) depends on the specified search parameters.
6.3. Domain documents module
This module encloses collections of documents and metainformation compliant to them according to that the search is performed by the context search module. This module provides interfaces of an access to documents for completion/editing of collections and also for the use of them for indexing/context search by a compliant system module.
7. Administration subsystem 7.1. Documents editor
14
It corresponds to special software that performs a preparation for loading, loading into the base, statistics computation and documents editing in the data base. It implements in two variants:
• on the system server an access to which is performed through web-interface:
• in the form of a graphical application for Windows operating system, in which all the operations on a preparation of documents are performed on a computer on which this application is launched that enables to somewhat diminish the server load.
In a process of functioning the editor uses the data base of documents for a loading of processing results and the context search model for a juxtaposition of handling documents with key words.
7.2. Ontology editor
This module provides the developers of ontology with an opportunity to build and edit the ontology that is used for a user’s query conversion. For the implementation of this module functioning Protege ontology editor is used, that supports with all the functional required for this purpose.
8. Main system search control module
This module is the core of the whole system under development. It performs a search process and its control as well.
Conclusions
The developed architecture makes it possible to implement the described above requirements to the system and provides with some other advantages:
• an interchangeability of the system components that supports with an opportunity to change and replace the system components without touching the others;
• the system openness that enables an integration of already implemented components to it such as Data Base Management System, search engine, ontology editor and so on;
• an object organization that enables to create distributed system implementations (for example, on several functional servers). However, it is worth taking notice of the fact that the most dense co-location of components in a number of cases assist a significant improvement of the system safety.
The gained as a result of the development the technical frame of the system can be regarded as a framework for creating search systems in its way. This opportunity appears proceeding from the implementation of the second requirement to the system - the search theme dependence only on the ontology and the base of documents, and also on the flexible parameters of the system functioning such as an opportunity to switch on/switch off the use of morphology and conceptual scheme, a choice of given to a user of the system operations over a search results and so on.
The retrieval system is being developed according to the project “The informational system for the search of data and documents and the integration of knowledge in the subject field “public management” by means of theme ontology” by the grant of Russian Humanist Scientific fund № 11-02-12012v. (project director - A.V. Bogomolova, associate professor of Economic Faculty of M.V. Lomonosov MSU).
The ontology is being developed as an additional retrieval tool in “Russia” University Informational System for the integration of statistics data and subject analytical publications on a social-economic development of Russia, regions, municipalities [4, 5].
References
1. Data volume on the Internet is very close 5500 exabytes. It's 500 billion gigabytes. Technologies, Newsru.com, 2009. - [Internet resource]:[Article]. - URL: http://hitech.newsru.com/article/19may2009/netvolume
2. Lukashevych N.V. Thesauri in information retrieval tasks. Moscow: Published by Moscow State University, 2011. 512 p.
3. Rosseeva O.I., Zagorulko Yu.A. Organization of efficient search based on ontologies. Proceedings of International Workshop on Computer linguistics and its applications “Dialog'2001”, v.2, 2001. -
15
[Internet resource]:[Article]. - URL: http://www.dialog-21.ru/materials/archive.asp?id=7029&y=2001&vol=6078
4. Yudina T.N. GIS technology for systems analysis subjects of Russian Federation for geostatistical data, Prikladnaya Informatica, 2011, №1 (31), p. 61-66.
5. Yudina T.N., Bogomolova A.V. MIS RUSSIA: building the infrastructure for a modern statistical education. Proceedings of the XII Russian Joint Conference “Internet and Modern Society” (IMS-2009), the 27 - 29 of October 2009, St. Peterburg, p. 8-12.
16