Вестник угроведения № 1 (20), 2015
УДК 811.511.1
М. Бакро-Надь, Ж. Дураи, Н. Муш, Б. Оско, М. Шипош, Д. Такач, Ж. Варнаи Добыча «золота». Разработка этимологической базы данных: Uralonet
Аннотация. Статья посвящена описанию многоцелевой этимологической базы данных уральских языков, которую можно найти на сайте Института языкознания Венгерской академии наук http://www.uralonet.nytud.hu. Она основана на данных «Этимологического словаря уральских языков» (Редеи, 1986-1989), который был составлен Кароли Редеи и его коллегами в 1980-х годах на немецком языке. Словарь, следовательно, и наша база данных содержит следующие данные: языковые формы и их значения в дочерних языках, праформы и их значения, извлеченные из данных дочерних языков, дополненные лингвистическими объяснениями, а также библиографические данные. База данных уже на данный момент содержит больший объем лингвистической информации, чем исходный словарь. Реконструированные значения доступны на немецком, венгерском, а также на английском языках. Словарь был расширен семантической категоризацией по системе Дорнсейфа (1959), таким образом, поиск можно осуществлять также по семантическим полям. Параметры поиска пользовательского интерфейса показаны на примере, в котором этимология, выступающая основой для иллюстрации, выбрана из угорского праязыка. В статье также описана техническая основа языка базы данных.
Ключевые слова: уральский, этимология, электронная база данных, корпус.
M. Bakro-Nagy, Zs. Duray, N. Mus, B. Oszko, M. Sipos, D. Takacs, Zs.Varnai «Gold» mining. Exploitation of an etymological database: Uralonet
Abstract. The present paper is about a multipurpose etymological database of the Uralic languages, which can be found on the website of the Research Institute for Linguistics, Hungarian Academy of Sciences. http://www.uralonet.nytud.hu. It is based on the data included in the German language Uralic Etymological Dictionary (Redei 1986-1989) which was compiled by Karoly Redei and his colleagues in the 1980s. The dictionary, consequently our database contain the following data: daughter language forms and their meanings, the protoforms and their meanings derived from daughter language data, completed with linguistic explanations, and bibliographical data. The database, even in its present state, is containing more linguistic information than the original dictionary. Reconstructed meanings, besides the German, and Hungarian ones, are available in English, as well. The dictionary has been expanded with Dornseiff's system (1959) for semantic categorization, thus can also be searched for according to semantic fields. The user interface search options are demonstrated through an example, in which the etymology serving as a base for illustration was chosen from the Ugric protolanguage. In our paper, language technological background of the database is also described.
Keywords: Uralic, etymology, electronic database, corpus.
U га И Etimologiai Adatbazis Uralic Etymological Database Uralische Etymolcgie-Oatenbank
Keresofelulet | Informaciok | 5uqo
anularhoz- tanitarhoz
Introduction
In the present paper our aim is to introduce Uralonet, a multipurpose etymological database of the Uralic languages, which has been launched in RIL HAS. The database is composed of the data included in the German language Uralic Etymological Dictionary (Redei 1986-1989) which was compiled by Karoly Redei and his colleagues in the 1980s.
Overview
The database includes daughter language forms and their meanings, the protoforms and their meanings derived from daughter language data, as well as linguistic explanations, and bibliographical.1
Uralonet has been established with the aim of assisting historical-etymological, phonological, morphological, semantic and lexicological research, and, at the same time, we intended to make it useful not only for historical linguists and etymologists but also for those involved in secondary and higher education. With that in mind we made the terminology on the web user interface comprehensible for the wider public.
The database, even in its present state, is containing more linguistic information than the original dictionary. Reconstructed meanings, besides the German, and the recently added Hungarian ones, will soon be available in English as well. The dictionary has been expanded with Dornseiff's system (1959) for semantic categorization,2 i.e. reconstructed meaning can be searched for according to semantic fields.
On the one hand, present query options are in accordance with the original aims of the dictionary, and also refer to further research directions on the other hand. The web user interface is now available in German, English and Hungarian. For a better understanding
of our presentation, the following background information should be taken into consideration. Today Uralic languages are spoken in Europe and Northwest Asia by about 24 000 000 people. Table 1. shows the protolanguages and their daughter languages.
Proto Samoyedic Nenets/Yurak, Euets, Xganasan, Selkup, +Kamassian, +Koibal, +Mator, +Karagas, +Tavgi
Proto Pemiic I'dmurtA'otvak, Komi/Zvrvau
Proto- Pioto-Finno Permic Proto- Mari/C'heremis, Mordvin Saaini ^'Lappish
Uralic Proto- Finno- Proto Balto-Fiirnic "Finnish, "Estonian. Karelian,
Fimio-Ugric Voigaic Olonets, Lndic, Veps, Votic, Ingrian, Livonia n
""Hungarian
Proto-Ugric Proto-Ob-Ugric KhantyOstyak. Mausi/Vogul
Table 1. Uralic languages and protolanguages3
By focusing on the complete common lexicon of the Uralic languages, they aimed at reconstructing the protoforms and their meanings according to uniform principles andmethodology. Most of the languages constituting the Uralic language family split up about 6,000 years ago, have not had written records until the 1920s. The description of the earlier stages of the language family, including setting up etymologies, was exclusively based on reconstruction, thus the interpretation of data was a demanding task. The dialectal diversity of the daughter languages and significant amount of data required an adequate presentation in the dictionary. The multipurpose linguistic processing of this can only be carried out through a database.
Digitization of the dictionary started in the 1990s in RIL HAS (see: Batori - Csucs 2001). Earlier online versions were realized in cooperation with the University of Koblenz (for details see: http://blade.uni-koblenz. de:8080/Uralothek/pdom/portal.html)
The present project, Uralonet, has been going on since 2010. The new concept is being developed on the basis of the digital data of
1 It contains 1,876 etymologies, 2,421 protoforms and protoform variants, 4,688 reconstructed meanings, 223 dialects/languages, 21,829 linguistic data in the set of cognates as well as 154 languages and protolanguages other than Uralic.
2 Bakro-Nagy 1991, Sipos 1999, 2003, Sipos 2010.
3 The database includes only the protolanguages printed bold in the table. Daughter languages are listed on the right with exonyms and endonyms separated by «/». Extinct languages are marked with «+». Languages with official status are marked with «*». All the other languages are in a minority status.
earlier versions4 and on modern technical support. The present structure as well as the web user interface are the results of our team.
Language technological background
The main purpose for the project of the implementation of the online dictionary is to digitize, to preserve and to make available online publicly the original UEW. So a main principle has been formulated: we store and publish each and every bit of information, no information shall be lost nor changed (except for the errata). This principle has been followed throughout the building of the database and the web user interface. Another goal was to assure the usability of the online dictionary by following the current web-related standards and insite search engine optimization guidelines.
Therefore the main requirements for the implementation were set as follows:
• The visual representation and the inner structure of the xhtml pages are as fine grained as the arrangement of data in the database is. In other words, every metadata is displayed and is made searchable as far as possible.
• The language code of every linguistic data is given visually and in xml:lang tags.
• Every linguistic data will be encoded in standard Unicode, so custom fonts can be omitted.
• The search form makes it possible to filter the results by any data field.
• The pages of the web site are xhtml valid, and some xhtml semantic tags are used when applicable.
• The web user interface is translatable to any language when needed.
• Similarly, multi-lingual translations and commentaries may be given for the linguistic data.
• The url of every page of the site is a permalink, meaning it can be copy-pasted and used as a link to return to the very same page anytime later.
From a technical point of view, we wanted a software system made of components which are widely used and supported, so that
we can count on their use for the next decade, and which makes the development process easy and efficient. This led us to work on Linux & Apache architecture with Mysql as the database engine, and our programming language of choice is Perl.
Every text on the pages of the site, including the language data and their translations, is featured with their corresponding language codes. This makes it easy for text editing softwares, search engines and other text processing softwares to determine the language of each part of the pages and to handle them accordingly. The proper way to do this is to apply the language codes defined in the ISO 639-1, -2 and -3 standards in xml:lang attributes. Thus we had to collect the standard language codes of the Uralic (and some other) languages. We found out that there are some Uralic languages which the standard does not assign separate codes to. In these cases we applied thefiu code as a fallback.
The encoding of the vast amount of language data stored in the database went through three phases. Originally, a custom font was created which had more than a thousand custom characters (code points in Unicode terms) in order to have separate code points for every possible groups of formatted letters with diacritic marks used in the transcription of the linguistic data. There are two viable ways to use a custom font in web pages. One is to expect the user to have installed the given font file on they computer or at least in their browser. In principle it is possible to notice the browser in CSS directives about the special font file used on the page behind the scenes and make it use that font file automatically, but in practice we came across problems with this approach. Consequently, in the second phase we set the focus on the visual representation of the pages and switched to another solution, which is basically to generate the written form of the linguistic data on server side when needed, store it in plain image files and show them on the web pages together with their enclosing text. This approach works fairly well. However, a text mixed with in-line
images makes it hard to reuse the content of the pages, namely the copy-pasting of such a text to another text processing application may lead to quite interesting unexpected results. We think that Uralonet should make it possible if not easy to reuse its contents in a standardized way, therefore another representation must have been developed, which is of course based on the Unicode standard. To do this, two conversion tables were built, one mapping from the original complex representation to an easily readable and editable encoding, and a second mapping this encoding to the standard Unicode letters and diacritic marks. The whole database was then converted using these two cascading conversion tables. The conversion tables are editable and extendable, so the text contents may be converted to other encoding styles if needed.
To provide search capabilities over the CV-structure of the linguistic data, we have built a so called CV-index. This is stored in the database together with the original linguistic data, and is generated by a hand-made conversion table which maps each character to a C or a V sign. The search engine matches the query terms to this CV-index. Combined queries may then be realized mixing exact characters and C/V criteria.
Another targeted usage scenario is to include syllable boundaries in the search criteria. This requires the database to store the
word forms and the positions of the syllables in them together. The syllable index has been generated automatically.
The main purpose of the web user interface is of course to provide the users with the most information possible. Besides this, we took in account some widely accepted web usability concerns. Thus a light, easy-to-understand design has been applied to the website, and the elements of the user interface and their layout is made similar to the well-known search engines.
Demonstration of the user interface search options
The demonstration below is restricted to showing one single example illustrating the potential links of the database to Hungarian text corpora.
A) Question: Are there words meaning precious metal reconstructed for the protolanguages?
The search form5 consists of four units. When searching, these units can be used separated or combined. In the present search the first one should be filled in, the only unit necessary for answering the question.6 Search setting: Protolanguage: All Certainty: All
Lexical fields: 1.11. Mineralien All the other fields remain unmarked.7
5 As the table shows, the database and the user interface is presently under construction.
6 The next three units are as follows: «Search in cognates» lists the daughter language data with their meaning, borrowings within the language family are also shown here. The third unit offers free word search in linguistic explanations. The fourth one searches within the bibliography.
7 No.: numbers assigned to etymologies from technical reasons; Browse: search within the reconstructed forms listed according to the base characters.
Hits:
Display of data: Sr Protnlanguaqe ^N? !Meanjnq Hihfcnqraphy saareft: Bark rjp* ra^n-h ifl afl lflij 1-9/9
<£*la Ug 'sö', 'Sali' UFW PK 1737
ir7»ii (iitYJii) FP 'riz','Kupfer' UEW »3 1232
kam FP Vas (fn)", 'Eisen' U£W rr 128?
soJj №la) FP 's<5', 'Salz' UEW №5 1534
jtms FP 'rozsda; rozsdisodik', 'Rost; rosten, rostig werden' UEW №9 1S54
waskc U "vmilyen fem, ?rez', 'irgendein Metall, ?Kupfer" UfWH0 1123
vvdna FU 'ön', r2lnn' UEW № 1162
wsIjils Ug 'ölom (fn)', "Blei* UEW 1871
itarono, ¡atarw Ug arany; röz', 'Gold; Kupfer' uew ws 1746
Table 3.
This semantic field displays 9 etymologies. The only etymology satisfying the search criterion ('precious metal') is the one with the meaning 'gold; copper'.
B) Question: Which meaning is preserved in Hungarian?
Search setting: There is no need to set a new search field. It is sufficient to click on the previous hit(s).
Hits:
Table 4.
While the meaning 'copper' was preserved in Khanty and Mansi, in Hungarian the meaning 'gold' can be documented. The comments make clear that the Ugric protolanguage borrowed the word from Old Iranian, demonstrating that the meaning of its Avestan, Old Persian and Sanskrit reflexes is 'gold'. In the case of a future development links to non-Uralic (e.g.
Indo-European) dictionaries, etymological databases and corpora might be inserted here. It goes without saying that relying on further external data could be helpful in evaluating thephonetic, semantic respects of etymologies. Moreover, with the analysis of further source language data the Uralic etymological results might be modified.
Further links belonging to Hungarian data lead to Hungarian corpora of RIL HAS connected to Uralonet, i.e. the Hungarian National Corpus8 (HNC) and HNC beta presenting a 10-data sample from the data of the HNC; the Hungarian Historical Corpus
(HHC) leads to a corpus of Hungarian historical data originating from the period from 1772 to 19979. It is here where further information on Hungarian data can be gained (Table 5.; the result page of HHC of arany).
1.. 30 talâlat
Tovabbi max. 30 talalai ... Menü
1 eknek hat nemek van, ûgy-mint
2 zen itilet-s43zerént tehât az
3 atâ föOldet; 's ettöOl van az
4 k s43zéke. Jôl-lehet pedig az
5 1ÖQ a' Kopernikânus43ok; söOt
6 lésével, hanem azon feliil 500
7 -mulik, mu'zikât kezdenek, Az
8 Vénus, ös43zve-keverednek Äz
9 43zegletébô41 JÖ4 aprosâgival
10 gyermek, enyelegni indùl, "S
11 nelc kikiabâlâsâra. Sot ha azt
12 ketten, ts43ak hogy a" kirâly
13 elterjeszti kedves sugârait S
14 ulyât, mellyben mint-egy ezer
15 otl> 0007 Kârhozatos ros43z
17 ogàny, hogy Ö4tet el-ne érje,
18 ett gyözedelmet, az el-hintet
19 talâm azt kivânod, hogy ezen
20 tok-fel a* gyö4zedelemre ezen
21 ..lrtek. 95 Minek utânna az
22 atok, hogy a" fö41dön heverö4
24 te: hogy talâm a' lâbaikat az
25 any kövekben meg ne ûtnék. Az
arany, ezü0s43t, réz, fejér ân, és .. arany ez hârom föOldnek leg-tis43ztâb.. aranynak nehézsége, s43û0rû0sége, t.. arany, és mâs értzek töObbnyire a' f.. Aranys43zâjû Sz. Jânos irja: <.. aranyig mulctâztatott, melyet forditot.. aranyas s43zobâk jâtékkal el-telnek.<.. arany almâért, mellyen vetekednek. arany fellegébo41 Virâg kos43zor.. arany pillangôja, râla lâbâhoz hull., aranyos festékkel is lerajzolhatnâm, m., arany ts43és43zébol ivott, a" mâs43i.. arany szinre festi hegyek oldalait, aranyok valânak (:mert a' Férjem a' ma..
arany! az aranynak vétke ez inkàbb:16 7 Kârhozatos ros43z arany! az aranynak vétke ez inkâbb: E" mind..
arany nyügö4ket hâny eleibe, hog..
aranyal vis43zs43za nyerné. M..
arany-magokbôl gyôzedelem-pâlmâk
arany-grâdits43okon Ne kételkedje..
aranyat az ellenséggel el-hânyatjâtok..
aranyra kezeteket ki-ne nvûitsâtok.23 ôlitotta-meg az 04 Vitézit az arany-és vas ves43zedelmek között meg. arany kövekben meg ne ûtnék. Az .. arany-is engedett ezen Kirâlyi parants4..
Table 5.
Further developments
The database is intended to be developed in two directions. One of them is concerned with establishing further search options in
Connection with databases
the present database, and the other is related to establishing connections with external electronic text corpora and dictionary-based databases.
Search options Connection with databases
Ongoing refining the options related to CV-structures and syllable structures linking the Hungarian database to other corpora of RIL HAS (e.g. Old Hungarian Corpus,10 New Hungarian Etymological Dictionary)
searching according to the morphological structure of the reconstructed forms establishing links to the database of EUROBABEL Ob-Ugric Languages 11
Further plans searching onomatopoetic etymologies
establishing links among etymologies belonging to the same semantic field establishing links with the etymological and dictionary-based databases and corpora of further Uralic and non-Uralic languages
8 At present the Hungarian National Corpus includes 187,6 million text words. It is divided into five regional language versions and it includes texts from five registers (Varadi 2002).
9 « .. .text samples were selected by professionals (literary historians, historians, mathematicians etc.) from printed works between 1772 and 1997. Currently the corpus consists of 25,822,775 words, with a relative majority (40%) from the second half of the 20th century. [.] A representation of the whole spectrum of written language was aimed while compiling HHC, that is why several genres are present in this corpus. We can find prosaic and rhythmic texts and texts from different registers» (Kozma, Martonffy, Szabo 2012).
10 Simon - Sass - Mittelholz 2011.
11 For further information see: http://www.babel.gwi.uni-muenchen.de/
Bibliography
1.
2.
3.
4.
5.
6. 7.
9.
Bakró-Nagy Marianne. Die Begriffsgruppen des Wortschatzes im PU/PFU. UAJb. N.F. 1991. P. 13-40.
Bátori István, Csúcs, Sándor. Uralische Etymologische Databasis. In: CIFU 9. Tartu 2011. Pars 4. P. 113-121.
Kozma Judit, Mártonfi Attila, Szabó Tamás Péter. A new genre in Hungarian lexicography. Scenes from the workshop of a new corpus-based dictionary. 4th International Conference on Corrpus Linguistics - CCILC2012. Jaén, Spain, 22-24 March 2012. Rédei Károly (Hrsg.). Uralisches etymologisches Wörterbuch. Akadémiai Kiadó, Budapest. Simon Eszter, Sass Bálint, Mittelholcz Iván. Korpuszépítés ómagyar kódexekbol. VIII. Magyar számítógépes konferencia Szeged, 2011. December 1-2. P. 81-89.
Sipos Mária. Az ugor kori szókincs fogalomkörök szerinti csoportosítása. NyK 96. P. 158-169. Sipos Mária. A finn-volgai alapnyelv fogalomköri felosztása. In: Molnár Zoltán - Zaicz Gábor (szerk.): Permistica et Uralica: Ünnepi könyv Csúcs Sándor tiszteletére. Fenno-Ugrica Pázmániensia I. Pázmány Péter Katolikus Egyetem. Piliscsaba. P. 223-228. Sipos Mária. A finn-permi alapnyelv szókészletének fogalomköri felosztása. Kézirat. Budapest, MTA Nyelvtudományi Intézet.
Váradi Tamás. The Hungarian National Corpus. In: Proceedings of the 3rd LREC Conference, Las Palmas, Spain. P. 385-389. [Electronic resource] // URL: http://corpus.nytud.hu/mnsz.