Russian Journal of Linguistics
ISSN 2687-0088 (print), ISSN 2686-8024 (online)
2022 Vol. 26 No. 2 391-408
http://journals.rudn.ru/linguistics
https://doi.org/10.22363/2687-0088-30122
Research article
A cognitive linguistic approach to analysis and correction of orthographic errors
Robert REYNOLDS12 0, Laura JANDA1 and Tore NESSET1
1UiT The Arctic University of Norway, Tromso, Norway 2Brigham Young University, Provo, Utah, USA [email protected]
Abstract
In this paper, we apply usage-based linguistic analysis to systematize the inventory of orthographic errors observed in the writing of non-native users of Russian. The data comes from a longitudinal corpus (560K tokens) of non-native academic writing. Traditional spellcheckers mark errors and suggest corrections, but do not attempt to model why errors are made. Our approach makes it possible to recognize not only the errors themselves, but also the conceptual causes of these errors, which lie in misunderstandings of Russian phonotactics and morphophonology and the way they are represented by orthographic conventions. With this linguistically-based system in place, we can propose targeted grammar explanations that improve users' command of Russian morphophonology rather than merely correcting errors. Based on errors attested in the non-native academic writing corpus, we introduce a taxonomy of errors, organized by pedagogical domains. Then, on the basis of this taxonomy, we create a set of mal-rules to expand an existing finite-state analyzer of Russian. The resulting morphological analyzer tags wordforms that fit our taxonomy with specific error tags. For each error tag, we also develop an accompanying grammar explanation to help users understand why and how to correct the diagnosed errors. Using our augmented analyzer, we build a webapp to allow users to type or paste a text and receive detailed feedback and correction on common Russian morphophonological and orthographic errors.
Keywords: morphophonology, phonotactics, orthography, corpus, error taxonomy, webapp For citation:
Reynolds, Robert, Laura Janda &Tore Nesset. 2022. A cognitive linguistic approach to analysis and correction of orthographic errors. Russian Journal of Linguistics 26 (2). 391-408. https://doi.org/10.22363/2687-0088-30122
© ¡Robert Reynolds, Laura Janda and Tore Nesset, 2022
0 ' This work is licensed under a Creative Commons Attribution 4.0 International License https://creativecommons.Org/licenses/by/4.0/
Научная статья
Лингвокогнитивный подход к классификации и исправлению орфографических ошибок
Роберт РЕЙНОЛЬДС12 И, Лора ЯНДА1 , Торе НЕССЕТ1
1 Университет Тромсё — Арктический университет Норвегии, Тромсё, Норвегия 2Университет Бригама Янга, Прово, Юта, США [email protected]
Аннотация
В представленной статье мы предлагаем систематизацию орфографических ошибок неносителей русского языка на основе лингвистических и когнитивных критериев. Материалом исследования послужили данные лонгитюдного корпуса (560000 слов) работ на русском языке, написанных студентами-иностранцами. Традиционные автоматические средства проверки орфографии (spell checkers) выявляют ошибки и предлагают исправления, но не могут построить объяснительные когнитивные модели. Предлагаемый подход позволяет распознать не только сами ошибки, но и концептуальные причины этих ошибок, заключающиеся в непонимании фонотактики и морфофонологии русского языка, а также в способах их репрезентации орфографическими правилами. Этот способ позволяет обосновывать причины грамматических ошибок и рекомендовать правила, которые улучшают владение пользователями русской морфофонологией, а не просто исправляют ошибки. Принцип систематизации аннотированных ошибок в корпусе академического письма на неродном языке и таксономия ошибок ориентированы на преподавание. На основе представленной таксономии мы разработали набор правил (mal-rules), расширяющих функционал конечно-автоматного анализатора русского языка. Разработанный морфологический анализатор аннотирует словоформы специальными тегами ошибок. Для каждого тега ошибки мы предлагаем сопровождающее пояснение, чтобы помочь пользователям понять, почему и как исправить диагностированные ошибки. Используя наш расширенный анализатор, мы создаем веб-приложение, позволяющее пользователям набирать или вставлять текст, а также подробные комментарии и исправления распространенных морфофонологических и орфографических ошибок в русском языке.
Ключевые слова: морфофонология, фонотактика, орфография, корпус, таксономия ошибок Для цитирования:
Reynolds R., Janda L., Nesset T. A cognitive linguistic approach to analysis and correction of orthographic errors. Russian Journal of Linguistics. 2022. Vol. 26. № 2. P. 391-408. https://doi.org/10.22363/2687-0088-30122
1. Introduction
Traditional approaches to spell checking are sometimes inadequate for the needs of non-native users because they are optimized for native speakers. Not only is it assumed that the user is capable of choosing between suggested corrections, but the suggestions themselves are optimized for the kinds of errors that native speakers make. Even if a non-native user were able to select the correct form from the suggested corrections, it is entirely possible that the user would not understand why it is the correct form in contrast to the form they wrote. Furthermore, whereas spell checking for native speakers is mainly a matter of fixing one-off random
errors, non-native users need to acquire rules that they can apply in the future. The mistakes that non-native writers make tend to be systematic, and thereby can be analyzed linguistically and present excellent targeted learning opportunities.
The output of a spellchecker will frequently be either too broad (merely marking a word as misspelled) or too specific (suggesting an alternative for a single given misspelled word) to support the acquisition of useful generalizations. Our proposed tool, the Russian Mentor for Orthographic Rules (RuMOR) is designed to help non-native users connect each specific error to linguistic generalizations, orthographic rules, and examples. This design encourages the user to update their understanding of Russian linguistic and orthographic patterns so that they can avoid making similar errors in the future.
Section 2 reviews related research in the fields of morphological analysis, spelling correction, and intelligent tutoring systems. In Section 3, we describe our methodology, including the process of classifying errors in the RULEC (ENA, April 16, 2022)1 corpus, modeling the errors in a finite-state framework using mal-rules, evaluating the model, and applying the model in a webapp for users. Section 4 contains a summary of our results and future research directions.
2. Related work
Our project is connected to research in a number of disparate fields, including Natural Language Processing (NLP), Intelligent Computer-Assisted Language Learning (ICALL), Russian Linguistics, and Second Language Acquisition (SLA).
2.1. Pedagogical foundations
Textbooks of Russian typically state spelling rules and contain explanations about pronunciation. However, the connection between this material and what it means for confident writing skills is underrepresented. In other words, students may learn that they should pronounce the letter e like an u when unstressed, or that the letter u sounds like u when preceded by m or m. But students are not warned that these conventions will present challenges in spelling. Furthermore, these rules are typically not exercised in any systematic way and tend to remain peripheral from the students' perspective.
Traditional textbooks take an instruction-based perspective, with the idea of mere transfer of knowledge. A better model for pedagogy is learning by doing, whereby each student constructs their own knowledge network through active engagement. This framework, which is known as constructivism (Biggs 1999, Biggs & Tang 2011), promotes student-centered learning activities both within and outside the classroom. When a student of Russian makes a spelling error, RuMOR can capitalize on that event as an opportunity to engage students with targeted feedback on the relevant spelling and pronunciation conventions. A spelling error is something that is directly relevant to the student in the moment, thus opening up
1 http://www.web-corpora.net/RLC/rulec
a "teachable moment", when the student is receptive to improvement of their skills. When used over time, RuMOR will engage each student with all of the typical errors that they need to focus on.
2.2. Morphological analysis
The Russian language has widespread fusional morphology, with each major word class having multiple inflection classes. Since the complexity of the morphological system is itself the source of many errors, a morphological analysis is frequently essential for determining what feedback will be most helpful to the user. Table 1 shows two authentic orthographic errors which, at the surface level, appear to be the same — mistakenly replacing u with e — but which are motivated by entirely different parts of the linguistic system.
Table 1. Different underlying motivations for identical surface substitutions
Correct form Erroneous form Substitution Motivation
Марии 'Maria' Mapue и ^ е inflectional
умирает 'dies' yMepaem и ^ е phonological
The erroneous Mapue is morphologically motivated by the fact that the default Locative singular (and for feminine nouns like this one, Dative singular) ending is -e, but the writer has failed to take into account the exceptional rule that nouns in -uh take instead the ending -u. The incorrect spelling of yMepaem is phonologically motivated by the fact that the pronunciation of e is indistinguishable from that of u in unstressed syllables, and in all forms of this verb the stress is on the vowel a.
The output of traditional spellcheckers would be able to tell the user what substitution is needed to correct the error, but it would be inadequate for determining feedback that helps get at the root of the mistake. On the other hand, a morphological analyzer that is sensitive to the grammatical structure of words can model errors such that these two errors can be linked to distinct and appropriate feedback that is relevant to the different factors that led to the error.
Approaches to automatic morphological analysis of Russian have historically gravitated toward rule- and lexicon-based methods. One reason for this is the existence of the seemingly prescient Grammatical dictionary of Russian (Zaliznjak 1977), which specifies the inflectional patterns of more than 100 000 words. On the basis of this dictionary, computational linguists have produced many Russian morphological analyzers/taggers. These include RUSTWOL (Vilkki 1997, 2005), StarLing (ENA, April 17, 2022)2 (Krylov & Starostin 2003), DiaLing (ENA, April 17, 2022)3, Mystem (Nozhov, 2003)4 (Segalovich 2003), pymorphy2 (ENA, April 17, 2022)5 (Korobov 2015, Boxarov et al. 2013), and UDAR (ENA, April 17,
2 http://starling.rinet.ru/downl.php
3 http://www.aot.ru (In Russ.)
4 https://yandex.ru/dev/mystem/
5 https://yandex.ru/dev/mystem/
2022)6. Although all of these analyzers could theoretically be augmented or adapted to provide more informative feedback than a traditional spellchecker, UDAR is best suited to our needs for a number of reasons. First, it is free and open-source, which facilitates operating in an Open Research paradigm. Second, it includes specification of word stress position, which is crucial for predicting some kinds of spelling errors. Third, it is integrated with a Constraint Grammar, a framework designed to deal with inherent ambiguity, a property which errors are notorious for. Fourth, the finite-state paradigm enables extremely fast lookup times, avoiding procedural logic at runtime.
2.3. Spelling and grammar correction
Rozovskaya and Roth (2019) classified errors from the RULEC corpus (The Russian Learner Corpus of Academic Writing, Alsufieva et al. 2012), and found that spelling errors were by far the most frequent class of errors, accounting for 18.6% of non-native errors and 42.4% of heritage speaker errors. Since spelling errors are by definition limited to the modality of writing, it seems safe to say that most, if not all, of these errors are a direct reflection of writing proficiency, as opposed to general language proficiency. Therefore, significant improvement in spelling ability is one of the most straightforward paths to build writing confidence and proficiency.
In recent years, there has been a significant uptick in research on spelling correction for Russian (Sorokin 2017), including SpellRuEval, a competition on automatic spelling correction for Russian (Sorokin et al. 2016). However, so far these research projects have understandably been focused only on surface-level correction, without regard to the underlying linguistic sources of the errors. A natural result of this narrow focus is that grammatical input is generally not included because it is not helpful to these models. Whereas grammatical awareness is a sometimes crucial element of pedagogically oriented spelling correction, the official report from SpellRuEval states that adding morphological and semantic features to these models for traditional spelling correction yields little to no gains.
Research on automatic grammatical error correction has been dominated by studies of English, but Rozovskaya and Roth (2019, 2021) have recently extended this research to Russian as well, with impressive results for certain kinds of errors. Although their research path is promising, it falls short for our application in the same way that recent spelling correction does: the training data — and by extension the outputs of the models - do not contain hypotheses about why errors are made.
2.4. Intelligent Language Tutoring Systems (ILTS)
Intelligent Language Tutoring Systems (ILTS) use Natural Language Processing to provide individualized feedback to users without the need for human
6 https://github.com/giellalt/lang-rus and https://github.com/reynoldsnlp/udar; UDAR is an abbreviated form of udarenie 'word stress', and it is also a recursive acronym: "UDAR Does Accented Russian."
graders or tutors. Historically, research on ILTS has been focused on workbook-style exercises with tightly controlled context (Heift 2010, Nagata 2009, Amaral & Meurers 2011, Choi 2016; Meurers et al. 2019). In these systems, limiting the context allows the designers to anticipate what kinds of feedback are appropriate. The more controlled the context, the less sophisticated the language analysis needs to be. Conversely, providing feedback on every aspect of language with unlimited context in an ILTS would require something near artificial general intelligence.
One departure from the strategy of tightly controlling the context for feedback in ILTS is the Revita system (Kopotev et al. 2019), which allows users to upload their own texts in a number of languages, including Russian, and generate workbook exercises for that text. Notably, the feedback for incorrect responses is generally limited to connecting the mistake to another word in the sentence that governs the target word, or with which the target word should agree. Unlimited possibilities require limited feedback.
While the goal of RuMOR is also to provide feedback to any arbitrary text entered by the user, it is limited to spelling errors, which tend to be interpretable without reference to any surrounding context. Because the scope of the task is limited to only spelling errors, it is possible to provide detailed feedback with high confidence that the feedback will be germane.
Given the fact that all major Russian morphological analyzers are lexicon- and rule-based, the most natural approach to analyzing Russian produced by non-native speakers in an ILTS is through the use of mal-rules (cf. Sleeman 1982, Mathews 1992). Mal-rules are rules that are added to license structures that are not valid in the standard language, but are expected in non-native language production. For example, UDAR uses two-level orthographic and phonological rules7 to generate standard Russian surface forms from an underlying representation. By modifying or deleting subsets of these rules, one can compile an analyzer that recognizes erroneous wordforms of the sort that non-native writers produce.
3. Methodology
In this section, we describe the methods used to 1) identify the classes of errors to model in our analyzer, 2) augment UDAR to label these errors, and 3) implement the analyzer in the RuMOR webapp.
3.1. Classifying RULEC errors
Russian morphology is more complex than that of many major world languages, and the size of the paradigms, as well as the large number of arcane exceptions, pose a significant challenge. Although RuMOR is not designed to teach inflectional morphology, there are a number of morphophonological phenomena, such as stem alternations, that directly lead to spelling mistakes. Orthographies tend
7 Cf. Koskenniemi, Kimmo. 1983. Two-level morphology: A general computational model for word-form recognition and production. Technical report, University of Helsinki, Department of General Linguistics.
to accrete idiosyncratic conventions that can be especially obscure to non-native writers, and Russian orthography is rife with challenges. Russian orthography can be characterized as morphophonemic, as it does not always reflect phonological phenomena, such as vowel reduction, consonant voicing assimilation and final-obstruent devoicing.
In order to determine which errors should be included in our model, we turned to the Russian Learner Corpus of Academic Writing (RULEC) (Alsufieva et al. 2012), currently the largest freely available corpus of Russian writing produced by non-native users. It consists of approximately 560 000 words, written by 15 non-native and 13 heritage writers, all residing in the United States. We analyzed the corpus using the udar (ENA, Arpril 17, 2022)8 python package to output a list of all words not recognized by the analyzer. This method admittedly overlooks realword errors, but we suspect that such errors are extremely infrequent in this corpus because opportunities for homophone errors in Russian are mostly limited to a few rare word pairs that are confusable due to final devoicing/voicing assimilation, such as лук 'onion' vs. луг 'meadow', both pronounced with final [k].
After generating this list of unrecognized tokens, we constructed a frequency distribution of errors and manually classified the tokens according to whether we believed the token was an actual error, or simply a valid token that UDAR did not recognize, such as the acronym СПбГУ 'Saint Petersburg State University'. For those tokens that we believe are spelling errors, we classified them linguistically according to the motivation behind the error, relying on our expertise as professional linguists and teachers. Each of these error tags is discussed in the following subsections.
The goal of RuMOR is to improve mastery of Russian orthography by making generalizations that users can apply in the future. In this sense, RuMOR has a different and more advanced linguistic goal than that of a spell-checker. Since RuMOR relies on linguistic analysis, it seizes upon spelling errors as teachable moments when it is most appropriate to deliver systematic explanations. Therefore, the tags are linguistically motivated rather than aimed at simple correction. Each tag can be considered an index to link the error to a relevant mini-lesson to help correct the error.
3.1.1. Overview of error tags
Table 2 contains a summary of the error tags currently included in our spelling model and webapp. The "Tag" column is the name of the tag, as implemented in UDAR. Many of the tag names merely describe the substitution that caused the error, so "a2o" means that the letter "a" was erroneously spelled as an "o". The "Linguistic label" column is a short pithy description of how to fix the error. More detailed descriptions of the error types are given in the "Tag explanation" column, and relevant examples of misspelled words are provided in the "Examples" column.
8 https://github.com/reynoldsnlp/udar
Table 2. Summary of error tags
Tag Linguistic label Tag explanation Example(s)
a2o o^a Misspelling (о should be a) озночает
e2je е^э Misspelling (е should be э) ето
FV no fill vowel Presence of unnecessary fleeting vowel отеца
H2S ъ^ь Misspelling (ь should be ъ) подьезд
i2j й^и Misspelling (й should be и) миллйард
i2y ы^и Misspelling (ы should be u) блызко
ii ие^ии ие should be uu Марие
Ikn и^е/я/а Ikanje (u should be е/я/а) дитей
j2i и^й Misspelling (u should be й) рабочии
je2e э^е Misspelling (э should be е) проэкта
NoFV add fill vowel Missing fleeting vowel окн
NoGem add double letter Geminate letter is missing имено
NoSS add ь Misspelling (ь is missing) болше
o2a a^o Akanje (a should be о) каторый
Pal add softening Missing palatalization at stem-ending interface землу
sh2shch щ^ш Misspelling (щ should be ш) лучще
shch2sh ш^щ Misspelling (ш should be щ) вообше
ski ский^ски по-~скuй instead of no-~CKu по-русский
SRo о^е Spelling Rule о>е нашой
SRy ы^и Spelling Rule bi>u книгы
y2i и^ы Misspelling (u should be ы) описивают
prijti прийти Misspelling the stem of прийти прийду
revIkn е/я/а^и Reversed Ikanje (e, а, я should be u) умерает
Gem no double letter Should be just single, not geminate letter рассширить
3.1.2. Fill vowels
Fill vowels (also known as "fleeting" or "mobile" vowels) are vowels that are only realized if there is no inflectional ending, or if the inflectional ending does not begin with a vowel. For example, окно 'window.SG.NOM' has an inflectional ending, so there is no fill vowel, but окон 'window.PL.GEN' has no inflectional ending so the fill vowel appears between the к and the н.
Fill vowel errors clearly demonstrate both the linguistic motivation for our project, as well as the methodological necessity of a morphological analyzer. There are generalizations that help predict which fill vowels appear in what contexts, but ultimately, they are lexically specified and must be memorized. A traditional spellchecker cannot identify that a particular letter omission or insertion is related to fill vowels, so it cannot direct users to remedial resources. Further, because the "rules" for fill vowels have many exceptions, it is essential to rely on a structured lexicon, such as that in UDAR, to model which errors are related to fill vowels.
We currently have two fill vowel (FV) error tags. The FV tag indicates the presence of a fill vowel that should not be present, and the NoFV tag indicates the absence of a fill vowel that should be present. Since users tend to think in terms of generating oblique forms from the lemma, these tags are far more likely to appear on oblique forms (e.g., erroneous отеца 'father.SG.GEN.FV' which should be
отца, and erroneous окн 'window.PL.GEN.NoFV' which should be окон, etc.) as opposed to the lemma, which users are most familiar with, (e.g., errors such as отц 'father.SG.NOM.NoFV' instead of correct отец or оконо 'window.SG.NOM.FV' instead of correct окно are quite rare). Our analyzer recognizes all of these forms.
3.1.3. Vowel reduction
Russian vowels are always spelled as they would be pronounced if they were stressed, despite the fact that the sounds of some vowels are very different when they are not stressed. What sounds like unstressed [i] might be spelled и, е, а, or я; and what sounds like unstressed [a] might be spelled a or o. Spelling unstressed vowels is therefore a major challenge, even for native Russian speakers. Native speakers can often solve this problem by remembering a related word or wordform where the given vowel is stressed. For example, to spell река [rik'a] "river.SG.NOM' a native speaker can think of a form of the word with different stress, such as реку [r''eku] 'river.SG.ACC'. However, non-native users have more limited relevant knowledge to draw on, and vowel reduction is one of the most frequent causes for spelling errors in the RULEC corpus.
The pronunciation of an orthographic о as [a] is called "akanje" by linguists, and the associated spelling error is tagged o2a. The pronunciation of orthographic е, а, or я after palatalized consonants as [i] is called "ikanje", and the associated spelling error is tagged "Ikn". These are the most common error tags for vowel reduction. However, we were surprised to find that akanje and ikanje create enough confusion in the minds of users that they sometimes do the exact opposite (hypercorrection). The tag a2o identifies instances where an orthographic a is replaced by o, even though it is pronounced [a], as with the token озночает 'signify.PRS.3P.SG.a2o' (cf. correct означает). Similarly, the tag revlkn identifies instances where an orthographic и is replaced by а, е, or я, as with the token умерает 'die.PRS.3P.SG.revIkn' (cf. correct умирает).
3.1.4. Phonetic competence
Depending on a user's first language, some of the sounds of Russian are difficult to distinguish, so choosing between letters whose sounds seem indistinguishable is a common problem.
The first instance of confusion that we model is between the letters ш and щ, both representing voiceless fricatives that English-speaking users associate with "sh". The prior is post-alveolar, and the latter is palatal. Whether because of the similarity of the orthographic symbols or the similarity of the sounds, non-native writers frequently substitute these letters for one another. The tag sh2shch identifies instances where ш has been replaced by щ, as with the erroneous token лучще 'better.ADV.sh2shch' (cf. correct лучше). Conversely, the tag shch2sh marks instances where щ has been replaced by ш, as in erroneous вообше 'generally.ADV.shch2sh' (cf. correct вообще).
Another phonetic difficulty is the distinction between the high central unrounded vowel [i] and the high front vowel [i]. Although linguists do not agree on the phonemic status of [i] and [i], they are represented in standard orthography by two separate letters, ы and и, respectively. Not only is the vowel [i] difficult to pronounce for many non-native speakers, but it is not represented consistently in standard orthography. Although the vowel [i] is mostly represented by the letter ы, in some contexts it is written as w, most notably when preceded by the letters ж or ш. The difficulty of phonetic competence, combined with orthographic inconsistency of [i], leads to many spelling errors substituting these letters for one another. The tag y2i marks tokens where ы has been replaced by w, as in описивают 'describe.PRS.3P.PL.y2i' (cf. correct описывают). The i2y tag marks tokens with the inverse substitution, such as блызко 'close.ADV.i2y' (cf. correct близко)9
Two of our error tags are motivated by a misunderstanding of phonemic palatalization in Russian consonants. In modern usage, the soft sign ь indicates that the preceding consonant is palatalized, and the hard sign ъ indicates that the preceding consonant is not palatalized. Generally speaking, consonants are assumed to be hard, so the hard sign appears in only one context: between prefixes that end in a consonant, and stems that begin with е, ё, ю, or я, as in подъезд 'stairwell'. However, given the relative frequency of the visually similar soft sign ь, non-native writers frequently use the soft sign in place of the hard sign, as in подьезд 'stairwell.H2S' (cf. correct подъезд). Similarly, for users that have not acquired palatalization in their language, the role of the soft sign ь is difficult to grasp. This leads to its frequent omission, as in болше 'bigger/more.NoSS' (cf. correct больше).
A prominent feature of Russian phonology is consonant palatalization (commonly referred to as hardness vs. softness). Russian orthography marks consonant hardness or softness by two parallel sets of vowel letters (and the symbols ь and ъ), so that hard consonants are followed by one set, and soft consonants by the other. When inflecting words, users are prone to change the hardness or softness of the stem-final consonant by using a vowel from the wrong set. In particular, it is most common to change soft consonants to hard consonants. Errors of this type are indicated with the tag Pal, as in the error землу 'earth.ACC.Pal' (cf. correct землю).
3.1.5. Alphabetic confusion
Some spelling errors are either evidence of misunderstanding of the sounds or roles associated with a given letter, or interference from the alphabet of the user's first language. These errors differ from those in Section 3.1.4 (Phonetic competence) in that the users are proficient at producing and perceiving these sounds, but simply fail to associate the sounds with their corresponding symbols. The first pair of such letters is the vowel letter и [i] and the consonant letter й [j].
9 Note that the i2y tag and the SRy tag are complementary. The i2y tag applies anywhere that the SRy tag does not.
Examples of these errors include рабочии 'worker.SG.NOM.j2i' (cf. correct рабочий) and миллиард 'billion.SG.NOM.i2j' (cf. correct миллиард).
Another pair of letters that are easily confused are е [je] and э [e]. The letter э only occurs in a small number of high-frequency types, almost exclusively word-initially. Examples of these errors include ето 'this.e2je' (cf. correct это) and проэкта 'project.SG.GEN.je2e' (cf. correct проекта).
3.1.6. Spelling Rules
A small set of consonant letters have restrictions on which vowel letters are allowed to follow them, in some cases motivated by phonological restrictions at the time of orthographic standardization. The relevant consonants are the so-called hushers (ж, ч, ш, and щ), velars (г, к, and х), and the letter ц. These spelling rules are generally mentioned by Russian textbooks because they are especially relevant for inflectional endings. However, in many cases textbooks merely state these rules rather than attempting to actively engage students in acquiring them. As a result, such rules tend to remain abstract and students get little opportunity to work out their implications.
The first spelling rule is that after the so-called hushers and ц, an unstressed letter о is replaced by the letter е. Violations of this rule are indicated with the tag SRo, as in the error нашой 'our.FEM.SG.GEN.SRo' (cf. correct нашей).
Another spelling restriction is that after velars or hushers, the letter ы is replaced by и. Unfortunately, for two of the hushers, this restriction is no longer a valid reflection of modern phonology, since ж and ш are now non-palatalized consonants. Because of this, not only is the rule sometimes difficult to remember and apply, but it is also phonetically misleading. Violations of this spelling rule are indicated with the tag SRy, as in the error душы 'soul.PL.NOM.SRy' (cf. correct души).
The third spelling rule is one that is not explicitly discussed in any textbooks that we are aware of but is nonetheless a cause for confusion for many non-native speakers. The letter ц can be followed by either ы or и, depending on whether it is in the stem or the inflectional ending. In stems, ц is followed by и (e.g., цирк 'circus'),10 and in endings ц is followed by ы. Violations of this rule are indicated with the tag SRc, as in the error цыфровой 'digital.SRc' (cf. correct цифровой).
3.1.7. По-_ски
Many adjectives ending in -ский can be converted to adverbs by adding the hyphenated prefix "по-" and removing the final й. For example, русский 'Russian' becomes по-русски 'in Russian'. Non-native writers frequently forget to remove the final й. This error is indicated by the tag ski, as in the error по-русский 'Russian.ski'.
10 There are a handful of exceptions to this rule, including цыплёнок 'chick', цыган 'gypsy', на цыпочках 'on tiptoe'.
3.1.8. Morphological errors
Another common error is particular to stems ending in an underlying /ij/, whose lemmas orthographically end in -ий, -ие, and -ия, such as критерий 'criterion', здание 'building', and Мария 'Maria'. For such stems, any paradigmatic cell that would otherwise end in -е ends in -и instead. For all three classes, this includes the locative (i.e. prepositional) case and for feminine nouns, the dative case. Errors regarding this principle are indicated with the tag ii, as in о критерие 'about the criterion.LOC.ii' (cf. correct о критерии).
3.1.9. Geminates
As in many languages, it is difficult for writers to know which letters are duplicated. Errors that include geminate letters where they should not be are indicated using the tag Gem, as in колличество 'quantity.Gem' (cf. correct количество).11 Errors that do not include geminate letters where they should be are indicated with the tag NoGem, as in искуство 'art.NoGem' (cf. correct искусство).
3.1.10. Прийти
The stem of the lexeme прийти 'to come' causes problems for native and non-native speakers alike. The й appears in the infinitive прийти, but not the indicative: пришла 'come.PST.FEM', придет 'come.NONPST.3P.SG'. This may feel unexpected when compared with some other prefixed forms of идти 'go' which do have й in the non-past: зайдет 'drop by.NONPST.3P.SG', пройдет 'pass.NONPST.3P.SG'. Errors related to this lexeme are indicated with the tag prijti, as in прийду 'come.NONPST.lP.SG.prijti' (cf. correct приду).
3.2. Automatic error diagnosis: extending UDAR
Each of the sources of errors discussed in Section 3.1 can be formalized in rules defining each of the error types discussed in the previous section. As mentioned in section 2.4, rules that license non-normative words or structures are referred to as mal-rules (cf., e.g., Sleeman 1982, Matthews 1992 and references therein). In this section, we provide an abbreviated overview of the mechanics of applying our mal-rules to UDAR.
UDAR is a finite-state transducer, built using three formalisms: the lexc language for creating the finite-state lexical network; the twolc language for realizing orthographic and morphophonological rules on surface forms; and vislcg3
11 The insertion of geminates is problematic for practical reasons. The corresponding mal-rule would apply to virtually every letter of every word in the analyzer, exploding the amount of storage/memory required for the analyzer. Although theoretically possible, the Gem tag is usually omitted for practical reasons.
for writing a Constraint Grammar to resolve morphosyntactic ambiguity on the basis of surrounding context.12 Our mal-rules are applied in one of two ways. First, rules that are sensitive to underlying morphophonological structure-such as ii, FV, NoFV, Pal, and SRo-are implemented as alternative twolc rules.13 Rules that can be modeled as simple character substitution are implemented as XFST regular expression replace rules.14 In either case, the process for adding a tag to the transducer is the following.
First, a standard transducer is compiled, using UDAR's original rules. Then, for each tag, the mal-rule is applied to make an error transducer. The standard transducer is subtracted from the error transducer so that only wordforms that were affected by the mal-rule remain. Then, the error tag is added to all forms in the error transducer, and the resulting transducer is added to the standard transducer by disjunction. (ENA, April 17, 2022)15. In this way, all of UDAR's original contents are preserved, and all additions are tagged with the appropriate error tags.
To the extent possible, errors are accumulated, one after the other, so that words with more than one kind of error can be recognized. However, several of the rules feed into one another, or could even reverse one another. For example, if e2je were added on top of je2e, the resulting surface form would be identical to the correct form, but would be tagged for both errors. Therefore, the errors were grouped by contexts, and all errors affecting the same context are added in parallel. In this way, errors in different context-groups can stack on one another, but errors in the same context-group do not.
3.2.1. Evaluation
We analyzed the entire RULEC corpus using our augmented analyzer, compiled a list of all types that are tagged as errors, and compared the output of the analyzer with our manual labels. We found that for our target errors, the analyzer has perfect recall, meaning that every token that was manually labelled with one of our target error tags was also labeled by the augmented analyzer as such. However, not all of the errors identified in the corpus fit into these categories. Out of 279 manually labeled error types, our analyzer labeled 124 (44.4%). Out of 999 manually labeled error tokens, our analyzer labeled 467 (46.7%).
12 The lexc and twolc source files can be compiled using either Xerox Finite-State Tools (XFST) (Beesley and Karttunen 2003) or Helsinki Finite-State Transducer Technology (HFST) (Linden et al. 2011).
13 For a detailed explanation of how the twolc rules in UDAR function, see chapter 2 of Reynolds, Robert. 2016. Russian natural language processing for computer-assisted language learning: capturing the benefits of deep morphological analysis in real-life applications. Ph.D. thesis, UiT - The Arctic University of Norway.
14 For a detailed explanation of XFST regular expressions, see Beesley and Kartunnen (2003).
15 The Makefile that builds the error transducer can be found at https://github.com/giellalt/lang-rus/blob/8839887e986ae15a255e3396f08d394e8efac363/src/Makefile L2
3.3. RuMOR webapp
RuMOR is a free and open-source webapp allowing users to get interactive feedback on Russian spelling errors.(ENA, April 17, 2022)16 RuMOR was built as a mobile-first webapp, so that it can be used comfortably on desktops, laptops, and mobile devices. Currently, two interface languages are available: English and Norwegian. A screenshot of the app is shown in Figure 1.
The user is prompted to type or paste a text, and upon submitting the text, words identified by our augmented analyzer as spelling errors are turned into clickable links. Tokens are considered errors only if all possible readings are errors, so our system does not currently attempt to handle real-word errors. For example, in Figure 1, the token эй 'hey' is obviously intended to be ей 'she.DAT', but because the analyzer outputs at least one non-error reading, it is not treated as an error by RuMOR.17
When an error is clicked, all possible readings are shown in a pane to the side of the text. For each reading, we display the dictionary form, the type of error that would lead to the attested token, and the corrected form (which is shown by clicking or hovering). The readings are sorted by lemma frequency, so the most likely reading is listed first. In Figure 1, the token Ана is selected, and four possible readings are displayed: она 'she.o2a', оно 'she.o2a', Анна 'Anna.NoGem', and Аня 'Anya.Pal'.
When the user clicks on any of the error tags, the error explanation is shown in the next column. These explanations are intended to be as short as possible while still giving enough explanation and examples to be reasonably complete. The explanations are open-source, and hosted separately at (ENA, April 17, 2022)18
RUMOR
Figure 1. Screenshot of the RuMOR webapp
4. Conclusions and future work
This article has introduced RuMOR, a free, open-source, interactive webapp for identifying, diagnosing, correcting, and explaining a variety of common spelling
16 The source code for the webapp is available at https://github.com/reynoldsnlp/rus_L2_flask. At the time of writing, the app is accessible at https://icall.byu.edu/rumor.
17 Although this particular example would be difficult to disambiguate, some real-word errors can be resolved by Constraint Grammar rules which would remove some real-word readings on the basis of the surrounding context.
18 https://github.com/reynoldsnlp/rus_grammar_explanations.
errors, based on linguistic analysis. The webapp uses a modified version of the UDAR analyzer, which we augmented using mal-rules. The validity of our model was maximized by deriving error tags from real-world errors identified in the RULEC corpus. To our knowledge, this is the first such application for Russian that attempts to provide comparable targeted feedback to any arbitrary running text.
This linguistic approach is especially well-suited to error annotation, but also facilitates text normalization. As demonstrated in the webapp, UDAR can automatically generate the corrected wordform.
Another potential application of our error-augmented analyzer is automatic corpus annotation. Until now, corpora of Russian texts produced by non-native speakers have relied almost exclusively on human annotators to analyze and classify errors. Our analyzer can make this process faster and more consistent by giving annotators a preliminary linguistic analysis of orthographic errors to review.
Future work will focus on adding more classes of errors attested in corpora. These errors include conjugation errors, especially related to stem alternations and inflection class selection. Hapaxes in RULEC were excluded from the present study, but we know that there are some error types represented among them that deserve to be included in our error model. For example, users whose first language uses the Latin alphabet frequently misuse alphabetic false friends, i.e., letters that appear the same as Latin letters, but which represent different sounds. In addition to expanding our spelling error model, we also intend to expand UDAR's existing Constraint Grammar to add syntactic error labels.
Finally, although it is tempting to assume that RuMOR is an effective tool, it is crucial to understand how such tools are actually used, and what effect they have on motivation and proficiency outcomes. We hope to perform evaluations and experiments to understand the outcomes of this project.
REFERENCES
Amaral, Luiz & Detmar Meurers.2011. On using intelligent computer-assisted language
learning in real-life foreign language teaching and learning. ReCALL 23(1). 4-24. Beesley, Kenneth R. & Lauri Karttunen. 2003. Finite State Morphology. Stanford, CA: CSLI Publications.
Biggs, John & Catherine Tang. 2011. Teaching for Quality Learning at University.
Maidenhead, UK: Open University Press. Biggs, John. 1999. What the student does: Teaching for enhanced learning. Higher Education
& Development 18 (1). 57-75. Bocharov, Victor, Svetlana Alexeeva, Dmitry Granovsky, E. Protopopova, Anastasia Bodrova, Svetlana Volskaya, I.V. Krylova & A.S. Chuchunkov. 2013. Crowdsourcing morphological annotations. In Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference "Dialog" 1. http://opencorpora.org/doc/articles/2013_Dialog.pdf (accessed 20.04.2022). Choi, Inn-Chull. 2016. Efficacy of an ICALL tutoring system and process-oriented corrective
feedback. Computer Assisted Language Learning 29. 334-364. Heift, Trude. 2010. Developing an Intelligent Language Tutor. CALICO Journal 27(3). 443-459.
Kopotev, Mixail, Sardana Ivanova, Anisia Katinskaia & Roman Yangarber. 2019. Corpus-based language teaching tool. TrudyMezdunarodniiKonferencii «KORPUSNAYA LINGVISTIKA-2019». 30-39. (In Russ.)
Korobov, Mikhail. 2015. Morphological analyzer and generator for Russian and Ukrainian languages. In Proceedings of AIST'2015. 320-332. New York: Springer.
Krylov, Sergej & Sergej Starostin. 2003. Upcoming tasks for morphological analysis and generation in the integrated information environment STARLING. In Proceedings of the International Conference "Dialog 2003". https://www.dialog-21.ru/ media/2655/krylov.pdf (In Russ.) (accessed 20.04.22).
Linden, Krister, Erik Axelson, Sam Hardwick & Tommi A. Pirinen. 2011. HFST- framework for compiling and applying morphologies. In Cerstin Mahlow & Michael Pietrowski (eds.), Systems and frameworks for computational morphology, 100 of Communications in Computer and Information Science, 67-85. New York: Springer.
Matthews, Clive. 1992. Going AI: Foundations of ICALL. Computer Assisted Language Learning 5(1). 13-31.
Matthews, Clive. 1992. Going AI: Foundations of ICALL. Computer Assisted Language Learning 5(1). 13-31.
Meurers, Detmar, Kordula De Kuthy, Florian Nuxoll, Björn Rudzewitz &Ramon Ziai.2019. Scaling up intervention studies to investigate real-life foreign language learning in school. Annual Review of Applied Linguistics 39.
Nagata, Noriko. 2009. Robo-Sensei's NLP-Based Error detection and feedback generation. CALICO Journal 26(3). 562-579.
Rozovskaya, Alla & Dan Roth. 2019. Grammar Error Correction in Morphologically Rich Languages: The Case of Russian. Transactions of the Association for Computational Linguistics 7. 1-17. https://doi.org/10.1162/tacl_a_00251
Rozovskaya, Alla & Dan Roth. 2021. How Good (really) are Grammatical Error Correction Systems? In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2686-2698.
Segalovich, Ilya. 2003. A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In International Conference on Machine Learning; Models, Technologies and Applications. 273-280.
Sleeman, Derek. 1982. Inferring (mal) rules from pupil's protocols. In Proceedings of the 5th European Conference on Artificial Intelligence (ECAI). 160-164. Orsay, France.
Vilkki, Liisa. 2005. RUSTWOL: A tool for automatic Russian word form recognition. In Antti Arppe, Lauri Carlson, Krister Linden, Jussi Piitulainen, Mickael Suominen, Martti Vainio, Hanna Westerlund & Anssi Yli-Jyrä (eds.), Inquiries into words, constraints and contexts: Festschrift for Kimmo Koskenniemi on his 60th Birthday, 151-162. Stanford, CA: CSLI Publications.
Vilkki, Liisa. 1997. RUSTWOL: A System for Automatic Recognition of Russian Words. Technical report, Lingsoft, Inc.
Vilkki, Liisa. 2005. RUSTWOL: A tool for automatic Russian word form recognition. In Arppe, A., Carlson, L., Linden, K., Piitulainen, J., Suominen, M., Vainio, M., Westerlund, H., and Yli-Jyrä, A. (eds.), Inquiries into Words, Constraints and Contexts: Festschrift for Kimmo Koskenniemi on his 60th Birthday, 151-162. CSLI Publications.
Dictionaries
Zaliznjak, Andrej A. 1977. Grammatical dictionary of the Russian language: Inflection: Approx 100 000 words. Russkij Jazyk. (In Russ.)
Article history:
Received: 20 October 2021 Accepted: 21 January 2022
Bionotes:
Robert REYNOLDS is employed as Assistant Research Professor in the Office of Digital
Humanities at Brigham Young University. He holds a PhD in Russian Language
Technology from UiT The Arctic University of Norway. His research interests include
Intelligent Computer-Assisted Language Learning (ICALL), Natural Language Processing
for low-resource languages, automatic analysis of text complexity/readability, automatic
reading proficiency assessment using eye-tracking, structure of Russian, and
morphological complexity.
Contact Information:
Brigham Young University
Brigham Young University Provo, UT 84602
e-mail: [email protected]
ORCID: 0000-0003-0306-087X
Laura A. JANDA is Professor of Russian in the Department of Language and Culture at UiT The Arctic University of Norway. She holds a PhD in Slavic Linguistics from UCLA (1984). She pursues research in the framework of cognitive linguistics applied mostly to the analysis of grammatical categories and constructions in Russian using corpus data. She also works on the development of research-based electronic resources for learners of Russian.
Contact Information: UiT The Arctic University of Norway UiT Norges arktiske universitet Postboks 6050 Langnes 9037 Troms0 e-mail: [email protected] ORCID: 0000-0001-5047-1909
Tore NESSET is Professor of Russian linguistics in the Department of Language and
Culture at UiT The Arctic University of Norway. He received his doctoral degree from the
University of Oslo in 1997. His research interests include corpus and cognitive linguistics
which he applies to the study of Russian and Norwegian. He also works on historical
linguistics and is the author of the widely used textbook How Russian Came to Be the Way
It Is (2015).
Contact Information:
UiT The Arctic University of Norway
UiT Norges arktiske universitet
Postboks 6050 Langnes 9037 Troms0
e-mail: [email protected]
ORCID: 0000-0003-1308-3506
Сведения об авторах:
Роберт РЕЙНОЛЬДС - доцент-исследователь в Отделе цифровых гуманитарных наук Университета Бригама Янга. Имеет докторскую степень по языковым технологиям в русском языке, полученную в Арктическом университете Норвегии. Его исследовательские интересы включают обучение языку с помощью интеллектуальных компьютерных технологий (ICALL), обработку естественного языка для малоресурсных языков, автоматический анализ сложности/читабельности текста, автоматическую оценку навыков чтения с помощью айтрекинга, структуру русского языка и морфологическую сложность языков. Контактная информация: Brigham Young University Brigham Young University Provo, UT 84602 e-mail: [email protected] ORCID: 0000-0003-0306-087X
Лора А. ЯНДА - профессор кафедры языка и культуры Арктического университета Норвегии, степень доктора наук получила в Калифорнийском университета в Лос-Анджелесе (1984), специалист по славянскому языкознанию. Сфера интересов включает когнитивную и корпусную лингвистику, грамматические категории русского языка, а также создание электронных ресурсов исследовательского типа для изучающих русский язык.
Контактная информация: UiT The Arctic University of Norway UiT Norges arktiske universitet Postboks 6050 Langnes 9037 Troms0 e-mail: [email protected] ORCID: 0000-0001-5047-1909
Туре НЕССЕТ - профессор кафедры языка и культуры Арктического университета
Норвегии. Докторскую степень получил в Университете Осло в 1997 году. Его
исследовательские интересы включают корпусную и когнитивную лингвистику
применительно к русскому и норвежскому языкам. Он также работает в области
исторической лингвистики и является автором широко известного учебника How
Russian Came to Be the Way It Is (2015).
Контактная информация:
UiT The Arctic University of Norway
UiT Norges arktiske universitet
Postboks 6050 Langnes 9037 Troms0
e-mail: [email protected]
ORCID: 0000-0003-1308-3506