Научная статья на тему 'VISUALIZING EMBEDDINGS TO STUDY GENDER-RELATED DIFFERENCES IN WORD MEANING'

VISUALIZING EMBEDDINGS TO STUDY GENDER-RELATED DIFFERENCES IN WORD MEANING Текст научной статьи по специальности «Языкознание и литературоведение»

CC BY
72
11
i Надоели баннеры? Вы всегда можете отключить рекламу.
Ключевые слова
CLUSTERING / DISTRIBUTIONAL SEMANTICS / VISUALIZATION / WORD EMBEDDING

Аннотация научной статьи по языкознанию и литературоведению, автор научной работы — Litvinova T.A., Panicheva P.V., Kotlyarova E.S., Zavarzina V.V.

Development of the models of distributional semantics is one of the most important directions of research in modern NLP. This field is developing rapidly. New transformed-based models allow one to obtain good results in a lot of practical tasks, although the problem of their interpretability remains largely unsolved despite research efforts made in this direction. It should also be noted that, despite the obvious progress in the field, very little attention has been given to the problem of estimating and assessing the differences in word meaning (in the sense of distributional semantics) related to the characteristics of text authors (gender, age, psychological traits, etc.). This problem has not only a theoretical but also a practical value. Currently, no attention is being paid to the characteristics of authors whose texts are used to construct pretrained models widely used in NLP, and knowing individual differences in word meaning is crucial to understanding the biases existing in these models. We use the existing methods of word embedding visualization to show the differences in the structure of word meaning related to the gender of authors and propose clustering methods to study this structure. We conclude that the development of the methods aimed at visualizing and interpreting the individual differences in word meaning is crucial both for the efficient solution of various NLP tasks and for the theory of word meaning.

i Надоели баннеры? Вы всегда можете отключить рекламу.
iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.
i Надоели баннеры? Вы всегда можете отключить рекламу.

Текст научной работы на тему «VISUALIZING EMBEDDINGS TO STUDY GENDER-RELATED DIFFERENCES IN WORD MEANING»

Visualizing Embeddings to Study Gender-Related Differences in Word Meaning

Tatiana A. Litvinova, Polina V. Panicheva, Elena S. Kotlyarova, Victoria V. Zavarzina

Abstract— Development of the models of distributional semantics is one of the most important directions of research in modern NLP. This field is developing rapidly. New transformed-based models allow one to obtain good results in a lot of practical tasks, although the problem of their interpretability remains largely unsolved despite research efforts made in this direction. It should also be noted that, despite the obvious progress in the field, very little attention has been given to the problem of estimating and assessing the differences in word meaning (in the sense of distributional semantics) related to the characteristics of text authors (gender, age, psychological traits, etc.). This problem has not only a theoretical but also a practical value. Currently, no attention is being paid to the characteristics of authors whose texts are used to construct pretrained models widely used in NLP, and knowing individual differences in word meaning is crucial to understanding the biases existing in these models. We use the existing methods of word embedding visualization to show the differences in the structure of word meaning related to the gender of authors and propose clustering methods to study this structure. We conclude that the development of the methods aimed at visualizing and interpreting the individual differences in word meaning is crucial both for the efficient solution of various NLP tasks and for the theory of word meaning.

Keywords— Clustering, Distributional Semantics, Visualization, Word embedding.

I. Introduction

Models of distributional semantics (word2vec [16], Glove [6], latest transformer-based models) are widely used to solve a lot of NLP tasks, but it can be safely said that a lot of research efforts have been dedicated to the engineering aspects of the models and their performance. The very essence of word meaning from the point of view of distributional semantics remains unclear, and the overall impact of distributional semantics on theoretical linguistics "has so far been limited" [4].

It is well-known that distributional semantics assume that the meaning of words is stored in the word co-occurrence

Manuscript received October 2, 2022. This work was supported by Russian Science Foundation, project number 21-78-10148 "Modeling the meaning of a word in individual linguistic consciousness based on distributive semantics".

T.A. Litvinova is with Voronezh State Pedagogical University, Voronezh, 394043 Russia (corresponding author, phone: +7-980-342-0073; e-mail: [email protected]).

P.V. P.anicheva is with Voronezh State Pedagogical University, Voronezh, 394043 Russia (e-mail: [email protected]).

E.S. Kotlyarova is with Voronezh State Pedagogical University, Voronezh, 394043 Russia (e-mail: [email protected]).

V.V. Zavarzina is with Voronezh State Pedagogical University, Voronezh, 394043 Russia (e-mail: [email protected]).

across text corpora. However, distributional semantics models ignore social aspects of language processing [12].

one of the most important social aspects of language processing and production is gender aspect. The newest trend of research in computational semantics is exploring and eliminating bias in word embedding, with gender bias being the most studied type of bias [13]. The classical example of such a bias is "man is to computer programmer as woman is to x" with x=homemaker.

Little is known, however, about differences in the meaning of the words (in terms of distributional semantics) related to the gender of authors. It is well-known that the characteristics of the authors including gender are reflected in their texts in different linguistic levels, which is proved by the results of classification experiments including word embeddings as features [1]. However, the problem of individual semantics, differences of word meanings related to the author characteristics remains unsolved [8].

To facilitate the interpretations of word embeddings, several methods of their visualization have been developed. In this paper, we employ various methods of word embedding vizualization 1) to study the difference in meaning of the words related to the gender of the authors of texts distributional models were built on, 2) to highlight their advantages and disadvantages and 3) to propose the merits of clustering methods in studying the hierarchical structure of word meaning; 4) to propose their usefulness to study word association structure.

II. Methodology

A. Corpus

As the aim of our analysis is to understand differences in word meaning related to gender of the speaker, it is crucial to use a corpus of the texts to train our models which meets the following requirements: 1) texts are of the same genre;

2) the genre of texts allows for free expressions of identity;

3) information about the authors (at least gender) is available. obviously, in case of training embedding the more data there is, the better, but for our specific task the quality of data is of utmost importance.

For the current study we used a corpus of blogs in Russian. The algorithm of corpus development is described in [15]. After collecting the corpus, it was manually cleaned. First, we removed profiles for which information about the author's gender did not match that explicitly shown in texts through grammatical gender. Second, we checked texts for plagiarism and removed profiles of the authors whose texts were copy-pasted news, etc. overall, there are 966 female and 1399 male authors in our corpus. Besides the gender

information, the age of the authors is also known. The mean age of female authors is 32.7 y.o. (sd = 9.9), of male authors - 39.5 y.o. (sd = 11.3).

The corpus of blogs is freely available through database Rusldiolect created with the aim of facilitating authorship attribution and profiling studies [14].

B. Text Preprocessing

It is well known that text preprocessing is a crucial step before doing any experiments with word embeddings [8]. We preprocessed our corpus in the following way. First, we removed any tokens except for words (symbols, punctuation marks, etc.). Second, we lowercased texts. Third, we removed stop words (custom dictionary of stop words was constructed). Then we lemmatized texts using Treetagger [5]. This is a probabilistic parser based on decision trees. TreeTagger was shown to be one of the most efficient tools (along with mystem) for lemmatization and POS-tagging of Russian texts from different collections [2]. Manual check of the lemmatized texts randomly selected from our corpus has also proved its usefulness for blog text lemmatization as well. At the next step of preprocessing, we removed lemmas which are used in fewer than 20 texts.

Then we divided our corpus into Male corpus and Female corpus. The volume of the Male corpus after cleaning is 1940643 tokens, mean text length is 1387 tokens (sd=871 token). The volume of Female corpus is 1134465 tokens, mean text length is 1174 tokens (sd=842 tokens).

As this study is aimed at examining possibilities to use word embedding vizualization methods for estimating the difference in distributional word meaning related to the gender of text authors, we used one lemma for our case study: дом 'house'. We have chosen this word since it has a high (absolute) frequency in both Male and Female corpora: 2960 and 2381 respectively and is included in top 30 of most frequent words for both corpora.

C. Word embedding models training

We trained two models, word2vec and Glove, for comparison purposes. Each model has been trained on Male and Female corpus separately as well as on the whole corpus. Word2vec models were trained using R package word2vec with the following settings: type = "skip-gram", window = 5L, dim = 100, iter = 20. Glove model was trained using R package text2vec with the following default parameters: doc_count_min = 20, skip_grams_window = 5L, number of dimensions = 100.

It should be noted that despite the fact that Glove is very rarely used for training distributional models on Russian texts (unlike word2vec), this model deserves more profound research on Russian texts.

D. Methods of word emdedding vizualization

There are several methods of word embedding vizualization. All of them, however, have both advantages and disadvantages [3]. To visualize high-dimensional embeddings, dimensionality reduction algorithms (PCA, t-SNE and, more recently, UMAP) are commonly used. The most widespread method is PCA which makes linear transformation on data, however this is often not suitable for linguistic data.

t-SNE is suitable for non-linear data, but its performance is not satisfactory on large datasets [11; 17].

UMAP is a new technique which is 'competitive with t-SNE for visualization quality, and arguably preserves more of the global structure with superior run time performance' [7]. This method uses the concept of k-nearest neighbor and optimizes the results using stochastic gradient descent.

In our work we performed UMAP using R package uwot.

Two most important parameters for UMAP algorithm are the number of neighbors connected to a single point and the min_dist. We performed a few iterations to find out which UMAP parameter settings yielded the best results in terms of the compromise between local and global structure vizualization.

Our experiments have shown that UMAP is much faster than t-SNE.

Another way of visualizing word embeddings, which is used rarely but shown to be useful in our preliminary experiments, is hierarchical clustering. It allows interpretation of word meaning structure from different point of view as compared to UMAP, namely it enables one to reveal hierarchical structure of word meaning.

For clustering, we used R packages cluster. To compare the results of clustering with different settings, we used R package NbClust which yields around 30 indices of cluster quality after search for different combinations of the number of groups, distance metrics and clustering methods. We used a voting scheme for choosing the best parameters.

Another method of word embedding visualization which is used in a few studies in NLP domain is heatmap. Despite their popularity in such fields as bioinformatics, heatmaps remain a underutilized visualization tool in data analysis.

We used heatmaps to visualize cosine similarity matrix calculated on word vectors of associates of the cue word "house" presented in word associations databases with available demographic information. For this task we used 300-dimensional word2vec model pretrained on Russian Wikipedia

(https://wikipedia2vec.github.io/wikipedia2vec/pretrained/).

We used R package superheat for heatmap construction. Superheat were used to run a k-means (the default clustering algorithm) on the data matrix and group together the rows and columns that are in the same cluster.

III. Results and discussion

A. UMAP vizualization

Large figures and tables may span both columns. First, we applied UMAP visualization for the word дом 'a house' in Glove model trained on the Blog corpus without dividing texts with respect to the gender of authors (Fig. 1), then separately on Male (Fig. 2) and Female (Fig. 3) Corpora.

We tested different sets of UMAP parameters but the tradeoff between preserving the local and global structure of distributional meaning has been obtained with the following settings: n_components = 2, metric = "cosine", n_neighbors = 25, min_dist = 0.1.

Visual comparison of the results of UMAP vizualization of the lemma дом 'house' in three Glove models shows that the structure of distributional meaning of the examined word differs in each model.

Vizualization of the nearest neighbors of the word дом in Glove model trained on concatenated corpus (Fig. 1) shows that the clusters of the nearest neighbors of the word дом are less dense than in Male and Female corpora and less structured from the thematical point of view.

cluster of the words describing the general aspects related to the дом (жить 'to live', место 'a place', остаться 'to stay', вид 'a view').

UMAP1

Fig. 1. UMAP visualization of lemma дом in Glove model trained on the whole corpus (without division with respect to text author gender)

The first cluster of the words closest to the lemma дом consists of the words квартира 'a flat', улица 'a street', соседний 'nearest', напротив 'opposite' which forms the core meaning of the word and denotes the location of the object (home) with respect to other entities.

The second cluster consists of the words unified with the idea of movement (ходить 'walk', выходить (из дома) 'leave home', прямо 'straight').

The third cluster describes the words related to the parts of дом and typical physical actions which can be performed in it (комната 'a room', подъезд 'a block', стоять 'stand', сидеть 'to sit', also - 'to stay at home').

The densest cluster consists of the words related to the idea of construction (здание 'a building', строить, построить 'to build') and large entities where дом is located (город 'a city', деревня 'a village', находиться 'to be located').

Another cluster consists of the words related to the idea of cost (место 'a place', стоить 'to cost') and general verbs related to the man action connected to perception (видеть 'to see') and life (жить 'to live').

There is only one word related to the characteristics of the дом (старый 'old') which forms a separate cluster.

As Fig. 2 and 3 strongly suggest, the structure of distributional meaning of the word дом differs between the Male and Female Glove models.

The structure of the word дом in "Male" model is closer to the structure of distributional meaning of дом in the concatenated corpus than in the Female one. As in the General model, in the core meaning there are words квартира 'flat', улица 'a street', соседний 'nearest' as well as the words related to the idea of construction (построить 'to build', здание 'a building') and location (деревня 'a village', район 'a district', город 'a city', находиться 'to be located'). The same cluster related to the physical action (выходить 'to leave home') is observed. The characteristics of the house include the nomination of age (old) as in the General model. The separate cluster consists of the adjective детский which is a part of the collocation детский дом 'a children's house'. Like in the General model, there is a

Fig. 2. UMAP visualization of the lemma дом in Glove model trained on the Male Corpus

Compared to the General model in Male model the cluster of words related to the description of the parts of a house is larger and contains only nouns (комната 'a room', окно 'a window', крыша 'a roof, стена 'a wall', этаж 'a floor').

At the core of distributional meaning of the word дом in the Female model (Fig. 3) there are words related to its essential parts (комната 'a room'), surroundings (улица 'a street', двор 'a yard') and location (город 'a city', центр 'a center').

Like in the other two models, the word квартира 'a flat' constitutes the core of the distributional meaning of the w^ird дом 'a^house'.

Fig. 3. UMAP vizualization of the lemma дом in Glove model trained on the Female Corpus

The verb denoting physical action (ходить 'to walk') is among the nearest neighbors of дом in the General and Male model. Only in the Female model the word сидеть 'to sit' is included in the nearest neighbors of дом (the Russian collocation сидеть дома has the meaning 'not to work, be a housewife').

Unlike the General and Male models, in the Female model there is a large cluster of verbs related to the non-physical actions (думать 'to think', хотеть 'to want', видеть 'to see', стать 'to become', знать' to know'). In the Female model there is a unique set of adjectives describing дом 'a house': новый 'new', большой 'big', свой 'own'.

It is interesting to note that the closest words (in terms of cosine similarity value) to дом in pretrained multilingual

Glove (after case forms of дом since this model is non-lemmatized) are семья 'a family', комната 'a room', стена 'a wall', квартира 'a flat', территория 'an area', помещение 'premises', коробка 'a box', картина 'a picture', кухня 'a kitchen'.

The closest words for дом in the Male model are квартира 'a flat', улица 'street', соседний 'nearest', старый 'old', построить 'to build', здание 'a building', in the Female model - квартира 'a flat', жить 'to live', выходить 'to leave', улица 'street', комната 'a room'.

Word2Vec models vizualization (not shown due to lack of space) reveals a quite different distributional meaning of the word дом. In the General model we reveal the clusters related to outer entities where дом is located (улица 'street', переулок 'lane', квартал 'quarter'), entities directly related to the дом as its parts (подъезд 'a block', квартира 'a flat', этаж 'a floor', крыша 'a roof, помещение 'premises').

Unlike Glove, word2vec offers a set of adjectives describing the objective characteristics of дом related to the number of floors (двухэтажный 'two-room', двухэтажка 'a two-floor building', пятиэтажный 'five-floor'), number of flats (многоквартирный 'multi-flat'), number of rooms (трехкомнатный 'three-room'). There is a cluster of synonyms which take the дом as a physical object, the result of construction: двухэтажка 'a two-floor building', здание 'a building', постройка 'a construction object'. There is also a full synonym for дом with a pejorative suffix (домик 'a small bad-looking house').

There is a cluster of words nominating дом in the Female model (квартира 'flat', жилище 'dwelling', однушка 'a one-room flat', комната 'a room', комнатка 'a small room'). In the Male model there are fewer words which form this cluster (квартира 'a flat', помещение 'premises', комната 'a room').

Female speakers are more prone to "outer" words denoting the relation of the дом to other objects (улица 'a street', соседний 'the nearest', двор 'yard'), whereas the male ones are more concentrated on the characteristics of the house (многоквартирный 'multi-flat', силикатный 'silicon', пятиэтажный 'five-floor').

The closest (in terms of cosine similarity) word to дом in the Male model is квартира 'flat', in the Female one -домик 'a small house' (in word2vec model trained on 600 million words the closest words to дом are особняк 'a mansion', квартира 'a flat', домик, 'a small house', усадьба 'a homestead', здание 'a building').

B. Hierarchical clustering of word embeddings

We also applied another method for vizualizations of the distributional structure of the word дом, namely, hierarchical clustering.

First, we searched for the top 10 nearest neighbors of the word дом for each model and combined them into the "house vector". Then we selected 50 closest neighbors to the house vector in each model and performed clustering on word embeddings to find groups in the data. In other words, we created a hierarchy of the "house words" based on their distances from each other (Fig. 4-7).

housefem_distances hclust (*, "average")

Fig. 4. Hierarchy of the "house words" based on word2vec model trained on blogs by female authors

housemale_distances hclust (*, "complete")

Fig. 5. Hierarchy of the "house words" based on word2vec model trained on blogs by male authors

As Fig. 4-7 suggest, word2vec and Glove provide different views on the hierarchical structure of the meaning of the "house words".

In the word2vec model trained on female blogs, three large clusters are observed which could be described as (from left to right) "different entities related to the building construction" (1); "house from inside" (2); 'house from outside' (3).

In the word2vec model trained on male blogs, the following clusters are revealed (from left to right): "house from outside" (1); "house from inside" (2); "house as an engineering object" (3); "house as a place of living" (4).

1

1 1 ¡1 «Jin n 1 Irt »' If Л s=—и прямо -1 увидеть -1 |-1 видеть -1 1 Jl sis 6|* Ц p горвд 1 I находиться | I одти 1 I гулять -1 1 I ходить -1 утро 1 1_1 дома -1 I сидеть -1

Fig. 6. Hierarchy of the "house words" based on Glove model trained on blogs by female authors

Fig. 7. Hierarchy of the "house words"

based on Glove model trained on blog s by male authors

Clear differences can be seen in the level of detail for the same topics. Female authors tend to pay more attention to details of the inner location (окно 'a window', проем 'an opening', балконный 'balcony', дверь 'a door', линолеум 'linoleum', санузел 'a toilet unit', умывальник 'a wash basin'), while males tend to use more generalized characteristics related to a material (деревянный 'wooden', силикатный 'silicon', сруб 'log'). Males typically have a more engineering view of a house as an object of construction. In the Male model the separate cluster "house as an engineering object" is comprised of the words (among others) гараж 'a garage', подвал 'a basement', подъезд 'a block', перекрытие 'an overlap', этаж 'a floor', этажность 'a number of floors'. In the Female model words подъезд 'a block', улица 'a street', двор 'a yard', прибирать 'to clean' are constituents of the cluster 'house from outside', which also contains words гараж 'a garage', сторожить 'to guard', возле 'near', which might be due to different ways of categorizing space by male and female speakers (importance to classify the objects and define its structure for man and importance of the function for female).

Some similar patterns could be observed in Glove clusters. As with word2vec, Female Glove model relates

дом and кухня 'kitchen', дом and двор 'a yard', соседний 'the nearest' in the first cluster (on the left) which can be named 'a house and its parts'.

It is interesting that the same subclusters enter different clusters in the Female and Male Glove models. For example, subcluster сидеть дома 'to sit at home' is in the same cluster as the words denoting physical actions (гулять 'to walk', ходить 'to go', пойти 'to head', идти 'to be going') and location (центр 'a center', город 'a city') in the Female model while with the word офис 'office' in the Male one.

In the Male Glove model there is a large cluster of words related to description of the location of a house with respect to other locations (the first cluster from the right) (мост 'a bridge', площадь 'a square', улица 'a street', поселок 'a settlement', деревня 'a village', район 'a district', центр 'a center', город 'a city').

"Visual" clusters are observed both in the Male and Female models, although in the Female one it contains more elements (видеть 'to see', вид 'a view', красивый 'beautiful', etc.).

C. Word Associates visualization using

heatmaps

Heatmaps (Fig. 8-9) built on word vectors of associates of the cue word "house" presented in Russian Associative Thesaurus (RAT) revealed clear differences in associative word meaning of the cue word related to the gender of the respondents.

The data for the "Russian Associative Thesaurus" was obtained as a result of a three-stage questionnaire survey of subjects (for whom Russian is their native language) during a mass associative experiment. The thesaurus contains over 1 million associations, over 6 thousand unique stimuli and 100 thousand responses from 11+ thousand respondents. According to its authors, RAS models the verbal memory and linguistic consciousness of an "average" Russian speaker. The database was assessed via web interface http://tesaurus.ru/dict/ which allows to select male and female associates.

iНе можете найти то, что вам нужно? Попробуйте сервис подбора литературы.

A cluster of words related to the different elements of the "inner" and "outer" of the house which are used in similar contexts (according to high cosine similarity) is revealed in "male" heatmaps (bottom left corner), while most of them are the parts of the house: крыша 'roof, забор 'fence', труба 'pipe', дверь "door". Another cluster contains adjectives related to visual description (красивый "beautiful", белый "white", желтый "yellow"). Another cluster (in the centre of the picture) consists of words which denote different type of residential construction (изба "hut", дача "dacha", деревня "village", квартира "flat") while the topic of village is clearly observed.

Different structure is revealed in "female" heatmap. A large cluster of words related to the house and space around it is revealed (камень "rock", дерево "tree", балкон "balcony", мезонин "mezzanine", улица "street", дорога "road"). The cluster related to the topic of cosiness is observed (уют "comfort", гостеприимный "hospitable", жить "live", бабушка "grandma"), as well as topic of the fashion (мода "fashion", эстет "esthete") (due to word collocation модный дом "fashion house").

-0.1 0.1 0.3 0.5

Fig. 8. Heatmaps of word vectors of associates of the cue word "house" given by male respondents

mlb_VERB ■■■■■■■

«"1"» ADJ ■ И ■ » ■ ■■ ■I »•««■■ <■■■«-¡п ■ ■наш :!■■■ ■■■■■ -~ 1

родиоЯЛГУ ■ ■ ■ ч_1

rexrp«,,!. VERB

•sur.:;, ::::■:■ 22Z2.ia.lZZ 1 2 2 2 qVZ 2 Z 2 ^iiiiiiiiiiiiiijlï 1ЩЧУЧ!ЩУЕУ f'^iPîHlif g

-0.1 0.1 0,3 0.5

Fig. 9. Heatmaps of word vectors of associates of the cue word "house" given by male respondents

As another source of word association norm we used a database of word associates given by military respondents (students of military universities and officers) assessed via web interface http://adictru.nsu.ru/dict# (all of them are men) [9].

As it clearly seen from the Fig. 10, the structure of associative meaning of the cue word "house" differs in military and non-military respondents.

A dense cluster of word related to the family is observed (bottom left cluster): семья "family", родня "relative", родной "relative", жена "wife", бабушка "grandmother". A close to this cluster is a cluster of words related to the positive emotions and feelings (любовь 'love', радость 'joy, хорошо 'good, милый 'cute). Another cluster is formed by words denoting residential constructions (дом "house", хата "house").

-0.1 0.1 0.3 o.s

Fig. 10. Heatmaps of word vectors of associates given by military respondents

This brief analysis allows us to draw some preliminary conclusions.

First, it is crucial to apply different sets of tools for word embedding vizualization for examining the structure of distributional meaning of words derived from the models trained on texts by people of different backgrounds. As of now, this problem remains practically understudied. It is very important to apply not only one particular tool (by now PCA is the most widespread one), but a set of tools to obtain more objective results.

Second, it is critical to use different models of distributional semantics. As our analysis revealed, word2vec and Glove provide different views on semantics of individual words. Glove provides a broader view, allowing for exploration of the concepts and its relations, word2vec provides a more local one.

Third, vizualization of distributional meaning of the frequent words is a useful methodology which can generate new insights in the field of analysis of patterns of world categorization by people of different backgrounds.

IV. Conclusion

Distributional semantics is one of the main directions of NLP but despite technical advances it is still suffering from the low level of interpretability of the models. The methods for word embedding vizualization have been proposed to facilitate interpretation, but to the best of our knowledge they have never been applied to examine the differences in word embeddings between groups of authors, i.e. female and male authors. This direction of research in distributional semantics to date has not received much attention despite its obvious theoretical and practical importance. Widely used pretrained models are developed without any consideration of the characteristics of the authors of texts they are trained on, which might lead to underrepresentation of the certain groups.

As any other case study, our research has obvious limitations but it also highlights the direction of future research in individual distributional semantics related, to name a few, to the study of other frequent words as well as groups of words and expanding the list of vizualization methods.

Acknowledgment

The study is supported by Russian Science Foundation, project number 21-78-10148 "Modeling the meaning of a word in individual linguistic consciousness based on distributive semantics".

References

[1] C. B. Ritesh, "Word Representations For Gender Classification Using Deep Learning", Procedia Computer Science Volume, vol. 132, pp. 614-622, 2018.

[2] E. Kotelnikov, E. Razova, and I. Fishcheva,. "A Close Look at Russian Morphological Parsers: Which One Is the Best?", Communications in Computer and Information Science, vol. 789, pp. 131-142, 2017.

[3] F. Heimerl, and M. Gleicher,. "Interactive analysis of word vector embeddings", Computer Graphics Forum, vol. 37, no. 3, pp. 253-265, 2018.

[4] G. Boleda, "Distributional Semantics and Linguistic Theory", Annu. Rev. Linguist, vol. 6, pp. 213-34, 2020.

[5] H. Schmid, "Probabilistic Part-of-Speech Tagging Using Decision

Trees", in Proceedings of International Conference on New Methods in Language Processing, Manchester, UK, 1994.

[6] J. Pennington, R. Socher, and C. Manning, "Glove: Global Vectors for Word Representation.", in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 15321543.

[7] L. McInnes, and J. Healy, "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction", ArXiv; abs/1802.03426, 2018.

[8] P. Rodriguez, and L.A. Spirling, "Word Embeddings: What Works, What Doesn't, and How to Tell the Difference for Applied Research", Journal of Politics, vol. 84, pp. 101-115, 2022.

[9] PVAS (2015 - 2018) - Subcorpus of associates of military respondents (R.A.Kaftanov, A.A. ARomanenko) [Online]. Available: http://adictru.nsu.ru.

[10] R. L. Barter, and B. Yu, "Superheat: An R package for creating beautiful and extendable heatmaps for visualizing complex data', Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America, vol. 27, no. 4, pp. 910-922, 2018.

[11] R. Heuser, "Word Vectors in the Eighteenth Century, Episode 2: Methods", Adventures of the Virtual, 2016 [Online]. Available: http://ryan- heuser.org/word-vectors-2.

[12] T.J. Brendan, "Distributional social semantics: Inferring word meanings from communication patterns", Cognitive Psychology, vol. 131, 101441, 2021.

[13] T. Bolukbasi, K.-W. Chang, J. Zou, V. Saligrama, and A. Kalai, "Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings", in NIPS16: Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016, pp. 4356-4364.

[14] T., Litvinova, "RusIdiolect: A New Resource for Authorship Studies", in Lecture Notes in Networks and Systems, vol. 186, 2021, pp. 14-23.

[15] T., Litvinova, A., Sboev, and P., Panicheva, "Profiling the Age of Russian Bloggers", in Artificial Intelligence and Natural Language. AINL 2018. Communications in Computer and Information Science, vol. 930, 2018, pp. 167-177.

[16] T. Mikolov, I. Sutskever, K,.Chen, G. Corrado, J. Dean, "Distributed representations of words and phrases and their compositionality", in NIPS13: Proceedings of the 26th International Conference on Neural Information Processing Systems, vol. 2, 2013, pp. 3111-3119.

[17] X. Liu, Z. Zhang, R. Leontie, A. Stylianou, and R. Pless, "2-MAP: Aligned Visualizations for Comparison of High-Dimensional Point Sets.", in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 2539-2547.

i Надоели баннеры? Вы всегда можете отключить рекламу.