UDC 004.9
Vestnik SibGAU Vol. 17, No. 1, P. 84-90
HUMAN-HUMAN TASK-ORIENTED CONVERSATIONS CORPUS FOR INTERACTION QUALITY MODELING
A. V. Spirina1, 2*, M. Yu. Sidorov2, R. B. Sergienko2, E. S. Semenkin1, W. Minker2
1 Reshetnev Siberian State Aerospace University 31, Krasnoyarsky Rabochy Av., Krasnoyarsk, 660037, Russian Federation
2Ulm University 43, Albert-Einstein-Allee, Ulm, 89081, Germany E-mail: [email protected]
Speech is the main modality for human communication. It can tell a lot about its owner: their emotions, intelligence, age, psychological portrait and others properties. Such information can be useful in different fields: in call centres for improvement in the quality of service, in designing Spoken Dialogue Systems for better adaptation of a system to users' behaviour, in the automatization of some processes for analysing people's psychological state in a situation with a high level of responsibility, for example, in a space programme. One such characteristic is the Interaction Quality. The Interaction Quality is a quality metric, which is used in the field of Spoken Dialogue Systems to evaluate the quality of human-computer interaction. As well as in Spoken Dialogue Systems, the Interaction Quality can be applied for estimating the quality of human-human conversations. As with any investigation in the field of speech analytics, for modelling the Interaction Quality for human-human conversations a specific corpus of task-oriented dialogues is required. Although there is a large number of speech corpora, for some tasks, as, for example, for Interaction Quality modelling, it is still difficult to find appropriate specific corpora. That is why we decided to generate our own corpus based on dialogues between the customers and agents of one company. In this paper we describe the current state of this corpus. It contains 53 dialogues, corresponding to 1165 exchanges. It includes audio features, paralinguistic information and experts' labels. We plan to extend this corpus both in the feature set and in the observations.
Keywords: interaction quality, human-human conversation, speech analysis, speech corpus.
Вестник СибГАУ Том 17, № 1. С. 84-90
КОРПУС ЗАДАЧЕОРИЕНТИРОВАННЫХ ДИАЛОГОВ ТИПА «ЧЕЛОВЕК-ЧЕЛОВЕК» ДЛЯ МОДЕЛИРОВАНИЯ КАЧЕСТВА ВЗАИМОДЕЙСТВИЯ
1 2* 2 2 12 А. В. Спирина' , М. Ю. Сидоров , Р. Б. Сергиенко , Е. С. Семенкин , В. Минкер
1Сибирский государственный аэрокосмический университет имени академика М. Ф. Решетнева Российская Федерация, 660037, г. Красноярск, просп. им. газ. «Красноярский рабочий», 31
2Университет города Ульма Германия, 89081, г. Ульм, аллея им. Альберта Эйнштейна, 43 E-mail: [email protected]
Речь является основным способом коммуникации для человека. Она может многое рассказать о своем владельце: эмоции, уровень интеллекта, возраст, психологический портрет и не только. Такая информация может быть полезна в различных областях: в колл-центрах для улучшения качества обслуживания, при проектировании речевых диалоговых систем для лучшей адаптации системы к поведению пользователя, в автоматизации некоторых процессов анализа психологического состояния людей в ситуации с высоким уровнем ответственности, например, в космических программах. Одной из таких характеристик является Interaction Quality. Interaction Quality - это метрика качества, которая применяется в области проектирования речевых диалоговых систем для оценки качества взаимодействия между компьютером и человеком. Как и в речевых диалоговых системах, Interaction Quality может применяться для оценки качества диалога между людьми. Как и для любого исследования из области речевой аналитики, при моделировании Interaction Quality для диалогов между людьми требуется речевой корпус задачеориентированных диалогов типа «человек-человек». Несмотря на то, что существует большое количество речевых корпусов, для некоторых задач, таких как, например, моделирование Interaction Quality, все еще трудно найти подходящий речевой корпус. Именно поэтому мы решили создать собственный речевой корпус на основе диалогов между клиентами и агентами одной компании. Датся описание версии этого корпуса на текущий момент времени. Он содержит 53 диалога,
которым соответствуют 1165 обменов репликами. Данный корпус включает в себя аудиохарактеристики, паралингвистическую информацию и оценки экспертов. В дальнейшем мы планируем расширить этот корпус путем увеличения количества характеристик и количества наблюдений.
Ключевые слова: качество взаимодействия, диалог типа «человек-человек», речевая аналитика, речевой корпус.
1. Introduction
Speech analytics is applied to extract different information from speech data. Human speech can tell a lot about a person: their emotions, intelligence, age, psychological portrait and other properties. On the dialogue level it can determine for example cooperativeness between speakers, the involvement of each speaker in the dialogue and the topic of the discussion.
Speech analytics is useful for call centres in such tasks as estimating customer satisfaction and detecting problems in the agent's work. Moreover, such characteristics as customer satisfaction, emotions, Interaction Quality (IQ) and others are important for designing Spoken Dialogue Systems (SDS) for better adaptation of such systems to user behaviour through the dialogue. Besides, these characteristics can be used for the automatic assessing of relationships between people using speech. It is especially it is important for different space programmes, where the crew members spend a lot of time in a small space inside the space station.
The IQ is a quality metric, which is used in the field of SDS to evaluate the quality of human-computer (HC) interaction. The IQ metric was proposed by Schmitt et al. in [1]. This metric can be useful not only for measuring the quality of the interaction between humans and computers, but for human-human (HH) dialogues as well. The model of the IQ for HH task-oriented conversations can help then make SDS more flexible, more human-like and friendlier.
As for each investigation in the field of speech analytics, for modelling the IQ for HH conversations a specific corpus of task-oriented dialogues is required. Such a corpus can be developed based on calls from call centers offering support, information or help services. It is difficult to get access to such a database of calls as these calls contain the private information of speakers.
This is why we tried to develop the speech corpus based on a call database. This corpus consists of the calls between company workers and customers. In this paper we present a first overview on this corpus.
This paper is organized as follows. A brief description of related work (existing HH task-oriented conversation corpora) is presented in Section 2. Section 3 gives information about the developed HH task-oriented conversation corpus for the IQ modelling for HH dialogues. Section 4 introduces a description of manually annotated variables in the corpus and presents a comparison of the rules for annotating the IQ for HC and HH task-oriented spoken dialogues. In section 5 we describe future work for extending this corpus both in terms of variables and observations. Finally we present our conclusion in Section 6.
2. Related work: existing corpora
Such organisations as ERLA (European Language Resources Association) [2] and LDC (Linguistic Data Con-
sortium) [3] offer huge corpora databases for different purposes in the field of speech analytics such as:
- emotion recognition;
- speech recognition;
- language identification;
- speaker identification;
- speaker segmentation;
- speaker verification;
- topic detection and others.
Although there is a huge number of corpora, some researchers are forced to develop specific corpora for their research.
DECODA is a call centre human-human spoken conversation corpus consisting of dialogues from the call centre of the Paris public transport authority. It consists of 1514 dialogues, corresponding to about 74 hours of speech [4].
The corpus described in [5; 6] consists of 213 manually transcribed conversations of a help desk call centre in the banking domain. Unfortunately it includes text data without audio files.
Another example of a task-oriented corpus is described in [7]. The EDF (French power supply company) CallSurf corpus consists of almost 5800 calls (620 hours) between customers and operators of an EDF Pro call centre.
There are many other corpora, which are described in different papers, but some of them are difficult to find or to get access to. Some existing corpora in the field of HH task-oriented conversations are not appropriate for our task of IQ modelling for different reasons and some of them are not accessible.
The main reasons, which forced us to design our own corpus are as follows:
- such corpora as DECODA and CallSurf are in French, which makes the labelling process difficult without knowing French;
- some corpora, such as the corpus described in [5; 6], do not include audio files, which leads to a loss in information enclosed in audio features;
- some corpora, unfortunately, are not accessible to the public because the content is private information.
3. Corpus description
The current state of the corpus consists of 53 task-oriented dialogues in English between the customers and agents of one company, corresponding to about 87 minutes of signal. The raw audio data was presented in mono audio format. The average duration of a dialogue is 99.051 seconds. The distribution of dialogue duration is presented in fig. 1.
First of all it was required to perform speaker diariza-tion. Speaker diarization consists of speaker segmentation and speaker clustering, in other words, it helps to understand who speaks in each speech fragment. We tried
to implement such open-source diarization toolkits as LIUM [8] and SHoUT [9]. Unfortunately, the results of diarization were not suitable for us, because we needed diarization without errors. That is why the diarization was performed manually with the help of the Audacity computer software application, a free open source, cross-platform software for recording and editing sounds [10]. Then the audio files were split by FFmpeg, a free software project, which includes libraries for working with multimedia data [11]. Thus 1791 audio-file fragments, which contain the speech of customers or agents or each overlapping, were extracted. In this stage such information as gender, type of speaker (customer, agent) and overlapping speech was extracted manually, although it could be done automatically, but with some error.
<4-1
o ^
u t Ö
20 i 17 17 15 -10 -5 -
0
15
EZL
0-1 1-2 2-3 3-4 4-5 5-6
Fig. 1. Distribution of the dialogues by their duration
In the next stage all these fragments were manually joined into exchanges. Each exchange consists of the turns of a customer and an agent. All fragment concatenations can be divided into three groups: sequential, chain and mixed.
For example, we can have such a scheme of a dialogue: ACACAC, where A is the agent's turn and C is the customer's turn. The sequential type of concatenation will look like this: AC-AC-AC. If we speak about the chain type of concatenation, it will be like this: AC-CA-AC-CA-AC. An example of the mixed type of concatenation can be like this: AC-CA-CA-AC.
To concatenate turns into exchanges we applied the mixed type. Thus we retrieved 1165 exchanges.
For extracting features for the IQ modelling for HH conversation we used the three different parameter levels described in [12]. This approach consists of the three levels:
- exchange level, containing information about the current exchange;
- window level, containing information about the last n exchanges;
- dialogue level, containing information from the beginning of the dialogue up to the current exchange.
The scheme of the three different parameter levels is depicted in fig. 2.
On the exchange level there are four blocks of features:
- features, describing an exchange on the whole;
- features, describing the speech of the agent in this exchange;
- features, describing the speech of the customer in this exchange;
- features, describing overlapping speech in this exchange.
The list of the features on the exchange level is presented in tab. 1.
The list of the features on the window level and dialogue level is the same and is presented in tab. 2. The difference remains only in the number of exchanges in the computation.
fnJ en !
.». crtdiuiiyc luvcl pGinurictcis
wï I LciiivL1 Level partmttCf« {i}, (MeanJ, etc->lP dialogue level panmeners: Mean, err
Fig. 2. This figure from [12] represents three parameter levels
Table 1
Features on the exchange level
3
1
0
Feature Description
Features, describing an exchange on the whole
Agent speech Is there an agent's speech in the exchange?
Customer speech Is there a customer's speech in the exchange?
Overlapping Is there overlapping speech in the exchange?
{#} overlapping exchange Number of overlapping speech moments in the exchange
Start time The time elapsed from the beginning of the dialogue before the start of the exchange
First speaker Who starts the exchange?
Duration Duration of the exchange
First exchange Is the exchange first in the dialogue?
Pause duration Total duration of the pauses between a customer's and agent's speech in the exchange
{%} pause duration The percentage of the total pauses duration in the exchange
End tab. 1
Feature Description
Type of turns concatenation [l, if the previous exchange intersect the current exchange by the time [0, otherwise
Pause before duration Duration of the pause between the current and previous exchanges
Features, describing agent speech/customer speech/overlapping speech
Audio features All audio features were extracted by OpenSMILE (Speech & Music Interpretation by Large Space Extraction), an open source features extraction utility for automatic speech, music, para-linguistic recognition research [13]. For extracting features we have applied the openSMILE configuration emo IS09.conf. It contains 384 features, as a result of applying statistical func-tionals to 16 low-level descriptor contours
Split duration Duration of the agent's/customer's/overlapping speech in the exchange
Split overlapping Is there overlapping speech in the speech of an agent or customer
Start time split The time elapsed from the beginning of the dialogue before the start of the agent's/customer's/overlapping speech
{%} duration The percentage of agent's/customer's/overlapping speech duration in the exchange
Gender Gender of the agent and customer
Table 2
Features on window/exchange level
Feature Description
Total duration Total duration of exchanges
Mean duration Mean duration of exchange
A duration, C duration, O duration Total duration of the agent's/customer's/overlapping speech
Pauses duration Total duration of the pauses between customer's and agent's speech in the exchanges
A mean duration, C mean duration, O mean duration Mean duration of the agent's/customer's/overlapping speech
Pause mean duration Mean duration of the pause between customer's and agent's speech in the exchanges
A_percent duration, C percent duration, O percent duration The percentage of the agent's/customer's/overlapping speech duration
Pauses_percent duration Percentage of the pause duration between customer's and agent's speech
#A start dialogue, #C start dialogue, #O start dialogue Number of exchanges where the first speech is agent's/customer's/overlapping speech
#overlapping Number of the fragments with overlapping speech
Mean num overlapping Mean number of the overlaps
Pauses between exchanges duration Total duration of pauses between exchanges
For emotion labelling a web form was designed. This form is presented in fig. 4.
The web form depicted above has also been used to join split audio fragments into an exchange. For visualizing diarization results we used Flotr2, a library for drawing HTML5 charts and graphs [15].
IQ labels. For IQ score annotation we used adopted rater guidelines from [1]. Due to the fact that for HC interaction IQ labelling always starts with "5" and for HH conversations it can lead to a loss of useful information, we add the scale of IQ changes, which mirror the value of the change between the previous and current exchange. The rater guidelines both for the absolute scale and for the scale of changes is described in tab. 3.
All exchanges were annotated with IQ scores by one expert rater. For IQ labelling we designed the web form depicted in fig. 5.
The manually annotated variables, such as emotions and the IQ will be discussed in the next section.
4. Annotation (Target variables)
The corpus has been manually annotated with a number of target variables, such as emotions and IQ score.
Emotions. Three sets of emotion categories were selected from [14]. The first set consists of: angry (1), sad (2), neutral (3) and happy (4). The second set includes such emotions as anxiety (1), anger (2), sadness (3), disgust (4), boredom (5), neutral (6) and happiness (7). The third set contains fear (1), anger (2), sadness (3), disgust (4), neutral (5), surprise (6) and happiness (7). All audio fragments were annotated by one expert rater. The distribution of the emotion labels for each set is presented in fig. 3.
Emotion set 1
Emotion set 2
Emotion set 3
w 2000
Id
« 1500
1000
-Q
s
3
1659 ID 2000 H 1641
1500 -
^ o 1000 -
CD
2 130 -Q S 3 500 - 9 10 131
H-r hr1-h c 0 -
Emotions
4 5 6 7 Emotions
2000 H
1500 -1000 -500 -0
1618
9 45 119
1
4 5 6 7 Emotions
Fig. 3. Distributions of the emotion label for three sets of emotions
Fig. 4. The web form for emotion annotation and manually joining agent/customer turns into an exchange
Rater guidelines for annotating the IQ in the absolute scale and in the scale of changes
Table 3
№ The absolute scale The scale of changes
1 The rater should try to assess the interaction on the whole as objectively as possible, but pay more attention to the customer point of view in the interaction
2 An exchange consists of the agent and the customer turns
3 The IQ score is defined on a 5-point scale with "1=bad", "2=poor", "3=fair", "4=good" and "5=excellent" The IQ score is defined on a 6-point scale with "-2", "-1", "0", "1", "2" and "abs 1". The first five points of the scale reflect changes in the IQ from the previous exchange to the current exchange. "abs 1" means "1=bad" in the absolute scale
4 The IQ is to be rated for each exchange in the dialogue. The history of the dialogue should be kept in mind when assigning the score. For example, a dialogue that has proceeded fairly poorly for a long time should require some time to recover
5 A dialogue always starts with an IQ score of "5" A dialogue always starts with an IQ score of "0"
6 In general, the score from one exchange to the following exchange is increased or decreased by one point at the most
7 Exceptions, where the score can be decreased by two points are e. g. hot anger or sudden frustration. The rater's perception is decisive here
End tab. 3
№ The absolute scale The scale of changes
8 Also, if the dialogue obviously collapses due to agent or customer behaviour, the score can be set to "1" immediately. An example therefore is a reasonable frustrated sudden hang-up Also, if the dialogue obviously collapses due to agent or customer behaviour, the score can be set to "abs 1" immediately. An example therefore is a reasonable frustrated sudden hang-up
9 Anger does not need to influence the score, but can. The rater should try to figure out whether anger was caused by the dialogue behaviour or not
10 In the case of a customer realizing that he should adapt his dialogue strategy to obtain the desired result or information and succeeded that way, the IQ score can be raised up to two points per turn. In other words, the customer realizes that he caused the poor IQ by himself
11 If a dialogue consists of several independent queries, the quality of each query is to be rated independently. The former dialogue history should not be considered when a new query begins. However, the score provided for the first exchange should be equal to the last label of the previous query
12 If a dialogue proceeds fairly poorly for a long time, the rater should consider increasing the score more slowly if the dialogue starts to recover. Also, in general, he should observe the remaining dialogue more critically
13 If a constantly low-quality dialogue finishes with a reasonable result, the IQ can be increased
Fig. 5.The web form for annotating the IQ in absolute scale and scale of changes by the expert rater
1500 n
ЕЛ
! 1000 -is
■g 500 -
Absolute scale
4
1123
38
3 4
IQ_label
Absolute scale from the scale of changes
ш
о !-
CU
.a S s
500 -
1028
4 37 96
4 5
IQ_label
Fig. 6. Distributions of the IQ score
Then we converted the scale of changes into the absolute scale. The distributions of IQ score for both scale (absolute scale and absolute scale, converted from the scale of changes) are presented in fig. 6.
5. Future work
We plan to extend this corpus both in the feature set and in observations. The feature set will be extended, for example, by adding some audio features, such as schim-
mer, jitter, formants and others, computed in openSMILE or PRAAT [16], by adding manually annotated features as task completion. Moreover, we plan to apply automatic speech recognition software to our audio files.
The use of two scales (absolute scale and scale of changes) has revealed that applying the method of IQ estimation starting with "5" for HH conversations can lead to a possible loss of information in the modelling process. An example of such a situation is depicted in fig. 7.
Fig. 7. An example of possible loss of information due to IQ estimations starting with "5"
In HH task-oriented conversations, for example, in call centres, there are a number of reasons why the dialogue cannot start with the IQ label "5": the great waiting time, customer claims against the company with customers' aggressive behaviour due to which the IQ initially cannot be good.
Moreover, in the form presented above the IQ is a very subjective estimation. That is why we plan to change the rater guidelines for the IQ annotation: describe all possible situations in each category of the IQ and transitions from one category to another. It helps to decrease subjectivity in the assessment of the IQ. Also for decreasing subjectivity we plan to increase the number of expert raters.
6. Conclusion
Although there are a large number of speech corpora, for some tasks from speech analysis, such as the IQ modelling for HH task-oriented conversations, it is still difficult to find appropriate specific corpora. In this paper we described the current version of the corpus for the IQ modelling for HH task-oriented conversations. In spite of some drawbacks it can be useful for investigations.
Extension of this corpus helps to design a more accurate model for the IQ, which in turn can help to improve the quality of service in the call centres, to make SDS friendlier, more flexible and more human-like, and to automatize some processes of analysing the psychological state of people in a situation with a high level of responsibility, for example, in different space programmes.
Acknowledgment. This work was partly supported by the DAAD (German Academic Exchange Service) within the program "Research Grants - One-Year Grants".
Благодарности. Работа частично выполнялась в рамках программы «Научно-исследовательские стипендии - Годовые стипендии», финансируемой Немецкой службой академических обменов (DAAD).
References
1. Schmitt A., Schatz B., Minker W. Modeling and predicting quality in spoken human-computer interaction. Proceedings of the SIGDIAL 2011 Conference. Association for Computational Linguistics, 2011, P. 173-184.
2. European Language Resources Association. Available at: http://elra.info/Language-Resources-LRs. html (accessed 03.10.2015).
3. Linguistic Data Consortium. Available at: https:// catalog.ldc.upenn.edu/ (accessed 03.10.2015).
4. Bechet F., Maza B., Bigouroux N., Bazillon T., El-Beze M., R. De Mori, Arbillot E. DECODA: a call-center human-human spoken conversation corpus. International Conference on Language Resources and Evaluation (LREC), 2012, P. 1343-1347.
5. Vincenzo Pallotta, Rodolfo Delmonte, Lammert Vrieling, David Walker. Interaction Mining: the new frontier of Call Center Analytics. CEUR Workshop Proceedings, 2011, P. 1-12.
6. Rafaelli A., Ziklik L., Doucet L. The Impact of Call Center Employees' Customer Orientation Behaviors on Service Quality. Journal of Service Research, 2008, Vol. 10, No. 3, P. 239-255.
7. Lavalley R., Clavel C., Bellot P., El-Beze M. Combining text categorization and dialog modeling for speaker role identification on call center conversations. INTERSPEECH, 2010, P. 3062-3065.
8. Meignier S., Merlin T., LIUM SpkDiarization: An Open Source Toolkit For Diarization. Proceedings of CMUSPUD Workshop, 2010.
9. Shout. Available at: http://shout-toolkit.sourceforge. net/ (accessed 03.10.2015).
10. Audacity. Available at: http://audacityteam.org/ (accessed 03.10.2015).
11. FFmpeg. Available at: https://www.ffmpeg.org/ (accessed 03.10.2015).
12. Schmitt A., Ultes S., Minker W. A Parameterized and Annotated Corpus of the CMU Let's Go Bus Information System. International Conference on Language Resources and Evaluation (LREC), 2012, P. 3369-3373.
13. Eyben F., Weninger F., Gross F., Schuller B. Recent Developments in openSMILE, the Munich Open-Source Multimedia Feature Extractor. Proceedings of ACM Multimedia (MM), 2013, P. 835-838.
14. M. Sidorov, A. Schmitt and E. Semenkin. Automated Recognition of Paralinguistic Signals in Spoken Dialogue Systems: Ways of Improvement. Journal of Siberian Federal University, Mathematics and Physics, 2015, Vol. 8, No. 2, P. 208-216.
15. Flotr2. Available at: http://www.humblesoftware. com/flotr2/ (accessed 03.10.2015).
16. Praat: doing phonetics by computer. Available at: http://www.fon.hum.uva.nl/praat/ (accessed 03.10.2015).
© Spirina A. V., Sidorov M. Yu., Sergienko R. B., Semenkin E. S., Minker W., 2016