REVIEW OF RESEARCH ARTICLES ON PERSONAL IDENTIFICATION USING VOICE BIOMETRICS

Pugashbek А.А.

УДК 004

Pugashbek А. А.

Master's student in Information Security School of Digital Technologies NARXOZ University (Almaty, Kazakhstan)

REVIEW OF RESEARCH ARTICLES ON PERSONAL IDENTIFICATION USING VOICE BIOMETRICS

Аннотация: this paper reviews recent advancements in voice biometrics for personal identification. It covers machine learning techniques such as i-vectors, x-vectors, and deep neural networks, highlighting their accuracy and limitations. Special focus is given to challenges like noise, channel variability, and spoofing attacks using synthetic voices. The review also explores the benefits of combining voice with other biometrics to enhance security in real-world applications.

Ключевые слова: voice biometrics, speaker recognition, deep learning, i-vectors, x-vectors, spoofing attacks, deepfake detection, multimodal authentication, noise robustness, machine learning, personal identification.

1. Introduction.

Modern security systems predominantly rely on traditional authentication methods passwords and PIN codes which still serve as the primary means of authentication for both everyday users and corporate environments. However, as technology advances, cybercriminals become increasingly sophisticated in their tactics, employing methods such as phishing and fake websites to harvest credentials. These breaches can lead to grave consequences, including identity theft, financial losses, and unauthorized disclosure of corporate secrets, ultimately undermining an organization's credibility and inflicting both financial and reputational damage.

The growing adoption of online services, remote work, and cloud-based solutions has amplified the demand for authentication strategies that are both robust and convenient. Voice biometrics is emerging as a promising avenue, capitalizing on the unique acoustic characteristics of human speech. Its appeal lies in the fact that almost any device equipped with a microphone can support voice-based authentication, thereby offering ease of deployment and user-friendliness. At the same time, the proliferation of "deepfake" technologies capable of generating highly realistic synthetic voices highlights the urgent need for advanced anti-spoofing measures.

Recent research in speaker recognition demonstrates impressive accuracy levels under controlled laboratory conditions, particularly with the advent of modern machine learning techniques such as i-vectors, x-vectors, and deep neural networks. Nevertheless, real-world environments present a range of challenges, including background noise, varying communication channels, emotional and physiological changes in speech, and intentional voice alterations (spoofing). These factors complicate the practical implementation of voice biometrics in settings like banking contact centers, smart home systems, and mobile applications, where user experience and security are both paramount.

Given these complexities, this paper offers a review of contemporary research on personal identification through voice biometrics. Specifically, we analyze existing approaches, discuss commonly used datasets, and explore open challenges such as robustness to noise and resistance to spoofing. The overarching goal is to illustrate how, in tandem with other biometric technologies, voice-based authentication can significantly enhance security without sacrificing user convenience in real-world scenarios.

2. State of the art.

2.1 Common decisions.

There are currently many solutions for identifying individuals by voice characteristics that are used in various fields.

One such solution is the voice-controlled telephone and verification systems. They are used to control access to buildings and premises, allowing users to confirm

their identity with a voice. These systems assess distinct vocal characteristics like pitch and tonal quality. Domotphones are simple to operate since they don't need keys or cards. Nonetheless, they might be susceptible to assaults employing recorded voices.

An additional well-liked option involves mobile devices featuring voice recognition, such as Samsung Bixby Voice, Apple Siri, and Google Assistant Voice Match. These technologies enable users to control devices, retrieve data, and perform commands through voice interaction. They offer comfort and customization, but could encounter issues in loud settings or when altering the voice due to health conditions. There is a potential threat from assaults that utilize replicated recorded voices.

Consequently, voice recognition is now commonly utilized in daily life. Nonetheless, current solutions are still unable to ensure total protection against attacks and external influences like noise or variations in voice.

2.2 Voice decisions.

Recent research on automatic identification and authentication of identity by voice shows remarkable progress and high accuracy. An overview of a number of works illustrating relevant approaches and results in this area is provided below.

The study [1] proposes an i-vector and deep neural network (DNN) approach. According to the results of experiments, the probability of errors of the first and second types are 0.025 and 0.005 respectively, which indicates a high system reliability against attacks. The use of the cluster model of elementary speech units and DNN in i-vector space has allowed to achieve such low error rates.

The work [2] considers a multi-modal biometric authentication system based on face and voice recognition. The Gaussian Mixture Model (GMM) was used for voice and the FaceNet model for face using embeddings. When the results are merged (score-level fusion), the minimumR error is reduced to 0.011%. For voice recognition alone,R was 0.13%, and for face recognition it was 0.22%.

In a study [3] a fast Fourier (FFT) conversion with different window functions (Hamming, 4 Term B-Harris, Flat Top and Hanning) is used to identify the person by voice. The extracted spectral features were learned on two models: k-nearest neighbors (k-NN) and multilayered neural networks (FFNN). In k-NN, the highest accuracy

(97.68%) was achieved with the Cityblock metric at k=3k=3k=3. When using FFNN with linear activation (purelin), the accuracy in certain network configurations reached 100%.

Work [4] indicates that verification (confirmation of identity) in ideal conditions gives less than 1% errors, but with the growth of the database and the complexity of conditions (noise, short recordings) Identification errors can reach 2030%. It is also noted that when recording 2 seconds or less, errors can exceed 30%.

In a study [5] on voice biometrics using neural networks and mel-frequency key coefficients (MFCC), the accuracy was 82% when recognizing the "target" speaker (FRR=18%) and 86% when rejecting the "imitators" (FAR=14%). The partial total error (HTER) in such an experiment was 16%. The importance of correct voice recognition (MFCC) is emphasized to improve accuracy.

Separate research area [6] analyzes the influence of intentional voice changes (imitation, camouflage). It is indicated that when changing the formant (F1-F4), the frequency of the main tone (F0) and other parameters increases the level of recognition errors. This is especially noticeable in the «young» and «old» voice, where men often prove more successful in disguise.

An Introduction to Biometric Recognition (A review) reports that voice systems may have a False Match Rate (FMR) of no more than 0.001% in controlled conditions, but in practice this is increased by the influence of noise and quality of the communication channel. FNMR (False Non-Match Rate) is also increased when there are changes in the voice, for example, due to disease.

Work [8] describes multi-mode authentication for mobile devices, including face, teeth and voice recognition. Voice is analyzed using MFCC and Gaussian Mixtures (GMM) models. When using only voice modalityR was 8.98%, whereas in combination with other modalities (face + teeth) the totalR could be reduced to 1.64%.

Article [9] also considers multimodal biometrics (teeth and voice) in the mobile environment. In the voice authentication^ error 6.24% is reached based on the characteristics of the main tone (pitch) and MFCC, and classification is built on the

model GMM. It was noted that under noise and signal compression conditions, accuracy may decrease.

In [10] a method based on personal identification of voice (PIV) and fuy similarity (Trapezoid Fuy Similarity) is proposed. The system accepted 78.75% of registered users, which is 29.78% higher than the nearest neighbor method, while retaining 100% of non-registered users' deviations. By means of 242 repetitions of phrases variations of voice characteristics are taken into account.

The paper [11] on an interactive voice system (IVR) with speech recognition based on artificial neural networks also mentions 82%, FRR=18% and FAR=14% (similar to the figures in [5]). The system was trained on 50 samples of «correct» user and 125 samples of «simulators», using a neural network with 4 layers.

The article [12] presents a method of classification of voice biometric data based on the REPTralgorithm, which reached 96.68% accuracy on a set of 3168 pieces. The study highlights the importance of pre-processing and reducing dimension. It is assumed that such solutions can be applied in mobile banking and public administration systems.

In another direction of work [13] the task of detecting pathologies in voice is analyzed. The best accuracy - 85.77% - showed the support vector method (SVM), with sensitivity reaching 87.59%, specificity - 83.94%, and area under curve value (AUC) - 0.858.

In [14] is described a system of identification of the caller for emergency services, where a 30-second sample of voice during the training phase is sufficient to be recognized. The system identifies a speaker in 5 seconds when his profile is in the database, but noisy and compressed audio remains a serious problem.

Finally, the paper [15] examines the effectiveness of various algorithms (SVM, k-NN, LMkNN, FkNN) for human biometric systems as well as for frog bioacoustics. The best result for human voice biometrics (93.38%) is obtained by using an fuy FkNN classifier with 20 learning examples. This indicates the high accuracy and versatility of such classification methods.

Thus, modern voice biometrics and multi-mode authentication systems demonstrate high efficiency (accuracy up to 99-100% under controlled conditions). There are still a number of unsolved problems related to noise, voice change, special masking strategies and the need for large amounts of data for training. 2.3 Analysis.

Table 1. Comparison of the studies analysis.

Paper Data Sources Algorithm Used Results

Identification and authentication of user voice using DNN features and i-vector [1] 80-200 voice samples DNN with i-vector approach FRR: 0.025, FAR: 0.005, High resistance to attacks

Face-voice based multimodal biometric authentication system via FaceNet and GMM [2] 700 voice samples FaceNet embeddings and Gaussian Mixture Model (GMM) VoiceR: 0.13%, FaceR: 0.22%, FusionR: 0.011%

Voice analysis for personal identification using FFT, machine learning and AI techniques [3] 3168 voice samples k-NN and FeedForward Neural Networks (FFNN) k-NN (97.68%), FFNN (up to 100%)

Speaker Recognition— Identifying People by Their Voices [4] Real-world recordings, 2-second and longer samples voices Statistical models and machine learning classifiers <1% verification errors, 20-30% identification errors with noise

Voice Biometrics [5] 50 target voice samples and 125 impostor voice samples Multilayer Perceptron (MLP) with MFCC FRR: 18%, FAR: 14%, HTER: 16%

Human-induced voice modification and speaker recognition [6] 100 voice samples Acoustic analysis with formant (F1-F4) and pitch modifications Higher errors in masked voices, sensitive to formant changes

Paper Data Sources Algorithm Used Results

An Introduction to Biometric Recognition [7] Various statistical classifiers with thresholds FMR: 0.001% in controlled settings, FNMR higher with noise

Person authentication using face, teeth and voice modalities for mobile device security [8] 1000 voice samples MFCC and GMM for multimodal fusion VoiceR: 8.98%, CombinedR: 1.64%

Multimodal biometric authentication using teeth image and voice in mobile environment [9] 1000 images and voice samples GMM and MFCC for teeth and voice analysis VoiceR: 6.24%, Sensitive to noise and compression

A Speaker Recognition Method based on Personal Identification Voice and Trapezoid Fuy Similarity [10] 242 voice samples Trapezoid Fuy Similarity Acceptance: 78.75%, Higher than k-NN by 29.78%

Interactive Voice Response with Pattern Recognition Based on Artificial Neural Network Approach [11] 50 target and 125 impostor voice samples Artificial Neural Networks (ANN) FRR: 18%, FAR: 14%, Similar to MLP

Feature based classification of voice based biometric data through Machine learning algorithm [12] 3168 voice samples REPTrwith dimensionality reduction 96.68% accuracy with REPTree

Voice disorder identification by using machine learning techniques [13] 1370 voice samples Support Vector Machine (SVM) Sensitivity: 87.59%, Specificity: 83.94%, AUC: 0.858

Paper Data Sources Algorithm Used Results

Caller identification by voice [14] Emergency call recordings with 45 hours of data Random Forest and ensemble methods 5-sec identification time, challenges with noisy data

Evaluation of Human Voice Biometrics and Frog Bioacoustics Identification Systems [15] 49 voice samples SVM, k-NN, LMkNN, and FkNN FkNN (93.38%) for human voice, 97% for frog bioacoustics

Table 1 presents key studies on voice based biometric systems (as well as multimodal approaches) where data sets used are analyzed (from targeted voice samples to multi-modal AVSpeech, mobile recordings, emergency calls and even frog sounds) algorithms used (i-vector in combination with DNN, FaceNet+GMM, FFT+k-NN/FFNN, statistical models, MLP with MFCC, SVM, Random Forest, FkNN, REPTrand others) and resulting metrics (FRR, FAR,R, Accuracy, Sensitivity, Specificity), with the best accuracy most often shown by deep neural networks or multimodal systems (whereR can be reduced to 0.011%), classical methods like GMM and Random Forest are convenient for limited resources, a major challenge remains noise and conscious modification of the voice, which indicates the need for either special pretreatment methods or additional biometric features to increase resilience.

2.4 Conclusion.

Modern research in the field of voice biometrics demonstrate high accuracy in controlled conditions (EER up to 0.011%), but the real application of such systems, especially in noisy and dynamically changing environment, requires taking into account more «natural» recordings, advanced mechanisms of defying spoofing and orientation towards multimodal solutions, laboratory datasets and «played» scenarios often do not reflect real difficulties (noise, emotional and physiological changes, various channels of communication), and modern deep learning methods need large enough and diverse datasets to ensure reliability and attack resilience in the practical environment.

3. Discussion.

Despite the impressive accuracy reported in many of the controlled laboratory experiments, practical voice biometrics deployments still face a number of unresolved challenges. First, noise and channel variability substantially affect recognition performance: even small background disturbances or differences in microphone quality can degrade accuracy, increasing both FRR (False Rejection Rate) and FAR (False Acceptance Rate). This is especially evident in real-world settings like call centers and mobile environments, where the acoustic channel cannot be strictly controlled. Advanced front-end signal processing (e.g., speech enhancement, reverberation compensation) and robust training strategies (including data augmentation with noisy or low-quality samples) have emerged as essential techniques to mitigate these issues.

A second major challenge involves spoofing attacks, particularly with the rise of deepfake technology. An attacker might generate synthetic voices that closely mimic the target user's timbre and intonation. Many current methods (e.g., i-vector, GMM) were not originally designed to detect synthetic or replayed speech, emphasizing the urgent need for anti-spoofing measures. Recent works use specific classifiers to detect anomalies in phase information or high-frequency components that differ between natural and artificially generated speech.

Third, speaker variability due to emotional state, aging, or intentional voice modifications (camouflage and imitation) also tests the stability of voice biometric systems. Longitudinal studies confirm that the voice changes over time, and a model trained on relatively short or homogeneous data may need continuous updates (enrollment re-checks, adaptive templates). Consequently, approaches integrating domain adaptation or incremental learning are increasingly important for maintaining high accuracy across changing conditions.

Lastly, multimodal fusion has proven to be one of the more promising solutions to achieve both security and usability. Combining voice with face, teeth, or other biometric cues reliably reduces the likelihood of both false accepts and false rejects, even in challenging environments. This synergy is critical for large-scale applications (e.g., in finance, healthcare, or government services) where a single biometric may not

fully handle all corner cases (noise, replay attacks, etc.). By drawing on different strengths of each modality, a multimodal system can substantially enhance robustness against spoofing while delivering a user-friendly experience.

Overall, research trends indicate that future breakthroughs will come from integrating deep learning architectures (e.g., x-vector, ECAPA-TDNN, transformer-based models) with sophisticated anti-spoofing strategies, extensive data augmentation, and, wherever feasible, multimodal solutions. While current voice identification systems already show high accuracy in controlled scenarios, continuous refinement in these directions is crucial for reliable deployment in everyday, unpredictable real-world settings.

СПИСОК ЛИТЕРАТУРЫ:

1. Aizat, K., Mohamed, O., Orken, M., Ainur, A., & Zhumazhanov, B. (2020). Identification and authentication of user voice using DNN features and i-vector. Cogent Engineering, 7(1), 1751557. https://doi.org/10.1080/23311916.2020.1751557;

2. Alharbi, B., & Alshanbari, H. S. (2023). Face-voice based multimodal biometric authentication system via FaceNet and GMM. PeerJ Computer Science, 9, e1468. https://doi.org/10.7717/peerj-cs.1468;

3. Balabanova, I., Georgiev, G., Karapenev, B., & Rankovska, V. (2022). Voice analysis for personal identification using FFT, machine learning and AI techniques. AIP Conference Proceedings. https://doi.org/10.1063/5.0099672;

4. Doddington, G. (1985). Speaker recognition—Identifying people by their voices. Proceedings of the I.E.E.E., 73(11), 1651-1664. https://doi.org/10.1109/proc.1985.13345;

5. González-Rodríguez, J., Toledano, D. T., & Ortega-García, J. (2007). Voice Biometrics. In Springer eBooks (pp. 151-170). https://doi.org/10.1007/978-0-387-71041-9_8;

6. Hautamaki, R. G. (2017). Human-induced voice modification and speaker recognition: automatic, perceptual and acoustic perspectives. https://erepo.uef.fi/handle/123456789/18781;

7. Jain, A., Ross, A., & Prabhakar, S. (2004). An introduction to biometric recognition. I.E.E.E. Transactions on Circuits and Systems for Video Technology, 14(1), 4-20. https://doi.org/10.1109/tcsvt.2003.818349;

8. Kim, D., Chung, K., & Hong, K. (2010). Person authentication using face, teeth and voice modalities for mobile device security. I.E.E.E. Transactions on Consumer Electronics, 56(4), 26782685. https://doi.org/10.1109/tce.2010.5681156;

9. Kim, D., & Hong, K. (2008). Multimodal biometric authentication using teeth image and voice in mobile environment. I.E.E.E. Transactions on Consumer Electronics, 54(4), 1790-1797. https://doi.org/10.1109/tce.2008.4711236;

10. Nguyen, T. H. L., Arai, Y., Sato, H., Hayashi, T., Dong, F., & Hirota, K. (2008). A Speaker Recognition Method based on Personal Identification Voice and Trapezoid Fuy Similarity. SCIS & ISIS SCIS & ISIS 2008, 2008, 1596-1601. https://doi.org/10.14864/softscis.2008.0.1596.0;

11. Shah, S. a. A., Asar, A. U., & Shah, S. W. (2007). Interactive Voice Response with Pattern Recognition Based on Artificial Neural Network Approach. International Conference on Emerging Technologies, 249-252. https://doi.org/10.1109/icet.2007.4516352;

12. Shakil, S., Arora, D., & Zaidi, T. (2021). Feature based classification of voice based biometric data through Machine learning algorithm. Materials Today Proceedings, 51, 240-247. https://doi.org/10.1016/j.matpr.2021.05.261;

13. Verde, L., De Pietro, G., & Sannino, G. (2018). Voice disorder identification by using machine learning techniques. I.E.E.E. Access, 6, 16246-16255. https://doi.org/10.1109/access.2018.2816338;

14. Witkowski, M., Igras, M., Grzybowska, J., Jaciow, P., Galka, J., & Ziolko, M. (2014). Caller identification by voice. I.E.E.E., 1-7. https://doi.org/10.1109/pvc.2014.6845420;

15. (2018) .<Js .o.<J" .o J .oEvaluation of human voice biometrics and frog bioacoustics identification systems based on feature extraction method and classifiers. ^^ ^ jjiJl, 176. https://doi.org/10.36458/1253-000-031-010

REVIEW OF RESEARCH ARTICLES ON PERSONAL IDENTIFICATION USING VOICE BIOMETRICS Текст научной статьи по специальности «Медицинские технологии»

Аннотация научной статьи по медицинским технологиям, автор научной работы — Pugashbek А. А.

Похожие темы научных работ по медицинским технологиям , автор научной работы — Pugashbek А. А.

Текст научной работы на тему «REVIEW OF RESEARCH ARTICLES ON PERSONAL IDENTIFICATION USING VOICE BIOMETRICS»