Speech emotion recognition with light gradient boosting decision trees machine

Kah Liang Ong; Chin Poo Lee; Heng Siong Lim; Kian Ming Lim

doi:10.11591/ijece.v13i4.pp4020-4028

Speech emotion recognition with light gradient boosting decision trees machine

Kah Liang Ong, Chin Poo Lee, Heng Siong Lim, Kian Ming Lim

Research output: Journal Publication › Article › peer-review

10 Citations (Scopus)

Abstract

Speech emotion recognition aims to identify the emotion expressed in the speech by analyzing the audio signals. In this work, data augmentation is first performed on the audio samples to increase the number of samples for better model learning. The audio samples are comprehensively encoded as the frequency and temporal domain features. In the classification, a light gradient boosting machine is leveraged. The hyperparameter tuning of the light gradient boosting machine is performed to determine the optimal hyperparameter settings. As the speech emotion recognition datasets are imbalanced, the class weights are regulated to be inversely proportional to the sample distribution where minority classes are assigned higher class weights. The experimental results demonstrate that the proposed method outshines the state-of-the-art methods with 84.91% accuracy on the Berlin database of emotional speech (emo-DB) dataset, 67.72% on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset, and 62.94% on the interactive emotional dyadic motion capture (IEMOCAP) dataset.

Original language	English
Pages (from-to)	4020-4028
Number of pages	9
Journal	International Journal of Electrical and Computer Engineering
Volume	13
Issue number	4
DOIs	https://doi.org/10.11591/ijece.v13i4.pp4020-4028
Publication status	Published - Aug 2023
Externally published	Yes

Keywords

Light gradient boosting machine
Machine learning
Speech
Speech emotion
Speech emotion recognition

ASJC Scopus subject areas

General Computer Science
Electrical and Electronic Engineering

Access to Document

10.11591/ijece.v13i4.pp4020-4028

Cite this

@article{c0ee385f67a546d58f50f80633833393,

title = "Speech emotion recognition with light gradient boosting decision trees machine",

abstract = "Speech emotion recognition aims to identify the emotion expressed in the speech by analyzing the audio signals. In this work, data augmentation is first performed on the audio samples to increase the number of samples for better model learning. The audio samples are comprehensively encoded as the frequency and temporal domain features. In the classification, a light gradient boosting machine is leveraged. The hyperparameter tuning of the light gradient boosting machine is performed to determine the optimal hyperparameter settings. As the speech emotion recognition datasets are imbalanced, the class weights are regulated to be inversely proportional to the sample distribution where minority classes are assigned higher class weights. The experimental results demonstrate that the proposed method outshines the state-of-the-art methods with 84.91% accuracy on the Berlin database of emotional speech (emo-DB) dataset, 67.72% on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset, and 62.94% on the interactive emotional dyadic motion capture (IEMOCAP) dataset.",

keywords = "Light gradient boosting machine, Machine learning, Speech, Speech emotion, Speech emotion recognition",

author = "Ong, {Kah Liang} and Lee, {Chin Poo} and Lim, {Heng Siong} and Lim, {Kian Ming}",

year = "2023",

month = aug,

doi = "10.11591/ijece.v13i4.pp4020-4028",

language = "English",

volume = "13",

pages = "4020--4028",

journal = "International Journal of Electrical and Computer Engineering",

issn = "2088-8708",

publisher = "Institute of Advanced Engineering and Science",

number = "4",

}

TY - JOUR

T1 - Speech emotion recognition with light gradient boosting decision trees machine

AU - Ong, Kah Liang

AU - Lee, Chin Poo

AU - Lim, Heng Siong

AU - Lim, Kian Ming

PY - 2023/8

Y1 - 2023/8

N2 - Speech emotion recognition aims to identify the emotion expressed in the speech by analyzing the audio signals. In this work, data augmentation is first performed on the audio samples to increase the number of samples for better model learning. The audio samples are comprehensively encoded as the frequency and temporal domain features. In the classification, a light gradient boosting machine is leveraged. The hyperparameter tuning of the light gradient boosting machine is performed to determine the optimal hyperparameter settings. As the speech emotion recognition datasets are imbalanced, the class weights are regulated to be inversely proportional to the sample distribution where minority classes are assigned higher class weights. The experimental results demonstrate that the proposed method outshines the state-of-the-art methods with 84.91% accuracy on the Berlin database of emotional speech (emo-DB) dataset, 67.72% on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset, and 62.94% on the interactive emotional dyadic motion capture (IEMOCAP) dataset.

AB - Speech emotion recognition aims to identify the emotion expressed in the speech by analyzing the audio signals. In this work, data augmentation is first performed on the audio samples to increase the number of samples for better model learning. The audio samples are comprehensively encoded as the frequency and temporal domain features. In the classification, a light gradient boosting machine is leveraged. The hyperparameter tuning of the light gradient boosting machine is performed to determine the optimal hyperparameter settings. As the speech emotion recognition datasets are imbalanced, the class weights are regulated to be inversely proportional to the sample distribution where minority classes are assigned higher class weights. The experimental results demonstrate that the proposed method outshines the state-of-the-art methods with 84.91% accuracy on the Berlin database of emotional speech (emo-DB) dataset, 67.72% on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset, and 62.94% on the interactive emotional dyadic motion capture (IEMOCAP) dataset.

KW - Light gradient boosting machine

KW - Machine learning

KW - Speech

KW - Speech emotion

KW - Speech emotion recognition

UR - http://www.scopus.com/inward/record.url?scp=85151899421&partnerID=8YFLogxK

U2 - 10.11591/ijece.v13i4.pp4020-4028

DO - 10.11591/ijece.v13i4.pp4020-4028

M3 - Article

AN - SCOPUS:85151899421

SN - 2088-8708

VL - 13

SP - 4020

EP - 4028

JO - International Journal of Electrical and Computer Engineering

JF - International Journal of Electrical and Computer Engineering

IS - 4

ER -

Speech emotion recognition with light gradient boosting decision trees machine

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this