Multi-view spectrogram transformer for respiratory sound classification

Wentao He; Yuchen Yan; Jianfeng Ren; Ruibin Bai; Xudong Jiang

doi:10.1109/icassp48485.2024.10445825

Multi-view spectrogram transformer for respiratory sound classification

Wentao He, Yuchen Yan, Jianfeng Ren, Ruibin Bai, Xudong Jiang

Research output: Journal Publication › Conference article › peer-review

5 Citations (Scopus)

Abstract

Deep neural networks have been applied to audio spectrograms for respiratory sound classification. Existing models often treat the spectrogram as a synthetic image while overlooking its physical characteristics. In this paper, a Multi-View Spectrogram Transformer (MVST) is proposed to embed different views of time-frequency characteristics into the vision transformer. Specifically, the proposed MVST splits the mel-spectrogram into different-sized patches, representing the multi-view acoustic elements of a respiratory sound. The patches and positional embeddings are fed into transformer encoders to extract the attentional information among patches through a self-attention mechanism. Finally, a gated fusion scheme is designed to automatically weigh the multi-view features to highlight the best one in a specific scenario. Experimental results on the ICBHI dataset demonstrate that the MVST significantly outperforms state-of-the-art methods for classifying respiratory sounds. The code is available at: https://github.com/wentaoheunnc/MVST.

Original language	English
Journal	ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
DOIs	https://doi.org/10.1109/icassp48485.2024.10445825
Publication status	Published Online - 18 Mar 2024

Keywords

Respiratory sound classification
Melspectrogram
Vision Transformer
ICBHI dataset

Access to Document

10.1109/icassp48485.2024.10445825

https://ieeexplore.ieee.org/ielx7/10445798/10445803/10445825.pdf

Cite this

@article{092daee2420b49358f3ce549dc633a53,

title = "Multi-view spectrogram transformer for respiratory sound classification",

abstract = "Deep neural networks have been applied to audio spectrograms for respiratory sound classification. Existing models often treat the spectrogram as a synthetic image while overlooking its physical characteristics. In this paper, a Multi-View Spectrogram Transformer (MVST) is proposed to embed different views of time-frequency characteristics into the vision transformer. Specifically, the proposed MVST splits the mel-spectrogram into different-sized patches, representing the multi-view acoustic elements of a respiratory sound. The patches and positional embeddings are fed into transformer encoders to extract the attentional information among patches through a self-attention mechanism. Finally, a gated fusion scheme is designed to automatically weigh the multi-view features to highlight the best one in a specific scenario. Experimental results on the ICBHI dataset demonstrate that the MVST significantly outperforms state-of-the-art methods for classifying respiratory sounds. The code is available at: https://github.com/wentaoheunnc/MVST.",

keywords = "Respiratory sound classification, Melspectrogram, Vision Transformer, ICBHI dataset",

author = "Wentao He and Yuchen Yan and Jianfeng Ren and Ruibin Bai and Xudong Jiang",

year = "2024",

month = mar,

day = "18",

doi = "10.1109/icassp48485.2024.10445825",

language = "English",

journal = "ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",

}

TY - JOUR

T1 - Multi-view spectrogram transformer for respiratory sound classification

AU - He, Wentao

AU - Yan, Yuchen

AU - Ren, Jianfeng

AU - Bai, Ruibin

AU - Jiang, Xudong

PY - 2024/3/18

Y1 - 2024/3/18

N2 - Deep neural networks have been applied to audio spectrograms for respiratory sound classification. Existing models often treat the spectrogram as a synthetic image while overlooking its physical characteristics. In this paper, a Multi-View Spectrogram Transformer (MVST) is proposed to embed different views of time-frequency characteristics into the vision transformer. Specifically, the proposed MVST splits the mel-spectrogram into different-sized patches, representing the multi-view acoustic elements of a respiratory sound. The patches and positional embeddings are fed into transformer encoders to extract the attentional information among patches through a self-attention mechanism. Finally, a gated fusion scheme is designed to automatically weigh the multi-view features to highlight the best one in a specific scenario. Experimental results on the ICBHI dataset demonstrate that the MVST significantly outperforms state-of-the-art methods for classifying respiratory sounds. The code is available at: https://github.com/wentaoheunnc/MVST.

AB - Deep neural networks have been applied to audio spectrograms for respiratory sound classification. Existing models often treat the spectrogram as a synthetic image while overlooking its physical characteristics. In this paper, a Multi-View Spectrogram Transformer (MVST) is proposed to embed different views of time-frequency characteristics into the vision transformer. Specifically, the proposed MVST splits the mel-spectrogram into different-sized patches, representing the multi-view acoustic elements of a respiratory sound. The patches and positional embeddings are fed into transformer encoders to extract the attentional information among patches through a self-attention mechanism. Finally, a gated fusion scheme is designed to automatically weigh the multi-view features to highlight the best one in a specific scenario. Experimental results on the ICBHI dataset demonstrate that the MVST significantly outperforms state-of-the-art methods for classifying respiratory sounds. The code is available at: https://github.com/wentaoheunnc/MVST.

KW - Respiratory sound classification

KW - Melspectrogram

KW - Vision Transformer

KW - ICBHI dataset

U2 - 10.1109/icassp48485.2024.10445825

DO - 10.1109/icassp48485.2024.10445825

M3 - Conference article

JO - ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

JF - ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

ER -

Multi-view spectrogram transformer for respiratory sound classification

Abstract

Keywords

Access to Document

Fingerprint

Cite this