Real and synthetic Punjabi speech datasets for automatic speech recognition

Satwinder Singh; Feng Hou; Ruili Wang

doi:10.1016/j.dib.2023.109865

Real and synthetic Punjabi speech datasets for automatic speech recognition

Satwinder Singh, Feng Hou, Ruili Wang

Research output: Journal Publication › Article › peer-review

3 Citations (Scopus)

Abstract

Automatic speech recognition (ASR) has been an active area of research. Training with large annotated datasets is the key to the development of robust ASR systems. However, most available datasets are focused on high-resource languages like English, leaving a significant gap for low-resource languages. Among these languages is Punjabi, despite its large number of speakers, Punjabi lacks high-quality annotated datasets for accurate speech recognition. To address this gap, we introduce three labeled Punjabi speech datasets: Punjabi Speech (real speech dataset) and Google-synth/CMU-synth (synthesized speech datasets). The Punjabi Speech dataset consists of read speech recordings captured in various environments, including both studio and open settings. In addition, the Google-synth dataset is synthesized using Google's Punjabi text-to-speech cloud services. Furthermore, the CMU-synth dataset is created using the Clustergen model available in the Festival speech synthesis system developed by CMU. These datasets aim to facilitate the development of accurate Punjabi speech recognition systems, bridging the resource gap for this important language.

Original language	English
Article number	109865
Journal	Data in Brief
Volume	52
DOIs	https://doi.org/10.1016/j.dib.2023.109865
Publication status	Published - Feb 2024
Externally published	Yes

Keywords

Automatic speech recognition
Punjabi language
Speech dataset
low-resource languages

ASJC Scopus subject areas

General

Access to Document

10.1016/j.dib.2023.109865

Cite this

@article{bd0ab2cbd5314cf59c323e204d5d8b2e,

title = "Real and synthetic Punjabi speech datasets for automatic speech recognition",

abstract = "Automatic speech recognition (ASR) has been an active area of research. Training with large annotated datasets is the key to the development of robust ASR systems. However, most available datasets are focused on high-resource languages like English, leaving a significant gap for low-resource languages. Among these languages is Punjabi, despite its large number of speakers, Punjabi lacks high-quality annotated datasets for accurate speech recognition. To address this gap, we introduce three labeled Punjabi speech datasets: Punjabi Speech (real speech dataset) and Google-synth/CMU-synth (synthesized speech datasets). The Punjabi Speech dataset consists of read speech recordings captured in various environments, including both studio and open settings. In addition, the Google-synth dataset is synthesized using Google's Punjabi text-to-speech cloud services. Furthermore, the CMU-synth dataset is created using the Clustergen model available in the Festival speech synthesis system developed by CMU. These datasets aim to facilitate the development of accurate Punjabi speech recognition systems, bridging the resource gap for this important language.",

keywords = "Automatic speech recognition, Punjabi language, Speech dataset, low-resource languages",

author = "Satwinder Singh and Feng Hou and Ruili Wang",

note = "Publisher Copyright: {\textcopyright} 2023 The Author(s)",

year = "2024",

month = feb,

doi = "10.1016/j.dib.2023.109865",

language = "English",

volume = "52",

journal = "Data in Brief",

issn = "2352-3409",

publisher = "Elsevier BV",

}

TY - JOUR

T1 - Real and synthetic Punjabi speech datasets for automatic speech recognition

AU - Singh, Satwinder

AU - Hou, Feng

AU - Wang, Ruili

PY - 2024/2

Y1 - 2024/2

N2 - Automatic speech recognition (ASR) has been an active area of research. Training with large annotated datasets is the key to the development of robust ASR systems. However, most available datasets are focused on high-resource languages like English, leaving a significant gap for low-resource languages. Among these languages is Punjabi, despite its large number of speakers, Punjabi lacks high-quality annotated datasets for accurate speech recognition. To address this gap, we introduce three labeled Punjabi speech datasets: Punjabi Speech (real speech dataset) and Google-synth/CMU-synth (synthesized speech datasets). The Punjabi Speech dataset consists of read speech recordings captured in various environments, including both studio and open settings. In addition, the Google-synth dataset is synthesized using Google's Punjabi text-to-speech cloud services. Furthermore, the CMU-synth dataset is created using the Clustergen model available in the Festival speech synthesis system developed by CMU. These datasets aim to facilitate the development of accurate Punjabi speech recognition systems, bridging the resource gap for this important language.

AB - Automatic speech recognition (ASR) has been an active area of research. Training with large annotated datasets is the key to the development of robust ASR systems. However, most available datasets are focused on high-resource languages like English, leaving a significant gap for low-resource languages. Among these languages is Punjabi, despite its large number of speakers, Punjabi lacks high-quality annotated datasets for accurate speech recognition. To address this gap, we introduce three labeled Punjabi speech datasets: Punjabi Speech (real speech dataset) and Google-synth/CMU-synth (synthesized speech datasets). The Punjabi Speech dataset consists of read speech recordings captured in various environments, including both studio and open settings. In addition, the Google-synth dataset is synthesized using Google's Punjabi text-to-speech cloud services. Furthermore, the CMU-synth dataset is created using the Clustergen model available in the Festival speech synthesis system developed by CMU. These datasets aim to facilitate the development of accurate Punjabi speech recognition systems, bridging the resource gap for this important language.

KW - Automatic speech recognition

KW - Punjabi language

KW - Speech dataset

KW - low-resource languages

UR - http://www.scopus.com/inward/record.url?scp=85178999404&partnerID=8YFLogxK

U2 - 10.1016/j.dib.2023.109865

DO - 10.1016/j.dib.2023.109865

M3 - Article

AN - SCOPUS:85178999404

SN - 2352-3409

VL - 52

JO - Data in Brief

JF - Data in Brief

M1 - 109865

ER -

Real and synthetic Punjabi speech datasets for automatic speech recognition

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this