Automatic collection of transcribed speech for low resources languages

Aguiar, Thales; Da Costa Abreu, Marjory

Automatic collection of transcribed speech for low resources languages

Tools

AGUIAR, Thales and DA COSTA ABREU, Marjory (2023). Automatic collection of transcribed speech for low resources languages. In: 2023 IEEE 13th International Conference on Pattern Recognition Systems (ICPRS). IEEE. [Book Section]

[+][-]

Documents

32178:619817

[+][-]

32178:619817

Preview

PDF
ttbacc_dataset_paper.pdf - Accepted Version
Available under License Creative Commons Attribution.

Download (623kB) | Preview

Abstract

Speech is a crucial for human communication and combined with the evolution of instant messaging in voice format as well as automated chatbots, its importance is greater. While the majority of speech technologies have achieved high accuracy, they fail when tested for accents that deviate from the “standard” of a language. This becomes more concerning for languages that lack on datasets and have scarce literature, like Brazilian Portuguese. Thus, this paper proposes a methodology to collect and release a speech dataset for Brazilian Portuguese. The method explores the availability of data and information in video platforms, and automatically extracts the audio from TEDx Talks.

More Information

Official URL:

https://ieeexplore.ieee.org/document/10179033

Identifiers

Identification Number:

10.1109/icprs58416.2023.10179033

ORCID for Thales Aguiar: