Automatic collection of transcribed speech for low resources languages

AGUIAR, Thales and DA COSTA ABREU, Marjory (2023). Automatic collection of transcribed speech for low resources languages. In: 2023 IEEE 13th International Conference on Pattern Recognition Systems (ICPRS). IEEE.

ttbacc_dataset_paper.pdf - Accepted Version
Creative Commons Attribution.

Download (623kB) | Preview
Official URL:
Link to published version::


Speech is a crucial for human communication and combined with the evolution of instant messaging in voice format as well as automated chatbots, its importance is greater. While the majority of speech technologies have achieved high accuracy, they fail when tested for accents that deviate from the “standard” of a language. This becomes more concerning for languages that lack on datasets and have scarce literature, like Brazilian Portuguese. Thus, this paper proposes a methodology to collect and release a speech dataset for Brazilian Portuguese. The method explores the availability of data and information in video platforms, and automatically extracts the audio from TEDx Talks.

Item Type: Book Section
Identification Number:
SWORD Depositor: Symplectic Elements
Depositing User: Symplectic Elements
Date Deposited: 21 Jul 2023 08:44
Last Modified: 11 Oct 2023 13:15

Actions (login required)

View Item View Item


Downloads per month over past year

View more statistics