TEDxSK and JumpSK Lecture Speech Corpus
TEDxSK and JumpSK is a new Slovak spoken language resource built from TEDx and Jump Slovensko lectures. The presented speech corpus consists of 220 lectures in total duration of 58 hours. Annotated speech corpus was generated automatically, in an unsupervised manner, by using acoustic speech segmentation based on a principal component analysis and automatic speech transcription using two complementary speech recognition systems. For evaluation of quality of automatic transcription of speech, an evaluation set composed of 50 lectures, in total duration of 12 hours with manual transcription, has been created.
Creative Commons 3.0
Please cite the following references if you use the TEDxSK and JumpSK lecture speech corpus for your research tedx
 J. Staš, D. Hládek, P. Viszlay, T. Koctúr, “TEDxSK and JumpSK: A new Slovak speech recognition dedicated corpus,” Journal of Linguistics, Vol. 68, No. 2, 2017, pp. 346-354.
 J. Staš, T. Koctúr, P. Viszlay, “Automatická anotácia a tvorba rečového korpusu prednášok TEDxSK a JumpSK,” in Proc. of 11th Workshop on Intelligent and Knowledge Oriented Technologies, WIKT & Data a Znalosti 2016, Smolenice, Slovakia, 2016, pp. 127-132.
- training set, speech recordings
- development and evaluation set, speech recordings
- manual transcriptions, v1 (26102016)
- filtration, variable CMS thresholding, config1, v1 (~13.57% WER)
- filtration, variable CMS thresholding, config2, v1 (~9.44% WER)
- filtration, variable CMS thresholding, config3, v1 (~4.94% WER)