Transcription Data

Kenya Language Corpus, founded by Maseno University, the University of Nairobi and Africa Nazarene University early in 2021. These universities have been jointly creating a language corpus, and while using machine learning and natural language processing, are creating tomorrow's African language chatbot. Although natural language processes have undergone quite a bit of modernization and upkeep over the years, KenCorpus aims to take it a step further, and process our own African Languages on our own devices.

Contact Info

Kisumu Busia Road, Maseno
+254 722 268 484
kencorpus@maseno.ac.ke

This speech dataset includes both read and spontaneous speech recordings, recorded in Kenya with native Swahili speakers. In total this dataset includes 27 hours 31 minutes 50 seconds of speech data from 26 speakers, that is, 19 females and 7 males. The recordings are of the following audio format: .wav, 16bits, 16kHz, mono and Little Endian. Of the total recordings 26 hours 32 minutes and 37 seconds represent the read speech data while 59 minutes 13 seconds represent the spontaneous speech recordings. Each audio file has a corresponding transcript.

To cite this dataset: