logo

Kindly select the language

This project collected text and speech corpora for Languages in Kenya. In KenCorpus project, three languages were strategically selected i.e. Kiswahili, Luhya, and Dholuo. The Luhya Language has several dialects. In the project, 3 dialects were chosen as a start: Lumarachi, Logooli and Lubukusi. Primary data was collected from the respective language communities, which also included indiginous stories and other narratives from student compositions, native language media stations, and publishers. This went beyond the conventional religious texts to include other genres of texts that made the corpus more representative of everyday language use in the communities. Text data : A total of 4442 texts were collected: 546 texts for Dholuo, 483 texts for Luhya-Lumarachi, 135 texts for Luhya-Lubukusu and 359 texts for Luhya-Logooli. Spontaneous Speech data: A total of 1,152 files were collected which total to 176hr 29min and 46sec of spontaneous speech data: 104 files (19hr 10min 57sec) for Swahili, 512 files (99hr 3min 8sec) for Dholuo, 138 files (15hr 37min 46sec) for Luhya-Lumarachi, 354 files (30hr 11min) for Luhya-Lubukusu and 44 files (12hr 26min 55sec) for Luhya-Logooli.

X