In the last two years, a team of researchers has taken the initiative to develop KenCorpus, an impressive open-source collection of textual and spoken data in three prominent Kenyan languages: Swahili, Dholuo, and Luhya. Furthermore, other individuals have taken the lead in gathering a Swahili dataset through Mozilla's Common Voice (MCV) platform, utilizing crowdsourcing. These datasets serve as fundamental resources for developers who aim to build applications like chatbots or automatic translation services. But how do you make use of such datasets as an aspiring NLP developer? How can such a repository further grow with and for communities?