Lost in translation – How Africa is trying to close the AI language gap

Although Africa is home to a huge proportion of the world’s languages – well over a quarter according to some estimates – many are missing when it comes to the development of AI.

 

This is both an issue of a lack of investment and readily available data.

 

Most AI tools, such as Chat GPT, used today are trained on English as well as other European and Chinese languages.

 

These have vast quantities of online text to draw from.

 

But as many African languages are mostly spoken rather than written down, there is a lack of text to train AI on to make it useful for speakers of those languages.

 

For millions across the continent this means being left out.

 

Researchers who have been trying to address this issue have recently released what is thought to be the largest known dataset of African languages.

 

“We think in our own languages, dream in them and interpret the world through them. If technology doesn’t reflect that, a whole group risks being left behind,” the University of Pretoria’s Prof Vukosi Marivathe, who worked on the project, tells the BBC.

 

“We’re going through this AI revolution, imagining all that can be done with it. Now imagine there’s a part of the population that just doesn’t have that access because all the information is in English.”

 

The Africa Next Voices project brought together linguists and computer scientists to create AI-ready datasets in 18 African languages.

 

That may just be a small portion of the more than 2,000 languages estimated to be spoken across the continent but those involved in the project say they hope to expand in the future.

 

In two years, the team recorded 9,000 hours of speech across Kenya, Nigeria and South Africa, capturing everyday scenarios in farming, health and education.

 

The languages recorded included Kikuyu and Dholuo in Kenya, Hausa and Yoruba in Nigeria and isiZulu and Tshivenda in South Africa, some of which are spoken by millions of people.

 

“You need some basis to start off with and that’s what Africa Next Voices is and then people will build on top of that and add their own innovations,” says Prof Marivathe, who led the research in South Africa.

 

His Kenyan counterpart, computational linguist Lilian Wanzare, says recording the speech on the continent meant creating data aimed at reflecting how people really live and speak.

 

“We gathered voices from different regions, ages and backgrounds so it’s as inclusive as possible. Big tech can’t always see those nuances,” she says.

 

The project was made possible by a $2.2m (£1.6m) Gates Foundation grant.

 

The data will be open access, allowing developers to build tools that translate, transcribe and respond in African languages.

 

There are already small examples of how indigenous languages used in AI can be used to solve real-life challenges in Africa, according to Prof Marivathe.

Credit: BBC

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts