In just a matter of a few months, chatGPT has become such an essential part of our workflow. I use it to generate titles and descriptions for many of my videos. Today, I thought I’d run an experiment with Pallavi, my colleague, who produces videos for Prajavani, our sister company. Over the past few days, Chandran 3 has been all the buzz. We’ve been doing everything we can to get our videos to rank higher, so I thought we’d try chatGPT to generate titles for our videos.
First, I asked chatGPT to give me a title for a video about Chandra. It easily generated the title ‘Chandra and 3: India’s Next Leap to the Lunar Frontier.’ However, when Pallavi tried it, the response was completely in English. We specified that the title needs to be in Kannada, but the sentence formation was wrong. This highlights the problem of AI models not working well in regional languages.
GPT is a large language model that learns from vast amounts of data to mimic human intelligence. In March 2023, Google released Bard, a competitor to GPT that works in 46 languages, including Indian regional languages. However, the performance of these models in regional languages is still not as good as in English.
The main reason for this is the lack of data in regional languages. The training data for GPT consists of 45.2% English content, while Kannada, for example, only has 0.0132% representation. This means that around 70% of Indians who don’t communicate in English cannot access artificial intelligence.
To address this issue, initiatives like AI for Bharat and Bhashini are being developed. These projects aim to build open-source language AI models that can function in Indian languages. Bhashini provides an API that allows other AI models to translate information between languages. The Ministry of Electronics and Information Technology is also working on the National Language Translation Mission to make texts available in all 22 official Indian languages.
One example of the application of these language AI models is Jugal Bandi, a WhatsApp bot that provides information about government schemes. It works in around 10 languages, including Kannada and Hindi. Jugal Bandi uses GPT-3 and the Bhashini API to make scheme data available as a conversational service for citizens to discover and check their eligibility for schemes.
While there is still a long way to go, steps are being taken to bridge the gap in AI models for regional languages. Data scraping is happening on a daily basis, and data sets are being created to train these models. The next video will address how Jugal Bandi was trained and the challenges of training AI models with limited regional language data.