ADVERTISEMENT
AI for all: Can India bridge LLM gap?LLMs trained on datasets featuring Indian languages would have many positives. They would help democratise education and access to information to those not proficient in English.
Adya Madhavan
Last Updated IST
FILE PHOTO: Figurines with computers and smartphones are seen in front of the words "Artificial Intelligence AI" in this illustration taken, February 19, 2024. REUTERS/Dado Ruvic/Illustration/File Photo
FILE PHOTO: Figurines with computers and smartphones are seen in front of the words "Artificial Intelligence AI" in this illustration taken, February 19, 2024. REUTERS/Dado Ruvic/Illustration/File PhotoREUTERS/Dado Ruvic

In the last two years since ChatGPT was released to the public, AI has gone from a sci-fi concept to a buzzword, and to a tool that more and more people use daily. Like many countries, India has tried to hop on the AI train to capitalise on AI’s potential to transform critical sectors and drive growth. The government of India has created the IndiaAI Mission, an initiative targeted at bolstering the domestic AI ecosystem. As of July-September 2024, it had allocated $125 million to the mission. India has adopted AI to varying degrees across different sectors, but overall, the numbers indicate that the country is willing to adapt to this new technology. For instance, the banking, financial services, and insurance sector has a 58% adoption rate as of this year, according to data by Teamlease Digital.

One would have thought that a country as linguistically and culturally diverse as India would seize this opportunity to develop LLMs and multimodal LLMs in regional languages, thus bringing their benefits to a much wider population. Yet, thus far, LLMs proficient in Indian languages are still scarce. The Ministry of Electronics and Information Technology released a report that revealed the breakdown of IndiaAI’s budget, and regional language LLMs are likely to get about Rs 200 crore in investments. However, one can’t help but wonder if funds will even solve the issue.

India has 22 recognised languages, and scores of regional dialects. Merely 10% of the country’s 142.86 crore individuals speak some English, a figure that does not even reflect fluency. The remaining masses have not been able to effectively reap the benefits of the global LLM wave. Currently, ChatGPT’s capabilities (in the context of some of the more widely spoken Indian languages such as Tamil, Hindi and Bengali) are limited to basic conversational queries and script recognition. However, many users who have fed the LLM native scripts complain of mistranslations. Nuanced and context-specific uses of Indian languages are currently beyond the ambit of ChatGPT, and dialectical versions prove to be another roadblock.

ADVERTISEMENT

Some Indian startups and companies are not attempting to develop LLMs proficient in Indian languages. There are large language models such as Tech Mahindra’s Indus 2.0, and CoRoverAi’s BharatGPT that are being trained on Indian language data. However, these models are yet to have any widespread impact. Building an LLM from the ground up requires large, diverse datasets, research talent and computational power, in addition to vast financial resources.

Widening the reach

The reason Indian language LLMs are such an untapped market is the immense potential of such models. The number of users for models fluent in native languages would be much higher than those who currently use LLMs that are only fluent in English. LLMs trained on datasets featuring Indian languages would have many positives. Firstly, they would help democratise education and access to information to those not proficient in English.

Multimodal LLMs would be especially helpful, as outputs across different data modalities such as images and audio would further help with learning complex concepts. Marginalised communities could access healthcare and financial services in their native tongue instead of struggling with platforms that do not cater to their dialects. On the economic front, businesses could leverage Indian language LLMs to reach a broader audience and expand their market. Once models have been trained extensively, AI-powered platforms could bring services like telemedicine and voter registration platforms to the broader population.

Additionally, AI-augmented technologies could help counter disinformation, with increased proficiency in regional languages allowing them to filter a wider range of content. Finally, models trained on regional dialects could help translate and preserve a large slice of Indian culture that is increasingly forgotten.

The current digital divide, rooted in linguistic barriers, stops millions from accessing opportunities and services that are supposed to be available to the public. If built well, regional language LLMs would be able to fix this gap. LLMs are trained on large and diverse datasets that are typically in the form of a large corpus of unlabeled text. While in theory Indian LLMs would be an excellent means to combat many issues, the major roadblock seems to be access to the data required to build these models, there are even questions as to whether the data required is available at all. Until such a time that there is enough diverse data available to train and build such models, the government should, perhaps, direct its expenditure to training pre-existing models on the available data. It may not be a catch-all solution, but presently appears to be much more feasible.

(The writer is a policy researcher at Takshashila Institution)

ADVERTISEMENT
(Published 10 December 2024, 02:44 IST)