‘India’s AI edge will come from our deep understanding of language’: Mission Bhashini architect

Wait 5 sec.

When Mission Bhashini was first conceived under the Ministry of Electronics and Information Technology (MeitY) in 2018-19, it promised to bridge India’s language divide by developing speech-to-text AI models and multilingual translation tools. The timing proved prescient, coming four years before OpenAI launched ChatGPT in 2022 which set off the global AI arms race.Today, Indian AI startups are racing to build Indic large language models (LLMs) in an effort to catch up to global tech giants. While key players such as Sarvam AI, Krutrim, and the BharatGPT consortium have launched foundational AI models with support across several Indic languages, progress in this domain has been slow due to the lack of digitised, labeled, and cleaned training datasets.Since Indian language content on the internet is limited, developers have had to source language data from a variety of other places in order to train LLMs that understand how Indians actually speak or ask queries. This is a key challenge that the Bhashini AI mission looks to address.Over 350 AI models supporting all 22 scheduled Indian languages have been built under Bhashini. The free Bhashini app has crossed one million downloads, and around 200 higher education courses have been translated from English into eight Indian languages with subtitles. Many government ministries are reportedly leveraging Bhashini’s translation technology to build platforms such as the Pehchaan and SabhaSaar apps .However, its Bhasha Dhaan initiative to crowdsource language data for AI training has struggled to gain traction, recording fewer than 80,000 language samples so far, according to its official website.In a conversation with The Indian Express, Professor Rajeev Sangal, the founding chair of the Executive Committee of Mission Bhashini and founder director of IIIT Hyderabad, discusses why Bhasha Dhaan failed to take off, his reservations about open-sourcing Indic language datasets, the roadmap for Bhashini 2.0, and what he expects at the India AI Impact Summit.Q: Can you explain what it means to be a computational linguist? How did you get into doing this work that brings together language and computers?Story continues below this adRajeev Sangal: I did my B.Tech in electronics in Indian Institute of Technology (IIT) Kanpur back in 1975. I went abroad to do my PhD in computer science. There was no computer science department in India at that time. Then I decided to come back and became a faculty member at IIT KanpurAfter I returned, I felt that I should work in artificial intelligence as it was my favourite field. I also felt I should work on those problems which are of great relevance to India. As a result, I chose to focus on natural language processing, specifically in Indian languages and English. That’s how I got into language processing or computational linguistics as it is also called.Q: Mission Bhashini was conceived pre-ChatGPT. What happened when ChatGPT was launched in 2022? How did it change things for the AI mission?Sangal: ChatGPT came full four years after Mission Bhashini was conceived. Mission Bhashini was already in operation at that time.Story continues below this adHowever, the goals of the two are slightly different. ChatGPT is, in some sense, a question-answering system. The better prompt you give, the more detailed answer you get. Mission Bhashini, on the other hand, focused on translation among Indian languages and from Indian languages to English. In that sense, it’s a slightly different goal and a different application. Of course, advances in one area always influence advances in the other area.If we want LLMs to work effectively in Indian languages, machine translation plays a crucial role. For example, when someone asks a question in Hindi, it can be translated into English in the background, processed by ChatGPT, and then translated back into Hindi for the user.Also Read | Meta’s new speech-to-text AI models cover 1,600 languages, including rare Indian dialectsMost online content is in English, while Indian language content makes up less than 0.1 per cent of the internet—Hindi, Tamil, Marathi, Bengali, and others combined. English, by contrast, accounts for around 50 per cent.So to get accurate answers, queries in Indian languages often need to be translated into English first. That’s why translation technology is so important. While machine translation and LLMs are distinct fields, they can work together and complement each other effectively.Story continues below this adQ: What has been the most surprising use case for Bhashini’s AI tools that you didn’t anticipate?Sangal: I would say that the most interesting use case for me is also somewhat conventional. We have translated recorded lectures and educational course material from English to Indian languages and from one Indian language to another Indian language.However, machines do make errors, so for frequently accessed content, human editing can be applied—especially for educational material like Swayam courses and NPTEL lectures. The Bhashini technology is flexible: it can work in fully automatic mode for speech-to-speech or text-to-text translation, or in a human-edited mode. It was designed with all these use cases in mind.Q: Bhasha Daan was launched to crowdsource voice and language data from citizens. How many voice recordings and data samples have been collected so far across all languages? Which languages have seen the strongest and weakest participation?Story continues below this adSangal: I think Bhasha Daan has not really worked as we expected Indians to contribute. But I think our people haven’t yet learned how they can contribute their time and effort so that nation-building takes place. But I think sometimes people need to be energised, so that aspect of the mission needs to be undertaken, where you socially inspire people. You tell people if you love your language, this is something you must do.Because it isn’t a platform where everyday users naturally interact, Bhasha Daan contributions usually happen only when they fit into people’s normal routines.There is a Bhashini mobile app, and the platform is also available on desktop. If it were integrated with popular services like search engines or messaging platforms such as WhatsApp or the Indian alternative, Zoho’s Attarai, data collection would likely increase. In that case, the data would go to whoever operates the commercial platform. Still, we’re glad that Indians are contributing and helping build these language datasets.Q: Do you think global tech companies are also using Bhashini’s datasets to improve their AI models? You have mentioned that Bhashini was made open source despite having some reservations. Could you share what those reservations were?Story continues below this adSangal: My reservation was that this is Indian language data created by the Indian government, using taxpayers money, and it should primarily be made available to Indian researchers, academic institutions, and startups so that we build technology that can compete against the multinationals. If you look at the multinational giants, they have hundreds of times more data than us. They have infinitely more computing resources that we have. And they also have a greater amount of money.However, our data is higher quality data. The data that they have is perhaps low quality data because it’s not collected under very good conditions.I did have some reservations about making Bhashini open-source. But we later realised—and this was also pointed out by other people in government and the ministry—that whenever restrictions are placed on technology or datasets, it becomes harder for Indian researchers to access them. Multinational giants still find ways to get the data, while our own people are denied access. So we decided to accept that reality and make it open.I fully support open-sourcing data within the Indian research community, but I do have concerns when multinational companies gain access to it.Story continues below this adQ: Is Bhashini also working on creating synthetically generated language datasets?Sangal: Synthetic datasets do not work out well for training language models. Synthetic means you start with a dataset, and you increase the amount of data by introducing variations. However, language models require high quality data. People have also earlier talked about scraping language data from the internet and aligning it to produce parallel translations, but those efforts have yielded very bad quality data which has not helped at all.You actually need to go and collect the data from all these different regions. And then only your AI training will give you good results.Q: Where does Bhashini fit in the broader IndiaAI mission? How much compute has been set aside for Bhashini under the IndiaAI mission?Story continues below this adSangal: Bhashini is India’s first AI mission, launched even before the larger IndiaAI Mission, which is now being rolled out. It received formal funding in 2022.As the country’s first AI mission, Bhashini has already delivered strong results. Being a focused initiative, it brought together the right components such as speech processing, language processing, and a range of related technologies like named entity recognition, lip synchronisation, and disfluency removal. All these elements, which lie at the intersection of speech and language, were developed in a coordinated, targeted manner.Bhashini offers an important lesson for India’s broader AI ecosystem: to succeed in any area of AI, one must define a clear focus and integrate all the necessary elements specific to that domain. The IndiaAI Mission is a much larger effort, and these lessons will be crucial as it begins, particularly in understanding how to build effective technology and then take it to the next stage, bringing it to users.I don’t think a decision on compute for Bhashini has been made. The mission requires very little compute power for training. Once the training gets done, then one goes to commercial organisations for actually doing the task of building the applications. It is not compute power, which I’m sure will be made available, but the entire design of Bhashini that needs to change once it is at a mature stage.Q: Since Mission Bhashini is expected to come to an end in March 2026, what do you think should be the key priorities for Bhashini 2.0?Sangal: Bhashini 2.0 should not be merged with IndiaAI. It should remain as a separate mission. Why? Because it is at an advanced stage, the IndiaAI mission would be just starting. And the requirements of the two missions would be different.The datasets that are available today for training simply lack discourse markers. Discourse markers means you don’t translate sentence by sentence, you translate paragraph by paragraph. They also lack prosody markers. So when Mission Bhashini was conceived, we had also said that we should focus on these two areas in particular.And in Bhashini, we started that research. But results of that research will come more slowly because Bhashini has been more occupied with building systems for all Indian languages, collecting data, developing training modules, etc.Also Read | Can OpenAI’s new ‘IndQA’ benchmark help Indic LLMs close the gap?I hope that the main models from Bhashini remain open under the second phase of the mission. If startups want to adopt Bhashini’s language models, they may need to retrain them on data specific to that application or domain. While retraining requires lesser compute, it requires know-how that research groups in Bhashini possess. Hence, we need to set up an arm within Bhashini that can facilitate startups, not just in computer power, but in helping them show what needs to be done.This has been written in the original document but has not been implemented yet.Q: The global AI race is often seen as a contest between the United States and China. How do you view India’s progress in AI? Do you believe the focus on developing AI models for Indian languages could give the country an edge?Sangal: AI is an evolving technology, and in the West, it has largely advanced through a brute-force approach: using massive compute power and enormous, often low-quality datasets to achieve results.China, with DeepSeek, has shown that it’s possible to achieve comparable results with far less data. I believe we can go a step further. Not just smaller data, but smarter data.Understanding how language conveys meaning, and how sentences connect through discourse and prosody, can dramatically reduce the amount of data and compute needed. Instead of translating sentence by sentence, we should focus on paragraph-level translation that captures these connections. This theoretical grounding can help us build better, more efficient models that don’t hallucinate or lose context.India is well positioned to follow this path, but too often we try to replicate what’s being done in the West, where countries already have greater resources, compute power, and expertise. Competing on those terms will not help us stand out.Q: The India AI Impact Summit is scheduled to be held in February 2026. What does Mission Bhashini plan on showcasing there, and what are you expecting from the summit?Sangal: Of course, there are major issues in AI that need to be discussed at the summit. You mentioned some of them, such as bias, the ethical use of AI, privacy, and control over people. Then there are broader concerns, such as robots taking over.These are complex topics that deserve a separate and detailed discussion. At the summit, not all of these issues will be covered, but some will certainly be addressed. These are also immediate concerns for India’s high-tech, software and IT services industries, which are already feeling the impact of AI, and that impact will only grow in the coming years.