← Blog
explainx / blog

BharatGen: IIT Bombay Launches India's Sovereign AI for All 22 Scheduled Languages

IIT Bombay unveiled BharatGen at Bharat Innovates 2026 in Nice — a sovereign AI ecosystem covering all 22 scheduled Indian languages with models for text, speech, and documents. Backed by DST and the IndiaAI Mission with ₹988.6 crore in funding. Here is what it includes and why it matters.

6 min readYash Thakker
India AIOpen Source AIMultilingual AIIIT BombayBharatGen

MDX restores the committed source plus an HTML comment attribution; plain text bundles the rendered markdown body with the explainx.ai attribution footer.

BharatGen: IIT Bombay Launches India's Sovereign AI for All 22 Scheduled Languages

India's Sovereign AI Goes Public

On June 15, 2026, at Bharat Innovates 2026 in Nice, France, IIT Bombay formally presented BharatGen to the world — a sovereign AI ecosystem built for India's 1.4 billion people across all 22 scheduled languages.

The announcement, which drew 118,900 views on X within hours, represents the culmination of a multi-year national effort: 9 premier academic institutions, 60+ researchers, engineers, and linguists, backed by India's Department of Science and Technology, the IndiaAI Mission, and ₹988.6 crore in funding.

BharatGen is not a single model. It is a four-family ecosystem covering the full stack of language interaction — text, speech, and documents — in every officially scheduled Indian language.


The Four Model Families

Param2 — Foundational Text Model

The cornerstone of BharatGen. Param2 is a foundational large language model that works across all 22 scheduled Indian languages with:

  • Reasoning capabilities (multi-step problem solving)
  • Coding support
  • Tool calling for agentic use cases

Param2 is built to handle not just translation but genuine native-language understanding — the cultural nuances, idioms, and domain knowledge that Western models trained primarily on English-language data routinely miss in Indian language contexts.

The use cases highlighted span governance, healthcare, education, insurance, finance, and cultural preservation — domains where the language gap between frontier AI and India's actual population has been most acute.

Shrutam2 — Multilingual Speech Recognition

Automatic speech recognition across Indian languages. India is predominantly an oral culture in many regions — literacy rates vary significantly, and for hundreds of millions of users, voice is the primary interaction modality.

Shrutam2 addresses the specific challenges of Indian speech recognition: phonetic complexity, tonal variations, code-switching (mixing multiple languages within a single utterance), and the acoustic diversity across India's geographic spread.

Sooktam2 — Text-to-Speech with Voice Cloning

Text-to-speech synthesis across Indian languages, with a notable capability: zero-shot voice cloning. The model can reproduce a target speaker's voice characteristics without fine-tuning on that speaker's data — enabling personalised speech synthesis for applications ranging from accessibility tools to personalised education.

Zero-shot voice cloning in multilingual Indian language contexts is technically demanding — Indian language prosody and phonology differ substantially from the Latin-script languages where most voice cloning research has been conducted. This is a meaningful technical achievement.

Patram — Document Vision Model

A vision-language model specifically designed for understanding Indian documents. This is a more specialised challenge than it might appear: Indian documentation includes documents in multiple scripts (Devanagari, Tamil script, Bengali script, Telugu script, and others), mixed-language content, handwritten text, and domain-specific formats used in Indian governance, legal, and financial systems that generic document AI models handle poorly.

Patram is positioned as infrastructure for digitising and understanding the enormous volume of India's existing document corpus — from government records to land registry documents to healthcare records.


The Dataset: India's Largest Open AI Corpus

Underlying all four model families is what BharatGen describes as the world's largest dataset of its kind focused on underrepresented Indian data:

  • Text, speech, and images tied to Indian languages, culture, history, and philosophy
  • 15,000+ hours of annotated voice data across 22 Indian languages
  • Secure, versioned corpus with version control for reproducibility
  • Coverage of rural dialects and urban contexts

The dataset itself is a significant contribution independent of the models. India's AI development has been constrained by the absence of high-quality, culturally representative training data in Indian languages. BharatGen's dataset, released partially as open source, changes that constraint for the entire research community.


Why Sovereign AI Matters for India

The framing of BharatGen as "sovereign AI" is deliberate and politically significant. Three concerns motivate it:

1. Data sovereignty. When Indian citizens interact with AI models trained primarily on Western data and hosted on Western infrastructure, their data flows through systems India does not control. A sovereign AI ecosystem keeps that data — and the value derived from it — within Indian institutions.

2. Cultural representation. AI systems trained predominantly on English-language data encode cultural assumptions that may not apply to Indian contexts. Legal norms, medical practices, educational conventions, and social structures differ — and AI systems that don't understand those differences produce worse outcomes for Indian users even when translated into Indian languages.

3. Capability independence. India's experience with the US export ban on Fable 5 illustrates the vulnerability of depending on foreign-controlled frontier AI. A domestically developed and controlled AI ecosystem provides resilience against access restrictions.

These are not abstract concerns. They map directly onto BharatGen's target domains: governance, healthcare, and education are areas where the Indian state cannot afford dependency on AI infrastructure it does not control.


The Institutional Architecture

What distinguishes BharatGen from previous Indian AI initiatives is its institutional depth. The project is structured as:

  • Lead institution: IIT Bombay, Department of Computer Science and Engineering
  • Leadership: Prof. Ganesh Ramakrishnan (academic lead), Rishi Bal (CEO), Dr. Maneesh Singh (VP, Machine Learning)
  • Consortium: 9 premier Indian academic institutions
  • Team: 60+ researchers, engineers, linguists
  • Funding: DST + IndiaAI Mission, ₹988.6 crore secured

The involvement of 9 institutions rather than a single lab signals an attempt to build durable infrastructure rather than a one-time research project. The presence of a CEO suggests commercialisation is a design goal, not an afterthought.


What India's AI Ecosystem Gets

BharatGen's launch changes the landscape for Indian AI development in several ways:

For developers: Open-source model weights for text, speech, and TTS models, with training recipes — enabling Indian developers to build on BharatGen without rebuilding from scratch.

For enterprises: Production-ready models for governance, healthcare, and finance domains in all Indian languages, with IIT Bombay's research backing.

For researchers: The dataset corpus and benchmarks tailored to Indian language performance — enabling rigorous evaluation of Indian language AI that previous infrastructure did not support.

For policymakers: An Indian-controlled AI stack that can be deployed in sensitive domains without foreign data dependencies.


Where BharatGen Fits Globally

BharatGen is the most comprehensive Indian-language AI initiative to date, but it exists in a global context where language-specific sovereign AI is becoming a policy priority across multiple countries. France has Mistral (and its own language concerns, as illustrated by the Le Chaton Fat phenomenon), China has its domestic model ecosystem, and now India has BharatGen.

The pattern suggests we are entering a period of AI multipolarity — not a single global frontier model, but multiple national or regional AI ecosystems serving their own populations and regulatory contexts. BharatGen is India's entry into that multipolar world.

For Indian AI-native companies building on top of frontier models, BharatGen creates a new option: build on infrastructure that understands Indian languages natively, is controlled domestically, and does not expose user data to foreign jurisdictions.


Related Reading

Related posts