Sarvam AI, an emerging player in India’s generative AI space, has launched a new language model that has been specifically trained for Indian languages.
The new AI model called Sarvam-1 is open-source and supports up to ten Indian languages, including Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu – besides English.
The Bengaluru-based company had launched its first foundational AI model called Sarvam 2B in August this year. However, it claims that Sarvam-1 is unique because it “demonstrates that careful curation of training data can yield superior performance even with a relatively modest parameter count.”
The newly released AI model has been developed with 2 billion parameters. Parameter count is often used to indicate the complexity of an AI model and determine an AI model’s capability of converting inputs to outputs. For context, Microsoft’s Phi-3 Mini measures 3.8 billion parameters.
AI models like Sarvam-1 and Phi-3 Mini fall under the category of small language models (SLMs) which have parameters less than ten billion as opposed to large language models (LLMs) like OpenAI’s GPT-4 with more than a trillion parameters.
Notably, Sarvam AI said that its latest AI model is powered by 1,024 Graphics Processing Units (GPUs) supplied by data infrastructure company Yotta and trained with NVIDIA’s NeMo framework.
Sarvam-1 has also been uniquely trained. “A key challenge in developing effective language models for Indian languages has been the scarcity of high-quality training data,” the company said, adding that existing datasets often lack depth, diversity, and quality necessary for training world-class models.
For this reason, the company said it developed its own training corpus called Sarvam-2T that consists of an estimated 2 trillion tokens with an even distribution of linguistic data across all ten languages. The training dataset was built using synthetic data generation techniques to sidestep depth and quality issues in Indic language data scraped from the web.
While 20 per cent of the Sarvam-2T dataset is Hindi, a substantial chunk of it also comprises English and programming languages to help the AI model perform monolingual and multilingual tasks, according to the company.
Sarvam-1 is said to be more efficient in handling Indic language scripts as opposed to previous LLMs by using minimal tokens per word. The company claims that Sarvam-1 surpassed larger AI models like Meta’s Llama-3 and Google’s Gemma-2 model on benchmarks such as MMLU, ARC-Challenge, and IndicGenBench.
It achieved an accuracy of 86.11 across Indic languages on the TriviaQA benchmark, which is much more than Meta’s Llama-3.1 8B’s score of 61.47.
Sarvam-1 is also said to be more computationally efficient with inference speeds that are 4-6 times faster than larger models like Gemma-2-9B and Llama-3.1-8B. “This combination of strong performance and superior inference efficiency makes Sarvam-1 particularly well-suited for practical applications, including on edge devices,” the company said.
Sarvam-1 is available for download on Hugging Face, which is an online repository for open-source AI models.