
Anoop Kunchukuttan
@anoopk
Followers
1K
Following
635
Media
12
Statuses
910
I am a researcher in Machine Translation group at Microsoft India and co-lead and co-founder at AI4Bharat, a research center at IIT Madras for Indian NLP.
Hyderabad, India
Joined September 2008
Any got access the the Param 1 models?.
India’s open-source AI ecosystem just got a powerful new foundation. BharatGen, the government-backed AI initiative under @GoI_MeitY , has unveiled Param 1, a 2.9 billion parameter bilingual language model, boasting an unprecedented 25% Indic data, compared to Llama’s mere 0.01%.
4
1
21
RT @soumithchintala: the PyTorch Foundation is becoming an umbrella for great AI open-source projects. @vllm_project and @DeepSpeedAI are….
0
32
0
All of AI4Bharat's models and datasets are on HugginFace now, with getting-started scripts! It is an effort to ensure easy of use for all our offerings and foster research and innovation for Indian languages.
🚀 AI4Bharat: Advancing Indian Language AI - Open & Scalable! 🇮🇳✨. Over the past 4 years, we at AI4Bharat have been on a mission to accelerate Indian language AI 🚀 —building large-scale datasets, models, and tools and releasing everything open-source for the community. Now, all.
8
42
462
Try out our latest LLM and speech translation models!.
IndicTrans3 and IndicSeamless have arrived!. IndicTrans3 . Following in the steps of IndicTrans2, we have released a beta version of IndicTrans3 (yes a better one is coming soon). It works at the sentence as well as the document level. It's lightweight, and of fairly high.
0
1
36
Happy to share IndicSeamless - that's a step towards broad-coverage multi-modal Indic language technology! Kudos to the team to undertake this effort! @prajdabre1 - thanks for driving this!.
📢 Presenting IndicSeamless: A Speech Translation Model for Indian Languages 🎙️🌍. IndicSeamless is a speech translation model fine-tuned from SeamlessM4Tv2-large on 13 Indian languages. Trained on a curated subset of BhasaAnuvaad, the largest open-source Speech Translation.
1
3
26
Timely and extremely well said. This is nice: "Remember Douglas Adams' Hitchhiker's Guide? The answer is apparently 42, but nobody knows the right question. That's research in a nutshell.".
I shared a controversial take the other day at an event and I decided to write it down in a longer format: I’m afraid AI won't give us a "compressed 21st century". The "compressed 21st century" comes from Dario's "Machine of Loving Grace" and if you haven’t read it, you probably.
1
1
3
RT @ravi_iitm: One of the greatest privileges I enjoyed is that I worked with Andy Barto during my PhD! He is a great mentor as well as a….
0
16
0
RT @rohanpaul_ai: The Chain-of-Thought (CoT) method improves LLM reasoning. However, relying on natural language for reasoning may be inef….
0
62
0
Very useful. Also another wonderful read is the Tulu3 paper by Allen AI. Completely documented and open-source effort for LLM post-training. We need more reports like that:
arxiv.org
Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind...
Beautiful Paper. A comprehensive survey of post-training methods including fine-tuning, reinforcement learning, and test-time scaling to refine LLMs reasoning. Methods Explored in this Paper 🔧:. → Systematically explores fine-tuning techniques that adapt LLMs for specific
0
3
26
RT @Thom_Wolf: if you didn't have time yet to read the 100-page "Ultrascale Playbook", I gathered all the NotebookLM audios spread along th….
0
81
0
New datasets for low-resource translations. We need more of these. Just to point out - AI4Bharat's BPCC corpus contains training and test data (IN22) for India's 22 scheduled languages - of which about 10 are extremely low-resource languages. (1/2).
arxiv.org
India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people. 22 of these languages are listed in the Constitution of India (referred to as...
Two new datasets from Google Translate targeting high and low resource languages!. WMT24++: 46 new en->xx languages to WMT24, bringing the total to 55.SMOL: 6M tokens for 115 very low-resource languages. WMT24++: SMOL:
2
2
26
Interesting!.
Honor to share about SAMANVAYA! 📖. This project dives deep into India’s linguistic roots, crafting an interlingua inspired by Sanskrit & native languages. It’s about more than grammar—it’s about connection. Details: #SAMANVAYA #IndianLanguages.
0
0
2
@prajdabre1 @ratishsp Romanization probably plays a role in ensuring better cross-lingual transfer from English. This analytical study kind of supports the positive results we saw when using romanized text input to language models in our previous work RomanSetu. (4/n).
1
0
1
@prajdabre1 @ratishsp Just as previous work has shown English representations in the intermediate layers during multilingual processing, we see Romanized representations for languages using non-Roman scripts in layers prior to the surface generation of the native script. (3/n).
1
0
1