Anoop Kunchukuttan Profile
Anoop Kunchukuttan

@anoopk

Followers
1K
Following
635
Media
12
Statuses
910

I am a researcher in Machine Translation group at Microsoft India and co-lead and co-founder at AI4Bharat, a research center at IIT Madras for Indian NLP.

Hyderabad, India
Joined September 2008
Don't wanna be here? Send us removal request.
@anoopk
Anoop Kunchukuttan
2 months
Any got access the the Param 1 models?.
@Analyticsindiam
AIM
2 months
India’s open-source AI ecosystem just got a powerful new foundation. BharatGen, the government-backed AI initiative under @GoI_MeitY , has unveiled Param 1, a 2.9 billion parameter bilingual language model, boasting an unprecedented 25% Indic data, compared to Llama’s mere 0.01%.
Tweet media one
4
1
21
@anoopk
Anoop Kunchukuttan
3 months
RT @soumithchintala: the PyTorch Foundation is becoming an umbrella for great AI open-source projects. @vllm_project and @DeepSpeedAI are….
0
32
0
@anoopk
Anoop Kunchukuttan
5 months
All of AI4Bharat's models and datasets are on HugginFace now, with getting-started scripts! It is an effort to ensure easy of use for all our offerings and foster research and innovation for Indian languages.
@ai4bharat
AI4Bharat
5 months
🚀 AI4Bharat: Advancing Indian Language AI - Open & Scalable! 🇮🇳✨. Over the past 4 years, we at AI4Bharat have been on a mission to accelerate Indian language AI 🚀 —building large-scale datasets, models, and tools and releasing everything open-source for the community. Now, all.
8
42
462
@anoopk
Anoop Kunchukuttan
5 months
Try out our latest LLM and speech translation models!.
@prajdabre
Raj Dabre
5 months
IndicTrans3 and IndicSeamless have arrived!. IndicTrans3 . Following in the steps of IndicTrans2, we have released a beta version of IndicTrans3 (yes a better one is coming soon). It works at the sentence as well as the document level. It's lightweight, and of fairly high.
0
1
36
@anoopk
Anoop Kunchukuttan
5 months
Happy to share IndicSeamless - that's a step towards broad-coverage multi-modal Indic language technology! Kudos to the team to undertake this effort! @prajdabre1 - thanks for driving this!.
@ai4bharat
AI4Bharat
5 months
📢 Presenting IndicSeamless: A Speech Translation Model for Indian Languages 🎙️🌍. IndicSeamless is a speech translation model fine-tuned from SeamlessM4Tv2-large on 13 Indian languages. Trained on a curated subset of BhasaAnuvaad, the largest open-source Speech Translation.
1
3
26
@anoopk
Anoop Kunchukuttan
5 months
Timely and extremely well said. This is nice: "Remember Douglas Adams' Hitchhiker's Guide? The answer is apparently 42, but nobody knows the right question. That's research in a nutshell.".
@Thom_Wolf
Thomas Wolf
5 months
I shared a controversial take the other day at an event and I decided to write it down in a longer format: I’m afraid AI won't give us a "compressed 21st century". The "compressed 21st century" comes from Dario's "Machine of Loving Grace" and if you haven’t read it, you probably.
1
1
3
@anoopk
Anoop Kunchukuttan
5 months
RT @ravi_iitm: One of the greatest privileges I enjoyed is that I worked with Andy Barto during my PhD! He is a great mentor as well as a….
0
16
0
@anoopk
Anoop Kunchukuttan
5 months
RT @rohanpaul_ai: The Chain-of-Thought (CoT) method improves LLM reasoning. However, relying on natural language for reasoning may be inef….
0
62
0
@anoopk
Anoop Kunchukuttan
5 months
Happy to share this talk on DeepSeek R1 and recent Open Source efforts following the release of DeepSeek. Extended version of the ones I gave recently at IIT Hyderabad and internally at Microsoft. 📜Slides: 📽️Video:
0
4
38
@anoopk
Anoop Kunchukuttan
5 months
Very useful. Also another wonderful read is the Tulu3 paper by Allen AI. Completely documented and open-source effort for LLM post-training. We need more reports like that:
Tweet card summary image
arxiv.org
Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind...
@rohanpaul_ai
Rohan Paul
5 months
Beautiful Paper. A comprehensive survey of post-training methods including fine-tuning, reinforcement learning, and test-time scaling to refine LLMs reasoning. Methods Explored in this Paper 🔧:. → Systematically explores fine-tuning techniques that adapt LLMs for specific
Tweet media one
0
3
26
@anoopk
Anoop Kunchukuttan
5 months
RT @prayanks: India has a very vibrant AI ecosystem. I myself have met / reviewed close to 2000 startups in last two years. We have been mi….
0
143
0
@anoopk
Anoop Kunchukuttan
5 months
RT @Thom_Wolf: if you didn't have time yet to read the 100-page "Ultrascale Playbook", I gathered all the NotebookLM audios spread along th….
0
81
0
@anoopk
Anoop Kunchukuttan
5 months
We need more North East lndian language data. That's been a blind spot.
0
0
2
@anoopk
Anoop Kunchukuttan
5 months
New datasets for low-resource translations. We need more of these. Just to point out - AI4Bharat's BPCC corpus contains training and test data (IN22) for India's 22 scheduled languages - of which about 10 are extremely low-resource languages. (1/2).
Tweet card summary image
arxiv.org
India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people. 22 of these languages are listed in the Constitution of India (referred to as...
@markuseful
Markus Freitag
5 months
Two new datasets from Google Translate targeting high and low resource languages!. WMT24++: 46 new en->xx languages to WMT24, bringing the total to 55.SMOL: 6M tokens for 115 very low-resource languages. WMT24++: SMOL:
2
2
26
@anoopk
Anoop Kunchukuttan
5 months
Interesting!.
@ganramkr
Ganesh Ramakrishnan
5 months
Honor to share about SAMANVAYA! 📖. This project dives deep into India’s linguistic roots, crafting an interlingua inspired by Sanskrit & native languages. It’s about more than grammar—it’s about connection. Details: #SAMANVAYA #IndianLanguages.
0
0
2
@anoopk
Anoop Kunchukuttan
5 months
Suggestions on challenging reasoning benchmarks for domains other than math, code ones like Math500, AIME, SawE, etc. that papers typically test on?.
0
0
4
@anoopk
Anoop Kunchukuttan
5 months
@prajdabre1 @ratishsp Romanization probably plays a role in ensuring better cross-lingual transfer from English. This analytical study kind of supports the positive results we saw when using romanized text input to language models in our previous work RomanSetu. (4/n).
1
0
1
@anoopk
Anoop Kunchukuttan
5 months
@prajdabre1 @ratishsp Just as previous work has shown English representations in the intermediate layers during multilingual processing, we see Romanized representations for languages using non-Roman scripts in layers prior to the surface generation of the native script. (3/n).
1
0
1
@anoopk
Anoop Kunchukuttan
5 months
Work with Alan Saji, Jaavid aktar Thanmay Jayakumar @prajdabre1 and @ratishsp (2/n).
1
0
0
@anoopk
Anoop Kunchukuttan
5 months
🚨🚨🚨 Excited to share our preprint, "RomanLens: Latent Romanization and its Role in Multilinguality in LLMs". TLDR; Intermediate layers occasionally represent non-Roman scripts in a romanized form - Latent Romanization. Here is a visualization (1/n)
Tweet media one
1
2
16