Anoop Kunchukuttan @anoopk X Profile

Anoop Kunchukuttan

@anoopk

Followers

1K

Following

635

Media

12

Statuses

910

I am a researcher in Machine Translation group at Microsoft India and co-lead and co-founder at AI4Bharat, a research center at IIT Madras for Indian NLP.

Hyderabad, India

Joined September 2008

Don't wanna be here? Send us removal request.

Anoop Kunchukuttan

@anoopk

2 months

Any got access the the Param 1 models?.

AIM

@Analyticsindiam

2 months

India’s open-source AI ecosystem just got a powerful new foundation. BharatGen, the government-backed AI initiative under @GoI_MeitY , has unveiled Param 1, a 2.9 billion parameter bilingual language model, boasting an unprecedented 25% Indic data, compared to Llama’s mere 0.01%.

4

1

21

Anoop Kunchukuttan

@anoopk

3 months

RT @soumithchintala: the PyTorch Foundation is becoming an umbrella for great AI open-source projects. @vllm_project and @DeepSpeedAI are….

0

32

0

Anoop Kunchukuttan

@anoopk

5 months

All of AI4Bharat's models and datasets are on HugginFace now, with getting-started scripts! It is an effort to ensure easy of use for all our offerings and foster research and innovation for Indian languages.

AI4Bharat

@ai4bharat

5 months

🚀 AI4Bharat: Advancing Indian Language AI - Open & Scalable! 🇮🇳✨. Over the past 4 years, we at AI4Bharat have been on a mission to accelerate Indian language AI 🚀 —building large-scale datasets, models, and tools and releasing everything open-source for the community. Now, all.

8

42

462

Anoop Kunchukuttan

@anoopk

5 months

Try out our latest LLM and speech translation models!.

Raj Dabre

@prajdabre

5 months

IndicTrans3 and IndicSeamless have arrived!. IndicTrans3 . Following in the steps of IndicTrans2, we have released a beta version of IndicTrans3 (yes a better one is coming soon). It works at the sentence as well as the document level. It's lightweight, and of fairly high.

0

1

36

Anoop Kunchukuttan

@anoopk

5 months

Happy to share IndicSeamless - that's a step towards broad-coverage multi-modal Indic language technology! Kudos to the team to undertake this effort! @prajdabre1 - thanks for driving this!.

AI4Bharat

@ai4bharat

5 months

📢 Presenting IndicSeamless: A Speech Translation Model for Indian Languages 🎙️🌍. IndicSeamless is a speech translation model fine-tuned from SeamlessM4Tv2-large on 13 Indian languages. Trained on a curated subset of BhasaAnuvaad, the largest open-source Speech Translation.

1

3

26

Anoop Kunchukuttan

@anoopk

5 months

Timely and extremely well said. This is nice: "Remember Douglas Adams' Hitchhiker's Guide? The answer is apparently 42, but nobody knows the right question. That's research in a nutshell.".

Thomas Wolf

@Thom_Wolf

5 months

I shared a controversial take the other day at an event and I decided to write it down in a longer format: I’m afraid AI won't give us a "compressed 21st century". The "compressed 21st century" comes from Dario's "Machine of Loving Grace" and if you haven’t read it, you probably.

1

3

Anoop Kunchukuttan

@anoopk

5 months

RT @ravi_iitm: One of the greatest privileges I enjoyed is that I worked with Andy Barto during my PhD! He is a great mentor as well as a….

0

16

0

Anoop Kunchukuttan

@anoopk

5 months

RT @rohanpaul_ai: The Chain-of-Thought (CoT) method improves LLM reasoning. However, relying on natural language for reasoning may be inef….

0

62

0

Anoop Kunchukuttan

@anoopk

5 months

Happy to share this talk on DeepSeek R1 and recent Open Source efforts following the release of DeepSeek. Extended version of the ones I gave recently at IIT Hyderabad and internally at Microsoft. 📜Slides: 📽️Video:

0

4

38

Anoop Kunchukuttan

@anoopk

5 months

Very useful. Also another wonderful read is the Tulu3 paper by Allen AI. Completely documented and open-source effort for LLM post-training. We need more reports like that:

arxiv.org

Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind...

Rohan Paul

@rohanpaul_ai

5 months

Beautiful Paper. A comprehensive survey of post-training methods including fine-tuning, reinforcement learning, and test-time scaling to refine LLMs reasoning. Methods Explored in this Paper 🔧:. → Systematically explores fine-tuning techniques that adapt LLMs for specific

0

3

26

Anoop Kunchukuttan

@anoopk

5 months

RT @prayanks: India has a very vibrant AI ecosystem. I myself have met / reviewed close to 2000 startups in last two years. We have been mi….

0

143

0

Anoop Kunchukuttan

@anoopk

5 months

RT @Thom_Wolf: if you didn't have time yet to read the 100-page "Ultrascale Playbook", I gathered all the NotebookLM audios spread along th….

0

81

0

Anoop Kunchukuttan

@anoopk

5 months

We need more North East lndian language data. That's been a blind spot.

0

2

Anoop Kunchukuttan

@anoopk

5 months

New datasets for low-resource translations. We need more of these. Just to point out - AI4Bharat's BPCC corpus contains training and test data (IN22) for India's 22 scheduled languages - of which about 10 are extremely low-resource languages. (1/2).

arxiv.org

India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people. 22 of these languages are listed in the Constitution of India (referred to as...

Markus Freitag

@markuseful

5 months

Two new datasets from Google Translate targeting high and low resource languages!. WMT24++: 46 new en->xx languages to WMT24, bringing the total to 55.SMOL: 6M tokens for 115 very low-resource languages. WMT24++: SMOL:

2

26

Anoop Kunchukuttan

@anoopk

5 months

Interesting!.

Ganesh Ramakrishnan

@ganramkr

5 months

Honor to share about SAMANVAYA! 📖. This project dives deep into India’s linguistic roots, crafting an interlingua inspired by Sanskrit & native languages. It’s about more than grammar—it’s about connection. Details: #SAMANVAYA #IndianLanguages.

0

2

Anoop Kunchukuttan

@anoopk

5 months

Suggestions on challenging reasoning benchmarks for domains other than math, code ones like Math500, AIME, SawE, etc. that papers typically test on?.

0

4

Anoop Kunchukuttan

@anoopk

5 months

@prajdabre1 @ratishsp Romanization probably plays a role in ensuring better cross-lingual transfer from English. This analytical study kind of supports the positive results we saw when using romanized text input to language models in our previous work RomanSetu. (4/n).

1

0

1

Anoop Kunchukuttan

@anoopk

5 months

@prajdabre1 @ratishsp Just as previous work has shown English representations in the intermediate layers during multilingual processing, we see Romanized representations for languages using non-Roman scripts in layers prior to the surface generation of the native script. (3/n).

1

0

1

Anoop Kunchukuttan

@anoopk

5 months

Work with Alan Saji, Jaavid aktar Thanmay Jayakumar @prajdabre1 and @ratishsp (2/n).

1

0

Anoop Kunchukuttan

@anoopk

5 months

🚨🚨🚨 Excited to share our preprint, "RomanLens: Latent Romanization and its Role in Multilinguality in LLMs". TLDR; Intermediate layers occasionally represent non-Roman scripts in a romanized form - Latent Romanization. Here is a visualization (1/n)

1

2

16