Josh Barua @BaruaJosh X Profile

Josh Barua

@BaruaJosh

Followers

59

Following

130

Media

11

Statuses

34

Research @UT_Linguistics. Trying to understand language models. Prev @ucberkeley @berkeleynlp, intern @LTIatCMU.

https://t.co/n9zisxFywx

Austin, TX

Joined June 2019

Don't wanna be here? Send us removal request.

Josh Barua

@BaruaJosh

25 days

🌍 LLMs can use long chain-of-thought (CoT) to reason in English, but what about other languages? New paper w/ @BerkeleyNLP: We study how scaling, pretraining, post-training & inference affect long CoT across 9 languages. Spoiler: English long CoT ≠ multilingual long CoT 🧵

4

9

23

Kyle Mahowald

@kmahowald

15 days

Delighted Sasha's work using mech interp to study complex syntax constructions won an Outstanding Paper Award at EMNLP! And delighted the ACL community continues to recognize unabashedly linguistic topics like filler-gaps, and the huge potential for LMs to inform such topics!

Sasha Boguraev

@SashaBoguraev

6 months

A key hypothesis in the history of linguistics is that different constructions share underlying structure. We take advantage of recent advances in mechanistic interpretability to test this hypothesis in Language Models. New work with @kmahowald and @ChrisGPotts! 🧵👇

1

21

92

Josh Barua

@BaruaJosh

25 days

Finally, here are some other great works in this space covering inference-time scaling, language mixing, and language compliance. Yong et al: https://t.co/vmXcyFdTbI Son et al: https://t.co/sgEz7VccNf Wang et al: https://t.co/IThpJMYpvd Qi et al:

arxiv.org

Recent Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks. However, their ability to think in other languages is less studied. This...

0

Josh Barua

@BaruaJosh

25 days

Huge thank you to my collaborators Seun Eisape @kayo_yin @alsuhr! Paper: https://t.co/mic0Ld6C2J Code:

github.com

Code for Paper: Long Chain-of-Thought Reasoning Across Languages [SCALR @ COLM 2025] - Berkeley-NLP/Multilingual-Long-CoT

1

0

Josh Barua

@BaruaJosh

25 days

(2/2) By mixing multilingual data into the reasoning stage of pretraining, scaling model capacity could be sufficient to enable long CoT transfer. Perhaps being able to generate reasoning in multiple languages could be a kind of learned invariance that improves robustness 🤷‍♂️.

1

0

Josh Barua

@BaruaJosh

25 days

Thoughts (1/2): SFT/RL for long CoT works best when the base model already has the right behaviors—post-training just elicits them. But typical multilingual pretraining still uses documents like Wikipedia which are good for imparting general knowledge but less so reasoning.

1

0

Josh Barua

@BaruaJosh

25 days

🔎 The failure modes diverge: We observe that while English reasoning fails from logical errors, target-language reasoning often fails from knowledge gaps and repetitive generation before reaching the reasoning stage—suggesting much work is needed (especially in pretraining).

1

0

Josh Barua

@BaruaJosh

25 days

🔎 CoT analysis: Do different languages fail differently? We categorize errors into 5 types and analyze CoTs with Gemini-2.5-Pro across English vs. target languages.

1

0

Josh Barua

@BaruaJosh

25 days

🧪 Post-training results: Translated data > distilled data (suggesting MT is a viable path for creating synthetic reasoning data). Target-language SFT is highly sample-efficient: target-language traces outperform English traces for mid/low-resource languages with 20x less data.

1

0

Josh Barua

@BaruaJosh

25 days

🧪 Post-training challenge: High-quality reasoning traces barely exist outside EN/ZH. We test two synthetic data approaches: Translation: MT from English traces Distillation: Generate from DeepSeek-R1 with language forcing (injecting translated phrases to guide output language)

1

0

Josh Barua

@BaruaJosh

25 days

💡 Key insight so far: Scaling + better multilingual data improve target-language understanding across languages. But generating long reasoning chains in the target language remains the critical bottleneck. Can post-training fix this? →

1

0

Josh Barua

@BaruaJosh

25 days

🧰 Surprising result: Continual pretraining on English/Chinese math data hurts target-language reasoning—even for Chinese (in-distribution)! Broader multilingual pretraining helps both comprehension and generation, but large gaps remain for Target-CoT.

1

0

Josh Barua

@BaruaJosh

25 days

🧰 Pretraining levers: What is the impact of specialized reasoning data vs. multilingual coverage in pretraining? We hold post-training fixed and vary the backbone: Qwen2.5-7B: seen 29 langs Qwen2.5-Math-7B: +1T math tokens (EN/ZH only) Qwen3-8B: +reasoning data, seen 119 langs

1

0

Josh Barua

@BaruaJosh

25 days

📈 Scaling isn’t enough: Using DeepSeek-R1-Distill (1.5B→32B): scaling pushes En-CoT close to English performance — but Target-CoT still lags badly. At 32B, switching from En-CoT→Target-CoT drops accuracy by 29% on average with low-resource languages near zero.

1

0

Josh Barua

@BaruaJosh

25 days

🔬 Experimental setup: We test 3 configurations across 9 languages: En-Only: Input+reasoning in English En-CoT: Input in target language, reasoning in English Target-CoT: Input+reasoning in target language This setup disentangles target-language comprehension from reasoning.

1

0

Josh Barua

@BaruaJosh

25 days

🔑 Why this matters: 1. If models cannot generate reasoning in the user's language, it makes auditing responses and diagnosing errors difficult, reducing trust. 2. Unclear if English long CoT + multilingual understanding transfers to long CoT generation in other languages.

1

0

Kayo Yin

@kayo_yin

9 months

Induction heads are commonly associated with in-context learning, but are they the primary driver of ICL at scale? We find that recently discovered "function vector" heads, which encode the ICL task, are the actual primary drivers of few-shot ICL. https://t.co/zTpiOKatEF 🧵

16

115

779

John Hewitt

@johnhewtt

9 months

Understanding and control are two sides of the problem of communicating differing concepts between humans and machines. New position paper: Robert Geirhos, @_beenkim, and I argue we must develop neologisms - new words - for human and machine concepts to understand and control AI

13

33

192

Kayo Yin

@kayo_yin

1 year

Come to my EMNLP oral today at 2pm in Flagler room!!

Kayo Yin

@kayo_yin

1 year

🚨New dataset + challenge #EMNLP2024🚨 We release ASL STEM Wiki: the first signing dataset of STEM articles! 📰 254 Wikipedia articles 📹 ~300 hours of ASL interpretations 👋 New task: automatic sign suggestion to make STEM education more accessible https://t.co/U2pky8fmxq 🧵

0

1

6

Josh Barua

@BaruaJosh

1 year

Huge thank you to my amazing collaborators @sanjayssub @kayo_yin @alsuhr without whom this work would not have been possible! code: https://t.co/pBFngu6lLK paper:

github.com

Code for Paper: Using Language Models to Disambiguate Lexical Choices in Translation [EMNLP 2024] - Berkeley-NLP/Lex-Rules

0