Josh Barua
@BaruaJosh
Followers
59
Following
130
Media
11
Statuses
34
Research @UT_Linguistics. Trying to understand language models. Prev @ucberkeley @berkeleynlp, intern @LTIatCMU.
Austin, TX
Joined June 2019
🌍 LLMs can use long chain-of-thought (CoT) to reason in English, but what about other languages? New paper w/ @BerkeleyNLP: We study how scaling, pretraining, post-training & inference affect long CoT across 9 languages. Spoiler: English long CoT ≠ multilingual long CoT 🧵
4
9
23
Delighted Sasha's work using mech interp to study complex syntax constructions won an Outstanding Paper Award at EMNLP! And delighted the ACL community continues to recognize unabashedly linguistic topics like filler-gaps, and the huge potential for LMs to inform such topics!
A key hypothesis in the history of linguistics is that different constructions share underlying structure. We take advantage of recent advances in mechanistic interpretability to test this hypothesis in Language Models. New work with @kmahowald and @ChrisGPotts! 🧵👇
1
21
92
Finally, here are some other great works in this space covering inference-time scaling, language mixing, and language compliance. Yong et al: https://t.co/vmXcyFdTbI Son et al: https://t.co/sgEz7VccNf Wang et al: https://t.co/IThpJMYpvd Qi et al:
arxiv.org
Recent Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks. However, their ability to think in other languages is less studied. This...
0
0
0
Huge thank you to my collaborators Seun Eisape @kayo_yin @alsuhr! Paper: https://t.co/mic0Ld6C2J Code:
github.com
Code for Paper: Long Chain-of-Thought Reasoning Across Languages [SCALR @ COLM 2025] - Berkeley-NLP/Multilingual-Long-CoT
1
0
0
(2/2) By mixing multilingual data into the reasoning stage of pretraining, scaling model capacity could be sufficient to enable long CoT transfer. Perhaps being able to generate reasoning in multiple languages could be a kind of learned invariance that improves robustness 🤷♂️.
1
0
0
Thoughts (1/2): SFT/RL for long CoT works best when the base model already has the right behaviors—post-training just elicits them. But typical multilingual pretraining still uses documents like Wikipedia which are good for imparting general knowledge but less so reasoning.
1
0
0
🔎 The failure modes diverge: We observe that while English reasoning fails from logical errors, target-language reasoning often fails from knowledge gaps and repetitive generation before reaching the reasoning stage—suggesting much work is needed (especially in pretraining).
1
0
0
🔎 CoT analysis: Do different languages fail differently? We categorize errors into 5 types and analyze CoTs with Gemini-2.5-Pro across English vs. target languages.
1
0
0
🧪 Post-training results: Translated data > distilled data (suggesting MT is a viable path for creating synthetic reasoning data). Target-language SFT is highly sample-efficient: target-language traces outperform English traces for mid/low-resource languages with 20x less data.
1
0
0
🧪 Post-training challenge: High-quality reasoning traces barely exist outside EN/ZH. We test two synthetic data approaches: Translation: MT from English traces Distillation: Generate from DeepSeek-R1 with language forcing (injecting translated phrases to guide output language)
1
0
0
💡 Key insight so far: Scaling + better multilingual data improve target-language understanding across languages. But generating long reasoning chains in the target language remains the critical bottleneck. Can post-training fix this? →
1
0
0
🧰 Surprising result: Continual pretraining on English/Chinese math data hurts target-language reasoning—even for Chinese (in-distribution)! Broader multilingual pretraining helps both comprehension and generation, but large gaps remain for Target-CoT.
1
0
0
🧰 Pretraining levers: What is the impact of specialized reasoning data vs. multilingual coverage in pretraining? We hold post-training fixed and vary the backbone: Qwen2.5-7B: seen 29 langs Qwen2.5-Math-7B: +1T math tokens (EN/ZH only) Qwen3-8B: +reasoning data, seen 119 langs
1
0
0
📈 Scaling isn’t enough: Using DeepSeek-R1-Distill (1.5B→32B): scaling pushes En-CoT close to English performance — but Target-CoT still lags badly. At 32B, switching from En-CoT→Target-CoT drops accuracy by 29% on average with low-resource languages near zero.
1
0
0
🔬 Experimental setup: We test 3 configurations across 9 languages: En-Only: Input+reasoning in English En-CoT: Input in target language, reasoning in English Target-CoT: Input+reasoning in target language This setup disentangles target-language comprehension from reasoning.
1
0
0
🔑 Why this matters: 1. If models cannot generate reasoning in the user's language, it makes auditing responses and diagnosing errors difficult, reducing trust. 2. Unclear if English long CoT + multilingual understanding transfers to long CoT generation in other languages.
1
0
0
Induction heads are commonly associated with in-context learning, but are they the primary driver of ICL at scale? We find that recently discovered "function vector" heads, which encode the ICL task, are the actual primary drivers of few-shot ICL. https://t.co/zTpiOKatEF 🧵
16
115
779
Understanding and control are two sides of the problem of communicating differing concepts between humans and machines. New position paper: Robert Geirhos, @_beenkim, and I argue we must develop neologisms - new words - for human and machine concepts to understand and control AI
13
33
192
Come to my EMNLP oral today at 2pm in Flagler room!!
🚨New dataset + challenge #EMNLP2024🚨 We release ASL STEM Wiki: the first signing dataset of STEM articles! 📰 254 Wikipedia articles 📹 ~300 hours of ASL interpretations 👋 New task: automatic sign suggestion to make STEM education more accessible https://t.co/U2pky8fmxq 🧵
0
1
6
Huge thank you to my amazing collaborators @sanjayssub @kayo_yin @alsuhr without whom this work would not have been possible! code: https://t.co/pBFngu6lLK paper:
github.com
Code for Paper: Using Language Models to Disambiguate Lexical Choices in Translation [EMNLP 2024] - Berkeley-NLP/Lex-Rules
0
0
0