Lihao Sun Profile
Lihao Sun

@1e0sun

Followers
44
Following
23
Media
5
Statuses
9

Researching mech interp/RL in LLM/AI Safety. Recent graduate from @uchicago

Joined January 2023
Don't wanna be here? Send us removal request.
@1e0sun
Lihao Sun
5 months
7/ 📢 Accepted to #ACL2025 Main Conference! See you in Vienna. Work done by @1e0sun, @ChengzhiM, @vjhofmann, @baixuechunzi . Paper: https://t.co/u4bJkD31Hx Project page: https://t.co/s6radmtxfN Code & Data: https://t.co/3ppKn0uGo8
0
0
6
@1e0sun
Lihao Sun
5 months
6/ We call this failure mode "blindness"—when alignment makes certain concepts less salient. This may reflect a broader class of alignment issues. Similar methods can be extended to other forms of social bias or to study how models resolve polysemy under ambiguity.
1
0
4
@1e0sun
Lihao Sun
5 months
5/ This challenges a common belief: unlearning ≠ debiasing When debiasing strategies suppress sensitive concepts, they can unintentionally reduce a model’s ability to detect bias. 🧠 Instead, we may achieve deeper alignment effects with strategies that make models aware of
1
0
4
@1e0sun
Lihao Sun
5 months
4/ Inspired by these results, we tested the opposite of “machine unlearning” for debiasing. What if we reinforced race concepts in models? - Injecting race-laden activations cut implicit bias by 54.9%. - LoRA fine-tuning brought it down from 97.3% → 42.4%. Bonus: stronger race
1
0
4
@1e0sun
Lihao Sun
5 months
3/ We mechanistically tested this using activation patching and embedding interpretation. Aligned models were 52.2% less likely to represent “black” as race in ambiguous contexts compared to unaligned models. 🧠 LMs trained for harmlessness may avoid racial
1
0
4
@1e0sun
Lihao Sun
5 months
2/ So why does alignment increase implicit bias? Our analyses showed that aligned LMs are more likely to treat “black” and “white” as pure color, not race, when the context is ambiguous. This resembles race blindness in humans; ignoring race makes stereotypes more likely to
1
0
4
@1e0sun
Lihao Sun
5 months
1/ We curated pairs of prompts testing for implicit and explicit racial bias and used them to evaluate Llama 3 models. - Explicit: Likert scale, asking whether the model agrees with a given association such as “black” is related to negative, “white” is related to positive. -
1
0
4
@1e0sun
Lihao Sun
5 months
🚨New #ACL2025 paper! Today’s “safe” language models can look unbiased—but alignment can actually make them more biased implicitly by reducing their sensitivity to race-related associations. 🧵Find out more below!
1
2
11
@a_jy_l
Andrew Lee
6 months
🚨New preprint! How do reasoning models verify their own CoT? We reverse-engineer LMs and find critical components and subspaces needed for self-verification! 1/n
8
52
269