
Jiachen Zhao
@jcz12856876
Followers
152
Following
209
Media
7
Statuses
41
PhD student @KhouryCollege | Scholar @MATSprogram | Prev: @UMassAmherst @HKUST
Boston, MA
Joined October 2022
0
0
2
6/ 💡We have also found representations of harmfulness may vary across different risk categories. Additionally, adversarial finetuning has minimum influence on the internal belief of harmfulness. Read our full paper:
arxiv.org
LLMs are trained to refuse harmful instructions, but do they truly understand harmfulness beyond just refusing? Prior work has shown that LLMs' refusal behaviors can be mediated by a...
1
0
0
5/ ⚔️ We propose Latent Guard: a safeguard model that uses internal beliefs to detect unsafe inputs or over-refusal cases.
1
0
0
4/ 🔓 We also find jailbreak methods can suppress refusal, but LLMs may still internally know the input is harmful.
1
0
1
3/ 🧪 Causal evidence: We extract a “harmfulness” direction and a “refusal” direction. We design a reply inversion task where steering with these two directions leads to opposite results!.Harmfulness direction: it will make LLMs interpret benign prompts as harmful.Refusal
1
0
1
2/🔗Paper: 💻Code: .📚Project Page: 📊We focus on two token positions t_inst and t_post-inst. Through clustering, we show:. — Harmfulness decides the clustering of instructions at t_inst. — Refusal decides
1
0
4
1/ 🚨New Paper 🚨.LLMs are trained to refuse harmful instructions, but internally, do they see harmfulness and refusal as the same? .⚔️We find causal evidence that 👈”LLMs encode harmfulness and refusal separately” 👉. ✂️LLMs may know a prompt is harmful internally yet still
5
18
68
RT @shi_weiyan: 🤔Long-horizon tasks: How to train LLMs for the marathon?🌀. Submit anything on 🔁"Multi-turn Interactions in LLMs"🔁 to our @N….
0
17
0
RT @dawnsongtweets: 1/ 🔥 AI agents are reaching a breakthrough moment in cybersecurity. In our latest work:. 🔓 CyberGym: AI agents discov….
0
151
0
What types of exemplar CoTs are better for In-Context Learning?.Our #EMNLP paper shows that an LLM usually prefers its own generated CoTs as demonstrations for ICL. 📅I will present this paper in person on Wednesday 4pm at Poster Session E (Jasmine). Come visit our poster!
🎉 New paper alert!.Large Language Models are In-context Teachers for Knowledge Reasoning #EMNLP24 finding.🔗 Read the paper: Work done by @jcz12856876 @YaoZonghai @YangZhichaoNLP and Prof. Hong Yu.#BioNLP #InstructionTuning.(0/N).
0
7
10
RT @RubyFreax: ❓ How do you solve grasping problems when your target object is completely out of sight?. 🚀 Excited to share our latest rese….
0
8
0
RT @simon_ycl: ❗Are We Truly Achieving Multilingualism in LLMs or Just Relying on Translation?❗. Need multilingual instruction data and ben….
0
16
0
Our work has been accepted by #EMNLP2024 Findings ! So thankful for my wonderful co-authors !!! All my three projects during my Master's study @UMassAmherst now have happy endings!
3
0
28
RT @simon_ycl: ☕We release the paper, “Diversity and Conquer: Diversity-Centric Data Selection with Iterative Refinement”. 👉In this paper,….
0
45
0
RT @mengyer: 🚨 New Research Alert!.People have found safety training of LLMs can be easily undone through finetuning. How can we ensure saf….
0
12
0
RT @DimaKrotov: In our #NeurIPS2023 paper Energy Transformer we propose a network that unifies three promising ideas in AI: Transformers, E….
0
14
0
Is it possible to find a metric to locate or explain what training data LLMs generalize from to the test cases? Really interested in how LLMs can magically answer users’ diverse questions. Or it’s simply a result of almost exhaustive training data.
1
2
5
I will have a poster presentation at ICML TEACH workshop (7/29). The paper is an extension of this tweet 😆that interprets ICL as retrieving from associative memory.
Is it possible that for the most of time, ChatGPT is only retrieving answers from its memory like a Hopfield Network or an IR system. How good is it for OOD cases.
0
0
0
RT @Andrew_Akbashev: In your application letter for #PhD / postdoc, NEVER ever say:. "Hi prof"."Hello"."Dear Professor"."Greetings of the d….
0
316
0