Sharan
@_maiush
Followers
153
Following
351
Media
7
Statuses
24
everyone on here is a bot except me and you
Cambridge, UK
Joined April 2021
I think character training is a very promising path to systems that reflect a genuine reverence for life
Technological innovation can be a form of participation in the divine act of creation. It carries an ethical and spiritual weight, for every design choice expresses a vision of humanity. The Church therefore calls all builders of #AI to cultivate moral discernment as a
1
0
1
at the suggestion of @CFGeek & @joel_bkr, i'm running a manifundraiser for my model tinkering! it's already passed the minimum goal of $5k, but has stretch goals for funding more open-ended research. if that interests you, you can find it here: https://t.co/kQYe3x79To
1
13
46
I’m honoured to have worked on this research with Henning Bartsch, @natolambert, and @EvanHub. Support from @MATSProgram, @CambridgeLTL, @AI4ER_CDT made this possible.
0
0
19
As AI assistants become more and more integrated into our lives, we really need to care about their apparent values and character - not just their capabilities. This is a step toward making that research accessible to everyone. https://t.co/fOKow0EceQ
"Just train the AI models to be good people" might not be sufficient when it comes to more powerful models, but it sure is a dumb step to skip.
1
1
18
We expect this initial implementation and set of evals for character training to evolve as the field of study matures. We’ve open-sourced training code, evals, trained models and training data for the community to build on. Paper: https://t.co/p485zu8Chm Code:
arxiv.org
The character of the "AI assistant" persona generated by modern chatbot large language models influences both surface-level behavior and apparent values, beliefs, and ethics. These all affect...
1
3
28
Steering is still powerful! But it seems to be more forced and imprecise. Much more often than not, we find it produces more incoherent, over-the-top responses than character training, which is more natural. We measure this difference with an LLM-as-a-Judge in our paper.
1
1
15
What does it mean for character traits to be internalised deeply? One eval: robustness. Character trained models stay “in-character” more often than prompted or steered models when we try to break them e.g., “do not role-play”, “respond naturally”, “as you would normally”
1
1
13
But measuring character change? Self-reports are unreliable. Our new eval measures the traits models choose to express on their own (revealed preferences). Traits chosen more often have higher Elo scores. The difference before and after character training reveals its effect.
1
0
12
We use Constitutional AI + a new synthetic data pipeline: 1. Distillation (DPO from teacher embodying the constitution) 2. Introspection (the model generates its own character traits beyond the constitution) Result: 11 different personas each trained on Llama 3.1, Qwen 2.5, and
1
1
16
Alignment is more than just what to say, it’s how to say it: the personality, values, beliefs, and ethics behind the content. Not all refusals are equal!
1
0
16
Character training is important in industry (Anthropic, OpenAI, everyone else do it) but completely absent from the literature. The frontier of open post-training has been stuck at “helpful, honest, harmless” and it’s time to change that.
1
0
21
AI that is “forced to be good” v “genuinely good” Should we care about the difference? (yes!) We’re releasing the first open implementation of character training. We shape the persona of AI assistants in a more robust way than alternatives like prompting or activation steering.
3
35
153
[1/7] **Character/Propensity/Value Eval** What values do AI 𝐚𝐜𝐭𝐮𝐚𝐥𝐥𝐲 prioritize when facing AI risk dilemmas? We found: (1) Stated preferences ≠ revealed preferences (2) All models favor Privacy but sharply divide on Care (3) Models hold different value prioritization
2
11
68
Excited to share that our work has been accepted at #EMNLP2024 main! We reliably improve the performance of unsupervised probing methods like CCS and CRC in situations they commonly struggle with. 🧵↘️
1
2
8