_maiush Profile Banner
Sharan Profile
Sharan

@_maiush

Followers
153
Following
351
Media
7
Statuses
24

everyone on here is a bot except me and you

Cambridge, UK
Joined April 2021
Don't wanna be here? Send us removal request.
@_maiush
Sharan
2 hours
@_maiush
Sharan
3 days
AI that is “forced to be good” v “genuinely good” Should we care about the difference? (yes!) We’re releasing the first open implementation of character training. We shape the persona of AI assistants in a more robust way than alternatives like prompting or activation steering.
0
0
0
@_maiush
Sharan
2 hours
I think character training is a very promising path to systems that reflect a genuine reverence for life
@Pontifex
Pope Leo XIV
11 hours
Technological innovation can be a form of participation in the divine act of creation. It carries an ethical and spiritual weight, for every design choice expresses a vision of humanity. The Church therefore calls all builders of #AI to cultivate moral discernment as a
1
0
1
@voooooogel
thebes
18 hours
at the suggestion of @CFGeek & @joel_bkr, i'm running a manifundraiser for my model tinkering! it's already passed the minimum goal of $5k, but has stretch goals for funding more open-ended research. if that interests you, you can find it here: https://t.co/kQYe3x79To
1
13
46
@_maiush
Sharan
3 days
I’m honoured to have worked on this research with Henning Bartsch, @natolambert, and @EvanHub. Support from @MATSProgram, @CambridgeLTL, @AI4ER_CDT made this possible.
0
0
19
@_maiush
Sharan
3 days
As AI assistants become more and more integrated into our lives, we really need to care about their apparent values and character - not just their capabilities. This is a step toward making that research accessible to everyone. https://t.co/fOKow0EceQ
@AmandaAskell
Amanda Askell
4 months
"Just train the AI models to be good people" might not be sufficient when it comes to more powerful models, but it sure is a dumb step to skip.
1
1
18
@_maiush
Sharan
3 days
We expect this initial implementation and set of evals for character training to evolve as the field of study matures. We’ve open-sourced training code, evals, trained models and training data for the community to build on. Paper: https://t.co/p485zu8Chm Code:
Tweet card summary image
arxiv.org
The character of the "AI assistant" persona generated by modern chatbot large language models influences both surface-level behavior and apparent values, beliefs, and ethics. These all affect...
1
3
28
@_maiush
Sharan
3 days
Steering is still powerful! But it seems to be more forced and imprecise. Much more often than not, we find it produces more incoherent, over-the-top responses than character training, which is more natural. We measure this difference with an LLM-as-a-Judge in our paper.
1
1
15
@_maiush
Sharan
3 days
What does it mean for character traits to be internalised deeply? One eval: robustness. Character trained models stay “in-character” more often than prompted or steered models when we try to break them e.g., “do not role-play”, “respond naturally”, “as you would normally”
1
1
13
@_maiush
Sharan
3 days
But measuring character change? Self-reports are unreliable. Our new eval measures the traits models choose to express on their own (revealed preferences). Traits chosen more often have higher Elo scores. The difference before and after character training reveals its effect.
1
0
12
@_maiush
Sharan
3 days
We use Constitutional AI + a new synthetic data pipeline: 1. Distillation (DPO from teacher embodying the constitution) 2. Introspection (the model generates its own character traits beyond the constitution) Result: 11 different personas each trained on Llama 3.1, Qwen 2.5, and
1
1
16
@_maiush
Sharan
3 days
Alignment is more than just what to say, it’s how to say it: the personality, values, beliefs, and ethics behind the content. Not all refusals are equal!
1
0
16
@_maiush
Sharan
3 days
Character training is important in industry (Anthropic, OpenAI, everyone else do it) but completely absent from the literature. The frontier of open post-training has been stuck at “helpful, honest, harmless” and it’s time to change that.
1
0
21
@_maiush
Sharan
3 days
AI that is “forced to be good” v “genuinely good” Should we care about the difference? (yes!) We’re releasing the first open implementation of character training. We shape the persona of AI assistants in a more robust way than alternatives like prompting or activation steering.
3
35
153
@kellychiuyy
Yu Ying Chiu (Kelly Chiu)
6 months
[1/7] **Character/Propensity/Value Eval** What values do AI 𝐚𝐜𝐭𝐮𝐚𝐥𝐥𝐲 prioritize when facing AI risk dilemmas? We found: (1) Stated preferences ≠ revealed preferences (2) All models favor Privacy but sharply divide on Care (3) Models hold different value prioritization
2
11
68
@walterlaurito
Walter Laurito
1 year
Excited to share that our work has been accepted at #EMNLP2024 main! We reliably improve the performance of unsupervised probing methods like CCS and CRC in situations they commonly struggle with. 🧵↘️
1
2
8