Sharan @_maiush X Profile

Sharan

@_maiush

Followers

153

Following

351

Media

7

Statuses

24

everyone on here is a bot except me and you

https://t.co/oHzM0QvZMk

Cambridge, UK

Joined April 2021

Don't wanna be here? Send us removal request.

Sharan

@_maiush

2 hours

https://t.co/sbnjuJZNgf

Sharan

@_maiush

3 days

AI that is “forced to be good” v “genuinely good” Should we care about the difference? (yes!) We’re releasing the first open implementation of character training. We shape the persona of AI assistants in a more robust way than alternatives like prompting or activation steering.

0

Sharan

@_maiush

2 hours

I think character training is a very promising path to systems that reflect a genuine reverence for life

Pope Leo XIV

@Pontifex

11 hours

Technological innovation can be a form of participation in the divine act of creation. It carries an ethical and spiritual weight, for every design choice expresses a vision of humanity. The Church therefore calls all builders of #AI to cultivate moral discernment as a

1

0

1

thebes

@voooooogel

18 hours

at the suggestion of @CFGeek & @joel_bkr, i'm running a manifundraiser for my model tinkering! it's already passed the minimum goal of $5k, but has stretch goals for funding more open-ended research. if that interests you, you can find it here: https://t.co/kQYe3x79To

1

13

46

Sharan

@_maiush

3 days

I’m honoured to have worked on this research with Henning Bartsch, @natolambert, and @EvanHub. Support from @MATSProgram, @CambridgeLTL, @AI4ER_CDT made this possible.

0

19

Sharan

@_maiush

3 days

As AI assistants become more and more integrated into our lives, we really need to care about their apparent values and character - not just their capabilities. This is a step toward making that research accessible to everyone. https://t.co/fOKow0EceQ

Amanda Askell

@AmandaAskell

4 months

"Just train the AI models to be good people" might not be sufficient when it comes to more powerful models, but it sure is a dumb step to skip.

1

18

Sharan

@_maiush

3 days

We expect this initial implementation and set of evals for character training to evolve as the field of study matures. We’ve open-sourced training code, evals, trained models and training data for the community to build on. Paper: https://t.co/p485zu8Chm Code:

arxiv.org

The character of the "AI assistant" persona generated by modern chatbot large language models influences both surface-level behavior and apparent values, beliefs, and ethics. These all affect...

1

3

28

Sharan

@_maiush

3 days

Steering is still powerful! But it seems to be more forced and imprecise. Much more often than not, we find it produces more incoherent, over-the-top responses than character training, which is more natural. We measure this difference with an LLM-as-a-Judge in our paper.

1

15

Sharan

@_maiush

3 days

What does it mean for character traits to be internalised deeply? One eval: robustness. Character trained models stay “in-character” more often than prompted or steered models when we try to break them e.g., “do not role-play”, “respond naturally”, “as you would normally”

1

13

Sharan

@_maiush

3 days

But measuring character change? Self-reports are unreliable. Our new eval measures the traits models choose to express on their own (revealed preferences). Traits chosen more often have higher Elo scores. The difference before and after character training reveals its effect.

1

0

12

Sharan

@_maiush

3 days

We use Constitutional AI + a new synthetic data pipeline: 1. Distillation (DPO from teacher embodying the constitution) 2. Introspection (the model generates its own character traits beyond the constitution) Result: 11 different personas each trained on Llama 3.1, Qwen 2.5, and

1

16

Sharan

@_maiush

3 days

Alignment is more than just what to say, it’s how to say it: the personality, values, beliefs, and ethics behind the content. Not all refusals are equal!

1

0

16

Sharan

@_maiush

3 days

Character training is important in industry (Anthropic, OpenAI, everyone else do it) but completely absent from the literature. The frontier of open post-training has been stuck at “helpful, honest, harmless” and it’s time to change that.

1

0

21

Sharan

@_maiush

3 days

AI that is “forced to be good” v “genuinely good” Should we care about the difference? (yes!) We’re releasing the first open implementation of character training. We shape the persona of AI assistants in a more robust way than alternatives like prompting or activation steering.

3

35

153

Yu Ying Chiu (Kelly Chiu)

@kellychiuyy

6 months

[1/7] **Character/Propensity/Value Eval** What values do AI 𝐚𝐜𝐭𝐮𝐚𝐥𝐥𝐲 prioritize when facing AI risk dilemmas? We found: (1) Stated preferences ≠ revealed preferences (2) All models favor Privacy but sharply divide on Care (3) Models hold different value prioritization

2

11

68

Walter Laurito

@walterlaurito

1 year

Excited to share that our work has been accepted at #EMNLP2024 main! We reliably improve the performance of unsupervised probing methods like CCS and CRC in situations they commonly struggle with. 🧵↘️

1

2

8