Kevin Wei Profile
Kevin Wei

@kevinlwei

Followers
1K
Following
16K
Media
5
Statuses
65

Science of AI evaluations @ UK AISI | prev @RANDCorporation, @Harvard_Law, @SchwarzmanOrg, @GTOMSCS | Views mine only πŸ³οΈβ€πŸŒˆ πŸŽ‰

New York, USA
Joined July 2013
Don't wanna be here? Send us removal request.
@CUdudec
Cozmin Ududec
1 month
Very excited that this systematic analysis is out! We found a bunch of failure modes, as well as interesting and surprising behaviours. Theres a lot more insight we can get from looking carefully at how models are solving evaluation tasks!
@AISecurityInst
AI Security Institute
1 month
Measuring how often an AI agent succeeds at a task can help us assess its capabilities – but it doesn’t tell the whole story. We’ve been experimenting with transcript analysis to better understand not just how often agents succeed, but why they fail 🧡
1
2
3
@kevinlwei
Kevin Wei
2 months
Excited to be an Asterisk fellow and to be doing more public writing! And very impressed with the rest of this cohort ^^
@asteriskmgzn
Asterisk
2 months
Introducing: Asterisk's AI Fellows. Hailing from Hawaii to Dubai, and many places between, our AI Fellows will be writing on law, military, development economics, evals, China, biosecurity, and much more. We can’t wait to share their writing with you. https://t.co/rjLp2RAjME
0
0
4
@kevinlwei
Kevin Wei
3 months
We wrote a paper last year about all the ways industry orgs could influence policy tl;dr: unsurprisingly, there's lots of places you could spend money to influence policy, and industry is massively outspending civil society orgs on AI https://t.co/3ZrIDuEYeH
Tweet card summary image
arxiv.org
Industry actors in the United States have gained extensive influence in conversations about the regulation of general-purpose artificial intelligence (AI) systems. Although industry participation...
@Miles_Brundage
Miles Brundage
4 months
AI industry lobbying + PACs will be the most well funded in history, making it all the more important to pass federal legislation soon before the process is completely corrupted
1
3
18
@janet_e_egan
Janet Egan
3 months
Selling H20 (and potentially Blackwell?) chips to China gives up valuable leverage. @ohlennart and I argue there's a smarter approach: let China access these chips remotely via the cloud. 1/
20
41
335
@kevinlwei
Kevin Wei
3 months
I think I also had a high bar for "interdisciplinary" position papers. If the basis of your position is arguments from law, economics, sociology, etc., then I expect you to actually engage with that literature, not just throw around some keywords and citations!
0
0
3
@kevinlwei
Kevin Wei
3 months
Strong +1, my pile also had papers that read like technical papers but without experiments/theory/data, very odd/confusing to me as that's not the point of a position paper imo (My scores were 1, 2, 3, 3, 10 - the 10 was very good, and I hope it gets an award)
@RishiBommasani
rishi
3 months
I finished my reviews for the NeurIPS position track with an average score of 2/10 and top score of 3/10 I support publishing position papers at AI venues, but authors (and reviewers) should realize that the purpose isn't a shortcut for publishing second-rate work at NeurIPS...
1
0
6
@kevinlwei
Kevin Wei
3 months
I'm the Submissions Editor this year, which means I manage the entire submissions pipeline. Feel free to email me at jolt.submissions@gmail.com with questions.
0
0
0
@kevinlwei
Kevin Wei
3 months
We've also just revamped our website with lots more information! New on the site: - Details about our review process - A data retention policy - An AI usage policy (tl;dr: OK to use AI if you disclose, and you're responsible for any errors) https://t.co/LmHeiKZuKv
1
0
0
@kevinlwei
Kevin Wei
3 months
As of today, submissions for @HarvardJOLT's spring issue are open! We're looking for law review articles related to law and technology (defined very broadly). Articles can be doctrinal, empirical, historical, philosophical, etc. Scholastica link is in the thread :)
1
3
6
@michael__aird
Michael Aird
4 months
πŸš€Come join my team at RAND! We’re looking for research leads, researchers, & project managers for our compute, US AI policy, Europe, & talent management teams. All teams have urgent, important work to do & broad options for the future. Some roles close July 27⏰
13
12
76
@evaluatingevals
EvalEval Coalition
4 months
🚨 AI Evals Crisis: Officially kicking off the Eval Science Workstream 🚨 We’re building a shared scientific foundation for evaluating AI systems, one that’s rigorous, open, and grounded in real-world & cross-disciplinary best practicesπŸ‘‡ (1/2) https://t.co/AQdEKtJS3l
Tweet card summary image
evalevalai.com
Announcing the launch of a research-driven initiative among a community of researchers to strengthen the science of AI evaluations.
1
7
17
@daniel_d_kang
Daniel Kang
4 months
As AI agents near real-world use, how do we know what they can actually do? Reliable benchmarks are critical but agentic benchmarks are broken! Example: WebArena marks "45+8 minutes" on a duration calculation task as correct (real answer: "63 minutes"). Other benchmarks
7
33
97
@kevinlwei
Kevin Wei
4 months
And shoutout to all our coauthors @SunishchalDev , @m_j_byun, @AnkaReuel , @xave_rg , Rachel Calcott, @EvieCoxon, @chinmay_deshp !
0
0
4
@kevinlwei
Kevin Wei
4 months
Find our recommendations, reporting checklist, and results πŸ‘‡ Arxiv version with 3 min exec summary here:
1
0
3
@kevinlwei
Kevin Wei
4 months
We then systematically reviewed 115 human baseline studies and found substantial shortcomings: * The median sample size is 8 people * 98% lack statistical power analysis * 67% only report point entimates (no SD or intervals) * 78% and 59% don't make data or code available 😭😭😭
1
0
2
@kevinlwei
Kevin Wei
4 months
We look at measurement theory from the social sciences to write recommendations for more rigorous human baselines. We also produce a reporting checklist to help make results/methods more transparent.
1
0
2
@kevinlwei
Kevin Wei
4 months
Human baselines add important context to AI evals: ML researchers need them to assess performance differences, users can check them for adoption decisions, and policymakers can use them to understand risk and economic impact But most human baselines aren't good enough for this!
1
0
2
@kevinlwei
Kevin Wei
4 months
🚨 New paper alert! 🚨 Are human baselines rigorous enough to support claims about "superhuman" performance? Spoiler alert: often not! @prpaskov and I will be presenting our spotlight paper at ICML next week on the state of human baselines + how to improve them!
1
8
22
@law_ai_
Institute for Law & AI
6 months
πŸ“’ Last Call for Applications! Apply by May 31 to join one of our three in-person events this summer: πŸ“† Summer Institute on Law and AI: July 11-15, Washington, DC-Area πŸ“† Workshop on Law-Following AI: August 6-8, Cambridge University, UK πŸ“† Cambridge Forum on Law and AI: August
1
6
14