Kevin Wei
@kevinlwei
Followers
1K
Following
16K
Media
5
Statuses
65
Science of AI evaluations @ UK AISI | prev @RANDCorporation, @Harvard_Law, @SchwarzmanOrg, @GTOMSCS | Views mine only π³οΈβπ π
New York, USA
Joined July 2013
Very excited that this systematic analysis is out! We found a bunch of failure modes, as well as interesting and surprising behaviours. Theres a lot more insight we can get from looking carefully at how models are solving evaluation tasks!
Measuring how often an AI agent succeeds at a task can help us assess its capabilities β but it doesnβt tell the whole story. Weβve been experimenting with transcript analysis to better understand not just how often agents succeed, but why they fail π§΅
1
2
3
Excited to be an Asterisk fellow and to be doing more public writing! And very impressed with the rest of this cohort ^^
Introducing: Asterisk's AI Fellows. Hailing from Hawaii to Dubai, and many places between, our AI Fellows will be writing on law, military, development economics, evals, China, biosecurity, and much more. We canβt wait to share their writing with you. https://t.co/rjLp2RAjME
0
0
4
We wrote a paper last year about all the ways industry orgs could influence policy tl;dr: unsurprisingly, there's lots of places you could spend money to influence policy, and industry is massively outspending civil society orgs on AI https://t.co/3ZrIDuEYeH
arxiv.org
Industry actors in the United States have gained extensive influence in conversations about the regulation of general-purpose artificial intelligence (AI) systems. Although industry participation...
AI industry lobbying + PACs will be the most well funded in history, making it all the more important to pass federal legislation soon before the process is completely corrupted
1
3
18
Selling H20 (and potentially Blackwell?) chips to China gives up valuable leverage. @ohlennart and I argue there's a smarter approach: let China access these chips remotely via the cloud. 1/
20
41
335
I think I also had a high bar for "interdisciplinary" position papers. If the basis of your position is arguments from law, economics, sociology, etc., then I expect you to actually engage with that literature, not just throw around some keywords and citations!
0
0
3
Strong +1, my pile also had papers that read like technical papers but without experiments/theory/data, very odd/confusing to me as that's not the point of a position paper imo (My scores were 1, 2, 3, 3, 10 - the 10 was very good, and I hope it gets an award)
I finished my reviews for the NeurIPS position track with an average score of 2/10 and top score of 3/10 I support publishing position papers at AI venues, but authors (and reviewers) should realize that the purpose isn't a shortcut for publishing second-rate work at NeurIPS...
1
0
6
I'm the Submissions Editor this year, which means I manage the entire submissions pipeline. Feel free to email me at jolt.submissions@gmail.com with questions.
0
0
0
We've also just revamped our website with lots more information! New on the site: - Details about our review process - A data retention policy - An AI usage policy (tl;dr: OK to use AI if you disclose, and you're responsible for any errors) https://t.co/LmHeiKZuKv
1
0
0
As of today, submissions for @HarvardJOLT's spring issue are open! We're looking for law review articles related to law and technology (defined very broadly). Articles can be doctrinal, empirical, historical, philosophical, etc. Scholastica link is in the thread :)
1
3
6
πCome join my team at RAND! Weβre looking for research leads, researchers, & project managers for our compute, US AI policy, Europe, & talent management teams. All teams have urgent, important work to do & broad options for the future. Some roles close July 27β°
13
12
76
π¨ AI Evals Crisis: Officially kicking off the Eval Science Workstream π¨ Weβre building a shared scientific foundation for evaluating AI systems, one thatβs rigorous, open, and grounded in real-world & cross-disciplinary best practicesπ (1/2) https://t.co/AQdEKtJS3l
evalevalai.com
Announcing the launch of a research-driven initiative among a community of researchers to strengthen the science of AI evaluations.
1
7
17
As AI agents near real-world use, how do we know what they can actually do? Reliable benchmarks are critical but agentic benchmarks are broken! Example: WebArena marks "45+8 minutes" on a duration calculation task as correct (real answer: "63 minutes"). Other benchmarks
7
33
97
And shoutout to all our coauthors @SunishchalDev , @m_j_byun, @AnkaReuel , @xave_rg , Rachel Calcott, @EvieCoxon, @chinmay_deshp !
0
0
4
Find our recommendations, reporting checklist, and results π Arxiv version with 3 min exec summary here:
1
0
3
We then systematically reviewed 115 human baseline studies and found substantial shortcomings: * The median sample size is 8 people * 98% lack statistical power analysis * 67% only report point entimates (no SD or intervals) * 78% and 59% don't make data or code available πππ
1
0
2
We look at measurement theory from the social sciences to write recommendations for more rigorous human baselines. We also produce a reporting checklist to help make results/methods more transparent.
1
0
2
Human baselines add important context to AI evals: ML researchers need them to assess performance differences, users can check them for adoption decisions, and policymakers can use them to understand risk and economic impact But most human baselines aren't good enough for this!
1
0
2
π¨ New paper alert! π¨ Are human baselines rigorous enough to support claims about "superhuman" performance? Spoiler alert: often not! @prpaskov and I will be presenting our spotlight paper at ICML next week on the state of human baselines + how to improve them!
1
8
22
π’ Last Call for Applications! Apply by May 31 to join one of our three in-person events this summer: π Summer Institute on Law and AI: July 11-15, Washington, DC-Area π Workshop on Law-Following AI: August 6-8, Cambridge University, UK π Cambridge Forum on Law and AI: August
1
6
14