
Stanford Trustworthy AI Research (STAIR) Lab
@stai_research
Followers
611
Following
320
Media
3
Statuses
281
A research group in @StanfordAILab researching AI Capabilities, Trust and Safety, Equity and Reliability Website: https://t.co/CgOHvNHL4x
Stanford, CA
Joined November 2023
RT @DeepIndaba: 🚨 Keynote alert! We’re thrilled to welcome @sanmikoyejo as our next speaker in #DLI2025!.Catch the session "Beyond benchma….
0
7
0
RT @BrandoHablando: @OpenAI @RylanSchaeffer 🎯 From “looks right” ➜ mathematically verified. Visit our poster #ICML2025 West Ballroom C.Fri….
0
2
0
RT @BrandoHablando: @_akhaliq @_alycialee Joint work with @ObbadElyas Mario Krrish Aryan @sanmikoyejo Me Sudarsan at @stai_research !. Than….
0
2
0
RT @BrandoHablando: @_akhaliq @_alycialee @ObbadElyas @sanmikoyejo @stai_research Preprint on arxiv: đź§µ4/3.
arxiv.org
Contrary to the conventional emphasis on dataset size, we explore the role of data alignment -- an often overlooked aspect of data quality -- in training capable Large Language Models (LLMs). To...
0
2
0
RT @BrandoHablando: Come to Convention Center West room 208-209 2nd floor to learn about optimal data selection using compression like gzip….
0
4
0
RT @BrandoHablando: 🕵️‍♂️ Takeaway: report dynamic splits + step metrics or risk over-claiming your model’s reasoning skills. Putnam-AXIOM….
openreview.net
Current mathematical reasoning benchmarks for large language models (LLMs) are approaching saturation, with some achieving $>$ 90% accuracy, and are increasingly compromised by training-set...
0
3
0
RT @sangttruong: GitHub: HuggingFace: Come talk to us to learn more about better LM evalua….
huggingface.co
0
2
0
RT @sangttruong: We thank Andrew Myers and Jill Wu from @StanfordEng for bringing our research to the broader community:..
0
2
0
RT @sangttruong: The adaptive testing is integrated into HELM: HELM integration blog:. You….
0
3
0
RT @sangttruong: Adaptive testing needs a large & diverse question bank, but manual curation is costly. We use the amortized difficulty pre….
0
2
0
RT @sangttruong: During calibration, IRT estimates question difficulty from LM responses, but querying LMs is costly. We introduce *amortiz….
0
2
0
RT @sangttruong: IRT includes 2 phases: calibration (estimate question difficulty) and adaptive testing (select informative questions to ev….
0
2
0
RT @sangttruong: LMs are evaluated by average scores on benchmark subsets to save costs, but that’s unreliable. Item response theory (IRT)….
0
2
0
RT @sangttruong: @sanmikoyejo gives a nice talk contextualizing our paper contribution in the broader AI Measurement Sciences community in….
hai.stanford.edu
The widepread deployment of AI systems in critical domains demands more rigorous approaches to evaluating their capabilities and safety.
0
5
0
RT @sangttruong: Interested in LLM evaluation reliability & efficiency?. Check our ICML’25 paper. Reliable and Efficient Amortized Model-ba….
0
14
0