@_arohan_
rohan anil
5 months
Called it with some knowledge about the model. Ultra is going to break ground! Those quibbling over hellaswag and mmlu is just showing their misunderstanding about evaluation. Onwards 🚀
@lmsysorg
lmsys.org
5 months
🔥Breaking News from Arena Google's Bard has just made a stunning leap, surpassing GPT-4 to the SECOND SPOT on the leaderboard! Big congrats to @Google for the remarkable achievement! The race is heating up like never before! Super excited to see what's next for Bard + Gemini…
Tweet media one
154
629
3K
8
5
106

Replies

@keveman
keveman
5 months
@_arohan_ Ultra perhaps the first one to exceed 1300 elo?
1
0
1
@_arohan_
rohan anil
5 months
@keveman 🫡
0
0
1
@natolambert
Nathan Lambert
5 months
@_arohan_ 🫡 haters gonna hate, most realistic people still count google in / as the favorite in the AI race
1
0
7
@_arohan_
rohan anil
5 months
@natolambert I want to be petty and want to respond to everyone who makes a big deal about mmlu and hellaswag but I just dont have enough time in the day to make an alt account and start memeing
1
0
12
@Shawnryan96
Shawn
5 months
@_arohan_ I have never found evals to mean anything to me as the user. It is better to test yourself and see what works
0
0
0
@kohjingyu
Jing Yu Koh
5 months
@_arohan_ Can't wait for multimodal Ultra!
0
0
2
@gblazex
Blaze (Balázs Galambosi)
5 months
@_arohan_ You can argue about HellaSwag but MMLU is far from useless. Human/LLM-judged responses can be biased towards conv. style and length. Strictly general chatbot use-case, yes Arena is king. But if using the model for other things ,breadth of knowledge (MMLU) can matter a lot
Tweet media one
0
0
0
@programmer_dude
Programmer Dude
5 months
@_arohan_ When we ramping?
0
0
0
@gblazex
Blaze (Balázs Galambosi)
5 months
@_arohan_ Also congrats on the results. GPT-4 could be using similar techniques in bg to enhance responses as Bard. Competition is good for everyone. Just be cautious about ditching every single benchmark other than Arena. They can have their place too (e.g. MMLU).
0
0
0