Explore tweets tagged as #BenchMark
LLMs can lie in different waysโhow do we know if lie detectors are catching all of them? We introduce LIARSโ BENCH, a new benchmark containing over 72,000 on-policy lies and honest responses to evaluate lie detectors for LLMs, made of 7 different datasets.
1
2
10
1992 Ferrari F40 | Asking Price: $3,250,990 The Ferrari F40 remains the benchmark for pure driver engagement. With its featherweight chassis, twin-turbo punch, and iconic rear wing, it represents the essence of Ferrariโs golden era and continues to command respect from
16
151
1K
The โBenchmarkโ ๐ ๐๐ฅ๐๐ฟ
10
124
3K
๐New @scale_AI paper: ๐ฅ๐ฒ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต๐ฅ๐๐ฏ๐ฟ๐ถ๐ฐ๐, a benchmark for evaluating Deep Research (DR) agents. Even top agents like Gemini & OpenAI DR achieve <๐ฒ๐ด% ๐ฟ๐๐ฏ๐ฟ๐ถ๐ฐ ๐ฐ๐ผ๐บ๐ฝ๐น๐ถ๐ฎ๐ป๐ฐ๐ฒ. We built ๐ฎ.๐ฑ๐+ expert rubrics with ๐ฎ.๐ด๐+ hrs of human labor to measure why.
12
30
198
Real marketers. Real results. See what our users are saying about how Benchmark Email helps them send smarter, faster, and better campaigns. โจ Because your success is the best story we can tell. #benchmarkemail #emailmarketing #customerlove #emailstrategy
0
0
0
If $WIF is a behavior benchmark for the broader altcoin market, then Alts are sitting exactly on support, holding their April-bottom structure despite BTCโs heavy sell pressure. Alts, which were nuked in October, are resisting the final rounds far better than expected. ๐งต
8
11
83
Introducing Gemini 3 Pro for understanding research papers ๐ Highlight any section of a paper to ask questions and โ@โ other papers for quick context, comparisons, and benchmark references
42
129
913
Next Level Intensity The Hummer Fight Sequence Sets A New Benchmark In Cinematic Action ๐ฎโ๐จ๐ฅ
7
150
1K
Freight continues it's epic collapse, with the Cass Shipment Index giving off some of the most significant warning signs about the state of the goods economy. The Cass Shipment Index, the benchmark freight index, has dropped to October 2009 levels, the height of the Great
45
255
853
๐ฅ Gemini 3.0 vs Radiologists: RadLE Benchmark Results Are OUT! โ ๏ธ Is it game over for Radiology? Let us find out! โฌ๏ธ ๐ซจ Since yesterday, Gemini 3.0 has been everywhere for crushing benchmarks. My inbox exploded asking: โBut how did it do on the hardest visual reasoning
68
158
1K
Asus is lazy. I showed them full video proof, live footage of the crash, dump files and every detail possible. My 5090 keeps crashing randomly and all they do is launch a 15 minute benchmark and send it back. This has been going on for 3 months. I tell them itโs not fixed so they
7
4
120
The founder of Google flying his $150M blimp over San Francisco on the day Gemini 3 beats nearly every model benchmark is the exact type of big baller energy this city loves. โIโm still daddyโ - Sergey Brin, probably
80
107
4K
Surrey residents deserve clear facts. According to the Surrey Police Service (SPS), 608 SPS officers are currently deployed. Combined with remaining RCMP officers, Surrey has the highest number in the cityโs history above the benchmark agreed upon by both the Province and the
0
1
5
Gemini 3.0 Pro absolutely dominates every benchmark! The jump from 2.5 is nuts! Its scores on the most difficult benchmarks suggest this is essentially baby AGI! Humanity Last exam: 37.5% ARC-AGI-2: 31.1% LiveCodeBench Pro: 2439 Math arena apex : 23.4% Simple QA: 72.1%
99
88
939
Announcing AA-Omniscience, our new benchmark for knowledge and hallucination across >40 topics, where all but three models are more likely to hallucinate than give a correct answer Embedded knowledge in language models is important for many real world use cases. Without
42
116
686