Explore tweets tagged as #BenchMark
@_valsai
Vals AI
2 hours
Stop vibe checking your vibe code! We just released Vibe Code Bench: the first benchmark that tests whether AI models can actually build complete web applications from scratch. Featured today in @Inc (1/6)
15
28
85
@walterlaurito
Walter Laurito
3 hours
LLMs can lie in different waysโ€”how do we know if lie detectors are catching all of them? We introduce LIARSโ€™ BENCH, a new benchmark containing over 72,000 on-policy lies and honest responses to evaluate lie detectors for LLMs, made of 7 different datasets.
1
2
10
@MissBenchmark
Alex
19 hours
Watching everyone somehow defend Gretchen in this #RHOC reunion.
11
69
1K
@duPontREGISTRY
dupontregistry
2 days
1992 Ferrari F40 | Asking Price: $3,250,990 The Ferrari F40 remains the benchmark for pure driver engagement. With its featherweight chassis, twin-turbo punch, and iconic rear wing, it represents the essence of Ferrariโ€™s golden era and continues to command respect from
16
151
1K
@ShadrackAmonooC
Shadrack Amonoo Crabe ๐Ÿ‘โ€๐Ÿ—จ
3 days
The โ€œBenchmarkโ€ ๐Ÿ ๐Ÿ‘‘๐Ÿ’ฅ๐Ÿ‘๐Ÿฟ
10
124
3K
@ManasiSharma_
Manasi Sharma
9 days
๐Ÿš€New @scale_AI paper: ๐—ฅ๐—ฒ๐˜€๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต๐—ฅ๐˜‚๐—ฏ๐—ฟ๐—ถ๐—ฐ๐˜€, a benchmark for evaluating Deep Research (DR) agents. Even top agents like Gemini & OpenAI DR achieve <๐Ÿฒ๐Ÿด% ๐—ฟ๐˜‚๐—ฏ๐—ฟ๐—ถ๐—ฐ ๐—ฐ๐—ผ๐—บ๐—ฝ๐—น๐—ถ๐—ฎ๐—ป๐—ฐ๐—ฒ. We built ๐Ÿฎ.๐Ÿฑ๐—ž+ expert rubrics with ๐Ÿฎ.๐Ÿด๐—ž+ hrs of human labor to measure why.
12
30
198
@BenchmarkEmail
Benchmark
4 hours
Real marketers. Real results. See what our users are saying about how Benchmark Email helps them send smarter, faster, and better campaigns. โœจ Because your success is the best story we can tell. #benchmarkemail #emailmarketing #customerlove #emailstrategy
0
0
0
@Crypto_Holding_
Crypto Holdingโ„ข ๐Ÿ’Ž
8 days
๐Ÿšจ BREAKING: #Binance is #1 in #CoinDeskโ€™s 2025 Exchange Benchmark โ€” the ONLY exchange with 90+ scores in BOTH spot (93.4) & derivatives (93.65) , earning AA rating ! ๐Ÿ’ช Leads in Market Quality, Security & Transparency. 26% global spot volume. Deeper liquidity = tighter spreads,
92
596
922
@altcoinvector
Altcoin Vector
2 hours
If $WIF is a behavior benchmark for the broader altcoin market, then Alts are sitting exactly on support, holding their April-bottom structure despite BTCโ€™s heavy sell pressure. Alts, which were nuked in October, are resisting the final rounds far better than expected. ๐Ÿงต
8
11
83
@Urs_Ramchandra
#JaiBabu ๐Ÿฆ…
10 days
Real ayyagaru. Benchmark for heroism and elevation in TFI started here โ›“๏ธ #Shiva4K
12
215
3K
@askalphaxiv
alphaXiv
3 days
Introducing Gemini 3 Pro for understanding research papers ๐Ÿš€ Highlight any section of a paper to ask questions and โ€œ@โ€ other papers for quick context, comparisons, and benchmark references
42
129
913
@Tharun_billa_
Tharun Billa
1 day
Next Level Intensity The Hummer Fight Sequence Sets A New Benchmark In Cinematic Action ๐Ÿ˜ฎโ€๐Ÿ’จ๐Ÿ”ฅ
7
150
1K
@FreightAlley
Craig Fuller ๐Ÿ›ฉ๐Ÿš›๐Ÿš‚โš“๏ธ
7 days
Freight continues it's epic collapse, with the Cass Shipment Index giving off some of the most significant warning signs about the state of the goods economy. The Cass Shipment Index, the benchmark freight index, has dropped to October 2009 levels, the height of the Great
45
255
853
@DrDatta_AIIMS
Dr. Datta M.D. (AIIMS Delhi)
2 days
๐Ÿ”ฅ Gemini 3.0 vs Radiologists: RadLE Benchmark Results Are OUT! โ˜ ๏ธ Is it game over for Radiology? Let us find out! โฌ‡๏ธ ๐Ÿซจ Since yesterday, Gemini 3.0 has been everywhere for crushing benchmarks. My inbox exploded asking: โ€œBut how did it do on the hardest visual reasoning
68
158
1K
@muaxh03
Muaxh03
6 hours
Asus is lazy. I showed them full video proof, live footage of the crash, dump files and every detail possible. My 5090 keeps crashing randomly and all they do is launch a 15 minute benchmark and send it back. This has been going on for 3 months. I tell them itโ€™s not fixed so they
7
4
120
@samhogan
Sam Hogan ๐Ÿ‡บ๐Ÿ‡ธ
3 days
The founder of Google flying his $150M blimp over San Francisco on the day Gemini 3 beats nearly every model benchmark is the exact type of big baller energy this city loves. โ€œIโ€™m still daddyโ€ - Sergey Brin, probably
80
107
4K
@npffpn
NPF-FPN
47 minutes
Surrey residents deserve clear facts. According to the Surrey Police Service (SPS), 608 SPS officers are currently deployed. Combined with remaining RCMP officers, Surrey has the highest number in the cityโ€™s history above the benchmark agreed upon by both the Province and the
0
1
5
@DeryaTR_
Derya Unutmaz, MD
3 days
Gemini 3.0 Pro absolutely dominates every benchmark! The jump from 2.5 is nuts! Its scores on the most difficult benchmarks suggest this is essentially baby AGI! Humanity Last exam: 37.5% ARC-AGI-2: 31.1% LiveCodeBench Pro: 2439 Math arena apex : 23.4% Simple QA: 72.1%
99
88
939
@ArtificialAnlys
Artificial Analysis
4 days
Announcing AA-Omniscience, our new benchmark for knowledge and hallucination across >40 topics, where all but three models are more likely to hallucinate than give a correct answer Embedded knowledge in language models is important for many real world use cases. Without
42
116
686