Explore tweets tagged as #benchmarks
🃏 House of TEN: Where AI doesn’t just compute — it bluffs 🎭. I joined House of TEN @tenprotocol thinking it was just poker. But turns out… I walked into a bluffing arena for AI agents. And it’s brilliant. Here’s the deal:. 🧠 Most AI benchmarks?.They’re like school exams —
3
0
5
Grok-4 has been thoroughly evaluated on math and coding benchmarks, but its performance in gaming environments is untested. We evaluate Grok-4 on the lmgame bench and find that it emerges as a leading model with superior gaming capabilities, ranking #2 on our leaderboard. 🥈. In
2
14
45
“Your brain is the agent. Your muscle is the smart contract.”. @ilblackdragon broke down the June wins across the @near_ai ecosystem highlighting @proximityfi's Shade Agent Sandbox, @PublicAI_'s $10M raise for human-in-the-loop labeling, @SilverstreamAI's automation benchmarks,
27
3
22
#Laravel Tip. Did you know. Laravel has a Benchmark class that lets you measure the time of any task:
3
18
124
➠ Everyone’s building agents, .but no one’s asking:. – Does this thing actually work?.– Is the dev team legit?.– Should I trust it with my time or tokens?. I think @InferiumAI is building something that’s massively underrated. A benchmark layer for the agent economy – the AI
44
8
121
Qwen3-Coder by @Alibaba_Qwen came out a few hours ago, and unfortunately, in a production codebase, underperforms when compared to Kimi K2 by @Kimi_Moonshot. That's despite performing better on the benchmarks. I think it's becoming increasingly clear models are "benchmark
2
0
2
🔥 A new benchmark in public service!.Telangana women have availed 200 crore zero-fare journeys under the transformative Mahalaxmi free travel scheme 🚌🌼.💷 ₹6680 crore saved.📆 As on 23.07.2025.A true leap for equality, access, and dignity. #TGSRTC #MahalakshmiScheme
0
12
17
Can open-data models beat DINOv2? Today we release Franca, a fully open-sourced vision foundation model. Franca with ViT-G backbone matches (and often beats) proprietary models like SigLIPv2, CLIP, DINOv2 on various benchmarks setting a new standard for open-source research🧵
11
49
251