8 Models of the week to pay attention to: ▪️ FastVLM by @Apple ▪️ OLMoASR ▪️ gpt-realtime and Realtime API updates ▪️ InternVL3.5 ▪️ Hermes 4 ▪️ USO ▪️ rStar2-Agent ▪️ VibeVoice Find the latest updates about AI here: https://t.co/GmOBazWXUP Some details about models in 🧵
8
19
111
Replies
1. @Apple's FastVLM on Hugging Face
🚨 Apple just released FastVLM on Hugging Face - 0.5, 1.5 and 7B real-time VLMs with WebGPU support 🤯 > 85x faster and 3.4x smaller than comparable sized VLMs > 7.9x faster TTFT for larger models > designed to output fewer output tokens and reduce encoding time for high
1
0
1
2. OLMoASR: Open speech recognition models by @allen_ai 6 fully open ASR models (39M–1.5B parameters) trained on curated datasets up to 680K hours. - OLMoASR-medium.en achieved 12.8%/11.0% WER (short/long-form), matching Whisper-medium.en. - Built from a 3M-hour pool filtered
1
0
1
3. gpt-realtime and Realtime API for voice agents This speech-to-speech model - Achieves 82.8% accuracy on Big Bench Audio - 30.5% on MultiChallenge - Supports image inputs, SIP phone calling, and remote MCP servers - Function calling accuracy improved to 66.5% - 2 new voices
1
0
1
4. InternVL3.5: open-source LLM-based multimodal model family - 4.05× faster inference and SOTA performance across general multimodal and agentic tasks - +16.0% gain on MMMU and MathVista - Features Cascade Reinforcement Learning (offline + online RL) to enhance reasoning - The
1
0
0
5. Hermes 4 It's a hybrid reasoning LLM family built using 5M post-training samples (19B tokens), with 3.5M reasoning-heavy examples with sequences up to 16K tokens. - Uses DataForge for structured synthetic data generation and Atropos for rejection sampling across task-specific
1
0
2
6. USO: Unified style and subject-driven generation via disentangled and reward learning - It uses a triplet dataset (content, style, stylized image) and trains via style-alignment and content-style disentanglement objectives - A Style Reward Learning (SRL) module further
1
0
0
7. rStar2-Agent This is a 14B parameter math reasoning model trained with agentic RL. - Uses GRPO-RoC, an RL strategy that handles noisy code environments, - Is trained efficiently using only 64 MI300X GPUs. - In just 510 RL steps, it achieves 80.6% on AIME24 and 69.8% on
1
0
2
8. VibeVoice It's a long-form speech synthesis model using next-token diffusion for continuous data generation. - Generates up to 90 minutes of speech involving 4 speakers in a 64K token window, delivering high-fidelity, multi-speaker dialogue synthesis. - A novel tokenizer
1
0
2
@Apple Follow @TheTuringPost for more. Get deep analysis, guides & breakdowns of what AI is about now. Join 90,000+ readers from top AI labs, VC funds & universities.:
0
0
2
@TheTuringPost @Apple vibey stuff with VibeVoice. interested to see how it handles both voice and text inputs in real scenarios. looks like it could be a game-changer for interactive systems.
0
0
2
@TheTuringPost @Apple An insightful list of models to watch. FastVLM and gpt-realtime updates stand out. It’ll be interesting to see their long-term impact. Which model do you think will shape the future of AI the most?
0
0
0
@TheTuringPost @Apple eight models, huh? sounds like a lineup for a bad sci-fi movie. fastvlm better not be as slow as its name suggests. let's see who survives the AI apocalypse. 2025 is just getting warmed up.
0
0
0