Orr Zohar
@orr_zohar
Followers
521
Following
229
Media
20
Statuses
126
@nvidia • @Stanford • @KnightHennessy scholar • Researching large multimodal models
Joined May 2023
FineVision is out 🚀 And ready to empower the next generation of LMMs. Check it out!
Today, we are releasing FineVision, a huge open-source dataset for training state-of-the-art Vision-Language Models: > 17.3M images > 24.3M samples > 88.9M turns > 9.5B answer tokens Here are my favourite findings:
0
4
20
People underestimate how token-heavy video understanding & generation are - and their massive untapped potential. We grasp images in a split second. Videos? Not so much: Comprehending a 1-hour clip often demands watching the whole thing. Video AI is a tiny sliver of tokens today,
@StefanoErmon @_inception_ai Diffusion will obviously work on any bitstream. With text, since humans read from first word to last, there is just the question of whether the delay to first sentence for diffusion is worth it. That said, the vast majority of AI workload will be video understanding and
1
0
6
we think humanity’s biggest challenges won’t be solved by ai thinking for 1000 hours coming back with an answer they’ll be solved by many collaborating humans, and ai that understands them and their different skills, goals, values, etc to empower them to do more together
some folks and i are making something new if you're hopeful about AI empowering everyone if you've worked on multiturn, memory, model behavior, multiagent RL, user sim, AI interfaces/products, kernels, or dist systems if you want frontier-scale compute & top infra let's chat!
54
41
672
Started the early release of Vision Language Models (O’Reilly). My first VLM book—already on chapter 3 and loving it. Clear and easy to follow. Great work @mervenoyann , @andimarafioti ,@micuelll & @orr_zohar! Looking forward to the remaining chapters.
3
3
31
And we at #NVIDIA Research are still seeking research interns to explore omni-modal LLMs across a variety of domains, including robotics (VLA), visual agentic tool using, world modeling, and unified understanding and generation. Drop me an email if you are interested!
OmniVinci is now #1 paper on Huggingface!!! 🤗 Building omni-modal LLMs is MORE than just mixing tokens 😉 At @NVIDIA, we explored deeper possibilities in building truly omni-modal systems — leading to OmniVinci-9B, which introduces three key innovations: - OmniAlignNet – a
0
1
12
Encoder-Decoder models — making a comeback? 🤔 Discrete diffusion LMs can accelerate text generation (similar to speculative decoding). Exciting to see where these models go, and how many of the traditional AR design decisions will hold.
🚨In our NeurIPS paper, we bring encoder-decoders back.. for diffusion language models! ⚡️Encoder-decoders make diffusion sampling fast: a small (fast) decoder denoises tokens progressively and a large (slower) encoder represents clean context.
0
0
6
🚨Huge for multimodal/vision AI: Datasets hit 100s of TB, making on-prem storage a nightmare. 🤗Now stream them directly from Hugging Face to GPUs - unlocking scalable training of everything from vlms to world models. 🚀 I've battled storage limits for years; thrilled to move
You can now train SOTA models without any storage!🌩️ We completely revamped the Hub’s backend to enable streaming at scale. We streamed TBs of data to 100s of H100s to train SOTA VLMs and saw serious speed-ups. But how?
1
10
70
Open data is the foundation of open science. FineVision is our step toward making VLM research transparent, reproducible, and actually open. You can find more HF daily papers: https://t.co/fkrvVscUUe And a big shoutout to the first authors: @lusxvr and @orr_zohar ! Titans!
1
3
9
NYC open-source AI infra contributors — we’ve launched a community research hub above Grand Central where GPUs go brrr 🔥🗽 A place to hack, benchmark, and collaborate — vLLM, SGLang, kernels, inference optimizations all welcome. Open space. Open source. Weekends too. Huge
7
10
89
🚀 Excited to release SciVideoBench — a new benchmark that pushes Video-LMMs to think like scientists! Designed to probe video reasoning and the synergy between accurate perception, expert knowledge, and logical inference. 1,000 research-level Qs across Physics, Chemistry,
🚨 New Paper Alert! Introducing SciVideoBench — a comprehensive benchmark for scientific video reasoning! 🔬SciVideoBench: 1. Spans Physics, Chemistry, Biology & Medicine with authentic experimental videos. 2. Features 1,000 challenging MCQs across three reasoning types:
0
7
16
🚀 Excited to share that our work is featured in the State of AI Report! Check it out - lots of interesting insights. Our research on the potential of small language models (SLMs) for Agentic AI is highlighted at slide 82: - Tasks like form filling, using a calculator, creating
stateof.ai
The State of AI Report analyses the most interesting developments in AI. Read and download here.
🪩The one and only @stateofaireport 2025 is live! 🪩 It’s been a monumental 12 months for AI. Our 8th annual report is the most comprehensive it's ever been, covering what you *need* to know about research, industry, politics, safety and our new usage data. My highlight reel:
1
1
10
we just released another chapter on O'Reilly for our vision language model book about modern VLM architectures ✨ @andimarafioti @micuelll @orr_zohar it covers early & late fusion, encoder-decoders, multimodal attention types and more! 🤗
6
22
242
🚀 MiMo‑VL 2508 is live! Same size, much smarter. We’ve upgraded performance, thinking control, and overall user experience. 📈 Benchmark gains across image + video: MMMU 70.6, VideoMME 70.8. Consistent improvements across the board. 🤖 Thinking Control: toggle reasoning with
2
16
91
🚀 How far can RL scaling take LLMs? Drop ProRLv2! 🔥With ProRLv2, we keep expanding LLM’s reasoning boundaries through 3,000+ RL steps over 5 domains and set a new state-of-the-art 🌟 among 1.5B reasoning models. 🔗 Full blog: https://t.co/mxpaVXZdjj 🤗Open model:
4
40
209
🧠 How can we truly test long-context video understanding in video-LMMs? ⏱️ TimeScope benchmarks models from 1 min to 8 hours using “needle-in-a-haystack” probes. 🚀 Gemini 2.5-Pro leads the pack—but even it struggles as context length grows. Long-range memory is still a
🧵 Introducing TimeScope, an open-source benchmark rigorously evaluating the true “temporal context window” of video-language models on videos ranging from 1 minute to 8 hours. #AI #MachineLearning
1
1
10
Thrilled to announce our MiMo-VL series hit 100K downloads on HuggingFace last month! 🚀🚀 Incredible to see the community's enthusiasm for our VLMs. More exciting updates coming soon! 😜 https://t.co/7NhlMds1A5
2
18
69
timescope: testing if large models understand long videos or they just claim to do so 🤠 they randomly insert needles (short videos/static images) in long videos and ask questions about the needle itself 🤯 Gemini seems to be the best! very cool work by @orr_zohar et al 👏
3
10
118
Big thanks for @huggingface, and my amazing collborators: Rui Li, @XiaohanWang96, and @andimarafioti For more details, check out: 📑Blog post: https://t.co/NBxrigForI ⚖️Leaderboard: https://t.co/krDXab4Gd9 🔥Dataset: https://t.co/kIOqMQRr5j
0
0
5
🚀Evaluations expose significant gaps even in leading models like Gemini 2.5-Pro, highlighting challenges in tasks requiring detailed motion analysis and information synthesis.
1
0
4