Orr Zohar @orr_zohar X Profile

Orr Zohar

@orr_zohar

Followers

521

Following

229

Media

20

Statuses

126

@nvidia • @Stanford • @KnightHennessy scholar • Researching large multimodal models

https://t.co/LiGRj8i7kF

Joined May 2023

Don't wanna be here? Send us removal request.

Orr Zohar

@orr_zohar

2 months

FineVision is out 🚀 And ready to empower the next generation of LMMs. Check it out!

Luis

@lusxvr

2 months

Today, we are releasing FineVision, a huge open-source dataset for training state-of-the-art Vision-Language Models: > 17.3M images > 24.3M samples > 88.9M turns > 9.5B answer tokens Here are my favourite findings:

0

4

20

Orr Zohar

@orr_zohar

2 days

People underestimate how token-heavy video understanding & generation are - and their massive untapped potential. We grasp images in a split second. Videos? Not so much: Comprehending a 1-hour clip often demands watching the whole thing. Video AI is a tiny sliver of tokens today,

Elon Musk

@elonmusk

3 days

@StefanoErmon @_inception_ai Diffusion will obviously work on any bitstream. With text, since humans read from first word to last, there is just the question of whether the delay to first sentence for diffusion is worth it. That said, the vast majority of AI workload will be video understanding and

1

0

6

Eric Zelikman

@ericzelikman

8 days

we think humanity’s biggest challenges won’t be solved by ai thinking for 1000 hours coming back with an answer they’ll be solved by many collaborating humans, and ai that understands them and their different skills, goals, values, etc to empower them to do more together

Eric Zelikman

@ericzelikman

2 months

some folks and i are making something new if you're hopeful about AI empowering everyone if you've worked on multiturn, memory, model behavior, multiagent RL, user sim, AI interfaces/products, kernels, or dist systems if you want frontier-scale compute & top infra let's chat!

54

41

672

Latent Kiri

@edgeaiguy

27 days

Started the early release of Vision Language Models (O’Reilly). My first VLM book—already on chapter 3 and loving it. Clear and easy to follow. Great work @mervenoyann , @andimarafioti ,@micuelll & @orr_zohar! Looking forward to the remaining chapters.

3

31

HanRong YE

@leoyerrrr

12 days

And we at #NVIDIA Research are still seeking research interns to explore omni-modal LLMs across a variety of domains, including robotics (VLA), visual agentic tool using, world modeling, and unified understanding and generation. Drop me an email if you are interested!

HanRong YE

@leoyerrrr

21 days

OmniVinci is now #1 paper on Huggingface!!! 🤗 Building omni-modal LLMs is MORE than just mixing tokens 😉 At @NVIDIA, we explored deeper possibilities in building truly omni-modal systems — leading to OmniVinci-9B, which introduces three key innovations: - OmniAlignNet – a

0

1

12

Orr Zohar

@orr_zohar

12 days

Encoder-Decoder models — making a comeback? 🤔 Discrete diffusion LMs can accelerate text generation (similar to speculative decoding). Exciting to see where these models go, and how many of the traditional AR design decisions will hold.

Marianne Arriola

@mariannearr

12 days

🚨In our NeurIPS paper, we bring encoder-decoders back.. for diffusion language models! ⚡️Encoder-decoders make diffusion sampling fast: a small (fast) decoder denoises tokens progressively and a large (slower) encoder represents clean context.

0

6

Orr Zohar

@orr_zohar

12 days

🚨Huge for multimodal/vision AI: Datasets hit 100s of TB, making on-prem storage a nightmare. 🤗Now stream them directly from Hugging Face to GPUs - unlocking scalable training of everything from vlms to world models. 🚀 I've battled storage limits for years; thrilled to move

Andi Marafioti

@andimarafioti

14 days

You can now train SOTA models without any storage!🌩️ We completely revamped the Hub’s backend to enable streaming at scale. We streamed TBs of data to 100s of H100s to train SOTA VLMs and saw serious speed-ups. But how?

1

10

70

Andi Marafioti

@andimarafioti

20 days

Open data is the foundation of open science. FineVision is our step toward making VLM research transparent, reproducible, and actually open. You can find more HF daily papers: https://t.co/fkrvVscUUe And a big shoutout to the first authors: @lusxvr and @orr_zohar ! Titans!

1

3

9

Nadav Timor

@NadavTimor

27 days

NYC open-source AI infra contributors — we’ve launched a community research hub above Grand Central where GPUs go brrr 🔥🗽 A place to hack, benchmark, and collaborate — vLLM, SGLang, kernels, inference optimizations all welcome. Open space. Open source. Weekends too. Huge

7

10

89

Xiaohan Wang

@XiaohanWang96

1 month

🚀 Excited to release SciVideoBench — a new benchmark that pushes Video-LMMs to think like scientists! Designed to probe video reasoning and the synergy between accurate perception, expert knowledge, and logical inference. 1,000 research-level Qs across Physics, Chemistry,

Shoubin Yu @ EMNLP

@shoubin621

1 month

🚨 New Paper Alert! Introducing SciVideoBench — a comprehensive benchmark for scientific video reasoning! 🔬SciVideoBench: 1. Spans Physics, Chemistry, Biology & Medicine with authentic experimental videos. 2. Features 1,000 challenging MCQs across three reasoning types:

0

7

16

Pavlo Molchanov

@PavloMolchanov

1 month

🚀 Excited to share that our work is featured in the State of AI Report! Check it out - lots of interesting insights. Our research on the potential of small language models (SLMs) for Agentic AI is highlighted at slide 82: - Tasks like form filling, using a calculator, creating

stateof.ai

The State of AI Report analyses the most interesting developments in AI. Read and download here.

Nathan Benaich

@nathanbenaich

1 month

🪩The one and only @stateofaireport 2025 is live! 🪩 It’s been a monumental 12 months for AI. Our 8th annual report is the most comprehensive it's ever been, covering what you *need* to know about research, industry, politics, safety and our new usage data. My highlight reel:

1

10

Lucas Beyer (bl16)

@giffmana

2 months

Say "NO!" to filters:

Andi Marafioti

@andimarafioti

2 months

Here's a wild finding from our ablations: filtering for only the "highest-quality" data actually hurts performance! 🤯 Our experiments show that at this scale, training on the full, diverse dataset—even with lower-rated samples—is better. Don't throw away your data!

20

8

203

merve

@mervenoyann

3 months

we just released another chapter on O'Reilly for our vision language model book about modern VLM architectures ✨ @andimarafioti @micuelll @orr_zohar it covers early & late fusion, encoder-decoders, multimodal attention types and more! 🤗

6

22

242

Lei Li@EMNLP25

@_TobiasLee

3 months

🚀 MiMo‑VL 2508 is live! Same size, much smarter. We’ve upgraded performance, thinking control, and overall user experience. 📈 Benchmark gains across image + video: MMMU 70.6, VideoMME 70.8. Consistent improvements across the board. 🤖 Thinking Control: toggle reasoning with

2

16

91

Shizhe Diao

@shizhediao

3 months

🚀 How far can RL scaling take LLMs? Drop ProRLv2! 🔥With ProRLv2, we keep expanding LLM’s reasoning boundaries through 3,000+ RL steps over 5 domains and set a new state-of-the-art 🌟 among 1.5B reasoning models. 🔗 Full blog: https://t.co/mxpaVXZdjj 🤗Open model:

4

40

209

Xiaohan Wang

@XiaohanWang96

4 months

🧠 How can we truly test long-context video understanding in video-LMMs? ⏱️ TimeScope benchmarks models from 1 min to 8 hours using “needle-in-a-haystack” probes. 🚀 Gemini 2.5-Pro leads the pack—but even it struggles as context length grows. Long-range memory is still a

Orr Zohar

@orr_zohar

4 months

🧵 Introducing TimeScope, an open-source benchmark rigorously evaluating the true “temporal context window” of video-language models on videos ranging from 1 minute to 8 hours. #AI #MachineLearning

1

10

Lei Li@EMNLP25

@_TobiasLee

4 months

Thrilled to announce our MiMo-VL series hit 100K downloads on HuggingFace last month! 🚀🚀 Incredible to see the community's enthusiasm for our VLMs. More exciting updates coming soon! 😜 https://t.co/7NhlMds1A5

2

18

69

merve

@mervenoyann

4 months

timescope: testing if large models understand long videos or they just claim to do so 🤠 they randomly insert needles (short videos/static images) in long videos and ask questions about the needle itself 🤯 Gemini seems to be the best! very cool work by @orr_zohar et al 👏

3

10

118

Orr Zohar

@orr_zohar

4 months

Big thanks for @huggingface, and my amazing collborators: Rui Li, @XiaohanWang96, and @andimarafioti For more details, check out: 📑Blog post: https://t.co/NBxrigForI ⚖️Leaderboard: https://t.co/krDXab4Gd9 🔥Dataset: https://t.co/kIOqMQRr5j

0

5

Orr Zohar

@orr_zohar

4 months

🚀Evaluations expose significant gaps even in leading models like Gemini 2.5-Pro, highlighting challenges in tasks requiring detailed motion analysis and information synthesis.

1

0

4