Chenhao Zheng Profile
Chenhao Zheng

@Michael3014018

Followers
75
Following
22
Media
10
Statuses
20

Computer Vision PhD student @uwcse | Student Reseacher @allen_ai | ex Undergrad @UMichCSE and @sjtu1896

Seattle, WA
Joined December 2022
Don't wanna be here? Send us removal request.
@Michael3014018
Chenhao Zheng
29 days
Having trouble dealing with the excessive token number when processing a video? Check out our paper that is accepted by ICCV 2025 with an average score of 5.5! We tokenize video with tokens grounded in trajectories of all objects rather than fix-sized patches. Trained with a
Tweet media one
1
25
112
@Michael3014018
Chenhao Zheng
29 days
Extremely grateful to work with the amazing team!!.@JieyuZhang20, @mrezasal1, Ziqi Gao, Vishnu Iyengar, Norimasa Kobori, Quan Kong, @RanjayKrishna.
1
0
2
@Michael3014018
Chenhao Zheng
29 days
We also connect TrajViT and ViT3D to Llama-3, forming two VideoLLMs. Across six Video-QA benchmarks, TrajViT-LLM delivers +5.24 pp accuracy, trains 4 × faster* and requires 18 × fewer inference FLOPs than the ViT3D counterpart. We also demonstrate in the paper that our model has.
1
0
2
@Michael3014018
Chenhao Zheng
29 days
For efficiency,despite the cost of trajectory extraction, TrajViT trains faster, consumes less GPU memory, and performs quicker inference for sequences ≥ 64 frames.
Tweet media one
1
0
3
@Michael3014018
Chenhao Zheng
29 days
For evaluation, we compare it with standard ViT with space-time patch tokens (ViT3D) and state-of-the-art token merging methods. Task scope: video-text retrieval, spatiotemporal detection, action classification, and VideoLLM QA. TrajViT surpasses ViT3D on all tasks. At our
Tweet media one
1
0
3
@Michael3014018
Chenhao Zheng
29 days
We train a video encoder, TrajViT, using our tokenization paradigm with CLIP objective, on a large scale dataset of 50M image + 10M video. TrajViT can naturally process image data by treating each image segment as a trajectory of length one, allowing seamingless joint training.
1
0
2
@Michael3014018
Chenhao Zheng
29 days
Naively splitting the video tensor into patches is known to introduce memory bottleneck, but is still the de-facto way of tokenizing video due to its strong performance. As shown in figure, We fundamentally transform traditional video tokenization by reorganizing video tokens to
Tweet media one
1
0
2
@Michael3014018
Chenhao Zheng
2 months
RT @jae_sung_park96: 🔥We are excited to present our work Synthetic Visual Genome (SVG) at #CVPR25 tomorrow! .🕸️ Dense scene graph with d….
0
8
0
@Michael3014018
Chenhao Zheng
2 months
RT @JieyuZhang20: Calling all #CVPR2025 attendees!. Join us at the SynData4CV Workshop at @CVPR (Jun 11 full day at Grand C2, starting at 9….
syndata4cv.github.io
[“CVPR 2025 Workshop”, “June 11th, 2025, Grand C2”, “Nashville, TN, United States”]
0
8
0
@Michael3014018
Chenhao Zheng
8 months
RT @ainaz_eftekhar: 🎉 Excited to introduce "The One RING: a Robotic Indoor Navigation Generalist" – our latest work on achieving cross-embo….
0
36
0
@Michael3014018
Chenhao Zheng
8 months
RT @jbhuang0604: The slide is bad, her response to an audience is even worse…. “Maybe there is one, maybe they are common, who knows what.….
0
20
0
@Michael3014018
Chenhao Zheng
8 months
This project is led by @zitong_lan , with collaboration of @zhiwei_zzz, and advised by @mingmin_zhao . Thank for everyone’s great efforts!.
0
0
0
@Michael3014018
Chenhao Zheng
8 months
In parallel with AVR, we develop AcoustiX, an acoustic simulation platform that generates more physically accurate signals compared to simulators like soundspace. Soundspace exhibits significant errors in signal phases and arrival times, so we develop a version to solve these.
Tweet media one
1
0
0
@Michael3014018
Chenhao Zheng
8 months
One cool thing: we can generate accurate binaural audio despite being trained only on monaural audio. The existing methods either require training in domain or manually creates signal delays. For us, rendering impulse response at the pos of two ears separately is good enough.
1
0
0
@Michael3014018
Chenhao Zheng
8 months
The resulting method surpass the previous state-of-the-art model in a large margin, both in real and simulated benchmarks. Below is the evaluation in a set of real world benchmarks. More results in paper.
Tweet media one
Tweet media two
1
0
0
@Michael3014018
Chenhao Zheng
8 months
In this paper, we for the first time successfully introduce volumetric rendering in audio synthesis. We propose various techniques to solve the challenges mentioned above, by closely follow acoustic wave principle. See our paper for detail.
Tweet media one
1
0
0
@Michael3014018
Chenhao Zheng
8 months
One may ask, why can’t apply the same NERF techniques for audio? The challenge is sound and light have quite different properties: microphone-captured sound lacks directionality; propagation time of sound can’t be ignored, so different location has different signal delays; etc….
1
0
0
@Michael3014018
Chenhao Zheng
8 months
In 3D vision community, novel view synthesis has advanced a lot since NERF comes out. However, people in audio field still struggles to apply similar physics prior in spatial audio synthesis. The figure below shows how prev models except us fails to model the field distribution.
Tweet media one
1
0
0
@Michael3014018
Chenhao Zheng
8 months
Excited to share our #NeurIPS2024 spotlight: Acoustic Volume Rendering (AVR) for Neural Impulse Response Fields. AVR greatly improve the state-of-the-art in novel view spatial audio synthesis by introducing acoustic volume rendering. Listen with headphone for example below
1
6
17
@Michael3014018
Chenhao Zheng
11 months
RT @DJiafei: Humans learn and improve from failures. Similarly, foundation models adapt based on human feedback. Can we leverage this failu….
0
43
0