Introducing SMERF: a streamable, memory-efficient method for real-time exploration of large, multi-room scenes on everyday devices. Our method brings the realism of Zip-NeRF to your phone or laptop!
Project page:
ArXiv:
(1/n)
Our paper, “NeRF in the Wild”, is out! NeRF-W is a method for reconstructing 3D scenes from internet photography. We apply it to the kinds of photos you might take on vacation: tourists, poor lighting, filters, and all. (1/n)
For lighting and image post-processing, we introduce a low-dimensional embedding space controlling NeRF’s radiance field. This not only gives NeRF-W the capacity to model photo-specific lighting, it enables us to “relight” a scene from new angles. (3/n)
This project wouldn’t have been possible without my amazing coauthors:
@rmbrualla
, Noha Radwan, Mehdi S. M. Sajjadi,
@jon_barron
, and Alexey Dosovitskiy. Check out our paper:
We build on NeRF, a method for learning a volumetric radiance field from a posed photo collection. We introduce two extensions to soften NeRF’s “static world” assumption: one for lighting/post-processing, the other for transient objects. (2/n)
NeRF-W improves on the SOTA by >5dB in PSNR and reduces error on other metrics by 20-50%. Qualitatively, NeRF-W produces consistent, crisp 3D geometry without fog or checkerboard artifacts. Check out the project website for more videos and the paper. (5/n)
For transient objects, we introduce a secondary volumetric radiance field combined with an uncertainty field. The former explicitly captures transient objects; the latter uncertainty about the color of a pixel passing through part of the 3D space. (4/n)
SMERF has the best of both worlds: we produce renders nearly indistinguishable from Zip-NeRF while rendering at 60 fps or more on desktops, laptops, and even recent smartphones, all while scaling to scenes as big as a house!
(3/n)
How does one trade-off sample quality and diversity in a language model? Which decoding method is best? We introduce a multi-objective framework maximizing human judgement score subject to a constraint on diversity (entropy). (1/7)
@Knusper2000
We used a few hundred to low-digit thousands. Based on the results of NeRF, if you capture your images in a controlled environment, you might be able to get away with as few as one hundred!
How do we achieve this? We distill a teacher model into a family of MERF-like student submodels, each of which specializes to a different part of the scene. Each submodel captures the entire scene, so rendering stays fast and GPU memory consumption stays low.
(4/n)
Only a single submodel needs to be in memory at a time, and while the user explores the space, we swap out old submodels and stream in new ones. We train submodels to be mutually consistent, making transitions barely noticeable.
(6/n)
We also modify MERF to significantly improve visual fidelity on small-to-medium size scenes. Our submodels capture thin geometry, high-resolution textures, and specular highlights better than ever before.
(5/n)
The result: a set of compact, streaming-ready submodels ready to run at up to 60 fps in your browser. The best part: you can try it out yourself:
(7/n)
Existing approaches for view-synthesis are torn between two conflicting goals: high quality and fast rendering. Most methods only achieve one or the other.
(2/n)
Stochastic natural gradient descent corresponds to Bayesian training of neural networks, with a modified prior. This equivalence holds *even away from local minima*. Very proud of this work with Sam Smith, Daniel Duckworth, and Quoc Le.
I'm stoked to be a contributor on Object SRT, a new method for unsupervised, posed-images-to-3D-scene representation and segmentation! It's crazy fast and, while far from perfect, is leaps and bounds better than anything I've seen yet :)
So excited to share Object Scene Representation Transformer (OSRT):
OSRT learns about complex 3D scenes & decomposes them into objects w/o supervision, while rendering novel views up to 3000x faster than prior methods!
🖥️
📜
1/7
I'm proud to announce the release of our new paper relating Whitening, Newton's Method, and Generalization! tl;dr whitening w/o regularization significantly reduces a model's ability to generalize.
Work with
@negative_result
@sschoenholz
@ethansdy
@jaschasd
Whitening and second order optimization both destroy information about the dataset, and can make generalization impossible: We examine what information is usable for training neural networks, and how second order methods destroy exactly that information.
Introducing SMERF: a streamable, memory-efficient method for real-time exploration of large, multi-room scenes on everyday devices. Our method brings the realism of Zip-NeRF to your phone or laptop!
Project page:
ArXiv:
(1/n)
Super proud of my ACL publication with
@daphneipp
! tl;dr we find that the decoding methods that produce the most "human-like" text are also the easiest for BERT-style classifiers to identify. We humans and our models don't see text the same way!
But all is not lost! We also find that *regularized* second-order optimization leads to better generalization than un-regularized second-order optimization or gradient descent.
@w4nderlus7
I believe and this work aren't in conflict! Two points: (1) The Meena paper says, "given two models, the one with better perplexity produces better samples." This work says, "given two samples, the more likely isn't always better."
@GKopanas
Thanks for the kind words, Georgios! I look forward to the next generation of 3DGS work as well. It's just a matter of time till 3D capture & presentation is accessible as 2D is today.
@NextWorldOfTech
It is, but the scene representation is a "volumetric radiance field". I really like the original presentation by the NeRF authors on the subject:
Introducing SMERF: a streamable, memory-efficient method for real-time exploration of large, multi-room scenes on everyday devices. Our method brings the realism of Zip-NeRF to your phone or laptop!
Project page:
ArXiv:
(1/n)
Have you wondered how effective social distancing is? Or quarantining? What happens if a few people ignore social distancing? How bad is it to go to the grocery store?
In short, everything helps -- especially early testing and quarantine! We're all in this together.
New video: Simulating an epidemic.
What happens when people avoid each other for the most part but still go to a common central location like a store?
What if you can track and isolate cases, but 20% slip through the cracks? 50%?
And much more.
Key takeaways: (i) very high likelihood samples are bad, (ii) compare decoding methods fairly by controlling entropy, and (iii) there's more to decoding methods than favoring high-likelihood samples. (6/7)
Spot on article on the state of AI and the Mind. Definitely worth the read!
"Despite the remarkable commercial success of current AI systems...we still have a long way to go in mimicking truly human like intelligence."
When using log likelihood as a proxy for human judgement ("quality"), we obtain "Global Temperature Sampling", a globally-normalized decoding method that optimally traverses the quality-diversity curve. (2/7)
While this blog post may only have two authors, the project itself is the hard work of a number of amazing teammates. Take a peak at the "Acknowledgments" section -- you may spot a few familiar names :)
@w4nderlus7
On the decoding side, the Meena paper also advocates for a sample-and-rank method, N=20. I hypothesize that the method doesn't surface decodes on the "too likely" side of the Likelihood Trap. We didn't compare sample-and-rank as...
@supremebeme
@RadianceFields
This and other large scenes are captured with a DSLR camera and a fisheye lens. Approximately ~1500 photos are used. Capture takes 30~60 min.
We perform the first large-scale human study (>38,000 ratings) comparing decoding method/hyperparameter combinations against each other. When controlling for entropy, we find nucleus > top-k > temperature sampling in low-entropy regimes. (4/7)
Introducing SMERF: a streamable, memory-efficient method for real-time exploration of large, multi-room scenes on everyday devices. Our method brings the realism of Zip-NeRF to your phone or laptop!
Project page:
ArXiv:
(1/n)
We further find that, when pairing samples from decoding methods with random samples from the model *with equal likelihood*, temperature sampling is preferred to nucleus and top-k sampling by human raters. (5/7)
We fit a small CNN, MLPs, and a linear model w/ and w/o Natural Gradient Descent (a second order optimizer) and find all generalize more poorly than those trained w/ gradient descent.
Surprisingly, this method is *worse* than token-by-token decoding methods according to human raters! We discover this is a consequence of the "Likelihood Trap", wherein samples with exceptionally high likelihood receive low human judgement scores. (3/7)
This is a highly unintuitive result! In linear regression, training on a whitened dataset w/ fewer data points than dimensions results in a model that *cannot* of doing better than random chance on a validation set!
From the moment NeRF was first published, the research community knew it would be something game-changing. I'm proud to be part of the team turning this amazing line of work into a real product experience!
Immersive View gives users a virtual, close-up look at indoor spaces in 3D! Learn how it uses neural radiance fields to seamlessly fuse photos to produce realistic, multidimensional reconstructions of your favorite businesses and public spaces →
Further, applying gradient descent on a whitened dataset is *exactly* equivalent to applying Newton's Method on the original dataset. This suggests that models trained w/ second order methods may generalize as well as those trained w/ SGD.
@wxswxs
The original NeRF folks produced depth maps and meshes learned from those maps on their project website. You can see our depth map of Trevi Fountain in the overview video @ 2:25
Here’s a head-to-head comparison of nucleus, top-k, and temperature sampling, and our newly proposed decoding method. Sampling directly from the model is by far the worst and nucleus p=0.3 is best according to human judgement. (7/7)