WhiskerSplat: Feed-Forward Neural 3D Reconstruction from Sparse Views
1 CatTower Vision Lab · 2 Whiskr Inc
✉ corresponding: kitty.park@cattower.example
Reconstructing 3D scenes from a handful of casual photos typically demands known camera poses and minutes of per-scene optimization. We present WhiskerSplat, a feed-forward model that predicts a metric-scale anisotropic-splat 3D field from as few as three unposed RGB views in a single pass — about 0.9 seconds per scene, with no test-time optimization. A cross-view attention module resolves the scale ambiguity that otherwise requires pose supervision. On the CT-Scenes-Hard sparse-view benchmark, WhiskerSplat improves novel-view PSNR by +0.99 dB over the strongest baseline while running ~40× faster than optimization-based methods.
What's new
- Feed-forward reconstruction. A single encoder maps sparse, unposed views to a metric 3D field — no per-scene fitting.
- Cross-view attention resolves metric-scale ambiguity across views, replacing explicit pose supervision.
- State of the art on CT-Scenes-Hard at 0.9 s/scene, ~40× faster than optimization baselines.
Feed-forward reconstruction pipeline
Rendered vs. ground truth
Reconstructions on CT-Scenes-Hard
Hover a card to swap WhiskerSplat → FelineGS baseline on the same scene. All renders are synthetic stand-ins (self-made, CC0).
Quantitative results (3 views, CT-Scenes-Hard)
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | Chamfer-L1 ↓ | Time ↓ |
|---|---|---|---|---|---|
| Meowtrics-NeRF | 24.10 | 0.842 | 0.171 | 0.092 | 38 min |
| PawSplat | 26.30 | 0.881 | 0.124 | 0.063 | 4.2 s |
| Yarn-3R | 26.78 | 0.889 | 0.115 | 0.058 | 1.1 s |
| FelineGS | 27.05 | 0.894 | 0.108 | 0.055 | 1.6 s |
| TabbyFormer | 27.42 | 0.901 | 0.097 | 0.049 | 1.0 s |
| WhiskerSplat (ours) | 28.41 | 0.912 | 0.087 | 0.041 | 0.9 s |
↑ higher is better, ↓ lower is better. Best per column in bold. All numbers fictional.
What matters
| Variant | PSNR ↑ | LPIPS ↓ | Δ PSNR |
|---|---|---|---|
| 2 input views | 26.90 | 0.118 | −1.51 |
| w/o cross-view attention | 27.10 | 0.112 | −1.31 |
| w/o depth supervision | 27.60 | 0.101 | −0.81 |
| full model (3 views) | 28.41 | 0.087 | — |
| 5 input views | 29.80 | 0.071 | +1.39 |
| 9 input views | 30.60 | 0.063 | +2.19 |
BibTeX
@inproceedings{park2026whiskersplat,
title = {WhiskerSplat: Feed-Forward Neural 3D Reconstruction from Sparse Views},
author = {Park, Kitty and Kim, Ya-ong and Lee, Calico and Choi, Mittens},
booktitle = {Proceedings of PURRCV},
year = {2026},
note = {Fictional demo citation}
}
We thank the CatTower Vision Lab for compute on the fictional Meowtrics cluster, and Whiskr Inc for capture hardware. This page, its authors, dataset, venue, and every number on it are fictional.