PURRCV 2026 · Oral

WhiskerSplat: Feed-Forward Neural 3D Reconstruction from Sparse Views

Kitty Park¹, Ya-ong Kim^1,2, Calico Lee¹, Mittens Choi¹

¹ CatTower Vision Lab · ² Whiskr Inc

✉ corresponding: kitty.park@cattower.example

RECONSTRUCTION · CT-Scenes / living-room-07 · 3 input views drag to rotate Interactive 3D requires WebGL — showing a static depth render.

Figure 1. A live point-cloud reconstruction (≈45k splats) predicted feed-forward from 3 unposed views. Drag to orbit. All scenes fictional.

28.41

PSNR (dB) ↑

0.912

SSIM ↑

0.087

LPIPS ↓

0.9s

per scene ↓

Abstract

Reconstructing 3D scenes from a handful of casual photos typically demands known camera poses and minutes of per-scene optimization. We present WhiskerSplat, a feed-forward model that predicts a metric-scale anisotropic-splat 3D field from as few as three unposed RGB views in a single pass — about 0.9 seconds per scene, with no test-time optimization. A cross-view attention module resolves the scale ambiguity that otherwise requires pose supervision. On the CT-Scenes-Hard sparse-view benchmark, WhiskerSplat improves novel-view PSNR by +0.99 dB over the strongest baseline while running ~40× faster than optimization-based methods.

Contributions

What's new

Feed-forward reconstruction. A single encoder maps sparse, unposed views to a metric 3D field — no per-scene fitting.
Cross-view attention resolves metric-scale ambiguity across views, replacing explicit pose supervision.
State of the art on CT-Scenes-Hard at 0.9 s/scene, ~40× faster than optimization baselines.

FIG. 02 — Method

Feed-forward reconstruction pipeline

Figure 2. The pipeline is feed-forward at test time; the dashed path is supervision used only during training.

FIG. 03 — Novel view

Rendered vs. ground truth

OURS (RENDERED) GROUND TRUTH

Figure 3. Drag the divider: WhiskerSplat's rendered novel view (left) vs. held-out ground truth (right). Fictional scene.

FIG. 04 — Qualitative

Reconstructions on CT-Scenes-Hard

Hover a card to swap WhiskerSplat → FelineGS baseline on the same scene. All renders are synthetic stand-ins (self-made, CC0).

Table 1 — Comparison

Quantitative results (3 views, CT-Scenes-Hard)

Method	PSNR ↑	SSIM ↑	LPIPS ↓	Chamfer-L1 ↓	Time ↓
Meowtrics-NeRF	24.10	0.842	0.171	0.092	38 min
PawSplat	26.30	0.881	0.124	0.063	4.2 s
Yarn-3R	26.78	0.889	0.115	0.058	1.1 s
FelineGS	27.05	0.894	0.108	0.055	1.6 s
TabbyFormer	27.42	0.901	0.097	0.049	1.0 s
WhiskerSplat (ours)	28.41	0.912	0.087	0.041	0.9 s

↑ higher is better, ↓ lower is better. Best per column in bold. All numbers fictional.

Table 2 — Ablations

What matters

Variant	PSNR ↑	LPIPS ↓	Δ PSNR
2 input views	26.90	0.118	−1.51
w/o cross-view attention	27.10	0.112	−1.31
w/o depth supervision	27.60	0.101	−0.81
full model (3 views)	28.41	0.087	—
5 input views	29.80	0.071	+1.39
9 input views	30.60	0.063	+2.19

Citation

BibTeX

@inproceedings{park2026whiskersplat,
  title     = {WhiskerSplat: Feed-Forward Neural 3D Reconstruction from Sparse Views},
  author    = {Park, Kitty and Kim, Ya-ong and Lee, Calico and Choi, Mittens},
  booktitle = {Proceedings of PURRCV},
  year      = {2026},
  note      = {Fictional demo citation}
}

Acknowledgements

We thank the CatTower Vision Lab for compute on the fictional Meowtrics cluster, and Whiskr Inc for capture hardware. This page, its authors, dataset, venue, and every number on it are fictional.