PURRCV 2026 · Oral

WhiskerSplat: Feed-Forward Neural 3D Reconstruction from Sparse Views

Kitty Park1, Ya-ong Kim1,2, Calico Lee1, Mittens Choi1

1 CatTower Vision Lab  ·  2 Whiskr Inc

✉ corresponding: kitty.park@cattower.example

RECONSTRUCTION · CT-Scenes / living-room-07 · 3 input views drag to rotate Interactive 3D requires WebGL — showing a static depth render.

Figure 1. A live point-cloud reconstruction (≈45k splats) predicted feed-forward from 3 unposed views. Drag to orbit. All scenes fictional.

28.41
PSNR (dB) ↑
0.912
SSIM ↑
0.087
LPIPS ↓
0.9s
per scene ↓
Abstract

Reconstructing 3D scenes from a handful of casual photos typically demands known camera poses and minutes of per-scene optimization. We present WhiskerSplat, a feed-forward model that predicts a metric-scale anisotropic-splat 3D field from as few as three unposed RGB views in a single pass — about 0.9 seconds per scene, with no test-time optimization. A cross-view attention module resolves the scale ambiguity that otherwise requires pose supervision. On the CT-Scenes-Hard sparse-view benchmark, WhiskerSplat improves novel-view PSNR by +0.99 dB over the strongest baseline while running ~40× faster than optimization-based methods.

Contributions

What's new

FIG. 02 — Method

Feed-forward reconstruction pipeline

3 unposed views View Encoder ViT-B · shared Cross-View Attn resolves metric scale 3D Field ~45k splats Diff. Renderer views + depth photometric + depth loss (training only)
Figure 2. The pipeline is feed-forward at test time; the dashed path is supervision used only during training.
FIG. 03 — Novel view

Rendered vs. ground truth

OURS (RENDERED) GROUND TRUTH
Figure 3. Drag the divider: WhiskerSplat's rendered novel view (left) vs. held-out ground truth (right). Fictional scene.
FIG. 04 — Qualitative

Reconstructions on CT-Scenes-Hard

Hover a card to swap WhiskerSplat → FelineGS baseline on the same scene. All renders are synthetic stand-ins (self-made, CC0).

Table 1 — Comparison

Quantitative results (3 views, CT-Scenes-Hard)

MethodPSNR ↑SSIM ↑LPIPS ↓Chamfer-L1 ↓Time ↓
Meowtrics-NeRF24.100.8420.1710.09238 min
PawSplat26.300.8810.1240.0634.2 s
Yarn-3R26.780.8890.1150.0581.1 s
FelineGS27.050.8940.1080.0551.6 s
TabbyFormer27.420.9010.0970.0491.0 s
WhiskerSplat (ours)28.410.9120.0870.0410.9 s

↑ higher is better, ↓ lower is better. Best per column in bold. All numbers fictional.

Table 2 — Ablations

What matters

VariantPSNR ↑LPIPS ↓Δ PSNR
2 input views26.900.118−1.51
w/o cross-view attention27.100.112−1.31
w/o depth supervision27.600.101−0.81
full model (3 views)28.410.087
5 input views29.800.071+1.39
9 input views30.600.063+2.19
Citation

BibTeX

@inproceedings{park2026whiskersplat,
  title     = {WhiskerSplat: Feed-Forward Neural 3D Reconstruction from Sparse Views},
  author    = {Park, Kitty and Kim, Ya-ong and Lee, Calico and Choi, Mittens},
  booktitle = {Proceedings of PURRCV},
  year      = {2026},
  note      = {Fictional demo citation}
}
Acknowledgements

We thank the CatTower Vision Lab for compute on the fictional Meowtrics cluster, and Whiskr Inc for capture hardware. This page, its authors, dataset, venue, and every number on it are fictional.