Most benchmark wins for vision–language models (VLMs) on spatial tasks assume models learn structured 3D reasoning. This paper challenges that view by showing an easier — and problematic — explanation: many VLMs embed a photographer’s perspective bias so that vertical image position becomes conflated with distance. That single insight explains large accuracy gaps on counter-heuristic examples and suggests why benchmark accuracy can rise even as true spatial understanding remains shallow.
Key Findings
- Vertical-distance entanglement: Across diverse VLM families, embedding axes show a consistent entanglement where higher (upward) image positions correlate with greater inferred distance. So what: models can succeed on natural images by exploiting this photographic bias rather than learning geometric depth.
- Performance gap on counter-heuristic examples: Models score much lower when evaluated on examples that break the natural vertical–distance correlation. So what: benchmark aggregates mask brittle shortcuts and overstate real-world spatial reasoning.
- Scaling worsens the bias: As models and datasets scale, overall benchmark accuracy grows while the vertical–distance entanglement often intensifies. So what: more data and parameters do not automatically yield more principled spatial representations.
- Representations predict robustness: Models with better-separated spatial axes (less entanglement) show higher robustness across spatial benchmarks. So what: internal representation structure is a better predictor of reliable spatial reasoning than raw benchmark numbers.
- SpatialTunnel benchmark & code: The authors provide SpatialTunnel, a synthetic evaluation suite that removes common photographic correlations to reveal shortcut reliance, plus probing code and a project page for reproduction.
Who it's for and trade-offs
Great fit if you care about diagnosing what VLMs actually represent (researchers building multimodal probes, model evaluators, and developers aiming for robust spatial reasoning). The paper offers a compact representation-level toolkit and a stress-test benchmark to separate genuine geometric understanding from dataset shortcuts. Look elsewhere if you only need application-level performance comparisons or deployment-ready fixes — the work diagnoses representation problems and provides evaluation + correlation evidence, but does not itself deliver production-ready methods to fully remove the bias. Also note: synthetic benchmarks like SpatialTunnel are good at isolating specific shortcuts but may not capture every real-world distributional nuance.
Methods (brief)
The study uses a representation-level analysis that constructs minimal contrastive pairs to probe how spatial axes are organized and disentangled in VLM embeddings. By measuring axis alignment and disentanglement metrics across model families and scales, and by testing on SpatialTunnel (which breaks image-level correlations), the authors isolate model-intrinsic biases from dataset skew and link representation geometry to downstream robustness.
