VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan1†*, Jian Zhang2*, Renjie Li3, Junge Zhang4, Runjin Chen1, Hezhen Hu1, Kevin Wang1, Huaizhi Qu5, Dilin Wang6, Zhicheng Yan6, Hongyu Xu6, Justin Theiss6, Tianlong Chen5, Jiachen Li4, Zhengzhong Tu3, Zhangyang Wang1, Rakesh Ranjan6

1UT Austin    2XMU    3TAMU    4UCR    5UNC    6Meta

Corresponding Author. *Equal contribution.

zhiwenfan@utexas.edu

Video: VLM-3R Architecture Overview.

A unified Vision-Language Model (VLM) framework integrating 3D reconstructive instruction tuning for deep spatial understanding from monocular video.

Abstract

The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has motivated extending these models to understand 3D scenes, aiming for human-like visual-spatial intelligence. Nevertheless, achieving deep spatial understanding comparable to human capabilities poses significant challenges in model encoding and data acquisition. Existing methods frequently depend on external depth sensors for geometry capture or utilize off-the-shelf algorithms for pre-constructing 3D maps, thereby limiting their scalability, especially with prevalent monocular video inputs and for time-sensitive applications. In this work, we introduce VLM‑3R, a unified framework for Vision-Language Models (VLMs) that incorporates 3D Reconstructive instruction tuning. VLM‑3R processes monocular video frames by employing a geometry encoder to derive implicit 3D tokens that represent spatial understanding. Through the utilization of our Spatial-Visual–View Fusion technique and over 200K curated 3D reconstructive instruction tuning question-answer (QA) pairs, VLM‑3R effectively aligns real-world spatial context with language instructions. This enables the model to perform monocular 3D spatial assistance and embodied reasoning. To facilitate the evaluation of temporal reasoning capabilities, we introduce the Vision-Spatial-Temporal Intelligence benchmark, featuring over 138.6K QA pairs across five distinct tasks focused on evolving spatial relationships. Extensive experiments demonstrate that our model, VLM‑3R, not only promotes robust visual-spatial reasoning but is also capable of understanding 3D contextual changes over time, excelling in both accuracy and scalability.

Overview

VLM-3R Overview

Figure: VLM-3R Overview. Our framework (b) utilizes an end-to-end architecture to process video directly, unlike prior methods (a) that rely on explicit 3D data. This enables the model to understand spatial context, instance layout, and temporal dynamics, achieving leading performance on benchmarks (results in c).

Key Innovations

End-to-End Monocular Video 3D Understanding

VLM-3R directly processes monocular RGB videos without needing external depth sensors or pre-built 3D maps, significantly enhancing scalability and practical applicability.

3D Reconstructive Instruction Tuning

Instruction tuning with over 200K QA pairs enables the model to effectively align visual information with 3D spatial context and language instructions.

Spatial-Visual-View Fusion

A novel fusion mechanism integrates 3D geometric tokens, per-view camera tokens, and 2D appearance features for joint spatio-linguistic understanding.

Vision-Spatial-Temporal Intelligence Benchmark (VSTI-Bench)

A new benchmark with over 138.6K QA pairs, specifically designed to evaluate the model's understanding of spatio-temporal relationships evolving from camera motion within 3D environments.

VLM-3R Architecture

VLM-3R Network Architecture Diagram

Figure: Network Architecture. Our method takes monocular video and language instruction as input. Visual Encoder coupled with Spatial Encoder extract frame-level appearance, camera view position, and globally aligned geometry. Visual-Geometry Fusion integrates these through attention and projection layers to create 3D-aware visual features for the LMM. During the inference stage, this fusion enables reliable spatial and temporal reasoning.

Architectural Overview

The core of VLM-3R is a pre-trained Large Multimodal Model (LMM), integrated with modules for deriving geometric encodings, camera view encodings, and visual features from the input video; these diverse inputs are subsequently fused effectively with language representations. VLM-3R does not rely on pre-built 3D maps or external depth sensors. This design directly addresses key limitations of existing approaches, such as the common inadequacy of Video LLMs in perceiving rich spatial context from monocular video and the restrictive dependency of many specialized 3D-LLMs on prior 3D map or depth sensor inputs.

Key Components:

Datasets & Benchmarks

VSTI-Bench Data Statistics Diagram

Figure: VSTemporalI-Bench Overview. (a) Statistical distribution of QA pairs by primary categories (inner ring) and their sub-categories (outer ring). (b) Example QA pairs for different task types.

Multimodal Spatial Instruction Data Generation

We developed a scalable, automated data generation pipeline to instill robust spatial intelligence in LMMs. This pipeline produced:

This data is derived from existing 3D datasets like ScanNet, ScanNet++, and ARKitScenes, processed via detailed spatio-temporal scene graphs to automatically generate QA pairs for tasks such as object counting, relative distance/direction, appearance order, object size, absolute distance, and room size.

Vision-Spatial-Temporal Intelligence Benchmark (VSTI-Bench)

To evaluate the understanding of dynamic 3D environments, we introduce VSTI-Bench. This benchmark contains approximately 138,600 QA pairs, distributed across three main categories: Camera Dynamics (49.6%), Camera-Object Interactions (38.4%), and Object Relative Position (12.0%). It is designed to assess LMMs' ability to perceive and reason about relative camera/object motion, dynamic object-camera relationships, and evolving spatial configurations.

Evaluation Metrics

For Multiple-Choice Answer (MCA) tasks, standard Accuracy (ACC) is used. For Numerical Answer (NA) tasks, Mean Relative Accuracy (MRA) is utilized:

MRA = (1/10) * Σθ∈{0.5,0.55,...,0.95} 𝟙(|ŷ - y|/y < 1-θ)

VSTI-Bench: Interactive Examples

VSTI-Bench Example 1: Camera Displacement VSTI-Bench Example 2: Camera Movement Direction VSTI-Bench Example 3: Camera-Object Absolute Distance VSTI-Bench Example 4: Camera-Object Relative Distance VSTI-Bench Example 5: Object-Object Relative Position
Task Type: Camera Displacement
Video at 2x speed. Adjust controls.
Frame 1 1
Frame 2 2
Frame 3 3
Frame 4 4
Frame 5 5
Frame 6 6
Frame 7 7
Frame 8 8
Frame 9 9
Frame 10 10
Frame 11 11
Frame 12 12
Frame 13 13
Frame 14 14
Frame 15 15
Frame 16 16
Frame 17 17
Frame 18 18
Frame 19 19
Frame 20 20
Frame 21 21
Frame 22 22
Frame 23 23
Frame 24 24
Frame 25 25
Frame 26 26
Frame 27 27
Frame 28 28
Frame 29 29
Frame 30 30
Frame 31 31
Frame 32 32

Question 1: Approximately how far (in meters) did the camera move between frame 6 and frame 14 of 32?

Click for Answer

Experimental Results

VSI-Bench Evaluation

On VSI-Bench, VLM-3R (7B) ranks as the top-performing open-sourced Vision-Language Model, outperforming other models in its parameter class (around 7-8B) as well as those with fewer parameters. It even surpasses some significantly larger 72B parameter models and proprietary systems. This highlights the effectiveness of its reconstructive instruction tuning. The integration of spatial encoding significantly boosts LMM capabilities in distance, size, and direction estimation tasks.

Table 1: VSI-Bench Evaluation Results. VLM-3R ranks first among open-sourced VLMs, showcasing the effectiveness of its reconstructive instruction tuning. This validates our model's spatial encoding significantly improves 3D understanding and reasoning, particularly in distance, size, direction, and spatial planning tasks. For each task within the open-sourced VLMs group, dark gray highlights the overall best-performing model; light gray denotes the second-best open-source model. Results on the VSI-Bench tiny set are presented following established setups.
MethodsRankAvg.Obj. CountAbs. Dist.Obj. SizeRoom SizeRel. Dist.Rel. Dir.Route PlanAppr. Order
Numerical AnswerMultiple-Choice Answer
Baseline
Chance Level (Random)------25.036.128.325.0
Chance Level (Frequency)-34.062.132.029.933.125.147.928.425.2
VSI-Bench Perf. ( = Tiny Set)
Human Level-79.294.347.060.445.994.795.895.8100.0
Gemini-1.5 Flash-45.750.833.656.545.248.039.832.759.2
Gemini-1.5 Pro-48.849.628.858.649.446.048.142.068.0
Gemini-2.0 Flash-45.452.430.666.731.856.046.324.555.1
Proprietary Models (API)
GPT-4o334.046.25.343.838.237.041.331.528.5
Gemini-1.5 Flash242.149.830.853.554.437.741.031.537.8
Gemini-1.5 Pro145.456.230.964.143.651.346.336.034.6
Open-Sourced VLMs
LLaVA-OneVision-0.5B1128.046.128.415.428.328.936.934.55.8
InternVL2-2B1227.421.824.922.035.033.844.230.57.1
LLaVA-NeXT-Video-7B535.648.514.047.824.243.542.434.030.6
InternVL2-8B634.623.128.748.239.836.730.729.939.6
LLaVA-OneVision-7B732.447.720.247.412.342.535.229.424.4
LongVA-7B929.238.016.638.922.233.143.325.415.7
VILA-1.5-8B1028.917.421.850.318.832.134.831.024.8
LongVILA-8B1321.629.19.116.70.029.630.732.525.5
InternVL2-40B436.034.926.946.531.842.132.234.039.6
VILA-1.5-40B831.222.424.848.722.740.525.731.532.9
LLaVA-NeXT-Video-72B240.948.922.857.435.342.436.735.048.6
LLaVA-OneVision-72B340.243.523.957.637.542.539.932.544.6
VLM-3R (7B)160.970.249.469.267.165.480.545.440.1

VSTI-Bench Evaluation

On VSTI-Bench, VLM-3R also demonstrates strong capabilities in understanding spatial context and temporal movement, enabling it to effectively answer questions and make inferences about video content.

Table 2: VSTemporalI-Bench Evaluation Results. VLM-3R demonstrates leading performance across all models on this benchmark, showcasing its strong capabilities in spatio-temporal reasoning. This highlights its effectiveness in understanding evolving camera dynamics, camera-object interactions, and inter-object relationships from monocular video.
MethodsRankAvg.Cam-Obj Abs. Dist.Cam. Displace.Cam. Mov. Dir.Obj-Obj Rel. Pos.Cam-Obj Rel. Dist.
Numerical AnswerMultiple-Choice Answer
Baseline
Chance Level (Random)----36.150.036.1
Chance Level (Frequency)-27.45.46.240.752.232.4
Human Performance
Human Level-77.051.446.895.197.594.3
Proprietary Models (API)
GPT-4o138.229.523.437.358.142.5
Gemini-1.5 Flash232.128.520.924.452.633.9
Open-Sourced VLMs
LLaVA-OneVision-0.5B936.916.532.446.150.539.0
InternVL2-2B738.117.727.843.054.947.2
LLaVA-NeXT-Video-7B540.028.21.849.864.755.6
LLaVA-OneVision-7B441.729.919.347.562.149.8
LongVA-7B1032.313.55.143.757.941.2
InternVL2-8B343.532.913.548.068.055.0
LongVILA-8B1130.520.011.635.452.333.4
VILA-1.5-8B837.330.127.342.250.436.7
VILA-1.5-40B638.228.215.728.865.453.0
LLaVA-NeXT-Video-72B244.032.310.548.178.350.9
VLM-3R (7B)158.839.439.660.686.568.6

Ablation Studies

Ablation studies confirm that both geometric token fusion and camera token fusion are critical to VLM-3R's performance, especially in tasks reliant on scene structure and directional awareness. The overall 3D fusion mechanism also shows clear performance benefits.

Table 3: Ablation Study of VLM-3R Components on VSI-Bench. This table illustrates the impact of key components in VLM-3R, specifically Geometry Tokens and Camera Tokens. The performance of the full VLM-3R model is compared against a fine-tuned LLaVA-NeXT-Video baseline and VLM-3R variants with ablated components. Scores indicate percentage accuracy or an appropriate metric for each task. The VLM-3R (Full) model row is highlighted.
MethodsRankAvg.Obj. CountAbs. Dist.Obj. SizeRoom SizeRel. Dist.Rel. Dir.Route PlanAppr. Order
Numerical AnswerMultiple-Choice Answer
LLaVA-NeXT-Video ft (w/o C&G Tok.)457.7470.6443.6770.8263.7264.9368.9340.7238.51
VLM-3R w/o Cam. Tok.359.0969.5048.6668.4765.2162.8278.8642.7836.41
VLM-3R w/o Geo. Tok.259.4670.3049.2768.3666.0161.2781.3541.7537.38
VLM-3R (Full Model)160.9070.1649.3869.1567.1265.3580.5245.3640.13
× Enlarged Image