REM: Evaluating LLM Embodied Spatial Reasoning
through Multi‑Frame Trajectories

A benchmark for object permanence, spatial relations, temporal ordering, and counting across egocentric multi‑frame sequences in controllable 3D environments.

Jacob Thompson, Emiliano Garcia‑Lopez, Yonatan Bisk
Carnegie Mellon University • Department of Computer Science • Pittsburgh, PA
3,119 trajectories • 47,019 QA • Tasks: counting, comparison, left/right, temporal
Top‑down layout, egocentric views, and QA examples.

Abstract

Humans build viewpoint‑independent cognitive maps, enabling robust reasoning across navigation. Despite large‑scale video training, current MLLMs struggle with embodied spatial reasoning. REM introduces egocentric multi‑frame trajectories with explicit egomotion to evaluate object permanence & distinction, spatial relations, temporal ordering, and numerical tracking under viewpoint change. Reasoning models perform well in simple cases but degrade with congestion, duplicates, and longer horizons—far from human reliability.

Datasets

Three Blender‑generated egocentric datasets designed to isolate distinct challenges in visuospatial reasoning.

Property Baseline Single Frame Full Rotation
Num. Trajectories3,119 (18)350100
Total QA Pairs47,019 (154)1,2892,424
Trajectory Length(s)2, 4, 8, 16, 32, 64 (4, 8, 16, 32)124
Object Count8–4824–5524
Duplicate Count0–460–201–2
PurposeGeneral capabilitiesSingle‑frame countingObject distinction

Baseline

~50k QA across ~3k trajectories. Varies trajectory length, scene congestion, and duplicate rate. Tasks: counting, numerical comparison, left/right positioning, temporal ordering.

Baseline dataset examples with varying trajectory lengths.

Single Frame

Isolates visual counting without frame‑to‑frame tracking, disentangling perception vs. identity maintenance.

Single‑frame dataset statistics.

Full Rotation

360° rotation in a cluttered scene; 0° and 180° views appear similar but contain different object identities (with 1–2 intentional duplicates). Tests whether models integrate egomotion and context.

Full rotation dataset visualization showing 360° scene coverage.
Full rotation dataset visualization showing 360° scene coverage.

Generation & Verification

Per‑frame annotations (IDs, pixel coverage) power automated QA generation and a keyword‑aware verifier for grading.

REM generation and evaluation pipeline diagram.

Results

Reasoning models (e.g., o3) lead overall but still underperform humans—particularly on counting and comparison under viewpoint change, congestion, and duplicates.

Scaling laws for non-numerical tasks vs. congestion, duplicates, and horizon length.
Scaling laws for non‑numerical tasks vs. congestion, duplicates, and horizon length.
Comparison accuracy dropping as target counts converge.
Comparison accuracy drops as target counts converge.
Counting underestimation analysis showing weak dependence on sequence length.
Counting underestimation increases with true count; weak dependence on sequence length.
Predictions tracking max objects in single frames vs. full trajectory totals.
Predictions track max objects in any single frame more than full‑trajectory totals.
Question Metrics Overall Num. Comparison Left/Right Rel. Temp. Ord. Counting
Full Count47,01915,5801,57614,30415,559
Mini Count15439383740
Random Chance33.350.033.3
Distribution of question types across the dataset.

Selected Figures

System prompt and message format used for model queries.
System prompt & message format.
Mini baseline dataset statistics.
Mini‑baseline statistics.

BibTeX

@inproceedings{thompson2025rem,
  title     = {REM: Evaluating LLM Embodied Spatial Reasoning through Multi-Frame Trajectories},
  author    = {Jacob Thompson and Emiliano Garcia-Lopez and Yonatan Bisk},
  booktitle = {Proceedings of COLM 2025},
  year      = {2025},
  note      = {Code and dataset: https://github.com/EmilianoGarciaLopez/REM}
}
Open PDF GitHub

Acknowledgments

We thank Lockheed Martin Corporation, the DCIST Collaborative Research Alliance, and Fujitsu Limited for partial support.