AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

Anonymous

AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

AtlasVA replaces text-only agent memory with a native visual skill memory hierarchy, evolving spatial atlases from rollouts and turning them into dense reward signals for long-horizon VLM agents.

Paper Code BibTeX

AtlasVA teaser comparing text-centric memory with visual atlas memory

Paper Abstract

Memory That Stays Visual

Vision-language model (VLM) agents increasingly rely on memory-augmented reinforcement learning to reuse experience across long-horizon tasks, yet most existing frameworks store memory as text and depend on proprietary teacher models to summarize or refine it. This design is poorly matched to spatial decision making: geometric priors are compressed into lossy language, and sparse interaction is often supervised through delayed textual feedback rather than dense visually grounded signals.

AtlasVA is a teacher-free visual skill memory framework that organizes memory into three complementary layers: spatial heatmaps, visual exemplars, and symbolic text skills. It evolves danger and affinity atlases directly from trajectory statistics and lightweight grid heuristics, then reuses these self-evolving atlases as potential-based shaping rewards for reinforcement learning.

Core Ideas

Teacher-Free Visual Skill Memory

Spatial Heatmaps

Danger and affinity maps preserve local hazards, promising regions, and topological structure in the visual modality instead of flattening layouts into text.

Visual Exemplars

Representative success and failure screenshots provide concrete visual references that help the policy avoid repeated mistakes across episodes.

Atlas Evolution

Trajectory statistics and lightweight grid heuristics update the atlases with EMA blending, removing the need for external LLM teachers.

Dense Shaping

The evolved atlases become potential functions that reward motion toward high-affinity regions and penalize historically risky coordinates.

Architecture

Closed Perception-Optimization Loop

Task Demos

AtlasVA Rollout Examples

Representative agent rollouts across grid puzzles, navigation, and manipulation tasks.

Sokoban

FrozenLake

Navigation

PrimitiveSkill Swap

Reported Experiments

Spatial Reasoning Benchmarks

AtlasVA quantitative results across Sokoban, FrozenLake, Navigation, and PrimitiveSkill benchmarks — Quantitative comparison across 2D puzzles, 3D navigation, and robotic manipulation benchmarks, with deeper color denoting higher success rates.

Citation

BibTeX

@misc{wang2026atlasvaselfevolvingvisualskill,
      title={AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents}, 
      author={Pan Wang and Yihao Hu and Xiujin Liu and Jingchu Yang and Hang Wang and Zhihao Wen},
      year={2026},
      eprint={2605.17933},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.17933}, 
}