Sections

    TL;DR. We propose Action Images, a pixel-grounded multiview action representation that turns 7-DoF robot control into action videos, enabling zero-shot control and unified video-action generation with a single video backbone.

    Action Images teaser

    Zero-shot Results

    Our video backbone can generate high-quality results across diverse environments and viewpoints.

    Instruction: Place the black cup in the paper box (Unseen environment, unseen embodiment)

    View1

    View1 image

    View 2

    View2 image

    Loading 3D scene...


    Abstract

    World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.


    Method

    Our method builds a single video-space interface for observations and control. We rasterize each 7-DoF command into multi-view Gaussian action images aligned with RGB, decode predicted heatmaps back to continuous actions with a lightweight multi-view procedure, and fine-tune a pretrained video generator so masked latent training covers joint generation, action-conditioned video, video-to-action labeling, and video-only rollouts—yielding one model that both imagines the world and acts in it.

    Action as image
    Action as image. We convert each 7-DoF robot action into three semantic 3D points (position, normal, and up), project them into image space, and render them as RGB Gaussian heatmaps. The blue channel further encodes gripper openness in the low-response background, producing a pixel-grounded action representation.

    Unified world-action model training. Multi-view video and action latents are packed with text and camera conditions, and trained under diverse mask strategies.



    BibTeX