VITA: Vision-To-Action Flow Matching Policy

VITA: Vision-to-Action Flow Matching

Noise-Free, Conditioning-Free Policy Learning

Camera Image

Latent Images

Flow Matching

Latent Actions

Action Sequence

Ready to start VITA flow

Abstract

We present VITA, a VIsion-To-Action flow matching policy that evolves latent visual representations into latent actions via flow matching for visuomotor control. Traditional flow matching and diffusion policies face a fundamental inefficiency: they sample from standard source distributions (e.g., Gaussian noise) and then require additional conditioning mechanisms, such as cross-attention, to condition action generation on visual information, incurring time and space overheads. We propose VITA, a novel paradigm that treats latent images as the source of the flow, and learns an inherent mapping from vision to action. VITA eliminates the need for separate conditioning modules, while preserving the generative modeling capabilities. We implement VITA as simple MLP layers, and evaluate VITA on challenging bi-manual manipulation tasks on the ALOHA platform, including 5 simulation and 2 real-world tasks. Despite its simplicity, MLP-only VITA outperforms or matches state-of-the-art generative policies, while reducing inference latency by 50% to 130% compared to conventional flow matching policies requiring different conditioning mechanisms or complex architectures. To our knowledge, VITA is the first MLP-only flow matching policy capable of solving complex bi-manual manipulation tasks like those in ALOHA benchmarks.

VITA Framework

VITA learns a continuous flow from latent visual representations to latent actions without requiring additional conditioning mechanisms. By leveraging compact 1D visual and action latent spaces, VITA enables a highly efficient MLP-only architecture while achieving strong performance across both simulation and real-world visuomotor tasks.

VITA Denoising

Comparison of the denoising process between conventional flow matching and VITA. Conventional flow matching denoises random Gaussian into actions, VITA flows from latent images to latent actions, and generated actions at initial steps are more structured.

Results

We evaluate VITA on challenging bi-manual manipulation tasks on ALOHA, including 5 simulation and 2 real-world tasks. VITA consistently outperforms or matches state-of-the-art generative policies (including transformer-based conventional flow matching policy), while being significantly more efficient with conditioning-free MLP-only architectures.

BibTeX

@misc{gao2025vitavisiontoactionflowmatching,
    title={VITA: Vision-to-Action Flow Matching Policy}, 
    author={Dechen Gao and Boqi Zhao and Andrew Lee and Ian Chuang and Hanchu Zhou and Hang Wang and Zhe Zhao and Junshan Zhang and Iman Soltani},
    year={2025},
    eprint={2507.13231},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2507.13231}, 
}

VITA: Vision-To-Action Flow Matching Policy

Abstract

VITA Framework

VITA Denoising

Autonomous Rollouts of VITA: Real-World Tasks

Third-Person Views

Hidden Pick Task

Transfer From Box Task

Robot Active Vision Views

Hidden Pick Task

Transfer From Box Task

Autonomous Rollouts of VITA: Simulation Tasks

Pour Test Tube

Thread Needle

Hook Package

Slot Insertion

Transfer Cube

PushT

Results

BibTeX