VITA: Vision-To-Action Flow Matching Policy

1Department of Computer Science, University of California, Davis
2Department of Electrical Engineering and Computer Sciences, University of California, Berkeley
3Department of Electrical and Computer Engineering, University of California, Davis
4Department of Mechanical and Aerospace Engineering, University of California, Davis
VITA directly flows from latent images to latent actions
without sampling from Gaussian noise or relying on conditioning modules.

Abstract

We present VITA, a VIsion-To-Action flow matching policy that evolves latent visual representations into latent actions via flow matching for visuomotor control. Traditional flow matching and diffusion policies face a fundamental inefficiency: they sample from standard source distributions (e.g., Gaussian noise) and then require additional conditioning mechanisms, such as cross-attention, to condition action generation on visual information, incurring time and space overheads. We propose VITA, a novel paradigm that treats latent images as the source of the flow, and learns an inherent mapping from vision to action. VITA eliminates the need for separate conditioning modules, while preserving the generative modeling capabilities. We implement VITA as simple MLP layers, and evaluate VITA on challenging bi-manual manipulation tasks on the ALOHA platform, including 5 simulation and 2 real-world tasks. Despite its simplicity, MLP-only VITA outperforms or matches state-of-the-art generative policies, while reducing inference latency by 50% to 130% compared to conventional flow matching policies requiring different conditioning mechanisms or complex architectures. To our knowledge, VITA is the first MLP-only flow matching policy capable of solving complex bi-manual manipulation tasks like those in ALOHA benchmarks.

VITA Framework



VITA learns a continuous flow from latent visual representations to latent actions without requiring additional conditioning mechanisms. By leveraging compact 1D visual and action latent spaces, VITA enables a highly efficient MLP-only architecture while achieving strong performance across both simulation and real-world visuomotor tasks.

Results



We evaluate VITA on challenging bi-manual manipulation tasks on ALOHA, including 5 simulation and 2 real-world tasks. VITA consistently outperforms or matches state-of-the-art generative policies (including transformer-based conventional flow matching policy), while being significantly more efficient with conditioning-free MLP-only architectures.

Autonomous Rollouts of VITA: Real-World Tasks

Third-Person Views

Hidden Pick Task

Transfer From Box Task

Robot Active Vision Views

Hidden Pick Task

Transfer From Box Task

Autonomous Rollouts of VITA: Simulation Tasks

Pour Test Tube

Thread Needle

Hook Package

Slot Insertion

Transfer Cube

PushT

BibTeX

@misc{gao2025vitavisiontoactionflowmatching,
    title={VITA: Vision-to-Action Flow Matching Policy}, 
    author={Dechen Gao and Boqi Zhao and Andrew Lee and Ian Chuang and Hanchu Zhou and Hang Wang and Zhe Zhao and Junshan Zhang and Iman Soltani},
    year={2025},
    eprint={2507.13231},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2507.13231}, 
}