We present VITA, a VIsion-To-Action flow matching policy that evolves latent visual representations into latent actions via flow matching for visuomotor control. Traditional flow matching and diffusion policies face a fundamental inefficiency: they sample from standard source distributions (e.g., Gaussian noise) and then require additional conditioning mechanisms, such as cross-attention, to condition action generation on visual information, incurring time and space overheads. We propose VITA, a novel paradigm that treats latent images as the source of the flow, and learns an inherent mapping from vision to action. VITA eliminates the need for separate conditioning modules, while preserving the generative modeling capabilities. We implement VITA as simple MLP layers, and evaluate VITA on challenging bi-manual manipulation tasks on the ALOHA platform, including 5 simulation and 2 real-world tasks. Despite its simplicity, MLP-only VITA outperforms or matches state-of-the-art generative policies, while reducing inference latency by 50% to 130% compared to conventional flow matching policies requiring different conditioning mechanisms or complex architectures. To our knowledge, VITA is the first MLP-only flow matching policy capable of solving complex bi-manual manipulation tasks like those in ALOHA benchmarks.
VITA learns a continuous flow from latent visual representations to latent actions without requiring additional conditioning mechanisms. By leveraging compact 1D visual and action latent spaces, VITA enables a highly efficient MLP-only architecture while achieving strong performance across both simulation and real-world visuomotor tasks.
We evaluate VITA on challenging bi-manual manipulation tasks on ALOHA, including 5 simulation and 2 real-world tasks. VITA consistently outperforms or matches state-of-the-art generative policies (including transformer-based conventional flow matching policy), while being significantly more efficient with conditioning-free MLP-only architectures.
@misc{gao2025vitavisiontoactionflowmatching,
title={VITA: Vision-to-Action Flow Matching Policy},
author={Dechen Gao and Boqi Zhao and Andrew Lee and Ian Chuang and Hanchu Zhou and Hang Wang and Zhe Zhao and Junshan Zhang and Iman Soltani},
year={2025},
eprint={2507.13231},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.13231},
}