VITA: Vision-To-Action Flow Matching Policy

1Department of Computer Science, University of California, Davis
2Department of Electrical Engineering and Computer Sciences, University of California, Berkeley
3Department of Electrical and Computer Engineering, University of California, Davis
4Department of Mechanical and Aerospace Engineering, University of California, Davis
VITA directly flows from latent images to latent actions
without sampling from Gaussian noise or relying on conditioning modules.
VITA: Vision-to-Action Flow Matching
Noise-Free, Conditioning-Free Policy Learning
Camera Image
Camera Image
Latent Images
Flow Matching
Latent Actions
Action Sequence
Ready to start VITA flow

Abstract

We present VITA, a VIsion-To-Action flow matching policy that evolves latent visual representations into latent actions via flow matching for visuomotor control. Conventional flow matching and diffusion policies face a fundamental inefficiency: they sample from standard source distributions (e.g., Gaussian noise) and then require additional conditioning mechanisms, such as cross-attention, to condition action generation on visual information, incurring time and space overheads. We propose VITA, a novel paradigm that treats latent images as the source of the flow, and learns an inherent mapping from vision to action. Because the source of the flow is visually grounded, VITA eliminates the need for repeated conditioning during generation. VITA inherently enables simpler architectures such as MLPs. We evaluate MLP-only VITA on 9 simulation and 5 real-world tasks from ALOHA and Robomimic. Despite its simplicity, VITA outperforms or matches state-of-the-art policies, while speeding up inference by 1.5x to 2.3x. To our knowledge, VITA is the first MLP-only flow matching policy capable of solving complex bi-manual manipulation tasks like those in ALOHA benchmarks.

VITA Framework



VITA learns a continuous flow from latent visual representations to latent actions without requiring additional conditioning mechanisms. By leveraging compact 1D visual and action latent spaces, VITA reduces flow matching to a conditioning-free vector-to-vector mapping, allowing for the use of an MLP-only architecture while achieving strong performance across both simulation and real-world visuomotor tasks.

VITA Denoising

Comparison of the denoising process between conventional flow matching and VITA. Conventional flow matching denoises random Gaussian into actions, VITA flows from latent images to latent actions, and generated actions at initial steps are more structured. We found that through VITA learning, latent images manifest action semantics. The latent image can decoded into a smooth trajectory, and progressively refined by the ODE process.



Addressing Latent Collapse of E2E Latent Flow Matching Training

Unlike latent diffusion for image generation, where the target latent space can be trainied via abundant image data, action data is sparse and limited and thus the target latent space is hard to be well pre-trained and frozen as the target for flow matching. Naively end-to-end training flow matching along with the target latent space leads to latent collapse (Figure (a) Left). We first time identify the cause of the issue as the training-test time gap between encoder-based latents and ODE-generated latents. We propose flow latent decoding (FLD), to backpropagate through the flow ODE solver during training, to close the gap by anchoring latent representations using ground-truth targets.



Efficiency



The table compares the inference latency and peak memory usage of different flow matching policies when using vector-based (Vector) representations or grid-based representations (Grid) for visual features. VITA achieves 1.5x to 2x faster inference speed and reduces memory usage by 18.6% to 28.7%.

Success Rates



We evaluate VITA on challenging bi-manual manipulation tasks, and single-arm tasks including 9 simulation and 5 real-world tasks on ALOHA, AV-ALOHA, Robomimic, covering bimnual and single-arm manipulation tasks. The MLP-only VITA consistently outperforms or matches state-of-the-art policies (including transformer-based conventional flow matching policy), while being significantly more efficient.

VITA Demos: Real-World Tasks

VITA demonstrates robustness to online perturbations.

Online Perturbations

VITA demonstrates generalization to unseen objects.

Unseen Objects

Bimanual Tasks with Active Vision

Two challenging bimanual manipulation tasks on AV-ALOHA with an additional 7-DoF arm carrying an active vision camera. The robot must predict and reach the best viewpoint to avoid occulusions and increase precision.

Hidden Pick

Transfer From Box

VITA Demos: Real-World and Simulation Tasks

Hidden Pick

Transfer From Box

Pick Ball

Store Drawer

Thread Needle

Pour Test Tube

Hook Package

Slot Insertion

Transfer Cube

Square

PushT

BibTeX

@misc{gao2025vitavisiontoactionflowmatching,
    title={VITA: Vision-to-Action Flow Matching Policy}, 
    author={Dechen Gao and Boqi Zhao and Andrew Lee and Ian Chuang and Hanchu Zhou and Hang Wang and Zhe Zhao and Junshan Zhang and Iman Soltani},
    year={2025},
    eprint={2507.13231},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2507.13231}, 
}