logo_frappeFRAPPE:
Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment
Han Zhao1,2†‡, Jingbo Wang3,4†, Wenxuan Song3†,
Shuai Chen5, Yang Liu2, Yan Wang6, Haoang Li3*, Donglin Wang2*
Equal contribution Project Lead *Corresponding Authors
1Zhejiang University, 2MILAB, Westlake University, 3HKUST (GZ), 4South China University of Technology, 5ShanghaiTech University, 6Tsinghua University

We introduce Future Representation Alignment via Parallel Progressive Expansion (FRAPPE), a VLA training framework that performs implicit world modeling by aligning future representations from multiple visual foundation models, achieving significantly improved generalization and data efficiency in both simulation and real-world robotic tasks.

We demonstrate that FRAPPE significantly outperforms the SOTA models in both simulated and real-world complex scenarios, and it can effectively leverage data from different levels of the training data pyramid.

Framework

Training: The model progressively learns to align with the representation spaces of multiple visual foundation models simultaneously.

Two-Stage Training Strategy: 1. Mid-Training fully fine-tunes the model to align with a single visual encoder (Theia) 2. Post-Training introduces parallel experts with multiple prefixes and LoRA modules (MiPA) aligning with different visual teachers and aggregating their outputs via a router to achieve parameter-efficient scaling of model capacity.

Inference: Parallel inference is implemented during the inference stage.

Experiments

SOTA on RoboTwin

FRAPPE outperforms SOTA models on average across 8 tasks under Easy and Hard settings on the RoboTwin simulation benchmark.


Result on Small-Scale Models

FRAPPE demonstrates strong performance even on small-scale models. A 130M RDT backbone achieves success rates comparable to those of RDT-1B with naive fine-tuning.


Real-World Evaluation


Mid-Training with Human Egocentric Data

The implicit world modeling training process of FRAPPE can benefit from human video demonstration data without action labels, including both large-scale internet video data (Ego (Web)) and small-scale task-related video demonstration data (Ego (Task)).


Inference Efficiency

FRAPPE enables efficient inference, maintaining competitive inference latency and GPU memory usage.

BibTeX

@article{zhao2026frappe,
    title={FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment},
    author={Han Zhao and Jingbo Wang and Wenxuan Song and Shuai Chen and Yang Liu and Yan Wang and Haoang Li and Donglin Wang},
    journal = {arXiv preprint arXiv:2602.17259},
    year={2026},
}