Less is More: an Attention-free Sequence Prediction Modeling for Offline Embodied Learning

NeurIPS 2025

Wei Huang1,2, Jianshu Zhang3, Leiyu Wang3, Heyue Li4, Luoyi Fan3, Yichen Zhu5,
Nanyang Ye3†, Qinying Gu1†
1Shanghai AI Laboratory 2Tsinghua University,
3Shanghai Jiao Tong University, 4Wuhan University, 5Midea Group,

indicates corresponding author

Entropy-reward Correlation

Teaser Image
Entropy-reward correlation (left) and attention maps (right) for DT (top) and DT with Token Merger (bottom), where lower entropy signifies more focused attention and thus stronger local dependency modeling. DT exhibits state-dominated attention, with sparse weights on RTG and action tokens, leading to higher attention entropy and a strong correlation between local entropy maxima and reward minima. In contrast, DT with Token Merger achieves significantly lower entropy, more balanced inter-step dependencies, and notably higher rewards.


Architecture for offlineRL

Teaser Image

Hierarchical modeling framework and micro structure of Token Merger. Token Merger merges (R, s, a) triplets into unified representations, while Token Mixer models dependencies across timesteps.


Real World


We further evaluate the model’s inference speed and generalization ability in real-world settings using a 7-DOF Franka Emika Panda robotic arm equipped with a single-joint torque sensor.


Visualization

Teaser Image
t-SNE visualization of embeddings with/without Token Merger. Post-merger embeddings exhibit structured clustering, indicating enhanced local associations.

Abstract

Offline reinforcement learning (offline RL) is increasingly approached as a sequence modeling task, with methods leveraging advanced architectures like Transformers to capture trajectory dependencies. Despite significant progress, the mechanisms underlying their effectiveness and limitations remain insufficiently understood. We conduct a thorough analysis on the representative Decision Transformer (DT) model using an entropy analysis and identify the inconsistencies in state-action-reward (s, a, R) distributions causing attention ``dispersal". To address this, we propose a hierarchical framework that decomposes sequence modeling into intra-step relational modeling—handled by a Token Merger that fuses each (s, a, R) triplet—and inter-step modeling—handled by a Token Mixer across timesteps. We investigate several Token Merger designs and validate their effectiveness across various offline RL methods. Furthermore, our theoretical analysis and experimental results suggest that while Token Mixers are important, lightweight architecture can also achieve even better performance to more complex ones. We therefore propose a parameter-free Average Pooling Token Mixer, which, combined with a convolutional Token Merger, forms our final model, Decision HiFormer (DHi). DHi achieves a 73.6% improvement in inference speed and an 9.3% gain in policy performance on the D4RL benchmark compared to DT. DHi also generalizes well to real-world robotic manipulation tasks, offering both practical benefits and insights into sequence-based policy design for offline RL.

BibTeX


@InProceedings{huang2025decisionhiformer,
  author    = {Wei Huang and Jianshu Zhang and Leiyu Wang and Heyue Li and Luoyi Fan and Yichen Zhu and Nanyang Ye and Qinying Gu},
  title     = {Less is More: an Attention-free Sequence Prediction Modeling for Offline Embodied Learning},
  booktitle = {Advances in Neural Information Processing Systems},
  year      = {2025},
}