Comprehensive results on Vision, MLM and more LION variants
In the final part of our LION series, we will present and discuss a selection of experimental results across various domains, including vision tasks, masked language modeling (MLM), and different LION architectures. These results not only highlight LION’s versatility and efficiency across diverse applications but also serve as a preview of the comprehensive findings detailed in the full paper.
We evaluated LION’s performance, efficiency, and training times against state-of-the-art SSMs and Transformers for image classification. The results demonstrate that LION achieves competitive performance while offering significant advantages in training speed and efficiency.
Model | #Param | Imagenet Top-1 Acc. | Train. time |
---|---|---|---|
$\text{ViT}$ | 86M | $77.9$ | $\times 1$ |
$\text{DeiT}$ | 86M | $\underline{81.8}$ | $\times 1$ |
$\text{Hydra}$ | 104M | $81.0$ | $\times 2.51$ |
$\text{Vim}$ | 98M | $\mathbf{81.9}$ | $\times 10.86$ |
$\text{LION-}\text{🔥}$ | 86M | $74.7$ | $\mathbf{\times 0.73}$ |
$\text{LION-D}$ | 86M | $77.8$ | $\times \underline{1.39}$ |
$\text{LION-D}^{\natural}$ | 86M | $80.2$ | $\times 1.48$ |
$\text{LION-S}$ | 86M | $76.3$ | $\times 1.46$ |
$\text{LION-S}^{\natural}$ | 86M | $79.9$ | $\times 1.68$ |
As shown in the table above, LION models achieve competitive performance with vision-specific SSMs like Vim, while being significantly faster during training. LION-D performs comparably to Vim and surpasses Hydra
The LION family demonstrates excellent memory efficiency across both vision and language tasks. Figure below shows inference memory usage with a batch size of 64 across different image resolutions, LION models (RNN form) maintain reasonable memory consumption even at high resolutions up to 2496 pixels, while adding minimal training overhead in BERT-style language modeling scenarios. In contrast, baseline models like ViT and DeiT run out of memory (OOM) at much lower resolutions.
The LION family demonstrates remarkable training efficiency across both vision and language tasks. As shown in the table below, LION variants add minimal training overhead compared to SSMs.
Task | LION-🔥 | LION-D | LION-S | Hydra | Vim |
---|---|---|---|---|---|
Vision | $\times 0.73$ | $\times 1.39$ | $\times 1.46$ | $\times 2.51$ | $\times 10.86$ |
MLM | $\times 0.95$ | $\times 1.10$ | $\times 1.32$ | $\times 3.13$ | ✗ |
For vision tasks, LION-🔥 achieves remarkable speed, training 27% faster than standard vision Transformers
In MLM tasks, the efficiency gains are even more pronounced. LION-🔥 nearly matches Transformer training speed at just 0.95x, while LION-D adds only 10% overhead. Even LION-S remains efficient at 1.32x. All LION variants significantly outperform Hydra’s 3.13x slowdown, while Vim is not applicable to MLM tasks (marked as ✗).
For masked language modeling (MLM) tasks, we evaluated LION models against BERT
Model | MLM Acc. | GLUE | Train. time |
---|---|---|---|
BERT | $\underline{69.88}$ | $\mathbf{82.95}$ | $\times 1$ |
Hydra | $\mathbf{71.18}$ | $\underline{81.77}$ | $\times 3.13$ |
LION-🔥 | $67.11$ | $80.76$ | $\times \mathbf{0.95}$ |
LION-D | $68.64$ | $81.34$ | $\times \underline{1.10}$ |
LION-S | $69.16$ | $81.58$ | $\times 1.32$ |
Let’s explore how different LION variants handle the trade-off between memory usage and inference speed. We will look at three key approaches:
The first plot below shows how these approaches compare in terms of memory efficiency and inference speed in LION-D. The RNN approach proves to be the most memory-efficient, while Full Attention uses the most memory. LION Chunk provides a nice middle ground - it uses less memory than Full Attention while actually achieving faster inference speeds than both alternatives. This makes it particularly attractive when you need to balance performance with resource constraints.
For LION-🔥, we see a similar pattern, but the chunking approach is even more pronounced.
Lastly for LION-S, we see that the chunking approach is only faster at lower resolutions - at higher resolutions, the overhead from mask calculations starts to slow it down.
Expanding LION’s Potential: Our experiments focused on three main mask choices, but LION has the potential to accelerate other Linear Transformer variants for bidirectional tasks.
Optimizing Chunkwise Parallelism: The chunkwise parallel implementation during inference was done in PyTorch, with room for optimization through GPU kernel programming to reduce I/O overhead and improve speed.
Stabilizing Hydra and Mamba with LION: Hydra
We encourage the readers of this blog post to read the full paper for more details about the LION framework and experimental setups. The implementation details are available in the code repository.
If you use this work, please consider citing the paper:
@article{afzal2025linear,
title={Linear Attention for Efficient Bidirectional Sequence Modeling},
author={Afzal, Arshia and Abad Rocamora, Elias and Candogan, Leyla Naz and Puigdemont, Pol and Tonin, Francesco and Wu, Yongtao and Shoaran, Mahsa and Cevher, Volkan},
journal={arXiv preprint arXiv:2502.16249},
year={2025},
url={https://arxiv.org/abs/2502.16249},
doi={10.48550/arXiv.2502.16249}
}