Complementary Attention Head Pruning for Efficient Transformers
The remarkable success of Transformer-based models in natural language processing stems from architectural scaling, w...
Why it matters: CAHP enables true, automatic compression of large-scale Transformers by selecting a strategically diverse set of attention heads via graph theory, eliminating the need for manual pruning ratios and mitigating proximity bias that plagues gradient‑based methods. This means production‑ready models can achieve higher compression ratios without performance collapse—critical for deploying cutting‑edge NLP at scale in mobile and edge environments while preserving competitive accuracy.
Primary Source
arXiv (Yaniv Livertovsky)
research / 1 source