Polynomial Trajectory Compression for Protein Language Model Embeddings

Protein language models (PLMs) generate rich, layer-wise embeddings that capture diverse biological information but are expensive in terms of storage and computation at scale. In this work, we propose a compact surrogate representation for PLM embeddings across transformer layers using low-dimensional PCA projections and cubic polynomial trajectories. This approach enables efficient storage and on-demand reconstruction of these protein-level embeddings at any layer without rerunning the PLM. We evaluate our method on two downstream tasks: protein protein interaction and subcellular localization using ESM-35M and ESM-3B PLM. We show that the surrogate embeddings achieve high reconstruction fidelity while reducing storage and computational requirements significantly. The new approach also retains downstream task prediction performance compared to original embeddings. Our approach provides a scalable and practical solution for large-scale protein embedding storage and reuse.

Read Original Article →

Source

https://www.biorxiv.org/content/10.64898/2026.06.05.730461v1?rss=1