Pretraining Objective Shapes Cross-Category Generalization in Affective Image Prediction: A Geometric Comparison of Vision Transformer Encoders

The geometry of representations learned by deep neural networks is shaped jointly by architecture and pretraining objective, yet disentangling these two factors remains difficult. Here we isolate the contribution of pretraining objective by comparing two Vision Transformers from the same backbone family but trained under different objectives: language-image contrastive learning (CLIP) and ImageNet-21k classification. Using continuous Valence-Arousal prediction on the OASIS dataset as a probe of representational quality, we evaluated frozen features under Leave-One-Theme-Out and Leave-One-Category-Out cross-validation, the latter requiring extrapolation to entirely unseen semantic categories. The contrastively pretrained encoder generalized substantially better than the classification-pretrained encoder under both protocols, with the gap widening sharply when held-out categories required cross-category generalization. To characterize why the two representations differ, we developed a ge

Read Original Article →

Source

https://www.biorxiv.org/content/10.64898/2026.05.11.724194v1?rss=1