Understanding Task Transfer in Vision-Language Models

CVPR 2026 (Oral)

Microsoft Research India
*Indicates Equal Contribution (order decided by coin toss)
Teaser overview figure

Vision-Language Models (VLMs) are usually finetuned on multiple tasks with little knowledge about how such finetuning affects a model’s performance on other tasks. Hence we aim to explore: How does finetuning on one visual perception task affect performance on other tasks?

Key Findings

1

Low-level tasks transfer best. Tasks like Relative Depth, Relative Reflectance, and Visual Correspondence are highly positively transferable and malleable. Finetuning on low-level tasks is more beneficial than finetuning on mid- or high-level tasks.

2

Scale and granularity matter. Image-level tasks (Art Style, Counting, Forensic Detection) and pixel-level tasks (Functional Correspondence, Relative Depth) are both highly positively transferable, outperforming patch- and crop-level tasks.

3

Bigger models transfer more. The magnitude of both positive transferability and malleability increases with model size.

4

PGF guides data selection. When supervised data is scarce, PGF-informed data selection can inform alternative dataset designs that match — and even exceed — performance from direct finetuning.

Interactive Task Transfer Graph

Qwen-2.5-VL 32B Task Transfer Graph (averaged across seeds). Click a node to highlight its connections and drag a node to rearrange. The color intensity of the edge defines how strong the transfer is between the two tasks.

Perfection Gap Factor (PGF) Score

Understanding how finetuning on one task affects performance on another—task transferability—is central to analyzing the behavior of Vision-Language Models (VLMs). Traditional transfer metrics often fail to account for how close a target task already is to its ceiling performance: a small improvement near saturation may be far more meaningful than a large improvement on a task with plenty of remaining headroom.

To address this limitation, we introduce the Perfection Gap Factor (PGF), a normalized measure that captures how much of the remaining performance gap on a target task is closed by finetuning on a source task. PGF enables fair comparison of transfer effects across tasks with different difficulty levels and performance ceilings.

Consider a VLM M finetuned on a source task Ti (denoted as M(Ti)) and evaluated on a target task Tj, which has an upper-bound or ceiling performance Uj. Let Acc(M, T) denote the accuracy of model M on task T.

$$\mathrm{PGF}_{i \to j}=\frac{\mathrm{Acc}(M(T_i),\,T_j) - \mathrm{Acc}(M,\,T_j)}{U_j - \mathrm{Acc}(M,\,T_j)}$$

Here, the numerator measures the change in performance on Tj due to finetuning on Ti, while the denominator captures the remaining gap to the ceiling Uj. A larger gap means more room to improve, while a smaller gap indicates a saturated task where improvements are harder.

Interpretation: A positive PGF indicates beneficial transfer, and a negative PGF indicates harmful transfer. Values near 1 reflect strong positive transfer (closing most of the remaining gap), while values near -1 indicate substantial negative transfer. PGF therefore provides a calibrated and comparable view of cross-task influence in VLM finetuning.

Task Transfer Heatmaps

Perfection Gap Factor (PGF) between all pairs of source (rows) and target (columns) tasks.

Negative transfer Neutral Positive transfer

BibTeX

@article{Sachdeva2025TaskTransfer,
  title={Understanding Task Transfer in Vision-Language Models},
  author={Bhuvan Sachdeva and Karan Uppal and Abhinav Java and Vineeth N. Balasubramanian},
  journal={arXiv preprint},
  year={2025},
  url={https://arxiv.org/abs/2511.18787}
}