Understanding Task Transfer in Vision-Language Models

Sachdeva, Bhuvan; Uppal, Karan; Java, Abhinav; Balasubramanian, Vineeth N.

Understanding Task Transfer in Vision-Language Models

CVPR 2026 (Oral)

Bhuvan Sachdeva^*, Karan Uppal^*, Abhinav Java^*, Vineeth N. Balasubramanian

Microsoft Research India
^*Indicates Equal Contribution (order decided by coin toss)

arXiv

Vision-Language Models (VLMs) are usually finetuned on multiple tasks with little knowledge about how such finetuning affects a model’s performance on other tasks. Hence we aim to explore: How does finetuning on one visual perception task affect performance on other tasks?

Key Findings

1

Low-level tasks transfer best. Tasks like Relative Depth, Relative Reflectance, and Visual Correspondence are highly positively transferable and malleable. Finetuning on low-level tasks is more beneficial than finetuning on mid- or high-level tasks.

2

Scale and granularity matter. Image-level tasks (Art Style, Counting, Forensic Detection) and pixel-level tasks (Functional Correspondence, Relative Depth) are both highly positively transferable, outperforming patch- and crop-level tasks.

3

Bigger models transfer more. The magnitude of both positive transferability and malleability increases with model size.

4

PGF guides data selection. When supervised data is scarce, PGF-informed data selection can inform alternative dataset designs that match — and even exceed — performance from direct finetuning.

Interactive Task Transfer Graph

Qwen-2.5-VL 32B Task Transfer Graph (averaged across seeds). Click a node to highlight its connections and drag a node to rearrange. The color intensity of the edge defines how strong the transfer is between the two tasks.

PGF Threshold: 0.25

Filter:

Perfection Gap Factor (PGF) Score

Understanding how finetuning on one task affects performance on another—task transferability—is central to analyzing the behavior of Vision-Language Models (VLMs). Traditional transfer metrics often fail to account for how close a target task already is to its ceiling performance: a small improvement near saturation may be far more meaningful than a large improvement on a task with plenty of remaining headroom.

To address this limitation, we introduce the Perfection Gap Factor (PGF), a normalized measure that captures how much of the remaining performance gap on a target task is closed by finetuning on a source task. PGF enables fair comparison of transfer effects across tasks with different difficulty levels and performance ceilings.

Consider a VLM M finetuned on a source task T_i (denoted as M(T_i)) and evaluated on a target task T_j, which has an upper-bound or ceiling performance U_j. Let Acc(M, T) denote the accuracy of model M on task T.

$$\mathrm{PGF}_{i \to j}=\frac{\mathrm{Acc}(M(T_i),\,T_j) - \mathrm{Acc}(M,\,T_j)}{U_j - \mathrm{Acc}(M,\,T_j)}$$

Here, the numerator measures the change in performance on T_j due to finetuning on T_i, while the denominator captures the remaining gap to the ceiling U_j. A larger gap means more room to improve, while a smaller gap indicates a saturated task where improvements are harder.

Interpretation: A positive PGF indicates beneficial transfer, and a negative PGF indicates harmful transfer. Values near 1 reflect strong positive transfer (closing most of the remaining gap), while values near -1 indicate substantial negative transfer. PGF therefore provides a calibrated and comparable view of cross-task influence in VLM finetuning.

Task Transfer Heatmaps

Perfection Gap Factor (PGF) between all pairs of source (rows) and target (columns) tasks.

Negative transfer Neutral Positive transfer

BibTeX

@article{Sachdeva2025TaskTransfer,
  title={Understanding Task Transfer in Vision-Language Models},
  author={Bhuvan Sachdeva and Karan Uppal and Abhinav Java and Vineeth N. Balasubramanian},
  journal={arXiv preprint},
  year={2025},
  url={https://arxiv.org/abs/2511.18787}
}