Hide to See: Reasoning-prefix Masking for
Visual-anchored Thinking in VLM Distillation

1KAIST   2NVIDIA   3POSTECH
Corresponding authors
Reasoning-prefix masking during VLM distillation
With the full teacher thinking trace (naïve distillation), the student leans on exposed textual prefixes and attends weakly to the image. By masking salient reasoning prefixes, Masking-KD pushes the student to exploit visual evidence instead, improving its visual-anchored thinking.

Abstract

Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their computational cost becomes substantial, especially for larger VLMs. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student's ability to utilize visual evidence throughout its reasoning trace, as long think-answer traces suffer from visual forgetting issues. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student's salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation. Our masking strategies include: (1) token-wise salient reasoning-prefix masking, which masks high-influence reasoning prefixes selectively for each next-token prediction, and (2) self-paced masking budget scheduling, which gradually increases the masking scale according to distillation difficulty, measured by the discrepancy between teacher–student distributions. In the distillation phase, the student is guided by our salient reasoning-prefix mask, which blocks both future tokens and salient reasoning cues, in place of the standard causal mask used for auto-regressive language modeling. Experimental results show that our approach outperforms recent open-source VLMs, VLM distillation, and self-distillation methods on multimodal reasoning benchmarks, while further analyses confirm enhanced visual utilization along the student thinking process.

Method

Masking-KD framework
During distillation, the student is guided by a salient reasoning-prefix mask that blocks both future tokens and salient reasoning prefixes, while the teacher operates under the standard causal mask. The mask is derived from an auxiliary student forward pass: a response-to-response attention map (which prefixes to mask) and a token-wise reverse-KL signal (how strongly to mask).

Masking-KD is a think-answer distillation framework that masks salient reasoning prefixes of the stuednt during distillation, forcing the student to rely more on visual evidence instead of missing textual cues. When distilling long reasoning traces, accumulated prefixes can become highly informative, allowing the student to imitate the teacher's reasoning trace easily from a few exposed textual cues — This reduce the student's need to attend to the image. look at the image and causing visual forgetting. Masking-KD removes that this textual shortcut via reasoning-preifx masking. To address this, we construct salient reasoning-prefix mask via two strategies below and integrated with standard causal mask, preventing the student from accessing both salient reasoning prefixes and future tokens during distillation:

  • Token-wise salient reasoning-prefix masking. Using a response-to-response attention map, it selects the highest-attended prefixes with a nucleus (top-ρ) rule and masks them — differently at every decoding step, since the most influential context shifts as generation proceeds.
  • Self-paced masking budget scheduling. The masking strength ρn is set adaptively from the token-wise teacher–student reverse-KL divergence: easily imitated tokens get stronger masking to amplify weak learning signals, while harder tokens keep more access to the reasoning context for stable training.

Results

Pass@1 on multimodal reasoning benchmarks (greedy decoding, max length 4096). Masking-KD-8B is self-distilled; the 4B and 2B students are distilled from the Qwen3-VL-8B-Thinking teacher. Notably, our 2B surpasses the undistilled 4B, and our 4B surpasses the undistilled 8B.

MethodGeo3KMathVistaWe-MathMMK12 MathVerseLogicVistaMMMU-ProAvg.
~8B Models
Qwen3-VL-8B-Thinking54.5865.2066.1542.5563.8143.4039.8353.65
self-distill Masking-KD-8B (ours) 58.2467.1071.7249.95 67.8448.1043.4758.06
~4B Models
Qwen3-VL-4B-Thinking43.9362.6049.3731.5549.8639.3732.0844.11
Masking-KD-4B (ours)52.5866.5071.0351.0062.6652.3540.5256.66
~2B Models
Qwen3-VL-2B-Thinking26.2943.1025.1713.0028.2118.5714.5124.12
Masking-KD-2B (ours)40.9359.2063.7937.2057.8941.6130.7547.34

Visual-anchored Thinking

Visual attention over generation

Visual attention stays high throughout generation, mitigating visual forgetting.

Visual attention maps

Masking-KD attends more strongly to relevant image regions than naive KD or attention-distillation baselines.

Prediction behavior during distillation

When salient prefixes are masked, the student compensates by activating relevant visual regions.

BibTeX

@article{yu2026hide,
  title   = {Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation},
  author  = {Yu, Seonghoon and Nam, Dongjun and Lee, Byung-Kwan and Son, Jeany},
  journal = {arXiv preprint arXiv:2605.11651},
  year    = {2026}
}