StochasT: Learning with Stochastic Turn Depth for Visual Instruction Tuning

Abstract

Large Vision-Language Models (LVLMs) rely extensively on Visual Instruction Tuning (VIT) to elicit their multimodal reasoning capabilities. However, we find a discrepancy: VIT often packs multiple language tasks about the same image for conversational, multi-turn training, whereas existing benchmarks evaluate LVLMs in isolated, single-turn scenarios. The models can suffer from visual attention decay and contextual overfitting during multi-turn training, making it hard for them to realize their full potential in the mismatched test phase. To close the gap, we propose learning with Stochastic Turn Depth (StochasT), which stochastically groups language tasks for the same image into clusters of varying sizes (turn depth) while preserving their organic order. Hence, while StochasT draws on Dropout and stochastic depth for ResNets, it does not actually drop anything to maximize the utility of the training data. Furthermore, we introduce a challenging, benchmark-agnostic evaluation mechanism based on the Balanced Latin Square to measure LVLMs' robustness under varying contextual dependencies. Extensive experiments demonstrate that StochasT effectively grants LVLMs strong, harmonized capabilities for both single-turn and multi-turn use cases.

Visual Instruction Tuning and LVLM Evaluation

Unlike pure language tasks, the high information density inherent to visual data naturally affords multi-turn (multiT) language queries. A single image often grounds multiple distinct instructions (as illustrated in the figure below), and this one-image-multiT format has been frequently used in VIT. However, a significant discrepancy exists between this multiT training paradigm and currently prevalent single-turn (singleT) evaluation protocols. As depicted in the figure below, multiT training groups all instruction-answer pairs about the same image into one training example, while singleT evaluation anchors every instruction individually to the image. As a result, a model may fail to answer a simple question in isolation but succeed when the exact same question is contextualized within a conversation. Most existing LVLM benchmarks employ singleT testing exclusively, treating related questions about the same image as independent, isolated runs. However, if we instead evaluate LVLMs in the same, multiT way as used in training, we can boost their performance significantly. Surprisingly, the literature has largely overlooked this discrepancy between multiT VIT and singleT evaluation.

Our Research Questions

In this work, we investigate two primary research questions hinging on this observation:

1) How can we effectively balance the one-image-multiT training paradigm to optimize performance across both singleT and multiT evaluation?
2) How can we systematically evaluate LVLMs' robustness to varying contextual dependencies, and how does this robustness correlate with their performance in singleT vs. multiT scenarios?

Stochastic Turn Depth

Training Large Vision-Language Models (LVLMs) typically forces a compromise between isolated single-turn responses and rigid, sequential multi-turn dialogues. To bridge this gap, we introduce Stochastic Turn Depth (StochasT). Inspired by the principles of Dropout, StochasT dynamically masks the historical context of conversations during training. Instead of feeding the model a strict, linear history, our algorithm randomly connects each conversational turn to a preceding parent node via a causal backward traversal. This innovative approach expands standard sequential data into a rich, diverse tree of conversational trajectories. By training over this implicit ensemble of varying context lengths, StochasT strictly preserves chronological causality while teaching the model to dynamically adapt to any conversational length.

Balanced Latin Square Turn Permutations

Large Vision-Language Models often suffer from context sensitivity—a model might answer a question perfectly in isolation, but fail when that same question is buried deep within a complex dialogue. To rigorously test a model's true reliability, we introduce a novel evaluation paradigm using Balanced Latin Square (BLS) Turn Permutations. Instead of testing static question-answer pairs, we systematically shuffle the order of the conversation. Our BLS approach ensures that every question appears in every possible dialogue position, and immediately follows every other question exactly once. To quantify a model's intrinsic capabilities under these shifting contexts, we propose two new metrics:

Context-Robust Accuracy (CRA): This measures the model's average accuracy for a specific question across all the different conversational reorderings, giving a broad view of its stability.
Strict Context-Robust Accuracy (CRA+): The ultimate stress test. This stringent metric only awards credit if the model answers the question correctly across every single permutation. It ensures the model relies on genuine, context-invariant knowledge rather than lucky guesses or favorable formatting.

An exmaple of a 4x4 BLS is shown in the figure below.

Experiments

Main Results

To rigorously evaluate our approach, we selected a diverse suite of visual instruction tuning datasets that inherently feature multi-turn interactions across specialized domains. Our setup includes iNat-Plant for fine-grained botanical understanding, PathVQA for clinical pathology (which we custom-formatted into a standard instruction-response structure), CoralVQA for targeted marine analysis, and TaiwanVQA to test culturally specific localized recognition and nuanced reasoning. Additionally, we validated our method on MMDU, a dataset explicitly designed to push the limits of an LVLM's extended conversational depth and complex multi-image reasoning capabilities. We compare our method against the standard multiT visual instruction tuning baseline as well as a specialized singleT training setting. The evaluations are conducted under both singleT and multiT settings, as detailed in the table below.

Ablation on Dropout Rates

To control how we drop conversational history, we sample dropout rates from a Beta distribution. This gives us the flexibility to bias the model: we can aggressively drop context to force the model to focus on the image (similar to single-turn training), or preserve more history for long-context reasoning. Our ablation studies show that a symmetric distribution (where both alpha and beta are set to 2) strikes the perfect balance and achieves the highest Context-Robust Accuracy. Conversely, if we heavily suppress the dropout rate, the model simply regresses to standard multi-turn training—reintroducing overfitting and significantly dropping its robust performance scores.

Visual Contribution

A major bottleneck in current Large Vision-Language Models is that as conversations get longer, the model tends to ignore the image and rely almost entirely on the text history. To see if StochasT prevents this "visual attention degradation," we measured the model's Visual Contribution (VC)—a metric that tests how much the model actually relies on the image to generate its answer. Our results show that StochasT achieves the highest Visual Contribution compared to both standard single-turn and multi-turn paradigms. By dynamically disrupting the text history during training, StochasT successfully forces the model to stay grounded in the visual evidence, ultimately reducing the risk of hallucinations.