Most single-domain RL fine-tuning for large language models improves the target domain but often degrades others. This paper's core insight is that such interference is not necessarily a diffuse, full-model phenomenon: updates are sparse and small in magnitude, yet they act on shared active computation routes where a low-dimensional conflict subspace concentrates cross-domain harm. That localization makes selective, low-cost recovery possible.
Key Findings
-
Sparse, route-focused updates. The authors show single-domain RL produces sparse, small-magnitude parameter edits with weak overlap among the top-changed neurons; different domains nonetheless share active computation routes. So what? Interference depends more on which computation routes are activated and how updates project onto those routes than on wholesale gradient alignment across the whole model.
-
Local perturbation theory and the second-order damage term. Under a local perturbation model the dominant harm from later-domain training to an earlier domain arises from a second-order damage term that, given the observed sparsity, concentrates in a low-dimensional shared conflict subspace. So what? Theoretical framing explains why focused interventions (not full rollback) can be effective.
-
Empirical, inexpensive recovery. A short domain-specific "Re-Math" refresh after the sequence Code → Math → QA → CW restored Math performance from 57.66 to 66.04 while largely preserving other domains, improving the overall average to 66.39. So what? Brief, targeted retraining can recover lost capability without expensive full-model re-training.
-
Training-free, sparse rollback evidence. A training-free rollback applied on a sparse proxy conflict coordinate set for the Math–QA pair partially restored Math, providing direct proxy-level evidence that damage is localized and addressable without full retraining.
Who It's For & Trade-offs
Great fit if you are an ML researcher or model maintainer investigating RLHF/multi-domain fine-tuning, want mechanistic explanations for cross-domain failure modes, or need low-cost repair strategies for deployed LLMs. Look elsewhere if your primary failure mode is dense, large-magnitude catastrophic forgetting across the entire parameter space or if you cannot observe/identify neuron-level activations or active computation routes — the proposed methods rely on route-sparsity assumptions and probes that may be harder to apply on very different architectures or opaque production stacks.
Methodological notes
The paper combines empirical probing (route/activity analysis and sparse coordinate interventions) with a theoretical local-perturbation model that formalizes damage as a second-order term concentrated on shared active routes. The combination of proof and targeted empirical interventions gives practical guidance: identify conflict coordinates or run a short domain refresh to contract harmful components in the conflict subspace before resorting to large-scale retraining.
