Most skill-based agentic RL methods force a single treatment for all skills: either keep them external (large context / invocation overhead) or fully internalize them (risking overfitting and knowledge conflicts). The core insight behind Skill0.5 is that general cognitive skills and task-specific execution skills should be treated differently — internalize the former to build a stable foundation and explicitly use the latter to avoid shortcutting. This split, routed dynamically by task difficulty, improves both in-distribution competence and out-of-distribution generalization.
Key Findings
- Difficulty-aware routing: Tasks are streamed into mastery tiers so the agent applies different training objectives depending on difficulty; this reduces wasted context and focuses learning where it matters. (So what: fewer context tokens for common tasks, targeted learning for hard tasks.)
- Privileged distillation for general skills: For hard tasks, general skills are internalized via distillation from privileged/expert signals to form a reusable cognitive base. (So what: better transfer when encountering novel or harder situations.)
- Diagnostic probing and utilization for easy tasks: Easy tasks are used to probe and enforce explicit skill invocation, penalizing shortcut behaviors and ensuring task-specific skills are actually utilized. (So what: reduces brittle shortcuts that fail OOD.)
- Empirical validation: Evaluations on ALFWorld and WebShop show consistent improvements over memory-based and prior skill-based RL baselines across in-distribution and out-of-distribution splits, indicating stronger generalization rather than mere memorization.
Who it's for & trade-offs
Great fit if you research or build autonomous agents that must generalize beyond training distributions and you can (or want to) separate skill types into general vs task-specific categories. The approach is particularly relevant for environments where invoking external skill libraries is costly or unreliable. Look elsewhere if your application: (1) has no clear skill decomposition, (2) cannot provide privileged/expert signals needed for distillation, or (3) demands minimal architectural change and prefers purely memory-based skill invocation — Skill0.5 introduces routing and training distinctions that add design complexity.
Method highlights
The framework centers on a dynamic, difficulty-aware router that assigns tasks to tiers. For hard-tier tasks, the framework applies privileged distillation to internalize broad skills; for easy-tier tasks, it runs diagnostic probes that penalize shortcut policies and enforce explicit skill use. This hybrid—hence “0.5”—aims to balance context efficiency and robustness, reducing both context overhead and overfitting-related failures in OOD scenarios.
