Frontier models are often gated or costly; synthetic distillation datasets let practitioners transfer characteristic reasoning patterns into open or smaller weights without direct access to the original system. This dataset bundles 25k chat-style SFT examples with rich metadata to help instruction-tuning workflows approximate Mythos-style capabilities in focused areas (security, coding, formal reasoning, and agentic planning).
What Sets It Apart
- Distilled voice and structure: examples are authored to reflect an autonomous, multi-step "Mythos-style" decomposition (numbered reasoning, risk matrices, detection heuristics) so fine-tuned models learn structured chain-of-thought-like outputs without exposing proprietary model outputs. This is useful when you need consistency in reply structure for downstream evaluation.
- Balanced, high-signal categories: explicit splits (cybersecurity ~7k, advanced coding ~5.5k, reasoning ~3k, agentic planning ~3.5k, scientific analysis ~2.5k, general expert QA ~3.5k) allow curriculum or targeted upsampling during SFT/TRL training.
- Trainer-friendly format and metadata: chat
messagesplus category, id, source, timestamp enable selective sampling, loss-masking on assistant tokens, and integration with TRL/Axolotl/standard Hugging Face trainers. - License and reproducibility: Apache-2.0 license and included generator script support commercial use and reproducible extensions.
Who It's For and Tradeoffs
Great fit if you want to bootstrap instruction-tuned models (Llama-family, Qwen, Mistral, Gemma, etc.) toward structured, multi-step expert answers in security, coding, or agentic workflows and need a compact, metadata-rich synthetic curriculum. Look elsewhere if you require genuine proprietary model outputs, human-preference-aligned labels, or real-world exploit code datasets — this corpus is synthetic and framed defensive-only. Also plan for human preference tuning and careful safety review: the cyber content is defensively oriented but requires governance and evaluation to avoid unintended dual-use behavior.
Where It Fits
Use this dataset as a middle-ground between small open instruction datasets (fast to train but shallow) and inaccessible gated frontier traces (deep but unavailable). It accelerates capability transfer when paired with a smaller amounts of human preference data or real-world corpora for calibration.
