Why this matters
Language models are often aligned to refuse certain prompts, which can reduce their utility for legitimate research and analysis. Heretic approaches the problem from the model-internals side: instead of post-training finetuning or dataset-based unalignment, it automatically identifies and ablates the directions responsible for refusals while minimizing divergence from the original model. That trade-off — suppress refusals while preserving capability — is the project's core insight.
What Sets It Apart
- Automatic, optimization-driven ablation: Heretic pairs a parametrized directional ablation implementation with a Tree-structured Parzen Estimator (TPE) optimizer (via Optuna). So what: users do not need to design ablation parameters manually — the tool co-optimizes refusal suppression and KL divergence to the base model.
- Low damage to model capability: The built-in objective explicitly minimizes KL divergence on harmless prompts, producing decensored checkpoints that (in the author's benchmarks) match human-built abliterations in refusal reduction while showing lower KL divergence. So what: you get fewer refusals without severely degrading model utility for unrelated tasks.
- Research-oriented diagnostics: Optional research extras generate per-layer residual-vector plots (PaCMAP projections), residual geometry metrics, and animated layer transition visualizations. So what: researchers can inspect how 'refusal' signals evolve through layers rather than treating decensoring as a black box.
- Broad model support with pragmatic constraints: Heretic supports most dense transformer models and several MoE variants, offers quantization (bitsandbytes) to lower VRAM cost, and exposes evaluation workflows to compare original vs. decensored checkpoints. So what: it's practical for experimentation on a wide range of community models, but not every exotic architecture is supported.
Who it’s for — tradeoffs
Great fit if you are a researcher or practitioner who wants to study or remove refusal behavior from transformer models without manual, per-model ablation engineering. It is useful when you can run PyTorch-based workloads (GPU recommended) and want both automated parameter search and quantitative/human evaluation of decensored checkpoints.
Look elsewhere if you need a GUI-first product, a hosted inference service for live production use, or support for non-transformer hybrid architectures (SSMs/hybrid attention types) — Heretic is a command-line, research-forward tool that assumes familiarity with running local model workloads and evaluating model outputs.
Where it fits
Heretic sits between interpretability toolkits and model surgery utilities: it is not a new model or a training framework but a model-modification and evaluation system that automates a specific interpretability-inspired intervention (directional ablation) and frames it as a constrained optimization problem. Compared with manual abliterations, it prioritizes repeatability, built-in metrics, and lower KL damage.
Practical notes (brief)
- Requires Python 3.10+, PyTorch (2.x series recommended) and GPU for practical runtimes; quantized flows are available to reduce VRAM. The CLI-centric workflow outputs options to save/upload decensored checkpoints and run comparisons.
- Not a turnkey hosted decensoring service — runs locally or in your compute environment and is best used by those comfortable running model inference/IO workflows and interpreting evaluation metrics.
Overall, Heretic is useful when you want an automated, evaluable approach to reducing refusal behavior in transformer models while keeping a close eye on capability preservation.
