LogoAIAny
Icon for item

Heretic

Automatically removes censorship from transformer language models using parameterized directional ablation plus a TPE optimizer to auto-search ablation parameters. Command-line tool with built-in evaluation and visualization; supports many dense models and some MoE architectures.

Introduction

Why this matters

Language models are often aligned to refuse certain prompts, which can reduce their utility for legitimate research and analysis. Heretic approaches the problem from the model-internals side: instead of post-training finetuning or dataset-based unalignment, it automatically identifies and ablates the directions responsible for refusals while minimizing divergence from the original model. That trade-off — suppress refusals while preserving capability — is the project's core insight.

What Sets It Apart
  • Automatic, optimization-driven ablation: Heretic pairs a parametrized directional ablation implementation with a Tree-structured Parzen Estimator (TPE) optimizer (via Optuna). So what: users do not need to design ablation parameters manually — the tool co-optimizes refusal suppression and KL divergence to the base model.
  • Low damage to model capability: The built-in objective explicitly minimizes KL divergence on harmless prompts, producing decensored checkpoints that (in the author's benchmarks) match human-built abliterations in refusal reduction while showing lower KL divergence. So what: you get fewer refusals without severely degrading model utility for unrelated tasks.
  • Research-oriented diagnostics: Optional research extras generate per-layer residual-vector plots (PaCMAP projections), residual geometry metrics, and animated layer transition visualizations. So what: researchers can inspect how 'refusal' signals evolve through layers rather than treating decensoring as a black box.
  • Broad model support with pragmatic constraints: Heretic supports most dense transformer models and several MoE variants, offers quantization (bitsandbytes) to lower VRAM cost, and exposes evaluation workflows to compare original vs. decensored checkpoints. So what: it's practical for experimentation on a wide range of community models, but not every exotic architecture is supported.
Who it’s for — tradeoffs

Great fit if you are a researcher or practitioner who wants to study or remove refusal behavior from transformer models without manual, per-model ablation engineering. It is useful when you can run PyTorch-based workloads (GPU recommended) and want both automated parameter search and quantitative/human evaluation of decensored checkpoints.

Look elsewhere if you need a GUI-first product, a hosted inference service for live production use, or support for non-transformer hybrid architectures (SSMs/hybrid attention types) — Heretic is a command-line, research-forward tool that assumes familiarity with running local model workloads and evaluating model outputs.

Where it fits

Heretic sits between interpretability toolkits and model surgery utilities: it is not a new model or a training framework but a model-modification and evaluation system that automates a specific interpretability-inspired intervention (directional ablation) and frames it as a constrained optimization problem. Compared with manual abliterations, it prioritizes repeatability, built-in metrics, and lower KL damage.

Practical notes (brief)
  • Requires Python 3.10+, PyTorch (2.x series recommended) and GPU for practical runtimes; quantized flows are available to reduce VRAM. The CLI-centric workflow outputs options to save/upload decensored checkpoints and run comparisons.
  • Not a turnkey hosted decensoring service — runs locally or in your compute environment and is best used by those comfortable running model inference/IO workflows and interpreting evaluation metrics.

Overall, Heretic is useful when you want an automated, evaluable approach to reducing refusal behavior in transformer models while keeping a close eye on capability preservation.

Information

  • Websitegithub.com
  • Authorsp-e-w
  • Published date2025/09/21

Categories