Most molecular benchmarks measure property prediction; MolDeTox instead stresses minimal-structure interventions that flip toxicity labels. By framing paired molecules as "toxicity cliffs," the dataset forces language and vision-language models to reason about which substructure changes are plausibly responsible for toxicity and propose minimal, property-preserving edits.
What Sets It Apart
- Focus on toxicity cliffs: pairs are structurally similar but oppositely labeled, so success requires pinpointing small, targeted edits rather than global optimization. This tests a model’s causal / counterfactual reasoning about chemistry, not just pattern matching.
- Multi-format evaluation: provides both tabular CSV toxicity‑cliff splits and QA-formatted configs (single/multi prompts, SMILES and safe-text generation), enabling standardized evaluation across generative LLMs and VLMs.
- Downstream safety angle: tasks include identifying toxic fragments, proposing non-toxic alternatives, and generating detoxified molecules while preserving physicochemical constraints—making it directly relevant to model safety and medicinal chemistry workflows.
- Built from public toxicity sources: curated from established toxicity repositories and restructured to emphasize minimal edits and benchmark detoxification capability.
Who It's For (and Tradeoffs)
Great fit if you want to evaluate or fine-tune LLMs/VLMs on chemistry-aware, safety-sensitive generation tasks—teams working on AI-driven drug design, safety evaluation, or model alignment will find it most useful. Look elsewhere if you need large-scale property-prediction benchmarks (e.g., broad ADMET prediction) or raw experimental assay data: MolDeTox emphasizes paired-edit reasoning and QA-style evaluation rather than exhaustive assay-level coverage.
Where It Fits
Use MolDeTox alongside conventional property datasets to measure a different axis of capability: instead of asking "can the model predict a label?", it asks "can the model suggest minimal, chemically plausible edits that change that label while keeping key properties?" That makes it complementary to regression/classification benchmarks.
Practical Notes on Use
Expect configurations for several tasks (toxicity cliff CSV splits plus multiple QA configs covering single/multi items and SMILES vs. safe-text generation). The dataset size and split structure are suitable for benchmarking and short fine-tuning experiments, and its QA formats make it easy to evaluate generative outputs with both automated metrics and human assessment.
