AIAny - MolDeTox

Most molecular benchmarks measure property prediction; MolDeTox instead stresses minimal-structure interventions that flip toxicity labels. By framing paired molecules as "toxicity cliffs," the dataset forces language and vision-language models to reason about which substructure changes are plausibly responsible for toxicity and propose minimal, property-preserving edits.

What Sets It Apart

Focus on toxicity cliffs: pairs are structurally similar but oppositely labeled, so success requires pinpointing small, targeted edits rather than global optimization. This tests a model’s causal / counterfactual reasoning about chemistry, not just pattern matching.
Multi-format evaluation: provides both tabular CSV toxicity‑cliff splits and QA-formatted configs (single/multi prompts, SMILES and safe-text generation), enabling standardized evaluation across generative LLMs and VLMs.
Downstream safety angle: tasks include identifying toxic fragments, proposing non-toxic alternatives, and generating detoxified molecules while preserving physicochemical constraints—making it directly relevant to model safety and medicinal chemistry workflows.
Built from public toxicity sources: curated from established toxicity repositories and restructured to emphasize minimal edits and benchmark detoxification capability.

Who It's For (and Tradeoffs)

Great fit if you want to evaluate or fine-tune LLMs/VLMs on chemistry-aware, safety-sensitive generation tasks—teams working on AI-driven drug design, safety evaluation, or model alignment will find it most useful. Look elsewhere if you need large-scale property-prediction benchmarks (e.g., broad ADMET prediction) or raw experimental assay data: MolDeTox emphasizes paired-edit reasoning and QA-style evaluation rather than exhaustive assay-level coverage.

Where It Fits

Use MolDeTox alongside conventional property datasets to measure a different axis of capability: instead of asking "can the model predict a label?", it asks "can the model suggest minimal, chemically plausible edits that change that label while keeping key properties?" That makes it complementary to regression/classification benchmarks.

Practical Notes on Use

Expect configurations for several tasks (toxicity cliff CSV splits plus multiple QA configs covering single/multi items and SMILES vs. safe-text generation). The dataset size and split structure are suitable for benchmarking and short fine-tuning experiments, and its QA formats make it easy to evaluate generative outputs with both automated metrics and human assessment.

MolDeTox

Introduction

What Sets It Apart

Who It's For (and Tradeoffs)

Where It Fits

Practical Notes on Use

Information

Categories

Tags

More Items

SynthComp

VideoChat3-Academic2M

TRuST