The dataset captures two complementary datasets produced during Anthropic's RLHF and red-teaming efforts: (1) preference (chosen/rejected) pairs designed to train reward or preference models, and (2) full red-team conversation transcripts annotated for harmfulness. The practical insight is that these records show both what annotators favor and which adversarial prompts succeed — useful for building and evaluating reward models and harm-detection systems, but explicitly unsuitable as supervised dialogue fine-tuning data.
Key Contents
- Preference (PM) data: JSONL files of paired responses ("chosen" vs "rejected") collected across model checkpoints and collection tranches for helpfulness and harmlessness modeling. This is the primary substrate for reward-model training.
- Red-team transcripts: full conversation transcripts produced by crowdworkers attempting to break models, annotated with a red-team success rating, harmlessness scores, tags, and metadata (model type, num_params, worker source). Useful for empirical analysis of attack types and model vulnerabilities.
- Metadata & constraints: dataset size falls in the 100K–1M range, contains potentially upsetting content (violence, discrimination, self-harm), and is released under an MIT-style license on the Hugging Face hub.
What Sets It Apart
- Collected specifically for RLHF pipelines: the preference pairs are structured for reward-model objectives (so they directly support RLHF workflows rather than supervised conversational training).
- Red-team examples are human-generated, rated, and accompanied by model-level metadata, enabling research into which attacks scale with model size and which mitigation patterns matter.
- Clear safety guidance from the maintainers: parts of the dataset (especially harmlessness and red-team subsets) are flagged as inappropriate for naive fine-tuning into dialogue agents because that practice can amplify harmful behaviors.
Who it's for and trade-offs
Great fit if you want to: train or evaluate reward/preference models, analyze human adversarial strategies, or study harm-detection and mitigation empirically. Look elsewhere if you need clean, conversational supervised training data for general-purpose chatbots — this dataset contains adversarial and sensitive content and was not intended for that use. Also plan for annotation-noise and domain bias inherited from the crowdworker procedures described in the accompanying papers.
