Most benchmarks for code-centric reasoning rely on synthetic problems or curated snippets; this dataset focuses on real-world GitHub issues with unit-test–based verification, so evaluations measure whether a model’s patch actually fixes the repository tests rather than just producing plausible code. The Verified split contains 500 Issue→PR pairs sampled from popular Python repositories and manually validated for quality, providing a compact, high-trust evaluation set.
What Sets It Apart
- Human-verified ground truth: each example in the 500-sample test split was manually checked for correctness, reducing label noise common in large mined corpora. This improves confidence in reported pass/fail metrics.
- Unit-test verification as the metric: evaluation is performed by running repository tests with the post-PR behavior as the reference solution, so success implies the proposed change genuinely resolves the issue in-context.
- Minimal input surface for inference: the dataset provides the problem_statement and the base_commit (the repository state before the fix). It intentionally omits auxiliary retrieval artifacts (e.g., oracle corpora) so users can reproduce strict, retrieval-agnostic evaluations.
Who It's For and Trade-offs
Great fit if you need a compact, high-quality benchmark to compare models' real-world issue-resolution ability under reproducible, test-driven evaluation. Use it when you want low-noise signal for leaderboard comparisons or ablation studies focused on patch correctness rather than retrieval strategy. Look elsewhere if you need large-scale retrieval contexts (see related SWE-bench Lite oracle/bm25 variants) or non-Python language coverage — this split contains Python repo examples and is intentionally small (500 examples) to favor verification quality over scale.
Practical notes
- Publish and metadata dates: created on 2025-04-29 and last modified on 2026-02-27, per the dataset card. The original SWE-bench benchmark and leaderboard provide broader evaluation settings and leaderboards (see the SWE-bench site for competition results).
- Data structure highlights: each example includes repo, base_commit, problem_statement, patch/test_patch, FAIL_TO_PASS/PASS_TO_PASS lists, and environment_setup_commit to aid reproducible test runs. Because the dataset supplies base commits, accurate environment setup and dependency installation per the provided version fields are important to reproduce unit-test outcomes.
