Most code-eval benchmarks focus on single-file edits or short prompts; this dataset stresses whether an agent can execute repository-scoped reasoning, produce large patches, and validate fixes through real tests. The core insight: success here means handling long-horizon change planning, multi-file diffs, and test orchestration — not just single-line completions.
What Sets It Apart
- Enterprise-scale, repo-level tasks: packaged instances include full repo identifiers, base commit hashes, golden patches, and test patches so evaluations simulate real repair workflows (731 test examples, dataset size ~23.7MB, parquet format). This emphasizes integration and CI-like verification rather than synthetic unit edits.
- Rich metadata for evaluation: problem statements, requirements, interfaces, selected test files, and failure-to-pass / pass-to-pass test records allow nuanced metrics (e.g., exact-test recovery, regression risk) beyond simple edit similarity.
- Built to align with SWE-Bench Verified structure while adding fields specific to agent evaluation, enabling comparisons across agent policies and automated toolchains.
Who it's for and trade-offs
Great fit if you: want to benchmark LLM-based agents or automated repair systems on realistic, long-horizon software engineering tasks; need test-driven validation and golden patches to compute robust pass/fail metrics; or are evaluating agent orchestration across multi-file changes. Look elsewhere if you: need tiny, fast unit tests for quick iteration (this dataset's CI-style runs and larger patches slow evaluation), or cannot provide reproducible test environments—many instances require repo setup and test execution to verify fixes.
Practical notes
Expect nontrivial engineering overhead to run full evaluations (repo checkout, dependency setup, test selection). The dataset is distributed in parquet and exposes explicit fields such as repo, base_commit, patch, test_patch, problem_statement, fail_to_pass, and pass_to_pass to support automated scoring pipelines. It pairs well with evaluation harnesses that can run containerized tests or use reproducible CI runners.
