A 57-subject multiple-choice benchmark for measuring broad language understanding in LLMs; provides per-subject configs and test/dev/auxiliary_train splits for few-/zero-shot evaluation, widely used for model comparison and academic reporting.
Provides pre-parsed Parquet snapshots of English and French Wikipedia articles with structured fields (sections, infoboxes, tables, references, images) and credibility signals — optimized for large-scale analysis, retrieval-augmented generation, and model development.