LogoAIAny
Icon for item

SMOL

Provides professionally translated parallel corpora and a multilingual lexicon across 100+ low-resource languages for training and evaluating multilingual MT and NLP models. Includes SmolDoc, SmolSent, GATITOS, and factuality annotations; licensed CC-BY-4.0.

Introduction

Low-resource languages remain badly underrepresented in parallel corpora and lexica, which constrains both model quality and real-world usefulness. SMOL offers a focused remedy: professionally produced and volunteer-augmented translations at document, sentence, and token levels to give practitioners trusted bilingual data for training, evaluation, and analysis.

What Sets It Apart
  • Multi-granularity collection: document-level (SmolDoc), sentence-level (SmolSent), and token/lexicon-level (GATITOS), plus factuality annotations for a subset of SmolDoc. This mix supports both model training and fine-grained error analysis.
  • Broad low-resource coverage: the release covers translations into many under-represented languages (hundreds of language pairs across the components), enabling work on languages often missing from public corpora.
  • Professional + volunteer pipeline: core professional translations are complemented by volunteer contributions and targeted post-editing, improving quality for many smaller languages while scaling coverage.
  • Open license and reproducibility: released under CC-BY-4.0 on Hugging Face, with configs for many language-pair splits and regular updates (e.g., April 2026 content expansion and MediSMOL additions).
Who it's for and trade-offs

Great fit if you need high-quality bilingual data for low-resource MT research, multilingual evaluation, or to bootstrap lexicons and post-editing workflows. The dataset is particularly useful for experiments requiring document context, token lexica, or per-document factuality labels.

Look elsewhere if you need extremely large-scale, high-volume parallel corpora for high-resource languages (SMOL is intentionally compact per language), or if you require gold-standard human translations for every language without any volunteer-contributed splits — quality and size vary by language and split.

Where it fits

Use SMOL to augment web-crawled or synthetic parallel data, to evaluate model robustness on diverse languages and scripts, or to build lexicon-based augmentations for unsupervised/zero-shot MT. Combine SmolDoc/SmolSent/GATITOS according to whether your task benefits from document context, sentence pairs, or lexical coverage.

Information

  • Websitehuggingface.co
  • Authorsgoogle, Isaac Caswell, Elizabeth Nielsen, Jiaming Luo, Colin Cherry
  • Published date2025/02/14

Categories