AIAny - SMOL

Low-resource languages remain badly underrepresented in parallel corpora and lexica, which constrains both model quality and real-world usefulness. SMOL offers a focused remedy: professionally produced and volunteer-augmented translations at document, sentence, and token levels to give practitioners trusted bilingual data for training, evaluation, and analysis.

What Sets It Apart

Multi-granularity collection: document-level (SmolDoc), sentence-level (SmolSent), and token/lexicon-level (GATITOS), plus factuality annotations for a subset of SmolDoc. This mix supports both model training and fine-grained error analysis.
Broad low-resource coverage: the release covers translations into many under-represented languages (hundreds of language pairs across the components), enabling work on languages often missing from public corpora.
Professional + volunteer pipeline: core professional translations are complemented by volunteer contributions and targeted post-editing, improving quality for many smaller languages while scaling coverage.
Open license and reproducibility: released under CC-BY-4.0 on Hugging Face, with configs for many language-pair splits and regular updates (e.g., April 2026 content expansion and MediSMOL additions).

Who it's for and trade-offs

Great fit if you need high-quality bilingual data for low-resource MT research, multilingual evaluation, or to bootstrap lexicons and post-editing workflows. The dataset is particularly useful for experiments requiring document context, token lexica, or per-document factuality labels.

Look elsewhere if you need extremely large-scale, high-volume parallel corpora for high-resource languages (SMOL is intentionally compact per language), or if you require gold-standard human translations for every language without any volunteer-contributed splits — quality and size vary by language and split.

Where it fits

Use SMOL to augment web-crawled or synthetic parallel data, to evaluate model robustness on diverse languages and scripts, or to build lexicon-based augmentations for unsupervised/zero-shot MT. Combine SmolDoc/SmolSent/GATITOS according to whether your task benefits from document context, sentence pairs, or lexical coverage.

SMOL

Introduction

What Sets It Apart

Who it's for and trade-offs

Where it fits

Information

Categories

Tags

More Items

SynthComp

VideoChat3-Academic2M

TRuST