Nemotron‑Personas‑El‑Salvador matters because lack of regionally grounded persona data amplifies bias and harms model behaviour when deployed locally. This release gives model builders a census‑anchored synthetic population in Salvadoran Spanish so they can condition outputs on realistic age, occupation, municipality and education distributions without exposing real PII.
What Sets It Apart
- Census‑anchored generation: Personas are conditioned on the VII Census (2024) distributions for age, department/municipality, education, marital status and employment categories — enabling geographically and demographically realistic sampling for El Salvador.
- High coverage & scale: Published as 148k parquet records (7 personas per record ≈ 1M personas), ~300M tokens total (≈161M persona tokens) with 25 fields (7 persona narratives + 18 contextual attributes) to support fine‑grained conditioning in training and evaluation.
- Synthetic + provenance controls: Generated with NeMo Data Designer using a probabilistic graphical model plus an Apache‑2 LLM (openai/gpt-oss-120b) and validators; names distributions were used during generation but name fields are not exposed to reduce memorization and re‑identification risk.
- Practical license & local focus: Released under CC BY 4.0 and built in collaboration with WideLabs and NVIDIA to support Sovereign AI efforts and localized model development.
Who it's for — and tradeoffs
Great fit if you are training or evaluating LLMs for Salvadoran Spanish, building synthetic-data pipelines that require realistic demographic anchors, or researching bias mitigation and model collapse from synthetic corpora. Look elsewhere if you need labeled clinical/financial personas, under‑18 profiles, or an authoritative Salvadoran NLP corpus for dialectal speech generation: the persona narratives approximate Salvadoran Spanish rather than deriving from a verified local conversational corpus. Known limitations include underrepresentation of some Indigenous and Afrodescendant identities (small‑cell suppression), omission of religion, and potential residual gender‑role artifacts from the narrative LLM. Use responsibly — verify for your downstream regulatory and privacy requirements before production deployment.
