LogoAIAny
Icon for item

IndustryBench

Multilingual benchmark for evaluating LLMs' industrial domain knowledge via 2,049 expert-curated QA pairs spanning 10 product verticals and four languages, with each item grounded to industry or national standards and an LLM-as-judge evaluation pipeline.

Introduction

Most LLM benchmarks emphasize common-sense, coding, or web knowledge; industrial expertise is different: it relies on standards, manuals, and precise safety constraints. This benchmark probes whether LLMs can retrieve, reason over, and safely apply grounded industrial knowledge rather than produce plausible-sounding but unsafe answers.

What Sets It Apart
  • Grounded, expert-curated QA pairs: 2,049 human-validated question–answer pairs drawn from authoritative sources (national/industry standards, product manuals), so evaluations emphasize verifiable correctness rather than subjective plausibility.
  • Cross-industry coverage with fine-grained capability tags: items span 10 industrial sectors (e.g., Machinery, Chemical, Electronics, Electrical) and are annotated across seven capability dimensions (Selection & Substitution; Standards & Terminology; Process Principles; Safety & Compliance; Quality & Metrology; Fault Diagnosis; Engineering Calculation), enabling targeted analysis of model strengths and blind spots.
  • Multilingual, human-reviewed translations: questions and reference answers are available in Chinese, English, Russian, and Vietnamese, supporting multilingual model evaluation and cross-lingual consistency checks.
  • Safety-aware evaluation pipeline: uses an LLM-as-judge 0–3 rubric plus an explicit safety review that forces a score of 0 for responses containing industrial safety violations, prioritizing real-world compliance over surface-level correctness.
Who It's For

Great fit if you are assessing or fine-tuning LLMs for industrial applications where standards compliance, terminology accuracy, and safety are critical—e.g., vendors evaluating domain adaptation, companies validating assistant recommendations against manuals, or researchers studying knowledge grounding and evaluation methods. Look elsewhere if your focus is general-purpose NLP benchmarks, multimodal perception, or very large-scale unsupervised corpora: this dataset is medium-sized (2k examples) and deliberately curated for diagnostic, standard-grounded evaluation rather than large-scale pretraining.

Practical notes and trade-offs
  • The dataset is intentionally unbalanced by capability because it reflects the frequency of issues in the source pool: Selection & Substitution, Standards & Terminology, and Process Principles dominate, while Fault Diagnosis and Engineering Calculation have far fewer items—treat per-dimension quantitative comparisons for the rarer labels as indicative rather than definitive.
  • Each item includes a knowledge_text reference used by the safety module and judge prompts; reproducing the benchmark faithfully requires respecting those references during evaluation.
  • Medium scale (2,049 items) makes thorough manual inspection and repeat experiments tractable, but it is not a substitute for large-scale industrial QA collections when training data volume is the goal.

Information

  • Websitehuggingface.co
  • Authorsalibaba-multimodal-industrial-ai, Songlin Bai, Xintong Wang, Linlin Yu, Bin Chen, Zhiang Xu, Yuyang Sheng, Changtong Zan, Xiaofeng Zhu, Yizhe Zhang, Jiru Li, Mingze Guo, Ling Zou, Yalong Li, Chengfu Huo, Liang Ding
  • Published date2026/05/12

Categories