Modern retrieval systems rely heavily on topical overlap or surface cues in embeddings; OBLIQ-Bench asks a different question: what happens when relevance is latent, stylistic, structural, or only obvious when a human (or reasoning LLM) inspects a candidate? The dataset highlights retrieval blind spots that standard benchmarks miss and surfaces where retrieval and verification should be decoupled.
What Sets It Apart
- Focus on "oblique queries": queries whose relevant signals are implicit (stance, error modes, abstract proof strategy, stylistic fingerprint, or a fuzzy memory) rather than expressed as keywords. This stresses retrieval mechanisms beyond topical similarity.
- Five heterogeneous tasks spanning three mechanisms of obliqueness (descriptive, analogue, tip-of-tongue), with realistic corpus sizes (examples: ~72k tweets; ~507k conversations; ~213k congressional passages; ~3.5k math problems; ~10k writing snippets). Each task provides queries.jsonl and qrels in standard TREC format; some include pooled judgments and per-query exclusion lists to support fair evaluation.
- Design encourages separation of retrieval and verification: many positives are easy to recognize when paired with a query (suitable for an LLM verifier) but hard to locate by floating embeddings alone—making it useful for RAG pipelines that combine retrieval with reasoning.
- Evaluation-ready: intended metrics include NDCG@10/50 and Recall@k, with qrels_pool.tsv available for pooled evaluations to mitigate unjudged positives.
Who It's For and Tradeoffs
Great fit if you are researching or developing: embedding or dense retrievers, cross-encoder rerankers, LLM-based verifier modules, or benchmark suites that surface failure modes not covered by topical retrieval tasks. The dataset aids stress-testing systems on nuance (stance, style, strategy) and on retrieval+verification workflows.
Look elsewhere if you need: large-scale web crawl corpora, conversational-turn-level downstream metrics, or exhaustive multilingual coverage—OBLIQ-Bench is English-focused and centered on exposing oblique relevance rather than serving as a general-purpose, production-scale search index.
Where It Fits
Use OBLIQ-Bench to compare embedding families, evaluate hybrid sparse+dense pipelines, or measure gains from adding an LLM verifier after retrieval. Its per-query exclusion files and pooled judgments make it suitable for careful IR evaluations and reproducible experiments.
Practical notes
Data files follow JSONL/corpus + queries + qrels conventions and are released under CC-BY-4.0. Because many tasks have few queries (e.g., 40 for WildChat errors), reported variance can be high—report confidence intervals and use pooled qrels when possible.
