Tag

Explore by tags

All

30u30

ASR

ChatGPT

GNN

IDE

RAG

agent-skills

ai

ai-agent

ai-api

ai-api-management

ai-client

ai-coding

ai-demos

ai-deploy

ai-development

ai-framework

ai-image

ai-image-demos

ai-inference

ai-leaderboard

ai-library

ai-rank

ai-serving

ai-tools

ai-train

ai-video

ai-workflow

AIGC

algorithms

alibaba

amazon

android

anthropic

audio

aws

benchmark

biology

blog

book

bytedance

chatbot

chatgpt

chemistry

claude

claude-code

cli

code

codex

copilot

course

cuda

cursor

deepmind

deepseek

depth

devops

diffusers

docker

drug-discovery

electron

embeddings

engineering

evaluation

facebook

finance

flow-matching

foundation

foundation-model

gemini

gemini-cli

gemma

genomics

gitHub

github

go

google

gradient-booting

grok

groq

huggingface

image

ios

java

javascript

json

kimi

llama.cpp

LLM

llm

lora

mLOps

math

mcp

mcp-client

mcp-server

meta-ai

meta-pytorch

metal

microsoft

mlops

mobile

multilingual

multimodal

mysql

NLP

nlp

nodejs

numpy

nvidia

ocr

ollama

openai

opencode

pandas

paper

physics

pi

plugin

polars

postgres

privacy

prompt-engineering

pwa

python

pytorch

qwen

react

reasoning

retrieval

RL

robotics

rust

science

security

segmentation

shodan

skillkit

sora

speech

sqlite

ssh

stt

swe

tensorrt

terminal

transformers

translation

tts

tutorial

typescript

vibe-coding

video

vision

vllm

voice

web-search

windsurf

xAI

xai

AI Leaderboard2023

Arena Leaderboard (formerly LMArena)

LMSYS Org, ArenaArena Intelligence Inc.

Blind side-by-side voting site where users send one prompt to two anonymous chat models, pick the winner, and millions of votes become Elo rankings across text, coding, vision, image, and video. Crowd preference, not static benchmarks, decides the order.

ai-leaderboard ai-rank LLM chatbot ai-tools

AI Dataset2024

hf-audio/open-asr-leaderboard

hf-audio

Provides leaderboard-ready test splits for the Open ASR Leaderboard: converts unsafe custom loaders to Parquet, sorts samples by audio length, and packages eight ESB test sets (LibriSpeech, Common Voice, GigaSpeech, SPGISpeech, etc.) for reproducible ASR benchmarking.

ASR audio huggingface ai-leaderboard ai-rank

AI Dataset2025

SWE-bench Verified

SWE-bench

A human-verified subset of 500 SWE-bench test cases for evaluating models that resolve GitHub issues into PRs using unit-test verification. Contains problem statements and base commits (pre-fix) for reproducible unit-test based evaluation; suitable for benchmarking code-fix and issue-resolution capabilities.

github nlp python ai-leaderboard ai-rank+1

AI Dataset2026

TAAC2026 Demo Dataset (1000 Samples)

TAAC2026

Provides a 1,000-row sample user–item interaction Parquet for the TAAC2026 recommendation task, using a flat column layout with 120 top-level columns (IDs, labels, user/item int & dense features, and four-domain behavioral sequences). Updated 2026-04-10.

huggingface pandas python ai ai-rank

AI Dataset2026

HealthBench Professional

OpenAI

Benchmark dataset for evaluating clinician-facing chat assistants: physician-authored conversations plus rubric items, use-case and difficulty labels, specialty metadata, and a built-in canary to reduce benchmark contamination. Hosted on Hugging Face under an MIT license.

openai huggingface ai-rank ai-leaderboard nlp+2

AI Dataset2026

OBLIQ-Bench

dianetc

A retrieval benchmark suite focused on “oblique queries,” where relevance depends on latent attributes rather than surface keywords. Includes five tasks with large corpora, qrels (and pooled judgments), and task-specific constraints for evaluating embedding-based retrievers and reasoning-augmented retrieval.

huggingface nlp embeddings llm math+3

AI Dataset2026

Agents Last Exam — Task Card Metadata

agents-last-exam (RDI Berkeley)

Provides task-card metadata for 147 long-horizon professional tasks from the Agents Last Exam benchmark — titles, prompts, taxonomy, and input-file descriptors. This v1.0 release is metadata-only; companion repos host input files and gated reference outputs.

huggingface ai-agent agent-skills pandas ai-rank

AI Dataset2026

WBench

meituan-longcat

Provides a 289-case (1,058-turn) multi-turn benchmark that evaluates interactive video world models across 22 metrics and five dimensions (quality, setting, interaction, consistency, physics). Includes first-/third-person and navigation splits plus a 20-model leaderboard for head-to-head comparisons.

video ai-video vision physics huggingface+4

AI Agent Papers2026

A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

Tomer Keren, Nitay Calderon +4

Proposes TASTE, an automatic pipeline that synthesizes challenging agent benchmark tasks by sampling and evolving valid tool-sequence patterns; uses an adaptive contrastive n-gram model and LLM validity judgments to produce τ^c-Bench with broader tool-use coverage and higher difficulty.

agent-skills ai-agent paper LLM ai-rank

Large Language Model Papers2026

MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

Jiacheng Chen, Xinyu Zhang +21

Applies a population-level test-time scaling strategy that uses one model as generator, verifier, refiner, and ranker to search over candidate proofs. Combines generative-verifier RL and a low false-positive verifier with tournament selection to reach competition-level performance on IMO and USAMO.

paper LLM RL ai ai-rank+2