Tag

Explore by tags

All

30u30

ASR

ChatGPT

GNN

IDE

RAG

agent-skills

ai

ai-agent

ai-api

ai-api-management

ai-client

ai-coding

ai-demos

ai-deploy

ai-development

ai-framework

ai-image

ai-image-demos

ai-inference

ai-leaderboard

ai-library

ai-rank

ai-serving

ai-tools

ai-train

ai-video

ai-workflow

AIGC

algorithms

alibaba

amazon

android

anthropic

audio

aws

benchmark

biology

blog

book

bytedance

chatbot

chatgpt

chemistry

claude

claude-code

cli

code

codex

copilot

course

cuda

cursor

deepmind

deepseek

depth

devops

diffusers

docker

drug-discovery

electron

embeddings

engineering

evaluation

facebook

finance

flow-matching

foundation

foundation-model

gemini

gemini-cli

gemma

genomics

gitHub

github

go

google

gradient-booting

grok

groq

huggingface

image

ios

java

javascript

json

kimi

llama.cpp

LLM

llm

lora

mLOps

math

mcp

mcp-client

mcp-server

meta-ai

meta-pytorch

metal

microsoft

mlops

mobile

multilingual

multimodal

mysql

NLP

nlp

nodejs

numpy

nvidia

ocr

ollama

openai

opencode

pandas

paper

physics

pi

plugin

polars

postgres

privacy

prompt-engineering

pwa

python

pytorch

qwen

react

reasoning

retrieval

RL

robotics

rust

science

security

segmentation

shodan

skillkit

sora

speech

sqlite

ssh

stt

swe

tensorrt

terminal

transformers

translation

tts

tutorial

typescript

vibe-coding

video

vision

vllm

voice

web-search

windsurf

xAI

xai

AI Leaderboard2023

OpenRouter LLM Rankings

OpenRouter, Inc.

Ranks LLMs by real production token usage, not benchmarks. Aggregates traffic from millions of users hitting 400+ models through one API — sliced by model, lab market share, tool-call frequency, and image volume, updated weekly.

AI Leaderboard2023

OpenCompass CompassRank

OpenCompass ContributorsShanghai AI Laboratory

Public leaderboard ranking LLMs and multimodal models across 70+ datasets — reasoning, knowledge, coding, math, and long-context. Blends open-source and proprietary benchmarks into one comparative view spanning GPT-4, Claude, Qwen, and InternLM.

Language Model Evaluation Harness

Unified framework for few-shot evaluation of generative language models across 60+ academic benchmarks. Supports multiple model backends (Hugging Face, vLLM, APIs, local servers), configurable prompts/YAML configs, and reproducible exports for leaderboards and research comparisons.

llm ai-leaderboard huggingface vllm github+3

AI Leaderboard2023

Arena Leaderboard (formerly LMArena)

LMSYS Org, ArenaArena Intelligence Inc.

Blind side-by-side voting site where users send one prompt to two anonymous chat models, pick the winner, and millions of votes become Elo rankings across text, coding, vision, image, and video. Crowd preference, not static benchmarks, decides the order.

ai-leaderboard ai-rank LLM chatbot ai-tools

AI Leaderboard2023

VLMEvalKit

open-compass (OpenCompass community)OpenCompass, Shanghai AI Laboratory

Runs one-command evaluation of vision-language models across 80+ multimodal benchmarks, handling data download, inference, and metric scoring in a single pass. Supports 220+ LMMs; adding a new model means writing one generate_inner() function.

vision ai-leaderboard huggingface github ai-tools+1

hf-audio/open-asr-leaderboard

Provides leaderboard-ready test splits for the Open ASR Leaderboard: converts unsafe custom loaders to Parquet, sorts samples by audio length, and packages eight ESB test sets (LibriSpeech, Common Voice, GigaSpeech, SPGISpeech, etc.) for reproducible ASR benchmarking.

ASR audio huggingface ai-leaderboard ai-rank

Humanity's Last Exam

Center for AI Safety (cais), Scale AI

Multi‑modal closed-ended academic benchmark with 2,500 multiple-choice and short-answer exam questions spanning math, natural sciences, and humanities for automated grading. Curated by subject-matter experts, released under MIT, and includes a canary string to help prevent dataset leakage into model training.

huggingface multimodal image nlp pandas+1

SWE-bench Verified

A human-verified subset of 500 SWE-bench test cases for evaluating models that resolve GitHub issues into PRs using unit-test verification. Contains problem statements and base commits (pre-fix) for reproducible unit-test based evaluation; suitable for benchmarking code-fix and issue-resolution capabilities.

github nlp python ai-leaderboard ai-rank+1

ScaleAI/SWE-bench_Pro

Benchmark dataset for evaluating agents on long-horizon software-engineering tasks (repo-level patches, test-driven fixes). Includes golden patches, related tests, and problem statements in parquet format; aimed at agent debugging and code-modification evaluation but requires full test environments.

huggingface ai-coding agent-skills ai-leaderboard code

OpenAI/parameter-golf

A challenge repository for training the best language model that fits inside a 16,000,000‑byte (16MB) submission artifact; provides baseline training code, FineWeb bpb evaluation, a public leaderboard, and compute-grant instructions for short 8×H100 runs.

openai ai-train ai-leaderboard github pytorch+2

ParseBench

llamaindex (dataset), Boyang Zhang +5

Benchmarks document-parsing systems on real-world enterprise PDFs and images—evaluates tables, charts, content faithfulness, semantic formatting, and visual grounding with human-verified, rule-level tests. Ships with ~2,000 pages, ~169K test rules, and an open evaluation framework for end-to-end pipeline scoring.

huggingface github paper ocr vision+3

HealthBench Professional

Benchmark dataset for evaluating clinician-facing chat assistants: physician-authored conversations plus rubric items, use-case and difficulty labels, specialty metadata, and a built-in canary to reduce benchmark contamination. Hosted on Hugging Face under an MIT license.

openai huggingface ai-rank ai-leaderboard nlp+2