MLOps2024

Inspect: Framework for Large Language Model Evaluations

Runs reproducible evaluations of large language models through a Python API with built-in solvers, scorers, and model-graded grading. Ships 200+ ready-to-run evals spanning capability and safety testing, and connects to most major model providers.

Visit Website

Introduction

Most teams discover their LLM evaluation harness is a pile of one-off scripts the moment they need to reproduce a result or defend a number. Inspect comes at the problem from the opposite end: it was built inside the UK's AI Security Institute, where evaluations have to survive scrutiny, so it treats an eval as structured, auditable infrastructure rather than a notebook. The core abstraction is a clean separation between the dataset, the solver (how the model is prompted and given tools), and the scorer (how answers are graded, including model-graded grading).

What Sets It Apart

200+ pre-built evaluations run against any provider out of the box, so you start from a real baseline instead of reinventing MMLU-style harnesses.
First-class agent evaluation: tool use, multi-turn dialogue, sandboxed code execution, and trajectory inspection are built in, which matters as benchmarks shift from Q&A toward autonomous tasks.
A log viewer records every prompt, tool call, and score, so a failing eval is debuggable and a passing one is reproducible — the difference between a demo and evidence.
Extensible through ordinary Python packages, so new scorers or eval techniques ship without forking the core.

Who It's For

Great fit if you're doing serious capability or safety evaluation — red-teaming, dangerous-capability testing, or comparing models on agentic tasks — and need results that hold up to review. Look elsewhere if you just want a quick accuracy number on a single benchmark; a lightweight script or a hosted leaderboard is less ceremony. The Python-first design also assumes you're comfortable writing code rather than clicking through a UI.

Back

Information

Websitegithub.com
OrganizationsUK AI Security Institute
AuthorsUK AI Security Institute, UK Government BEIS
Published date2024/05/10

More Items

AI Infra2026

Knowledge Catalog

Google Cloud (Google LLC), GoogleCloudPlatform (GitHub organization)

Provides tools and samples to build context management, enrichment, and retrieval solutions on Google Cloud Knowledge Catalog — an AI-oriented data catalog that builds a dynamic knowledge graph for structured and unstructured data, suitable for RAG and agent workflows.

google github ai ai-development RAG+5

MLOps2018

Prefect

PrefectHQ

Orchestrates and schedules Python data pipelines and workflows with primitives for retries, caching, parameters, and deployments. Provides either a self-hosted server or managed Prefect Cloud for monitoring, observability, and integrations across common data tools.

mLOps python ai-workflow docker cli+2

AI Agent2026

no-mistakes

kunchenguid

Acts as a local git proxy that runs an AI-driven validation pipeline in a disposable worktree, only forwarding the branch and opening a PR after every check passes. Runs review, tests, docs, and lint in isolation, applies safe auto-fixes, supports multiple agent providers, and pauses for human approval when intent would change.

go cli agent-skills ai-workflow mLOps+2