LogoAIAny
Icon for item

promptfoo

CLI and library for automated LLM prompt and red-team evaluations — run local-sensitive evals, compare models, and integrate checks into CI/CD with declarative configs and built-in red‑teaming.

Introduction

Most teams ship LLM-powered features without systematic, repeatable checks for prompt correctness, cross-model regressions, or emerging safety issues. This tool treats prompt and agent testing like unit testing: declarative evals, repeatable runs, and CI hooks that find functional regressions and security failures before they reach users.

What Sets It Apart
  • Local-first, privacy-minded evals: you can run tests and red-team scans so prompts and test data never leave your environment, which matters for sensitive domains. This means teams can validate behavior without exposing proprietary prompts or user data.
  • Multi-provider comparison with a single config: swap OpenAI, Anthropic, local models, and others in the same test suite to surface behavioral differences and regressions across providers — so you can make model-selection decisions based on measured metrics rather than anecdote.
  • Declarative workflows + CI/CD integration: test suites are defined in human-readable configs and can be executed via CLI or in CI pipelines, enabling automated guardrails on PRs and deployments to prevent regressions.
  • Red-teaming & vulnerability scanning baked in: includes patterns and tooling for prompt injection, data exfiltration, and other safety checks, plus human-reviewed reporting to turn findings into actionable fixes.
Who It's For and Tradeoffs

Great fit if you’re a developer or engineering team that ships LLM features and needs repeatable evaluation, model comparison, or automated red-team checks in CI. It’s particularly useful when privacy matters and you want local runs, or when you support multiple providers and need objective comparisons. Look elsewhere if you need a non-developer GUI-first product for end users, if you require an all-in-one model-hosting/inference platform, or if you only need lightweight ad-hoc manual testing — this tool assumes developer workflows and CI integration.

Where It Fits

Compared with simple ad-hoc prompt experiments or spreadsheet-based comparisons, this tool formalizes tests as code and ties them into CI. Compared with full MLOps suites, it focuses on evaluation, red-teaming, and prompt quality rather than model training or serving. The project has broad adoption (notable mention: used by OpenAI and Anthropic) and an active open-source community, and it has been integrated into production workflows at scale.

Information

  • Websitegithub.com
  • AuthorsPromptfoo team (now part of OpenAI)
  • Published date2023/04/28

Categories