AI Infra2023

LocalAI

Puts OpenAI-, Anthropic- and Ollama-compatible endpoints in front of 60+ inference backends, so existing client code runs unchanged against local models for text, vision, audio, image and embeddings. Runs CPU-only or accelerated, data stays local.

Visit Website

Introduction

The hard part of "running models locally" was never the model — it was the integration tax. Every local runner speaks its own dialect, so swapping in a self-hosted model usually means rewriting client code. LocalAI removes that tax by being an API shim, not a runtime: it presents the OpenAI, Anthropic, ElevenLabs and Ollama wire formats and dispatches each call to whichever of 60+ backends can serve it.

What Sets It Apart

It is a compatibility surface, not an engine. The same /v1/chat/completions you already call can route to llama.cpp, vLLM, SGLang, transformers, whisper.cpp, diffusers or MLX — you change a model name, not your code.
One process covers modalities most runners treat as separate products: text, vision, speech-to-text, text-to-speech, image and video generation, embeddings, object detection and reranking.
"A small core, not a bundle" — backends ship as separate OCI images pulled on demand, so a CPU-only text deployment never carries CUDA diffusion weight it won't use.
A distributed mode (PostgreSQL + NATS) lets you scale horizontally instead of vertically stacking one big box.

Who It's For

Great fit if you have an app already wired to a hosted API and want to move inference onto your own NVIDIA, AMD, Intel, Apple Silicon or plain-CPU hardware with near-zero client changes, or if you need many modalities behind one endpoint. Look elsewhere if you want a polished chat UI out of the box — this is infrastructure that other clients talk to — or if you only ever need one model in one format, where a single-purpose runner is leaner.

Where It Fits

Against Ollama it trades simplicity for breadth: many more backends and modalities and multi-vendor API shapes, at the cost of a larger surface to configure. It is MIT-licensed and community-driven rather than a vendor's funnel toward a paid tier.

Back

Information

Websitegithub.com
OrganizationsIndependent
AuthorsEttore Di Giacinto (mudler), Community contributors
Published date2023/03/18

More Items

AI API2026

CPA Manager Plus

seakee

Self-hosted CPA / CLIProxyAPI management and observability panel that stores request history, tracks cost/usage/quota, and centralizes provider/credential/OAuth and plugin management. Designed for local analytics, failure diagnosis and account automation without telemetry.

ai-api-management mLOps docker sqlite go+9

Reinforcement Learning Papers2026

LongStraw: Long-Context RL Beyond 2M Tokens under a Fixed GPU Budget

Changhai Zhou, Kieran Liu +18

Enables RL post-training with million-token prompts under a fixed GPU budget by evaluating shared prompt state without autograd, retaining only minimal model state, and replaying short response branches; instantiated as GRPO and demonstrated on Qwen3.6-27B and GLM-5.2 up to multi-million token execution.

RL llm qwen mLOps ai-train+1

AI Infra2026

OpenTelemetry GenAI Semantic Conventions

OpenTelemetry

Defines OpenTelemetry semantic conventions for generative AI telemetry — spans, metrics, and events for GenAI clients, the Model Context Protocol (MCP), and provider-specific integrations. Includes YAML models, human-readable docs, and reference implementations to standardize observability across GenAI deployments.

mcp mcp-client mcp-server mlops ai-api+3