AIAny - vllm

AI Infra2020

Language Model Evaluation Harness

Unified framework for few-shot evaluation of generative language models across 60+ academic benchmarks. Supports multiple model backends (Hugging Face, vLLM, APIs, local servers), configurable prompts/YAML configs, and reproducible exports for leaderboards and research comparisons.

llm ai-leaderboard huggingface vllm github+3

AI Deploy2021

KServe

KServe communityGoogle, IBM +4

Serves predictive and generative ML models on Kubernetes via a single InferenceService CRD, with scale-to-zero, canary rollouts, and an OpenAI-compatible LLM path on vLLM. One autoscaling abstraction over PyTorch, XGBoost, ONNX, and HuggingFace.

mlops ai-inference ai-serving ai-deploy vllm+3

AI Deploy2023

vLLM

vLLM Project, Sky Computing Lab (UC Berkeley)Sky Computing Lab, UC Berkeley, PyTorch Foundation

Open-source LLM inference and serving engine built around PagedAttention, which manages the KV cache like OS virtual memory to cut waste and raise throughput. Supports continuous batching, KV cache sharing, quantization, and an OpenAI-compatible API.

vllm llm ai-serving ai-inference huggingface+4

AI Infra2023

LocalAI

Ettore Di Giacinto (mudler), Community contributorsIndependent

Puts OpenAI-, Anthropic- and Ollama-compatible endpoints in front of 60+ inference backends, so existing client code runs unchanged against local models for text, vision, audio, image and embeddings. Runs CPU-only or accelerated, data stays local.

llm ai-inference ai-serving docker mcp+8

AI Train2023

LLaMA-Factory

hiyougaBeihang University, Peking University

Fine-tunes 100+ LLMs and VLMs from one config file or a no-code web UI, unifying LoRA, QLoRA, full tuning, DPO, PPO, KTO and ORPO behind a single interface. Bundles GaLore, Unsloth, FlashAttention-2 and 2-8bit quantization to fit a single 24GB GPU.

llm ai-train pytorch vllm docker+4

AI Deploy2023

Xorbits Inference (Xinference)

Xorbits (xorbitsai)Xorbits AI

Run any open-source LLM, embedding, speech, image, or multimodal model behind one OpenAI-compatible API — swap GPT for an open model in a single line. Routes across vLLM, llama.cpp, GGML, and TensorRT, scaling from a laptop to a multi-node GPU cluster.

ai-serving ai-inference mlops vllm tensorrt+4

AI Train2023

OpenRLHF

Jian Hu, Xibin Wu +5OpenRLHF Team, ByteDance +3

Trains LLMs with RLHF at scale by splitting actor, critic, reward, and reference models across separate GPU groups via Ray, with vLLM-accelerated generation and DeepSpeed ZeRO-3. Supports PPO, GRPO, REINFORCE++, DPO, plus async and agentic multi-turn RL.

RL vllm llm huggingface pytorch+4

AI Deploy2023

LitServe | Deploy any AI model Lightning fast

Lightning AI

Builds custom AI inference servers in pure Python on top of FastAPI, keeping full control over request logic while batching, GPU autoscaling, streaming, and OpenAI-spec endpoints come built in. Claims a 2x+ throughput edge over plain FastAPI.

mlops ai-inference pytorch docker vllm+3

AI Model2024

Surya

Vikas Paruchuri, Datalab Team

Performs document OCR, layout analysis, reading-order detection and table recognition across 90+ languages using a ~650M-parameter vision–language model; offers per-page and per-block modes and supports GPU (vllm) and CPU/Apple Silicon backends.

ocr multilingual vision vllm ai-inference+5

AI Model2024

MiniCPM-V

OpenBMB, ModelBest +1

Pocket-sized multimodal LLM for efficient image- and video-understanding on mobile and edge devices, featuring mixed 4x/16x visual-token compression (MiniCPM‑V 4.6), compact 1.3B variants, and ready guides for iOS/Android/HarmonyOS deployment.

multimodal vision video LLM huggingface+5

AI Deploy2024

FlashInfer

FlashInfer teamNVIDIA, University of Washington +2

GPU kernel library for LLM inference attention, sampling, and KV-cache, built on block-sparse formats with JIT-compiled customizable templates. Reports 29-69% inter-token-latency cuts vs compiler backends; powers SGLang, vLLM, and MLC-Engine.

ai-inference ai-serving llm vllm tensorrt+3

AI Train2024

Oumi

Oumi Community, oumi-ai

Streamlines the full lifecycle of foundation models — data prep, fine-tuning (SFT/LoRA/QLoRA/GRPO), evaluation, and deployment — with ready-to-run recipes, multi-engine inference support, and cloud/CLI workflows for both laptop experiments and large-scale runs.

mlops llm foundation-model vllm python+5

Tag

Explore by tags

Tag

Explore by tags

All

30u30

ASR

ChatGPT

GNN

IDE

RAG

agent-skills

ai

ai-agent

ai-api

ai-api-management

ai-client

ai-coding

ai-demos

ai-deploy

ai-development

ai-framework

ai-image

ai-image-demos

ai-inference

ai-leaderboard

ai-library

ai-rank

ai-serving

ai-tools

ai-train

ai-video

ai-workflow

AIGC

algorithms

alibaba

amazon

android

anthropic

audio

aws

benchmark

biology

blog

book

bytedance

chatbot

chatgpt

chemistry

claude

claude-code

cli

code

codex

copilot

course

cuda

cursor

deepmind

deepseek

depth

devops

diffusers

docker

drug-discovery

electron

embeddings

engineering

evaluation

facebook

finance

flow-matching

foundation

foundation-model

gemini

gemini-cli

gemma

genomics

gitHub

github