SkyPilot is an open-source MLOps / AI infrastructure project that provides a unified control plane and CLI to run, manage, and scale AI workloads on any compute — Kubernetes, Slurm, 20+ clouds, or on-prem clusters. It supports job-as-code (YAML/Python), intelligent scheduling and cost optimization (spot instances, autostop), automatic setup/sync of environments, auto-recovery, and integrations for training, serving and inference workflows.
SkyPilot is an open-source system designed to let AI teams and infra teams run, manage, and scale AI workloads on virtually any infrastructure. It exposes a simple unified interface (YAML or Python API + CLI) so the same task definition can be launched on Kubernetes, Slurm, or many cloud providers without code changes. SkyPilot emphasizes portability, cost efficiency, and operational ergonomics for training, distributed jobs, and model serving.
SkyPilot supports a long list of infrastructures including (but not limited to) Kubernetes, Slurm, AWS, GCP, Azure, OCI, CoreWeave, Lambda Cloud, RunPod, Fluidstack, Paperspace, Vast.ai, VMware vSphere, and many others. This makes it suitable for hybrid and multi-cloud environments.
sky launch <task.yaml> to provision resources; SkyPilot finds suitable infra, provisions, syncs workdir, runs setup, and starts the job.Example YAML snippet:
resources:
accelerators: A100:8
num_nodes: 1
workdir: ~/torch_examples
setup: |
pip install -r requirements.txt
run: |
python main.py --epochs 1SkyPilot started from the Sky Computing Lab at UC Berkeley and the project links to related academic work (e.g., NSDI 2023 paper and a Sky Computing whitepaper). It maintains documentation, demos, and a blog with case studies and benchmarks. The project is open-source on GitHub and has an active set of examples for training and serving modern LLMs and AI workloads.
Visit the documentation (official site) for installation, quickstart, and CLI references. Typical install is via pip and there are nightly/source install options for the latest fixes and features.