GPT3: Language Models are Few-Shot Learners

At 175 billion parameters, this autoregressive model becomes a strong few-shot learner: it handles translation, QA, and reasoning from a few prompt examples with no gradient updates, establishing in-context learning as an alternative to fine-tuning.

Visual Explainer Visit Website

Introduction

Fine-tuning was the assumed cost of using a language model in 2020: collect thousands of labeled examples per task, then update the weights. GPT-3's argument is that at sufficient scale you can skip that entirely — describe the task in the prompt, show a few examples, and the frozen model adapts on the fly. The 175-billion-parameter size is the headline, but "no gradient updates" is the idea that changed how people use models.

Key Findings

In-context learning scales with size. Zero-, one-, and few-shot accuracy all improve sharply as the model grows, with few-shot closing much of the gap to fine-tuned systems on many benchmarks.
One model, many tasks, no retraining. Translation, cloze, QA, arithmetic, and word unscrambling are all driven purely through text prompts — the operational basis for prompt engineering.
Honest about limits. The paper documents where few-shot still lags, contamination risks from web-scale training data, and tasks that genuinely require reasoning.
Human-indistinguishable text. Evaluators struggled to separate GPT-3 news samples from human-written ones, prompting an extended discussion of societal impact.

Great Fit If

Read it for the empirical foundation of prompting and the few-shot paradigm, and the scaling evidence that motivated much of what followed. Look elsewhere if you want architectural novelty — GPT-3 is deliberately a scaled-up GPT-2 — or the instruction-following and alignment behavior that arrived later with InstructGPT and ChatGPT.

Back

Information

Websitearxiv.org
OrganizationsOpenAI
AuthorsTom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell …
Published date2020/05/28

More Items

Large Language Model Papers2026

RAGU: A Multi-Step GraphRAG Engine with a Compact Domain-Adapted LLM

Mikhail Komarov, Ivan Bondarenko +6

Builds structured knowledge graphs for retrieval-augmented generation via a multi-step GraphRAG pipeline that separates extraction from consolidation. Key features include typed two-stage extraction, DBSCAN-backed deduplication, LLM summarization, Leiden community detection, and a compact 7B extractor model (Meno-Lite-0.1).

RAG LLM nlp retrieval benchmark+5

Large Language Model Papers2026

Loop the Loopies!

Zitian Gao, Yilong Chen +5

A looped-Transformer LLM series using Mixture-of-Experts (20B with 2B active; 6B with 0.6B active) that trades extra pretraining compute for repeated looping. Shows superior compute-efficiency versus matched-compute vanilla baselines and attains gold-medal performance on 2025 IMO and IPhO after a post-training pipeline.

transformers LLM reasoning evaluation paper+2

Large Language Model Papers2026

Cura 1T: Specialized Model for Agentic Healthcare

Haolin Chen, Leon Qi +8actAVA AI

Specialized LLM for clinical workflows trained via a human-gated self-evolution loop to improve patient consultation, multimodal clinical reasoning, interactive diagnosis, and EHR tool use. Iteratively refines targeted synthetic and curated data based on benchmark failures to raise specific capabilities without broad regressions.

LLM multimodal agent-skills evaluation reasoning+2

GPT3: Language Models are Few-Shot Learners

Introduction

Key Findings

Great Fit If

Information

Categories

Tags

More Items

RAGU: A Multi-Step GraphRAG Engine with a Compact Domain-Adapted LLM

Loop the Loopies!

Cura 1T: Specialized Model for Agentic Healthcare