Tag

Explore by tags

All

30u30

ASR

ChatGPT

GNN

IDE

RAG

agent-skills

ai

ai-agent

ai-api

ai-api-management

ai-client

ai-coding

ai-demos

ai-deploy

ai-development

ai-framework

ai-image

ai-image-demos

ai-inference

ai-leaderboard

ai-library

ai-rank

ai-serving

ai-tools

ai-train

ai-video

ai-workflow

AIGC

algorithms

alibaba

amazon

android

anthropic

audio

aws

benchmark

biology

blog

book

bytedance

chatbot

chatgpt

chemistry

claude

claude-code

cli

code

codex

copilot

course

cuda

cursor

deepmind

deepseek

depth

devops

diffusers

docker

drug-discovery

electron

embeddings

engineering

evaluation

facebook

finance

flow-matching

foundation

foundation-model

gemini

gemini-cli

gemma

genomics

gitHub

github

go

google

gradient-booting

grok

groq

huggingface

image

ios

java

javascript

json

kimi

llama.cpp

LLM

llm

lora

mLOps

math

mcp

mcp-client

mcp-server

meta-ai

meta-pytorch

metal

microsoft

mlops

mobile

multilingual

multimodal

mysql

NLP

nlp

nodejs

numpy

nvidia

ocr

ollama

openai

opencode

pandas

paper

physics

pi

plugin

polars

postgres

privacy

prompt-engineering

pwa

python

pytorch

qwen

react

reasoning

retrieval

RL

robotics

rust

science

security

segmentation

shodan

skillkit

sora

speech

sqlite

ssh

stt

swe

tensorrt

terminal

transformers

translation

tts

tutorial

typescript

vibe-coding

video

vision

vllm

voice

web-search

windsurf

xAI

xai

Large Language Model Papers2018

GPT1: Improving Language Understanding by Generative Pre-Training

Alec Radford, Karthik Narasimhan +2OpenAI

Introduced the two-stage recipe behind the GPT lineage: unsupervised generative pre-training on unlabeled text, then supervised fine-tuning per task. A single 12-layer Transformer decoder beat bespoke architectures on 9 of 12 NLP benchmarks.

openai transformers foundation-model paper LLM+1

AI Model2018

Transformers

Hugging Face

Provides unified model definitions and a single API for pretrained text, vision, audio, and multimodal models for both training and inference. Emphasizes cross-framework compatibility (PyTorch/TF/JAX), pipeline-based inference, and direct access to 1M+ Hub checkpoints.

transformers huggingface ai-library pytorch python+6

AI Dataset2022

Measuring Massive Multitask Language Understanding (MMLU)

cais (Hugging Face dataset curator), Dan Hendrycks et al. (original MMLU paper)

A 57-subject multiple-choice benchmark for measuring broad language understanding in LLMs; provides per-subject configs and test/dev/auxiliary_train splits for few-/zero-shot evaluation, widely used for model comparison and academic reporting.

huggingface NLP LLM pandas transformers+1

AI Dataset2022

Wikimedia / Wikipedia (HuggingFace dataset)

Wikimedia

Provides cleaned, per-language snapshots of Wikipedia articles (id, url, title, text) packaged as Hugging Face dataset configs (Parquet). Covers 300+ language configs and dated dumps — useful for language modeling, multilingual NLP, retrieval, and RAG pipelines.

huggingface multilingual nlp LLM transformers+2

AI Train2023

NVIDIA PhysicsNeMo

NVIDIA

Modular PyTorch-based framework for building, training, and deploying physics-informed ML models (neural operators, PINNs, GNNs, diffusion). Provides GPU‑optimized training, domain-specific datapipes for meshes/point clouds, distributed scaling and a model zoo.

nvidia physics pytorch ai-framework ai-train+6

AI Dataset2023

TinyStories

roneneldan

Contains short, small-vocabulary stories synthetically generated by GPT-3.5 and GPT-4 for training and evaluating compact language models. Includes multiple splits, a GPT-4-only V2 subset, and archive files with prompts and metadata for reproducible experiments.

huggingface NLP LLM prompt-engineering transformers+1

AI Dataset2024

FineWeb-Edu

HuggingFaceFW, Anton Lozhkov +3

Provides ~1.3 trillion tokens of web pages filtered for educational quality using an LLM-trained classifier; includes per-Crawl configs, smaller random samples (10B/100B/350B tokens), and the classifier code and model for reproducible filtering.

huggingface LLM nlp foundation-model ai-train+2

AI Model2024

ESM (Biohub/esm)

Biohub (EvolutionaryScale Team)

Provides code, pretrained weights, and tooling for protein language models and structure prediction — including ESMC, ESMFold2, sparse autoencoders (SAEs), and the ESM Atlas. Includes model checkpoints, tutorials, Hugging Face & Biohub integration, and an MIT license.

foundation-model transformers huggingface pytorch biology+4

AI Dataset2024

Wikimedia Structured Contents Dataset

Wikimedia Enterprise, Wikimedia Foundation

Provides pre-parsed Parquet snapshots of English and French Wikipedia articles with structured fields (sections, infoboxes, tables, references, images) and credibility signals — optimized for large-scale analysis, retrieval-augmented generation, and model development.

huggingface multilingual nlp RAG pandas+3

Embodied AI2024

NVIDIA Cosmos

NVIDIA

Provides an open platform of omnimodal world models, datasets, and tools to build Physical AI — joint perception, generation, and action reasoning for robots, autonomous vehicles, and smart infrastructure. Supports images, video, audio, and action-conditioned workflows.

nvidia multimodal foundation-model diffusers vllm+9

Large Language Model Tutorials2025

Train LLM From Scratch

Fareed Khan

Provides end-to-end PyTorch scripts to download/prepare data, implement a transformer from scratch, train LLMs (13M→billion-scale) and generate text. Emphasizes educational clarity and single‑GPU experiments; useful for researchers or hobbyists, but large-scale training still requires substantial compute and engineering.

pytorch LLM python nlp ai-train+3

AI Dataset2025

Ultra-FineWeb

openbmb

High-quality, efficiently verified and filtered web corpus for LLM pretraining — supplies ~1 trillion English tokens and ~120 billion Chinese tokens with English/Chinese Parquet splits. Designed for large-scale pretraining experiments and data-filtering research.

LLM huggingface nlp multilingual transformers+2