GitHub - tanishqkumar/beyond-nanogpt: Minimal and annotated implementations of key ideas from modern deep learning research.
Extracto
Minimal and annotated implementations of key ideas from modern deep learning research. - GitHub - tanishqkumar/beyond-nanogpt: Minimal and annotated implementations of key ideas from modern deep l...
Resumen
Resumen Principal
"Beyond NanoGPT" se presenta como un repositorio educativo fundamental que busca elevar a principiantes en LLMs al nivel de investigadores de IA, sirviendo de puente entre implementaciones básicas y el deep learning de vanguardia. Este recurso integral ofrece casi 100 implementaciones anotadas y desde cero de técnicas cruciales y modernas, abarcando un amplio espectro de la IA. Desde optimizaciones de modelos de lenguaje como el KV caching y la decodificación especulativa, hasta arquitecturas avanzadas como Vision Transformers (ViT) y Mamba, pasando por diversas variantes de atención y modelos generativos como los Denoising Diffusion Models y GANs. El repositorio también explora algoritmos seminales de aprendizaje por refuerzo (RL) como PPO, A3C y AlphaZero, e incluso fundamentos de sistemas como la paralelización de datos y tensores. Su valor pedagógico reside en la implementación manual de cada componente, con comentarios detallados que desentrañan complejidades a menudo omitidas en la literatura y el código de producción, fomentando un aprendizaje activo a través de la lectura, experimentación y recreación del código.
Elementos Clave
- Amplitud y Profundidad Técnica: Beyond NanoGPT destaca por su compilación exhaustiva de cerca de 100 técnicas modernas, implementadas desde cero y detalladamente anotadas. Cubre desde arquitecturas avanzadas (ViT, DiT, Mamba, MoE) y variantes de atención (Multi-Latent, Sparse) hasta optimizaciones para LLMs (KV Caching, Speculative Decoding, RoPE) y modelos
Contenido
Beyond NanoGPT: Go From LLM Beginner to AI Researcher!
Beyond-NanoGPT is the minimal and educational repo aiming to bridge between nanoGPT and research-level deep learning. This repo includes annotated and from-scratch implementations of almost 100 crucial modern techniques in frontier deep learning, aiming to help newcomers learn enough to start running experiments of their own.
The repo implements everything from KV caching and speculative decoding for LLMs to architectures like vision transformers and MLP-mixers; from attention variants like linear or multi-latent attention to generative models like denoising diffusion models and flow matching algorithms; from landmark RL papers like PPO, A3C, and AlphaZero to systems fundamentals like GPU communication algorithms and data/tensor parallelism.
Because everything is implemented by-hand, the code comments explain the especially subtle details often glossed over both in papers and production codebases.
A glimpse of some plots you can make!
(Left) Language model speedups from
attention-variants/linear_attention.ipynb,
(Center) Samples from a small denoising diffusion model trained on MNIST in
generative-models/train_ddpm.py,
(Right) Reward over time for a small MLP policy on CartPole in
rl/fundamentals/train_ppo.py.
LESSONS.md documents some of the things I've learned in the months spent writing this codebase.
Quickstart
-
Clone the Repo:
git clone https://github.com/tanishqkumar/beyond-nanogpt.git
-
Get Minimal Dependencies:
pip install torch numpy torchvision wandb tqdm transformers datasets diffusers matplotlib pillow jupyter gym
-
Start learning! The code is meant for you to read carefully, hack around with, then re-implement yourself from scratch and compare to. You can just run
.pyfiles with vanilla Python in the following way.cd architectures/ python train_dit.pyor for instance
cd rl/fundamentals/ python train_reinforce.py --verbose --wandbEverything is written to be run on a single GPU. The code is self-documenting with comments for intuition and elaborating on subtleties I found tricky to implement. Arguments are specified at the bottom of each file. Jupyter notebooks are meant to be stepped through.
Current Implementations and Roadmap
Architectures
- Basic Transformer
language-models/transformer.pyandtrain_naive.py[paper] - Vision Transformer (ViT)
architectures/train_vit.py[paper] - Diffusion Transformer (DiT)
architectures/train_dit.py[paper] - Recurrent Neural Network (RNN)
architectures/train_rnn.py[paper] - Residual Networks (ResNet)
architectures/train_resnet.py[paper] - MLP-Mixer
architectures/train_mlp_mixer.py[paper] - LSTM
architectures/train_lstm.py[paper] - Mixture-of-Experts (MoE)
architectures/train_moe.py[paper] - Mamba
architectures/train_mamba.py[paper]
Attention Variants
- Vanilla Self-Attention
attention-variants/vanilla_attention.ipynb[paper] - Multi-head Self-Attention
attention-variants/mhsa.ipynb[paper] - Grouped-Query Attention
attention-variants/gqa.ipynb[paper] - Linear Attention
attention-variants/linear_attention.ipynb[paper] - Sparse Attention
attention-variants/sparse_attention.ipynb[paper] - Cross Attention
attention-variants/cross_attention.ipynb[paper] - Multi-Latent Attention
attention-variants/mla.ipynb[paper]
Language Models
- Optimized Dataloading
language-models/dataloaders[reference]- Producer-consumer asynchronous dataloading
- Sequence packing
- Byte-Pair Encoding
language-models/bpe.ipynb[paper] - KV Caching
language-models/KV_cache.ipynb[reference] - Speculative Decoding
language-models/speculative_decoding.ipynb[paper] - RoPE embeddings
language-models/rope.ipynb[paper] - Multi-token Prediction
language-models/train_mtp.py[paper]
Reinforcement Learning
- Deep RL
- Fundamentals
rl/fundamentals - Actor-Critic and Key Variants
rl/actor-critic - Model-based RL
rl/model-based- Model Predictive Control (MPC)
train_mpc.py[reference] - Expert Iteration (MCTS)
train_expert_iteration.py[paper] - Probabilistic Ensembles with Trajectory Sampling (PETS)
- Model Predictive Control (MPC)
- Neural Chess Engine (AlphaZero)
rl/chess[paper]- Define the architecture and environment
model.pyandenv.py - MCTS for move search
mcts.py - Self-play
train.py - Dynamic batching and multiprocessing
mcts.py
- Define the architecture and environment
- Fundamentals
- LLMs
- RLHF a base model with UltraFeedback
- DPO a base model with UltraFeedback
- GRPO for reasoning: outcome reward on MATH
- Distributed RLAIF for tool use
Generative Models
- Generative Adversarial Networks (GAN)
generative-models/train_gan.py[paper] - Pix2Pix (Conditional GANs)
generative-models/train_pix2pix.py[paper] - Variational Autoencoders (VAE)
generative-models/train_vae.py[paper]- Train an autoencoder for reconstruction
generative-models/train_autoencoder.py
- Train an autoencoder for reconstruction
- Neural Radiance Fields (NeRF)
- Denoising Diffusion Probablistic Models (DDPM)
generative-models/train_ddpm.py[paper] - Classifier-based diffusion guidance
generative-models/ddpm_classifier_guidance.py[paper]- Classifier-free diffusion guidance
generative-models/ddpm_classifier_free_guidance.py[paper]
- Classifier-free diffusion guidance
- Flow matching
generative-models/train_flow_matching.py[paper]
MLSys
- GPU Communication Algorithms (scatter, gather, ring/tree allreduce)
mlsys/comms.py[reference] - Distributed Data Parallel
mlsys/train_ddp.py[paper] - Tensor Parallel
- Ring Attention (Context Parallel)
- Paged Attention
- Flash Attention in Triton
[Coming Soon]: RAG, Agents, Multimodality, Robotics, Evals.
Notes
- The codebase will generally work with either a CPU or GPU, but most implementations basically require a GPU as they will be untenably slow otherwise. I recommend either a consumer laptop with GPU, paying for Colab/Runpod, or simply asking a compute provider or local university for a compute grant if those are out of budget (this works surprisingly well, people are very generous).
- Most
.pyscripts take in--verboseand--wandbas command line arguments when you run them, to enable detailed logging and sending logs to wandb, respectively. Feel free to hack these to your needs. - Feel free to email me at tanishq@stanford.edu with feedback, implementation/feature requests, and to raise any bugs as GitHub issues. I am committing to implementing new techniques people want over the next month, and welcome contributions or bug fixes by others.
If this codebase helped you, please share it and give it a star! You can cite the repository in your work as follows.
@misc{kumar2025beyond, author = {Tanishq Kumar}, title = {Beyond-NanoGPT: From LLM Beginner to AI Researcher}, year = {2025}, howpublished = {\url{https://github.com/tanishqkumar/beyond-nanogpt}}, note = {Accessed: 2025-01-XX} }
Happy coding, and may your gradients never vanish!
Fuente: GitHub
