✦ Visual Learning Guide · 2025 Edition

AI that
Creates.

Understand Generative AI from neurons to ChatGPT — interactive visuals, architecture diagrams, and curated resources all in one place.

⚡ How a Transformer Processes Your Prompt
📝 Prompt Text
Raw input string
🔢 Tokenizer
Split → token IDs
📐 Embeddings
High-dim vectors
🕸️ Attention
Context weighting
✨ Output
Next token logits
[The] [quick] [brown] [fox] [▶ 5837] [▶ 2214] [▶ ...] [EOS]
🟣 Live generation
Generative AI can create text, images, music, code, and video from a simple prompt
1T+
Parameters (GPT-4 est.)
128K
Token context window
50+
Foundation models
Creative possibilities

6 Core Concepts

Everything in Generative AI builds on these fundamental ideas.

🔀

Generative vs. Discriminative

Discriminative models learn the boundary between classes (e.g., "is this a cat?"). Generative models learn the full distribution of the data, enabling them to create new samples.

🔍
DISCRIMINATIVE
P(y|x)
Input→Label
Classify, detect
GENERATIVE
P(x) or P(x|y)
Noise→Sample
Create, imagine
GANVAEDiffusionLLM
🧠

Neural Network Layers

Information flows through stacked layers of interconnected neurons. Each layer learns increasingly abstract representations — from edges to concepts.

Input
↓ weights & biases ↓
Hidden 1
↓ activation (ReLU/GeLU) ↓
Hidden 2
↓ softmax ↓
Output
BackpropGradient DescentReLU
🕸️

Transformer Attention

Self-attention lets every token look at every other token and decide what's relevant. "The bank by the river" vs "the bank account" — attention resolves context.

Attention heatmap — brighter = stronger attention
The cat sat on the mat
Q·K·VMulti-headSoftmax
🔮

LLM Token Prediction

Language models don't "think" — they predict the most likely next token given all previous tokens. Repeated thousands of times, this creates coherent text.

Next token probabilities:
"The capital of France is ___"
Paris
78%
Lyon
9%
France
6%
a
4%
...
3%
TemperatureTop-pGreedy/Beam
🌀

Diffusion: Noise → Image

Diffusion models learn to reverse a noise process. Training: gradually add Gaussian noise until the image is pure static. Inference: start from noise and iteratively denoise using the learned model.

Forward (add noise) vs Reverse (denoise):
🖼️
+noise
🌫
+noise
▓▒
+noise
U-Net
denoise
1000 steps (DDPM) → ~20 steps (DDIM) → 4 steps (LCM)
DDPMStable DiffusionFLUX
🌐

Multimodal AI

Modern models accept and produce multiple modalities — text, images, audio, video, and code. A single model can read a chart, describe it, and write code to recreate it.

📝
Text
🖼️
Images
🎵
Audio
💻
Code
GPT-4o · Gemini 1.5 · Claude 3 · Llava
VisionAudioCross-modal

GenAI Landscape

The major model families shaping what's possible today.


Prompt Engineering

The quality of your prompt determines the quality of the output. Here's how the same request evolves from weak to powerful.

❌ Weak
Your prompt:
"Write about marketing."
Typical output:
Marketing is the process of promoting and selling products or services. It involves advertising, market research, and various strategies to reach customers...
❌ No audience specified
❌ No format/length
❌ No context or goal
❌ Too broad — AI guesses everything
⚡ Better
Your prompt:
"Write a short blog post about content marketing for small businesses. Use simple language and include 3 tips."
Output quality:
Content Marketing for Small Businesses: 3 Tips to Get Started. Content marketing can transform how customers find you online. Here are three proven strategies...
✅ Format specified (blog post)
✅ Audience defined (small biz)
✅ Constraint given (3 tips)
⚠️ Still missing tone & length
🚀 Expert
Your prompt:
"You are a B2B marketing expert. Write a 400-word LinkedIn article for SaaS founders with <10 employees. Cover: (1) why content beats cold outreach, (2) the '1-10-100' content repurposing rule, (3) a 30-day quick-start plan. Tone: authoritative but conversational. End with a CTA."
Output quality:
Why Your Coldest Leads Are Actually Waiting for Your Blog Post... [Expert, targeted, actionable content exactly as requested]
✅ Role assigned (persona)
✅ Exact length + platform
✅ Specific audience + context
✅ Structured output required
✅ Tone + CTA specified

🛠️ Core Prompt Engineering Techniques

🎭
Role Prompting
"You are a senior data scientist..." — sets expertise and perspective
🪜
Chain of Thought
"Think step by step..." — forces explicit reasoning before the answer
📌
Few-Shot Examples
Provide 2–5 input/output examples before your actual request
🎯
Constraint Setting
Length, format, tone, audience, forbidden topics, output structure
🔁
Iterative Refining
"Make it shorter / more formal / add X / remove Y" — refine in turns
📋
Template + Fill
Provide a skeleton template and ask the model to fill in specific sections
🔬
Self-Critique
"Now critique your answer and improve it" — forces self-correction
🌲
Tree of Thought
Explore multiple reasoning branches, evaluate each, pick the best path

How LLMs Work

From your keystrokes to the model's output — every step explained.

01

Tokenization

Your text is split into tokens — roughly word-pieces. "unbelievable" → ["un","believ","able"]. GPT-4 uses ~100K BPE tokens. Each gets an integer ID.

Hello , Gen er ative AI !
→ [9906, 11, 3469, 261, 1413, 9552, 0]
02

Embeddings

Each token ID maps to a high-dimensional vector (e.g., 12,288 floats in GPT-3). Positional encodings are added so the model knows token order. Similar words cluster in this space.

"king" → [0.23, -0.71, 0.45, ...]
"queen" → [0.21, -0.68, 0.47, ...]
king - man + woman ≈ queen (famous word2vec result)
03

Self-Attention (the magic)

For each token, compute Query (Q), Key (K), Value (V) vectors. Attention score = softmax(Q·Kᵀ / √d). This determines how much each token "attends to" every other token.

Attention formula:
Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V
Multi-head attention runs this H times in parallel, capturing different relationship types
04

Feed-Forward + Layer Norm

After attention, each token passes through a Feed-Forward Network (two linear layers + activation). Layer Normalization stabilizes training. This block repeats N times (e.g., 96 layers in GPT-3).

LayerNorm
FFN (4x expand)
GeLU / ReLU
Residual
05

Output: Logits → Token

The final hidden state feeds into an unembedding matrix producing logits over the vocabulary. Softmax converts to probabilities. Temperature controls randomness. Then one token is sampled.

temp=0.1 (focused)
Paris
94%
temp=1.0 (balanced)
Paris
78%
temp=2.0 (creative)
Lyon
35%

Architecture Gallery

Five landmark model architectures that power modern generative AI.

🤖

Transformer

Vaswani et al., 2017
Input Embedding + Pos Enc
Multi-Head Self-Attention
Feed-Forward Network × N
Linear + Softmax

The foundation of virtually all modern LLMs. "Attention Is All You Need" replaced RNNs entirely.

GPT, BERT, T5, Claude, Llama
🔵

VAE

Kingma & Welling, 2013
Input x
Encoder → μ, σ
Sample z ~ N(μ,σ²)
Decoder → x̂

Learns a continuous latent space. The reparameterization trick enables backprop through sampling.

Image generation, drug discovery, anomaly detection
⚔️

GAN

Goodfellow et al., 2014
Random Noise z
Generator G(z)
↕ compete
Discriminator D
Real / Fake verdict

Two networks in adversarial training. Generator creates fakes; discriminator catches them. Nash equilibrium = photorealistic outputs.

StyleGAN, CycleGAN, Pix2Pix
🌀

Diffusion

Ho et al., 2020
Pure Gaussian Noise
↓ × T steps
U-Net ε-prediction
↓ conditioned on
Text / Class embed
Clean Image

State-of-the-art image generation. Classifier-free guidance enables text conditioning. DDIM/LCM reduce steps dramatically.

Stable Diffusion, DALL·E 3, Midjourney, FLUX
🐍

Mamba / SSM

Gu & Dao, 2023
Input sequence
Selective State Space
↓ O(n) vs O(n²)
Hardware-aware scan
Output (linear time)

Challenges transformers with linear-time sequence modeling. No attention matrix — uses selective state spaces. Excels on very long sequences.

Mamba, Jamba, Vision Mamba

Curated Resources

The best places to go deeper — vetted by practitioners.


Learning Roadmap

Seven stages from zero to shipping GenAI products. Follow in order.

1
Python & Math Foundations
⏱ 4–8 weeks

Python, NumPy, linear algebra, calculus, and basic probability. These underpin everything.

Python NumPy Linear Algebra Calculus Probability
📺 3Blue1Brown — Essence of Linear Algebra  ·  📚 fast.ai
2
ML Fundamentals
⏱ 4–6 weeks

Supervised/unsupervised learning, gradient descent, regularization, evaluation metrics. Practice with Scikit-learn.

Scikit-learn Gradient Descent Cross-validation
📺 Andrew Ng ML Specialization (Coursera)
3
Deep Learning
⏱ 6–10 weeks

Neural networks, backpropagation, CNNs, RNNs. Build an image classifier and character-level LM from scratch in PyTorch.

PyTorch Backprop CNNs RNNs
📺 Andrej Karpathy: Neural Nets Zero to Hero
4
Transformers & LLMs
⏱ 4–8 weeks

Self-attention, positional encodings, BERT vs GPT. Read "Attention Is All You Need." Use Hugging Face Transformers.

Attention Hugging Face BERT / GPT
📚 Hugging Face NLP Course (free)  ·  📄 "Attention Is All You Need"
5
Fine-tuning & Alignment
⏱ 3–5 weeks

LoRA, QLoRA, instruction tuning, RLHF, DPO. Fine-tune Llama on a custom dataset using Axolotl or Unsloth.

LoRA / QLoRA RLHF DPO
📺 Sebastian Raschka: Fine-tuning LLMs
6
Building GenAI Apps
⏱ 4–6 weeks

RAG pipelines, vector databases, LangChain/LlamaIndex, function calling, tool-use agents.

RAG Vector DBs LangChain Agents
📚 DeepLearning.AI Short Courses (free)
7
LLMOps & Production
⏱ Ongoing

Evals, observability, cost optimization, model routing, guardrails, red-teaming, and continuous monitoring in production.

Evals Observability Guardrails Red-teaming
🔧 LangSmith, Weights & Biases, Braintrust

GenAI Cheat Sheet

20 essential terms every AI practitioner must know cold.

🔢 Token
The atomic unit of text for LLMs — roughly a word-piece. "tokenization" → 3 tokens. GPT-4 uses ~100K BPE tokens. ~4 chars per token on average. Costs are billed per token.
📐 Embedding
A high-dimensional vector (e.g., 1536 floats) representing semantic meaning. Similar concepts cluster nearby. Foundation of semantic search, RAG, and recommendation systems.
🌡️ Temperature
Controls output randomness. temp=0 → deterministic (top token always). temp=1 → default sampling. temp>1 → chaotic/creative. Use 0 for code/facts, 0.7–1.0 for creative writing.
🎯 Top-p (Nucleus)
Sample from tokens whose cumulative probability ≥ p. top-p=0.9 ignores the long tail of unlikely tokens. Works alongside temperature. Reduces repetition and incoherence.
🔍 RAG
Retrieval-Augmented Generation — fetch relevant docs from a vector DB at inference time and inject into the prompt. Gives LLMs fresh/private data without retraining. The #1 GenAI app pattern.
🎓 RLHF
Reinforcement Learning from Human Feedback. Humans rank outputs → train reward model → optimize LLM with PPO to maximize reward. Used in GPT-4, Claude, Gemini.
👻 Hallucination
When an LLM confidently generates plausible-sounding but factually wrong content. Root cause: optimizes for likely text, not truth. Mitigated by RAG, verification chains, and grounding.
🪟 Context Window
Max tokens the model can "see" at once (input + output combined). GPT-4: 128K. Claude 3: 200K. Gemini 1.5: 1M. Longer context = better coherence but exponentially higher cost.
🎛️ Fine-tuning
Continue training a pre-trained model on a smaller task-specific dataset. Updates all or some weights. LoRA/QLoRA make this affordable on consumer hardware with 24GB VRAM.
✏️ Prompt Engineering
Designing inputs to elicit the best outputs without changing weights. Includes zero-shot, few-shot, chain-of-thought, system prompts, and output constraints. Core skill for AI applications.
🔑 LoRA
Low-Rank Adaptation — inject tiny trainable rank-decomposition matrices instead of updating all weights. Reduces trainable params by 10,000x. QLoRA adds 4-bit quantization for even cheaper training.
📊 Perplexity
How well a model predicts a test set. Lower = better. Perplexity of X means the model is as uncertain as uniformly choosing among X options at each step. Standard LLM benchmark.
🤖 Agent
An LLM + tools in a loop: observe → reason → act → observe. Enabled by function calling, tool-use APIs, and ReAct prompting. Multi-agent systems collaborate on complex tasks.
📦 Vector Database
Stores and searches embedding vectors by semantic similarity via ANN algorithms (HNSW, IVF). Powering RAG. Examples: Pinecone, Weaviate, Chroma, Qdrant, pgvector (Postgres).
🔄 DPO
Direct Preference Optimization — simpler RLHF alternative. Train directly on (preferred, rejected) pairs without a separate reward model. More stable, cheaper, and increasingly preferred.
🌊 Diffusion
Forward: add Gaussian noise until image is pure static. Reverse: train U-Net to denoise step by step. At inference, start from noise, run denoising chain conditioned on text. Powers DALL·E 3, FLUX, SD.
🏗️ Foundation Model
Large model pre-trained on massive data, adaptable to many tasks. "Pre-train once, fine-tune many." GPT-4, Claude, Llama, Stable Diffusion are all foundation models.
⚡ Inference
Running a trained model to generate outputs (forward pass only). Optimized via quantization, KV caching, speculative decoding, Flash Attention, and specialized hardware (H100, TPU v5).
🔮 Zero-Shot
Prompting without examples — just describe the task. Works because LLMs generalize from pretraining. Add 2–5 examples (few-shot) for harder or more idiosyncratic tasks.
🧪 Evals
Evaluation frameworks measuring LLM quality: accuracy, factuality, safety, latency, cost. Use LLM-as-judge, unit tests, and benchmarks (MMLU, HumanEval, MATH). Critical before shipping to production.

GenAI Learning Path

A structured progression from fundamentals to advanced architectures — built on landmark research papers and open learning resources.

Fundamentals Models & Concepts
Architectures Transformers & Diffusion
Alignment RLHF & Safety
Production Deployment & Scale
Stage 1 · Fundamentals
CORE CONCEPTS

Generative AI Fundamentals

Core GenAI concepts: how generative systems learn, create, and improve. Covers GANs (Goodfellow 2014), VAEs, Transformers (Vaswani 2017), the training loop, and latent space representations.

GANsVAEsTransformers LLMsLatent SpaceTraining Loop Attention Mechanism
Stage 1 · Foundations
DEEP LEARNING

Predictive AI & Deep Learning

Foundation for all AI work: supervised & unsupervised learning, neural network types (FFNNs, CNNs, RNNs), computer vision, NLP fundamentals, and the end-to-end ML pipeline.

Supervised LearningFFNNs CNNsRNNsNLP/NLU Computer Vision
Stage 2 · Scale
LARGE-SCALE AI

Foundation Models & Scaling

AI at scale: Scaling Laws (Kaplan 2020), GPUs/TPUs, Machine Learning as a Service (MLaaS), containers, cloud-based training, pre-built AI APIs, and automated deployment & governance.

Scaling LawsGPU/TPUMLaaS Pre-Built APIsFoundation ModelsAuto-Deploy
Stage 3 · Production
AI ARCHITECTURE

AI Architecture & Alignment

Production AI patterns: RLHF (Ouyang 2022), Constitutional AI (Anthropic 2022), cloud-native design patterns — Data-Centric (Serverless Pipeline, Feature Store), Model-Centric (Federated Learning, Drift Detection).

RLHFConstitutional AI Federated LearningModel Drift Detection Serverless InferenceFeature Store

AI systems are broadly classified into three types based on their primary function. Understanding which type solves which problem guides model selection and system design.

🔮
Predictive AI System
Analyzes data to forecast outcomes, classify inputs, detect anomalies. Uses FFNNs, CNNs, RNNs.
Generative AI System
Creates new content (text, image, audio, code) from learned patterns. Uses GANs, VAEs, Transformers.
🤖
Agentic AI System
Autonomously plans and executes multi-step tasks using tools, memory, and environmental feedback.

Key Terms Defined

Precise definitions of the core concepts powering modern generative AI systems — from GANs and Transformers to production design patterns.

Generative AI System GEN AI
An AI system designed to create new content — text, images, audio, video, or code — by learning patterns from training data. Contrasted with Predictive AI systems, which classify or forecast, and Agentic AI systems, which plan and act.
Goodfellow et al. 2014 · Vaswani et al. 2017
Transformer ARCH
A neural network architecture that uses encoder and decoder layers with attention mechanisms to process sequential data. The foundation of Large Language Models (LLMs). Excels at text generation, translation, and sentiment analysis.
Vaswani et al. 2017 · arXiv:1706.03762
Large Language Model (LLM) LLM
A Transformer-based generative AI model trained on massive text datasets that can generate human-quality text, answer questions, write code, and perform complex language tasks. Powers AI-driven chatbots and content generation systems.
Brown et al. 2020 · arXiv:2005.14165 (GPT-3)
Generator GAN
The component in a Generative Adversarial Network (GAN) training loop that creates synthetic content from latent space input. Receives feedback from the Loss Function to progressively improve the realism of its output in each training iteration.
Goodfellow et al. 2014 · arXiv:1406.2661
Discriminator GAN
The component in a GAN training loop that evaluates whether content submitted by the Generator is real (from training data) or synthetic (generated). Provides critical feedback to the Loss Function, which then instructs the Generator.
Goodfellow et al. 2014 · arXiv:1406.2661
Encoder VAE / TF
A neural network component that compresses input data into a compact latent space representation. Used in VAEs to extract essential patterns from training data, and in Transformers to identify relevant input using attention mechanisms.
Vaswani et al. 2017 · arXiv:1706.03762
Decoder VAE / TF
A neural network component that reconstructs or generates new content from latent space representations. In VAEs, it reconstructs data for comparison with the original. In Transformers, it creates new content using attention mechanisms and encoded data.
Vaswani et al. 2017 · arXiv:1706.03762
Latent Space LS
A compressed mathematical representation of data patterns learned by the model. The Generator samples from latent space to create new content. In VAEs, the latent space is kept organized and smooth (a key design goal of the Loss Function).
Goodfellow et al. 2014 · Ho et al. 2020
Attention Mechanism TF
A Transformer component that allows the model to focus on the most relevant parts of the input when generating each element of output. Enables LLMs to maintain context across long sequences — the core innovation of the Transformer architecture.
Vaswani et al. 2017 · arXiv:1706.03762
Loss Function LF
Measures the difference between generated and target output during training. Scores the Generator's content and provides guidance for improvement. In VAEs, the Loss Function balances reconstruction accuracy against latent space organization.
Goodfellow et al. 2014 · Vaswani et al. 2017
GAN (Generative Adversarial Network) GAN
A neural network architecture where a Generator and Discriminator are trained simultaneously in an adversarial relationship. Excels at sophisticated visual content generation, data augmentation, and product design generation. The adversarial dynamic drives quality improvement.
Goodfellow et al. 2014 · arXiv:1406.2661
VAE (Variational Autoencoder) VAE
A generative neural network that learns smooth latent space distributions through encoder/decoder pairs. Used for anomaly detection, data imputation, customer data anonymization, and reconstruction of incomplete sensor data. Different from GANs in that it has no adversarial component.
Kingma & Welling 2013 · arXiv:1312.6114
Pre-Trained Model PTM
A model already trained on large datasets that can be used directly or adapted via fine-tuning. Enables Transfer Learning — leveraging existing knowledge rather than training from scratch. Foundation models like GPT-3 and BERT are pre-trained models.
Brown et al. 2020 · arXiv:2005.14165
Hallucination RISK
A generative AI failure mode where the model produces plausible-sounding but factually incorrect or invented content. Distinct from data bias. Particularly problematic in LLMs used for information retrieval or factual question-answering.
Ji et al. 2022 · Survey of Hallucination in NLG
Mode Collapse RISK
A GAN training failure where the Generator learns to produce only a limited variety of outputs — enough to fool the Discriminator — rather than the full diversity of the training data. Countered via architecture choices and training regularization techniques.
Goodfellow et al. 2014 · arXiv:1406.2661
Serverless AI Inference PATTERN
A cloud AI design pattern where AI model inference runs on demand using serverless functions — no persistent server required. Reduces cost for variable workloads and eliminates infrastructure management, at the expense of potential cold-start latency.
Cloud-Native AI Design Pattern · Production Architecture
Federated AI Learning PATTERN
A distributed training design pattern where multiple nodes train on local data and share only model updates — not raw data — to produce a global model. Addresses data privacy requirements and enables training on data that cannot be centralized.
McMahan et al. 2017 · Communication-Efficient Learning
AI Model Drift Detection PATTERN
A production monitoring pattern for detecting performance degradation caused by shifts in real-world data distributions over time. When drift is detected, the model is retrained or updated — preventing silent accuracy degradation in production.
MLOps Best Practice · Production AI Architecture
Distributed Feature Store PATTERN
A data-centric design pattern providing a centralized, distributed repository of ML features accessible across multiple AI systems and training runs. Eliminates duplicate feature engineering, ensures consistency, and enables feature reuse across teams.
MLOps Best Practice · Feature Engineering
Training Data △ hexagon
The dataset used to fit a model's parameters, distinct from validation and test data. Quality, diversity, and volume of training data are primary determinants of model performance — central to scaling laws (Kaplan 2020) and foundational to all generative AI systems.
Kaplan et al. 2020 · arXiv:2001.08361

Practice Quiz

8 questions covering core generative AI concepts. Click an answer to reveal the explanation.

Score 0 / 8
Question 1 of 8 · GAN Architecture
Which component of a GAN evaluates whether submitted content is real (from training data) or synthetic (generated)?
The Discriminator scrutinizes the Generator's content and provides feedback to the Loss Function, which scores the Generator and guides its next iteration. (Goodfellow et al. 2014 — Generative Adversarial Nets, arXiv:1406.2661)
Question 2 of 8 · Latent Space
What is the primary role of Latent Space in a GAN training loop?
Latent Space is the compressed mathematical representation from which the Generator draws creative input at each training step. (Goodfellow et al. 2014 — Generative Adversarial Nets)
Question 3 of 8 · Transformer Architecture
Which neural network architecture forms the foundation of Large Language Models and uses attention mechanisms?
Transformers use encoder-decoder layers with self-attention mechanisms to process sequences in parallel — the core contribution of Vaswani et al. 2017 "Attention Is All You Need" (arXiv:1706.03762).
Question 4 of 8 · LLM Risks
What is a "hallucination" in the context of large language models?
Hallucination is a failure mode where LLM output sounds confident and plausible but is factually wrong or invented. Distinct from data bias; a well-documented risk in production LLMs. (Ji et al. 2022 — Survey of Hallucination in Natural Language Generation)
Question 5 of 8 · GAN Training
Which GAN training failure occurs when the Generator produces only a narrow variety of outputs rather than the full data diversity?
Mode Collapse — the Generator finds a small set of outputs that consistently fool the Discriminator, stopping exploration of the full output space. A key known failure mode identified in Goodfellow et al. 2014 (arXiv:1406.2661).
Question 6 of 8 · Alignment
What does RLHF stand for, and which landmark 2022 paper applied it to align large language models with human intent?
RLHF (Reinforcement Learning from Human Feedback) was the technique used in InstructGPT (Ouyang et al. 2022, arXiv:2203.02155) to fine-tune GPT-3 to follow instructions. Human raters ranked outputs; those preferences trained a reward model used in PPO-based RL fine-tuning.
Question 7 of 8 · Distributed Training
Which design pattern trains models on distributed local data, sharing only model updates — not raw data — to preserve privacy?
Federated Learning (McMahan et al. 2017) enables training on data that cannot be centralized due to privacy or regulations. Each node trains locally and shares only model parameters — never raw data — which are aggregated to update a global model.
Question 8 of 8 · Scaling Laws
Kaplan et al. 2020 (Scaling Laws for Neural Language Models) found that LLM performance scales predictably with which three factors?
Scaling Laws (Kaplan et al. 2020, arXiv:2001.08361) showed that language model loss follows a power-law with model size (parameters), dataset size, and compute — spanning 7+ orders of magnitude. Larger models are more sample-efficient, justifying training very large models on modest data.

Landmark Papers & Breakthroughs

The papers that defined modern generative AI — each a step-change in what machines can create, understand, or align with human values.

2017 Transformers

Attention Is All You Need

Vaswani et al. introduced the Transformer — replacing RNNs entirely with self-attention. Encoder-decoder with multi-head attention processes sequences in parallel, enabling models to attend to all positions simultaneously. The direct ancestor of GPT, BERT, T5, and every modern LLM.

Self-Attention Multi-Head Attention Positional Encoding
Vaswani, Shazeer, Parmar et al. · Google Brain arXiv:1706.03762
2014 Generative Models

Generative Adversarial Networks

Goodfellow et al. proposed training two neural networks simultaneously: a Generator that creates synthetic samples and a Discriminator that distinguishes real from fake. The adversarial minimax game — G maximizing D's error while D minimizes it — produces increasingly realistic output. Foundation of image synthesis, deepfakes, and data augmentation.

Minimax Game Adversarial Training Nash Equilibrium
Goodfellow, Pouget-Abadie, Mirza et al. · Montréal arXiv:1406.2661
2020 Diffusion Models

Denoising Diffusion Probabilistic Models

Ho, Jain & Abbeel showed that learning to reverse a Gaussian noise process produces high-quality images. Training adds noise iteratively until the image is pure static; the model learns to denoise step by step. DDPM underpins Stable Diffusion, DALL-E 2, Midjourney, and essentially all modern image generation.

Forward Diffusion Reverse Denoising Score Matching
Ho, Jain & Abbeel · UC Berkeley arXiv:2006.11239
2020 Few-Shot Learning

GPT-3: Language Models are Few-Shot Learners

Brown et al. trained a 175B-parameter autoregressive Transformer on 499B tokens. The key finding: simply scaling dramatically improves few-shot performance — the model learns new tasks from just a few examples in the prompt, without gradient updates. Introduced the paradigm of prompting as a programming interface for AI.

175B Parameters In-Context Learning Zero/One/Few-Shot
Brown, Mann, Ryder et al. · OpenAI arXiv:2005.14165
2022 Alignment · RLHF

InstructGPT: Training LLMs with Human Feedback

Ouyang et al. introduced RLHF for LLMs: fine-tune GPT-3 on curated demonstrations, train a reward model from human rankings, then optimize with PPO. The 1.3B InstructGPT model was preferred by human raters over the 175B GPT-3. Landmark proof that alignment techniques could outperform raw scale. Direct foundation for ChatGPT.

RLHF Reward Model PPO Fine-tuning
Ouyang, Wu, Jiang et al. · OpenAI arXiv:2203.02155
2022 AI Safety · Anthropic

Constitutional AI: Harmlessness from AI Feedback

Bai et al. (Anthropic) proposed Constitutional AI: rather than relying solely on human feedback labels, a "constitution" of principles guides an AI to critique and revise its own outputs (RLAIF). Dramatically reduces reliance on human labelers for harmlessness. Foundation of Claude's training approach and a major alternative to pure RLHF.

RLAIF AI Self-Critique Principle-Based
Bai, Kadavath, Kundu et al. · Anthropic arXiv:2212.08073
2020 Scaling & Compute

Scaling Laws for Neural Language Models

Kaplan et al. empirically demonstrated that LLM loss scales as a power law with model size (N), dataset size (D), and compute (C) — spanning 7+ orders of magnitude. Architectural details matter far less than scale. Larger models are more sample-efficient. The paper provided the theoretical framework that drove the race to train ever-larger models.

Power-Law Scaling Compute Efficiency N · D · C Tradeoffs
Kaplan, McCandlish, Henighan et al. · OpenAI arXiv:2001.08361
Ashish Vaswani
Google Brain → Inceptive
Lead author of "Attention Is All You Need" — invented the Transformer architecture
🔮
Ian Goodfellow
Montréal → Google → Apple
Invented GANs (2014) — one of the most cited ML papers ever
🧠
Geoffrey Hinton
Toronto / Google · Nobel 2024
Godfather of deep learning. Backpropagation, Boltzmann Machines, deep belief nets
👁️
Yann LeCun
Meta AI · NYU · Nobel 2024
Pioneered convolutional neural networks (CNNs) — foundational to computer vision
🌐
Yoshua Bengio
MILA · Montréal · Nobel 2024
Word embeddings, attention precursors, GAN co-author. AI safety advocate
🚀
Sam Altman
OpenAI CEO
Led deployment of GPT-3, ChatGPT, GPT-4. Drove RLHF-aligned AI to mainstream

Key Terms — Research Definitions

Exact terminology drawn from landmark papers — the language used by researchers who built modern generative AI.

Transformer ARCH
A sequence-to-sequence architecture relying entirely on self-attention, dispensing with recurrence and convolutions. Stacked encoder and decoder layers each contain multi-head self-attention and position-wise feed-forward networks. "The dominant sequence transduction model" as of 2017 and the backbone of every modern LLM.
Vaswani et al. 2017 · arXiv:1706.03762
Self-Attention ATTN
A mechanism relating different positions of a single sequence to compute its representation. Each token attends to all others with weights from query-key dot products: Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V. Allows every token to "see" every other token simultaneously — enabling global context in a single layer.
Vaswani et al. 2017 · arXiv:1706.03762
Multi-Head Attention MHA
Running h parallel attention heads on linearly projected Q, K, V, then concatenating and re-projecting outputs. Allows the model to jointly attend to information from different representation subspaces at different positions — capturing diverse contextual relationships that a single attention head would miss.
Vaswani et al. 2017 · arXiv:1706.03762
Positional Encoding PE
Vectors added to token embeddings to inject sequence position information. Since self-attention is permutation-invariant, positional encoding is essential for understanding word order. Vaswani et al. used sine/cosine functions of different frequencies; modern models use learned positions or Rotary Positional Embeddings (RoPE).
Vaswani et al. 2017 · arXiv:1706.03762
Token & Embedding TOK
A token is the atomic text unit (sub-word from BPE tokenization). An embedding is the dense continuous vector learned during training so semantically similar tokens are geometrically close. GPT-3 used a 50,257-token vocabulary and 12,288-dimensional embeddings. Token count drives context window limits and cost.
Brown et al. 2020 · arXiv:2005.14165
Latent Space LS
A compressed continuous mathematical space where the model represents data. GANs: G: Z → X maps latent vectors z ~ p(z) to data samples. VAEs: latent space is regularized toward a prior N(0,I) via KL divergence to ensure smoothness. Enables interpolation between concepts by moving continuously through the space.
Goodfellow et al. 2014 · Kingma & Welling 2013
Diffusion Process — Ho et al. DIFF
A two-phase Markov process: forward q(xₜ|xₜ₋₁) adds Gaussian noise over T steps until x_T ≈ N(0,I); reverse p_θ(xₜ₋₁|xₜ) is a learned denoiser. Ho et al. showed that minimizing a simplified ELBO equivalent to denoising score matching produces high-quality samples — FID 3.17 on CIFAR-10, state-of-the-art in 2020.
Ho, Jain & Abbeel 2020 · arXiv:2006.11239
GAN — Goodfellow's Formulation GAN
"Simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from training data rather than G." Minimax game: min_G max_D E[log D(x)] + E[log(1−D(G(z)))]. At Nash equilibrium, G perfectly replicates the data distribution.
Goodfellow et al. 2014 · arXiv:1406.2661
VAE (Variational Autoencoder) VAE
Encodes inputs to a distribution q(z|x) over latent variables, samples z, then decodes to reconstruct x. Optimizes ELBO: reconstruction term − KL divergence from prior p(z) = N(0,I). The KL term forces the latent space to be smooth and structured. Unlike GANs, has no adversarial component — training is stable and principled.
Kingma & Welling 2013 · arXiv:1312.6114
RLHF ALIGN
Reinforcement Learning from Human Feedback. Three steps: (1) supervised fine-tuning on human demonstrations; (2) train a reward model from human rankings of model outputs; (3) optimize the LM with PPO against the reward model. InstructGPT's 1.3B RLHF model was preferred by raters over the raw 175B GPT-3 — alignment beats scale.
Ouyang et al. 2022 · arXiv:2203.02155
Constitutional AI CAI
Anthropic's alignment method: a "constitution" of natural-language principles guides an AI to critique and revise its own outputs (RLAIF — Reinforcement Learning from AI Feedback). Trains harmlessness without per-response human labeling — making alignment scalable. Foundation of Claude's training. Provides transparency: the principles are public.
Bai et al. 2022 · Anthropic · arXiv:2212.08073
Hallucination RISK
A failure mode where a generative model produces output that is fluent and confident but factually incorrect, unsupported by source, or fabricated. Intrinsic: contradicts source material. Extrinsic: cannot be verified against any source. Arises from optimizing token probability rather than factual accuracy. Central unsolved challenge in production LLMs.
Ji et al. 2022 · Survey of Hallucination in NLG
Temperature SAMP
A scalar T applied to logits before softmax: P(token_i) ∝ exp(logit_i / T). T < 1 sharpens the distribution (more deterministic, less creative); T > 1 flattens it (more diverse, more random); T → 0 is greedy decoding. Standard hyperparameter in all GPT-family models — the primary knob for controlling output randomness.
Brown et al. 2020 · arXiv:2005.14165
Top-k / Top-p (Nucleus) Sampling SAMP
Top-k: sample only from the k highest-probability tokens. Top-p (nucleus): sample from the smallest token set whose cumulative probability ≥ p — adaptively sizing the pool. When the model is confident, few tokens qualify; when uncertain, more do. Avoids degenerate repetition while preserving diversity better than pure temperature scaling.
Holtzman et al. 2020 · The Curious Case of Neural Text Degeneration
Context Window CTX
The maximum token span a Transformer can attend to simultaneously — its working memory. Standard attention is O(n²) in sequence length. GPT-3: 2,048 tokens. Modern models extend to 128K–1M via FlashAttention, sliding window attention, or ring attention. Determines how much prior conversation and document content the model "remembers."
Vaswani et al. 2017 · Brown et al. 2020
Fine-tuning FT
Continuing training of a pre-trained model on a smaller task-specific dataset to specialize behavior. Step 1 of RLHF is supervised fine-tuning on curated human demonstrations. Parameter-efficient variants (LoRA, prefix tuning, adapters) update only a small fraction of parameters — enabling fine-tuning on consumer hardware.
Ouyang et al. 2022 · arXiv:2203.02155
Few-Shot Learning FSL
"The ability to perform new tasks given only a small number of examples in the prompt, with no gradient updates." Zero-shot uses only the task description; one-shot uses one example; few-shot uses 10–100. GPT-3 demonstrated that scale enables this emergent generalization — no task-specific fine-tuning required for strong performance.
Brown et al. 2020 · arXiv:2005.14165
Scaling Laws SCALE
Empirical power-law relationships: test loss L scales with model size N, dataset size D, and compute C over 7+ orders of magnitude. L(N) ∝ N^(−0.076), L(D) ∝ D^(−0.095). Architectural choices within a wide range matter far less than scale. Key insight: larger models are more sample-efficient — justifying training huge models on modest data.
Kaplan et al. 2020 · arXiv:2001.08361

Knowledge Check — Expert Quiz

8 questions drawn directly from landmark papers. Click an answer to reveal the explanation and source citation.

Score 0 / 8
Question 1 of 8 · Vaswani et al. 2017
In what year was "Attention Is All You Need" published, and what key component did it introduce to replace recurrence?
2017 — Self-Attention / Transformer. Vaswani et al. published "Attention Is All You Need" at NeurIPS 2017 (arXiv:1706.03762). The Transformer replaced RNNs and CNNs entirely with multi-head self-attention, enabling full parallelization and long-range dependency modeling. It became the direct foundation of GPT, BERT, T5, and every modern LLM.
Question 2 of 8 · Ouyang et al. 2022
What does RLHF stand for, and what was the surprising core finding of the InstructGPT paper (Ouyang et al. 2022)?
RLHF = Reinforcement Learning from Human Feedback. Ouyang et al. 2022 (arXiv:2203.02155) found that their 1.3B InstructGPT model fine-tuned with RLHF was preferred by human raters over the raw 175B GPT-3 — 100× fewer parameters outperforming raw scale. Alignment quality matters as much as model size. The direct precursor to ChatGPT.
Question 3 of 8 · Goodfellow et al. 2014
Who published the original GAN paper, and at what institution?
Ian Goodfellow et al. — Université de Montréal (2014). "Generative Adversarial Nets" (arXiv:1406.2661) was co-authored with Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Yoshua Bengio — all at Montréal. Goodfellow reportedly conceived the adversarial training idea on a single night after a research debate at a bar.
Question 4 of 8 · Bai et al. 2022 (Anthropic)
What does "Constitutional AI" (Anthropic, 2022) mean — and what is its key difference from standard RLHF?
Constitutional AI = principle-guided self-critique (RLAIF). Bai et al. 2022 (arXiv:2212.08073) used ~10 natural-language principles (a "constitution") to have an AI critique and revise its own outputs. This RLAIF approach trains harmlessness without human labelers rating every response — making alignment scalable. The foundation of how Claude is trained at Anthropic.
Question 5 of 8 · Ho et al. 2020
What is the core mechanism of Denoising Diffusion Probabilistic Models (Ho et al. 2020)?
Reversing a noise process. Ho, Jain & Abbeel (arXiv:2006.11239) trained a neural net to reverse a Markov chain that gradually adds Gaussian noise to data. Key simplification: the training objective reduces to predicting the noise added at each step — equivalent to denoising score matching. Achieved FID 3.17 on CIFAR-10. Foundation of Stable Diffusion, DALL-E 2, and Midjourney.
Question 6 of 8 · Kaplan et al. 2020
Scaling Laws (Kaplan et al. 2020) found LLM performance scales predictably with which three factors?
N (parameters), D (tokens), C (compute). Kaplan et al. 2020 (arXiv:2001.08361) showed test loss scales as power laws with these three quantities over 7+ orders of magnitude. Architectural choices (depth vs. width, attention heads) matter far less within a wide range. Key implication: larger models are more sample-efficient — justifying the push to train enormous models.
Question 7 of 8 · Brown et al. 2020
What is "in-context learning" as demonstrated by GPT-3 (Brown et al. 2020)?
In-context learning = task adaptation via the prompt, no gradient updates. Brown et al. (arXiv:2005.14165) showed GPT-3 could learn to do translation, arithmetic, or code generation from just a few in-context examples. The model "learns" from examples within a single forward pass — demonstrating emergent generalization that made prompting a new programming paradigm.
Question 8 of 8 · Vaswani et al. 2017
The self-attention formula divides by √dₖ: Attention(Q,K,V) = softmax(QKᵀ / √dₖ)·V. Why?
Gradient stability — preventing softmax saturation. Vaswani et al. (arXiv:1706.03762) noted that in high-dimensional spaces, dot products grow large in magnitude, pushing softmax into regions with extremely small gradients — making learning slow or impossible. Dividing by √dₖ keeps dot products in a well-behaved range. A small but crucial detail enabling stable Transformer training at scale.
All questions sourced from original arXiv papers