⚡ Interactive Visual Learning
Explore the 5 V's, build pipelines, map the ecosystem, and launch your big data career — all in one interactive app.
Core Concepts
Six foundational pillars every big data engineer must master
Hadoop Distributed File System splits files into 128 MB blocks replicated across DataNodes (default 3×). The NameNode holds metadata only.
The original Hadoop processing model. Splits tasks into parallel Map phases, then aggregates via Reduce. Disk-heavy but fault-tolerant for batch ETL.
⚠️ Writes to disk between each stage — Spark replaced this with in-memory DAGs for 100× speed.
Spark keeps data in RAM across a Directed Acyclic Graph (DAG) of transformations. 100× faster than MapReduce for iterative ML. 10× faster for batch ETL.
Spark APIs: RDD (low-level), DataFrame (SQL-like), Dataset (typed). PySpark enables Python. MLlib provides distributed ML.
Kafka is a distributed commit log — a durable, ordered, replayable stream of events. Used by 80%+ of Fortune 100 for real-time data pipelines.
Kafka Connect syncs databases. Kafka Streams for in-process processing. Confluent Cloud = managed Kafka.
Pipeline Builder
Toggle components to see how a real Big Data pipeline comes together
All 7 components enabled. Data flows: Sources → Kafka (stream ingestion) → HDFS/S3 (raw storage) → Spark (ETL/processing) → Hive/DW (serving layer) → Superset (visualization). Airflow orchestrates every stage. Toggle components to explore partial architectures.
Ecosystem Galaxy Map
The Big Data tool landscape — organized by layer
Click any tool in the map above to see details about it — what it does, who uses it, and where it fits in the stack.
Learning Resources
Curated videos, courses, books & podcasts to go deep
Learning Roadmap
Your 6-phase journey from data curious to cloud-scale engineer
Big Data Cheat Sheet
20 essential terms — hover a card and click 📋 to copy the definition
Foundational Papers & Architecture
The landmark research and architectural patterns that built the Big Data world
A scalable distributed file system for large data-intensive applications on commodity hardware. Files split into fixed 64 MB chunks, each replicated 3x across DataNodes. A single master holds namespace metadata; clients talk directly to chunkservers for I/O. Designed for append-dominant, sequential-read workloads — not random writes.
⚡ Directly inspired HDFS — the storage backbone of Apache Hadoop.
"MapReduce is a programming model and an associated implementation for processing and generating large datasets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key." The runtime auto-parallelises across thousands of machines and handles node failures transparently.
⚡ >100,000 MapReduce jobs/day at Google within 4 years. Foundation of Apache Hadoop.
Manages petabytes of structured data across thousands of commodity servers. Data model: a sparse, distributed, persistent multi-dimensional sorted map indexed by (row key x column key x timestamp → value). Served Google Web Indexing, Google Earth, and Google Finance — wildly different latency and scale requirements, one unified system.
⚡ Directly inspired Apache HBase, Cassandra & the entire NoSQL movement.
Doug Cutting implemented the GFS and MapReduce papers in open-source Java to power the Nutch web crawler, naming it after his son's toy elephant. HDFS is a direct open-source implementation of the GFS architecture. Became the de-facto Big Data platform and spawned the entire ecosystem: Hive, HBase, Pig, and Apache Spark.
⚡ The most influential open-source project in data engineering history.
Lambda vs. Kappa Architecture
Two competing patterns for combining batch and real-time data processing
The CAP Theorem
Eric Brewer's conjecture (PODC 2000) — formally proven by Gilbert & Lynch, MIT (2002)
A distributed system can guarantee at most 2 of these 3 properties simultaneously
Key Terms — Primary Source Definitions
Authoritative definitions drawn directly from foundational papers, documentation, and original research
Knowledge Check
8 questions from foundational papers and primary sources — click an answer to see the explanation
filter() and map() are transformations — they add nodes to the DAG but compute nothing immediately. Only actions like count() or collect() trigger the DAG scheduler to optimise and execute the full pipeline.