✓ Copied to clipboard!

⚡ Interactive Visual Learning

Master Big Data
From Zero to Production

Explore the 5 V's, build pipelines, map the ecosystem, and launch your big data career — all in one interactive app.

V Volume
V Velocity
V Variety
V Veracity
V Value
Scroll to explore
0 Quintillion bytes/day
0 % Fortune 500 use Big Data
0 Billion market by 2026 ($B)
0 Core V's of Big Data

Core Concepts

Six foundational pillars every big data engineer must master

🌊
The 5 V's — Deep Dive
📦 Volume — Scale of Data
Data measured in petabytes and exabytes. Facebook processes 4 PB/day; Walmart handles 2.5 PB/hr transactions. Horizontal scaling with distributed storage (HDFS, S3) is essential — no single machine can hold it all.
⚡ Velocity — Speed of Data
Real-time vs. batch processing. Twitter generates 6,000 tweets/sec; NYSE processes 1M+ trades/sec. Stream processing (Kafka, Spark Streaming, Flink) handles high-velocity data without buffering entire datasets.
🎲 Variety — Types of Data
Structured (SQL tables), Semi-structured (JSON/XML/logs), Unstructured (images, video, text). 80% of enterprise data is unstructured. Data lakes handle all types; warehouses require schema-on-write.
🎯 Veracity — Quality of Data
Trustworthiness, accuracy, consistency. Biased sensors, dirty records, missing values corrupt analytics. Data quality pipelines (Great Expectations, dbt tests) validate schema, nulls, and statistical distributions.
💎 Value — Business Impact
Converting raw data into insights. Only ~1% of data is ever analyzed. ROI requires clear use cases — churn prediction, fraud detection, personalization. Value justifies the infrastructure investment.
Fundamentals
🐘
Hadoop & HDFS Architecture

Hadoop Distributed File System splits files into 128 MB blocks replicated across DataNodes (default 3×). The NameNode holds metadata only.

[ NameNode ]
├─ Metadata only
├─ Block map & locations
└─ Single point of truth
[ DataNodes × N ]
├─ Block storage (128 MB)
├─ Heartbeat every 3s
└─ Replication factor: 3
Industry Adoption72%
Storage
🗺️
MapReduce Pipeline

The original Hadoop processing model. Splits tasks into parallel Map phases, then aggregates via Reduce. Disk-heavy but fault-tolerant for batch ETL.

INPUT → Split into chunks
MAP    → key-value pairs
SHUFFLE → group by key
REDUCE → aggregate per key
OUTPUT → write to HDFS

⚠️ Writes to disk between each stage — Spark replaced this with in-memory DAGs for 100× speed.

Processing
Apache Spark — In-Memory Magic

Spark keeps data in RAM across a Directed Acyclic Graph (DAG) of transformations. 100× faster than MapReduce for iterative ML. 10× faster for batch ETL.

MapReduce
💾 Disk
Writes every stage
Spark
🧠 RAM
In-memory DAG

Spark APIs: RDD (low-level), DataFrame (SQL-like), Dataset (typed). PySpark enables Python. MLlib provides distributed ML.

Compute
🏗️
Lake vs. Warehouse vs. Lakehouse
🏞️ Data Lake
Raw, schema-on-read. Any format. S3/ADLS/GCS. Cheap storage. Risk: data swamp if ungoverned.
🏢 Data Warehouse
Structured, schema-on-write. SQL. Snowflake/BigQuery/Redshift. Fast queries, expensive, limited formats.
🏠 Lakehouse
Best of both: ACID transactions on lake storage. Delta Lake/Apache Iceberg/Hudi. Databricks/Unity Catalog.
Architecture
🌊
Apache Kafka — Event Streaming

Kafka is a distributed commit log — a durable, ordered, replayable stream of events. Used by 80%+ of Fortune 100 for real-time data pipelines.

PRODUCERS → publish to Topics
BROKERS   → partition & replicate
TOPICS    → split into Partitions
CONSUMERS → read at own offset
Throughput1M+ msg/sec

Kafka Connect syncs databases. Kafka Streams for in-process processing. Confluent Cloud = managed Kafka.

Streaming

Pipeline Builder

Toggle components to see how a real Big Data pipeline comes together

📡 Data Source
🌊 Kafka
🗄️ HDFS / S3
Spark
🐝 Hive / DW
📊 Superset
🌬️ Airflow
📡 Data
Source
🌊 Kafka
Broker
🗄️ HDFS
/ S3
Spark
ETL
🐝 Hive
/ DW
📊 Apache
Superset
🌬️ Airflow — Orchestrator
Schedules & monitors all stages

⚡ Full Lambda Architecture Active

All 7 components enabled. Data flows: Sources → Kafka (stream ingestion) → HDFS/S3 (raw storage) → Spark (ETL/processing) → Hive/DW (serving layer) → Superset (visualization). Airflow orchestrates every stage. Toggle components to explore partial architectures.

Ecosystem Galaxy Map

The Big Data tool landscape — organized by layer

INGEST STORE PROCESS SERVE 🌊 Kafka 🔌 Flume ☁️ Kinesis 📨 Pub/Sub 🦑 Sqoop Flink 🐘 HDFS ☁️ AWS S3 🏛️ HBase Δ Delta Lake 👁️ Cassandra 🧊 Iceberg Apache Spark 🐝 Hive 🔍 Presto 🔧 dbt 🌬️ Airflow 📊 Superset 📈 Tableau ❄️ Snowflake 🔎 BigQuery 🧱 Databricks Click any tool to learn more ↓

🗺️ Big Data Ecosystem Map

Click any tool in the map above to see details about it — what it does, who uses it, and where it fits in the stack.

Learning Resources

Curated videos, courses, books & podcasts to go deep

🎬
Big Data in 5 Minutes — Simplilearn
Fast-paced overview of Hadoop, Spark, and the Big Data ecosystem. Perfect starting point.
→ Watch on YouTube
🎬
Hadoop Full Course — edureka!
7-hour deep dive into HDFS, MapReduce, YARN, Hive, Pig, HBase, and Spark.
→ Watch on YouTube
🎬
Apache Spark Tutorial — edureka!
Comprehensive Spark tutorial covering RDDs, DataFrames, Streaming, MLlib, and GraphX.
→ Watch on YouTube
🎬
Ken Jee — Data Engineering Roadmap
Practical career advice from a data science veteran. What to learn and in what order.
→ Ken Jee Channel
🎬
Alex The Analyst — SQL & Data Tools
Beginner-friendly tutorials on SQL, Python, Power BI — essential data foundations.
→ Alex The Analyst Channel
🎬
Kafka in 100 Seconds — Fireship
Brilliant quick explainer of Kafka's publish-subscribe model and when to use it.
→ Watch on YouTube
💻
IBM Data Engineering Professional — Coursera
16-course program covering SQL, NoSQL, Big Data, Spark, ETL, Kafka, and Airflow. Industry-recognized certificate.
→ coursera.org
💻
Big Data Fundamentals with Hadoop & Spark — DataCamp
Hands-on course with real PySpark exercises in a browser-based environment. No install needed.
→ datacamp.com
💻
Apache Spark with Python — Udemy (Frank Kane)
Best-selling Spark/PySpark course. Covers DataFrames, Spark SQL, Streaming, and ML with Amazon EMR labs.
→ udemy.com
💻
Data Engineering Zoomcamp — DataTalks.Club
Free 10-week bootcamp: containerization, workflow orchestration, data warehousing, Spark, and Kafka.
→ GitHub (Free)
💻
Fundamentals of Data Engineering — O'Reilly
Comprehensive course matching the landmark 2022 book by Joe Reis & Matt Housley.
→ oreilly.com
📚
Hadoop: The Definitive Guide — Tom White
The canonical Hadoop reference. Covers HDFS, MapReduce, YARN, HBase, Hive, Pig, and Flume with production patterns.
→ O'Reilly
📚
Learning Spark (2nd Ed.) — Damji et al.
Authoritative Spark 3.0 guide from Databricks engineers. RDDs, DataFrames, Structured Streaming, Delta Lake.
→ O'Reilly (Free PDF)
📚
Designing Data-Intensive Applications — Kleppmann
The bible of distributed systems engineering. Replication, sharding, consensus, stream processing — all explained.
→ O'Reilly
📚
Fundamentals of Data Engineering — Reis & Housley
Modern DE stack: the data engineering lifecycle, orchestration, storage, transformation, and serving.
→ O'Reilly
🎧
Data Skeptic
Weekly episodes on data science, ML, and AI — with strong Big Data engineering segments. Over 400 episodes.
→ dataskeptic.com
🎧
Super Data Science — Jon Krohn
Interviews with leading data professionals. Strong coverage of Spark, MLOps, and data platform engineering.
→ superdatascience.com
🎧
Streaming Audio — Confluent
Deep dives into Apache Kafka, event streaming architectures, and real-time data engineering from the Kafka creators.
→ Confluent Podcast
🎧
The Data Engineering Podcast
Tobias Macey interviews engineers building data platforms. Tools, patterns, and war stories from production systems.
→ dataengineeringpodcast.com
🧪
Databricks Community Edition
Free Spark cluster in the cloud. Run PySpark notebooks, experiment with Delta Lake, and try MLflow — no credit card.
→ Sign Up Free
🧪
Google BigQuery Sandbox
Analyze public datasets (Wikipedia, GitHub, NYC Taxi) with SQL — free 1 TB/month quota. No billing setup needed.
→ Google Cloud Console
🧪
Apache Kafka Quickstart
Official Kafka quickstart guide. Run a local broker, create topics, produce, and consume events in under 15 minutes.
→ kafka.apache.org
🧪
Snowflake Free Trial (30-day)
$400 of free credits to explore Snowflake's cloud data warehouse, Time Travel, and data sharing features.
→ signup.snowflake.com

Learning Roadmap

Your 6-phase journey from data curious to cloud-scale engineer

1

Data Literacy

What is Big Data? The 5 V's
Structured vs. unstructured data
File formats: JSON, CSV, Parquet, Avro
Basic statistics & probability
Excel / Google Sheets fluency
Intro to databases & RDBMS
⏱ 4–6 weeks
2

SQL + Python

SQL: SELECT, JOIN, GROUP BY, CTEs
Window functions & subqueries
Python: pandas, numpy, matplotlib
Data cleaning & transformation
Jupyter notebooks workflow
Git version control basics
⏱ 6–10 weeks
3

Hadoop Ecosystem

HDFS: blocks, replication, NameNode
MapReduce: map, shuffle, reduce
YARN resource management
Hive: HiveQL & metastore
HBase: wide-column NoSQL
Run Hadoop locally via Docker
⏱ 6–8 weeks
4

Apache Spark

RDDs, DataFrames, Datasets
PySpark & Spark SQL
Lazy evaluation & DAGs
Spark Structured Streaming
MLlib: classification & clustering
Databricks Community Edition labs
⏱ 8–12 weeks
5

Streaming & Pipelines

Kafka: topics, partitions, offsets
Kafka Connect & Kafka Streams
Apache Flink stateful streaming
Airflow DAGs & operators
dbt for data transformation
Lambda vs. Kappa architecture
⏱ 8–10 weeks
6

Cloud Big Data

AWS EMR / Glue / Kinesis
Google BigQuery & Dataflow
Azure HDInsight & Synapse
Snowflake & Databricks Lakehouse
Delta Lake / Apache Iceberg
CI/CD for data pipelines
⏱ 10–16 weeks

Big Data Cheat Sheet

20 essential terms — hover a card and click 📋 to copy the definition

HDFS
Hadoop Distributed File System. 128 MB blocks replicated 3× across DataNodes. NameNode holds block map.
MapReduce
Batch model: Input→Map(k/v)→Shuffle(group)→Reduce(aggregate). Disk I/O between stages; superseded by Spark.
Apache Spark
In-memory cluster compute. 100× faster than MapReduce for ML. APIs: RDD, DataFrame, Dataset. Batch & streaming.
RDD
Resilient Distributed Dataset. Spark's low-level immutable partitioned abstraction. Lazy evaluation. Transformations + Actions.
Apache Kafka
Distributed commit log for event streaming. Topics→Partitions→Offsets. Replayable & durable. 1M+ msg/sec.
Data Lake
Raw data in any format, schema-on-read. Cheap object storage (S3/ADLS/GCS). Risk: data swamp without governance.
Data Warehouse
Structured, SQL, schema-on-write analytical store. OLAP workloads. Snowflake, BigQuery, Redshift, Synapse.
Lakehouse
Lake storage + warehouse ACID features (Delta Lake, Iceberg, Hudi). Schema enforcement, BI support, time travel.
Apache Airflow
Python DAG-based orchestration. Schedules, monitors, retries pipeline tasks. 1000+ built-in operators.
dbt
Transform data inside warehouse using SQL SELECT models. Auto-generates tests, lineage graphs, and documentation.
Scale: KB → ZB
KB(10³)→MB(10⁶)→GB(10⁹)→TB(10¹²)→PB(10¹⁵)→EB(10¹⁸)→ZB(10²¹). Big Data at TB+. Internet ~5 ZB/yr.
Parquet
Columnar storage format. 3–10× better compression than CSV/JSON. Column pruning & predicate pushdown for fast queries.
Partitioning
Divide data by key (date, region) to reduce files scanned per query. repartition() / coalesce() in Spark.
Lambda Architecture
Batch layer (full recompute) + Speed layer (real-time) + Serving layer (merged). Complexity → Kappa alternative.
CAP Theorem
Distributed systems guarantee only 2 of: Consistency (latest write), Availability (always responds), Partition Tolerance.
Delta Lake
ACID transactions on Parquet files. Time travel (query history), schema evolution, DML on data lake storage.
Shuffle
Most expensive Spark op — redistributes data across partitions (groupBy, join, distinct). Minimize with broadcast joins.
Snowflake
Cloud DW with separated storage & compute. Virtual warehouses scale independently. Time Travel 90 days. Multi-cloud.
ETL vs ELT
ETL: transform before loading. ELT: load raw then transform inside warehouse. Cloud DWs make ELT preferred & cheaper.
YARN
Yet Another Resource Negotiator — Hadoop cluster manager. ResourceManager + NodeManagers. Spark, Hive, MR share one cluster.

Foundational Papers & Architecture

The landmark research and architectural patterns that built the Big Data world

2003
The Google File System (GFS)
Ghemawat, Gobioff & Leung — Google
SOSP 2003

A scalable distributed file system for large data-intensive applications on commodity hardware. Files split into fixed 64 MB chunks, each replicated 3x across DataNodes. A single master holds namespace metadata; clients talk directly to chunkservers for I/O. Designed for append-dominant, sequential-read workloads — not random writes.

⚡ Directly inspired HDFS — the storage backbone of Apache Hadoop.

2004
MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean & Sanjay Ghemawat — Google
OSDI 2004

"MapReduce is a programming model and an associated implementation for processing and generating large datasets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key." The runtime auto-parallelises across thousands of machines and handles node failures transparently.

⚡ >100,000 MapReduce jobs/day at Google within 4 years. Foundation of Apache Hadoop.

2006
Bigtable: A Distributed Storage System for Structured Data
Chang, Dean, Ghemawat, Hsieh et al. — Google
OSDI 2006 — Best Paper Award

Manages petabytes of structured data across thousands of commodity servers. Data model: a sparse, distributed, persistent multi-dimensional sorted map indexed by (row key x column key x timestamp → value). Served Google Web Indexing, Google Earth, and Google Finance — wildly different latency and scale requirements, one unified system.

⚡ Directly inspired Apache HBase, Cassandra & the entire NoSQL movement.

2006
Apache Hadoop
Doug Cutting & Mike Cafarella — Yahoo! / Apache Foundation
Open-Source Implementation of GFS + MapReduce

Doug Cutting implemented the GFS and MapReduce papers in open-source Java to power the Nutch web crawler, naming it after his son's toy elephant. HDFS is a direct open-source implementation of the GFS architecture. Became the de-facto Big Data platform and spawned the entire ecosystem: Hive, HBase, Pig, and Apache Spark.

⚡ The most influential open-source project in data engineering history.

Lambda vs. Kappa Architecture

Two competing patterns for combining batch and real-time data processing

λ Lambda Architecture
Nathan Marz, 2011 — Creator of Apache Storm
B
Batch Layer — Immutable master dataset; recomputes views over full history (Hadoop / Spark)
S
Speed Layer — Fills the batch gap with low-latency real-time views (Flink / Storm)
V
Serving Layer — Merges batch + speed views to answer queries (Druid / HBase)
✓ Highly fault-tolerant, historically accurate  |  ✗ Two codebases to maintain, complex result reconciliation
κ Kappa Architecture
Jay Kreps, 2014 — Co-creator of Apache Kafka
Stream Layer only — All data treated as a stream; reprocess history by replaying Kafka log from offset 0
V
Serving Layer — Single unified view from stream output (Kafka + Flink + query store)
✓ Single codebase, simpler ops, stream-first  |  ✗ Reprocessing huge history is expensive; deep historical analytics harder

The CAP Theorem

Eric Brewer's conjecture (PODC 2000) — formally proven by Gilbert & Lynch, MIT (2002)

A distributed system can guarantee at most 2 of these 3 properties simultaneously

Consistency C Availability A Partition Tolerance P CA PostgreSQL, MySQL CP HBase, Zookeeper AP Cassandra, DynamoDB
CA — Consistency + Availability
Every read sees the latest write; every request gets a response. Cannot tolerate network partitions — must be a single-node or tightly coupled cluster.
Examples: PostgreSQL, MySQL, traditional RDBMS
CP — Consistency + Partition Tolerance
Returns correct, up-to-date data even across partitions; may block or refuse requests to maintain consistency. Availability is sacrificed.
Examples: HBase, Zookeeper, MongoDB (default config)
AP — Availability + Partition Tolerance
Always responds and survives partitions, but data may be temporarily stale. Nodes converge to the same state over time — eventual consistency.
Examples: Cassandra, DynamoDB, CouchDB

Key Terms — Primary Source Definitions

Authoritative definitions drawn directly from foundational papers, documentation, and original research

MapReduce
— Dean & Ghemawat, Google Research, OSDI 2004
"A programming model and an associated implementation for processing and generating large datasets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key." The runtime auto-parallelises and handles machine failures transparently.
HDFS (Hadoop Distributed File System)
— Apache Hadoop Documentation
An open-source implementation of the Google File System architecture. Files are split into large blocks (default 128 MB) distributed across DataNodes. A NameNode holds all filesystem metadata. Designed for sequential, high-throughput reads of very large files on commodity hardware — not random-access I/O.
Commodity Hardware
— GFS & MapReduce papers, Google Research, 2003–2004
Standard, inexpensive off-the-shelf servers — as opposed to specialised, high-cost machines. Both GFS and MapReduce were explicitly designed with the assumption that component failures are the norm, not the exception, enabling massive horizontal scaling at low unit cost.
Data Locality
— MapReduce paper, Dean & Ghemawat, 2004
The principle of moving computation to the data rather than moving data to the computation. The MapReduce runtime schedules map tasks on the node — or a nearby rack node — that already holds the relevant HDFS block, dramatically reducing network I/O and improving throughput on large datasets.
Fault Tolerance
— GFS, MapReduce & Bigtable papers, Google, 2003–2006
The ability of a system to continue operating correctly when one or more components fail. In Big Data systems this is achieved through data replication (typically 3x), automatic task re-execution on failure, and lineage-based recovery (Spark RDDs) rather than expensive checkpointing.
RDD (Resilient Distributed Dataset)
— Zaharia et al., "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing", NSDI 2012
An immutable, fault-tolerant, partitioned collection of elements that can be operated on in parallel. RDDs are "resilient" because they maintain a lineage graph (DAG) of transformations, enabling any lost partition to be recomputed from the original source rather than requiring physical data replication.
DAG Execution
— Apache Spark Documentation
In Spark, all transformations on RDDs or DataFrames are lazy — recorded as nodes in a Directed Acyclic Graph (DAG) but not executed until an action is called. The DAG scheduler then optimises the plan, coalescing stages to minimise expensive data shuffles and enabling intelligent fault recovery.
Stream Processing
— Apache Kafka & Apache Flink Documentation
Processing data records continuously as they arrive, in near-real-time, rather than accumulating them into a batch first. Each event is processed individually or in micro-batches with millisecond-to-second latency. Used for fraud detection, live dashboards, real-time recommendations, and IoT sensor analytics.
Batch Processing
— MapReduce paradigm, Dean & Ghemawat, 2004
Processing a bounded, pre-accumulated dataset in a single job run — typically on a schedule (hourly, daily). Optimised for high throughput over low latency. Classic examples: nightly ETL pipelines, end-of-day financial aggregations, weekly model retraining on historical data.
Schema-on-Read
— Data Lake architectural pattern
Data is stored in its raw, native format without enforcing a schema at write time. The schema is applied only when the data is read or queried. Allows ingesting any data immediately with no up-front modelling; the trade-off is that schema errors are discovered later, at query time.
Schema-on-Write
— Traditional Data Warehouse pattern
A schema is defined and enforced before data is written to the store. Guarantees data quality and structure at the point of ingestion. Used in traditional relational databases and cloud data warehouses (Snowflake, Redshift). Slower ingestion pipeline but faster, more reliable queries.
Data Lakehouse
— Armbrust et al., Databricks / Delta Lake, 2021
A data management architecture combining the low-cost, flexible storage of a Data Lake with the ACID transaction guarantees and governance of a Data Warehouse. Implemented through open table formats — Delta Lake, Apache Iceberg, Apache Hudi — that add transactional metadata layers over cloud object storage.
ACID Transactions (Data Context)
— Delta Lake / Databricks Documentation
Atomicity: a transaction fully completes or fully fails — no partial writes. Consistency: data remains in a valid state. Isolation: concurrent transactions do not interfere. Durability: committed data survives failures. Brought to data lakes via Delta Lake and Iceberg to prevent corrupt tables during concurrent streaming writes.
ETL vs. ELT
— Modern Data Stack architectural pattern
ETL (Extract-Transform-Load): data is cleaned and transformed before loading into the destination — classic in on-premise warehouses with limited target compute. ELT (Extract-Load-Transform): raw data lands first, then transformed in-place using the warehouse's own compute — preferred in cloud-native stacks (Snowflake, BigQuery, dbt).
Partitioning
— Apache Hive, Spark & distributed database documentation
Dividing a large dataset into smaller segments based on one or more column values (e.g., date, region). Queries targeting a specific partition read only that segment — known as partition pruning. In HDFS/Hive, partitions are physical sub-directories on disk. Reduces I/O dramatically on large datasets.
Sharding
— Distributed database systems (MongoDB, Cassandra)
Horizontal partitioning of data across multiple independent database nodes, where each shard holds a distinct subset of rows. Unlike replication (which copies data for redundancy), sharding distributes data to increase write throughput and total storage capacity. Each shard operates as an independent database instance.
Replication Factor
— GFS paper, Ghemawat et al., SOSP 2003; HDFS Documentation
The number of copies of each data block maintained across different nodes. GFS and HDFS default to 3x: one primary copy plus two replicas, ideally placed on different racks for rack-fault tolerance. Higher replication increases fault tolerance and read throughput at the cost of storage overhead.
Compaction
— Apache HBase, Cassandra & Delta Lake documentation
The background process of merging many small files or SSTables into fewer, larger ones to improve read performance, reclaim space from deleted and updated records, and reduce metadata overhead. In Delta Lake, compaction (OPTIMIZE + ZORDER) co-locates related Parquet data for faster query performance via data skipping.

Knowledge Check

8 questions from foundational papers and primary sources — click an answer to see the explanation

Score 0 / 8
QUESTION 1 OF 8
Google's MapReduce paper was presented at which venue and in which year?
Correct: OSDI 2004. "MapReduce: Simplified Data Processing on Large Clusters" by Jeffrey Dean and Sanjay Ghemawat was presented at the 6th Symposium on Operating Systems Design and Implementation (OSDI), San Francisco, 2004. It is one of the most cited papers in computer science history.
QUESTION 2 OF 8
What does RDD stand for, and who introduced the concept?
Correct: Resilient Distributed Dataset, Matei Zaharia et al., NSDI 2012. RDDs were introduced in "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing." An RDD is an immutable, fault-tolerant collection partitioned across nodes, recoverable via its lineage graph (DAG) without physical data replication.
QUESTION 3 OF 8
The CAP theorem states a distributed system can simultaneously guarantee at most how many of its three properties?
Correct: At most two. Eric Brewer conjectured this at PODC 2000; Gilbert & Lynch (MIT) formally proved it in 2002. In practice, network partitions are unavoidable, so designers choose between CP (Consistency + Partition tolerance) and AP (Availability + Partition tolerance).
QUESTION 4 OF 8
Lambda Architecture was proposed by whom, and in what year?
Correct: Nathan Marz, 2011. Nathan Marz (creator of Apache Storm) proposed Lambda Architecture in a 2011 blog post. It combines a Batch Layer (full historical recomputation), a Speed Layer (real-time gap-filling), and a Serving Layer (merged query interface).
QUESTION 5 OF 8
In the original Google File System paper (SOSP 2003), what is the default chunk size used to store files?
Correct: 64 MB. Ghemawat et al. specified 64 MB chunks — far larger than typical file system blocks. This reduces NameNode metadata overhead and supports the sequential, large-file read patterns of Google's workloads. Apache HDFS adopted 128 MB as its default block size.
QUESTION 6 OF 8
Kappa Architecture (Kreps, 2014) simplifies Lambda by replacing which two layers with a single streaming layer?
Correct: Batch Layer and Speed Layer. In "Questioning the Lambda Architecture" (2014), Kreps argued that maintaining two separate processing codebases is unnecessarily complex. Kappa replaces both with a single stream layer backed by a replayable Kafka log — reprocess history by replaying from offset 0.
QUESTION 7 OF 8
The Google Bigtable paper received what distinction at OSDI 2006, and which open-source systems did it directly inspire?
Correct: Best Paper Award at OSDI 2006. Bigtable (Chang, Dean, Ghemawat et al.) won Best Paper at OSDI '06 and later the SIGOPS Hall of Fame Award in 2016. Its sparse, multi-dimensional sorted map model directly inspired Apache HBase (an open-source Bigtable clone) and strongly influenced Cassandra's column-family design.
QUESTION 8 OF 8
In Apache Spark's execution model, what does "lazy evaluation" mean?
Correct: Transformations build a DAG; execution waits for an action. Operations like filter() and map() are transformations — they add nodes to the DAG but compute nothing immediately. Only actions like count() or collect() trigger the DAG scheduler to optimise and execute the full pipeline.