⚡ Interactive Visual Learning

Master Big Data
From Zero to Production

Explore the 5 V's, build pipelines, map the ecosystem, and launch your big data career — all in one interactive app.

V Volume

V Velocity

V Variety

V Veracity

V Value

Scroll to explore

0 Quintillion bytes/day

0 % Fortune 500 use Big Data

0 Billion market by 2026 ($B)

0 Core V's of Big Data

Core Concepts

Six foundational pillars every big data engineer must master

🌊

The 5 V's — Deep Dive

📦 Volume — Scale of Data ▾

Data measured in petabytes and exabytes. Facebook processes 4 PB/day; Walmart handles 2.5 PB/hr transactions. Horizontal scaling with distributed storage (HDFS, S3) is essential — no single machine can hold it all.

⚡ Velocity — Speed of Data ▾

Real-time vs. batch processing. Twitter generates 6,000 tweets/sec; NYSE processes 1M+ trades/sec. Stream processing (Kafka, Spark Streaming, Flink) handles high-velocity data without buffering entire datasets.

🎲 Variety — Types of Data ▾

Structured (SQL tables), Semi-structured (JSON/XML/logs), Unstructured (images, video, text). 80% of enterprise data is unstructured. Data lakes handle all types; warehouses require schema-on-write.

🎯 Veracity — Quality of Data ▾

Trustworthiness, accuracy, consistency. Biased sensors, dirty records, missing values corrupt analytics. Data quality pipelines (Great Expectations, dbt tests) validate schema, nulls, and statistical distributions.

💎 Value — Business Impact ▾

Converting raw data into insights. Only ~1% of data is ever analyzed. ROI requires clear use cases — churn prediction, fraud detection, personalization. Value justifies the infrastructure investment.

Fundamentals

🐘

Hadoop & HDFS Architecture

Hadoop Distributed File System splits files into 128 MB blocks replicated across DataNodes (default 3×). The NameNode holds metadata only.

[ NameNode ]
├─ Metadata only
├─ Block map & locations
└─ Single point of truth
[ DataNodes × N ]
├─ Block storage (128 MB)
├─ Heartbeat every 3s
└─ Replication factor: 3

Industry Adoption72%

Storage

🗺️

MapReduce Pipeline

The original Hadoop processing model. Splits tasks into parallel Map phases, then aggregates via Reduce. Disk-heavy but fault-tolerant for batch ETL.

          INPUT → Split into chunks

          MAP    → key-value pairs

          SHUFFLE → group by key

          REDUCE → aggregate per key

          OUTPUT → write to HDFS

⚠️ Writes to disk between each stage — Spark replaced this with in-memory DAGs for 100× speed.

Processing

⚡

Apache Spark — In-Memory Magic

Spark keeps data in RAM across a Directed Acyclic Graph (DAG) of transformations. 100× faster than MapReduce for iterative ML. 10× faster for batch ETL.

MapReduce

💾 Disk

Writes every stage

Spark

🧠 RAM

In-memory DAG

Spark APIs: RDD (low-level), DataFrame (SQL-like), Dataset (typed). PySpark enables Python. MLlib provides distributed ML.

Compute

🏗️

Lake vs. Warehouse vs. Lakehouse

🏞️ Data Lake

Raw, schema-on-read. Any format. S3/ADLS/GCS. Cheap storage. Risk: data swamp if ungoverned.

🏢 Data Warehouse

Structured, schema-on-write. SQL. Snowflake/BigQuery/Redshift. Fast queries, expensive, limited formats.

🏠 Lakehouse

Best of both: ACID transactions on lake storage. Delta Lake/Apache Iceberg/Hudi. Databricks/Unity Catalog.

Architecture

🌊

Apache Kafka — Event Streaming

Kafka is a distributed commit log — a durable, ordered, replayable stream of events. Used by 80%+ of Fortune 100 for real-time data pipelines.

          PRODUCERS → publish to Topics

          BROKERS   → partition & replicate

          TOPICS    → split into Partitions

          CONSUMERS → read at own offset

Throughput1M+ msg/sec

Kafka Connect syncs databases. Kafka Streams for in-process processing. Confluent Cloud = managed Kafka.

Streaming

Pipeline Builder

Toggle components to see how a real Big Data pipeline comes together

📡 Data Source

🌊 Kafka

🗄️ HDFS / S3

⚡ Spark

🐝 Hive / DW

📊 Superset

🌬️ Airflow

📡 Data
Source

🌊 Kafka
Broker

🗄️ HDFS
/ S3

⚡ Spark
ETL

🐝 Hive
/ DW

📊 Apache
Superset

🌬️ Airflow — Orchestrator
Schedules & monitors all stages

⚡ Full Lambda Architecture Active

All 7 components enabled. Data flows: Sources → Kafka (stream ingestion) → HDFS/S3 (raw storage) → Spark (ETL/processing) → Hive/DW (serving layer) → Superset (visualization). Airflow orchestrates every stage. Toggle components to explore partial architectures.

Ecosystem Galaxy Map

The Big Data tool landscape — organized by layer

🗺️ Big Data Ecosystem Map

Click any tool in the map above to see details about it — what it does, who uses it, and where it fits in the stack.

Learning Resources

Curated videos, courses, books & podcasts to go deep

🎬

Big Data in 5 Minutes — Simplilearn

Fast-paced overview of Hadoop, Spark, and the Big Data ecosystem. Perfect starting point.

→ Watch on YouTube

🎬

Hadoop Full Course — edureka!

7-hour deep dive into HDFS, MapReduce, YARN, Hive, Pig, HBase, and Spark.

→ Watch on YouTube

🎬

Apache Spark Tutorial — edureka!

Comprehensive Spark tutorial covering RDDs, DataFrames, Streaming, MLlib, and GraphX.

→ Watch on YouTube

🎬

Ken Jee — Data Engineering Roadmap

Practical career advice from a data science veteran. What to learn and in what order.

→ Ken Jee Channel

🎬

Alex The Analyst — SQL & Data Tools

Beginner-friendly tutorials on SQL, Python, Power BI — essential data foundations.

→ Alex The Analyst Channel

🎬

Kafka in 100 Seconds — Fireship

Brilliant quick explainer of Kafka's publish-subscribe model and when to use it.

→ Watch on YouTube

💻

IBM Data Engineering Professional — Coursera

16-course program covering SQL, NoSQL, Big Data, Spark, ETL, Kafka, and Airflow. Industry-recognized certificate.

→ coursera.org

💻

Big Data Fundamentals with Hadoop & Spark — DataCamp

Hands-on course with real PySpark exercises in a browser-based environment. No install needed.

→ datacamp.com

💻

Apache Spark with Python — Udemy (Frank Kane)

Best-selling Spark/PySpark course. Covers DataFrames, Spark SQL, Streaming, and ML with Amazon EMR labs.

→ udemy.com

💻

Data Engineering Zoomcamp — DataTalks.Club

Free 10-week bootcamp: containerization, workflow orchestration, data warehousing, Spark, and Kafka.

→ GitHub (Free)

💻

Fundamentals of Data Engineering — O'Reilly

Comprehensive course matching the landmark 2022 book by Joe Reis & Matt Housley.

→ oreilly.com

📚

Hadoop: The Definitive Guide — Tom White

The canonical Hadoop reference. Covers HDFS, MapReduce, YARN, HBase, Hive, Pig, and Flume with production patterns.

→ O'Reilly

📚

Learning Spark (2nd Ed.) — Damji et al.

Authoritative Spark 3.0 guide from Databricks engineers. RDDs, DataFrames, Structured Streaming, Delta Lake.

→ O'Reilly (Free PDF)

📚

Designing Data-Intensive Applications — Kleppmann

The bible of distributed systems engineering. Replication, sharding, consensus, stream processing — all explained.

→ O'Reilly

📚

Fundamentals of Data Engineering — Reis & Housley

Modern DE stack: the data engineering lifecycle, orchestration, storage, transformation, and serving.

→ O'Reilly

🎧

Data Skeptic

Weekly episodes on data science, ML, and AI — with strong Big Data engineering segments. Over 400 episodes.

→ dataskeptic.com

🎧

Super Data Science — Jon Krohn

Interviews with leading data professionals. Strong coverage of Spark, MLOps, and data platform engineering.

→ superdatascience.com

🎧

Streaming Audio — Confluent

Deep dives into Apache Kafka, event streaming architectures, and real-time data engineering from the Kafka creators.

→ Confluent Podcast

🎧

The Data Engineering Podcast

Tobias Macey interviews engineers building data platforms. Tools, patterns, and war stories from production systems.

→ dataengineeringpodcast.com

🧪

Databricks Community Edition

Free Spark cluster in the cloud. Run PySpark notebooks, experiment with Delta Lake, and try MLflow — no credit card.

→ Sign Up Free

🧪

Google BigQuery Sandbox

Analyze public datasets (Wikipedia, GitHub, NYC Taxi) with SQL — free 1 TB/month quota. No billing setup needed.

→ Google Cloud Console

🧪

Apache Kafka Quickstart

Official Kafka quickstart guide. Run a local broker, create topics, produce, and consume events in under 15 minutes.

→ kafka.apache.org

🧪

Snowflake Free Trial (30-day)

$400 of free credits to explore Snowflake's cloud data warehouse, Time Travel, and data sharing features.

→ signup.snowflake.com

Learning Roadmap

Your 6-phase journey from data curious to cloud-scale engineer

Data Literacy

▸What is Big Data? The 5 V's

▸Structured vs. unstructured data

▸File formats: JSON, CSV, Parquet, Avro

▸Basic statistics & probability

▸Excel / Google Sheets fluency

▸Intro to databases & RDBMS

⏱ 4–6 weeks

SQL + Python

▸SQL: SELECT, JOIN, GROUP BY, CTEs

▸Window functions & subqueries

▸Python: pandas, numpy, matplotlib

▸Data cleaning & transformation

▸Jupyter notebooks workflow

▸Git version control basics

⏱ 6–10 weeks

Hadoop Ecosystem

▸HDFS: blocks, replication, NameNode

▸MapReduce: map, shuffle, reduce

▸YARN resource management

▸Hive: HiveQL & metastore

▸HBase: wide-column NoSQL

▸Run Hadoop locally via Docker

⏱ 6–8 weeks

Apache Spark

▸RDDs, DataFrames, Datasets

▸PySpark & Spark SQL

▸Lazy evaluation & DAGs

▸Spark Structured Streaming

▸MLlib: classification & clustering

▸Databricks Community Edition labs

⏱ 8–12 weeks

Streaming & Pipelines

▸Kafka: topics, partitions, offsets

▸Kafka Connect & Kafka Streams

▸Apache Flink stateful streaming

▸Airflow DAGs & operators

▸dbt for data transformation

▸Lambda vs. Kappa architecture

⏱ 8–10 weeks

Cloud Big Data

▸AWS EMR / Glue / Kinesis

▸Google BigQuery & Dataflow

▸Azure HDInsight & Synapse

▸Snowflake & Databricks Lakehouse

▸Delta Lake / Apache Iceberg

▸CI/CD for data pipelines

⏱ 10–16 weeks

Big Data Cheat Sheet

20 essential terms — hover a card and click 📋 to copy the definition

HDFS

Hadoop Distributed File System. 128 MB blocks replicated 3× across DataNodes. NameNode holds block map.

MapReduce

Batch model: Input→Map(k/v)→Shuffle(group)→Reduce(aggregate). Disk I/O between stages; superseded by Spark.

Apache Spark

In-memory cluster compute. 100× faster than MapReduce for ML. APIs: RDD, DataFrame, Dataset. Batch & streaming.

RDD

Resilient Distributed Dataset. Spark's low-level immutable partitioned abstraction. Lazy evaluation. Transformations + Actions.

Apache Kafka

Distributed commit log for event streaming. Topics→Partitions→Offsets. Replayable & durable. 1M+ msg/sec.

Data Lake

Raw data in any format, schema-on-read. Cheap object storage (S3/ADLS/GCS). Risk: data swamp without governance.

Data Warehouse

Structured, SQL, schema-on-write analytical store. OLAP workloads. Snowflake, BigQuery, Redshift, Synapse.

Lakehouse

Lake storage + warehouse ACID features (Delta Lake, Iceberg, Hudi). Schema enforcement, BI support, time travel.

Apache Airflow

Python DAG-based orchestration. Schedules, monitors, retries pipeline tasks. 1000+ built-in operators.

dbt

Transform data inside warehouse using SQL SELECT models. Auto-generates tests, lineage graphs, and documentation.

Scale: KB → ZB

KB(10³)→MB(10⁶)→GB(10⁹)→TB(10¹²)→PB(10¹⁵)→EB(10¹⁸)→ZB(10²¹). Big Data at TB+. Internet ~5 ZB/yr.

Parquet

Columnar storage format. 3–10× better compression than CSV/JSON. Column pruning & predicate pushdown for fast queries.

Partitioning

Divide data by key (date, region) to reduce files scanned per query. repartition() / coalesce() in Spark.

Lambda Architecture

Batch layer (full recompute) + Speed layer (real-time) + Serving layer (merged). Complexity → Kappa alternative.

CAP Theorem

Distributed systems guarantee only 2 of: Consistency (latest write), Availability (always responds), Partition Tolerance.

Delta Lake

ACID transactions on Parquet files. Time travel (query history), schema evolution, DML on data lake storage.

Shuffle

Most expensive Spark op — redistributes data across partitions (groupBy, join, distinct). Minimize with broadcast joins.

Snowflake

Cloud DW with separated storage & compute. Virtual warehouses scale independently. Time Travel 90 days. Multi-cloud.

ETL vs ELT

ETL: transform before loading. ELT: load raw then transform inside warehouse. Cloud DWs make ELT preferred & cheaper.

YARN

Yet Another Resource Negotiator — Hadoop cluster manager. ResourceManager + NodeManagers. Spark, Hive, MR share one cluster.

Foundational Papers & Architecture

The landmark research and architectural patterns that built the Big Data world

2003

The Google File System (GFS)

Ghemawat, Gobioff & Leung — Google

SOSP 2003

A scalable distributed file system for large data-intensive applications on commodity hardware. Files split into fixed 64 MB chunks, each replicated 3x across DataNodes. A single master holds namespace metadata; clients talk directly to chunkservers for I/O. Designed for append-dominant, sequential-read workloads — not random writes.

⚡ Directly inspired HDFS — the storage backbone of Apache Hadoop.

2004

MapReduce: Simplified Data Processing on Large Clusters

Jeffrey Dean & Sanjay Ghemawat — Google

OSDI 2004

"MapReduce is a programming model and an associated implementation for processing and generating large datasets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key." The runtime auto-parallelises across thousands of machines and handles node failures transparently.

⚡ >100,000 MapReduce jobs/day at Google within 4 years. Foundation of Apache Hadoop.

2006

Bigtable: A Distributed Storage System for Structured Data

Chang, Dean, Ghemawat, Hsieh et al. — Google

OSDI 2006 — Best Paper Award

Manages petabytes of structured data across thousands of commodity servers. Data model: a sparse, distributed, persistent multi-dimensional sorted map indexed by (row key x column key x timestamp → value). Served Google Web Indexing, Google Earth, and Google Finance — wildly different latency and scale requirements, one unified system.

⚡ Directly inspired Apache HBase, Cassandra & the entire NoSQL movement.

2006

Apache Hadoop

Doug Cutting & Mike Cafarella — Yahoo! / Apache Foundation

Open-Source Implementation of GFS + MapReduce

Doug Cutting implemented the GFS and MapReduce papers in open-source Java to power the Nutch web crawler, naming it after his son's toy elephant. HDFS is a direct open-source implementation of the GFS architecture. Became the de-facto Big Data platform and spawned the entire ecosystem: Hive, HBase, Pig, and Apache Spark.

⚡ The most influential open-source project in data engineering history.

Lambda vs. Kappa Architecture

Two competing patterns for combining batch and real-time data processing

λ Lambda Architecture

Nathan Marz, 2011 — Creator of Apache Storm

Batch Layer — Immutable master dataset; recomputes views over full history (Hadoop / Spark)

Speed Layer — Fills the batch gap with low-latency real-time views (Flink / Storm)

Serving Layer — Merges batch + speed views to answer queries (Druid / HBase)

✓ Highly fault-tolerant, historically accurate | ✗ Two codebases to maintain, complex result reconciliation

κ Kappa Architecture

Jay Kreps, 2014 — Co-creator of Apache Kafka

→

Stream Layer only — All data treated as a stream; reprocess history by replaying Kafka log from offset 0

Serving Layer — Single unified view from stream output (Kafka + Flink + query store)

✓ Single codebase, simpler ops, stream-first | ✗ Reprocessing huge history is expensive; deep historical analytics harder

The CAP Theorem

Eric Brewer's conjecture (PODC 2000) — formally proven by Gilbert & Lynch, MIT (2002)

A distributed system can guarantee at most 2 of these 3 properties simultaneously

"In the presence of a network partition, you must choose between consistency and availability." — Brewer, 2012 clarification

CA — Consistency + Availability

Every read sees the latest write; every request gets a response. Cannot tolerate network partitions — must be a single-node or tightly coupled cluster.

Examples: PostgreSQL, MySQL, traditional RDBMS

CP — Consistency + Partition Tolerance

Returns correct, up-to-date data even across partitions; may block or refuse requests to maintain consistency. Availability is sacrificed.

Examples: HBase, Zookeeper, MongoDB (default config)

AP — Availability + Partition Tolerance

Always responds and survives partitions, but data may be temporarily stale. Nodes converge to the same state over time — eventual consistency.

Examples: Cassandra, DynamoDB, CouchDB

Key Terms — Primary Source Definitions

Authoritative definitions drawn directly from foundational papers, documentation, and original research

MapReduce

— Dean & Ghemawat, Google Research, OSDI 2004

"A programming model and an associated implementation for processing and generating large datasets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key." The runtime auto-parallelises and handles machine failures transparently.

HDFS (Hadoop Distributed File System)

— Apache Hadoop Documentation

An open-source implementation of the Google File System architecture. Files are split into large blocks (default 128 MB) distributed across DataNodes. A NameNode holds all filesystem metadata. Designed for sequential, high-throughput reads of very large files on commodity hardware — not random-access I/O.

Commodity Hardware

— GFS & MapReduce papers, Google Research, 2003–2004

Standard, inexpensive off-the-shelf servers — as opposed to specialised, high-cost machines. Both GFS and MapReduce were explicitly designed with the assumption that component failures are the norm, not the exception, enabling massive horizontal scaling at low unit cost.

Data Locality

— MapReduce paper, Dean & Ghemawat, 2004

The principle of moving computation to the data rather than moving data to the computation. The MapReduce runtime schedules map tasks on the node — or a nearby rack node — that already holds the relevant HDFS block, dramatically reducing network I/O and improving throughput on large datasets.

Fault Tolerance

— GFS, MapReduce & Bigtable papers, Google, 2003–2006

The ability of a system to continue operating correctly when one or more components fail. In Big Data systems this is achieved through data replication (typically 3x), automatic task re-execution on failure, and lineage-based recovery (Spark RDDs) rather than expensive checkpointing.

RDD (Resilient Distributed Dataset)

— Zaharia et al., "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing", NSDI 2012

An immutable, fault-tolerant, partitioned collection of elements that can be operated on in parallel. RDDs are "resilient" because they maintain a lineage graph (DAG) of transformations, enabling any lost partition to be recomputed from the original source rather than requiring physical data replication.

DAG Execution

— Apache Spark Documentation

In Spark, all transformations on RDDs or DataFrames are lazy — recorded as nodes in a Directed Acyclic Graph (DAG) but not executed until an action is called. The DAG scheduler then optimises the plan, coalescing stages to minimise expensive data shuffles and enabling intelligent fault recovery.

Stream Processing

— Apache Kafka & Apache Flink Documentation

Processing data records continuously as they arrive, in near-real-time, rather than accumulating them into a batch first. Each event is processed individually or in micro-batches with millisecond-to-second latency. Used for fraud detection, live dashboards, real-time recommendations, and IoT sensor analytics.

Batch Processing

— MapReduce paradigm, Dean & Ghemawat, 2004

Processing a bounded, pre-accumulated dataset in a single job run — typically on a schedule (hourly, daily). Optimised for high throughput over low latency. Classic examples: nightly ETL pipelines, end-of-day financial aggregations, weekly model retraining on historical data.

Schema-on-Read

— Data Lake architectural pattern

Data is stored in its raw, native format without enforcing a schema at write time. The schema is applied only when the data is read or queried. Allows ingesting any data immediately with no up-front modelling; the trade-off is that schema errors are discovered later, at query time.

Schema-on-Write

— Traditional Data Warehouse pattern

A schema is defined and enforced before data is written to the store. Guarantees data quality and structure at the point of ingestion. Used in traditional relational databases and cloud data warehouses (Snowflake, Redshift). Slower ingestion pipeline but faster, more reliable queries.

Data Lakehouse

— Armbrust et al., Databricks / Delta Lake, 2021

A data management architecture combining the low-cost, flexible storage of a Data Lake with the ACID transaction guarantees and governance of a Data Warehouse. Implemented through open table formats — Delta Lake, Apache Iceberg, Apache Hudi — that add transactional metadata layers over cloud object storage.

ACID Transactions (Data Context)

— Delta Lake / Databricks Documentation

Atomicity: a transaction fully completes or fully fails — no partial writes. Consistency: data remains in a valid state. Isolation: concurrent transactions do not interfere. Durability: committed data survives failures. Brought to data lakes via Delta Lake and Iceberg to prevent corrupt tables during concurrent streaming writes.

ETL vs. ELT

— Modern Data Stack architectural pattern

ETL (Extract-Transform-Load): data is cleaned and transformed before loading into the destination — classic in on-premise warehouses with limited target compute. ELT (Extract-Load-Transform): raw data lands first, then transformed in-place using the warehouse's own compute — preferred in cloud-native stacks (Snowflake, BigQuery, dbt).

Partitioning

— Apache Hive, Spark & distributed database documentation

Dividing a large dataset into smaller segments based on one or more column values (e.g., date, region). Queries targeting a specific partition read only that segment — known as partition pruning. In HDFS/Hive, partitions are physical sub-directories on disk. Reduces I/O dramatically on large datasets.

Sharding

— Distributed database systems (MongoDB, Cassandra)

Horizontal partitioning of data across multiple independent database nodes, where each shard holds a distinct subset of rows. Unlike replication (which copies data for redundancy), sharding distributes data to increase write throughput and total storage capacity. Each shard operates as an independent database instance.

Replication Factor

— GFS paper, Ghemawat et al., SOSP 2003; HDFS Documentation

The number of copies of each data block maintained across different nodes. GFS and HDFS default to 3x: one primary copy plus two replicas, ideally placed on different racks for rack-fault tolerance. Higher replication increases fault tolerance and read throughput at the cost of storage overhead.

Compaction

— Apache HBase, Cassandra & Delta Lake documentation

The background process of merging many small files or SSTables into fewer, larger ones to improve read performance, reclaim space from deleted and updated records, and reduce metadata overhead. In Delta Lake, compaction (OPTIMIZE + ZORDER) co-locates related Parquet data for faster query performance via data skipping.

Knowledge Check

8 questions from foundational papers and primary sources — click an answer to see the explanation

Score 0 / 8

QUESTION 1 OF 8

Google's MapReduce paper was presented at which venue and in which year?

Correct: OSDI 2004. "MapReduce: Simplified Data Processing on Large Clusters" by Jeffrey Dean and Sanjay Ghemawat was presented at the 6th Symposium on Operating Systems Design and Implementation (OSDI), San Francisco, 2004. It is one of the most cited papers in computer science history.

QUESTION 2 OF 8

What does RDD stand for, and who introduced the concept?

Correct: Resilient Distributed Dataset, Matei Zaharia et al., NSDI 2012. RDDs were introduced in "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing." An RDD is an immutable, fault-tolerant collection partitioned across nodes, recoverable via its lineage graph (DAG) without physical data replication.

QUESTION 3 OF 8

The CAP theorem states a distributed system can simultaneously guarantee at most how many of its three properties?

Correct: At most two. Eric Brewer conjectured this at PODC 2000; Gilbert & Lynch (MIT) formally proved it in 2002. In practice, network partitions are unavoidable, so designers choose between CP (Consistency + Partition tolerance) and AP (Availability + Partition tolerance).

QUESTION 4 OF 8

Lambda Architecture was proposed by whom, and in what year?

Correct: Nathan Marz, 2011. Nathan Marz (creator of Apache Storm) proposed Lambda Architecture in a 2011 blog post. It combines a Batch Layer (full historical recomputation), a Speed Layer (real-time gap-filling), and a Serving Layer (merged query interface).

QUESTION 5 OF 8

In the original Google File System paper (SOSP 2003), what is the default chunk size used to store files?

Correct: 64 MB. Ghemawat et al. specified 64 MB chunks — far larger than typical file system blocks. This reduces NameNode metadata overhead and supports the sequential, large-file read patterns of Google's workloads. Apache HDFS adopted 128 MB as its default block size.

QUESTION 6 OF 8

Kappa Architecture (Kreps, 2014) simplifies Lambda by replacing which two layers with a single streaming layer?

Correct: Batch Layer and Speed Layer. In "Questioning the Lambda Architecture" (2014), Kreps argued that maintaining two separate processing codebases is unnecessarily complex. Kappa replaces both with a single stream layer backed by a replayable Kafka log — reprocess history by replaying from offset 0.

QUESTION 7 OF 8

The Google Bigtable paper received what distinction at OSDI 2006, and which open-source systems did it directly inspire?

Correct: Best Paper Award at OSDI 2006. Bigtable (Chang, Dean, Ghemawat et al.) won Best Paper at OSDI '06 and later the SIGOPS Hall of Fame Award in 2016. Its sparse, multi-dimensional sorted map model directly inspired Apache HBase (an open-source Bigtable clone) and strongly influenced Cassandra's column-family design.

QUESTION 8 OF 8

In Apache Spark's execution model, what does "lazy evaluation" mean?

Correct: Transformations build a DAG; execution waits for an action. Operations like filter() and map() are transformations — they add nodes to the DAG but compute nothing immediately. Only actions like count() or collect() trigger the DAG scheduler to optimise and execute the full pipeline.

Master Big DataFrom Zero to Production

⚡ Full Lambda Architecture Active

🗺️ Big Data Ecosystem Map

Data Literacy

SQL + Python

Hadoop Ecosystem

Apache Spark

Streaming & Pipelines

Cloud Big Data

Master Big Data
From Zero to Production