Back to Courses

MapReduce Framework

Programming model for processing large datasets in parallel across distributed clusters with a simple two-step logic (Map and Reduce).

MapReduce divides tasks into: 1) Map phase (processes input data into key-value pairs), and 2) Reduce phase (aggregates results).

Native to Hadoop but also implemented in other systems (e.g., Google's proprietary version, Apache Spark's RDDs).

Handles fault tolerance automatically - redistributes work if a node fails during computation.

Optimized for batch processing of large-scale unstructured data (e.g., web crawling, log analysis).

Limitations include disk I/O bottlenecks (addressed by Spark's in-memory processing) and complexity for iterative algorithms.