Apache Hadoop is an open-source framework that enables distributed storage and processing of massive datasets (petabytes) across clusters of commodity hardware. It provides high fault tolerance, allowing it to detect and handle hardware failures at the software layer. Key components include HDFS (storage), YARN (resource management), and MapReduce (processing).

  • Hadoop Common: common utilities and libraries that support other Hadoop modules.
  • HDFS (Hadoop Distributed File System): A highly fault-tolerant file system that stores large data blocks across multiple machines, typically replicating data three times for reliability.
  • YARN (Yet Another Resource Negotiator): The cluster resource management layer that schedules jobs and manages computing resources.
  • MapReduce: A parallel programming model for processing vast amounts of structured, semi-structured, and unstructured data.
  • Hadoop Ozone: A scalable, redundant, distributed object store.