HDFS : Hadoop Modules

Hadoop Modules

Below are the Hadoop modules, which together form a Hadoop ecosystem. We will cover each of them in this topic:

HDFS  

Hadoop Distribution File System (HDFS) is a sub-project of the Apache Hadoop project. This Apache Software or Foundation Edition project is designed to provide a fault-tolerant file system designed to handle standard hardware.

Goals of HDFS

Fault detection and Recovery- Since HDFS requires a large number of commodity hardware, device failures are common. Therefore, HDFS should provide mechanisms for rapid and automatic detection and recovery of faults.

Huge Datasets- To control applications with large datasets, HDFS should have hundreds of nodes per cluster.

Hardware at data- A requested task can be performed when the computation is performed near the data. When it comes to large data sets, it reduces network traffic and increases performance.

Features of HDFS

Features of HDFS

Fault Tolerance- Hadoop HDFS fault tolerance is a system's working power in unfavorable conditions. It forgives fault tolerance. The Hadoop frame separates blocks of data. Before that, multiple copies of blocks were created in the cluster on different machines. So, if any computer in the cluster goes down, a client can easily access its data from the other computer that contains the same copy of data blocks.

High Availability- The data is replicated between the nodes of the Hadoop cluster in HDFS. It is is formed by producing a replica of the blocks on the other slaves in the HDFS cluster. Therefore, whenever a client wants to access such information, they can access their data from the slaves in their blocks. In adverse situations such as the failure of a node, a user can easily access their data from other nodes. The copies of duplicate blocks are present in the different nodes of the HDFS cluster.

High Reliability- HDFS provides reliable data storage. We can store data in the range of hundreds of petabytes. HDFS stores the data reliably in a cluster and divide it into blocks. The Hadoop structure stores these blocks in the nodes present in the HDFS cluster. HDFS stores data by creating a replica of each block present in the cluster, which thus offers ease of fault tolerance. If the node of the cluster containing data falls, a user can easily access this data from the other nodes. HDFS, by default, creates three replicas of each block containing data in the nodes. As a result, data is quickly available, and a user does not face the problem of data loss.

Replication- Data replication is a unique feature of HDFS. Replication solves the problem of data loss under adverse conditions, such as hardware failure, blocking of a node, and so on. HDFS maintains the replication process at regular intervals and continues to create replicas of user data on different machines in the cluster. Then, when a node goes down, the user can access data from other computers. Therefore, the possibility of data loss reduces.

Scalability- The data in the Hadoop HDFS is stored on multiple nodes in the cluster. As a result, we can resize the cluster depending on the requirements. HDFS offers two scalability mechanisms: Vertical and Horizontal scalability.

Distributed Storage- All features of HDFS are obtained through distributed storage and replication. HDFS stores distributed data between nodes. In Hadoop, the data is divided into blocks and is stored in the nodes present in the HDFS cluster. After that, HDFS creates the replica of each block and saves it in other nodes. When one computer in the group goes down, we can easily access our data from other nodes that contain its replica.

HDFS Concept

The three concepts of HDFS are listed below:

  • NameNode
  • DataNode
  • Block

1. NameNode

NameNode is the master node in the HDFS architecture of Apache Hadoop, which maintains and manages the DataNodes blocks (slave nodes). NameNode is a database that manages the namespace of the file system and monitors client’s access to files. We are going to discuss this Apache Hadoop HDFS high-availability feature in my next blog. The HDFS architecture is constructed in such a way that the user data on the NameNode never resides. The data can only be found on DataNodes.

Function of NameNode

  • The DataNodes (slave nodes) are maintained and managed by the master daemon.
  • This records all the files stored in the cluster's metadata, e.g., contained block location, file size, permissions, hierarchy, etc. There are two metadata-related files: FsImage, EditLogs. It records every change to the metadata of the file system. For example, if a file is deleted in HDFS, the NameNode will record it in the EditLog immediately.
  • It records every change to the metadata of the file system. For example, if a file is deleted in HDFS, the NameNode will record it in the EditLog immediately.
  • It receives a Heartbeat and a block report from all of the cluster's DataNodes to ensure that the DataNodes are alive.
  • The NameNode selects new DataNodes for new replicas in the event of a DataNode failure, balances disk usage, and manages the communication traffic to the DataNodes.

2. DataNode

 DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is standard hardware and a cheaper system, which is neither of high quality nor highly available.

Functions of DataNode:

  • These are slave processes that run on each slave machine.
  • DataNodes stores the actual data.
  • The DataNodes read and write requests from clients of the file system at a low level.

Secondary NameNode: - There is a third daemon or a process called Secondary NameNode besides these two daemons. The Secondary NameNode works concurrently.                       

Secondary NameNode

Functions of Secondary NameNode:-

  • The Secondary NameNode is one that reads all of the NameNode's file systems and metadata constantly from the RAM and writes it to the hard disk or file system.
  • It is responsible for the combination of the NameNode EditLogs and FsImage.
  • It regularly downloads EditLogs from NameNode and adds to FsImage. The new FsImage is then copied back to the NameNode (master node). It can be used again when the NameNode is started.

3. Block

Block is nothing but the smallest computer system storage unit. It is the file's smallest contiguous storage. The standard block size in Hadoop is 128 MB or 256 MB

Hadoop HDFS 3

HDFS Read/Write Operation

Read Operation in HDFS:-

HDFS, NameNode, and DataNode supports the data read query. Let us mark the writer as a client. The diagram below illustrates the process of file reading in Hadoop.

Hadoop HDFS 4

The diagrams depicts the client’s communication with NameNode and DataNode.

Write Operation in HDFS:-

Let’s understand the write operation across files in HDFS using the below diagram:

HDFS5

The diagram depicts the client’s interaction with NameNode and DataNode.