Hadoop Tutorial

What is Hadoop?

Apache Hadoop is an open-source software framework developed in Java which is used to store and analyze the large sets of unstructured data. It allows distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from a single server to thousands of machines and each of them offering local computation and storage. It does not rely on hardware to deliver high-availability.

Hadoop is based on Google’s Map Reduce, a framework which is used to segregate a large part into a set of smaller parts.

big-data-hadoop tutorial

Hadoop Architecture

Hadoop has four modules which are used in Big Data Analysis:

  • Distributed File System: It allows data to be stored in such an accessible way, even when it is across a large number of linked devices.
  • MapReduce: MapReduce reads data from the database and then puts it in a readable format that can be used for analysis.
  • Hadoop Common: This provides tool required to analyze the data stored in the Hadoop file.
  • Yarn: It manages the resources and runs the analysis.
Hadoop Architecture

Advantages of Hadoop

  • Hadoop framework allows the user to write and test distributed systems in a quick manner. It is very efficient. Itcan distribute the data and work across the machines automatically. It utilizes the underlying parallelism of the CPU cores.
  • It does not rely on hardware to provide fault-tolerance and high availability (FTHA). Hadoop library is designed to detect and handle failures at the application layer.
  • Hadoop offers an option to add or remove servers from the cluster dynamically, and it continues to operate without interruption.
  • Hadoop is compatible with all the platforms because it is based on Java.
  • Hadoop uses a cluster of commodity hardware to store data. Commodity hardware is not very costly. So, the cost of adding nodes is not high.
  • Hadoop is an open source software framework, and its source code is available for free to use. So, we can modify the source code as per our needs.
  • It has multiple languages support. Developers can code using many languages like C, C++, Perl, Python, Ruby, and Groovy on Hadoop.
  • Hadoop uses a low traffic network because each job is split into the number of independent sub-tasks, and these sub-tasks are assigned to the data nodes thereby moving a small amount of code instead of moving huge data.
  • Hadoop has distributed processing and distributed storage architecture which processes massive amounts of data with high speed. Hadoop defeated supercomputer the fastest machine in 2008.

Disadvantages of Hadoop

  • Hadoop is well compatible for the small number of large files, but when it comes to the application that deals with a large number of small files, Hadoop fails here. It has a high capacity design, and so it lacks the ability to efficiently support while reading small files.
  • Hadoop is written in Java. Java is a widely used programming language. Thus, it is easily exploited by cybercriminals. It makes Hadoop vulnerable to security breaches.
  • Hadoop has no ability to do in-memory calculations. So, it incurs processing overhead which diminishes the performance of Hadoop.
  • In Hadoop, the data is read from the disk and written to the disk that makes read/write operations very costly while dealing with terabytes and petabytes of data.
  • Data is distributed and processed over the cluster in MapReduce, which requires a lot of time to perform the tasks like Map and Reduce. So, it increases the processing time and reduces processing speed.
  • Hadoop is not suitable for real-time data processing because it supports batch processing only. Although batch processing is very good for processing a high volume of data but depending on the size of data and computational power of the system, an output can delay significantly.
  • Hadoop is not so efficient for the iterative processing by itself. Iterative processing has a cyclic data flow, whereas Hadoop does not support cyclic data flow. Hadoop has data flowing in a chain of stages where output on one stage becomes the input of another stage.
  • Hadoop does not have any kind of abstraction support, so MapReduce developers need to hand code for each operation, which makes it very difficult to work.
  • Hadoop uses Kerberos authentication and it is hard to manage. It is missing encryption at storage and network levels which are significant issues of concern.

Hadoop 2.8.0 Installation on Windows 10

Hadoop needs Java to run, and the Java and Hadoop versions must fit together. This is the case for 1.8.0 and Hadoop 2.8.0, so we restrict the implementation to these versions. Hadoop can be implemented on any Windows OS version, but the installation process differs slightly. The assumption is that a user has a desktop/laptop with Windows 10 installed and Internet connection to download the software, and Windows 10 is the most common OS, used in more than 40% of the desktop/laptop OS market.

This step by step method should be easy to follow by an experienced Windows 10 user without programming skills. The user should know how to download and install programs, but don’t usually work with the command prompt, setups, or configurations.

Preparations

Java SE Development Kit

Set Up-

  • Check either Java 1.8.0 is already installed on your system or not, use command “Javac-version” to check.
JavaC Version

(Above screenshot shows that Java is not installed on the machine).

Java Command Prompt

(Above screenshot shows that Java is not installed on the machine).

  • If Java is not installed on your system, then first install Java. Navigate to C:\ [1], make a New folder [2] and name it Java [3]
First Install Java
  • (i)Run the Java installation file.Install directly in folder C:\Java
Run Java Installation file
Java Set Path

C- (ii) Click on the next button and wait for the installation. It will ask again to choose destination folder so create a sub-folder jre1.8.0 under C:\Java

Create JRE New Sub Folder

C- (iii)Now choose this sub-folder jre1.8.0 and finish the installation.

Jre Installation Finish
Java Setup Destination Folder

After installation, check again for Java using command “Javac-version”.

  • Extract Hadoop 2.8.0 right under C:\ like this:
Extract Hadoop 2.8.0
  • Install Notepad++ anywhere

Setup Environment Variables-

  • Use the search option to find the environment variables.
set up environment variables in Hadoop
  • In System Properties, click the button Environment Variables…
click the button Environment Variables

A new window will open with two tables and buttons. The upper table is for User variables and the lower for System variables.

  • Make a new User variable [1]. Name it JAVA_HOME and set it to the Java bin-folder [2]. Click OK [3].
set it to the Java bin-folder
  • Make another New User Variable [1]. Name it HADOOP_HOME and set it to the hadoop-2.8.0 bin-folder [2]. Click OK [3].
Make another New User Variable
  • Now add Java and Hadoop to system variables path: Click on path [1] and then click edit [2]. The editor window opens. Choose New [3] and add the address C:\Java\bin [4]. Choose New again [5] and add the address C:\hadoop-2.8.0\bin [6]. Click OK [7] in the editor window and click OK [8] to change the System variables.
Java and Hadoop to system variables path

Configuration-

  • Go to file C:\hadoop-2.8.0\etc\hadoop\core-site.xml [1]. Right-click on the file and edit with Notepad++ [2].
Goto Notepad++
  • At the end of the file, you have two configuration tags.
<configuration>
</configuration>

Paste the following code between the two tags and save (spacing doesn’t matter)

<property>
<name>fs.defaultFS</name><value>hdfs://localhost:9000</value>
</property>

It should look like this in Notepad++:

configuration tags
  • Rename C:\hadoop-2.8.0\etc\hadoop\mapred-site.xml.template to mapred-site.xml and edit this file with Notepad++. Paste the following code between the configuration tags and save:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
  • Under C:\hadoop-2.8.0 create a folder named “data” [1] with two subfolders,”datanode” and “namenode” [2].
Create Folder datanode
  • Edit the file C:\hadoop-2.8.0\etc\hadoop\hdfs-site.xml with Notepad++. Paste the following code between the configuration tags and save:
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop-2.8.0\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop-2.8.0\data\datanode</value>
</property>
  • Edit the file C:\hadoop-2.8.0\etc\hadoop\yarn-site.xml with Notepad++. Paste the following codes between the configuration tags and save:
<property>
<name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
  • Edit the file C:\hadoop-2.8.0\etc\hadoop\hadoop-env.cmd with Notepad++. Write @rem in front of “set JAVA_HOME=%JAVA_HOME%”.

Write set JAVA_HOME=C:\Java at the next row.

It should look like this in Notepad++:

Set Java Home Path

Replace the bin folder-

Before we can start testing,we must exchange a folder in Hadoop.

Unzip the bin file.

Join Github Today
  • Delete the bin folder from C:\hadoop-2.8.0 and copy/paste a new bin-folder from “Hadoop Configuration.zip” in the same directory C:\hadoop-2.8.0

Testing-

  • Search for cmd[1] and open the Command Prompt [2]. Write hdfs namenode-format [3] and push enter.
Write hdfs namenode
HDFS name node

If this first test works, the Command prompt will run a lot of information. It is a good sign.

  • Now you must change the directory in the Command Prompt. Write cd C:\hadoop-2.8.0\sbinand push enter. In the sbin folder, write start-all.cmd and push enter.
Start CMD

If the configuration is right, four applications will start running and it will look something like this:

four applications will start running
  • Now open a browser and write in the address field localhost:8088and push enter. It should look like this:
Hadoop All Application
  • Now, write localhost:50070 in the address field. It should look like this:
Hadoop installation Guide Step by Step

Now, Hadoop is installed successfully.

Hadoop Index

Hadoop Tutorial

Hadoop Module

HBase Tutorial

Hive Tutorial

Pig Tutorial

Spark Tutorial