Apache Hama Tutorial

What is Apache Hama?

Apache Hama is a java based framework for Big Data analytics. It uses the Bulk Synchronous Parallel (BSP) computing techniques for massive scientific computations, e.g., matrix, graph, and network algorithms. It is a distributed computing framework inspired by Google’s Pregel large-scale graph computing framework described in 2010. Hama was established in 2012 as a Top-level Project of The Apache Software Foundation.

Architecture of Apache Hama

Hama has three major components

BSPMaster

Each time the BSPMaster receives a heartbeat message, it brings up-to-date groom server status. Then, BSPMaster makes use of groom servers’ status in order to assign tasks to idle groom servers effectively. After that, it returns a heartbeat response which contains assigned tasks and other actions that a groom server has to do. BSPMaster is responsible for:
  • Maintaining groom server status
  • Controlling super steps in a cluster
  • Maintaining job progress information
  • jobs scheduling and assigning tasks to groom servers
  • Controlling fault

GroomServer

A GroomServer is a process which performs BSP tasks assigned by BSPMaster. Each groom contacts the BSPMaster, takes assigned tasks, and reports its status to the master (BSPMaster) via periodic piggybacks called heartbeat messages. Each groom is designed to run with HDFS (Hadoop Distributed File System) or other distributed storages. The assigned task is run in a child Virtual Machine, which is spawned every time a new assignment comes. A groom server and a data node, both of them should be run on one physical node to achieve the best performance for data-locality.

Zookeeper

A zookeeper manages the efficient barrier synchronization of the BSPPeers. It is also used for the area of a fault tolerance system. Architecture of Apache Hama

Process of Communication and Synchronization

Each BSP task consists of an Outgoing Message Manager and Incoming Queue. Outgoing Message Manager collects the message required to be sent, serializes it, compresses it and puts it in bundles. At the barrier synchronization phase, each BSP task is used to exchange, de-serialize, decompress the bundle and put it into the incoming queue.

Why use Hama?

The task to efficiently process massive digital data is growing exponentially, which becomes increasingly challenging. Most of the big data projects are limited in their processing domain. However, Hama is a framework which supports diverse massive computational tasks and is more suitable for intensive iterative applications. It outperforms MapReduce frameworks in such application domains because it avoids the processing overhead of sorting, shuffling, reducing the vertices, etc. MapReduce inherits this overhead for each iteration and essentially there exist at least millions of iterations, whereas Hama provides a message passing interface, and each BSP superstep is faster than a full job execution in the MapReduce framework, such as Hadoop.

Hama Implementation over Hadoop

Hama was implemented on top of Hadoop because of:
  • Hadoop framework has some limitations in several application domains, such as complex algorithmic computation, graph, and streaming data. Whereas, Hama uses BSP techniques for such computations.
  • Hama can perform MapReduce feature in the same framework in parallel with the BSP engine.
Hama consists of similar architecture with Apache Hadoop and the inspiration from Google Pregel still it has signi?cant differences. It prefers to provide a more general purpose framework than Pregel and Apache Hadoop, which supports massive scienti?c computations such as matrix, graph, machine learning, business intelligence, and network algorithms. It is not limited to graph processing, but also provides a full set of primitives that allows the creation of generic BSP applications. The main differences between Hama and Hadoop architecture are illustrated in the following image: differences between Hama and Hadoop architecture Each framework has respective advantages and disadvantages, and there is no single yardstick that truly embodies a one-size-fits-all solution.

Advantages of Apache Hama

  • Hama provides BSP primitives rather than graph processing APIs, which enable programmers to operate at a lower level.
  • Hama uses BSP model to avoid conflicts and deadlines during communication.
  • It provides not only the BSP programming model but also SQL-like query interface and vertex/neuron centric programming models.
  • Hama manages to provide explicit support to message passing.
  • It is primarily-Java based but also allows C++ programs.
  • Hama is not limited to HDFS and may be used with any distributed file system.
  • It supports general purpose computing on graphics processing units (GPGPU).
  • Hama is an open-source software framework, and its source code is available for free to use. So, we can modify the source code as per our needs.

Disadvantages of Hama

  • Hama uses graph partition strategy, which may have a negative performance impact because of the unnecessary communication between nodes.
  • Hama was proposed in 2010, but it has not yet been widely adopted.
  • Although Hama is an open-source framework, still it has no community support to communicate with the developers.
  • Hama should have a better interface for the models that they readily support.
  • Like any big data framework, Hama is also subject to security attacks and privacy issues, e.g., denial of service (DOS) attacks, or data use without permission.

Installation of Apache Hama

Preparations Make sure that you have installed all the required software on all nodes in your cluster:
  • Hadoop-1.0.x or higher version(non-secure version)
  • Sun Java JDK 1.7.x or higher version
  • SSH access to manages BSP daemons
  • Download Hama from the release page: https://hama.apache.org/downloads.html
Hama Installation Untar the files to your destination of choice and use the command:
tar –xzf hama-0.x.0.tar.gz
Don’t forget to choose the directory as the same user you configured Hadoop. Startup Script The $HAMA_HOME/bin directory contains some script which is used to start Hama daemons.
  • Start-bspd.sh
It starts all Hama daemons, the BSPMaster, GroomServers, and Zookeeper. Make sure to start Hama with the same user configured for Hadoop. Configuration Files $HAMA_HOME/conf directory consists of some configuration files for Hama which are:
  • hama-env.sh
It contains some environment variable settings which are required for Hama. It can be used to affect some aspects of Hama daemons behavior, such as where log files are stored, and the maximum amount of heap used, etc. The only variable required to be changed here is JAVA_HOME, which specifies the path to Java.
  • groomservers
This file lists the hosts, one per line, where the GroomServer daemons will run. By default, it contains the single entry localhost.
  • hama-default.xml
It contains some generic default settings for Hama daemons. This file should not be modified.
  • hama-site.xml
It contains site-specific settings for all Hama daemons and BSP jobs. It remains empty by default. The settings in this file override those in hama-default.xml. This file should keep only those settings which must be respected by all the servers and clients in a Hama installation.

Setting up Hama

This section will help you to understand how to get started by setting up a Hama cluster.
  • BSPMaster and Zookeeper settings
Find out where you want to run HDFS namenode and BSPMaster. After that, set the variable bsp.master.address to the BSPMaster’s intended host:port. Set the variable fs.defaultFS to the HDFS namenodes’ intended host:port. Starting a Hama Cluster If you are in Local Mode, skip this step. Otherwise, run the command:
% $HAMA_HOME/bin/start-bspd.sh
It will start a BSPMaster, GroomServers, and Zookeeper on your machine. Stopping a Hama Cluster To stop the Hama Cluster, run the command:
% $HAMA_HOME/bin/stop-bspd.sh
It will stop all the daemons running on your cluster. Enabling Fault Tolerance Service Fault Tolerance service remains disabled by default. To enable FT service, set properties like below:
<property>
    <name>bsp.ft.enabled</name>
    <value>true</value>
    <description>Enable Fault Tolerance in BSP Task execution.</description>
  </property>
<property>
    <name>bsp.checkpoint.enabled</name>
    <value>true</value>
    <description>Enable Hama to checkpoint the messages transferred among BSP tasks during the BSP synchronization period.</description>
  </property>
  <property>
    <name>bsp.checkpoint.interval</name>
    <value>10</value>
    <description>If bsp.checkpoint.enabled is set to true, the checkpointing is initiated on the valueth synchronization process of BSP tasks.</description>
</property>
Hama Web Interfaces The web user interface provides information about BSP job statistics of the Hama cluster, running/completed/failed jobs. By default, it’s available at http://localhost:40013 For additional information, check out the Compatibility Table:
Apache Hama Release Version Apache Hadoop Release Version Java Known Compatibility Problems
0.5.0 0.20.2, 0.20.2-cdh3u3b, and 1.0.x 1.6 None
0.6.1 0.20.2, 0.20.2-cdh3u3b, and 1.x 1.6, and 1.7 None
0.6.2 0.20.2, 0.20.2-cdh3u3b, and 1.x 1.6, and 1.7 None
0.6.3 Hadoop 0.20.x, 1.x, 2.x, and CDH3, CDH4 1.6, and 1.7 Support only HDFS2
0.6.4 Hadoop 0.20.x, 1.x, 2.x, and CDH3, CDH4 1.6, and 1.7 Support only HDFS2
0.7.0 Hadoop 1.x, 2.6+, and CDH3, CDH4 1.7+ Support HDFS2Mesos and YARN

 (Compatibility Table)

By default, Hama 0.7 contains hadoop-2.7.0.jar files. If you are going to use lower or higher version of Hadoop, then you need to replace Hadoop jar files in the ${HAMA_HOME}/lib folder. To run with Hadoop 1.x, download ansrc-release tarball and build with the following command:
           %mvn clean install –Phadoop1 –Dhadoop.version=1.x.x
Run Examples Hama provides examples packages that allow running examples on Hama cluster quickly. To run one of them, use command:
         % $HAMA_HOME/bin/hama jar hama-examples-x.x.x.jar
For example: download an Iris dataset from here: http://people.apache.org/~edwardyoon/kmeans.txt And then, run K-Means using the command:
% $HAMA_HOME/bin/hama jar hama-examples-x.x.x.jar kmeans/tmp/kmeans.txt /tmp/result 10 3
Conclusion If there is a lot of parallel processing for a massive amount of data, then it is a good idea to produce bulk synchronous programming models and use Apache Hama framework for the computations. Note: Apache Hama is an open source volunteer project under the Apache Software Foundation which encourage us to learn about the project and contribute our expertise. Moreover, it will help to stimulate interdisciplinary research and development in Hama to unleash its full potential.