Apache Storm Tutorial

What is Apache Storm?

The storm is a free and open source distributed real-time computation framework written in Clojure programming language. Apache storm is an advanced big data processing engine that processes real-time streaming data at an unprecedented (never done or known before) Speed, which is faster than Apache Hadoop. Apache Storm works for unbounded streams of data in a consistent method.

Basic Concept of Apache Storm

It reads an unrefined stream of immediate generated data from one end and passes it through a sequence of small processing units and outputs the processed /useful information at the other end. The following diagram shows the concept of topology. Basic Concept of Apache Storm
  • Tuple- An ordered list of elements.
  • Stream- An infinite sequence of tuples.
  • Spout- It is the entry point or source of streams in topology. It receives data continuously from data source transforming those data into an actual stream of tuples and finally sending them to the bolts to be processed.
  • Bolt- Process input stream and produce an output stream. Bolt keeps the logic required for processing, saving and transmitting data for storage. It can run functions, filtering tuples, aggregating and joining streams, linking with the database.
  • Storm Topology- Storm topology can be defined as a network made of spout and bolts. Every node in the network consists of processing logic‘s and links to demonstrate how data will pass, and processes will execute. Each time a topology submitted to the storm cluster, nimbus discusses with supervisor nodes about the worker nodes.

Apache Storm vs Hadoop

Hadoop and Apache Storm frameworks are used for analyzing big data. Both of them complement each other but differ in some aspects. Apache Storm performs all the operations except persistency, while Hadoop is good at everything but lags in real-time computation. The table compares the attributes of Storm and Hadoop.
Storm Hadoop
It supports real time streaming process It supports batch processing.
Stateless Stateful
Master-slave architecture with zookeeper based coordination. The master node is called nimbus and slave are supervisors. Master-slave architecture with or without zookeeper based coordination. Master node is called job tracker and slave node is called task tracker.
Apache Storm topology runs until shutdown by the user or an unexpected unrecoverable failure. MapReduce jobs are executed in a chronological order and completed eventually.
If nimbus /supervisor dies, restarting makes it continue from where it stopped, hence nothing gets change or lost. If the JobTracker dies, all the active or running jobs are lost.
Both are distributed and fault tolerant.

Streaming Vs. batch processing

Streaming Processing Batch Processing
Type Continuous, real-time Batch /period
Model DAG /graph MapReduce like jobs
Workload CPU /Memory intensive CPU/ mem and IO intensive
State Stateless Stateful
Cluster Master slave /Zookeeper Master-slave or job-task
Fault tolerance Fault tolerance Fault tolerance

Storm Vs Spark

Situation Spark Storm
Stream processing Batch processing Micro-batch processing
Latency Latency of few seconds Latency of milliseconds
Multi-language Support Lesser language support Multiple language support

Use cases of Apache Storm

Some use instances: real-time analytics, online machine learning, continuous computation, distributed RPC and ETL. Some of the use cases are as follows-
  • Wego-Wego is a travel metasearch engine. Travel related data fetch from many sources all over the world with different timing. Storm helps we go to search real-time data, resolve concurrency issue and find the best match for the end user.
  • NaviSite-NaviSite is using Apache Storm for event log monitoring /auditing system. Each log generated in the system will go through the storm. The storm will check the content or message against the configured set of the regular expression, and if there is a match, then that particular message will be saved to the database.
  • Twitter- It uses Storm for its range of publisher analytics products. Publisher analytics products process every tweet and click on the Twitter Platform. Apache Storm integrated with Twitter infrastructure.
  • Yahoo- Yahoo is working on a next-generation platform that enables the convergence of big data and low latency processing. While Hadoop is primary technology for batch processing, storm empowers micro-batch processing of user event, feeds, and logs.

Benefits of Apache Storm

  • The storm is user-friendly, robust and open source.
  • The storm is fault tolerant, reliable, and flexible, can be used with many programming languages.
  • Supports real-time stream processing.
  • Provides guaranteed data processing even if any of the connected nodes in the cluster die or message gets lost.
  • The storm has operational intelligence.
  • It has low latency.
  • High scalable.
  • Has the ability to process data very fast.

Why should we use Apache Storm?

It allows us to cooperate with a cluster and includes retrieving metrics data and configuration information as starting and stopping topologies.
  • Apache Storm processes a million messages of 100 bytes on a single node.
  • Works on fail fast, auto restart approach.
  • Each node is processed at least once even a failure occurs.
  • The storm is highly scalable with the ability to continue calculations in parallel at the same speed under heavy load.

Apache Storm Architecture

Storm architecture is closely similar to Hadoop. However, there are some differences which can be better understood once we get a closer look at its cluster- Apache Storm Architecture Node: There are two types of node in a storm cluster similar to Hadoop.

Master Node

The master node of storm runs a demon called "Nimbus" which is similar to the ": job Tracker" of Hadoop cluster. Nimbus is responsible for assigning the task to machines and monitoring their performance.

Worker Node

Similar to master node worker node also runs a daemon called "Supervisor" which can run one or more worker processes on its node. Nimbus assigns the work to the supervisor and starts and stops the process according to requirement. Hence, it can't manage its cluster state it depends on zookeeper.

Zookeeper Framework

It facilitates communication between nimbus and supervisor with the help of message ACK, processing status, etc.

Stream Grouping

Stream grouping controls how the tuples are routed in the topology and help to understand the tuples flow in the topology. There are six types of grouping-
  1. Shuffle Grouping
  2. Field Grouping
  3. Global Grouping
  4. All Grouping
  5. None Grouping
  6. Local Grouping

Graphical Representation of Grouping

Graphical Representation of Grouping TutorialandExample
  1. Shuffle Grouping- Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples.
  2. Field Grouping- The fields with the same values in tuples are grouped and the rest tuples kept outside. The stream grouped by the "book-id" field, tuples with the same "book-id" will always go to the same task, but tuples with different "book-id"'s may go to different tasks.
  3. Global Grouping- All the streams can be grouped and forward to one bolt. This grouping sends tuples generated by all instances of the source to a single target instance (individually, pick the worker which has lower ID).
  4. All Grouping- All grouping sends a single copy of each tuple to all instances of the receiving bolt. It is used to send signals to bolts. This is useful for join operations.
  5. None Grouping- It specifies that you don't care how the stream is grouped. Currently, none groupings are equivalent to shuffle groupings.
  6. Local Grouping- If the target bolt has more than one tasks in the same worker process, tuple will be shuffled to just those in-process tasks. Else it behaves like shuffle grouping.

Apache Storm Workflow

Let’s take a close look at the workflow of the storm.
  • Firstly, the nimbus will wait for the storm topology to be submitted to it.
  • When the topology is submitted, it will process the topology and gather all the tasks that are to be carried out and the order in which the task is to execute.
  • At a stipulated time interval, all supervisors will send status (alive or dead) to the nimbus to inform that they are still alive.
  • If a supervisor dies and doesn’t address the status to the nimbus, then the nimbus assigns the tasks to another supervisor.
  • When the Nimbus itself dies, the supervisor will work on an already assigned task without any interruption or issue.
  • When all tasks are completed, the supervisor will wait for a new task to process.
  • In a meanwhile, the dead nimbus will be restarted automatically by service monitoring tools.
  • The restarted nimbus will continue from where it stopped working. The dead supervisor can restart automatically. Hence there is guaranteed to process the entire task at least once.

Modes in Storm Cluster

There are two modes in storm cluster-
  1. Local Mode
  2. Production Mode
Local Mode- In this mode, we can modify parameters that enable us to see how our topology runs in a different storm configuration environment. It is used for development, testing and debugging. Production Mode- In this mode, we submit our topology to working storm cluster which is composed of many processes, which is running on a different machine. The cluster will run indefinitely until it is shut down. Reference: https://storm.apache.org/ https://www.baeldung.com/apache-storm