Cassandra vs Hbase

Cassandra vs Hbase

Cassandra:

Apache Cassandra is a distributed NoSQL database management system. Mostly deals with high amount of data stored across various machines. The large scale data is handled with the help of the architecture.

Same data is stored across various machines i.e, replica of data which enables high availability and no single point failure.

Cassandra vs Hbase

HBase:

Apache HBase is a non-relational distributed database system that is open source.It is scalable, distributed and big data store. It was written in java.

HBase is a column oriented database management system that runs on the top of HDFS (Hadoop Distributed File System) or Alluxio which gives Hadoop, Big-table like capabilities.

Similarities:

HBase and Cassandra are both descendants of Big-table. Cassandra was originally derived from amazon’s dynamo, while Hbase was derived from Big-table. The following are similarities between Cassandra and HBase,

  • Database:

Cassandra and HBase are open source NoSQL databases. Both were built originally to handle large amount of data especially for the field of Big Data. To support the requirement their respective architecture were built for such large amount of data handling.

  • Scalability:

Both have node, datacenter and cluster architecture. Hence, they can handle high scalability of data. Yet, they can store the data in linear scalability. They are very much suitable for every project that deals with large amount of data.

  • Replication:

The feature of replication is present in both Cassandra and HBase.

In this method the processed data is stored at one node and replicated across various nodes. This feature prevents risks such as point failure due to which the data may be lost or low grading performance for some time.

 Hence, any failure at one node another replicated node supports the process.

  • Programming:

Both Cassandra and HBase are column oriented databases. The column is nothing but main storage of data and more columns can be added as per the requirement. When logging a write operation to a log file indicates a right path. It ensures durability.

They both can be handled using java programming, as they were created on java.

Cassandra vs Hbase

Following are the some Differences between Cassandra and HBase:

Cassandra vs Hbase
  • Transaction:

Both Cassandra and HBase have mechanisms for transactions. HBase have two mechanisms, while Cassandra also has two mechanisms.

Cassandra transactions are light weight. ‘Row Level Write Isolation' and ‘Compare & Set' are the methods used for transactions.The mechanisms implemented for transactions in HBase are ‘Check and Put’ and another is ‘Read Check Delete’.

  • Query language:

Cassandra has dedicated specific query language. It is called as CQL (Cassandra Query Language). CQL was made based on the SQL and the respective requirements.CQL is the primary language for Cassandra. While, HBase has its own shell run on the top of Hadoop.

When compared to HBase, Cassandra has more advanced and more options in terms of features and functions.

  • Documentation:

The documentation in Cassandra is simple and easier to learn as the working is transparent. Hence, the documentation is much better in Cassandra than HBase.

Furthermore, the cluster setup is also much easier in Cassandra than in HBase.

  • Infrastructure:

Cassandra uses different DBMS along side in their infrastructure for Cassandra applications. Cassandra also uses Hadoop and Strom alongside for the Cassandra applications. In this infrastructure single node system is used and every node performs equally. When the Cassandra is used solo i.e, not used along-side anything then it is used as coordinator.

Complexity issue arises mostly when Cassandra is used with DBMS.

HBaseon the other hand have different operations and working when compared to Cassandra.It has different moving parts like HBase master, Data nodes and name nodes.

  • Support:

Cassandra supports ordered partitioning. This ordered partitioning helps Cassandra in making the row size 10’s of megabytes. Hotspots are created for users when ordered partitioning is used. Cassandra has a down side of limited range of row scans. It also does not support coprocessor-like functionality.

While, HBase do not support ordered partitioning at all. But HBase offers support of triggers. It also supports coprocessor capability. Load balancing against a single node is not supported as only one single row is served once at a time in a region. As a result, read load balancing against a single row is not supported.HBase supports ranges usually can be based on scans as well.

  • Nodes:

Seed nodes needed to be identified by the user from the nodes when uses Cassandra. These specific nodes serve as the points for inter-cluster communication. On the other hand HBase have master nodes, when a request and according actions are taken from a region server these are coordinated and monitored by the master nodes.

High scalability and availability is ensured due to the presence of multiple seed nodes. This same goes with HBase in case of high scalability and availability by master nodes. In situation when main master node which also coordinates fails, the standby node takes the charge.

  • Inter Node Communication:

Both Cassandra and HBase have Inter node communication. Cassandra uses inter node communication by method known as Gossip method or Gossip protocol. The coordinator node communicates with the other nodes in the data centre to copy and store replica of the data. This method is very efficient and successful for helping in performance and durability of Cassandra.

While, on the other hand HBase uses method called zookeeper method or zookeeper protocol. Here, a master node monitor’s the other and gets the data.

  • Miscellaneous:

Both Cassandra and HBase has a common feature i.e, Bloom filter but used for different purposes.

In Cassandra the Bloom filters is used as key lookup. Cassandra random partitioning helps in providing row replication of a single row across the WAN.

While, in the case of HBase the bloom filters used for indexing. It provides asynchronous replication of clusters as the storage unit across the WAN.

Architectural Differences:

  • Cassandra:

Dynamo:

It has a ring-type structure. All the nodes are connected to each other and have no master node. If the request is given it hits one of the nodes and that node processes the request and writes on to the database.

  • Gossip: The node after writing signals other nodes to update the data as well. This way of interaction between nodes is known as Gossip.
  • Failure Detection: Due to absence of central node, failure of the node to update is usual which is detected and recovered.
  • Replication: As the requirement the replication of data can be done in the nodes.

Storage Engine:

  • Commit-Log: Any data in Cassandra is written before in commit-log then in mem-tables. This improves the durability of data and risk of shutdown.
  • Mem-tables: These are memory structures where Cassandra buffer writes the data.
  • SSTables: Themem-tablesare further flushed into the disk and converted into SStables. These are immutable data files used for prevention of risk.

Cassandra’s core objective to accomplish large scalability, availability, and having storage requirements accordingly is achieved through the architecture of Cassandra.

Cassandra vs Hbase
  • HBase:

HBase has majorly 3 components i.e, Hmaster, Region master, Zookeeper.

  • HMaster:

A master server implements Hmaster in HBase. Regions servers are assignedregions along with DDL operations i.e, create, delete table. All the regions servers present in the cluster are monitored by HMaster. Several master threads are performed in a distributed environment by HMaster. It has many features like failover, controlling the load balancing etc.

  • Region Server:

Regions are nothing but the basic building elements of the HBase cluster which consists of distribution of tables and comprising of column families. HBase tables are split into Regions horizontally by row key range. It operates on the HDFS data node in the Hadoop cluster.Executing, handling, managing and read & write HBase operations of a set of regions are the responsibilities of region of region server which are taken care of. Default size of the HBase region is 256 MB.

  • Zookeeper:

It is used by the master node of the cluster in HBase. The client communicates with region server to master node via zookeeper. The protocol helps the master node monitoring and maintaining the data. It also helps in naming, configuration, information, proving distributed synchronization and server failure notificationetc.

Cassandra vs Hbase

Architecture Difference summary:

HBase is master-based architecture while Cassandra is master less architecture. This indicates that HBase has single point failure while Cassandra does not.

The node in Cassandra does not have to contact any master which saves a lot time for the user. While, in HBase user directly communicate with the slave server without contacting the master server which saves time but not much as compared to Cassandra.

Cassandra have its own down side as the replica of the data causes consistency problems. Which indeed strongly suggest HBase for the case which makes Cassandra a bad choice.

Though HBase supports only the data management, Cassandra supports data management and storage.