Apache Spark Components
The Spark project consists of various types of components that are closely integrated. Spark is at its core a computational engine capable of scheduling, distributing, and monitoring multiple apps.
The main components of Spark are:
- Spark Core
- Spark SQL
- Spark Streaming
- Mlib Machine Learning
- GraphX graph Processing
Spark core
- Spark Core is the heart of Spark, which is built on all other functionalities.
- It includes the components for job scheduling, fault recovery, communicating with storage, and memory management systems.
Spark SQL
- On top of Spark Core, the Spark SQL is developed. It supports structured data.
- Spark SQL(Structured Query Language) allows querying data from SQL as well as Apache Hive of SQL, which is called HQL (Hive Query Language).
- It supports connections between JDBC and ODBC that create a relationship between Java objects and existing databases, data warehouses, and business intelligence tools.
- It also supports complex data sources such as Hive tables, Parquet, and JSON.
SQL Streaming
- Spark Streaming is a component of Spark that supports scalable and fault-tolerant streaming data processing.
- It uses the quick scheduling capabilities of Spark Core to conduct streaming analytics.
- It accepts mini-batch data and executes RDD transformations on that data.
- Its architecture ensures that the applications written for streaming data can be reused with little modification to analyze batches of historical data.
- Log files generated by web servers can be considered as an example of a data stream in real-time.
MLlib
- MLlib is a type of machine library, which includes a variety of machine learning algorithms.
- It comprises checking of associations and theories, classification and regression, clustering, and study of main components.
- The disk-based implementations have been used nine times by the Apache mahout to make it faster.
GraphX
- GraphX is a library used to manipulate graphs and perform parallel graph computing.
- It facilitates the development of a directed graph with arbitrary properties attached to each vertex and edge.
- To control the graph, it supports various key operators, such as subgraph, merges Vertices, and aggregate Messages.