PySpark Tutorial
Introduction to PySpark
Apache Spark Community released ‘PySpark’ tool to support the python with Spark. PySpark is a combination of Python and Apache Spark. Python and Apache "PySpark=Python+Spark" Spark both are trendy terms in the analytics industry. Before moving towards PySpark let us understand the Python and Apache Spark.
What is Apache Spark?
Big Data computation is hurling towards the future where the processing speed has to keep moving with the speed of data being generated in a structured, unstructured and semi-structured format. Big Data comes into the picture because of Apache Spark, where it is incredibly scalable, Fault tolerance, a resilient and versatile processing engine for Big Data.
What is Apache Spark?
Apache Spark is a cluster-computing framework, which used for processing, querying and analyzing the Big Data. Apache Spark is a fast in-memory Big Data processing engine with the ability of machine learning. Apache Spark has written in Scala Programming language. Spark handles nearly all memory operations, and it is faster than MapReduce. Apache Spark writes the data into the Disk after each transaction. Data Query makes, Apache Spark widely deployed computation engine at some of the biggest enterprises like google Alibaba, eBay and others.
What is Python?
Python is a programming language, and it is easy to learn and implement. It provides API, which is comprehensive and straightforward. It gives the various options for data visualization whereas data visualization is difficult in Scala and Java. Python has a wide range of libraries such as Pandas, NumPy, Seaborn, sci-kit-Learn, etc.
Fundamentals of PySpark
Following is the list of fundamentals of PySpark:
- RDDs
- DataFrame
- PySpark SQL
- PySpark Streaming
- Machine Learning
Let us see the fundamentals in detail:
- RDDs (Resilient Distributed Datasets)
Resilient Distributed Datasets are the basic building blocks of Spark’s application.
Resilient: The meaning of Resilient is 'Fault tolerant and able to reconstruct the data on failure.'
Distributed: The data distributed among all the nodes of the cluster.
Datasets: Datasets are the collection of partitioned data with values.
RDDs is a layer of abstracted data over the distributed collection. It is immutable and follows the Lazy transformation. Two operations 'Transformation and Actions' applied to the RDDs. Transformation operation used to create new RDDs. Whereas Action operation applied to RDD to instruct the Apache Spark that ‘apply the computation and pass the result back to the driver.’
- DataFrame
DataFrame is the distributed collection of data present in a structured or semi-structured format. The data in the DataFrame stored in the form of tables/relations like RDBMS. DataFrame and RDDs have some common properties such as immutable, distributed in nature and follows the lazy evaluation. DataFrame supports a wide range of formats like JSON, TXT, CSV and many.
- PySpark SQL
It is the abstraction module present in the PySpark. It used in structured or semi-structured datasets. It provides optimized API and read the data from various data sources having different file formats. The user can process the data with the help of SQL.
- PySpark Streaming
PySpark streaming is a scalable and fault tolerant system, which follows the RDDs batch model. It operates in batch intervals and ranges from 500ms to large interval windows.
In PySpark Streaming, Spark streaming receives the input data from sources like Kafka, Apache Flume, TCP sockets, and Kinesis, etc. the collected streamed data divided into batch intervals and forwarded to the Spark engine. Spark engine process on the batch intervals using sophisticated algorithms. After processing, the processed batches pushed into the databases, file systems, and live dashboards.
5. Machine Learning
Python used for machine learning and data science for a long time. Python has MLlib (Machine Learning Library). PySpark used ‘MLlib’ to facilitate machine learning.
MLlib has core machine learning functionalities as data preparation, machine learning algorithms, and utilities.
Data preparation: Data preparation includes selection, extraction, transformation, and hashing.
Machine learning algorithm: It provides regression, classification, and clustering algorithms for machine learning.
Utilities: It has statistical methods like chi-square testing, linear algebra, model evaluation methods.
Features of PySpark
There are multiple features of PySpark, which makes PySpark unique and better framework than other frameworks.
- Speed
- Deployment
- Powerful Catching
- Data Scientist Interface
- Polyglot
- Real Time
Let us see the features of the PySpark one by one:
- Speed:
PySpark is 100X faster than the traditional large-scale data processing engine like MapReduce.
- Deployment
The deployment in the PySpark can be done in many ways as through Hadoop via Yarn, Mesos, or Sparks cluster manager.
- Powerful Catching
PySpark has simple programming layer, which provides excellent catching and disk persistence capabilities.
- Data Scientist Interface
PySpark helps in Data Scientist Interface with RDD’s and the py4j library available in Apache Spark and Python respectively.
- Polyglot
It supports programming in many programming languages like R, Scala, Java, and Python.
- Real Time
Because of the ‘in-memory computation,' PySpark achieves real-time computation and low latency.
PySpark SparkContext
PySpark made it possible to work with RDDS. PySpark shell initializes the SparkContext. SparkContext is nothing but the heart of Spark application. PySpark shell links the Spark Core with the Python API
- SparkContext sets up internal services and establish a link to the Spark execution environment.
- In a driver program, instance/ object of SparkContext coordinates with all the distributed processes and allow the resource allocation.
- Cluster manager provides JVM process with logic whereas JVM processes act as executors.
- SparkContext executes the tasks in each executor.
Advantages of PySpark
Advantages of PySpark Python over Scala programming.
- Simple to write
Because of PySpark, it is effortless to write the parallelized code for simple problems.
- Framework handles error
This framework easily handles errors and synchronization problems.
- Algorithms
Most of the algorithms implemented in Apache Spark.
- Libraries
Compared with Scala, Python has a rich set of libraries like py4j, MLlib. Machine learning and Data science interface possible using these libraries.
- Good local tools
For good visualization, there are multiple visualization tools are available, but in Scala, there are few and less effective tools present.
- Learning curve
Less learning curve in Python as compared with Scala.
- Ease of use
Python is very easy to use.
Disadvantages of PySpark
Disadvantages of PySpark Python over Scala programming.
- Difficult to express
It will be challenging to represent the problem in MapReduce fashion.
- Less efficient
Python is less efficient as compared with other programming languages, and efficiency is less when need a lot of communication.
- Slow
Python not able to handle heavy jobs and in case of Sparks jobs its performance is poor than Scala. Scala is 10X faster in performance wise, and it handles heavy jobs smoothly.
- Immature
For the streaming, Scala is a good option, and Python is not enough (not mature) to handle streaming.
- Cannot use the internal functioning of Spark
As Apache Spark is written in Scala Programming language. If need to change internal functioning, then Scala used for it. We cannot use Python in this case.
PySpark in Industry
Apache Spark used by many companies all around the earth for various purposes in industries.
Yahoo!
Yahoo uses Apache Spark to achieve Machine-learning capabilities like to personalize its News, Web pages and for target advertising.
Yahoo uses PySpark for reasons such as:
- To know what kind of News users are interested in reading.
- Categorizes the News stories, to know what kind of users are interested in which category of News stories reading.
TripAdvisor
TripAdvisor uses Apache Spark to advise millions of travelers ‘to find best hostel prices for its customers' by comparing hundreds of websites.
Alibaba
Alibaba is the world's largest E-commerce platform, uses Apache Spark to analyze the hundreds of petabytes of data.
References:
- https://spark.apache.org/docs/0.9.0/python-programming-guide.html
- https://dzone.com/articles/pyspark-tutorial-learn-apache-spark-using-python
- https://annefou.github.io/pyspark/03-pyspark_context/
- https://www.guru99.com/pyspark-tutorial.html
- https://www.udemy.com/spark-and-python-for-big-data-with-pyspark/