PySpark Tutorial for Beginners

Introduction to PySpark

Apache Spark Community released ‘PySpark’ tool to support the python with Spark. PySpark is a combination of Python and Apache Spark. Python and Apache “PySpark=Python+Spark” Spark both are trendy terms in the analytics industry. Before moving towards PySpark let us understand the Python and Apache Spark.

What is Apache Spark?

Big Data computation is hurling towards the future where the processing speed has to keep moving with the speed of data being generated in a structured, unstructured and semi-structured format. Big Data comes into the picture because of Apache Spark, where it is incredibly scalable, Fault tolerance, a resilient and versatile processing engine for Big Data.

What is Apache Spark?

Apache Spark is a cluster-computing framework, which used for processing, querying and analyzing the Big Data. Apache Spark is a fast in-memory Big Data processing engine with the ability of machine learning. Apache Spark has written in Scala Programming language. Spark handles nearly all memory operations, and it is faster than MapReduce. Apache Spark writes the data into the Disk after each transaction. Data Query makes, Apache Spark widely deployed computation engine at some of the biggest enterprises like google Alibaba, eBay and others.

What is Python?

Python is a programming language, and it is easy to learn and implement. It provides API, which is comprehensive and straightforward. It gives the various options for data visualization whereas data visualization is difficult in Scala and Java. Python has a wide range of libraries such as Pandas, NumPy, Seaborn, sci-kit-Learn, etc.

Fundamentals of PySpark

Following is the list of fundamentals of PySpark:

  1. RDDs
  2. DataFrame
  3. PySpark SQL
  4. PySpark Streaming
  5. Machine Learning

Let us see the fundamentals in detail:

  1. RDDs (Resilient Distributed Datasets)

Resilient Distributed Datasets are the basic building blocks of Spark’s application.

Resilient: The meaning of Resilient is ‘Fault tolerant and able to reconstruct the data on failure.’

Distributed: The data distributed among all the nodes of the cluster.

Datasets: Datasets are the collection of partitioned data with values.

RDDs is a layer of abstracted data over the distributed collection. It is immutable and follows the Lazy transformation. Two operations ‘Transformation and Actions’ applied to the RDDs. Transformation operation used to create new RDDs. Whereas Action operation applied to RDD to instruct the Apache Spark that ‘apply the computation and pass the result back to the driver.’

  1. DataFrame

DataFrame is the distributed collection of data present in a structured or semi-structured format. The data in the DataFrame stored in the form of tables/relations like RDBMS. DataFrame and RDDs have some common properties such as immutable, distributed in nature and follows the lazy evaluation.  DataFrame supports a wide range of formats like JSON, TXT, CSV and many.

  1. PySpark SQL

It is the abstraction module present in the PySpark.  It used in structured or semi-structured datasets. It provides optimized API and read the data from various data sources having different file formats. The user can process the data with the help of SQL.

  1. PySpark Streaming

PySpark streaming is a scalable and fault tolerant system, which follows the RDDs batch model. It operates in batch intervals and ranges from 500ms to large interval windows.

In PySpark Streaming, Spark streaming receives the input data from sources like Kafka, Apache Flume, TCP sockets, and Kinesis, etc. the collected streamed data divided into batch intervals and forwarded to the Spark engine. Spark engine process on the batch intervals using sophisticated algorithms. After processing, the processed batches pushed into the databases, file systems, and live dashboards.

 

PySpark Streaming

5. Machine Learning

Python used for machine learning and data science for a long time. Python has MLlib (Machine Learning Library). PySpark used ‘MLlib’ to facilitate machine learning.

MLlib has core machine learning functionalities as data preparation, machine learning algorithms, and utilities.

Data preparation: Data preparation includes selection, extraction, transformation, and hashing.

Machine learning algorithm: It provides regression, classification, and clustering algorithms for machine learning.

Utilities: It has statistical methods like chi-square testing, linear algebra, model evaluation methods.

Features of PySpark

There are multiple features of PySpark, which makes PySpark unique and better framework than other frameworks.

  1. Speed
  2. Deployment
  3. Powerful Catching
  4. Data Scientist Interface
  5. Polyglot
  6. Real Time

Let us see the features of the PySpark one by one:

  1. Speed:

PySpark is 100X faster than the traditional large-scale data processing engine like MapReduce.

  1. Deployment

The deployment in the PySpark can be done in many ways as through Hadoop via Yarn, Mesos, or Sparks cluster manager.

  1. Powerful Catching

PySpark has simple programming layer, which provides excellent catching and disk persistence capabilities.

  1. Data Scientist Interface

PySpark helps in Data Scientist Interface with RDD’s and the py4j library available in Apache Spark and Python respectively.

  1. Polyglot

It supports programming in many programming languages like R, Scala, Java, and Python.

  1. Real Time

Because of the ‘in-memory computation,’ PySpark achieves real-time computation and low latency.

PySpark SparkContext

PySpark made it possible to work with RDDS. PySpark shell initializes the SparkContext. SparkContext is nothing but the heart of Spark application. PySpark shell links the Spark Core with the Python API

PySpark SparkContext

  • SparkContext sets up internal services and establish a link to the Spark execution environment.
  • In a driver program, instance/ object of SparkContext coordinates with all the distributed processes and allow the resource allocation.
  • Cluster manager provides JVM process with logic whereas JVM processes act as executors.
  • SparkContext executes the tasks in each executor.

Advantages of PySpark

Advantages of PySpark Python over Scala programming.

  1. Simple to write

Because of PySpark, it is effortless to write the parallelized code for simple problems.

  1. Framework handles error

This framework easily handles errors and synchronization problems.

  1. Algorithms

Most of the algorithms implemented in Apache Spark.

  1. Libraries

Compared with Scala, Python has a rich set of libraries like py4j, MLlib. Machine learning and Data science interface possible using these libraries.

  1. Good local tools

For good visualization, there are multiple visualization tools are available, but in Scala, there are few and less effective tools present.

  1. Learning curve

Less learning curve in Python as compared with Scala.

  1. Ease of use

Python is very easy to use.

Disadvantages of PySpark

Disadvantages of PySpark Python over Scala programming.

  1. Difficult to express

It will be challenging to represent the problem in MapReduce fashion.

  1. Less efficient

Python is less efficient as compared with other programming languages, and efficiency is less when need a lot of communication.

  1. Slow

Python not able to handle heavy jobs and in case of Sparks jobs its performance is poor than Scala. Scala is 10X faster in performance wise, and it handles heavy jobs smoothly.

  1. Immature

For the streaming, Scala is a good option, and Python is not enough (not mature) to handle streaming.

  1. Cannot use the internal functioning of Spark

As Apache Spark is written in Scala Programming language. If need to change internal functioning, then Scala used for it. We cannot use Python in this case.

PySpark in Industry

Apache Spark used by many companies all around the earth for various purposes in industries.

Yahoo!

Yahoo uses Apache Spark to achieve Machine-learning capabilities like to personalize its News, Web pages and for target advertising.

Yahoo uses PySpark for reasons such as:

  • To know what kind of News users are interested in reading.
  • Categorizes the News stories, to know what kind of users are interested in which category of News stories reading.

TripAdvisor

TripAdvisor uses Apache Spark to advise millions of travelers ‘to find best hostel prices for its customers’ by comparing hundreds of websites.

Alibaba

Alibaba is the world’s largest E-commerce platform, uses Apache Spark to analyze the hundreds of petabytes of data.

References:

  1. https://spark.apache.org/docs/0.9.0/python-programming-guide.html
  2. https://dzone.com/articles/pyspark-tutorial-learn-apache-spark-using-python
  3. https://annefou.github.io/pyspark/03-pyspark_context/
  4. https://www.guru99.com/pyspark-tutorial.html
  5. https://www.udemy.com/spark-and-python-for-big-data-with-pyspark/