Introduction to Python Pandas

“According to Wikipedia, Pandas’ name is derived from the econometrics term Panel Data for multidimensional data sets that include observations over multiple time periods for the same individuals”. Pandas stands for Python Data Analysis Library. Pandas is an open-source, BSD-Licensed library of Python Programming Language written by Wes McKinney in 2008 for developers to provide suitable and highly-optimized performance tools for data analysis, cleaning, and manipulation with the powerful, expressive, and flexible data structures like Data Frames and Series. Pandas built on top of NumPy as it depends on and inter operates with NumPy for faster numeric array computations. NumPy is a python library for matrices computation, single and multi-dimensional arrays computation, along with an extensive collection of high-level numerical tools for array operations. Pandas enables developers for carrying out their entire data analysis workflow in Python without having to switch to a more domain specific language like R.Pandas is well suited for various kinds of data such as:

Ordered and unordered time series data
Arbitrary matrix data
Tabular data with heterogeneously-typed columns
Unlabeled data
Observational or statistical data sets

Before Pandas, Python majorly used for data munging and preparation. It had a minor contribution to data analysis. Pandas library solved this problem with the accomplishment of five typical steps in data analysis and processing, regardless of data origin- load, prepare, manipulate, model, and analyze. Python with Pandas use in a wide range of academic and commercial domains sectors that includes finance, economics, Statistics, analytics, etc.

Pandas Installation

For Pandas installation go to the command line/terminal and type pip install pandas or You can install Pandas using Anaconda Python Package (https://www.anaconda.com/) as the best way and then type conda install pandas After having a complete installation, go to IDE(Jupyter, PyCharm, etc.) and import Pandas just by typing import pandas as pd

Features of Pandas

Pandas provides elegant and simple API.
Pandas performs highly for merging and joining of high-volume datasets.
Pandas is easy to learn, use, and maintain by which you can focus more on research with less programming.
Pandas bridges the gap between rapid iterations of ad-hoc analysis and production quality code.
Pandas consists of Data Frames object for fast and efficient data manipulation with integrated indexing.
Pandas offers Time-Series functionality that includes

data range generation and frequency version
moving window statistics and window linear regressions
date shifting and lagging

Data Analysis can be done in Pandas by dealing with the fast data structures (Series and Data Frames)built on top of NumPy array. Series is a one-dimensional homogeneous array of an immutable size that enables us to store any data type (integer, string, float, python objects, etc.).

Pandas Series can be created by using the given constructor-

pandas.Series(data, index, dtype, copy)

where data contains various forms likend array, list, constants, scalar value (can be integer Value, string), and Python Dictionary (can be Key, Value pair). index represents the collection of axis labels whose values must be unique, hashable, and of the same length as data. dtype is for data type. copy is for copying data. Default False. NOTE: If index not given explicitly, then Pandas construct Range Index with range (0 to N-1) where N is the total number of elements that Series consists.

Program to create empty Series.

Empty Series defines a Basic Series.

#import the pandas library
import pandas as pd
s = pd.Series()
print(s)
OUTPUT:
Series([], dtype: float64)

Program to create ndarray (N-Dimensional) Series

If no index passed then by default we have index of range(n) where n is length of array.

#import the pandas library
import pandas as pd
#import the NumPy as Pandas built on top of it
importnumpy as np
data = np.array(['a','b','c','d','e'])
sndx = pd.Series(data,index=[0,1,2,3,4])
print(sndx)

OUTPUT:

0 a 
1 b
2 c
3 d
4 e
dtype: object

For data in ndarray, index passed must be same as the length of the array.

#import the pandas library
import pandas as pd
importnumpy as np
data = np.array(['a','b','c','d','e'])
sndx = pd.Series(data,index=[0,1,2,3,4])
print(sndx)

OUTPUT:

0 a
1 b
2 c
3 d
4 e
dtype: object

Program to create Dictionary Series

Dictionaries are Python Data Structure that allows storing data in key and value form such that key is a word used to access value (a piece of data).

If no index is passed then the dictionary keys are taken in sorted order to construct index.

import pandas as pd
importnumpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
sdict = pd.Series(data)
print(sdict)

OUTPUT:

a 0.0
b 1.0
c 2.0
dtype: float64

If index is passed then the values in data correlating with labels in index will be thrown out.

#import the pandas library
import pandas as pd
importnumpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
sdict = pd.Series(data,index=['c','b','a','e'])
print(sdict)

OUTPUT:

c    2.0
b   1.0
a    0.0
eNaN
dtype: float64

Program to create Scalar Series.

If data contains scalar value, index must be provided.

#import the pandas library
import pandas as pd
importnumpy as np
sclr = pd.Series(26, index=[100, 101, 102, 103])
print(sclr)

OUTPUT:

100   26
101   26
102   26
103   26
dtype: int64

Pandas Operations with Series Create a Series.

# import pandas library
import pandas as pd
# Create Series
series= pd.Series([1,5,2,7,3,8], index = ['a','b','c','d','e','f'])
print(series)

OUTPUT:

a  1
b  5
c  2
d  7
e  3
f  8
dtype: int64

Example 1: To retrieve the third element from Series with position.

print(series[2])

OUTPUT:

Example 2: To retrieve first three elements from Series with position.

print(series[:3])

OUTPUT:

a   1
b   5
c   2
dtype: int64

Example 3: To retrieve last two elements from Series with position.

print(series[-2:])

OUTPUT:

e   3
f   8
dtype: int64

Example 4: To retrieve multiple elements from Series, use a list of index label values.

print(series[['b','e','f']])

OUTPUT:

b   5
e   3
f  8
dtype: int64

DataFrame DataFrame represents a mutable sized tabular data structure with rows and columns,seems like a dictionary of Series instances where each column is a Series object and rows consist of elements inside Series.

Ac.No.	Name	Amount
15330000110	Niti Ahuja	5,45,000
22451200001	Om Kashyap	33000
54322344002	Nikhil Marathee	1,27,800
32894352323	Ranya Khan	3,95,300
17344000658	Sanjana Gupta	77000

Pandas DataFrame can be created by using the given constructor:

pandas.DataFrame( data, index, columns, dtype, copy)

where the parameters: data contains various forms like map, lists, constants, 2D-numpy Ndarray, one or more Series, One or more Dictionaries, and also other DataFrame. index are for row labels. columns are for column labels. dtype is the Data type of each column. copy is for copying data, if Default is False.

Program to create empty DataFrame

Empty DatFrame defines a DataFrame.

import pandas as pd
df = pd.DataFrame()
print(df)

OUTPUT:

Empty DataFrame
Columns: []
Index: []

Program to create DataFrame from List

Example 1:

import pandas as pd
data = [11,12,13,14,15]
df = pd.DataFrame(data)
print(df)

OUTPUT:

Example 2:

import pandas as pd
data = [[1,'Alok'],[2,'Riya'],[3,'John']]
df = pd.DataFrame(data,columns=['Roll No.','Name'],dtype=int)
print(df)

OUTPUT:

Roll No.     Name
0       1                Alok
1       2                Riya
2       3                John

Program to create DataFrame from Dictionaries

# import pandas library
import pandas as pd
dict1 ={0:'Arun',1:'Aliya',2:'Geet'}# Dictionary 1
dict2 ={0:19,1:23,2:24}  # Dictionary 2
dict3 ={0:'B+',1:'O-',2:'A+'} # Dictionary 3
dict4 ={0:77, 1:58,2:54}  # Dictionary 4
Data = {'Name':dict1, 'Age':dict2, 'Blood.Grp':dict3, 'Weight':dict4}  # Data of all dictionaries 
df = pd.DataFrame(Data)  # DataFrame
print(df)

OUTPUT:

    Age    BloodGrp.    Name     Weight
0    19         B+               Arun           77
1    23        O-               Aliya           58
2    24         A+              Geet            54

Program to create DataFrame fromSeries

import pandas as pd
s1 = pd.Series([ 123, 211,135])  # series 1
s2 = pd.Series([ 'Jiya Sheikh','Suvreen Arora','Pari Khan']) # series 2
Data ={'Ac.No.':s1, 'Name':s2} # Data of both the defined Series
dfseries = pd.DataFrame(Data)  # Create DataFrame
print(dfseries)

OUTPUT:

     Ac.No.              Name
0           123                  Jiya Sheikh 
1           211                  Suvreen Arora
2           135                 Pari Khan

Pandas Operations with DataFrame Create DataFrame

import pandas as pd
Std_dict={'Roll.No.':[4113,5432,3462,9532,3214],"St.Name":['Ashu','David','Sonu','Moin','Heena']}
df= pd.DataFrame(Std_dict)
print(df)

OUTPUT:

 Roll.No.   St.Name
0       4113        Ashu
1       5432        David
2       3462        Sonu
3       9532        Moin
4       3214        Heena

Slicing of rows

Input the given command for first two rows

print(df.head(2))

OUTPUT:

  Roll.No.    St.Name
0        4113           Ashu
1        5432          David

Input the given command for last two rows

print(df.tail(2))

OUTPUT:

 Roll.No.      St.Name
3         9532                Moin
4        3214                Heena

Column Selection

import pandas as pd
Std_dict={'Roll.No.':pd.Series([4113,5432],index=[0,1]),
         'St.Name':pd.Series(['Ashu','David','Sonu'],index=[0,1,2])}
df= pd.DataFrame(Std_dict)
print(df['St.Name'])

OUTPUT:

0        Ashu
1            David
2            Sonu
Name: St.Name,dtype: object

Column Addition

import pandas as pd
Std_dict={'Roll.No.':pd.Series([4113,5432,3462],index=[0,1,2]),
         'St.Name':pd.Series(['Ashu','David','Sonu'],index=[0,1,2])}
df=pd.DataFrame(Std_dict)      
#Adding a new column by passing as Series
df['Percentile']=pd.Series([86,74,95],index=[0,1,2])
print(df)

OUTPUT:

    Roll.No.   St.Name    Percentile
0       4113        Ashu                 86
1       5432        David                74
2       3462        Sonu                 95

Column Deletion

A column can be deleted either by DEL function

import pandas as pd
Std_dict={'Roll.No.':pd.Series([4113,5432,3462],index=[0,1,2]),
         'St.Name':pd.Series(['Ashu','David','Sonu'],index=[0,1,2]),
         'Percentile': pd.Series([86,74,95],index=[0,1,2])}
df=pd.DataFrame(Std_dict)      
#Deleting third column by DEL function
deldf['Percentile']
print(df)

OUTPUT:

 Roll.No.   St.Name
0       4113        Ashu
1       5432        David
2       3462        Sonu

Or by POP function

#Deleting third column by POP function
df.pop('Percentile')

OUTPUT:

  Roll.No.   St.Name
0       4113        Ashu
1       5432        David
2       3462        Sonu

Reference: https://www.guru99.com/python-pandas-tutorial.html