Python Pandas Tutorial

Introduction to Python Pandas

“According to Wikipedia, Pandas’ name is derived from the econometrics term Panel Data for multidimensional data sets that include observations over multiple time periods for the same individuals”.  Pandas stands for Python Data Analysis Library. Pandas is an open-source, BSD-Licensed library of Python Programming Language written by Wes McKinney in 2008 for developers to provide suitable and highly-optimized performance tools for data analysis, cleaning, and manipulation with the powerful, expressive, and flexible data structures like Data Frames and Series. Pandas built on top of NumPy as it depends on and inter operates with NumPy for faster numeric array computations. NumPy is a python library for matrices computation, single and multi-dimensional arrays computation, along with an extensive collection of high-level numerical tools for array operations. Pandas enables developers for carrying out their entire data analysis workflow in Python without having to switch to a more domain specific language like R.Pandas is well suited for various kinds of data such as:
  • Ordered and unordered time series data
  • Arbitrary matrix data
  • Tabular data with heterogeneously-typed columns
  • Unlabeled data
  • Observational or statistical data sets
Before Pandas, Python majorly used for data munging and preparation. It had a minor contribution to data analysis. Pandas library solved this problem with the accomplishment of five typical steps in data analysis and processing, regardless of data origin- load, prepare, manipulate, model, and analyze. Python with Pandas use in a wide range of academic and commercial domains sectors that includes finance, economics, Statistics, analytics, etc.

Pandas Installation

For Pandas installation go to the command line/terminal and type pip install pandas or You can install Pandas using Anaconda Python Package (https://www.anaconda.com/) as the best way and then type conda install pandas After having a complete installation, go to IDE(Jupyter, PyCharm, etc.) and import Pandas just by typing import pandas as pd

Features of Pandas

  1. Pandas provides elegant and simple API.
  2. Pandas performs highly for merging and joining of high-volume datasets.
  3. Pandas is easy to learn, use, and maintain by which you can focus more on research with less programming.
  4. Pandas bridges the gap between rapid iterations of ad-hoc analysis and production quality code.
  5. Pandas consists of Data Frames object for fast and efficient data manipulation with integrated indexing.
  6. Pandas offers Time-Series functionality that includes
  • data range generation and frequency version
  • moving window statistics and window linear regressions
  • date shifting and lagging
Data Analysis can be done in Pandas by dealing with the fast data structures (Series and Data Frames)built on top of NumPy array. Series is a one-dimensional homogeneous array of an immutable size that enables us to store any data type (integer, string, float, python objects, etc.).
10 32 11 54 27 83 91 33 67 77 22
  • Pandas Series can be created by using the given constructor-
pandas.Series(data, index, dtype, copy)
where data contains various forms likend array, list, constants, scalar value (can be integer Value, string), and  Python Dictionary (can be Key, Value pair). index represents the collection of axis labels whose values must be unique, hashable, and of the same length as data. dtype is for data type. copy is for copying data. Default False. NOTE: If index not given explicitly, then Pandas construct Range Index with range (0 to N-1) where N is the total number of elements that Series consists.
  1. Program to create empty Series.
  • Empty Series defines a Basic Series.
#import the pandas library
import pandas as pd
s = pd.Series()
print(s)
OUTPUT:
Series([], dtype: float64)
  1. Program to create ndarray (N-Dimensional) Series
  • If no index passed then by default we have index of range(n) where n is length of array.
#import the pandas library
import pandas as pd
#import the NumPy as Pandas built on top of it
importnumpy as np
data = np.array(['a','b','c','d','e'])
sndx = pd.Series(data,index=[0,1,2,3,4])
print(sndx)
OUTPUT:
0 a 
1 b
2 c
3 d
4 e
dtype: object
  • For data in ndarray, index passed must be same as the length of the array.
#import the pandas library
import pandas as pd
importnumpy as np
data = np.array(['a','b','c','d','e'])
sndx = pd.Series(data,index=[0,1,2,3,4])
print(sndx)
OUTPUT:
0 a
1 b
2 c
3 d
4 e
dtype: object
  1. Program to create Dictionary Series
Dictionaries are Python Data Structure that allows storing data in key and value form such that key is a word used to access value (a piece of data).
  • If no index is passed then the dictionary keys are taken in sorted order to construct index.
import pandas as pd
importnumpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
sdict = pd.Series(data)
print(sdict)
OUTPUT:
a 0.0
b 1.0
c 2.0
dtype: float64
  • If index is passed then the values in data correlating with labels in index will be thrown out.
#import the pandas library
import pandas as pd
importnumpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
sdict = pd.Series(data,index=['c','b','a','e'])
print(sdict)
OUTPUT:
c    2.0
b   1.0
a    0.0
eNaN
dtype: float64

  1. Program to create Scalar Series.
  • If data contains scalar value, index must be provided.
#import the pandas library
import pandas as pd
importnumpy as np
sclr = pd.Series(26, index=[100, 101, 102, 103])
print(sclr)
OUTPUT:
100   26
101   26
102   26
103   26
dtype: int64
Pandas Operations with Series Create a Series.
# import pandas library
import pandas as pd
# Create Series
series= pd.Series([1,5,2,7,3,8], index = ['a','b','c','d','e','f'])
print(series)
OUTPUT:
a  1
b  5
c  2
d  7
e  3
f  8
dtype: int64
Example 1:   To retrieve the third element from Series with position.
print(series[2])
OUTPUT:
2
Example 2: To retrieve first three elements from Series with position.
print(series[:3])
OUTPUT:
a   1
b   5
c   2
dtype: int64
Example 3: To retrieve last two elements from Series with position.
print(series[-2:])
OUTPUT:
e   3
f   8
dtype: int64
Example 4:  To retrieve multiple elements from Series, use a list of index label values.
print(series[['b','e','f']])
OUTPUT:
b   5
e   3
f  8
dtype: int64
DataFrame DataFrame represents a mutable sized tabular data structure with rows and columns,seems like a dictionary of Series instances where each column is a Series object and rows consist of elements inside Series.
Ac.No. Name Amount
15330000110 Niti Ahuja 5,45,000
22451200001 Om Kashyap 33000
54322344002 Nikhil Marathee 1,27,800
32894352323 Ranya Khan 3,95,300
17344000658 Sanjana Gupta 77000
  • Pandas DataFrame can be created by using the given constructor:
pandas.DataFrame( data, index, columns, dtype, copy)
where the parameters: data contains various forms like map, lists, constants, 2D-numpy Ndarray, one or more Series, One or more Dictionaries, and also other DataFrame. index are for row labels. columns are for column labels. dtype is the Data type of each column. copy is for copying data, if Default is False.
  1. Program to create empty DataFrame
  • Empty DatFrame defines a DataFrame.
import pandas as pd
df = pd.DataFrame()
print(df)
OUTPUT:
Empty DataFrame
Columns: []
Index: []
  1. Program to create DataFrame from List
Example 1:
import pandas as pd
data = [11,12,13,14,15]
df = pd.DataFrame(data)
print(df)
OUTPUT:
   0
0  11
1  12
2  13
3  14
4  15
Example 2:
import pandas as pd
data = [[1,'Alok'],[2,'Riya'],[3,'John']]
df = pd.DataFrame(data,columns=['Roll No.','Name'],dtype=int)
print(df)
OUTPUT:
Roll No.     Name
0       1                Alok
1       2                Riya
2       3                John
  1. Program to create DataFrame from Dictionaries
# import pandas library
import pandas as pd
dict1 ={0:'Arun',1:'Aliya',2:'Geet'}# Dictionary 1
dict2 ={0:19,1:23,2:24}  # Dictionary 2
dict3 ={0:'B+',1:'O-',2:'A+'} # Dictionary 3
dict4 ={0:77, 1:58,2:54}  # Dictionary 4
Data = {'Name':dict1, 'Age':dict2, 'Blood.Grp':dict3, 'Weight':dict4}  # Data of all dictionaries 
df = pd.DataFrame(Data)  # DataFrame
print(df)
OUTPUT:
    Age    BloodGrp.    Name     Weight
0    19         B+               Arun           77
1    23        O-               Aliya           58
2    24         A+              Geet            54
  1. Program to create DataFrame fromSeries
import pandas as pd
s1 = pd.Series([ 123, 211,135])  # series 1
s2 = pd.Series([ 'Jiya Sheikh','Suvreen Arora','Pari Khan']) # series 2
Data ={'Ac.No.':s1, 'Name':s2} # Data of both the defined Series
dfseries = pd.DataFrame(Data)  # Create DataFrame
print(dfseries)
OUTPUT:
     Ac.No.              Name
0           123                  Jiya Sheikh 
1           211                  Suvreen Arora
2           135                 Pari Khan
Pandas Operations with DataFrame Create DataFrame
import pandas as pd
Std_dict={'Roll.No.':[4113,5432,3462,9532,3214],"St.Name":['Ashu','David','Sonu','Moin','Heena']}
df= pd.DataFrame(Std_dict)
print(df)
OUTPUT:  
 Roll.No.   St.Name
0       4113        Ashu
1       5432        David
2       3462        Sonu
3       9532        Moin
4       3214        Heena
  1. Slicing of rows
  • Input the given command for first two rows
print(df.head(2))
OUTPUT:     
  Roll.No.    St.Name
0        4113           Ashu
1        5432          David
  • Input the given command for last two rows
print(df.tail(2))
OUTPUT:    
 Roll.No.      St.Name
3         9532                Moin
4        3214                Heena
  1. Column Selection
import pandas as pd
Std_dict={'Roll.No.':pd.Series([4113,5432],index=[0,1]),
         'St.Name':pd.Series(['Ashu','David','Sonu'],index=[0,1,2])}
df= pd.DataFrame(Std_dict)
print(df['St.Name'])
OUTPUT:
0        Ashu
1            David
2            Sonu
Name: St.Name,dtype: object
  1. Column Addition
import pandas as pd
Std_dict={'Roll.No.':pd.Series([4113,5432,3462],index=[0,1,2]),
         'St.Name':pd.Series(['Ashu','David','Sonu'],index=[0,1,2])}
df=pd.DataFrame(Std_dict)      
#Adding a new column by passing as Series
df['Percentile']=pd.Series([86,74,95],index=[0,1,2])
print(df)
OUTPUT:
    Roll.No.   St.Name    Percentile
0       4113        Ashu                 86
1       5432        David                74
2       3462        Sonu                 95
  1. Column Deletion
A column can be deleted either by DEL function
import pandas as pd
Std_dict={'Roll.No.':pd.Series([4113,5432,3462],index=[0,1,2]),
         'St.Name':pd.Series(['Ashu','David','Sonu'],index=[0,1,2]),
         'Percentile': pd.Series([86,74,95],index=[0,1,2])}
df=pd.DataFrame(Std_dict)      
#Deleting third column by DEL function
deldf['Percentile']
print(df)
OUTPUT:   
 Roll.No.   St.Name
0       4113        Ashu
1       5432        David
2       3462        Sonu
Or by POP function
#Deleting third column by POP function
df.pop('Percentile')
OUTPUT:    
  Roll.No.   St.Name
0       4113        Ashu
1       5432        David
2       3462        Sonu
Reference: https://www.guru99.com/python-pandas-tutorial.html

Python Pandas Tutorial Index