XML parsing in Python
What is XML?
Extensible Markup Language also known as XML is a file format that is used to transmit, store and reconstruct arbitrary data. A set of rules is defined by it to encode the data into both, the human-readable and the machine-readable format.
What is XML Parsing?
XML Parsing refers to the reading of data in the XML file and providing the user interface for the same. To perform such an operation, the software apparatus used is called an XML parser.
In this tutorial, the XML file that will be parsed is an RSS feed.
What is RSS?
RSS stands for Rich Site Summary (also known as Really Simple Syndication ). To publish information like blog entries, news headlines, videos, audio, etc. which are updated frequently, RSS is used. RSS is plain text with XML as its format.
- RSS is easy enough to be read easily by both, humans and machines.
- The RSS processed in this tutorial will be the RSS feed of one of the top stories from BBC News.
Python module used for parsing:
The module used in python for parsing is the ElementTreeXMLAPI. Here, in this module, we will focus on the inbuilt XML module.
Implementation of the required modules to parse an XML file:
# A code to illustrate parsing of an XML
# file with the help of required modules
import requests
import csv
import xml.etree.ElementTree as et
# A function to load RSS
def load_RSS( ):
# The URL of the RSS feed
u = 'http://www.hindustantimes.com/rss/topnews/rssfeed.xml'
# Creating the HTTP response object from the given URL
rsp = requests.get( u )
# Saving the xml file
with open('top_news_feed.xml', 'wb') as f:
f.write( rsp.content )
def parse_XML( xml_file ):
# Creating an element tree object
t = et.parse( xml_file )
# Getting the root element
root = t.getroot( )
# Creating an empty list for the news items
news_items = [ ]
# Iterating the news items
for itm in root.findall('./channel/item'):
# Empty news dictionary
news = { }
for chd in itm:
# Checking for the namespace object content: media
if chd.tag == '{http://search.yahoo.com/mrss/}content':
news[ 'media' ] = chd.attrib[ 'url' ]
else:
news[ chd.tag ] = chd.text.encode( 'utf8' )
# Appending the news dictionary to the news items list
news_items.append( news )
# Returning the news items list
return news_items
def savetoCSV(news_items, filename):
# Specifying the fields for the csv file
fields = [‘title’, ‘guid’, ‘media’, ‘link’, ‘pubDate’, ‘description’]
# Writing to the csv file
with open(filename, 'w') as csv_file:
# Creating a csv dictionary writer object
wrter = csv.DictWriter(csv_file, fieldnames = fields)
# Writing the headers (field names)
wrter.writeheader()
# Writing on data rows
wrter.writerows(news_items)
def main( ):
# Loading RSS from the web to update the existing xml file
load_RSS( )
# Parsing the XML file
news_items = parse_XML( 'top_news_feed.xml' )
# Storing the news items into a CSV file
savetoCSV( news_items, 'top_news.csv')
if __name__ == "__main__":
# Calling the main function
main( )
The above-given code will:
- Load the RSS feed from the URL provided by the user/programmer and store it in an XML file.
- Then it parses the XML files and saves all the data in a list of dictionaries and every dictionary is a single news item.
- At last, all the items are saved to a CSV file.
Let us break the code into smaller fragments and try to understand it more clearly:
- Loading and saving RSS feeds
def load_RSS( ):
# The URL of the RSS feed
u = 'http://www.hindustantimes.com/rss/topnews/rssfeed.xml'
# Creating the HTTP response object from the given URL
rsp = requests.get( u )
# Saving the xml file
with open('top_news_feed.xml', 'wb') as f:
f.write( rsp.content )
In this part, firstly, an HTTP response object is created by sending an HTTP request to the URL of the RSS feed. The contents contained by the response are the XML file, which is then saved as top_news_feed.xml in our local directory.
- Parsing of the XML file
To parse the XML file, we then created a function names parse_XML( ). As we know that XML is an inherently hierarchical data format and can be easily represented with the help of a tree. For example:
Here, the xml.etree.ElementTree (imported as ‘et’) module is used. For this purpose, the Element Tree has two classes:
- The whole XML document is represented as a tree with the help of the ElementTree.
- And to represent a single node in this tree, Element is used.
If you want to interact with the whole document, it can be done in the ElementTree level, and on the Element level, you can interact with a single XML element or its sub-elements.
Now let us have a look at the parse_XML( ) function:
t = et.parse( xml_file )
In this portion, to create an ElementTree object, we parsed the passed xml_file.
root = t.getroot( )
In this portion, the getroot( ) function returns the root of the tree as an Element object.
for itm in root.findall( './channel/item' ):
In this portion, ./channel/item is the syntax for the XPath (it is a language that is used to address the XML documents). Here, the item’s grand children of channel children of the root element are searched.
# Iterating the news items
for itm in root.findall('./channel/item'):
# Creating an empty news dictionary
news = { }
# Iterating the child elements of the item
for chd in itm:
# Checking for the namespace object content: media
if chd.tag == '{http://search.yahoo.com/mrss/}content':
news[ 'media' ] = chd.attrib[ 'url' ]
else:
news[ chd.tag ] = chd.text.encode( 'utf8' )
# Appending the news dictionary to the news items list
news_items.append( news )
In this part of the code, it is clear that we are iterating through the item elements where every element of the item contains one news. So, to store all the data available about the news item, an empty dictionary news is created.
Now to iterate through the child elements of an element, we execute the following line:
for chd in itm:
Let us have a look at a sample item element:
The namespace tags have to be handled separately, as during parsing they get extended to their original value. To handle it, we write the following lines:
if chd.tag == '{http://search.yahoo.com/mrss/}content':
news['media'] = chd.attrib[ 'url' ]
Here, chd.attrib is a dictionary of all the attributes that are related to an element.
For the rest of the children, the following line is written:
news[ chd.tag ] = chd.text.encode( 'utf8' )
Here, chd.tag is storing the name of the child element, and chd.text is used tostore all the text inside that child element. Therefore, an element of the sample item is converted into a dictionary. The following portion can be referred to for more clear understanding:
{ 'guid': 'http://www.hindustantimes.com/autos/maruti-ignis-launch.... ,
'description': 'Ignis has tough competition already, from Hyun.... ,
'media': 'http://www.hindustantimes.com/rf/image_size_630x354/HT/... ,
'link': 'http://www.hindustantimes.com/autos/maruti-ignis-launch.... ,
'pubDate': 'Thu, 12 Jan 2017 12:33:04 GMT ' }
After this, we will simply add this dictionary element to the list, i.e., news_items. And then the list is returned.
- Saving elements to the CSV file
Now, to use or modify the elements of the list of news items, we save it to a CSV file using the savetoCSV( ) function. After this step, our formatted data looks like this:
As we can see in the above-given image, all the news stories are now stored in a table and before being converted to a simple CSV file, the data was in the form of hierarchical XML file. Converting this file to the CSV file helps in extending the database easily. The JSON-like data types can be easily used directly by anyone in his applications.
The XML parsing is the best way to extract data from the websites, as every website does not provide a public API, but provides the RSS feeds.