Python RegEx

Python RegEx

Python RegEx (python regular expressions) is a concept of writing and finding expressions in a pool of characters easily.

A Regular expression (RegEx) is a set of characters that defines search patterns in python.

For example;

^p….n$

The above expression is a RegEx in python that says find a match in words that start with p, end with n, and has four characters in between.

Expression	String	Matched?
^p….n$	python	Matched
	pigeon	Matched
	pepper	No Match
	peer	No Match

In the above example, since the word "python" and "pigeon" started with p, ended with n, and have four characters in between those two, they are a match.

import re

pattern = '^p....n$'

test_string = 'python'

result = re.match(pattern, test_string)


if result:

  print("Search successful.")

else:

  print("Search unsuccessful.")

Output:

Search successful.

In the above code, we used a regular expression pattern and searched it against a test string. We used the match function of the re module in python.

Regular Expressions

Let’s see how regular expressions are written? And what characters are used in writing regular expressions?

In the previous example, '^p....n$', both ^ and $ are metacharacters used to write regular expressions.

Metacharacters

Metacharacters in a regular expression deliver a special meaning or directive to the interpreting mechanism. Some of the Metacharacters are:

[] . ^ $ * + ? {} () \ |

Square Bracket []

[aip] is a regular expression. The square bracket specifies some characters that it wishes to match against the text. If any of the characters inside the square bracket matches the test text, it will be successful.

Expression	String	Matched?
[aip]	and	matched
	april	matched
	iris	matched
	pupil	matched
	mate	Not matched

As we can see, if any of the characters specified in the bracket match within the test string, it returns match successfully.

A range of characters can be defined using dash(-):

[a-d] is equal to [abcd]

[1-5] is equal to [12345]

[1-39] is equal to [1239] and not [12345….39]

By using ^ in [], we can specify complementary characters:

[^abc] is equal to all alpha characters except a, b and c.

[^0-9] means any non-digit character.

Period (.)

A period (.) in a regular expression signifies any character can take up its place. Any character can take up its place except the newline character (/n).

Expression	Test String	Matched?
…	abc	1 Match
	ancd	1 Match (3 alphabets)
	abcdef	2 Matches (6 alphabets)
	ab	No Match

Caret (^)

Caret (^) symbol in regular expressions specifies if a test string starts with the provided character.

Expression	Test String	Matched?
^p	pathway	1 Match
	python	1 Match
	apron	No Match
^pa	pathway	1 Match
	python	No Match
	apron	No Match

Dollar ($)

As caret (^) is used to specify the starting character, dollar ($) is used to specify the end character.

Expression	Test String	Matched?
$n	pathway	No Match
	python	1 Match
	apron	1 Match
$on	pathway	No Match
	python	1 Match
	apron	1 Match

Star (*)

A star (*) symbol is present to the right of any character will check zero or more occurrences of that character in the test string.

Expression	Test String	Matched?
pa*n	pn	1 Match
	pan	1 Match
	paan	1 Match
	paaaaaaan	1 Match
	pain	No Match( as a is not followed by n)
	par	No Match( as a is not followed by n)
	paamn	No Match( as a is not followed by n)

Plus (+)

A plus (+) symbol is present to the right of any character will check one or more occurrences of that character in the test string.

Expression	Test String	Matched?
pa*n	pn	No Match
	pan	1 Match
	paan	1 Match
	paaaaaaan	1 Match
	pain	No Match( as a is not followed by n)
	pa	No Match( as a is not followed by n)
	paamn	No Match( as a is not followed by n)

Question mark (?)

A question mark (?) symbol is present to the right of any character will check zero or one occurrence of that character in the test string.

Expression	Test String	Matched?
Pa?n	pn	1 Match
	pan	1 Match
	paan	No Match
	paaaaaaan	No Match
	pain	No Match( as a is not followed by n)
	pa	No Match( as a is not followed by n)
	paamn	No Match( as a is not followed by n)

Braces ({})

A braces ({}) symbol is immediate right to a character specifies the number of occurrences of that character. An expression a{2,3} would mean, minimum of 2 occurrences of and a maximum of 3 occurrences in a text string.

Expression	Test String	Matched?
a{2,3}	ab	No Match
	aab	1 Match (at aab)
	aab aab	2 Matches (at aab aab)
	aaab	1 Match (at aaab)
	aab ab	1 Match (at aab)
	abc paar	1 Match (at paar)
	paaaar	2 Matches (at paaaar)

Let’s try one more example, [0-9]{1, 4}, the minimum occurrence of 1 and maximum occurrence of 4, for the digits 0-9.

Expression	Test String	Matched?
[0-9]{1, 4},	Ab12	1 Match (at Ab12)
	Ab1234512	2 Matches (at Ab1234 and 512)
	ab	No Match

Alternation (|)

An Alternation (|) symbol is like (a|c|d). We can specify the occurrence of any one of a, c or d.

Expression	Test String	Matched?
a\|c\|d	pn	No Match
	pan	1 Match
	pbn	1 Match
	cn	1 Match
	pnkj	No Match
	cpajd	3 Matches (cpajd)

Group – ()

Group is used in an expression to group other sub-patterns. For example, (a|b|c)yz match any string that matches either a or b or c followed by yz.

Expression	Test String	Match?
(a\|b\|c)yz	ayz	1 Match
	yzb	1 Match
	xyz	No Match

In the above example, () has grouped alternative options of a, b and c with fixed characters yz. If any test string contains a, b or c immediately followed by yz, it will show a match.

Backslash (\)

Backslash in python is a very powerful character. Backslash in a set of string characters nullifies any character's special meaning, which comes immediately after it. For example, \$n, $ in a regular expression conveys the end character. But in this case, since \ is put to the left of $, it will not be an expression element, and it will be read as a raw $ followed by n, i.e., $n by the python interpreter.

Similarly, if I write \\n in python, the first backslash nullifies the special significance of the next \. If we run this statement in the console, it will print \n and not a new line.

Let’s see an example.

print(“\$n”)

print(“\\n”)

print(‘those are dog\’s biscuit’)

Output:

$n

\n

those are dog’s biscuit

Special Sequences

Special sequences were devised so we could write commonly occurring expressions.

It uses commonly used symbols thus, saving a lot of time for the coder. We will look upon a few of these special sequences further down the article.

\A – It matches if the given characters are found at the starting of the string.

Expression	Test String	Matched?
\Athe	the moon in the night sky	Match
	If the moon in the night sky	No Match
	the bright star in the night sky	No Match

\b – It determines if the given characters match at the start or the end of a word.

Expression	Test String	Matched?
\bant	These antiques are amazing	Match
	antenna is not signaling right	Match
	there were a lot of red ants	Match
	Its adamant to say	No Match

Expression	Test String	Matched?
Ant\b	These antiques are amazing	No Match
	antenna is not signaling right	No Match
	there were a lot of red ants	No Match
	Its adamant to say	Match

\B – It is opposite of \b. It determines if the given characters do not match at the start or the end of a word.

Expression	Test String	Matched?
\Bant	These antiques are amazing	No Match
	antenna is not signaling right	No Match
	there was a lot of comment	Match
Expression	Test String	Matched?
Ant\B	These antiques are amazing	Match
	antenna is not signaling right	Match
	there were a lots of comment	No Match

\d – It matches all the decimal values (0-9) in the test string.

Expression	Test String	Matched?
\d	123and4	Match (123and4)
	If the moon in the night sky	No Match
	bright star in the night sky01	Match(sky01)

\D – It matches all the non-decimal values (^0-9) in the test string.

Expression	Test String	Matched?
\D	123and4	Match (123and4)
	If the moon in the night sky	Match
	bright star in the night sky01	Match

\s – It matches if any white space is present in the test string. It is equivalent in terms of functioning to [ \t\n\r\f\v].

Expression	Test String	Matched?
\s	123and4	No Match
	If the moon in the night sky	Match
	bright star in the night sky01	Match

\S – It matches if no white space is present in the test string. It is equivalent in terms of functioning to [^ \t\n\r\f\v].

Expression	Test String	Matched?
\s	123and4	Match
	If the moon in the night sky	No Match
	bright star in the night sky01	No Match

\w – It matches any alphanumeric (digits and alphabets) present in the test string. It is equivalent to [a-zA-Z0-9_]. By the way, underscore _ is also considered an alphanumeric character.

Expression	Test String	Matched?
\w	123_and4	Match
	If the moon in the night sky_	Match
	***********	No Match

\W – It matches if no alphanumeric (digits and alphabets) are present in the test string. It is equivalent to [^ a-zA-Z0-9_]. Since underscore _ is also considered an alphanumeric character, it un match if it finds one.

Expression	Test String	Matched?
\W	123_and4	No Match
	If the moon in the night sky_	No Match
	***********	Match

\Z – It matches if the given characters are found at the starting of the string.

Expression	Test String	Matched?
Sky\Z	the moon in the night	No Match
	If the moon in the night sky	Match
	bright star in the night	No Match

Python RegEx Module – re

re- module is a python module developed to work with general expressions. The module contains functions and other constants to deal with regular expressions in python. Let's look at a few most commonly used functions.

findall()

The re.findall() method is one of the most commonly used methods of this module. It returns a list of all the matches found in the given string. An example is illustrated below. It returns an empty string if no match is found.

Code:

#program to find all numbers in a string

 import re

string = ‘java 2 has 3 many45 features”

pattern = ‘\d+’


result = re.findall(pattern, string)

print(result)

Output:

[‘2’, ‘3’, ’45’]

split()

The re.split() method looks for the match and then split the string from where it matches the given text string. It will return the original string if no split is found.

Code:

#program to find all numbers in a string

 import re


string = ‘java 2 has 3 many45 features”

pattern = ‘\d+’


result = re.split(pattern, string)

print(result)

Output:

[‘java ’, ‘ has ’, ‘ many’, ‘features’]

Note: We can pass maxsplit argument to re.split method. It restricts the number of splits that occur. It returns the splited string as output. In case the value of max split is not mentioned, 0 is the default value, meaning max no. of split occurs.

Code:

#program to find all numbers in a string

 import re



#maxsplit = 1

#spliting the string only once

string = ‘java 2 has 3 many45 features”

pattern = ‘\d+’



result = re.split(pattern, string, 1)

print(result)

Output:

[‘java ’, ‘2 has 3 many45 features’]

sub()

The re.sub() method looks for a match defined by the pattern. After it finds the match, it replaces the matched value with a predefined replace value. It returns the original string if no match is found.

Code:

#program to remove all the whitespace from the text

 import re


#multiline string

string = ‘java 2 has 3 many45 features”


#matches all the whitespace characters

pattern = ‘\s+’


#emty string

replace = ‘’

new_string = re.sub(pattern, replace, string)

print(new_string)

Output:

java2has3many45features

We can pass a parameter called count in the argument. It restricts the no. of matches to be replaced. The default value of the count is set to 0.

Code:

#program to remove all the whitespace from the text

 import re


#multiline string

string = ‘java 2 has 3 many45 features”


#matches all the whitespace characters

pattern = ‘\s+’


#emty string

replace = ‘’


#count = 1

new_string = re.sub(pattern, replace, string, 1)

print(new_string)

Output:

java2 has 3 many 45 features

subn()

The re.subn() method is similar to re.sub() looks for match defined by the pattern. After it finds the match, it replaces the matched value with a predefined replace value. It returns a tuple with two parameters, result, and count of replaces. It returns the original string if no match is found.

Code:

#program to remove all the whitespace from the text

 import re

#multiline string

string = ‘java 2 has 3 many 45 features”




#matches all the whitespace characters

pattern = ‘\s+’




#emty string

replace = ‘’

new_string = re.subn(pattern, replace, string)

print(new_string)

Output:

(‘java2has3many45features’, 6)

search()

The re.search() method is doing what it sounds like. It return true if it finds a match for the given pattern.

Code:

import re

#multiline string

string = ‘java 2 has 3 many 45 features”

#check if java is at the beginning of the string

pattern = ‘\Ajava’

match = re.search(pattern, string)

if match:

            print(“match successful”)

else:

            print(“match not successful”)

Output:

match successful

Match Object in re module

The match object in the re module has several methods and attributes. A few commonly used methods of match objects are:

match.group()

The group() method returns the matched portion of the text string.

Code:

import re

#multiline string

string = ‘java 256 88 has 3 many 45 features”


#three digits followed by two digit

pattern = ‘(\d{3}) (\d{2})’


match = re.search(pattern, string)


if match:

            print(“match.group()”)

else:

            print(“match not  found”)

Output:

256 88

match.start(), match.end() and match.span()

match.start() returns the index of the first match. Similiarly, match.end() returns the index of the last matched character.

On the other hand, match.span() returns a tuple of start and end index.

>>> match.start()

5

>>>match.end()

10

>>>match,span()

(5, 10)

There are several other methods, and functions in the re module. Please refer to the official documentation to know more about the re module.

Python Tutorial

Python Conditional Statements

Python Loops

Python Arrays

Python Strings

Python Built-in Data Structure

Python Functions

Python File Handling

Python Exception Handling

Python OOPs Concept

Python Iterators

Python Generators

Python Decorators

Python Functions and Methods

Python Modules

Python MySQL

Python MongoDB

Python SQLite

Python Data Structure Implementation

Python Advance Topics

Python 2

Python 3

How to

Sorting

Programs

Questions

Differences

Python Kivy

Python Tkinter

Python PyQt5

Misc

Python RegEx

Regular Expressions

Special Sequences

Python RegEx Module – re

Match Object in re module

match.start(), match.end() and match.span()