Python RegEx
Python RegEx (python regular expressions) is a concept of writing and finding expressions in a pool of characters easily.
A Regular expression (RegEx) is a set of characters that defines search patterns in python.
For example;
^p….n$
The above expression is a RegEx in python that says find a match in words that start with p, end with n, and has four characters in between.
Expression |
String |
Matched? |
^p….n$
|
python |
Matched |
pigeon |
Matched |
|
pepper |
No Match |
|
peer |
No Match |
In the above example, since the word "python" and "pigeon" started with p, ended with n, and have four characters in between those two, they are a match.
import re
pattern = '^p....n$'
test_string = 'python'
result = re.match(pattern, test_string)
if result:
print("Search successful.")
else:
print("Search unsuccessful.")
Output:
Search successful.
In the above code, we used a regular expression pattern and searched it against a test string. We used the match function of the re module in python.
Regular Expressions
Let’s see how regular expressions are written? And what characters are used in writing regular expressions?
In the previous example, '^p....n$', both ^ and $ are metacharacters used to write regular expressions.
Metacharacters
Metacharacters in a regular expression deliver a special meaning or directive to the interpreting mechanism. Some of the Metacharacters are:
[] . ^ $ * + ? {} () \ |
- Square Bracket []
[aip] is a regular expression. The square bracket specifies some characters that it wishes to match against the text. If any of the characters inside the square bracket matches the test text, it will be successful.
Expression |
String |
Matched? |
[aip] |
and |
matched |
april |
matched |
|
iris |
matched |
|
pupil |
matched |
|
mate |
Not matched |
As we can see, if any of the characters specified in the bracket match within the test string, it returns match successfully.
A range of characters can be defined using dash(-):
[a-d] is equal to [abcd]
[1-5] is equal to [12345]
[1-39] is equal to [1239] and not [12345….39]
By using ^ in [], we can specify complementary characters:
[^abc] is equal to all alpha characters except a, b and c.
[^0-9] means any non-digit character.
- Period (.)
A period (.) in a regular expression signifies any character can take up its place. Any character can take up its place except the newline character (/n).
Expression |
Test String |
Matched? |
… |
abc |
1 Match |
ancd |
1 Match (3 alphabets) |
|
abcdef |
2 Matches (6 alphabets) |
|
ab |
No Match |
- Caret (^)
Caret (^) symbol in regular expressions specifies if a test string starts with the provided character.
Expression |
Test String |
Matched? |
^p |
pathway |
1 Match |
python |
1 Match |
|
apron |
No Match |
|
^pa |
pathway |
1 Match |
python |
No Match |
|
apron |
No Match |
- Dollar ($)
As caret (^) is used to specify the starting character, dollar ($) is used to specify the end character.
Expression |
Test String |
Matched? |
$n |
pathway |
No Match |
python |
1 Match |
|
apron |
1 Match |
|
$on |
pathway |
No Match |
python |
1 Match |
|
apron |
1 Match |
- Star (*)
A star (*) symbol is present to the right of any character will check zero or more occurrences of that character in the test string.
Expression |
Test String |
Matched? |
pa*n
|
pn |
1 Match |
pan |
1 Match |
|
paan |
1 Match |
|
paaaaaaan |
1 Match |
|
pain |
No Match( as a is not followed by n) |
|
par |
No Match( as a is not followed by n) |
|
paamn |
No Match( as a is not followed by n) |
- Plus (+)
A plus (+) symbol is present to the right of any character will check one or more occurrences of that character in the test string.
Expression |
Test String |
Matched? |
pa*n
|
pn |
No Match |
pan |
1 Match |
|
paan |
1 Match |
|
paaaaaaan |
1 Match |
|
pain |
No Match( as a is not followed by n) |
|
pa |
No Match( as a is not followed by n) |
|
paamn |
No Match( as a is not followed by n) |
- Question mark (?)
A question mark (?) symbol is present to the right of any character will check zero or one occurrence of that character in the test string.
Expression |
Test String |
Matched? |
Pa?n
|
pn |
1 Match |
pan |
1 Match |
|
paan |
No Match |
|
paaaaaaan |
No Match |
|
pain |
No Match( as a is not followed by n) |
|
pa |
No Match( as a is not followed by n) |
|
paamn |
No Match( as a is not followed by n) |
- Braces ({})
A braces ({}) symbol is immediate right to a character specifies the number of occurrences of that character. An expression a{2,3} would mean, minimum of 2 occurrences of and a maximum of 3 occurrences in a text string.
Expression |
Test String |
Matched? |
a{2,3} |
ab |
No Match |
aab |
1 Match (at aab) |
|
aab aab |
2 Matches (at aab aab) |
|
aaab |
1 Match (at aaab) |
|
aab ab |
1 Match (at aab) |
|
abc paar |
1 Match (at paar) |
|
paaaar |
2 Matches (at paaaar) |
Let’s try one more example, [0-9]{1, 4}, the minimum occurrence of 1 and maximum occurrence of 4, for the digits 0-9.
Expression |
Test String |
Matched? |
[0-9]{1, 4}, |
Ab12 |
1 Match (at Ab12) |
Ab1234512 |
2 Matches (at Ab1234 and 512) |
|
ab |
No Match |
- Alternation (|)
An Alternation (|) symbol is like (a|c|d). We can specify the occurrence of any one of a, c or d.
Expression |
Test String |
Matched? |
a|c|d |
pn |
No Match |
pan |
1 Match |
|
pbn |
1 Match |
|
cn |
1 Match |
|
pnkj |
No Match |
|
cpajd |
3 Matches (cpajd) |
- Group – ()
Group is used in an expression to group other sub-patterns. For example, (a|b|c)yz match any string that matches either a or b or c followed by yz.
Expression |
Test String |
Match? |
(a|b|c)yz |
ayz |
1 Match |
yzb |
1 Match |
|
xyz |
No Match |
In the above example, () has grouped alternative options of a, b and c with fixed characters yz. If any test string contains a, b or c immediately followed by yz, it will show a match.
- Backslash (\)
Backslash in python is a very powerful character. Backslash in a set of string characters nullifies any character's special meaning, which comes immediately after it. For example, \$n, $ in a regular expression conveys the end character. But in this case, since \ is put to the left of $, it will not be an expression element, and it will be read as a raw $ followed by n, i.e., $n by the python interpreter.
Similarly, if I write \\n in python, the first backslash nullifies the special significance of the next \. If we run this statement in the console, it will print \n and not a new line.
Let’s see an example.
print(“\$n”)
print(“\\n”)
print(‘those are dog\’s biscuit’)
Output:
$n
\n
those are dog’s biscuit
Special Sequences
Special sequences were devised so we could write commonly occurring expressions.
It uses commonly used symbols thus, saving a lot of time for the coder. We will look upon a few of these special sequences further down the article.
\A – It matches if the given characters are found at the starting of the string.
Expression |
Test String |
Matched? |
\Athe |
the moon in the night sky |
Match |
If the moon in the night sky |
No Match |
|
the bright star in the night sky |
No Match |
\b – It determines if the given characters match at the start or the end of a word.
Expression |
Test String |
Matched? |
\bant |
These antiques are amazing |
Match |
antenna is not signaling right |
Match |
|
there were a lot of red ants |
Match |
|
Its adamant to say |
No Match |
Expression |
Test String |
Matched? |
Ant\b |
These antiques are amazing |
No Match |
antenna is not signaling right |
No Match |
|
there were a lot of red ants |
No Match |
|
Its adamant to say |
Match |
\B – It is opposite of \b. It determines if the given characters do not match at the start or the end of a word.
Expression |
Test String |
Matched? |
\Bant |
These antiques are amazing |
No Match |
antenna is not signaling right |
No Match |
|
there was a lot of comment |
Match |
|
Expression |
Test String |
Matched? |
Ant\B |
These antiques are amazing |
Match |
antenna is not signaling right |
Match |
|
there were a lots of comment |
No Match |
\d – It matches all the decimal values (0-9) in the test string.
Expression |
Test String |
Matched? |
\d |
123and4 |
Match (123and4) |
If the moon in the night sky |
No Match |
|
bright star in the night sky01 |
Match(sky01) |
\D – It matches all the non-decimal values (^0-9) in the test string.
Expression |
Test String |
Matched? |
\D |
123and4 |
Match (123and4) |
If the moon in the night sky |
Match |
|
bright star in the night sky01 |
Match |
\s – It matches if any white space is present in the test string. It is equivalent in terms of functioning to [ \t\n\r\f\v].
Expression |
Test String |
Matched? |
\s |
123and4 |
No Match |
If the moon in the night sky |
Match |
|
bright star in the night sky01 |
Match |
\S – It matches if no white space is present in the test string. It is equivalent in terms of functioning to [^ \t\n\r\f\v].
Expression |
Test String |
Matched? |
\s |
123and4 |
Match |
If the moon in the night sky |
No Match |
|
bright star in the night sky01 |
No Match |
\w – It matches any alphanumeric (digits and alphabets) present in the test string. It is equivalent to [a-zA-Z0-9_]. By the way, underscore _ is also considered an alphanumeric character.
Expression |
Test String |
Matched? |
\w |
123_and4 |
Match |
If the moon in the night sky_ |
Match |
|
*********** |
No Match |
\W – It matches if no alphanumeric (digits and alphabets) are present in the test string. It is equivalent to [^ a-zA-Z0-9_]. Since underscore _ is also considered an alphanumeric character, it un match if it finds one.
Expression |
Test String |
Matched? |
\W |
123_and4 |
No Match |
If the moon in the night sky_ |
No Match |
|
*********** |
Match |
\Z – It matches if the given characters are found at the starting of the string.
Expression |
Test String |
Matched? |
Sky\Z |
the moon in the night |
No Match |
If the moon in the night sky |
Match |
|
bright star in the night |
No Match |
Python RegEx Module – re
re- module is a python module developed to work with general expressions. The module contains functions and other constants to deal with regular expressions in python. Let's look at a few most commonly used functions.
- findall()
The re.findall() method is one of the most commonly used methods of this module. It returns a list of all the matches found in the given string. An example is illustrated below. It returns an empty string if no match is found.
Code:
#program to find all numbers in a string
import re
string = ‘java 2 has 3 many45 features”
pattern = ‘\d+’
result = re.findall(pattern, string)
print(result)
Output:
[‘2’, ‘3’, ’45’]
- split()
The re.split() method looks for the match and then split the string from where it matches the given text string. It will return the original string if no split is found.
Code:
#program to find all numbers in a string
import re
string = ‘java 2 has 3 many45 features”
pattern = ‘\d+’
result = re.split(pattern, string)
print(result)
Output:
[‘java ’, ‘ has ’, ‘ many’, ‘features’]
Note: We can pass maxsplit argument to re.split method. It restricts the number of splits that occur. It returns the splited string as output. In case the value of max split is not mentioned, 0 is the default value, meaning max no. of split occurs.
Code:
#program to find all numbers in a string
import re
#maxsplit = 1
#spliting the string only once
string = ‘java 2 has 3 many45 features”
pattern = ‘\d+’
result = re.split(pattern, string, 1)
print(result)
Output:
[‘java ’, ‘2 has 3 many45 features’]
- sub()
The re.sub() method looks for a match defined by the pattern. After it finds the match, it replaces the matched value with a predefined replace value. It returns the original string if no match is found.
Code:
#program to remove all the whitespace from the text
import re
#multiline string
string = ‘java 2 has 3 many45 features”
#matches all the whitespace characters
pattern = ‘\s+’
#emty string
replace = ‘’
new_string = re.sub(pattern, replace, string)
print(new_string)
Output:
java2has3many45features
We can pass a parameter called count in the argument. It restricts the no. of matches to be replaced. The default value of the count is set to 0.
Code:
#program to remove all the whitespace from the text
import re
#multiline string
string = ‘java 2 has 3 many45 features”
#matches all the whitespace characters
pattern = ‘\s+’
#emty string
replace = ‘’
#count = 1
new_string = re.sub(pattern, replace, string, 1)
print(new_string)
Output:
java2 has 3 many 45 features
- subn()
The re.subn() method is similar to re.sub() looks for match defined by the pattern. After it finds the match, it replaces the matched value with a predefined replace value. It returns a tuple with two parameters, result, and count of replaces. It returns the original string if no match is found.
Code:
#program to remove all the whitespace from the text
import re
#multiline string
string = ‘java 2 has 3 many 45 features”
#matches all the whitespace characters
pattern = ‘\s+’
#emty string
replace = ‘’
new_string = re.subn(pattern, replace, string)
print(new_string)
Output:
(‘java2has3many45features’, 6)
- search()
The re.search() method is doing what it sounds like. It return true if it finds a match for the given pattern.
Code:
import re
#multiline string
string = ‘java 2 has 3 many 45 features”
#check if java is at the beginning of the string
pattern = ‘\Ajava’
match = re.search(pattern, string)
if match:
print(“match successful”)
else:
print(“match not successful”)
Output:
match successful
Match Object in re module
The match object in the re module has several methods and attributes. A few commonly used methods of match objects are:
match.group()
The group() method returns the matched portion of the text string.
Code:
import re
#multiline string
string = ‘java 256 88 has 3 many 45 features”
#three digits followed by two digit
pattern = ‘(\d{3}) (\d{2})’
match = re.search(pattern, string)
if match:
print(“match.group()”)
else:
print(“match not found”)
Output:
256 88
match.start(), match.end() and match.span()
match.start() returns the index of the first match. Similiarly, match.end() returns the index of the last matched character.
On the other hand, match.span() returns a tuple of start and end index.
>>> match.start()
5
>>>match.end()
10
>>>match,span()
(5, 10)
There are several other methods, and functions in the re module. Please refer to the official documentation to know more about the re module.