Java Regular Expressions
Java Regular Expressions
The Java Regex or Regular Expression is an API that defines a pattern for searching or manipulating strings. A regular expression is a pattern that can be as simple as a single character or can be a pattern of characters to make a complex pattern. To work with regular expression, one has to import the package java.util.regex. The package provides the following classes and interfaces.
Matcher Class: The class implements the MatchResult interface. It is used for pattern searching
Pattern Class: It defines the pattern that has to be searched.
PatternSyntaxException Class: The class checks for the syntactical error in the regular expression pattern.
Matcher Class
The following table enlists the pre-defined methods of the Matcher class:
Method Name | Description |
int start() | Returns the first sequence of the matching sequence. |
int groupCount() | Returns the total count of sequences that are matched. |
String group() | Returns the sequence that is matched. |
public boolean find() | Searches for the next sequence that matches with the given pattern. If the sequence is found, returns false, else returns true. |
public boolean find(int st) | Searches for the sequence that matches with the given pattern from the index st. If any match is found, returns false, else returns true. |
public boolean matches() | The method tries to match the regular expression pattern with the input sequence. If any mismatch is found, returns false, else returns true. |
Pattern Class
The following table enlists the pre-defined methods of the Pattern class:
Method Name | Description |
public Matcher matcher(CharSequence cs) | Creates a sequence in which the defined pattern has to be found. |
public static boolean matches(String re, CharSequence cs) | A static method that searches the regular expression re in the sequence cs. |
public String pattern() | Returns the sequence that is matched. |
public String[] split(CharSequence cs, int limit) | An array of string is returned by this method by breaking the input on the basis of matches with the given pattern. The second parameter limit determines the number of times the split() method is called. |
public static Pattern compile(String rgx) | The method compiles the string rgx to generate a pattern. The pattern is then returned. |
Let’s understand the concept of regular expression through a Java program.
Java Program
Consider the following program that shows how to use regular expression.
FileName: RegexExample.java
// importing the class Matcher import java.util.regex.Matcher; // importing the class Pattern import java.util.regex.Pattern; public class RegexExample { // main method public static void main(String argvs[]) { // the pattern is Tutorial & example Pattern pt = Pattern.compile("Tutorial & example", Pattern.CASE_INSENSITIVE); // the input sequence in which the pattern is searched. Matcher matcherObj = pt.matcher("Visit tutorial & example for learning about Java!"); // invoking the find() method boolean isMatchFound = matcherObj.find(); // checking whether match is found or not if(isMatchFound) { System.out.println("Match found for the given pattern."); } else { System.out.println("Match is not found for the given pattern."); } } }
Output:
Match found for the given pattern.
Explanation: The second parameter (Pattern.CASE_INSENSITIVE) in the compile() method is a flag that indicates that while making the pattern searching, the case sensitivity should not be taken into consideration. By default, the compile() method assumes that case sensitivity is present. The second parameter is optional and can be omitted. The matcher() method returns the object of the Matcher class. On this returned object, the find() method is invoked to check whether the regular expression pattern is available in the sequence or not.
Metacharacters
The characters that have special meaning are known as metacharacters. The following table shows the commonly used metacharacters in a regular expression.
Metacharacters | Definition |
| | Checks for any one of the patterns separated by |. For example, fish|dog|cat |
\d | Looks for a digit |
\s | Looks for a whitespace character |
\uxxxx | Looks for a Unicode character with the help of hexadecimal number xxxx |
^ | Looks for a match in the starting of the string, e.g., ^World |
$ | Looks for a match in the ending of the string, e.g., World$ |
. | Looks for any single instance of character |
\b | Looks for a match either at the starting or at the ending of the string, e.g., \bWorld or World\b |
\w | Looks for any word character |
Let’s use the metacharacters in a Java program.
FileName: RegexMetacharactersExample.java
// import statements import java.util.regex.Matcher; import java.util.regex.Pattern; public class RegexMetacharactersExample { // main method public static void main( String argvs[]) { // regular expression using \b String rx = "own\\b"; String input = "crown own town brown owner grown flown blown"; Pattern ptrn = Pattern.compile(rx); Matcher matcher = ptrn.matcher(input); int cnt = 0; while(matcher.find()) { cnt = cnt + 1; } System.out.println("Number of matches for the word \"own\" is : " + cnt); // regular expression using | rx = "world | hello"; input = "world is hello is world is hello is"; ptrn = Pattern.compile(rx); matcher = ptrn.matcher(input); cnt = 0; // resetting the value of the cnt is 0 // checks for either world or hello while(matcher.find()) { cnt = cnt + 1; } System.out.println("Number of matches for the word \"world\" and \"hello\" is : " + cnt); // regular expression using ^ rx = "^hello"; input = "hello hello is world is hello is"; ptrn = Pattern.compile(rx); matcher = ptrn.matcher(input); cnt = 0; // resetting the value of the cnt is 0 // checks whether the input string starts with the word hello or not while(matcher.find()) { cnt = cnt + 1; } System.out.println("Number of matches for the word \"hello\" is : " + cnt); // regular expression using $ rx = "hello$"; input = "hello is world is hello"; ptrn = Pattern.compile(rx); matcher = ptrn.matcher(input); cnt = 0; // resetting the value of the cnt is 0 // checks whether the input string ends with the word hello or not while(matcher.find()) { cnt = cnt + 1; } System.out.println("Number of matches for the word \"hello\" is : " + cnt); // regular expression using . rx = "."; input = "hello world"; ptrn = Pattern.compile(rx); matcher = ptrn.matcher(input); cnt = 0; // resetting the value of the cnt is 0 // checks whether the input string ends with the word hello or not while(matcher.find()) { cnt = cnt + 1; } System.out.println("Total number of characters are : " + cnt); // regular expression using \d rx = "hello\\d"; input = "hello hello hello9"; ptrn = Pattern.compile(rx); matcher = ptrn.matcher(input); cnt = 0; // resetting the value of the cnt is 0 // checks whether the string contains hello[0-9] while(matcher.find()) { cnt = cnt + 1; } System.out.println("Number of matches for hello[0-9] : " + cnt); // regular expression using \s rx = "hello\\s"; input = "hello hello hello9"; ptrn = Pattern.compile(rx); matcher = ptrn.matcher(input); cnt = 0; // resetting the value of the cnt is 0 // checks whether the string contains hello with a whitespace while(matcher.find()) { cnt = cnt + 1; } System.out.println("Number of matches for hello with whitespace: " + cnt); } }
Output:
Number of matches for the word "own" is : 7 Number of matches for the word "world" and "hello" is : 4 Number of matches for the word "hello" is : 1 Number of matches for the word "hello" is : 1 Total number of characters are : 11 Number of matches for hello[0-9] : 1 Number of matches for hello with whitespace: 2
Quantifiers
Quantifiers determine the number of characters or groups that should be present in the input to get a match.
Quantifiers | Description |
Y* | Looks for a string that contains 0 or greater than 0 occurrences of Y. |
Y+ | Looks for a string that contains at least one occurrence of Y |
Y{n, } | Looks for a string that contains at least n occurrences of Y |
Y{n} | Looks for a string that contains exactly n occurrences of Y |
Y(n1, n2} | Looks for a string that contains at least n1 occurrences of Y but does not contain greater than n2 occurrences of Y |
Y? | Looks for 0 or 1 occurrences of Y |
Java Program
The following program uses the quantifiers defined above.
FileName: QuantifiersExample.java
// import statements import java.util.regex.Matcher; import java.util.regex.Pattern; public class QuantifiersExample { // main method public static void main(String argvs[]) { System.out.println("For + quantifiers \n"); // regular expression for at least one 't'. String rx = "t+"; Pattern ptrn = Pattern.compile(rx); // Creating an object of the Matcher class Matcher mtchr = ptrn.matcher("ttttst"); while (mtchr.find()) { System.out.println("Pattern found from " + mtchr.start() + " to " + (mtchr.end() - 1)); } System.out.println("\n"); // s t or k can appear zero or one time rx = "[stk]?"; System.out.println("For ? quantifier \n"); ptrn = Pattern.compile(rx); mtchr = ptrn.matcher("ssttkk"); while (mtchr.find()) { System.out.println("Pattern found at index " + mtchr.start()); } // s t or k can appear zero or more times rx = "[stk]*"; System.out.println(); System.out.println("For * quantifier \n"); ptrn = Pattern.compile(rx); mtchr = ptrn.matcher("ssttkk"); while (mtchr.find()) { System.out.println("Pattern found at index " + mtchr.start()); } // k has to appear at least 3 times rx = "k{3,}"; System.out.println(); System.out.println("For {n, } quantifier \n"); ptrn = Pattern.compile(rx); mtchr = ptrn.matcher("ssttkkkk"); while (mtchr.find()) { System.out.println("Pattern found at index " + mtchr.start()); } // k has to appear at least 3 times but not greater than 6 times rx = "k{3,6}"; System.out.println(); System.out.println("For {n, m} quantifier \n"); ptrn = Pattern.compile(rx); mtchr = ptrn.matcher("ssttkkkkkkkkkkkkkk"); while (mtchr.find()) { System.out.println("Pattern found from " + mtchr.start() + " to " + (mtchr.end() - 1)); } // k has to appear exactly 3 times rx = "k{3}"; System.out.println(); System.out.println("For {n} quantifier \n"); ptrn = Pattern.compile(rx); mtchr = ptrn.matcher("ssttkkkkkkkkkkkkkk"); while (mtchr.find()) { System.out.println("Pattern found from " + mtchr.start() + " to " + (mtchr.end() - 1)); } } }
Output:
For + quantifiers Pattern found from 0 to 3 Pattern found from 5 to 5 For ? quantifier Pattern found at index 0 Pattern found at index 1 Pattern found at index 2 Pattern found at index 3 Pattern found at index 4 Pattern found at index 5 Pattern found at index 6 For * quantifier Pattern found at index 0 Pattern found at index 6 For {n, } quantifier Pattern found at index 4 For {n, m} quantifier Pattern found from 4 to 9 Pattern found from 10 to 15 For {n} quantifier Pattern found from 4 to 6 Pattern found from 7 to 9 Pattern found from 10 to 12 Pattern found from 13 to 15
Explanation: The + quantifier takes all the matching characters at a time. Thus, following the greedy approach. Hence, all the indices of ‘t’ is taken from 0 to 3. Then, ‘s’ comes, which is not the part of the regular expression. After that a single ‘t’ occurs at index 5, which is represented in the output too.
The ? quantifier takes one character at a time. Therefore, in the output, we see every index from 0 to 5. The 6th index is shown because the ? quantifier also considers zero characters. After the 5th index, the input string finishes. Hence, the zero-character condition becomes true, and the 6th index is displayed in the output.
For the * quantifier also, the 6th index is shown because the * quantifiers also consider the zero-character condition. However, the * quantifier processes all the matching characters at a time. Therefore, we only see indices 0 and 6 in the output.
For the {n, } quantifier, the processing happens for greater than or equal to n matching characters at a time. Therefore, index 4 is seen in the output.
For the {n, m} quantifier (m should be greater than or equal to n), the processing happens for any matching characters whose frequency of occurrences lies between n and m at a time. If the frequency of occurrences happens to be more than m, then the frequency till m is considered, and in the next iteration, the remaining occurrences are considered. The same is evident by looking at the output.
For the {n} quantifier also, only the frequency of occurrences till n is considered. In the next iteration, the rest of the frequency of occurrences is considered. As n = 3 in our case, we see a gap of 3 in the output.