Regular Expression | Compiler Design

Regular Expression

A regular expression is a set of patterns that can match a character or string. It can also match alternative characters or strings. The grammar defined by the regular expression is known as regular grammar, and the language is known as regular language. Any string matched by the regular expression is a set of symbols over an alphabet. The repetition and alternation in any string are expressed using *, +, and |.

In any regular expression, a* means a can occurs zero or more times. It can generate (e, aa, aaa, aaaa ...).

In any regular expression, a+ means a can occurs one or more times. It can generate (a, aa, aaa, aaaa ...).

Here are the rules that define regular expression over some alphabet and the language those expressions denote.

Let a and b are regular expressions expressing the language L(a) and L(b).

  1. (a)|(b) is a regular expression representing the language L(a) union L(b).
  2. (a)(b) is regular expression representing  the language L(a)L(b).
  3. (a)* is regular expression representing (L(a))*
  4. (a) is a regular expression representing L(r).

Operation on Regular Language

The various operations on the regular language are discussed below:

Union: If X and Y are regular expressions, L union M is also union.

X U Y = {a | a is in X or a is in Y}

Intersection: If X and Y are regular expressions, their intersection is also an intersection.

X ? Y = {ap | a is in X and p is in Y}

Kleene closure: If X is a regular language, its Kleene closure X1* will also be a regular language.

X* = the language L can occur zero or more times.

Precedence and Associativity

  • Unary operator * is left-associative and with the highest precedence.
  • Concatenation is the left-associative and has the second-highest precedence.
  • | (pipe sign) is also left-associative with the lowest precedence amongst all of them.

Example -

Let X = (a, b)

  • The regular expression a|b denote the language {a, b}.
  • (a|b)(a|b) represent {aa, ab, ba, bb} the language of all strings having length two over the alphabet X. One more regular expressions that can accept the same language is aa|ab|ba|bb.
  • a* represents the group of all strings that have zero or more a's, i.e. (e, aa, aaa, aaaa, ......).
  • (a|b)* represent the group of all strings having zero or more times of a or b, i.e., all string that contains a's and b's: {e, a, b, aa, ab, ba, bb, aaa,}. One more regular expression that accepts the same language is (a*b*)*.                                                                                                                                                  
  • a|a*b denotes the language {a, b, ab, aab, aaan ...}, i.e., the string a, and all strings have zero or more a's and ending with b.

Extensions of Regular Expressions

Kleene introduced regular expression in the 1950s with the basic operation for a union, concatenation, and Kleene closure.

Here are the few notational extension mentioned that are currently in use:

  1. One or more instance: Unary postfix operator + shows positive closure of a regular expression and its language. It stated that if a is the regular expression, then (a)+ denotes the language (L(a)+.  Two algebraic laws r* = r+|e and r+ =rr* = r*r relate the positive closure and Kleene closure.
  2. Zero or one instance: Unary postfix operator? means zero or one occurrence. It means that r? is equivalent to r|e or L(r?) = L(r) U {e}. This operator has the same precedence and associativity as * and +.