Lexical Analysis in Compiler Design

Lexical Analysis

It is the first phase of the compiler. As we know, it is also known as a scanner. The input for lexical analysis is source code. After taking source code as an input, it breaks them into valid tokens by removing whitespace, comment from source code. If there are any invalid tokens present in the source code, it will show an error. The output of the lexical analysis is a sequence of tokens, which will be further sent to the syntax analysis as an input.

    If the source program consists of a macro – preprocessor, then the lexical analyzer will also perform the expansion of macros.

Lexical Analysis


Token pairs consist of a token name and an optional attribute value. The sequence of characters in any tokens is known as lexemes. To check whether the lexemes are valid token or not, there are some predefined rules. These rules are defined with the help of the grammar of programming language, which is also known as a pattern.

There are various kinds of tokens in a programming language. Some of these are keyword, string, identifiers, operators, separators, numbers, punctuations are considered as tokens. 


int value = 10;


Here in this example, int (keyword), value (identifiers), = (operator), 100 (constant), and ; (symbol) are the tokens. So here, we can see that the total number of tokens present in this example is 5.


printf ("Tutorialandexample");


'printf', '(', ' )' , 'Tutorialandexample', ';' are the tokens present in this example.

So here, we can say that there are five valid tokens.

Specification of Tokens:

Following are the various kinds of tokens present in any programming language:


An alphabet is any finite set of symbols. Binary alphabets (0,1) and sets of hexadecimal (a-z, A-Z) are known as alphabets in the English language. 


In a programming language, a string is a finite set of characters. The number of alphabets present in any string is the length of the string. ? is known as an empty string


Tutorialandexample: There are 18 alphabets present in Tutorialandexample. So the length of the string is 18.

Operators/Symbols :

Arithmetic Operators+, -, *, /, %
Relational Operator !=, <, <=, >, >=, ==
Logical Operator&, &&, |, ||, !
Assignment Operators+=, /=, *=, -=, =
Shift Operators>>, >>>, <<, <<<
PunctuationComma(,), Semicolon(;), dot(.)


In general, language is a collection of words made up of a finite set of alphabets. Computer language is a set of instructions, and some kind of output is associated with it.