6 Phases of Compiler

What is a Compiler?

A compiler is computer software that converts source code written in a high-level programming language into code for machines for computer architecture. It guarantees that the original code is correct and that the software is simple to comprehend. It may be used for a variety of applications, including optimization, identifying errors, bug discovery, and performance optimization of produced code. A compiler performs formal transformations from high-level source code to low-level target code.

The Importance of a Compiler

The need for compilers derives from the way conventional processors execute object-oriented code. Processors use logic gates to route signals to the circuit board. To operate the arithmetic logic unit, they handle basic high and low signals. Because the original high-level code comprises lexical or arithmetic syntax instructions, variables, calls, operates and methods.

Types of Compilers

There are several kinds of compilers in use:

  • Cross compiler: A technique that enables programmers to compile and run software across several platforms. A compiler of this type is important when working on several versions of code in order to guarantee that all systems in use are supported. This is especially important when developing an entirely novel platform to ensure that the code works properly.
  • JIT (just-in-time) compiler: It is a compiler which is intended to compile programs while they run. The compiler is substantially quicker than a standard compiler, which must recompile a program every time it is run. This aid in lowering program size by removing duplicate code. This makes the software more compact and efficient, which aids in performance improvement.
  • Source-to-source compiler: A source-to-source compiler is a piece of software that converts source code into executable code. The compiler in question is used to convert code that is written in a variety of computer languages. The translation procedure can be accomplished both manually and mechanically. It is carried out by a compiler, which converts the source code to machine code, which is then run by the target machine.
  • Bytecode compiler: This is a type of compiler that converts high-level languages into machine-readable code that can be executed on the machine being targeted. These compilers enable developers to create high-level code and compile it to machine code. Developers may construct short and easy-to-understand code using bytecode compilers. These compilers, however, must be built in high-level languages and are not appropriate for producing low-level code.
  • Binary compiler: A compiler that converts source code files to binary format. This format stores program information in a compact manner that the computer can quickly interpret. Such compilers are used by developers for database management, network programming, and web development.
  • Hardware compiler: Hardware compilers translate source code into computer code in order to translate the original code into computer code. The computer then executes this code. A compiler of this type is utilized in embedded devices, operating systems, and computer games. A prime instance of a semiconductor compiler is Assembler.

Phases of Compiler

6 Phases of Compiler

The six phases of a compiler are as follows:

  1. Lexical Analysis
  2. Syntax Analysis
  3. Semantic Analysis
  4. Intermediate Code Generation
  5. Code Optimization
  6. Code Generation

All of the preceding steps entail the following tasks:

  • Symbol table maintenance.
  • Handling of errors.

Lexical Analysis

This is the initial phase of the compiler, in which high-level input programs are converted into a sequence of tokens. Tokens are sequences of characters which are handled as a unit in the computer language's grammar. It is possible to implement it using Deterministic Finite Automata. The result is a series of tokens that is given to the parser for syntax examination.

  • Token: A token is a string of characters that represents a lexical unit that matches the pattern in question, such as phrases, operators, identifiers, and so on.
  • Lexeme: A token instance is a set of characters that compose a token.
  • Pattern: A pattern is the rule that all the lexemes in a token follow. It is a framework that strings must match.

When a token is created, an entry in the symbol table is created for it.

  • Character stream as input
  • Token as an output
  • Template for Tokens: token-name, attribute-value

Example: c=a+b*5;

LexemesTokens
cIdentifiers
=symbol of assignment
aidentifiers
+(additional symbol) +
bidentifiers
** (symbol of multiplication)
55 (the number)

As a result, id, 1>=> id, 2> +>id, 3 > * > 5>

Syntactic Analysis or Parsing

Syntactic analysis, also known as parsing, is the second phase in which the supplied input string is read for validation of the structure of standard grammar. To determine if the provided input is valid in terms of computer grammar, the syntax of the input is analyzed and examined.

The tokens generated by the lexical analyzer are converted by the parser into a tree-like representation known as the parse tree. A parse tree represents the input's grammatical structure.  The syntax tree is a shortened version of the syntax tree in which operators appear as inside nodes and operands as descendants of a node associated with the operator.

  • Tokens are used as input.
  • Syntax tree is the result.

Semantic Analysis

In the third step, the compiler checks to see if the parse tree meets the language principles. Identifiers and expressions are tracked by the compiler. The validity of a parse tree is defined by a semantic analyzer, and the result is an annotated syntax tree.

Intermediate Code Generation

After the parse tree has been semantically checked, an intermediate code generation generates three address codes. The compiler generates middle-level language code during the translation of a source code into object code.

Intermediate code creation generates the following intermediate versions for the source program:

  • The use of a postfix
  • Three different address codes
  • The syntax tree

The three-letter address code is the most widely used format.

Int to float (5) t1

t2 = id3* t1

t3 = id2 + t2

id1 = t3

Intermediate code should have the following characteristics: 

  • It should be simple to write.
  • it must be straightforward to convert into the target software.

Code Optimization

This is a compiler phase that is optional and is used to optimize the intermediate code. The software runs faster and takes up less space as a result of this enhancement. To boost program performance, superfluous lines of code are removed and statement sequences are arranged.

The code optimization step takes the intermediate program as input and outputs optimized intermediate code.

As a result, machine code runs quicker.

It is possible to do this by lowering the amount of lines of programming in a program.

This step removes unnecessary code and strives to enhance intermediate code, resulting in faster-running machine code.

The program's output is unaffected by code optimization.

Optimization is used to enhance code generation.

  • Deduction and elimination of dead code (code that cannot be reached).
  • Constants in phrases and terms are calculated.
  • Reducing a repeated statement to a temporary string.
  • The loop is unrolling.
  • Exiting the loop with code.
  • Elimination of unneeded temporary variables.
t1 = id3* 5.0

id1 = id2 + t1

Code Generation

Code generation is the last stage of the compiler in which the compiler takes entirely optimized code that is intermediate as input and converts it into machine code. The initial code is transformed into machine code at this phase.

Intermediate instructions are converted into a series of machine instructions which perform the same function.

The code generating process entails

  • Memory and register.php are allocated.
  • Correct reference generation.
  • Correct data type generation.
  • Making up for lost code.
# 5.0 LDF R1, id2 ADDF R1, R2 STF id1, R1 STF id1, R1 STF id1, R1 STF id1, R1 STF id1, R1 STF id1, R

Symbol Table Management

  • The symbol table serves to hold all of the information about the program's identifiers.
  • It's a data structure that has a record for every identifier and fields for the identifier's properties.
  • It allows you to rapidly discover a record with every identifier and store or get information from that record.
  • When an identifier is found in any of the stages, it is saved in the symbol database.

Example

char z; int a, b; float c;
Symbol NameTypeAddress
aint1000
bInt1002
cFloat1004
dChar1008

Handling of Errors

  • Errors can occur at any stage. When a mistake is detected, a phase has to deal with the problem so that compilation may continue.
  • Errors in token separation arise in lexical analysis.
  • Errors arise during the creation of the syntax tree in syntax analysis.
  • Errors in semantic analysis can occur in the following situations:
  • When a compiler discovers constructions with correct syntax but no meaning.
  • During the type conversion process.
  • Errors occur in code optimization when the result is influenced by the optimization. When generating code, it displays errors when code is missing, for example.

Errors Occur in Different Phases

Errors can occur at any stage. After identifying a mistake, a phase must deal with the issue in some way so that compilation may continue.

At various phases of development, a program might exhibit the following types of errors:

Errors in Lexical

It includes identifiers with inaccurate or misspelled names, as well as IDs entered wrongly.

Errors in Semantical

It might be a missing semicolon or an uneven pair of parentheses. Syntax analyzers (parsers) tackle syntactic mistakes.

When a mistake is identified, the parser must handle it in order to continue parsing the remainder of the input. Problems might be predicted at many stages of compilation, but the majority of them are syntactic problems, which the parser ought to be able to identify and report in the program.

The following are the purposes of the parser's error handler:

  • Clearly and correctly report the existence of mistakes.
  • Recover rapidly enough from each error to identify subsequent errors.
  • Add as little overhead as possible to the process of repairing programs.

To cope with problems in the code, the parser can choose one of four typical error-recovery mechanisms.

  • The panic mode
  • Level of statement
  • Error manufacturing
  • Global rectification

Errors in Semantics

These mistakes are caused by incompatible value assignment. The semantic mistakes that a semantic analyzer is supposed to detect are as follows:

  • Mismatch in kind.
  • Variable that has not been stated.
  • Misuse of a reserved identifier.
  • Multiple variable declarations in a scope.
  • Accessing a variable that is not in scope.
  • Mismatch between actual and formal parameters.

Logical errors

  • These problems occur as a result of an unreachable code-infinite loop.
  • The most typical faults in semantic analysis are incorrect character sequences in imaging, invalid token sequence in type, scoping error, and parsing.
  • The error might occur during any of the preceding steps. After discovering mistakes, the phase must deal with them in order to proceed with the process of compilation.
  • These errors must be sent to the error handler, which addresses the error in order for the compilation process to proceed. Errors are often provided in the format of a message.

Advantages

  • Portability: Compilers enable programs to be developed using a language used for high-level programming that can be executed on multiple hardware platforms without change. Programs may therefore be developed once and operate on numerous platforms, which makes them more portable.
  • Optimization: Compilers can use optimization techniques such as loop removing, dead code removal, and constant propagation to increase the efficiency of the output machine code.
  • Error Checking: Compilers do a comprehensive examination of the original code, detecting syntax and semantic mistakes at build time, lowering the risk of runtime issues.
  • Maintainability: High-level language programs are easier to comprehend and maintain than low-level assembly language systems. Compilers aid in the translation of high-level code into computer code, making it simpler to maintain and alter programs.
  • Productivity: A high-level programming dialects and compilers aid in enhancing developer productivity. High-level languages allow developers to create code quickly, which can then be converted into fast machine code.

Conclusion

  • Each process of compiler design changes the original program from one form to another.
  • The six phases in compiler design are as follows:
  • 1) Lexical analysis 2) Syntax examination 3) Semantic examination 4) Generator of intermediate code 5) Code optimization 6) Code Creator
  • The initial phase of the compiler's scan of the source code is lexical analysis.
  • The goal of syntax analysis is to uncover structure in text.
  • Semantic analysis verifies the code's semantic correctness.
  • When the semantic analysis step is complete, the compiler should create the intermediate code on the target machine.
  • The code optimization process eliminates unneeded code lines and rearranges the statement order.
  • The code generation phase receives input from the source code optimization phase & generates page code or object-specific code as a consequence.
  • A symbol table has an entry for each symbol with columns for the identifier's characteristics.
  • During various steps, the error handling function handles errors and reports.