Master Computer Science The Smart Way

How a Compiler Actually Works (With Visuals)

How a Compiler Actually Works (With Visuals)
0

1. The Big Picture

text

SOURCE CODE                     MACHINE CODE
(hello.c)   →  [ COMPILER ]  →  (hello.exe / a.out)

A compiler is a translator from a high-level language (C, C++, Rust, etc.) to low-level machine instructions (binary) that a CPU can execute.

But internally, it’s not a single step — it’s a pipeline of phases.


2. The Six Major Phases (With Visuals)

Let’s take a tiny C program:

c

int add(int a, int b) {
    return a + b;
}

Phase 1: Lexical Analysis (Tokenization)

Input: stream of characters
Output: tokens (atoms like keywords, identifiers, operators, literals)

text

characters:   i n t   a d d ( i n t   a , i n t   b )
tokens:       KEYWORD(int)  IDENT(add)  LPAREN  KEYWORD(int)  IDENT(a)  COMMA  KEYWORD(int)  IDENT(b)  RPAREN  LBRACE  RETURN  IDENT(a)  PLUS  IDENT(b)  SEMICOLON  RBRACE

Visual:
"int" → [INT]
"add" → [ID]
"(" → [LPAREN]


Phase 2: Syntax Analysis (Parsing)

Tokens → Abstract Syntax Tree (AST) following grammar rules.

text

        return_stmt
        /          \
      PLUS         binary_op? no, just a+b
     /    \
  ID(a)   ID(b)

Better visualized as:

text

          function_definition
         /        |         \
     type: int   name: add   body (block)
                               |
                           return_stmt
                               |
                             binary_op: +
                             /        \
                        expr: a      expr: b

AST ignores punctuation like commas and semicolons — just structure.


Phase 3: Semantic Analysis

Checks meaning (type checking, scope, etc.).

Example:
int a = "hello"; ← Type error, caught here.
Also builds a symbol table:

NameTypeScope
addint→int→intglobal
aintparam
bintparam

Phase 4: Intermediate Representation (IR) Generation

AST → lower-level, machine-independent IR (e.g., three-address code).

text

t1 = a + b
return t1

Or in LLVM IR style:

text

%result = add i32 %a, %b
ret i32 %result

IR is easier to optimize than raw AST.


Phase 5: Optimization

Optimize IR without knowing the target CPU.

Example improvements:

  • Constant folding: x = 3 + 5 → x = 8
  • Dead code elimination: remove if (false) { ... }
  • Inline small functions: replace add(a,b) with a+b

Phase 6: Code Generation

IR → assembly / machine code for a specific CPU (x86, ARM, etc.).

add(a, b) on x86-64 might become:

asm

lea eax, [rdi + rsi]   ; rdi = a, rsi = b, eax = result
ret

Then assembler + linker produce executable binary.


3. Visual Pipeline Summary (Text Diagram)

text

Source File
     │
     ▼
┌─────────────┐
│ Lexer       │ ──► token stream
└─────────────┘
     │
     ▼
┌─────────────┐
│ Parser      │ ──► AST
└─────────────┘
     │
     ▼
┌─────────────┐
│ Semantic    │ ──► annotated AST + symbol table
│ Analyzer    │
└─────────────┘
     │
     ▼
┌─────────────┐
│ IR Gen      │ ──► IR (three-address code)
└─────────────┘
     │
     ▼
┌─────────────┐
│ Optimizer   │ ──► optimized IR
└─────────────┘
     │
     ▼
┌─────────────┐
│ Code Gen    │ ──► assembly / machine code
└─────────────┘
     │
     ▼
Executable

4. Real Compiler Examples

CompilerLanguagePhases combined
GCCC/C++6+ passes, many optimizations
Clang/LLVMC/C++/Rustclean separation: frontend (Clang) → IR → backend (LLVM)
javacJavasource → bytecode (JVM IR)

5. Key Takeaway

A compiler is not magic — it’s a deterministic pipeline:

  • Lexer breaks text into tokens.
  • Parser builds a tree (AST).
  • Semantic validates types and scopes.
  • IR flattens to lower-level ops.
  • Optimizer improves IR.
  • Code generator emits machine code.
Leave A Reply

Your email address will not be published.