C++ Series

Inside the C++ Compiler: From Tokens to Assembly

When you compile a C++ program, a surprising amount of work happens behind the scenes. Compilation is not a single transformation from source code to executable, but a carefully designed pipeline of stages that progressively analyze, verify, and transform your code.

1. Lexical Analysis (Lexing)

The compiler reads your source code as plain text and groups characters into meaningful units called tokens.

int sum = a + b;

This becomes a sequence of tokens:

  • int — keyword
  • sum — identifier
  • = — assignment operator
  • a, b — identifiers
  • + — arithmetic operator
  • ; — statement terminator

2. Syntax Analysis (Parsing)

The parser organizes tokens according to the C++ grammar and builds an Abstract Syntax Tree (AST), which represents the hierarchical structure of the code.

Parsing answers the question: "Is this code grammatically valid C++?" Syntax errors such as missing semicolons or mismatched braces are detected here.

3. Semantic Analysis

After structural validation, the compiler verifies whether the code makes sense:

  • Ensuring type correctness in expressions and assignments
  • Verifying variable scope and object lifetimes
  • Checking function declarations, definitions, and calls
  • Validating access control and const-correctness

4. Intermediate Representation (IR)

Most modern compilers lower the program into an Intermediate Representation (IR) — a simplified, platform-independent form that makes analysis and optimization easier. LLVM-based compilers use LLVM IR; GCC has its own internal representations.

5. Optimization

With the program in IR form, the compiler applies transformations to improve performance without changing observable behavior:

  • Dead code elimination — removes unused computations
  • Constant folding — evaluates expressions at compile time
  • Function inlining — replaces calls with function bodies
  • Loop unrolling — reduces loop overhead

Optimization levels (-O0, -O2, -O3) control how aggressively these are applied.

C++ → Assembly Example

int add(int a, int b) {
  return a + b;
}

After compilation, the compiler may generate:

add:
  mov eax, edi
  add eax, esi
  ret
Optimizations can drastically change the generated assembly — debugging optimized builds is significantly harder than debugging unoptimized ones.

Key Takeaways

  • Compilation is a multi-stage pipeline, not a black box
  • Most errors are detected before machine code generation
  • Intermediate representations enable powerful optimizations
  • Optimization improves performance but reduces debuggability
← Back to Blogs