top of page

🔬 Link Time Optimization in Embedded Systems

Whole-Program Optimization, Veneers, Map-File Forensics and Execution-Time Effects with Keil + armclang


In deeply embedded systems, optimization is not about squeezing synthetic benchmarks.

It is about:

  • Flash footprint

  • RAM usage

  • ISR latency

  • WCET predictability

  • Structural determinism


And if you are using Keil MDK with Arm Compiler 6, you are not working with generic GCC behavior — you are working with a highly optimized LLVM-based toolchain.


Understanding how LTO behaves in armclang — and how it changes veneers and map-file structure — is critical.

1️⃣ The Toolchain Context: Keil + armclang


Modern Keil (MDK v5+) uses:

  • Arm Compiler 6

  • Frontend: LLVM/Clang

  • Backend: ARM proprietary optimizations

  • Linker: armlink


When LTO is enabled in Keil:

  • IR is preserved across object files

  • Whole-program optimization happens inside armlink

  • Code generation is deferred


In Keil µVision:

Project → Options for Target → C/C++ → Optimization → Link Time Optimization

Or via command line:

armclang -O3 -flto -mcpu=cortex-m4 -c main.c
armlink --lto main.o motor.o control.o --map --list=firmware.map

2️⃣ Separate Compilation vs LTO (armclang)


Without LTO:

armclang -O3 -mcpu=cortex-m4 -c main.c
armclang -O3 -mcpu=cortex-m4 -c motor.c
armlink main.o motor.o --map --list=firmware.map

Each .o contains final machine code.


The linker:

  • Resolves symbols

  • Inserts veneers

  • Does limited relaxation

But cannot semantically optimize.


With LTO:

armclang -O3 -flto -mcpu=cortex-m4 -c main.c
armclang -O3 -flto -mcpu=cortex-m4 -c motor.c
armlink --lto main.o motor.o --map --list=firmware.map

Now:

  • Objects contain LLVM IR

  • armlink merges IR

  • Full call graph is available

  • Whole-program optimization runs


This is not cosmetic.

It changes call structure, code layout and veneer behavior.

3️⃣ Veneers in Arm Compiler 6


On Cortex-M (Thumb-2):


Branch instructions (B, BL) have limited range.


If the destination exceeds encoding limits, armlink inserts veneers.


In map file you may see entries like:

Veneer Table
  Veneer for motor_step
  Veneer for control_loop

Or:

$$Ven$$0001

Conceptually:

Caller → Veneer → Target

Each veneer introduces:

  • Extra branch

  • Pipeline refill

  • Extra flash consumption


On Cortex-M4:

~2–5 cycle penalty per veneer call.

4️⃣ How LTO Changes Veneer Generation (Keil Context)


Without LTO:

  • Functions are compiled independently

  • Call graph edges remain intact

  • Long-distance calls more likely

  • armlink inserts veneers


With LTO:

  • Cross-module inlining removes calls

  • Functions are clustered

  • Call graph flattened

  • Fewer long-distance edges

5️⃣ Map-File Analysis in Keil (Real Example Structure)


Generate map in Keil:

Project → Options → Linker → Listing → Generate Map File

Or via armlink:

armlink --map --info veneers --info sizes

5.1 Image Component Sizes

Image Component Sizes

Code (inc. veneers):  0x0001F200
RO Data:              0x00001200
RW Data:              0x00000400
ZI Data:              0x00000800

After LTO:

Code (inc. veneers):  0x0001C800

Notice reduction.


5.2 Veneer Reporting

Use:

armlink --info veneers

Without LTO:

Veneers created:
  4 veneers inserted

With LTO:

Veneers created:
  1 veneer inserted

6️⃣ Execution-Time Implications (Cycle-Level View)

Consider ISR:

void SysTick_Handler(void) {
    control_loop();
}

Without LTO:

BL control_loop
  → veneer
    → branch

With LTO:

  • control_loop inlined

  • arithmetic directly embedded


Effects:

  • No branch

  • No veneer

  • No pipeline refill


In high-frequency ISR (20 kHz):

Even 5-cycle reduction matters.

That is 100k cycles saved per second.

Branch Relaxation in armlink


armlink performs:

  • Branch shortening

  • Stub insertion

  • Veneer creation


Without LTO:

  • Relaxation is reactive

  • Layout fixed


With LTO:

  • Code generation aware of full layout

  • Better proximity placement

  • Reduced need for stubs


This improves:

  • Flash prefetch efficiency

  • I-cache locality

  • Sequential fetch behavior

Determinism and WCET in Safety Context


In automotive / aerospace contexts:

  • Predictability > raw speed


Veneers are hidden control-flow elements.

LTO reduces:

  • Indirection

  • Stub insertion

  • Call depth


Map-file inspection becomes part of verification.


In Keil-based projects, this is critical for:

  • Stack-depth analysis

  • Timing certification

  • Toolchain qualification

Practical µCore₀₁ Workflow (Keil Edition)


  1. Build without LTO

  2. Enable:

    • -O3

    • -flto

    • armlink LTO

  3. Enable:

    • --map

    • --info veneers

    • --info sizes

  4. Compare:

    • Code size

    • Veneer count

    • Symbol presence

    • Function clustering


Optimization must be measured.

µCore₀₁ Final Perspective


With Keil + armclang, LTO is not just:

“smaller binary”.


It reshapes:

  • Call graph topology

  • Code layout

  • Branch distance

  • Veneer generation

  • Execution-time characteristics

  • WCET stability


And the evidence is visible:

In your map file.In your veneer count.In your ISR cycle budget.

If you are not reading your map file,you are not really optimizing your firmware.




Commenti


bottom of page