🔬 Link Time Optimization in Embedded Systems

Alessandro Salvato
1 mar
Tempo di lettura: 3 min

Whole-Program Optimization, Veneers, Map-File Forensics and Execution-Time Effects with Keil + armclang

In deeply embedded systems, optimization is not about squeezing synthetic benchmarks.

It is about:

Flash footprint
RAM usage
ISR latency
WCET predictability
Structural determinism

And if you are using Keil MDK with Arm Compiler 6, you are not working with generic GCC behavior — you are working with a highly optimized LLVM-based toolchain.

Understanding how LTO behaves in armclang — and how it changes veneers and map-file structure — is critical.

1️⃣ The Toolchain Context: Keil + armclang

Modern Keil (MDK v5+) uses:

Arm Compiler 6
Frontend: LLVM/Clang
Backend: ARM proprietary optimizations
Linker: armlink

When LTO is enabled in Keil:

IR is preserved across object files
Whole-program optimization happens inside armlink
Code generation is deferred

In Keil µVision:

Project → Options for Target → C/C++ → Optimization → Link Time Optimization

Or via command line:

armclang -O3 -flto -mcpu=cortex-m4 -c main.c
armlink --lto main.o motor.o control.o --map --list=firmware.map

2️⃣ Separate Compilation vs LTO (armclang)

Without LTO:

armclang -O3 -mcpu=cortex-m4 -c main.c
armclang -O3 -mcpu=cortex-m4 -c motor.c
armlink main.o motor.o --map --list=firmware.map

Each .o contains final machine code.

The linker:

Resolves symbols
Inserts veneers
Does limited relaxation

But cannot semantically optimize.

With LTO:

armclang -O3 -flto -mcpu=cortex-m4 -c main.c
armclang -O3 -flto -mcpu=cortex-m4 -c motor.c
armlink --lto main.o motor.o --map --list=firmware.map

Now:

Objects contain LLVM IR
armlink merges IR
Full call graph is available
Whole-program optimization runs

This is not cosmetic.

It changes call structure, code layout and veneer behavior.

3️⃣ Veneers in Arm Compiler 6

On Cortex-M (Thumb-2):

Branch instructions (B, BL) have limited range.

If the destination exceeds encoding limits, armlink inserts veneers.

In map file you may see entries like:

Veneer Table
  Veneer for motor_step
  Veneer for control_loop

Or:

$$Ven$$0001

Conceptually:

Caller → Veneer → Target

Each veneer introduces:

Extra branch
Pipeline refill
Extra flash consumption

On Cortex-M4:

~2–5 cycle penalty per veneer call.

4️⃣ How LTO Changes Veneer Generation (Keil Context)

Without LTO:

Functions are compiled independently
Call graph edges remain intact
Long-distance calls more likely
armlink inserts veneers

With LTO:

Cross-module inlining removes calls
Functions are clustered
Call graph flattened
Fewer long-distance edges

5️⃣ Map-File Analysis in Keil (Real Example Structure)

Generate map in Keil:

Project → Options → Linker → Listing → Generate Map File

Or via armlink:

armlink --map --info veneers --info sizes

5.1 Image Component Sizes

Image Component Sizes

Code (inc. veneers):  0x0001F200
RO Data:              0x00001200
RW Data:              0x00000400
ZI Data:              0x00000800

After LTO:

Code (inc. veneers):  0x0001C800

Notice reduction.

5.2 Veneer Reporting

Use:

armlink --info veneers

Without LTO:

Veneers created:
  4 veneers inserted

With LTO:

Veneers created:
  1 veneer inserted

6️⃣ Execution-Time Implications (Cycle-Level View)

Consider ISR:

void SysTick_Handler(void) {
    control_loop();
}

Without LTO:

BL control_loop
  → veneer
    → branch

With LTO:

control_loop inlined
arithmetic directly embedded

Effects:

No branch
No veneer
No pipeline refill

In high-frequency ISR (20 kHz):

Even 5-cycle reduction matters.

That is 100k cycles saved per second.

Branch Relaxation in armlink

armlink performs:

Branch shortening
Stub insertion
Veneer creation

Without LTO:

Relaxation is reactive
Layout fixed

With LTO:

Code generation aware of full layout
Better proximity placement
Reduced need for stubs

This improves:

Flash prefetch efficiency
I-cache locality
Sequential fetch behavior

Determinism and WCET in Safety Context

In automotive / aerospace contexts:

Predictability > raw speed

Veneers are hidden control-flow elements.

LTO reduces:

Indirection
Stub insertion
Call depth

Map-file inspection becomes part of verification.

In Keil-based projects, this is critical for:

Stack-depth analysis
Timing certification
Toolchain qualification

Practical µCore₀₁ Workflow (Keil Edition)

Build without LTO
Enable:
- -O3
- -flto
- armlink LTO
Enable:
- --map
- --info veneers
- --info sizes
Compare:
- Code size
- Veneer count
- Symbol presence
- Function clustering

Optimization must be measured.

µCore₀₁ Final Perspective

With Keil + armclang, LTO is not just:

“smaller binary”.

It reshapes:

Call graph topology
Code layout
Branch distance
Veneer generation
Execution-time characteristics
WCET stability

And the evidence is visible:

In your map file.In your veneer count.In your ISR cycle budget.

If you are not reading your map file,you are not really optimizing your firmware.

🔬 Link Time Optimization in Embedded Systems

Whole-Program Optimization, Veneers, Map-File Forensics and Execution-Time Effects with Keil + armclang

1️⃣ The Toolchain Context: Keil + armclang

2️⃣ Separate Compilation vs LTO (armclang)

3️⃣ Veneers in Arm Compiler 6

4️⃣ How LTO Changes Veneer Generation (Keil Context)

5️⃣ Map-File Analysis in Keil (Real Example Structure)

5.1 Image Component Sizes

5.2 Veneer Reporting

6️⃣ Execution-Time Implications (Cycle-Level View)

Branch Relaxation in armlink

Determinism and WCET in Safety Context

Practical µCore₀₁ Workflow (Keil Edition)

µCore₀₁ Final Perspective

Post recenti

Commenti