🔬 Link Time Optimization in Embedded Systems
- Alessandro Salvato
- 1 mar
- Tempo di lettura: 3 min
Whole-Program Optimization, Veneers, Map-File Forensics and Execution-Time Effects with Keil + armclang
In deeply embedded systems, optimization is not about squeezing synthetic benchmarks.
It is about:
Flash footprint
RAM usage
ISR latency
WCET predictability
Structural determinism
And if you are using Keil MDK with Arm Compiler 6, you are not working with generic GCC behavior — you are working with a highly optimized LLVM-based toolchain.
Understanding how LTO behaves in armclang — and how it changes veneers and map-file structure — is critical.
1️⃣ The Toolchain Context: Keil + armclang
Modern Keil (MDK v5+) uses:
Arm Compiler 6
Frontend: LLVM/Clang
Backend: ARM proprietary optimizations
Linker: armlink
When LTO is enabled in Keil:
IR is preserved across object files
Whole-program optimization happens inside armlink
Code generation is deferred
In Keil µVision:
Project → Options for Target → C/C++ → Optimization → Link Time OptimizationOr via command line:
armclang -O3 -flto -mcpu=cortex-m4 -c main.c
armlink --lto main.o motor.o control.o --map --list=firmware.map2️⃣ Separate Compilation vs LTO (armclang)
Without LTO:
armclang -O3 -mcpu=cortex-m4 -c main.c
armclang -O3 -mcpu=cortex-m4 -c motor.c
armlink main.o motor.o --map --list=firmware.mapEach .o contains final machine code.
The linker:
Resolves symbols
Inserts veneers
Does limited relaxation
But cannot semantically optimize.
With LTO:
armclang -O3 -flto -mcpu=cortex-m4 -c main.c
armclang -O3 -flto -mcpu=cortex-m4 -c motor.c
armlink --lto main.o motor.o --map --list=firmware.mapNow:
Objects contain LLVM IR
armlink merges IR
Full call graph is available
Whole-program optimization runs
This is not cosmetic.
It changes call structure, code layout and veneer behavior.
3️⃣ Veneers in Arm Compiler 6
On Cortex-M (Thumb-2):
Branch instructions (B, BL) have limited range.
If the destination exceeds encoding limits, armlink inserts veneers.
In map file you may see entries like:
Veneer Table
Veneer for motor_step
Veneer for control_loopOr:
$$Ven$$0001Conceptually:
Caller → Veneer → TargetEach veneer introduces:
Extra branch
Pipeline refill
Extra flash consumption
On Cortex-M4:
~2–5 cycle penalty per veneer call.
4️⃣ How LTO Changes Veneer Generation (Keil Context)
Without LTO:
Functions are compiled independently
Call graph edges remain intact
Long-distance calls more likely
armlink inserts veneers
With LTO:
Cross-module inlining removes calls
Functions are clustered
Call graph flattened
Fewer long-distance edges
5️⃣ Map-File Analysis in Keil (Real Example Structure)
Generate map in Keil:
Project → Options → Linker → Listing → Generate Map FileOr via armlink:
armlink --map --info veneers --info sizes5.1 Image Component Sizes
Image Component Sizes
Code (inc. veneers): 0x0001F200
RO Data: 0x00001200
RW Data: 0x00000400
ZI Data: 0x00000800After LTO:
Code (inc. veneers): 0x0001C800Notice reduction.
5.2 Veneer Reporting
Use:
armlink --info veneersWithout LTO:
Veneers created:
4 veneers insertedWith LTO:
Veneers created:
1 veneer inserted6️⃣ Execution-Time Implications (Cycle-Level View)
Consider ISR:
void SysTick_Handler(void) {
control_loop();
}Without LTO:
BL control_loop
→ veneer
→ branchWith LTO:
control_loop inlined
arithmetic directly embedded
Effects:
No branch
No veneer
No pipeline refill
In high-frequency ISR (20 kHz):
Even 5-cycle reduction matters.
That is 100k cycles saved per second.
Branch Relaxation in armlink
armlink performs:
Branch shortening
Stub insertion
Veneer creation
Without LTO:
Relaxation is reactive
Layout fixed
With LTO:
Code generation aware of full layout
Better proximity placement
Reduced need for stubs
This improves:
Flash prefetch efficiency
I-cache locality
Sequential fetch behavior
Determinism and WCET in Safety Context
In automotive / aerospace contexts:
Predictability > raw speed
Veneers are hidden control-flow elements.
LTO reduces:
Indirection
Stub insertion
Call depth
Map-file inspection becomes part of verification.
In Keil-based projects, this is critical for:
Stack-depth analysis
Timing certification
Toolchain qualification
Practical µCore₀₁ Workflow (Keil Edition)
Build without LTO
Enable:
-O3
-flto
armlink LTO
Enable:
--map
--info veneers
--info sizes
Compare:
Code size
Veneer count
Symbol presence
Function clustering
Optimization must be measured.
µCore₀₁ Final Perspective
With Keil + armclang, LTO is not just:
“smaller binary”.
It reshapes:
Call graph topology
Code layout
Branch distance
Veneer generation
Execution-time characteristics
WCET stability
And the evidence is visible:
If you are not reading your map file,you are not really optimizing your firmware.



Commenti