# **Experiment with Linux** and ARM Thumb-2 ISA

Philippe Robin ARM Ltd.



# **Summary**

- ARM Roadmap and Processor Families
- Performance vs Code Size and ISA selection process
- Thumb-2 encoding and new instructions
- Changes in the Linux kernel
- Size reduction with kernel, libraries and applications
- Exception handler example
- Summary

## **ARM Activities**



# **Linux and ARM Processor Roadmap**



## **Processors Families**



THE ARCHITECTURE FOR THE DIGITAL WORLD®

### The Performance vs. Code Size Dilemma

- Thumb 16-bit ISA was created by analysing 32-bit ARM Instruction Set and deriving best fit 16-bit instruction set, thus reducing code size
  - User required to "blend" instruction sets by compiling performance critical code to ARM and the rest to Thumb
- But manual code blending is not optimal
  - Requires profiling
  - Modifications can reduce performance
  - Best results obtained near the end of the project
  - Difficult to manage distributed development



A "blended ISA" is a better solution

## The ISA Selection Process



The ARM Thumb-2 core technology combines 16- and 32-bit instructions in a single instruction set and allows programmers / compilers to freely mix the instructions together without mode switching.

# **Thumb-2 Encoding**



Halfword pairs (hw1, hw2) of instructions are inserted into Thumb (thm) instruction stream.

The encodings selected are compatible with the existing Thumb BL and BLX instructions:

```
hw1 hw2

Thumb BL{X}: 11110 offset[22:12] 111n1 offset[11:1]

T-2 BL{X}: 11110 offset[22:12] 11AnB offset[11:1]
```

Two extra offset bits are generated by XORing the A and B bits with Offset[22]. This means that the offset is sign-extended when A = B = 1, which ensures backwards compatibility with the existing instructions.

## **Thumb-2 32-bit Instructions**

#### ARM-like

- Data Processing Instructions
- DSP and Media instructions
- Load and Store instructions
- Branch instructions
- System control BXJ, RFE, SRS etc.
- Coprocessor (VFP, MOVE™, etc.)

#### New

- Bitfield insert/extract/clear BFI, {S|U}BFX, BFC
- Bit reverse RBIT
- 16 bit immediate instructions MOVW, MOVH
- Table branch TB{B|H} [Rbase, Rindex]
- Additional memory system hints (PLI)



## **Thumb-2 Move 16-bit Constant**

Two 32 bit instructions to load a 32 bit constant, one instruction for each half word

- Replaces one 32 bit instruction and a 32 bit literal (ARM) or one 16 bit instruction and a 32 bit literal (Thumb)
  - Single MOVW would be used for the majority of cases
- Reduce the size of literal pools
- Reduce data access to I-TCM via D-side for constant loads (~5X)

```
MOVW Rd,#imm16
Rd = ZeroExtend( imm16 )
```

```
MOVT Rd,#imm16
Rd[31:16] = imm16 // Rd[15:0] unaffected
```

## **Thumb-2 Bit Field Instructions**

#### Allow insertion and extraction of signed/unsigned bit fields

- Provides better handling of packed structures
- Replaces bit mask and shift operations

BFC, BFI, SBFX, UBFX

ARM\* or Thumb-2 ARM

BFI R0, R1, #bitpos, #fieldwidth AND R2, R1, #bitmask

BIC R0, R0, #bitmask << bitpos

ORR R0, R0, R2, LSL #bitpos

## **Thumb-2 Table Branch Instructions**

New Base + Offset Branching mechanism for switch statements generates branch targets directly from a table of destination offsets



- Thumb-2 code size as small or smaller than Thumb —Ospace
- Thumb-2 code performance as fast as ARM –Otime
- Thumb-2 code executes in a single instruction and uses packed table

## **New Thumb-2 Flow Control Instructions**

#### Compare and Branch

CBZ Rn, <label> CBNZ Rn, <label>

Optimises for the common case of "Branch If Zero" or "Branch If Non-Zero"

| ARM                    | Thumb-2          | Thumb                 |  |
|------------------------|------------------|-----------------------|--|
| CMP r0, #0             | CBZ r0, ln       | CMP r0, #0            |  |
| BEQ In                 |                  | BEQ In                |  |
| 8 Bytes, 1 or 2 cycles | 2 Bytes, 1 cycle | 4 Bytes,1 or 2 cycles |  |

#### If-Then Conditional

 $IT{x{y{z}}} <cond>$ 

- The If-Then (IT) instruction causes the next 1-4 instructions in memory to be conditional
- Allows short conditional execution bursts in 16-bit instruction set

| ARM                | Thumb-2                 | Thumb                    |  |
|--------------------|-------------------------|--------------------------|--|
| LDREQ r0, [r1]     | ITETE EQ                | BNE I1                   |  |
| LDRNE r0, [r2]     | LDREQ r0, [r1]          | LDR r0,[r1]              |  |
| ADDEQ r0, r3, r0   | LDRNE r0, [r2]          | ADD r0, r3, r0           |  |
| ADDNE r0, r4, r0   | ADDEQ r0, r3, r0        | B 12                     |  |
|                    | ADDNE r0, r4, r0        | I1 LDR r0,[r1]           |  |
|                    |                         | ADD r0, r4, r0           |  |
|                    |                         | 12                       |  |
| 16 Bytes, 4 cycles | 10 Bytes, 4 or 5 cycles | 12 Bytes, 4 to 20 cycles |  |

# **Thumb-2 Compiled Code Size**



Thumb-2 Performance Optimized26% smaller than ARM



Thumb-2 Space Optimized32% smaller than ARM

## **Thumb-2 Performance**

Analysis of the performance of code for EEMBC\* benchmarks on ARM11 like cores





- Thumb-2 performance is 98% of ARM performance
- Thumb-2 code achieves 125% of Thumb performance

\* Uncertified EEMBC benchmarks based information showing relative performance ONLY

# Thumb-2 – Changes to Linux Kernel

- A new control bit has been introduced with ARMv7 to control whether exceptions are taken in ARM or Thumb state
  - Modified Interrupt and Exception handling code accordingly
- Most 32-bit Thumb instructions are unconditional (whereas most of ARM instructions can be conditional)
- Many changes are due to adding unified syntax and flow control instructions
  - Use of If-Then (IT) instruction for instance
- There is no increase in the number of general purpose or special purpose registers, and no increase in register sizes
- Most Thumb 32-bit instructions cannot use the PC as a source or destination register.
- BL and BLX instructions are treated as 32-bit instructions instead of two 16-bit instructions
  - Note that 32-bit Thumb instructions can only take exceptions on their start address
- New T variants of LDR, STR
- New variants of LDREX and STREX
  - Thumb-2 has B, H, and D (Byte, Halfword, and Doubleword) variants

# **ARM vs Thumb-2 Memory Footprint**

- Using GCC 4.1 with –O2 option
  - Average 20% size reduction on common libraries
  - Kernel is 29% smaller in Thumb-2 compared to ARM

|               | ARM Mode  | Thumb-2 mode | Ratio |
|---------------|-----------|--------------|-------|
| libc-2.3.6.so | 1123552   | 824544       | 73%   |
| libm-2.3.6.so | 669496    | 542520       | 81%   |
| 2.6.19 kernel | 1019832   | 724888       | 71%   |
| MPlayer       | 5793064   | 5619000      | 96%   |
|               | (dynamic) | (dynamic)    | 77%   |

6707792 (static)

5176036 (static)

# Sample - Exception Handler in Thumb-2

```
vector stub, name, mode, correction=0
vector \name:
          .if \correction
                   Ir. Ir. #\correction
          sub.w
          .endif
          @ save Ir_<exception> (parent PC) and spsr_<exception>
          @ (parent CPSR) to the SVC stack
                   sp, #SVC_MODE
          @ Switch to SVC32 mode, save sp and Ir and set up the stack.
          @ IRQs remain disabled.
          mrs
                   Ir. cpsr
                   Ir, Ir, #(\mode ^ SVC MODE)
          eor.w
                   cpsr cxsf, lr
          msr
          @ may be overwritten by the usr handlers
                   sp. [sp. #(S_SP - S_FRAME_SIZE)] @ save sp_svc to the SVC stack
                   Ir, [sp, #(S LR - S FRAME SIZE)] @ save Ir svc to the SVC stack
         str.w
                   sp, sp, #S FRAME SIZE
         sub.w
          @ the branch table must immediately follow this code
          ldr.w
                   Ir, [sp, #S PSR]
                                       @ read the saved spsr <exception>
         and.w
                   Ir. Ir. #0x0f
                                       @ address in the branch table
          add.w
                   Ir, pc, Ir, Isl #2
                   pc, [lr, #4]
                                       @ branch to handler in SVC mode
          ldr.w
          .endm
[...]
          .macro
                   svc entry
          stmia
                   sp, {r0 - r12}
          .endm
  dabt svc:
          svc_entry
         add
                   r0, sp, #S_PC
```

```
@ get ready to re-enable interrupts if appropriate
          mrs
                    r9. cpsr
                    r3, #PSR I BIT
          tst
                    r9, r9, #PSR_I_BIT
          bicea
          @ Call the processor-specific abort handler:
          @ r2 - aborted context pc
             r3 - aborted context cpsr
          @ The abort handler must return the aborted address in r0, and
          @ the fault status register in r1. r9 must be preserved.
          Idmia
                    r0, {r2, r3}
                                         @ load the Ir <exception> and spsr <exception>
#ifdef MULTI ABORT
          ldr
                    r4, .LCprocfns
          mov
                    Ir, pc
                    pc, [r4]
          ldr
#else
          bl
                    CPU_ABORT_HANDLER
#endif
          @ set desired IRQ state, then call main handler
          msr
                    cpsr_c, r9
                    r2, sp
          mov
                    do DataAbort
          @ IRQs off again before pulling preserved data off the stack
          disable_irq
          @ restore the registers and restart the instruction
          Idmia
                    sp, {r0-r12}
                    Ir, [sp, #S LR]
          ldr
          add
                    sp. sp. #S PC
          rfeia
                    sp!
                                        @ restore pc, cpsr
```

# Sample – Exception Handler in ARM

```
@
                   vector_stub, name, mode, correction=0
vector_\name:
                                                                                                                stmia
                                                                                                                          r5, {r0 - r4}
          .if \correction
                                                                                                                 .endm
          sub
                    Ir. Ir. #\correction
          .endif
                                                                                                         dabt svc:
                                                                                                                svc entry
          @ Save r0, Ir <exception> (parent PC) and spsr <exception> (parent CPSR)
                                                                                                                 @ get ready to re-enable interrupts if appropriate
          stmia
                    sp, {r0, lr}
                                        @ save r0. Ir
          mrs
                    Ir, spsr
                                                                                                                          r9, cpsr
                    Ir, [sp, #8]
                                                                                                                          r3, #PSR I BIT
          str
                                        @ save spsr
                                                                                                                tst
                                                                                                                          r9, r9, #PSR_I_BIT
                                                                                                                bicea
          @ Prepare for SVC32 mode. IRQs remain disabled.
                                                                                                                 @ Call the processor-specific abort handler:
                    r0. cpsr
                                                                                                                          r2 - aborted context pc
          mrs
                    r0. r0. #(\mode ^ SVC MODE)
                                                                                                                           r3 - aborted context cpsr
          eor
                    spsr cxsf, r0
                                                                                                                 (a)
                                                                                                                 @ The abort handler must return the aborted address in r0. and
          @ the branch table must immediately follow this code
                                                                                                                 @ the fault status register in r1. r9 must be preserved.
                   Ir. Ir. #0x0f
                                                                                                      #ifdef MULTI ABORT
          and
                                                                                                                          r4, .LCprocfns
          mov
                    r0, sp
                                                                                                                ldr
          ldr
                    Ir, [pc, Ir, IsI #2]
                                                                                                                          Ir, pc
                                                  @ branch to handler in SVC mode
                                                                                                                ldr
                                                                                                                          pc, [r4]
          movs
          .endm
                                                                                                      #else
[...]
                                                                                                                          CPU ABORT HANDLER
                                                                                                      #endif
          .macro
                    svc_entry
          sub
                    sp, sp, #S_FRAME_SIZE
                    sp, {r1 - r12}
                                                                                                                 @ set desired IRQ state, then call main handler
          stmib
          Idmia
                    r0, {r1 - r3}
                                                                                                                          cpsr_c, r9
                                                                                                                msr
                   r5, sp, #S_SP
          add
                                                  @ here for interlock avoidance
                                                                                                                          r2, sp
                                                                                                                mov
                    r4, #-1
                                                                                                                          do DataAbort
          mov
                   r0, sp, #S_FRAME_SIZE @
          add
                    r1, [sp]
                                        @ save the "real" r0 copied
                                                                                                                 @ IRQs off again before pulling preserved data off the stack
                                                  @ from the exception stack
                                                                                                                disable irg
                    r1. Ir
          mov
                                                                                                                 @ restore SPSR and restart the instruction
          @ We are now ready to fill in the remaining blanks on the stack:
                    r0 - sp svc
                                                                                                                ldr
                                                                                                                          r0, [sp, #S_PSR]
                    r1 - Ir svc
                                                                                                                          spsr_cxsf, r0
                                                                                                                msr
                   r2 - Ir_<exception>, already fixed up for correct return/restart
                                                                                                                Idmia
                                                                                                                          sp, {r0 - pc}^
                                                                                                                                                                   @ load r0 - pc, cpsr
                    r3 - spsr <exception>
                    r4 - orig_r0 (see pt_regs definition in ptrace.h)
```

# **Summary**

- Thumb-2 core technology improves both ARM and Thumb ISAs to increase system performance and reduce cost.
- Thumb-2 core technology extends the Thumb ISA to provide a blended instruction set.
  - Average 20%better code density than ARM for Linux kernel and libraries using GCC
- With Thumb-2 developers don't have to manually balance between ARM and Thumb code
- Contribute kernel changes to mainline in 2007/2008
  - Thumb-2 support has been available with GNU compilation tools since 2006
- Higher code density can be achieved using optimized tool chains such as ARM RealView compilation tools