#### **Arm Processor** Arm = Advanced RISC Machines, Ltd. #### References: Computers as Components, 4th Ed., by Marilyn Wolf ARM Cortex-M4 User Guide (link on course web page) ARM Architecture Reference Manual (link on course web page) #### Arm instruction set - outline - Arm versions. - Arm assembly language. - Arm programming model. - Arm memory organization. - Arm data operations. - Arm flow of control. #### Cortex-A series (Application) - High performance processors capable of full Operating System (OS) support - Applications include smartphones, digital TV, smart books - Cortex-R series (Real-time) - High performance and reliability for real-time applications; - Applications include automotive braking system, powertrains - Cortex-M series (Microcontroller) - Cost-sensitive solutions for deterministic microcontroller applications - Applications include microcontrollers, smart sensors - SecurCore series - High security applications - Earlier classic processors including Arm7, Arm9, Arm11 families #### **Equipment Adopting Arm Cores** ELEC 5260/6260/6266 Embedded Systems Source: Arm University Program Overview #### Arm architecture - Describes the details of instruction set, programmer's model, exception model, and memory map - Documented in the Architecture Reference Manual #### Arm processor - Developed using one of the Arm architectures - More implementation details, such as timing information - Documented in processor's Technical Reference Manual #### Arm Architecture versions (From Arm.com) ELEC 5260/6260/6266 Embedded Systems #### Arm Cortex-M series - Cortex-M series: Cortex-M0, M0+, M3, M4, M7, M22, M23 - Low cost, low power, bit and byte operations, fast interrupt response - Energy-efficiency - Lower energy cost, longer battery life - **Smaller code** (Thumb mode instructions) - Lower silicon costs - Ease of use - Faster software development and reuse - Embedded applications - Smart metering, human interface devices, automotive and industrial control systems, white goods, consumer products and medical instrumentation # Arm Cortex-M processor profile - M0: Optimized for size and power (13 $\mu$ W/MHz dynamic power) - M0+: Lower power (11 μW/MHz dynamic power), shorter pipeline - M3: Full Thumb and Thumb-2 instruction sets, single-cycle multiply instruction, hardware divide, saturated math, (32 µW/MHz) - M4: Adds DSP instructions, optional floating point unit - M7: designed for embedded applications requiring high performance - M23, M33: include Arm TrustZone® technology for solutions that require optimized, efficient security # Arm Cortex-M series family | Processor | Arm<br>Architecture | Core<br>Architecture | Thumb <sup>®</sup> | Thumb®-2 | Hardware<br>Multiply | Hardware<br>Divide | Saturated<br>Math | DSP<br>Extensions | Floating<br>Point | |----------------|---------------------|----------------------|--------------------|----------|----------------------|--------------------|-------------------|-------------------|-------------------| | Cortex-M0 | Armv6-M | Von<br>Neumann | Most | Subset | 1 or 32<br>cycle | No | No | No | No | | Cortex-M0+ | Armv6-M | Von<br>Neumann | Most | Subset | 1 or 32<br>cycle | No | No | No | No | | Cortex-M3 | Armv7-M | Harvard | Entire | Entire | 1 cycle | Yes | Yes | No | No | | Cortex-M4 | Armv7E-M | Harvard | Entire | Entire | 1 cycle | Yes | Yes | Yes | Optional | | Cortex-M7 | Armv7E-M | Harvard | Entire | Entire | 1 cycle | Yes | Yes | Yes | Optional | | Cortex-M23, 33 | Armv8-M | Harvard | Entire | Entire | 1 cycle | Yes | Yes | Yes | Optional | #### RISC CPU Characteristics - 32-bit load/store architecture - Fixed instruction length - Fewer/simpler instructions than CISC CPU - Limited addressing modes, operand types - Simple design easier to speed up, pipeline & scale ### Arm assembly language • Fairly standard RISC assembly language: ### Arm Cortex register set #### Main # Arm Register Set Current Visible Registers (16 32-bit general-purpose registers) #### Arm data types - Word is 32 bits long. - Word can be divided into four 8-bit bytes. - Arm addresses can be 32 bits long. - Address refers to byte. - Address 4 starts at byte 4. - Configure at power-up in either little- or bit-endian mode. # CPSR Current Processor Status Register 31 30 29 28 7 6 5 4 3 2 1 0 NZCV $I F T M_4 M_3 M_2 M_1 M_0$ ALU Flags IRQ disable FIQ disable Thumb/Arm mode Must be in a "privileged" mode to change the CPSR MRS rn,CPSR MSR CPSR,rn Processor Mode\*\* 10000 - User 10001 – FIQ 10010 - IRQ 10011 – Supervisor (SWI) 10111 – Abort D/I mem'y 11001 – Undefined instr. 11111 - System \*\*2 modes in Cortex: Thread & Handler ELEC 5260/6260/6266 Embedded Systems #### Arm status bits - Every arithmetic, logical, or shifting operation <u>can</u> set CPSR bits: - N (negative), Z (zero), C (carry), V (overflow) - Examples: $$-1 + 1 = 0$$ : NZCV = 0110. $2^{31}-1+1 = -2^{31}$ : NZCV = 1001. - Setting status bits must be explicitly enabled on each instruction - ex. "adds" sets status bits, whereas "add" does not #### Arm data instructions • Basic format: ``` ADD r0,r1,r2 ``` - Computes r1+r2, stores in r0. - Immediate operand: (8-bit constant can be scaled by $2^k$ ) ADD r0,r1,#2 - Computes r1+2, stores in r0. - Set condition flags based on operation: ``` ADDS r0,r1,r2 set status flags ``` • Assembler translation: $$ADD r1,r2 => ADD r1,r1,r2$$ (but not MUL) # Flexible 2<sup>nd</sup> operand - $2^{nd}$ operand = constant or register - Constant with optional shift: (#8bit\_value) - Assembler/Compiler turns constant into one of: - 8-bit value, shifted left any #bits (up to 32) - 0x00ab00ab, 0xab00ab00, 0xabababab (a,b hex digits) - Assembler error if constant cannot be represented as above - Register with optional shift: Rm, shift\_type, #nbits - shift\_type = ASR, LSL, LSR, ROR, with nbits < 32 - shift\_type RRX (rotate through X) by 1 bit # Barrel shifter for 2<sup>nd</sup> operand #### Arm arithmetic instructions - ADD, ADC : add (w. carry) $[Rd] \le Op1 + Op2 + C$ - SUB, SBC : subtract (w. carry) $[Rd] \le Op1 Op2 + (C-1)$ - RSB, RSC : reverse subtract (w. carry) $[Rd] \le OP2 Op1 + (C-1)$ - MUL: multiply (32-bit product no immediate for Op2) [Rd] <= Op1 x Op2 - MLA: multiply and accumulate (32-bit result) MLA Rd,Rm,Rs,Rn: [Rd] <= (Rm x Rs) + Rn</li> #### Arm logical instructions - AND, ORR, EOR: bit-wise logical op's - BIC : bit clear $[Rd] \le Op1 \land Op2$ - LSL, LSR: logical shift left/right (combine with data op's) ADD r1,r2,r3, LSL #4: $$[r1] \le r2 + (r3x16)$$ Vacated bits filled with 0's - ASL, ASR: arithmetic shift left/right (maintain sign) - ROR : rotate right - RRX : rotate right extended with C from CPSR #### Arm comparison instructions These instructions only set the NZCV bits of CPSR — no other result is saved. ("Set Status" is implied) - CMP : compare : Op1 Op2 - CMN : negated compare : Op1 + Op2 - TST : bit-wise AND : Op1 ^ Op2 - TEQ : bit-wise XOR : Op1 xor Op2 # New Thumb2 bit operations • Bit field insert/clear (to pack/unpack data within a register) ``` BFC r0, \#5, \#4; Clear 4 bits of r0, starting with bit \#5 BFI r0, r1, \#5, \#4; Insert 4 bits of r1 into r0, start at bit \#5 ``` - Bit reversal (REV) reverse order of bits within a register - Bit [n] moved to bit [31-n], for n = 0..31 - Example: ``` REV r0,r1; reverse order of bits in r1 and put in r0 ``` #### Arm move instructions MOV, MVN: move (negated), constant = 8 or 16 bits MOV r0, r1; sets r0 to r1 MOVN r0, r1; sets r0 to r1 MOV r0, #55; sets r0 to 55 MOV r0, #0x5678; Thumb2 r0[15:0] MOVT r0, #0x1234 ; Thumb2 r0[31:16] • Use shift modifier to scale a value: ``` MOV r0, r1, LSL #6; [r0] \le r1 \times 64 ``` • Special pseudo-op: ``` LSL rd, rn, shift = MOV rd, rn, LSL shift ``` #### Arm 32-bit load pseudo-op\* - Operand cannot be memory address or large constant - LDR r3,=0x55555555 ← 32-bit constant or symbol. Ex: =VariableName - Place 0x5555555 in r3 - Produces MOV if immediate constant can be found - Otherwise put constant in a "literal pool" and use: LDR r3,[PC,#immediate-12] . . . . . DCD 0x55555555 ;in literal pool following code \* Not an actual Arm instruction – translated to Arm ops by the assembler ELEC 5260/6260/6266 Embedded Systems #### Arm memory access instructions - Load operand from memory into target register - LDR load 32 bits - LDRH load halfword (16 bit unsigned #) & zero-extend to 32 bits - LDRSH load signed halfword & sign-extend to 32 bits - LDRB load byte (8 bit unsigned #) & zero-extend to 32 bits - LDRSB load signed byte & sign-extend to 32 bits - Store operand from register to memory - STR store 32-bit word - STRH store 16-bit halfword (right-most16 bits of register) - STRB: store 8-bit byte (right-most 8 bits of register) # Arm load/store addressing - Addressing modes: base address + offset - register indirect: LDR r0,[r1] - with second register: LDR r0, [r1,-r2] - with constant: LDR r0, [r1, #4] - pre-indexed: LDR r0,[r1,#4]! - post-indexed: LDR r0,[r1],#8 Immediate #offset = 12 bits (2's complement) ## Arm load/store examples ``` ; address = (r2) • ldr r1,[r2] • ldr r1,[r2,#5] ; address = (r2)+5 • ldr r1,[r2,#-5] ; address = (r2)-5 • ldr r1,[r2,r3] ; address = (r2)+(r3) • ldr r1,[r2,-r3] ; address = (r2)-(r3) • ldr r1,[r2,r3,LSL #2]; address=(r2)+(r3 x 4) Scaled index ``` Base register r2 is not altered in these instructions ELEC 5260/6260/6266 Embedded Systems # Arm load/store examples (base register updated by auto-indexing) ``` • ldr r1,[r2,#4]! ; use address = (r2)+4 ; r2 \le (r2) + 4 (pre-index) ; use address = (r2)+(r3) • ldr r1,[r2,r3]! ; r2 \le (r2) + (r3) (pre-index) • ldr r1,[r2],#4 ; use address = (r2) ; r2 \le (r2) + 4 (post-index) ; use address = (r2) • ldr r1,[r2],[r3] ; r2 \le (r2) + (r3) (post-index) ``` ### Additional addressing modes • Base-plus-offset addressing: ``` LDR r0,[r1,#16] ``` - Loads from location [r1+16] - Auto-indexing increments base register: ``` LDR r0,[r1,#16]! ``` - Loads from location [r1+16], then sets r1 = r1 + 16 - Post-indexing fetches, then does offset: ``` LDR r0,[r1],#16 ``` - Loads r0 from [r1], then sets r1 = r1 + 16 - Recent assembler addition: $$M[rn] \rightarrow rd, rd \rightarrow M[rn]$$ ELEC 5260/6260/6266 Embedded Systems #### Arm ADR pseudo-op - Assembler will <u>try</u> to translate: LDR Rd,label to LDR Rd,[pc,#offset] - If address in Code Area, generate address value by performing arithmetic on PC. - ADR pseudo-op generates instruction required to calculate address (in Code Area ONLY) ADR r1, LABEL (uses MOV, MOVN, ADD, SUB op's) ## Example: C assignments ``` • C: x = (a + b) - c; ``` • Assembler: ``` ADR r4,a ; get address for a (in code area) LDR r0,[r4] ; get value of a LDR r4,=b ; get address for b, reusing r4 LDR r1,[r4] ; get value of b ADD r3,r0,r1 ; compute a+b LDR r4,=c ; get address for c LDR r2,[r4] ; get value of c SUB r3,r3,r2 ; complete computation of x LDR r4,=x ; get address for x STR r3,[r4] ; store value of x ``` # Example: C assignment ``` • C: y = a*(b+c); • Assembler: LDR r4,=b; get address for b LDR r0,[r4] ; get value of b LDR r4,=c ; get address for c LDR r1,[r4] ; get value of c ADD r2,r0,r1; compute partial result LDR r4,=a ; get address for a LDR r0,[r4]; get value of a MUL r2,r2,r0; compute final value for y LDR r4,=y; get address for y STR r2,[r4]; store y ``` ELEC 5260/6260/6266 Embedded Systems # Example: Cassignment ``` • C: z = (a << 2) | (b & 15); • Assembler: LDR r4,=a ; get address for a LDR r0,[r4] ; get value of a MOV r0,r0,LSL 2; perform shift LDR r4,=b ; get address for b LDR r1,[r4] ; get value of b AND r1,r1,#15 ; perform AND ORR r1,r0,r1 ; perform OR LDR r4,=z; get address for z STR r1,[r4] ; store value for z ``` #### Arm flow control operations - All operations can be performed conditionally, testing CPSR (only branches in Thumb/Thumb2): - EQ, NE, CS, CC, MI, PL, VS, VC, HI, LS, GE, LT, GT, LE - Branch operation: ``` B label ``` ``` Target < \pm 32M(Arm), \pm 2K(Thumb), \pm 16M(Thumb2) ``` • Conditional branch: ``` BNE label Target < ±32M(Arm),-252..+258(T),±1M(T2) ``` • Thumb2 additions (compare & branch if zero/nonzero): ``` CBZ r0, label ; branch if r0 == 0 CBNZ r0, label ; branch if r0 != 0 ``` ## Example: if statement • C: if $(a > b) \{ x = 5; y = c + d; \}$ else x = c - d;Assembler: ; compute and test condition LDR r4,=a ; get address for a LDR r0,[r4] ; get value of a LDR r4,=b; get address for b LDR r1,[r4] ; get value for b CMP r0,r1; compare a < b BLE fblock ; if a <= b, branch to false block #### If statement, cont'd. ``` ; true block MOV r0, #5 ; generate value for x LDR r4,=x ; get address for x STR r0,[r4]; store x LDR r4,=c ; get address for c LDR r0,[r4]; get value of c LDR r4,=d; get address for d LDR r1,[r4]; get value of d ADD r0,r0,r1; compute y LDR r4,=y; get address for y STR r0,[r4]; store y B after ; branch around false block ``` #### If statement, cont'd. ``` ; false block fblock LDR r4,=c ; get address for c LDR r0,[r4] ; get value of c lDR r4,=d ; get address for d LDR r1,[r4] ; get value for d SUB r0,r0,r1 ; compute a-b LDR r4,=x ; get address for x STR r0,[r4] ; store value of x after ... ``` # Example: Conditional instruction implementation (**Arm mode** only – not available in Thumb/Thumb 2 mode) ``` CMP r0,r1 ; true block MOVLT r0, #5 ; generate value for x ADRLT r4,x ; get address for x STRLT r0,[r4]; store x ADRLT r4,c ; get address for c LDRLT r0,[r4] ; get value of c ADRLT r4,d ; get address for d LDRLT r1,[r4] ; get value of d ADDLT r0,r0,r1; compute y ADRLT r4,y ; get address for y STRLT r0,[r4]; store y ``` # Conditional instruction implementation, cont'd. #### Thumb2 conditional execution - (IF-THEN) instruction, IT, supports conditional execution in Thumb2 of up to 4 instructions in a "block" - Designate instructions to be executed for THEN and ELSE - Format: ITxyz condition, where x,y,z are T/E/blank # Example: C switch statement • C: switch (test) { case 0: ... break; case 1: ... } • Assembler: LDR r2,=test ; get address for test LDR r0,[r2] ; load value for test ADR r1, switchtab ; load switch table address LDR pc,[r1,r0,LSL #2]; index switch table switchtab DCD case0 DCD case1 # Example: switch statement with new "Table Branch" instruction Branch address = PC + 2\*offset from table of offsets Offset = byte (TBB) or half-word (TBH) • C: switch (test) { case 0: ... break; case 1: ... } • Assembler: LDR r2,=test ; get address for test LDR r0,[r2] ; load value for test TBB [pc,r0]; add offset byte to PC switchtab DCB (case0 - switchtab) >> 1 ;byte offset DCB (case1 - switchtab) >> 1 ;byte offset case0 instructions casel instructions (TBH similar, but with 16-bit offsets/DCI) ELEC 5260/6260/6266 Embedded Systems # Finite impulse response (FIR) filter $X_i$ 's are data samples $C_i$ 's are constants ### Example: FIR filter • C: for (i=0, f=0; i< N; i++)f = f + c[i]\*x[i];Assembler ; loop initiation code MOV r0, #0; use r0 for ILDR r2,=N ; get address for N LDR r1,[r2] ; get value of N MOV r2, #0; use r2 for f LDR r3,=c ; load r3 with base of c LDR r5,=x; load r5 with base of x #### FIR filter, cont'.d ``` ; loop body loop LDR r4, [r3,r0,LSL #2]; get c[i] LDR r6,[r5,r0,LSL #2] ; get x[i] MUL r4,r4,r6; compute c[i]*x[i] ADD r2,r2,r4 ; add into running sum f ADD r0,r0,#1 ; add 1 to i CMP r0,r1; exit? BLT loop ; if i < N, continue ; Finalize result LDR r3,=f; point to f STR r2,[r3]; f = result ``` #### FIR filter with MLA & auto-index ``` AREA TestProg, CODE, READONLY ENTRY ;accumulator r0,#0 mov inumber of iterations r1,#3 mov ldr r2,=carray ;pointer to constants ldr r3,=xarray ;pointer to variables ldr r4,[r2],#4 ;get c[i] and move pointer loop ;get x[i] and move pointer ldr r5,[r3],#4 mla ;sum = sum + c[i]*x[i] r0,r4,r5,r0 idecrement iteration count subs r1,r1,#1 ;repeat until count=0 bne loop ldr r2,=f ;point to f str r0,[r2] ;f = result here here b AREA MyData, DATA carray dcd 1,2,3 xarray dcd 10,20,30 space END Also, need "time delay" to prepare x array for next sample ``` # Arm subroutine linkage • Branch and link instruction: ``` BL foo i Copies current PC to r14. ``` • To return from subroutine: ``` BX r14 ; branch to address in r14 or: MOV r15, r14 --Not recommended for Cortex ``` - May need subroutine to be "reentrant" - interrupt it, with interrupting routine calling the subroutine (2 instances of the subroutine) - support by creating a "stack" (not supported directly) - The CPU shifts the offset field left by 2 positions, signextends it and adds it to the PC - ± 32 Mbyte range(Arm) - Thumb: ± 16 Mbyte (unconditional), ± 1 Mbyte (conditional) - How to perform longer branches? - Boond is only conditional instruction allowed outside of IT block #### Nested subroutine calls Nested function calls in C: ``` void f1(int a){ f2(a);} void f2 (int r){ int g; g = r+5; } main () { f1(xyz); } ``` #### Nested subroutine calls (1) • Nesting/recursion requires a "coding convention" to save/pass parameters: ``` AREA Code1,CODE Main LDR r13,=StackEnd MOV r1,#5 STR r1,[r13,#-4]!; push argument onto stack BL func1; call func1() here AREA Code1,CODE (Omit if using Cortex-M startup code) ;r13 points to last element on stack ;pass value 5 to func1 ; push argument onto stack ; call func1() ``` #### Nested subroutine calls (2) ``` ; void f1(int a){ f2(a);} ; load arg a into r0 from stack Func1 LDR r0,[r13] ; call func2() STR r14,[r13,#-4]! ; store func1 return address STR r0,[r13,#-4]! ; store arg to f2 on stack BL func2 ; branch and link to f2 ; return from func1() ; "pop" func2's arg off stack ADD r13,#4 ; restore stack and return LDR r15, [r13],#4 ``` ### Nested subroutine calls (3) ``` ; void f2 (int r) int g; g = r+5; Func2 ldr r4,[r13] ;get argument r from stack add r5,r4,\#5 ; r5 = argument g ;preferred return instruction BX r14 ; Stack area AREA Data1, DATA Stack SPACE 20 ;allocate stack space StackEnd END ELEC 5260/6260/6266 Embedded Systems ``` # Register usage conventions | Reg | Usage* | Reg | Usage* | |-----|--------|-----|-----------------------------------| | rO | a1 | r8 | v5 | | r1 | a2 | r9 | v6 | | r2 | a3 | r10 | v7 | | r3 | a4 | r11 | v8 | | r4 | v1 | r12 | Ip (intra-procedure scratch reg.) | | r5 | v2 | r13 | sp (stack pointer) | | r6 | v3 | r14 | Ir (link register) | | r7 | v4 | r15 | pc (program counter) | \* Alternate register designation a1-a4: argument/result/scratch v1-v8: variables # Saving/restoring multiple registers - LDM/STM load/store multiple registers - LDMIA increment address after xfer - LDMIB increment address before xfer - LDMDA decrement address after xfer - LDMDB decrement address before xfer - LDM/STM default to LDMIA/STMIA #### Examples: ``` ldmia r13!, {r8-r12,r14} ;r13 updated at end stmda r13, {r8-r12,r14} ;r13 not updated at end Lowest numbered register at lowest memory address ``` #### Arm assembler additions - PUSH {reglist} = STMDB sp!, {reglist} - POP {reglist} = LDMIA sp!, {reglist} ## uP startup\_stm32I476.s #### • Stack definition: #### • <u>Vector table:</u> ``` AREA RESET, DATA, READONLY ___Vectors DCD ___initial_sp ; Top of Stack DCD Reset_Handler ; Reset Handler DCD NMI_Handler ; NMI Handler ``` #### • Reset handler: ``` Reset_Handler PROC EXPORT Reset\_Handler [WEAK] IMPORT SystemInit IMPORT \__main LDR RO, = SystemInit BLX RO LDR RO, = \__main BX RO ``` #### Mutual exclusion support - Test and set a "lock/semaphore" for shared data access - Lock=0 indicates shared resource is unlocked (free to use) - Lock=1 indicates the shared resource is "locked" (in use) - LDREX Rt,[Rn{,#offset}] - read lock value into Rt from memory to request exclusive access to a resource - Cortex notes that LDREX has been performed, and waits for STRTX - STREX Rd,Rt,[Rn{,#offset}] - Write Rt value to memory and return status to Rd - Rd=0 if successful write, Rd=1 if unsuccessful write - Cortex notes that LDREX has been performed, and waits for STRTX - "fail" if LDREX by another thread before STREX performed by first thread - CLREX - Force next STREX to return status of 1to Rd (cancels LDREX) ## Mutual exclusion example • Location "Lock" is 0 if a resource is free, 1 if not free ``` ldr r0,=Lock ;point to lock mov r1,#1 ;prepare to lock the resource ;read Lock value ldrex r2,[r0] try r2,#0 ;is resource unlocked/free? cmp ;next 2 ops if resource free itt eq strexeq r2,r1,[r0] ;store 1 in Lock cmpeq r2,#0 ;was store successful? ;repeat loop if lock unsuccessful bne try ``` LDREXB/LDREXH - STREXB/STREXH for byte/halfword Lock #### Common assembler directives • Allocate storage and store initial values (CODE area) Label DCD value1, value2... allocate word Label DCW value1, value2... allocate half-word Label DCB value1, value2... allocate byte • Allocate storage without initial values (DATA area) Label SPACE n reserve n bytes (uninitialized) # Summary - Load/store architecture - Most instructions are RISCy, operate in single cycle. - Some multi-register operations take longer. - All instructions can be executed conditionally. #### **Arm Instruction Code Format** ## Arm Load/Store Code Format