X86 Instructions and ARM Architecture

Site: Saylor Academy
Course: CS301: Computer Architecture
Book: X86 Instructions and ARM Architecture
Printed by: Guest user
Date: Saturday, November 28, 2020, 9:17 AM

Description

Read this article, which gives two examples of instructions set architectures (ISAs). Look over how the different microprocessors address memory. Take note of similarities and differences of format, instructions and type of instructions, and addressing modes between these two as well as between these and the MIPS instructions of the previous sections.

x86 Instructions

Conventions

The following template will be used for instructions that take no operands:

Instr

The following template will be used for instructions that take 1 operand:

Instr arg

The following template will be used for instructions that take 2 operands. Notice how the format of the instruction is different for different assemblers.

Instr src, dest GAS Syntax
Instr dest, src Intel Syntax


The following template will be used for instructions that take 3 operands. Notice how the format of the instruction is different for different assemblers.

Instr aux, src, dest GAS Syntax
Instr dest, src, aux Intel Syntax



Suffixes

Some instructions, especially when built for non-Windows platforms (i.e. Unix, Linux, etc.), require the use of suffixes to specify the size of the data which will be the subject of the operation. Some possible suffixes are:

  • b (byte) = 8 bits.
  • w (word) = 16 bits.
  • l (long) = 32 bits.
  • q (quad) = 64 bits.

An example of the usage with the mov instruction on a 32-bit architecture, GAS syntax:

movl $0x000F, %eax   # Store the value F into the eax register

On Intel Syntax you don't have to use the suffix. Based on the register name and the used immediate value the compiler knows which data size to use.

MOV EAX, 0x000F

Source: Wikibooks, https://en.wikibooks.org/wiki/X86_Assembly/X86_Instructions
Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.

Data Transfer Instructions

Some of the most important and most frequently used instructions are those that move data. Without them, there would be no way for registers or memory to even have anything in them to operate on.


Data transfer instructions

Move
mov src, dest GAS Syntax
mov dest, src Intel Syntax

mov stands for move. Despite its name the mov instruction copies the src operand into the dest operand. After the operation both operands contain the same contents.


Operands
legal operands for mov instruction
src operand dest operand
immediate value register memory
Yes
(into larger register)
Yes
(same size)
Yes
(register determines size of retrieved memory)
register
Yes
(up to 32-bit values)
Yes No memory

Modified flags

  • No FLAGS are modified by this instruction


Example

.data
value:
.long2
.text
.globl_start
_start:
movl$6,%eax# eax ≔ 6
# └───────┐
movw%eax,value# value ≔ eax
# └───────────┐
movl$0,%ebx# ebx ≔ 0 │ │
# ┌──┘ │
movb%al,%bl# bl ≔ al │
# %ebx is now 6 │
 # ┌─────┘
movlvalue,%ebx# ebx ≔ value

movl$value,%esi# ebx ≔ @value
# %esi is now the address of value
xorl%ebx,%ebx# ebx ≔ ebx ⊻ ebx
 # %ebx is now 0
movwvalue(,%ebx,1),%bx# bx ≔ value[ebx*1]
# %ebx is now 6
# Linux sys_exit
movl$1,%eax# eax ≔ 1
xorl%ebx,%ebx# ebx ≔ 0
int$0x80

Data swap
xchg src, dest GAS Syntax
xchg dest, src Intel Syntax

xchg stands for exchange. The xchg instruction swaps the src operand with the dest operand. It is like doing three mov operations:

  1. from dest to a temporary (another register),
  2. then from src to dest, and finally
  3. from the temporary storage to src,

except that no register needs to be reserved for temporary storage.

This exchange pattern of three consecutive mov instructions can be detected by the DFU present in some architectures, which will trigger special treatment. The opcode for  xchg is shorter though.


Operands

Any combination of register or memory operands, except that at most one operand may be a memory operand. You cannot exchange two memory blocks.


Modified Flags

None.


Example

.data
 value:
 .long2
 .text
 .global_start
 _start:
 movl$54,%ebx
 xorl%eax,%eax
 xchglvalue,%ebx
 # %ebx is now 2
 # value is now 54
 xchgw%ax,value
 # Value is now 0
 # %eax is now 54
 xchgb%al,%bl
 # %ebx is now 54
 # %eax is now 2
 xchgwvalue(%eax),%ax
 # value is now 0x00020000 = 131072
 # %eax is now 0
 # Linux sys_exit 
 mov$1,%eax
 xorl%ebx,%ebx
 int$0x80

Application

If one of the operands is a memory address, then the operation has an implicit lock prefix, that is, the exchange operation is atomic. This can have a large performance penalty.

However, on some platforms exchanging two registers will trigger the register renamer. The register renamer is a unit in that merely renames registers, so no data actually have to be moved. This is super fast (branded as “zero-latency”). Renaming registers could be useful since

  • some instructions either require certain operands to be located in specific register, but data will be needed later on,
  • or encoding some opcodes is shorter if one of the operands is the accumulator register.

It is also worth noting that the common nop (no operation) instruction, 0x90, is the opcode for xchgl %eax, %eax.


Data swap based on comparison
cmpxchg arg2, arg1 GAS Syntax
cmpxchg arg1, arg2 Intel Syntax

cmpxchg stands for compare and exchangeExchange is misleading as no data are actually exchanged.

The cmpxchg instruction has one implicit operand: the al/ax/eax depending on the size of arg1.

  1. The instruction compares arg1 to al/ax/eax.
  2. If they are equal, arg1 becomes arg2. (arg1 = arg2)
  3. Otherwise, al/ax/eax becomes arg1.

Unlike xchg there is no implicit lock prefix, and if the instruction is required to be atomic,  lock has to be prefixed.


Operands

arg2 has to be a register. arg1 may be either a register or memory operand.


Modified flags

  • ZF ≔ arg1 = (al|ax|eax) [depending on arg1’s size]
  • CFPFAFSFOF are altered, too.


Application

The following example shows how to use the cmpxchg instruction to create a spin lock which will be used to protect the result variable. The last thread to grab the spin lock will get to set the final value of result:

example for a spin lock

Move with zero extend
movz src, dest GAS Syntax
movzx dest, src Intel Syntax

movz stands for move with zero extension. Like the regular mov the movz instruction copies data from the src operand to the dest operand, but the the remaining bits in dest that are not provided by srv are filled with zeros. This instruction is useful for copying a small, unsigned value to a bigger register.


Operands

Dest has to be a register, and src can be either another register or a memory operand. For this operation to make sense dest has to be larger than src.


Modified flags

There are none.


Example

.data
 byteval:
 .byte204
 .text
 .global_start
 _start:
 movzbwbyteval,%ax
 # %eax is now 204
 movzwl%ax,%ebx
 # %ebx is now 204
 movzblbyteval,%esi
 # %esi is now 204
 # Linux sys_exit 
 mov$1,%eax
 xorl%ebx,%ebx
 int$0x80

Move with sign extend
movs src, dest GAS Syntax
movsx dest, src Intel Syntax

movsx stands for move with sign extension. The movsx instruction copies the src operand in the dest operand and pads the remaining bits not provided by src with the sign bit (the MSB) of src.

This instruction is useful for copying a signed small value to a bigger register.


Operands

movsx accepts the same operands as movzx.


Modified Flags

movsx does not modify any flags, either.


Example

.data
 byteval:
 .byte-24# = 0xe8
 .text
 .global_start
 _start:
 movsbwbyteval,%ax
 # %ax is now -24 = 0xffe8
 movswl%ax,%ebx
 # %ebx is now -24 = 0xffffffe8
 movsblbyteval,%esi
 # %esi is now -24 = 0xffffffe8
 # Linux sys_exit 
 mov$1,%eax
 xorl%ebx,%ebx
 int$0x80

Move String

movs

Move byte.

The movsb instruction copies one byte from the memory location specified in esi to the location specified in edi. If the direction flag is cleared, then esi and edi are incremented after the operation. Otherwise, if the direction flag is set, then the pointers are decremented. In that case the copy would happen in the reverse direction, starting at the highest address and moving toward lower addresses until ecx is zero.


Operands

There are no explicit operands, but

  • ecx determines the number of iterations,
  • esi specifies the source address,
  • edi the destination address, and
  • DF is used to determine the direction (it can be altered by the cld and std instruction).


Modified flags

No flags are modified by this instruction.


Example

section.text
 ; copy mystr into mystr2
 movesi,mystr; loads address of mystr into esi
 movedi,mystr2; loads address of mystr2 into edi
 cld; clear direction flag (forward)
 movecx,6
 repmovsb; copy six times
section.bss
 mystr2:resb6
section.data
 mystrdb"Hello",0x0

movsw

Move word

The movsw instruction copies one word (two bytes) from the location specified in esi to the location specified in edi. It basically does the same thing as movsb, except with words instead of bytes.


Operands

None.


Modified flags

  • No FLAGS are modified by this instruction


Example

section.code
 ; copy mystr into mystr2
 movesi,mystr
 movedi,mystr2
 cld
 movecx,4
 repmovsw
 ; mystr2 is now AaBbCca\0
section.bss
 mystr2:resb8
section.data
 mystrdb"AaBbCca",0x0

Load Effective Address
lea src, dest GAS Syntax
lea dest, src Intel Syntax


lea
 stands for load effective address. The lea instruction calculates the address of the src operand and loads it into the dest operand.


Operands

src

  • Immediate
  • Register
  • Memory

dest

  • Register


Modified flags

  • No FLAGS are modified by this instruction


Note

Load Effective Address calculates its src operand in the same way as the mov instruction does, but rather than loading the contents of that address into the dest operand, it loads the address itself.

lea can be used not only for calculating addresses, but also general-purpose unsigned integer arithmetic (with the caveat and possible advantage that FLAGS are unmodified). This can be quite powerful, since the src operand can take up to 4 parameters: base register, index register, scalar multiplier and displacement, e.g. [eax + edx*4 -4] (Intel syntax) or -4(%eax, %edx, 4) (GAS syntax). The scalar multiplier is limited to constant values 1, 2, 4, or 8 for byte, word, double word or quad word offsets respectively. This by itself allows for multiplication of a general register by constant values 2, 3, 4, 5, 8 and 9, as shown below (using NASM syntax):

leaebx,[ebx*2]; Multiply ebx by 2
leaebx,[ebx*8+ebx]; Multiply ebx by 9, which totals ebx*18

Conditional Move
cmovcc src, dest GAS Syntax
cmovcc dest, src Intel Syntax

cmov stands for conditional move. It behaves like mov but the execution depends on various flags. There are following instruction available:


available cmovcc combinations
… = 1 … = 0
ZF cmovzcmove cmovnzcmovne
OF cmovo cmovno
SF cmovs cmovns
CF cmovccmovbcmovnae cmovnccmovnbcmovae
CF ∨ ZF cmovbe N/A
PF cmovpcmovpe cmovnpcmovpo
SF = OF cmovgecmovnl cmovngecmovl
ZF ∨ SF ≠ OF cmovngcmovle N/A
CF ∨ ZF cmova N/A
¬CF SF = OF
¬ZF cmovnbecmova cmovgcmovnle

The cmov instruction needs to be available on the platform. This can be checked by using the cpuid instruction.


Operands

Dest has to be a register. Src can be either a register or memory operand.


Application

The cmov instruction can be used to eliminate branches, thus usage of cmov instruction avoids branch mispredictions. However, the cmov instructions needs to be used wisely: the dependency chain will become longer.


Data transfer instructions of 8086 microprocessor

General

General purpose byte or word transfer instructions:

mov

copy byte or word from specified source to specified destination

push

copy specified word to top of stack.

pop

copy word from top of stack to specified location

pusha

copy all registers to stack

popa

copy words from stack to all registers

xchg

Exchange bytes or exchange words

xlat

translate a byte in al using a table in memory

Input/Output

These are I/O port transfer instructions:

in

copy a byte or word from specific port to accumulator

out

copy a byte or word from accumulator to specific port

Address Transfer Instruction

Special address transfer Instructions:

lea

load effective address of operand into specified register

lds

load DS register and other specified register from memory

les

load ES register and other specified register from memory

Flags

Flag transfer instructions:

lahf

load ah with the low byte of flag register

sahf

stores ah register to low byte of flag register

pushf

copy flag register to top of stack

popf

copy top of stack word to flag register

Control Flow Instructions

Almost all programming languages have the ability to change the order in which statements are evaluated, and assembly is no exception. The instruction pointer (EIP) register contains the address of the next instruction to be executed. To change the flow of control, the programmer must be able to modify the value of EIP. This is where control flow functions come in.

mov eip, label   ; wrong
jmp label        ; right

Comparison Instructions
test arg0, arg1 GAS Syntax
test arg1, arg0 Intel Syntax


Performs a bit-wise logical and on arg0 and arg1 the result of which we will refer to as commonBits and sets the ZF(zero), SF(sign) and PF (parity) flags based on commonBitsCommonBits is then discarded.


Operands

arg0

  • Register
  • Immediate

arg1

  • AL/AX/EAX (only if arg0 is an immediate value)
  • Register
  • Memory


Modified flags

  • SF ≔ MostSignificantBit(commonBits)
  • ZF ≔ (commonBits = 0), so a set ZF means, arg0 and arg1 do not have any set bits in common
  • PF ≔ BitWiseXorNor(commonBits[Max-1:0]), so PF is set if and only if commonBits[Max-1:0] has an even number of 1 bits
  • CF ≔ 0
  • OF ≔ 0
  • AF is undefined


Application

  • passing the same register twice: test rax, rax
    • SF becomes the sign of rax, which is a simple test for non-negativity
    • ZF is set if rax is zero
    • PF is set if rax has an even number of set bits
cmp subtrahend, minuend GAS Syntax
cmp minuend, subtrahend Intel Syntax


Performs a comparison operation between minuend and subtrahend. The comparison is performed by a (signed) subtraction of subtrahend from minuend, the results of which can be called differenceDifference is then discarded. If subtrahend is an immediate value it will be sign extended to the length of minuend. The EFLAGS register is set in the same manner as a sub instruction.

Note that the GAS/AT&T syntax can be rather confusing, as for example cmp $0, %rax followed by jl branch will branch if %rax < 0 (and not the opposite as might be expected from the order of the operands).


Operands

minuend

  • AL/AX/EAX (only if subtrahend is immediate)
  • Register
  • Memory

subtrahend

  • Register
  • Immediate
  • Memory


Modified flags

  • SF ≔ MostSignificantBit(difference), so an unset SF means the difference is non-negative (minuend ≥ subtrahend [NB: signed comparison])
  • ZF ≔ (difference = 0)
  • PF ≔ BitWiseXorNor(difference[Max-1:0])
  • CFOF and AF


Jump Instructions

The jump instructions allow the programmer to (indirectly) set the value of the EIP register. The location passed as the argument is usually a label. The first instruction executed after the jump is the instruction immediately following the label. All of the jump instructions, with the exception of jmp, are conditional jumps, meaning that program flow is diverted only if a condition is true. These instructions are often used after a comparison instruction (see above), but since many other instructions set flags, this order is not required.

See chapter “X86 architecture”, § “EFLAGS register” for more information about the flags and their meaning.


Unconditional Jumps

jmp loc

Loads EIP with the specified address (i.e. the next instruction executed will be the one specified by jmp).


Jump if Equal

je loc

ZF = 1

Loads EIP with the specified address, if operands of previous cmp instruction are equal. For example:

movecx,$5
movedx,$5
cmpecx,edx
jeequal
; if it did not jump to the label equal,
; then this means ecx and edx are not equal.
equal:
; if it jumped here, then this means ecx and edx are equal

Jump if Not Equal

jne loc

ZF = 0

Loads EIP with the specified address, if operands of previous cmp instruction are not equal.


Jump if Greater

jg loc

SF = OF and ZF = 0

Loads EIP with the specified address, if the minuend of the previous cmp instruction is greater than the second (performs signed comparison).


Jump if Greater or Equal

jge loc

SF = OF or ZF = 1

Loads EIP with the specified address, if the minuend of the of previous cmp instruction is greater than or equal to the subtrahend (performs signed comparison).


Jump if Above (unsigned comparison)

ja loc

CF = 0 and ZF = 0

Loads EIP with the specified address, if the minuend of the previous cmp instruction is greater than the subtrahendja is the same as jg, except that it performs an unsigned comparison.

That means, the following piece of code always jumps (unless rbx is -1, too), because negative one is represented as all bits set in the two's complement.

movrax,-1//rax:=-1
cmprax,rbx
jaloc

Interpreting all bits set (without treating any bit as the sign) has the value 2ⁿ-1 (where n is the length of the register). That is the largest unsigned value a register can hold.


Jump if Above or Equal (unsigned comparison)

jae loc

CF = 0 or ZF = 1

Loads EIP with the specified address, if the minuend of previous cmp instruction is greater than or equal to the subtrahendjae is the same as jge, except that it performs an unsigned comparison.


Jump if Lesser

jl loc

The criterion required for a jl is that SF ≠ OF. It loads EIP with the specified address, if the criterion is met. So either SF or OF can be set, but not both in order to satisfy this criterion. If we take the sub (which is basically what a cmp does) instruction as an example, we have:

minuend - subtrahend

With respect to sub and cmp there are several cases that fulfill this criterion:

  1. minuend < subtrahend and the operation does not have overflow
  2. minuend > subtrahend and the operation has an overflow

In the first case SF will be set but not OF and in second case OF will be set but not SF since the overflow will reset the most significant bit to zero and thus preventing SF being set. The SF ≠ OF criterion avoids the cases where:

  1. minuend > subtrahend and the operation does not have overflow
  2. minuend < subtrahend and the operation has an overflow
  3. minuend = subtrahend

In the first case neither SF nor OF are set, in the second case OF will be set and SF will be set since the overflow will reset the most significant bit to one and in the last case neither SF nor OF will be set.


Jump if Less or Equal

jle loc

SF ≠ OF or ZF = 1.

Loads EIP with the specified address, if the minuend of previous cmp instruction is lesser than or equal to the subtrahend. See the jl section for a more detailed description of the criteria.


Jump if Below (unsigned comparison)

jb loc

CF = 1

Loads EIP with the specified address, if first operand of previous CMP instruction is lesser than the second. jb is the same as jl, except that it performs an unsigned comparison.

movrax,0; rax ≔ 0
cmprax,rbx; rax ≟ rbx
jbloc; always jumps, unless rbx is also 0

Jump if Below or Equal (unsigned comparison)

jbe loc

CF = 1 or ZF = 1

Loads EIP with the specified address, if minuend of previous cmp instruction is lesser than or equal to the subtrahendjbe is the same as jle, except that it performs an unsigned comparison.


Jump if Overflow

jo loc

OF = 1

Loads EIP with the specified address, if the overflow bit is set on a previous arithmetic expression.


Jump if Not Overflow

jno loc

OF = 0

Loads EIP with the specified address, if the overflow bit is not set on a previous arithmetic expression.


Jump if Zero

jz loc

ZF = 1

Loads EIP with the specified address, if the zero bit is set from a previous arithmetic expression. jz is identical to je.


Jump if Not Zero

jnz loc

ZF = 0

Loads EIP with the specified address, if the zero bit is not set from a previous arithmetic expression. jnz is identical to jne.


Jump if Signed

js loc

SF = 1

Loads EIP with the specified address, if the sign bit is set from a previous arithmetic expression.


Jump if Not Signed

jns loc

SF = 0

Loads EIP with the specified address, if the sign bit is not set from a previous arithmetic expression.


Jump if counter register is zero

jcxz loc

CX = 0

jecxz loc

ECX = 0

jrcxz loc

RCX = 0

Loads EIP with the specified address, if the counter register is zero.


Function Calls

call proc

Pushes the address of the instruction that follows the call call, i.e. usually the next line in your source code, onto the top of the stack, and then jumps to the specified location. This is used mostly for subroutines.

ret [val]

Loads the next value on the stack into EIP, and then pops the specified number of bytes off the stack. If val is not supplied, the instruction will not pop any values off the stack after returning.


Loop Instructions

loop arg

The loop instruction decrements ECX and jumps to the address specified by arg unless decrementing ECX caused its value to become zero. For example:

movecx,5; ecx ≔ 5
head:
; the code here would be executed 5 times
loophead

loop does not set any flags.


loopcc arg

These loop instructions decrement ECX and jump to the address specified by arg if their condition is satisfied (that is, a specific flag is set), unless decrementing ECX caused its value to become zero.

  • loope loop if equal
  • loopne loop if not equal
  • loopnz loop if not zero
  • loopz loop if zero

That way, only testing for a non-zero ECX can be combined with testing ZF. Other flags can not be tested for, say there is no loopnc “loop while ECX ≠ 0 and CF unset”.


Enter and Leave

enter arg

enter creates a stack frame with the specified amount of space allocated on the stack.

leave

leave destroys the current stack frame, and restores the previous frame. Using Intel syntax this is equivalent to:

movesp,ebp; esp ≔ ebp
popebp

This will set EBP and ESP to their respective value before the function prologue began therefore reversing any modification to the stack that took place during the prologue.


Other Control Instructions

hlt

Halts the processor. Execution will be resumed after processing next hardware interrupt, unless IF is cleared.

nop

No operation. This instruction doesn't do anything, but wastes (an) instruction cycle(s) in the processor.

This instruction is often represented as an xchg operation with the operands EAX and EAX (an operation without side-effects), because there is no designated opcode for doing nothing. This just as a passing remark, so that you do not get confused with disassembled code.

lock

Asserts #LOCK prefix on next instruction.

wait

Waits for the FPU to finish its last calculation.

Arithmetic Instructions

Arithmetic instructions

Arithmetic instructions take two operands: a destination and a source. The destination must be a register or a memory location. The source may be either a memory location, a register, or a constant value. Note that at least one of the two must be a register, because operations may not use a memory location as both a source and a destination.

add src, dest GAS Syntax
add dest, src Intel Syntax

This adds src to dest. If you are using the MASM syntax, then the result is stored in the first argument, if you are using the GAS syntax, it is stored in the second argument.

sub src, dest GAS Syntax
sub dest, src Intel Syntax

Like ADD, only it subtracts source from destination instead. In C: dest -= src;


mul arg

This multiplies arg by the value of corresponding byte-length in the AX register.

operand size 1 byte 2 bytes 4 bytes
other operand AL AX EAX
higher part of result stored in AH DX EDX
lower part of result stored in AL AX EAX
result registers used by mul

In the second case, the target is not EAX for backward compatibility with code written for older processors.


imul arg

As mul, only signed. The imul instruction has the same format as mul, but also accepts two other formats like so:

imul src, dest GAS Syntax
imul dest, src Intel Syntax

This multiplies src by dest. If you are using the NASM syntax, then the result is stored in the first argument, if you are using the GAS syntax, it is stored in the second argument.

imul aux, src, dest GAS Syntax
imul dest, src, aux Intel Syntax

This multiplies src by aux and places it into dest. If you are using the NASM syntax, then the result is stored in the first argument, if you are using the GAS syntax, it is stored in the third argument.


div arg

This divides the value in the dividend register(s) by arg, see table below.

divisor size 1 byte 2 bytes 4 bytes
dividend AX DX:AX EDX:EAX
remainder stored in AH DX EDX
quotient stored in AL AX EAX
result registers for div

The colon (:) means concatenation. With divisor size 4, this means that EDX are the bits 32-63 and EAX are bits 0-31 of the input number (with lower bit numbers being less significant, in this example).

As you typically have 32-bit input values for division, you often need to use CDQ to sign-extend EAX into EDX just before the division.

If quotient does not fit into quotient register, arithmetic overflow interrupt occurs. All flags are in undefined state after the operation.


idiv arg

As div, only signed.


neg arg

Arithmetically negates the argument (i.e. two's complement negation).


Carry Arithmetic Instructions

adc src, dest GAS Syntax
adc dest, src Intel Syntax


Add with carry. Adds src + CF to dest, storing result in dest. Usually follows a normal add instruction to deal with values twice as large as the size of the register. In the following example, source contains a 64-bit number which will be added to destination.

moveax,[source]; read low 32 bits
movedx,[source+4]; read high 32 bits
add[destination],eax; add low 32 bits
adc[destination+4],edx; add high 32 bits, plus carry
sbb src, dest GAS Syntax
sbb dest, src Intel Syntax

Subtract with borrow. Subtracts src + CF from dest, storing result in dest. Usually follows a normal sub instruction to deal with values twice as large as the size of the register.


Increment and Decrement

Increment

inc augend

Increments the register value in the argument by 1. Performs much faster than add arg, 1.


Decrement

dec minuend


Operation

Decrements the value in minuend by 1, but this is much faster than the equivalent sub minuend, 1.


Operands

Minuend may be either a register or memory operand.


Application

  • Some programming language represent Boolean values with either all bits zero, or all bits set to one. When you are programming Boolean functions you need to take account of this. The dec instruction can help you with this. Very often you set the final (Boolean) result based on flags. By choosing an instruction that is opposite of the intended and then decrementing the resulting value you will obtain a value satisfying the programming language’s requirements. Here is a trivial example testing for zero.
    xorrax,rax; rax ≔ false (ensure result is not wrong due to any residue)
    testrdi,rdi; rdi ≟ 0 (ZF ≔ rax = 0)
    setnzal; al ≔ ¬ZF
    decrax; rax ≔ rax − 1
    
    If you intend to set false the “erroneously” set 1 will be “fixed” by dec. If you intend to set true, which is represented by −1, you will decrement the value zero.

Pointer arithmetic

The lea instruction can be used for arithmetic, especially on pointers.

Logic Instructions

The instructions on this page deal with bit-wise logical instructions.

All logical instructions presented in this section are executed in the, as the name already suggests, the arithmetic logic unit.


binary operations

These instructions require two operands.


logical and
and mask, destination GAS Syntax
and destination, mask Intel Syntax


operation

and performs a bit-wise and of the two operands, and stores the result in destination.


example

movl$0x1,%edx; edx ≔ 1
movl$0x0,%ecx; ecx ≔ 0
andl%edx,%ecx; ecx ≔ edx ∧ ecx
; here ecx would be 0 because 1 ∧ 0 ⇔ 0

application

  • An and can be used to calculate the intersection of two “sets”, or a value representing a “mask”. Some programming language require that Boolean values are stored exactly as either 1 or 0. An and rax, 1 will ensure only the LSB is set, or not set.
  • If partial register addressing is not available in the desired size, an and can be used for a destination mod mask operation, that is the remainder of integer division. For that, mask has to contain the value 2n-1 (i. e. all lower bits set until a certain threshold), where 2n equals your desired divisor.


logical or
or addend, destination GAS Syntax
or destination, addend Intel Syntax


operation

The or instruction performs a bit-wise or of the two operands, and stores the result in destination.


example

movl$0x1,%edx; edx ≔ 1
movl$0x0,%ecx; ecx ≔ 0
orl%edx,%ecx; ecx ≔ edx ∨ ecx
; here ecx would be 1 because 1 ∨ 0 ⇔ 1

application

  • An or can be used to calculate the union of two “sets”, or a value representing a “mask”.
logical xor
xor flip, destination GAS Syntax
xor destination, flip Intel Syntax


operation

Performs a bit-wise xor of the two operands, and stores the result in destination.


example

movl$0x1,%edx; edx ≔ 1
movl$0x0,%ecx; ecx ≔ 0
xorl%edx,%ecx; ecx ≔ edx ⊕ ecx
; here ecx would be 1 because 1 ⊕ 0 ⇔ 1

application

  • xor rax, rax (or any GPR twice) will clear all bits. It is a specially recognized word. However, since xor affects flags it might introduce bogus dependencies.


common remarks

side effects for andor, and xor

  • OF ≔ 0
  • CF ≔ 0
  • SF becomes the value of the most significant bit of the calculated result
  • ZF ≔ result = 0
  • PF is set according to the result


unary operations

logical not


not argument


operation

Performs a bit-wise inversion of argument.


side-effects

None.


example

movl$0x1,%edx; edx ≔ 1
notl%edx; edx ≔ ¬edx
; here edx would be 0xFFFFFFFE because a bitwise NOT 0x00000001 = 0xFFFFFFFE

application

  • not is frequently used to get a register with all bits set.

Shift and Rotate Instructions

Logical Shift Instructions

In a logical shift instruction (also referred to as unsigned shift), the bits that slide off the end disappear (except for the last, which goes into the carry flag), and the spaces are always filled with zeros. Logical shifts are best used with unsigned numbers.

shr src, dest GAS Syntax
shr dest, src Intel Syntax


Logical shift dest to the right by src bits.

shl src, dest GAS Syntax
shl dest, src Intel Syntax


Logical shift dest to the left by src bits.

Examples (GAS Syntax):

movw $ff00,%ax # ax=1111.1111.0000.0000 (0xff00, unsigned 65280, signed -256)
shrw $3,%ax # ax=0001.1111.1110.0000 (0x1fe0, signed and unsigned 8160)
 # (logical shifting unsigned numbers right by 3
 # is like integer division by 8)
shlw $1,%ax # ax=0011.1111.1100.0000 (0x3fc0, signed and unsigned 16320)
 # (logical shifting unsigned numbers left by 1
 # is like multiplication by 2)

Arithmetic Shift Instructions

In an arithmetic shift (also referred to as signed shift), like a logical shift, the bits that slide off the end disappear (except for the last, which goes into the carry flag). But in an arithmetic shift, the spaces are filled in such a way to preserve the sign of the number being slid. For this reason, arithmetic shifts are better suited for signed numbers in two's complement format.

sar src, dest GAS Syntax
sar dest, src Intel Syntax


Arithmetic shift dest to the right by src bits. Spaces are filled with sign bit (to maintain sign of original value), which is the original highest bit.

sal src, dest GAS Syntax
sal dest, src Intel Syntax


Arithmetic shift dest to the left by src bits. The bottom bits do not affect the sign, so the bottom bits are filled with zeros. This instruction is synonymous with SHL.

Examples (GAS Syntax):

movw $ff00,%ax # ax=1111.1111.0000.0000 (0xff00, unsigned 65280, signed -256)
salw $2,%ax # ax=1111.1100.0000.0000 (0xfc00, unsigned 64512, signed -1024)
 # (arithmetic shifting left by 2 is like multiplication by 4 for
 # negative numbers, but has an impact on positives with most
 # significant bit set (i.e. set bits shifted out))
sarw $5,%ax # ax=1111.1111.1110.0000 (0xffe0, unsigned 65504, signed -32)
 # (arithmetic shifting right by 5 is like integer division by 32
 # for negative numbers)

Extended Shift Instructions

The names of the double precision shift operations are somewhat misleading, hence they are listed as extended shift instructions on this page.

They are available for use with 16- and 32-bit data entities (registers/memory locations). The src operand is always a register, the dest operand can be a register or memory location, the cnt operand is an immediate byte value or the CL register. In 64-bit mode it is possible to address 64-bit data as well.

shld cnt, src, dest GAS Syntax
shld dest, src, cnt Intel Syntax

The operation performed by shld is to shift the most significant cnt bits out of dest, but instead of filling up the least significant bits with zeros, they are filled with the most significant cnt bits of src.

shrd cnt, src, dest GAS Syntax
shrd dest, src, cnt Intel Syntax

Likewise, the shrd operation shifts the least significant cnt bits out of dest, and fills up the most significant cnt bits with the least significant bits of the src operand.

Intel's nomenclature is misleading, in that the shift does not operate on double the basic operand size (i.e. specifying 32-bit operands doesn't make it a 64-bit shift): the src operand always remains unchanged.

Also, Intel's manual[1] states that the results are undefined when cnt is greater than the operand size, but at least for 32- and 64-bit data sizes it has been observed that shift operations are performed by (cnt mod n), with n being the data size.

Examples (GAS Syntax):

xorw %ax,%ax # ax=0000.0000.0000.0000 (0x0000)
notw %ax # ax=1111.1111.1111.1111 (0xffff)
movw $0x5500,%bx # bx=0101.0101.0000.0000
shrdw $4,%ax,%bx # bx=1111.0101.0101.0000 (0xf550), ax is still 0xffff
shldw $8,%bx,%ax # ax=1111.1111.1111.0101 (0xfff5), bx is still 0xf550

Other examples (decimal numbers are used instead of binary number to explain the concept)

# ax = 1234 5678
# bx = 8765 4321
shrd $3, %ax, %bx # ax = 1234 5678 bx = 6788 7654
# ax = 1234 5678
# bx = 8765 4321
shld $3, %ax, %bx # bx = 5432 1123 ax = 1234 5678

Rotate Instructions

Rotate Right

In a rotate instruction, the bits that slide off the end of the register are fed back into the spaces.

ror offset, variable GAS Syntax
ror variable, offset Intel Syntax

Rotate variable to the right by offset bits. Here is a graphical representation how this looks like:

 ╭─────────╮
%al old │ 0 1 1 1 │
 │ │ │ │ ╰─╯
 │ │ │ ╰─╮
ror 1, %al │ │ ╰─╮ │
 │ ╰─╮ │ │
 ╰─╮ │ │ │
%al new 1 0 1 1

The number of bits to rotate offset is masked to the lower 5 bits (or 6 bits in 64-bit mode). This is equivalent to a  {\displaystyle {\text{offset}}\!\!\!\!\mod 32}  operation, i. e. the remainder of integer division (note:  {\displaystyle 2^{5}=32} ).


Operands

  • Variable has to be a register or memory location.
  • Offset can be either
    • an immediate value (where the value 1 has a dedicated opcode),
    • or the cl register (that is the lower byte of ecx).


Modified Flags

ror only alters flags if the masked offset is non-zero. The CF becomes the result’s MSB, so the “sign”.

Furthermore, if the masked offset = 1, OF ≔ result[MSB] ⊻ result[MSB−1], so the OF tells us, whether “the sign” has changed.


Rotate Left
rol src, dest GAS Syntax
rol dest, src Intel Syntax


Rotate dest to the left by src bits.


Rotate With Carry Instructions

Like with shifts, the rotate can use the carry bit as the "extra" bit that it shifts through.

rcr src, dest GAS Syntax
rcr dest, src Intel Syntax


Rotate dest to the right by src bits with carry.

rcl src, dest GAS Syntax
rcl dest, src Intel Syntax


Rotate dest to the left by src bits with carry.


Number of arguments

Unless stated, these instructions can take either one or two arguments. If only one is supplied, it is assumed to be a register or memory location and the number of bits to shift/rotate is one (this may be dependent on the assembler in use, however). shrl $1, %eax is equivalent to shrl %eax (GAS syntax).

Other Instructions

Stack Instructions

push arg

This instruction decrements the stack pointer and stores the data specified as the argument into the location pointed to by the stack pointer.


pop arg

This instruction loads the data stored in the location pointed to by the stack pointer into the argument specified and then increments the stack pointer. For example:

mov eax, 5
mov ebx, 6
push eax
The stack is now: [5]
push ebx
The stack is now: [6] [5]
pop eax
The topmost item (which is 6) is now stored in eax. The stack is now: [5]
pop ebx
ebx is now equal to 5. The stack is now empty.

pushf

This instruction decrements the stack pointer and then loads the location pointed to by the stack pointer with the contents of the flag register.


popf

This instruction loads the flag register with the contents of the memory location pointed to by the stack pointer and then increments the contents of the stack pointer.


pusha

This instruction pushes all the general purpose registers onto the stack in the following order: AX, CX, DX, BX, SP, BP, SI, DI. The value of SP pushed is the value before the instruction is executed. It is useful for saving state before an operation that could potentially change these registers.


popa

This instruction pops all the general purpose registers off the stack in the reverse order of PUSHA. That is, DI, SI, BP, SP, BX, DX, CX, AX. Used to restore state after a call to PUSHA.


pushad

This instruction works similarly to pusha, but pushes the 32-bit general purpose registers onto the stack instead of their 16-bit counterparts.


popad

This instruction works similarly to popa, but pops the 32-bit general purpose registers off of the stack instead of their 16-bit counterparts.


Flags instructions

While the flags register is used to report on results of executed instructions (overflow, carry, etc.), it also contains flags that affect the operation of the processor. These flags are set and cleared with special instructions.


Interrupt Flag

The IF flag tells a processor if it should accept hardware interrupts. It should be kept set under normal execution. In fact, in protected mode, neither of these instructions can be executed by user-level programs.


sti

Sets the interrupt flag. If set, the processor can accept interrupts from peripheral hardware.


cli

Clears the interrupt flag. Hardware interrupts cannot interrupt execution. Programs can still generate interrupts, called software interrupts, and change the flow of execution. Non-maskable interrupts (NMI) cannot be blocked using this instruction.


Direction Flag

The DF flag tells the processor which way to read data when using string instructions. That is, whether to decrement or increment the esi and edi registers after a movs instruction.


std

Sets the direction flag. Registers will decrement, reading backwards.


cld

Clears the direction flag. Registers will increment, reading forwards.


Carry Flag

The CF flag is often modified after arithmetic instructions, but it can be set or cleared manually as well.


stc

Sets the carry flag.


clc

Clears the carry flag.


cmc

Complements (inverts) the carry flag.


Other

sahf

Stores the content of AH register into the lower byte of the flag register.


lahf

Loads the AH register with the contents of the lower byte of the flag register.


I/O Instructions

in src, dest GAS Syntax
in dest, src Intel Syntax


The IN instruction almost always has the operands AX and DX (or EAX and EDX) associated with it. DX (src) frequently holds the port address to read, and AX (dest) receives the data from the port. In Protected Mode operating systems, the IN instruction is frequently locked, and normal users can't use it in their programs.

out src, dest GAS Syntax
out dest, src Intel Syntax


The OUT instruction is very similar to the IN instruction. OUT outputs data from a given register (src) to a given output port (dest). In protected mode, the OUT instruction is frequently locked so normal users can't use it.


System Instructions

These instructions were added with the Pentium II.


sysenter

This instruction causes the processor to enter protected system mode (supervisor mode or "kernel mode").


sysexit

This instruction causes the processor to leave protected system mode, and enter user mode.


Misc Instructions

Read time stamp counter

RDTSC

RDTSC was introduced in the Pentium processor, the instruction reads the number of clock cycles since reset and returns the value in EDX:EAX. This can be used as a way of obtaining a low overhead, high resolution CPU timing. Although with modern CPU microarchitecture(multi-core, hyperthreading) and multi-CPU machines you are not guaranteed synchronized cycle counters between cores and CPUs. Also the CPU frequency may be variable due to power saving or dynamic overclocking. So the instruction may be less reliable than when it was first introduced and should be used with care when being used for performance measurements.

It is possible to use just the lower 32-bits of the result but it should be noted that on a 600 MHz processor the register would overflow every 7.16 seconds:

 {\displaystyle 2^{32}{\text{ cycles}}*{\frac {1{\text{ second}}}{600,000,000{\text{ cycles}}}}\approx 7.16{\text{ seconds}}}{\displaystyle 2^{32}{\text{ cycles}}*{\frac {1{\text{ second}}}{600,000,000{\text{ cycles}}}}\approx 7.16{\text{ seconds}}}

While using the full 64-bits allows for 974.9 years between overflows:

 {\displaystyle 2^{64}{\text{ cycles}}*{\frac {1{\text{ second}}}{600,000,000{\text{ cycles}}*86400{\text{ seconds in a day}}*\ 365{\text{ days in a year}}}}\approx 974.9{\text{ years}}}

The following program (using NASM syntax) is an example of using RDTSC to measure the number of cycles a small block takes to execute:

globalmain
externprintf
section.data
  align4
  a:dd10.0
  b:dd5.0
  c:dd2.0
  fmtStr:db"edx:eax=%lluedx=%deax=%d",0x0A,0
section .bss
  align4
  cycleLow:  resd1
  cycleHigh: resd1
  result:    resd1
section.text
  main:; Using main since we are using gcc to link
;
;opdst, src
;
xoreax,eax
cpuid
rdtsc
mov[cycleLow],eax
mov[cycleHigh],edx
;
; Do some work before measurements 
;
flddword[a]
flddword[c]
fmulpst1
fmulpst1
flddword[b]
flddword[b]
fmulpst1
faddpst1
fsqrt
fstpdword[result]
;
; Done work
;
cpuid
rdtsc
;
; break points so we can examine the values
; before we alter the data in edx:eax and
; before we print out the results.
;
break1:
  subeax,[cycleLow]
  sbbedx,[cycleHigh]
break2:
  push eax
  push edx
  push edx
  push eax
  push dword fmtStr
  call printf
addesp,20; Pop stack 5 times 4 bytes
;
; Call exit(3) syscall
;void exit(int status)
;
mov ebx, 0; Arg one: the status
mov eax, 1; Syscall number:
int0x80

In order to assemble, link and run the program we need to do the following:

$ nasm -felf -g rdtsc.asm -l rdtsc.lst
$ gcc -m32 -o rdtsc rdtsc.o
$ ./rdtsc

x86 Interrupts

What is an Interrupt?

In modern operating systems, the programmer often doesn't need to use interrupts. In Windows, for example, the programmer conducts business with the Win32 API. However, these API calls interface with the kernel, and the kernel will often trigger interrupts to perform different tasks. In older operating systems (specifically DOS), the programmer didn't have an API to use, and so they had to do all their work through interrupts.


Interrupt Instruction

int arg

This instruction issues the specified interrupt. For instance:

int0x0A

Calls interrupt 10 (0x0A (hex) = 10 (decimal)).


Types of Interrupts

There are 3 types of interrupts: Hardware Interrupts, Software Interrupts and Exceptions.


Hardware Interrupts

Hardware interrupts are triggered by hardware devices. For instance, when you type on your keyboard, the keyboard triggers a hardware interrupt. The processor stops what it is doing, and executes the code that handles keyboard input (typically reading the key you pressed into a buffer in memory). Hardware interrupts are typically asynchronous - their occurrence is unrelated to the instructions being executed at the time they are raised.


Software Interrupts

There are also a series of software interrupts that are usually used to transfer control to a function in the operating system kernel. Software interrupts are triggered by the instruction int. For example, the instruction "int 14h" triggers interrupt 0x14. The processor then stops the current program, and jumps to the code to handle interrupt 14. When interrupt handling is complete, the processor returns flow to the original program.


Exceptions

Exceptions are caused by exceptional conditions in the code which is executing, for example an attempt to divide by zero or access a protected memory area. The processor will detect this problem, and transfer control to a handler to service the exception. This handler may re-execute the offending code after changing some value (for example, the zero dividend), or if this cannot be done, the program causing the exception may be terminated.

ARM Architecture

Arm (previously officially written all caps as ARM and usually written as such today), previously Advanced RISC Machine, originally Acorn RISC Machine, is a family of reduced instruction set computing (RISC) architectures for computer processors, configured for various environments. Arm Holdings develops the architecture and licenses it to other companies, who design their own products that implement one of those architectures‍—‌including systems-on-chips (SoC) and systems-on-modules (SoM) that incorporate memory, interfaces, radios, etc. It also designs  cores that implement this instruction set and licenses these designs to a number of companies that incorporate those core designs into their own products.

Processors that have a RISC architecture typically require fewer transistors than those with a complex instruction set computing (CISC) architecture (such as the x86 processors found in most personal computers), which improves cost, power consumption, and heat dissipation. These characteristics are desirable for light, portable, battery-powered devices‍, including smartphones, laptops and tablet computers, and other  embedded systems, ‌but are also useful for servers and desktops to some degree. For supercomputers, which consume large amounts of electricity, Arm is also a power-efficient solution.

Arm Holdings periodically releases updates to the architecture. Architecture versions Armv3 to Armv7 support 32-bit address space (pre-Armv3 chips, made before Arm Holdings was formed, as used in the Acorn Archimedes, had 26-bit address space) and 32-bit arithmetic; most architectures have 32-bit fixed-length instructions. The Thumb version supports a variable-length instruction set that provides both 32- and 16-bit instructions for improved code density. Some older cores can also provide hardware execution of Java bytecodes; and newer ones have one instruction for JavaScript. Released in 2011, the Armv8-A architecture added support for a 64-bit address space and 64-bit arithmetic with its new 32-bit fixed-length instruction set. Some recent Arm CPUs have simultaneous multithreading (SMT) with e.g. Arm Neoverse E1 being able to execute two threads concurrently for improved aggregate throughput performance. Arm Cortex-A65AE for automotive applications is also a multithreaded processor, and has Dual Core Lock-Step for fault-tolerant designs (supporting Automotive Safety Integrity Level D, the highest level). The Neoverse N1 is designed for "as few as 8 cores" or "designs that scale from 64 to 128 N1 cores within a single coherent system".

With over 130 billion Arm processors produced, as of 2019, Arm is the most widely used  instruction set architecture (ISA) and the ISA produced in the largest quantity. Currently, the widely used Cortex cores, older "classic" cores, and specialized SecurCore cores variants are available for each of these to include or exclude optional capabilities.


History


Microprocessor-based system on a chip


Arm1 2nd processor for the BBC Micro

The British computer manufacturer Acorn Computers first developed the Acorn RISC Machine architecture (Arm) in the 1980s to use in its personal computers. Its first Arm-based products were coprocessor modules for the 6502B based BBC Micro series of computers. After the successful BBC Micro computer, Acorn Computers considered how to move on from the relatively simple MOS Technology 6502 processor to address business markets like the one that was soon dominated by the IBM PC, launched in 1981. The Acorn Business Computer (ABC) plan required that a number of second processors be made to work with the BBC Micro platform, but processors such as the  Motorola 68000 and National Semiconductor 32016 were considered unsuitable, and the 6502 was not powerful enough for a  graphics-based user interface.

According to Sophie Wilson, all the processors tested at that time performed about the same, with about a 4 Mbit/second bandwidth.

After testing all available processors and finding them lacking, Acorn decided it needed a new architecture. Inspired by papers from the Berkeley RISC project, Acorn considered designing its own processor. A visit to the Western Design Center in  Phoenix, where the 6502 was being updated by what was effectively a single-person company, showed Acorn engineers Steve Furber and Sophie Wilson they did not need massive resources and state-of-the-art research and development facilities.

Wilson developed the instruction set, writing a simulation of the processor in BBC BASIC that ran on a BBC Micro with a 6502 second processor. This convinced Acorn engineers they were on the right track. Wilson approached Acorn's CEO, Hermann Hauser, and requested more resources. Hauser gave his approval and assembled a small team to implement Wilson's model in hardware.


Acorn RISC Machine: Arm2

The official Acorn RISC Machine project started in October 1983. They chose VLSI Technology as the silicon partner, as they were a source of ROMs and custom chips for Acorn. Wilson and Furber led the design. They implemented it with efficiency principles similar to the 6502. A key design goal was achieving low-latency input/output (interrupt) handling like the 6502. The 6502's memory access architecture had let developers produce fast machines without costly direct memory access (DMA) hardware. The first samples of Arm silicon worked properly when first received and tested on 26 April 1985.

The first Arm application was as a second processor for the BBC Micro, where it helped in developing simulation software to finish development of the support chips (VIDC, IOC, MEMC), and sped up the CAD software used in Arm2 development. Wilson subsequently rewrote BBC BASIC in Arm assembly language. The in-depth knowledge gained from designing the instruction set enabled the code to be very dense, making Arm BBC BASIC an extremely good test for any Arm emulator. The original aim of a principally Arm-based computer was achieved in 1987 with the release of the Acorn Archimedes. In 1992, Acorn once more won the Queen's Award for Technology for the Arm.

The Arm2 featured a 32-bit data bus, 26-bit address space and 27 32-bit registers. Eight bits from the program counter register were available for other purposes; the top six bits (available because of the 26-bit address space) served as status flags, and the bottom two bits (available because the program counter was always word-aligned) were used for setting modes. The address bus was extended to 32 bits in the Arm6, but program code still had to lie within the first 64 MB of memory in 26-bit compatibility mode, due to the reserved bits for the status flags. The Arm2 had a transistor count of just 30,000, compared to Motorola's six-year-older 68000 model with around 40,000. Much of this simplicity came from the lack of microcode (which represents about one-quarter to one-third of the 68000) and from (like most CPUs of the day) not including any cache. This simplicity enabled low power consumption, yet better performance than the Intel 80286. A successor, Arm3, was produced with a 4 KB cache, which further improved performance.


Advanced RISC Machines Ltd. – Arm6


Die of an Arm610 microprocessor

In the late 1980s, Apple Computer and  VLSI Technology started working with Acorn on newer versions of the Arm core. In 1990, Acorn spun off the design team into a new company named Advanced RISC Machines Ltd.,  which became Arm Ltd when its parent company, Arm Holdings  plc, floated on the  London Stock Exchange and  NASDAQ in 1998.  The new Apple-Arm work would eventually evolve into the Arm6, first released in early 1992. Apple used the Arm6-based Arm610 as the basis for their Apple Newton  PDA.


Early licensees

In 1994, Acorn used the Arm610 as the main central processing unit (CPU) in their RiscPC computers.  DEC licensed the Armv4 architecture and produced the StrongARM. At 233  MHz, this CPU drew only one watt (newer versions draw far less). This work was later passed to Intel as part of a lawsuit settlement, and Intel took the opportunity to supplement their  i960 line with the StrongArm. Intel later developed its own high performance implementation named XScale, which it has since sold to Marvell. Transistor count of the Arm core remained essentially the same throughout these changes; Arm2 had 30,000 transistors, while Arm6 grew only to 35,000.


Market share

In 2005, about 98% of all mobile phones sold used at least one Arm processor. In 2010, producers of chips based on Arm architectures reported shipments of 6.1 billion Arm-based processors, representing 95% of smartphones, 35% of digital televisions and set-top boxes and 10% of mobile computers. In 2011, the 32-bit Arm architecture was the most widely used architecture in mobile devices and the most popular 32-bit one in embedded systems.  In 2013, 10 billion were produced and "Arm-based chips are found in nearly 60 percent of the world's mobile devices".


Licensing


Die of a STM32F103VGT6 ARM Cortex-M3 microcontroller with 1 MB flash memory by STMicroelectronics


Core licence

Arm Holdings' primary business is selling IP cores, which licensees use to create microcontrollers (MCUs),  CPUs, and systems-on-chips based on those cores. The original design manufacturer combines the Arm core with other parts to produce a complete device, typically one that can be built in existing semiconductor fabrication plants (fabs) at low cost and still deliver substantial performance. The most successful implementation has been the Arm7TDMI with hundreds of millions sold. Atmel has been a precursor design center in the Arm7TDMI-based embedded system.

The Arm architectures used in smartphones, PDAs and other mobile devices range from Armv5 to Armv7-A, used in low-end and midrange devices, to Armv8-A used in current high-end devices.

In 2009, some manufacturers introduced netbooks based on Arm architecture CPUs, in direct competition with netbooks based on Intel Atom.

Arm Holdings offers a variety of licensing terms, varying in cost and deliverables. Arm Holdings provides to all licensees an integratable hardware description of the Arm core as well as complete software development toolset (compiler,  debugger, software development kit) and the right to sell manufactured  silicon containing the Arm CPU.

SoC packages integrating Arm's core designs include Nvidia Tegra's first three generations, CSR plc's Quatro family, ST-Ericsson's Nova and NovaThor, Silicon Labs's Precision32 MCU, Texas Instruments's OMAP products, Samsung's Hummingbird and  Exynos products, Apple's A4, A5, and  A5X, and  NXP's i.MX.

Fabless licensees, who wish to integrate an Arm core into their own chip design, are usually only interested in acquiring a ready-to-manufacture verified  semiconductor intellectual property core. For these customers, Arm Holdings delivers a gate netlist description of the chosen Arm core, along with an abstracted simulation model and test programs to aid design integration and verification. More ambitious customers, including integrated device manufacturers (IDM) and foundry operators, choose to acquire the processor IP in synthesizable  RTL (Verilog) form. With the synthesizable RTL, the customer has the ability to perform architectural level optimisations and extensions. This allows the designer to achieve exotic design goals not otherwise possible with an unmodified netlist (high clock speed, very low power consumption, instruction set extensions, etc.). While Arm Holdings does not grant the licensee the right to resell the Arm architecture itself, licensees may freely sell manufactured product such as chip devices, evaluation boards and complete systems. Merchant foundries can be a special case; not only are they allowed to sell finished silicon containing Arm cores, they generally hold the right to re-manufacture Arm cores for other customers.

Arm Holdings prices its IP based on perceived value. Lower performing Arm cores typically have lower licence costs than higher performing cores. In implementation terms, a synthesizable core costs more than a hard macro (blackbox) core. Complicating price matters, a merchant foundry that holds an Arm licence, such as Samsung or Fujitsu, can offer fab customers reduced licensing costs. In exchange for acquiring the Arm core through the foundry's in-house design services, the customer can reduce or eliminate payment of Arm's upfront licence fee.

Compared to dedicated semiconductor foundries (such as TSMC and UMC) without in-house design services, Fujitsu/Samsung charge two- to three-times more per manufactured wafer. For low to mid volume applications, a design service foundry offers lower overall pricing (through subsidisation of the licence fee). For high volume mass-produced parts, the long term cost reduction achievable through lower wafer pricing reduces the impact of Arm's NRE (Non-Recurring Engineering) costs, making the dedicated foundry a better choice.

Companies that have developed chips with cores designed by Arm Holdings include Amazon.com's Annapurna Labs subsidiary,  Analog Devices, Apple, AppliedMicro (now:  MACOM Technology Solutions), Atmel,  Broadcom, Cavium, Cypress Semiconductor,  Freescale Semiconductor (now NXP Semiconductors),  Huawei,  Intel, Maxim Integrated, Nvidia, NXP,  Qualcomm,  Renesas, Samsung Electronics, ST Microelectronics,  Texas Instruments and Xilinx.


Built on Arm Cortex Technology licence

In February 2016, Arm announced the Built on Arm Cortex Technology licence, often shortened to Built on Cortex (BoC) licence. This licence allows companies to partner with Arm and make modifications to Arm Cortex designs. These design modifications will not be shared with other companies. These semi-custom core designs also have brand freedom, for example Kryo 280.

Companies that are current licensees of Built on Arm Cortex Technology include Qualcomm.


Architectural licence

Companies can also obtain an Arm architectural licence for designing their own CPU cores using the Arm instruction sets. These cores must comply fully with the Arm architecture. Companies that have designed cores that implement an Arm architecture include Apple, AppliedMicro, Broadcom, Cavium (now: Marvell), Digital Equipment Corporation, Intel, Nvidia, Qualcomm, and Samsung Electronics.


Arm Flexible Access

On 16 July 2019, Arm announced Arm Flexible Access. Arm Flexible Access provides unlimited access to included Arm intellectual property (IP) for development. Per product licence fees are required once customers reaches foundry tapeout or prototyping.

75% of Arm's most recent IP over the last two years are included in Arm Flexible Access. As of October 2019:

  • CPUs: Cortex-A5, Cortex-A7, Cortex-A32,  Cortex-A34, Cortex-A35, Cortex-A53,  Cortex-R5, Cortex-R8, Cortex-R52,  Cortex-M0, Cortex-M0+, Cortex-M3,  Cortex-M4, Cortex-M7, Cortex-M23,  Cortex-M33
  • GPUs: Mali-G52, Mali-G31. Includes Mali Driver Development Kits (DDK).
  • Interconnect: CoreLink NIC-400, CoreLink NIC-450, CoreLink CCI-400, CoreLink CCI-500, CoreLink CCI-550, ADB-400 AMBA, XHB-400 AXI-AHB
  • System Controllers: CoreLink GIC-400, CoreLink GIC-500, PL192 VIC, BP141 TrustZone Memory Wrapper, CoreLink TZC-400, CoreLink L2C-310, CoreLink MMU-500, BP140 Memory Interface
  • Security IP: CryptoCell-312, CryptoCell-712, TrustZone True Random Number Generator
  • Peripheral Controllers: PL011 UART, PL022 SPI, PL031 RTC
  • Debug & Trace: CoreSight SoC-400, CoreSight SDC-600, CoreSight STM-500, CoreSight System Trace Macrocell, CoreSight Trace Memory Controller
  • Design Kits: Corstone-101, Corstone-201
  • Physical IP: Artisan PIK for Cortex-M33 TSMC 22ULL including memory compilers, logic libraries, GPIOs and documentation
  • Tools & Materials: Socrates IP ToolingArm Design Studio, Virtual System Models
  • Support: Standard Arm Technical support, Arm online training, maintenance updates, credits towards onsite training and design reviews


Cores

Arm Holdings provides a list of vendors who implement Arm cores in their design (application specific standard products (ASSP), microprocessor and microcontrollers).

Example applications of Arm cores


Tronsmart MK908, a Rockchip-based quad-core Android "mini PC", with a microSD card next to it for a size comparison

Main article: List of applications of ARM cores

Arm cores are used in a number of products, particularly PDAs and smartphones. Some computing examples are Microsoft's first generation Surface and  Surface 2, Apple's iPads and  Asus's  Eee Pad Transformer tablet computers, and several  Chromebook laptops. Others include Apple's iPhone smartphone and  iPod  portable media player, Canon PowerShot digital cameras,  Nintendo Switch hybrid and 3DS handheld game consoles, and  TomTom turn-by-turn navigation systems.

In 2005, Arm Holdings took part in the development of Manchester University's computer SpiNNaker, which used Arm cores to simulate the human brain.

Arm chips are also used in Raspberry Pi, BeagleBoard, BeagleBone,  PandaBoard and other single-board computers, because they are very small, inexpensive and consume very little power.


32-bit architecture


An Armv7 was used to power older versions of the popular Raspberry Pi single board computers like this Raspberry Pi 2 from 2015.


An Armv7 is also used to power the CuBox family of single board computers.

The 32-bit Arm architecture, such as Armv7-A (implementing AArch32; see section on ARMv8 for more on it), was the most widely used architecture in mobile devices as of 2011.

Since 1995, the Arm Architecture Reference Manual has been the primary source of documentation on the Arm processor architecture and instruction set, distinguishing interfaces that all Arm processors are required to support (such as instruction semantics) from implementation details that may vary. The architecture has evolved over time, and version seven of the architecture, Armv7, defines three architecture "profiles":

  • A-profile, the "Application" profile, implemented by 32-bit cores in the Cortex-A series and by some non-Arm cores
  • R-profile, the "Real-time" profile, implemented by cores in the Cortex-R series
  • M-profile, the "Microcontroller" profile, implemented by most cores in the Cortex-M series

Although the architecture profiles were first defined for Armv7, Arm subsequently defined the Armv6-M architecture (used by the Cortex M0/M0+/ M1) as a subset of the Armv7-M profile with fewer instructions.


CPU modes

Except in the M-profile, the 32-bit Arm architecture specifies several CPU modes, depending on the implemented architecture features. At any moment in time, the CPU can be in only one mode, but it can switch modes due to external events (interrupts) or programmatically.

  • User mode: The only non-privileged mode.
  • FIQ mode: A privileged mode that is entered whenever the processor accepts a fast interrupt request.
  • IRQ mode: A privileged mode that is entered whenever the processor accepts an interrupt.
  • Supervisor (svc) mode: A privileged mode entered whenever the CPU is reset or when an SVC instruction is executed.
  • Abort mode: A privileged mode that is entered whenever a prefetch abort or data abort exception occurs.
  • Undefined mode: A privileged mode that is entered whenever an undefined instruction exception occurs.
  • System mode (Armv4 and above): The only privileged mode that is not entered by an exception. It can only be entered by executing an instruction that explicitly writes to the mode bits of the Current Program Status Register (CPSR) from another privileged mode (not from user mode).
  • Monitor mode (Armv6 and Armv7 Security Extensions, Armv8 EL3): A monitor mode is introduced to support TrustZone extension in Arm cores.
  • Hyp mode (Armv7 Virtualization Extensions, Armv8 EL2): A hypervisor mode that supports Popek and Goldberg virtualization requirements for the non-secure operation of the CPU.
  • Thread mode (Armv6-M, Armv7-M, Armv8-M): A mode which can be specified as either privileged or unprivileged. Whether the Main Stack Pointer (MSP) or Process Stack Pointer (PSP) is used can also be specified in CONTROL register with privileged access. This mode is designed for user tasks in RTOS environment but it's typically used in bare-metal for super-loop.
  • Handler mode (Armv6-M, Armv7-M, Armv8-M): A mode dedicated for exception handling (except the RESET which are handled in Thread mode). Handler mode always uses MSP and works in privileged level.


Instruction set

The original (and subsequent) Arm implementation was hardwired without microcode, like the much simpler 8-bit  6502 processor used in prior Acorn microcomputers.

The 32-bit Arm architecture (and the 64-bit architecture for the most part) includes the following RISC features:

  • Load/store architecture.
  • No support for unaligned memory accesses in the original version of the architecture. Armv6 and later, except some microcontroller versions, support unaligned accesses for half-word and single-word load/store instructions with some limitations, such as no guaranteed atomicity.
  • Uniform 16 × 32-bit register file (including the program counter, stack pointer and the link register).
  • Fixed instruction width of 32 bits to ease decoding and pipelining, at the cost of decreased code density. Later, the Thumb instruction set added 16-bit instructions and increased code density.
  • Mostly single clock-cycle execution.

To compensate for the simpler design, compared with processors like the Intel 80286 and Motorola 68020, some additional design features were used:

  • Conditional execution of most instructions reduces branch overhead and compensates for the lack of a branch predictor in early chips.
  • Arithmetic instructions alter condition codes only when desired.
  • 32-bit barrel shifter can be used without performance penalty with most arithmetic instructions and address calculations.
  • Has powerful indexed addressing modes.
  • A link register supports fast leaf function calls.
  • A simple, but fast, 2-priority-level interrupt subsystem has switched register banks.


Arithmetic instructions

Arm includes integer arithmetic operations for add, subtract, and multiply; some versions of the architecture also support divide operations.

Arm supports 32-bit × 32-bit multiplies with either a 32-bit result or 64-bit result, though Cortex-M0 / M0+ / M1 cores don't support 64-bit results. Some Arm cores also support 16-bit × 16-bit and 32-bit × 16-bit multiplies.

The divide instructions are only included in the following Arm architectures:

  • Armv7-M and Armv7E-M architectures always include divide instructions.
  • Armv7-R architecture always includes divide instructions in the Thumb instruction set, but optionally in its 32-bit instruction set.
  • Armv7-A architecture optionally includes the divide instructions. The instructions might not be implemented, or implemented only in the Thumb instruction set, or implemented in both the Thumb and Arm instruction sets, or implemented if the Virtualization Extensions are included.


Registers

Registers R0 through R7 are the same across all CPU modes; they are never banked.

Registers R8 through R12 are the same across all CPU modes except FIQ mode. FIQ mode has its own distinct R8 through R12 registers.

R13 and R14 are banked across all privileged CPU modes except system mode. That is, each mode that can be entered because of an exception has its own R13 and R14. These registers generally contain the stack pointer and the return address from function calls, respectively.

Aliases:

  • R13 is also referred to as SP, the Stack Pointer.
  • R14 is also referred to as LR, the Link Register.
  • R15 is also referred to as PC, the Program Counter.

The Current Program Status Register (CPSR) has the following 32 bits.

  • M (bits 0–4) is the processor mode bits.
  • T (bit 5) is the Thumb state bit.
  • F (bit 6) is the FIQ disable bit.
  • I (bit 7) is the IRQ disable bit.
  • A (bit 8) is the imprecise data abort disable bit.
  • E (bit 9) is the data endianness bit.
  • IT (bits 10–15 and 25–26) is the if-then state bits.
  • GE (bits 16–19) is the greater-than-or-equal-to bits.
  • DNM (bits 20–23) is the do not modify bits.
  • J (bit 24) is the Java state bit.
  • Q (bit 27) is the sticky overflow bit.
  • V (bit 28) is the overflow bit.
  • C (bit 29) is the carry/borrow/extend bit.
  • Z (bit 30) is the zero bit.
  • N (bit 31) is the negative/less than bit.


Conditional execution

Almost every Arm instruction has a conditional execution feature called predication, which is implemented with a 4-bit condition code selector (the predicate). To allow for unconditional execution, one of the four-bit codes causes the instruction to be always executed. Most other CPU architectures only have condition codes on branch instructions.

Though the predicate takes up four of the 32 bits in an instruction code, and thus cuts down significantly on the encoding bits available for displacements in memory access instructions, it avoids branch instructions when generating code for small  if statements. Apart from eliminating the branch instructions themselves, this preserves the fetch/decode/execute pipeline at the cost of only one cycle per skipped instruction.

An algorithm that provides a good example of conditional execution is the subtraction-based Euclidean algorithm for computing the greatest common divisor. In the C programming language, the algorithm can be written as:

intgcd(inta,intb){
 while(a!=b)// We enter the loop when a<b or a>b, but not when a==b
 if(a>b)// When a>b we do this
 a-=b;
 else// When a<b we do that (no if(a<b) needed since a!=b is checked in while condition)
 b-=a;
 returna;
}

The same algorithm can be rewritten in a way closer to target Arm instructions as:

loop:
 // Compare a and b
 GT=a>b;
 LT=a<b;
 NE=a!=b;
 // Perform operations based on flag results
 if(GT)a-=b;// Subtract *only* if greater-than
 if(LT)b-=a;// Subtract *only* if less-than
 if(NE)gotoloop;// Loop *only* if compared values were not equal
 returna;

and coded in assembly language as:

; assign a to register r0, b to r1
loop:CMPr0,r1; set condition "NE" if (a != b),
 ; "GT" if (a > b),
 ; or "LT" if (a < b)
 SUBGTr0,r0,r1; if "GT" (Greater Than), a = a-b;
 SUBLTr1,r1,r0; if "LT" (Less Than), b = b-a;
 BNEloop; if "NE" (Not Equal), then loop
 Blr; if the loop is not entered, we can safely return

which avoids the branches around the then and else clauses. If r0 and r1 are equal then neither of the SUB instructions will be executed, eliminating the need for a conditional branch to implement the while check at the top of the loop, for example had SUBLE (less than or equal) been used.

One of the ways that Thumb code provides a more dense encoding is to remove the four-bit selector from non-branch instructions.


Other features

Another feature of the instruction set is the ability to fold shifts and rotates into the "data processing" (arithmetic, logical, and register-register move) instructions, so that, for example, the C statement

a+=(j<<2);

could be rendered as a single-word, single-cycle instruction:

ADDRa,Ra,Rj,LSL#2

This results in the typical Arm program being denser than expected with fewer memory accesses; thus the pipeline is used more efficiently.

The Arm processor also has features rarely seen in other RISC architectures, such as PC-relative addressing (indeed, on the 32-bit Arm the  PC is one of its 16 registers) and pre- and post-increment addressing modes.

The Arm instruction set has increased over time. Some early Arm processors (before Arm7TDMI), for example, have no instruction to store a two-byte quantity.


Pipelines and other implementation issues

The Arm7 and earlier implementations have a three-stage pipeline; the stages being fetch, decode and execute. Higher-performance designs, such as the Arm9, have deeper pipelines: Cortex-A8 has thirteen stages. Additional implementation changes for higher performance include a faster adder and more extensive  branch prediction logic. The difference between the Arm7DI and Arm7DMI cores, for example, was an improved multiplier; hence the added "M".


Coprocessors

The Arm architecture (pre-Armv8) provides a non-intrusive way of extending the instruction set using "coprocessors" that can be addressed using MCR, MRC, MRRC, MCRR and similar instructions. The coprocessor space is divided logically into 16 coprocessors with numbers from 0 to 15, coprocessor 15 (cp15) being reserved for some typical control functions like managing the caches and MMU operation on processors that have one.

In Arm-based machines, peripheral devices are usually attached to the processor by mapping their physical registers into Arm memory space, into the coprocessor space, or by connecting to another device (a bus) that in turn attaches to the processor. Coprocessor accesses have lower latency, so some peripherals—for example, an XScale interrupt controller—are accessible in both ways: through memory and through coprocessors.

In other cases, chip designers only integrate hardware using the coprocessor mechanism. For example, an image processing engine might be a small Arm7TDMI core combined with a coprocessor that has specialised operations to support a specific set of HDTV transcoding primitives.


Debugging

All modern Arm processors include hardware debugging facilities, allowing software debuggers to perform operations such as halting, stepping, and breakpointing of code starting from reset. These facilities are built using JTAG  support, though some newer cores optionally support Arm's own two-wire "SWD" protocol. In Arm7TDMI cores, the "D" represented JTAG debug support, and the "I" represented presence of an "EmbeddedICE" debug module. For Arm7 and Arm9 core generations, EmbeddedICE over JTAG was a de facto debug standard, though not architecturally guaranteed.

The Armv7 architecture defines basic debug facilities at an architectural level. These include breakpoints, watchpoints and instruction execution in a "Debug Mode"; similar facilities were also available with EmbeddedICE. Both "halt mode" and "monitor" mode debugging are supported. The actual transport mechanism used to access the debug facilities is not architecturally specified, but implementations generally include JTAG support.

There is a separate Arm "CoreSight" debug architecture, which is not architecturally required by Armv7 processors.


Debug Access Port

The Debug Access Port (DAP) is an implementation of an Arm Debug Interface. There are two different supported implementations, the Serial Wire JTAG Debug Port (SWJ-DP) and the Serial Wire Debug Port (SW-DP). CMSIS-DAP is a standard interface that describes how various debugging software on a host PC can communicate over USB to firmware running on a hardware debugger, which in turn talks over SWD or JTAG to a CoreSight-enabled Arm Cortex CPU.


DSP enhancement instructions

To improve the Arm architecture for digital signal processing and multimedia applications, DSP instructions were added to the set. These are signified by an "E" in the name of the Armv5TE and Armv5TEJ architectures. E-variants also imply T, D, M, and I.

The new instructions are common in digital signal processor (DSP) architectures. They include variations on signed multiply–accumulate, saturated add and subtract, and count leading zeros.


SIMD extensions for multimedia

Introduced in the Armv6 architecture, this was a precursor to Advanced SIMD, also known as Neon.


Jazelle

Jazelle DBX (Direct Bytecode eXecution) is a technique that allows Java bytecode  to be executed directly in the Arm architecture as a third execution state (and instruction set) alongside the existing Arm and Thumb-mode. Support for this state is signified by the "J" in the Armv5TEJ architecture, and in Arm9EJ-S and Arm7EJ-S core names. Support for this state is required starting in Armv6 (except for the Armv7-M profile), though newer cores only include a trivial implementation that provides no hardware acceleration.


Thumb

To improve compiled code-density, processors since the Arm7TDMI (released in 1994) have featured the Thumb instruction set, which have their own state. (The "T" in "TDMI" indicates the Thumb feature.) When in this state, the processor executes the Thumb instruction set, a compact 16-bit encoding for a subset of the Arm instruction set. Most of the Thumb instructions are directly mapped to normal Arm instructions. The space-saving comes from making some of the instruction operands implicit and limiting the number of possibilities compared to the Arm instructions executed in the Arm instruction set state.

In Thumb, the 16-bit opcodes have less functionality. For example, only branches can be conditional, and many opcodes are restricted to accessing only half of all of the CPU's general-purpose registers. The shorter opcodes give improved code density overall, even though some operations require extra instructions. In situations where the memory port or bus width is constrained to less than 32 bits, the shorter Thumb opcodes allow increased performance compared with 32-bit Arm code, as less program code may need to be loaded into the processor over the constrained memory bandwidth.

Unlike processor architectures with variable length (16- or 32-bit) instructions, such as the Cray-1 and Hitachi SuperH, the Arm and Thumb instruction sets exist independently of each other. Embedded hardware, such as the Game Boy Advance, typically have a small amount of RAM accessible with a full 32-bit datapath; the majority is accessed via a 16-bit or narrower secondary datapath. In this situation, it usually makes sense to compile Thumb code and hand-optimise a few of the most CPU-intensive sections using full 32-bit Arm instructions, placing these wider instructions into the 32-bit bus accessible memory.

The first processor with a Thumb instruction decoder was the Arm7TDMI. All Arm9 and later families, including XScale, have included a Thumb instruction decoder. It includes instructions adopted from the Hitachi SuperH (1992), which was licensed by Arm. Arm's smallest processor families (Cortex M0 and M1) implement only the 16-bit Thumb instruction set for maximum performance in lowest cost applications.


Thumb-2

Thumb-2 technology was introduced in the Arm1156 core, announced in 2003. Thumb-2 extends the limited 16-bit instruction set of Thumb with additional 32-bit instructions to give the instruction set more breadth, thus producing a variable-length instruction set. A stated aim for Thumb-2 was to achieve code density similar to Thumb with performance similar to the Arm instruction set on 32-bit memory.

Thumb-2 extends the Thumb instruction set with bit-field manipulation, table branches and conditional execution. At the same time, the Arm instruction set was extended to maintain equivalent functionality in both instruction sets. A new "Unified Assembly Language" (UAL) supports generation of either Thumb or Arm instructions from the same source code; versions of Thumb seen on Armv7 processors are essentially as capable as Arm code (including the ability to write interrupt handlers). This requires a bit of care, and use of a new "IT" (if-then) instruction, which permits up to four successive instructions to execute based on a tested condition, or on its inverse. When compiling into Arm code, this is ignored, but when compiling into Thumb it generates an actual instruction. For example:

; if (r0 == r1)
CMPr0,r1
ITEEQ; ARM: no code ... Thumb: IT instruction
; then r0 = r2;
MOVEQr0,r2; ARM: conditional; Thumb: condition via ITE 'T' (then)
; else r0 = r3;
MOVNEr0,r3; ARM: conditional; Thumb: condition via ITE 'E' (else)
; recall that the Thumb MOV instruction has no bits to encode "EQ" or "NE".

All Armv7 chips support the Thumb instruction set. All chips in the Cortex-A series, Cortex-R series, and Arm11 series support both "Arm instruction set state" and "Thumb instruction set state", while chips in the Cortex-M series support only the Thumb instruction set.


Thumb Execution Environment (ThumbEE)

ThumbEE (erroneously called Thumb-2EE in some Arm documentation), which was marketed as Jazelle RCT (Runtime Compilation Target), was announced in 2005, first appearing in the Cortex-A8 processor. ThumbEE is a fourth instruction set state, making small changes to the Thumb-2 extended instruction set. These changes make the instruction set particularly suited to code generated at runtime (e.g. by JIT compilation) in managed Execution Environments. ThumbEE is a target for languages such as Java, C#,  Perl, and Python, and allows JIT compilers to output smaller compiled code without impacting performance.

New features provided by ThumbEE include automatic null pointer checks on every load and store instruction, an instruction to perform an array bounds check, and special instructions that call a handler. In addition, because it utilises Thumb-2 technology, ThumbEE provides access to registers r8-r15 (where the Jazelle/DBX Java VM state is held). Handlers are small sections of frequently called code, commonly used to implement high level languages, such as allocating memory for a new object. These changes come from repurposing a handful of opcodes, and knowing the core is in the new ThumbEE state.

On 23 November 2011, Arm Holdings deprecated any use of the ThumbEE instruction set, and Armv8 removes support for ThumbEE.


Floating-point (VFP)

VFP (Vector Floating Point) technology is an floating-point unit (FPU) coprocessor extension to the Arm architecture (implemented differently in Armv8 – coprocessors not defined there). It provides low-cost single-precision and double-precision floating-point computation fully compliant with the ANSI/IEEE Std 754-1985 Standard for Binary Floating-Point Arithmetic. VFP provides floating-point computation suitable for a wide spectrum of applications such as PDAs, smartphones, voice compression and decompression, three-dimensional graphics and digital audio, printers, set-top boxes, and automotive applications. The VFP architecture was intended to support execution of short "vector mode" instructions but these operated on each vector element sequentially and thus did not offer the performance of true single instruction, multiple data (SIMD) vector parallelism. This vector mode was therefore removed shortly after its introduction, to be replaced with the much more powerful Neon Advanced SIMD unit.

Some devices such as the Arm Cortex-A8 have a cut-down VFPLite module instead of a full VFP module, and require roughly ten times more clock cycles per float operation. Pre-Armv8 architecture implemented floating-point/SIMD with the coprocessor interface. Other floating-point and/or SIMD units found in Arm-based processors using the coprocessor interface include FPA, FPE, iwMMXt, some of which were implemented in software by trapping but could have been implemented in hardware. They provide some of the same functionality as VFP but are not opcode-compatible with it.


VFPv1

Obsolete


VFPv2

An optional extension to the Arm instruction set in the Armv5TE, Armv5TEJ and Armv6 architectures. VFPv2 has 16 64-bit FPU registers.


VFPv3 or VFPv3-D32

Implemented on most Cortex-A8 and A9 Armv7 processors. It is backwards compatible with VFPv2, except that it cannot trap floating-point exceptions. VFPv3 has 32 64-bit FPU registers as standard, adds VCVT instructions to convert between scalar, float and double, adds immediate mode to VMOV such that constants can be loaded into FPU registers.


VFPv3-D16

As above, but with only 16 64-bit FPU registers. Implemented on Cortex-R4 and R5 processors and the Tegra 2 (Cortex-A9).


VFPv3-F16

Uncommon; it supports IEEE754-2008 half-precision (16-bit) floating point as a storage format.


VFPv4 or VFPv4-D32

Implemented on the Cortex-A12 and A15 Armv7 processors, Cortex-A7 optionally has VFPv4-D32 in the case of an FPU with Neon. VFPv4 has 32 64-bit FPU registers as standard, adds both half-precision support as a storage format and fused multiply-accumulate instructions to the features of VFPv3.


VFPv4-D16

As above, but it has only 16 64-bit FPU registers. Implemented on Cortex-A5 and A7 processors (in case of an FPU without Neon).


VFPv5-D16-M

Implemented on Cortex-M7 when single and double-precision floating-point core option exists.

In Debian GNU/Linux, and derivatives such as Ubuntu and  Linux Mint, armhf (Arm hard float) refers to the Armv7 architecture including the additional VFP3-D16 floating-point hardware extension (and Thumb-2) above. Software packages and cross-compiler tools use the armhf vs. arm/armel suffixes to differentiate.


Advanced SIMD (Neon)

The Advanced SIMD extension (aka Neon or "MPE" Media Processing Engine) is a combined 64- and 128-bit SIMD instruction set that provides standardized acceleration for media and signal processing applications. Neon is included in all Cortex-A8 devices, but is optional in Cortex-A9 devices. Neon can execute MP3 audio decoding on CPUs running at 10 MHz, and can run the GSM adaptive multi-rate (AMR) speech codec at 13 MHz. It features a comprehensive instruction set, separate register files, and independent execution hardware. Neon supports 8-, 16-, 32-, and 64-bit integer and single-precision (32-bit) floating-point data and SIMD operations for handling audio and video processing as well as graphics and gaming processing. In Neon, the SIMD supports up to 16 operations at the same time. The Neon hardware shares the same floating-point registers as used in VFP. Devices such as the Arm Cortex-A8 and Cortex-A9 support 128-bit vectors, but will execute with 64 bits at a time, whereas newer Cortex-A15 devices can execute 128 bits at a time.

A quirk of Neon in Armv7 devices is that it flushes all subnormal numbers to zero, and as a result the GCC compiler will not use it unless -funsafe-math-optimizations, which allows losing denormals, is turned on. "Enhanced" Neon defined since Armv8 does not have this quirk, but as of GCC 8.2 the same flag is still required to enable Neon instructions. On the other hand, GCC does consider Neon safe on AArch64 for Armv8.

ProjectNe10 is Arm's first open-source project (from its inception; while they acquired an older project, now known as Mbed TLS). The Ne10 library is a set of common, useful functions written in both Neon and C (for compatibility). The library was created to allow developers to use Neon optimisations without learning Neon, but it also serves as a set of highly optimised Neon intrinsic and assembly code examples for common DSP, arithmetic, and image processing routines. The source code is available on GitHub.


Arm Helium technology

Helium adds more than 150 scalar and vector instructions.


Security extensions

TrustZone (for Cortex-A profile)

The Security Extensions, marketed as TrustZone Technology, is in Armv6KZ and later application profile architectures. It provides a low-cost alternative to adding another dedicated security core to an SoC, by providing two virtual processors backed by hardware based access control. This lets the application core switch between two states, referred to as worlds (to reduce confusion with other names for capability domains), in order to prevent information from leaking from the more trusted world to the less trusted world. This world switch is generally orthogonal to all other capabilities of the processor, thus each world can operate independently of the other while using the same core. Memory and peripherals are then made aware of the operating world of the core and may use this to provide access control to secrets and code on the device.

Typically, a rich operating system is run in the less trusted world, with smaller security-specialized code in the more trusted world, aiming to reduce the attack surface. Typical applications include DRM functionality for controlling the use of media on Arm-based devices, and preventing any unapproved use of the device.

In practice, since the specific implementation details of proprietary TrustZone implementations have not been publicly disclosed for review, it is unclear what level of assurance is provided for a given threat model, but they are not immune from attack.

Open Virtualization is an open source implementation of the trusted world architecture for TrustZone.

AMD has licensed and incorporated TrustZone technology into its Secure Processor Technology. Enabled in some but not all products, AMD's APUs include a Cortex-A5 processor for handling secure processing. In fact, the Cortex-A5 TrustZone core had been included in earlier AMD products, but was not enabled due to time constraints.

Samsung Knox uses TrustZone for purposes such as detecting modifications to the kernel.


TrustZone for Armv8-M (for Cortex-M profile)

The Security Extension, marketed as TrustZone for Armv8-M Technology, was introduced in the Armv8-M architecture.


No-execute page protection

As of Armv6, the Arm architecture supports no-execute page protection, which is referred to as XN, for eXecute Never.


Large Physical Address Extension (LPAE)

The Large Physical Address Extension (LPAE), which extends the physical address size from 32 bits to 40 bits, was added to the Armv7-A architecture in 2011. Physical address size is larger, 44 bits, in Cortex-A75 and Cortex-A65AE.


Armv8-R and Armv8-M

The Armv8-R and Armv8-M architectures, announced after the Armv8-A architecture, share some features with Armv8-A, but don't include any 64-bit AArch64 instructions.


Armv8.1-M

The Armv8.1-M architecture, announced in February 2019, is an enhancement of the Armv8-M architecture. It brings new features including:

  • A new vector instruction set extension. The M-Profile Vector Extension (MVE), or Helium, is for signal processing and machine learning applications.
  • Additional instruction set enhancements for loops and branches (Low Overhead Branch Extension).
  • Instructions for half-precision floating-point support.
  • Instruction set enhancement for TrustZone management for Floating Point Unit (FPU).
  • New memory attribute in the Memory Protection Unit (MPU).
  • Enhancements in debug including Performance Monitoring Unit (PMU), Unprivileged Debug Extension, and additional debug support focus on signal processing application developments.
  • Reliability, Availability and Serviceability (RAS) extension.


64/32-bit architecture


Armv8-A Platform with Cortex A57/A53 MPCore big.LITTLE CPU chip

Armv8-A

See also: Comparison of ARMv8-A cores

Announced in October 2011, Armv8-A (often called Armv8 while the Armv8-R is also available) represents a fundamental change to the Arm architecture. It adds an optional 64-bit architecture (e.g. Cortex-A32 is a 32-bit Armv8-A CPU while most Armv8-A CPUs support 64-bit, unlike all Armv8-R), named "AArch64", and the associated new "A64" instruction set. AArch64 provides user-space compatibility with Armv7-A, the 32-bit architecture, therein referred to as "AArch32" and the old 32-bit instruction set, now named "A32". The Thumb instruction set is referred to as "T32" and has no 64-bit counterpart. Armv8-A allows 32-bit applications to be executed in a 64-bit OS, and a 32-bit OS to be under the control of a 64-bit hypervisor. Arm announced their Cortex-A53 and Cortex-A57 cores on 30 October 2012. Apple was the first to release an Armv8-A compatible core (Apple A7) in a consumer product (iPhone 5S). AppliedMicro, using an FPGA, was the first to demo Armv8-A. The first Armv8-A SoC from  Samsung is the Exynos 5433 used in the Galaxy Note 4, which features two clusters of four Cortex-A57 and Cortex-A53 cores in a big.LITTLE configuration; but it will run only in AArch32 mode.

To both AArch32 and AArch64, Armv8-A makes VFPv3/v4 and advanced SIMD (Neon) standard. It also adds cryptography instructions supporting AES,  SHA-1/SHA-256 and finite field arithmetic. AArch64 was introduced in ArmV8-A and its subsequent revision. AArch64 is not included in the 32-bit Armv8-R and Armv8-M architectures.


Platform Security Architecture

Platform Security Architecture (PSA) is an architecture-agnostic security framework and evaluation scheme, intended to help secure Internet of Things (IoT) devices built on system-on-a-chip (SoC) processors. It was introduced by Arm in 2017 at the annual TechCon event and will be first used on Arm Cortex-M processor cores intended for microcontroller use. The PSA includes freely available threat models and security analyses that demonstrate the process for deciding on security features in common IoT products. The PSA also provides freely downloadable application programming interface (API) packages, architectural specifications, open-source firmware implementations, and related test suites. PSA Certified offers a multi-level security evaluation scheme for chip vendors, OS providers and IoT device makers.


Operating system support

32-bit operating systems


Android, a popular operating system which is primarily used on the Arm architecture.

Historical operating systems

The first 32-bit Arm-based personal computer, the Acorn Archimedes, ran an interim operating system called Arthur, which evolved into RISC OS, used on later Arm-based systems from Acorn and other vendors. Some Acorn machines also had a Unix port called  RISC iX. (Neither is to be confused with RISC/os, a contemporary Unix variant for the MIPS architecture.)


Embedded operating systems

The 32-bit Arm architecture is supported by a large number of embedded and real-time operating systems, including:

  • A2
  • Android
  • ChibiOS/RT
  • Deos
  • DRYOS
  • eCos
  • embOS
  • FreeRTOS
  • Integrity
  • Linux
  • Micro-Controller Operating Systems
  • MQX
  • Nucleus PLUS
  • NuttX
  • OSE
  • OS-9
  • Pharos
  • Plan 9
  • PikeOS
  • QNX
  • RIOT
  • RTEMS
  • RTXC Quadros
  • SCIOPTA
  • ThreadX
  • TizenRT
  • T-Kernel
  • VxWorks
  • Windows Embedded Compact
  • Windows 10 IoT Core
  • Zephyr


Mobile device operating systems

The 32-bit Arm architecture is the primary hardware environment for most mobile device operating systems such as:

  • Android
  • Bada
  • BlackBerry OS/BlackBerry 10
  • Chrome OS
  • Firefox OS
  • MeeGo
  • Sailfish
  • Symbian
  • Tizen
  • Ubuntu Touch
  • webOS
  • Windows RT
  • Windows Mobile
  • Windows Phone
  • Windows 10 Mobile

Previously, but now discontinued:

  • iOS 10 and earlier


Desktop/server operating systems

The 32-bit Arm architecture is supported by RISC OS and by multiple Unix-like operating systems including:

  • FreeBSD
  • NetBSD
  • OpenBSD
  • OpenSolaris
  • several Linux distributions, such as:
    • Debian
    • Armbian
    • Gentoo
    • Ubuntu
    • Raspbian
    • Slackware


64-bit operating systems

Embedded operating systems

  • Integrity
  • OSE
  • SCIOPTA
  • seL4
  • Pharos
  • FreeRTOS


Mobile device operating systems

  • iOS supports Armv8-A in iOS 7 and later on 64-bit Apple SoCs.  iOS 11 and later only supports 64-bit Arm processors and applications.
  • Android supports Armv8-A in Android Lollipop (5.0) and later.


Desktop/server operating systems

  • Support for Armv8-A was merged into the Linux kernel version 3.7 in late 2012. Armv8-A is supported by a number of Linux distributions, such as:
    • Debian
    • Armbian
    • Ubuntu
    • Fedora
    • openSUSE
    • SUSE Linux Enterprise
    • RHEL
  • Support for Armv8-A was merged into FreeBSD in late 2014.
  • OpenBSD has experimental Armv8 support as of 2017.
  • NetBSD has Armv8 support as of early 2018.
  • Windows 10 – runs 32-bit "x86 and 32-bit Arm applications", as well as native Arm64 desktop apps. Support for 64-bit Arm apps in the Microsoft Store has been available since November 2018.
  • macOS will support ARM starting with macOS Big Sur in late 2020; its support (in beta versions of macOS Big Sur) is limited to the ARM Developer Transition Kit.


Porting to 32- or 64-bit Arm operating systems

Windows applications recompiled for Arm and linked with Winelib – from the Wine project – can run on 32-bit or 64-bit Arm in Linux, FreeBSD or other compatible operating systems. x86 binaries, e.g. when not specially compiled for Arm, have been demonstrated on Arm using QEMU with Wine (on Linux and more), but do not work at full speed or same capability as with Winelib.



Source: Wikipedia, https://en.wikipedia.org/wiki/ARM_architecture
Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 License.