VC4ASM - Assembler instructions

↑ Top, ALU, mov, ldi, nop, read, Semaphore, Branch, Signal, Conditional assignment, Pack/unpack

General rules

Each line of code with QPU assembler instructions generates exactly one 64 bit opcode for the QPU.
Multiple logical instructions can be placed in one line as long as their combination still fits into one opcode. The delimiter is ; (semicolon). Example:
mov ra11, rb11; mov rb11, ra11; ldtmu0
The sequence of the instructions in one line is normally not significant. Only if an instruction can be realized by the ADD and the MUL ALU then the ADD ALU is preferred as long as it has not already been used in the current line. You might use nop to explicitly mark the ADD ALU as used and place the second ALU instruction to the MUL ALU.
Operands are delimited by , (comma), destination operands first.
Instruction extensions can be applied to the opcode delimited by a . (dot). Some of the extensions can also be applied to the source or destination operands.

Instruction joining

A trailing ; indicates that the assembler my try to merge the current instruction with the next one if next instruction is preceded with ; and the two instructions fit into a single opcode. E.g.

add r0, r0, 1;
;mov r2, 64

generates the same code as

add r0, r0, 1;  mov r2, 64

This might cross macro boundaries. But be aware that dependencies might break your code. E.g. a read to a register file might move immediately after a write to the same register. Take special care of branch targets. They should not be joined with the previous instruction. Placing a bare colon in front of a line defines an anonymous label and prevents instruction joining over this point. Ordinary labels will do the same.

ALU instruction

binaryopcode destination, source1, source2
unaryopcode destination, source1
opcode.setf ...
opcode.ifcc ...

binaryopcode, unaryopcode: ALU opcode, see below.
destination: Target register.
source1, source2: Source register or small immediate value.

Opcodes

opcode	source1	source2	destination		flags (if `.setf` is used)
opcode	type	type	type	value	Z	N	C
`add`	`uint32`	`uint32`	`uint32`	`source1 + source2`	`destination == 0`	`destination >>> 31`	`source1 + source2 > 0xffffffff`
`sub`	`uint32`	`uint32`	`uint32`	`source1 - source2`	`destination == 0`	`destination >>> 31`	`source1 < source2`
`min`	`int32`	`int32`	`int32`	`source1 > source2 ? source2 : source1`	`destination == 0`	`destination >>> 31`	`source1 > source2`
`max`	`int32`	`int32`	`int32`	`source1 > source2 ? source1 : source2`	`destination == 0`	`destination >>> 31`	`source1 > source2`
`and`	`uint32`	`uint32`	`uint32`	`source1 & source2`	`destination == 0`	`destination >>> 31`	`0`
`or`	`uint32`	`uint32`	`uint32`	`source1 \| source2`	`destination == 0`	`destination >>> 31`	`0`
`xor`	`uint32`	`uint32`	`uint32`	`source1 ^ source2`	`destination == 0`	`destination >>> 31`	`0`
`shl`	`uint32`	`uint32`	`uint32`	`source1 <<< (source2 & 32)`	`destination == 0`	`destination >>> 31`	`source1 >>> 32-source2 & 1`
`shr`	`uint32`	`uint32`	`uint32`	`source1 >>> (source2 & 31)`	`destination == 0`	`destination >>> 31`	`source1 >>> (source2 & 31)-1 & 1`
`asr`	`int32`	`int32`	`int32`	`source1 >> (source2 & 31)`	`destination == 0`	`destination >>> 31`	`source1 >>> (source2 & 31)-1 & 1`
`ror`	`uint32`	`uint32`	`uint32`	`source1 >>< (source2 & 31)`	`destination == 0`	`destination >>> 31`	`0`
`not`	`uint32`		`uint32`	`~source1`	`destination == 0`	`destination >>> 31`	`0`
`clz`	`uint32`		`uint32`	`32 - floor(log₂(source1))`	`destination == 0`	`0`	`0`
`mul24`	`uint24`	`uint24`	`uint32`	`source1 * source2`	`destination == 0`	`destination >>> 31`	`source1 * source2 > 0xffffff`
`fadd`	`float32`	`float32`	`float32`	`source1 + source2`	`destination == 0`	`destination >>> 31`	`destination > 0` (incl. `+NaN`)
`fsub`	`float32`	`float32`	`float32`	`source1 - source2`	`destination == 0`	`destination >>> 31`	`destination > 0` (incl. `+NaN`)
`fmin`	`float32`	`float32`	`float32`	`source1 > source2 ? source2 : source1`	`destination == 0`	`destination >>> 31`	`source1 > source2`
`fmax`	`float32`	`float32`	`float32`	`source1 > source2 ? source1 : source2`	`destination == 0`	`destination >>> 31`	`source1 > source2`
`fminabs`	`float32`	`float32`	`float32`	`abs(source1) > abs(source2) ? abs(source2) : abs(source1)`	`destination == 0`	`destination >>> 31`	`abs(source1) > abs(source2)`
`fmaxabs`	`float32`	`float32`	`float32`	`abs(source1) > abs(source2) ? abs(source1) : abs(source2)`	`destination == 0`	`destination >>> 31`	`abs(source1) > abs(source2)`
`fmul`	`float32`	`float32`	`float32`	`source1 * source2`	`destination == 0`	`destination >>> 31`	`0`
`itof`	`int32`		`float32`	`source1`	`destination == 0`	`destination >>> 31`	`0`
`ftoi`	`float32`		`int32`	`source1`	`destination == 0`	`destination >>> 31`	`0`
`v8adds`	`uint8[4]`	`uint8[4]`	`uint8[4]`	`min(source1[] + source2[], 255)`	`destination == 0`	`destination >>> 31`	`0`
`v8subs`	`uint8[4]`	`uint8[4]`	`uint8[4]`	`max(min(source1[] - source2[], 255), 0)`	`destination == 0`	`destination >>> 31`	`0`
`v8min`	`uint8[4]`	`uint8[4]`	`uint8[4]`	`min(source1[], source2[])`	`destination == 0`	`destination >>> 31`	`0`
`v8max`	`uint8[4]`	`uint8[4]`	`uint8[4]`	`max(source1[], source2[])`	`destination == 0`	`destination >>> 31`	`0`
`v8muld`	`uint8[4]`	`uint8[4]`	`uint8[4]`	`(source1 * source2 + 127) / 255`	`destination == 0`	`destination >>> 31`	`0`

Example

add.setf r3, ra0, unif
mul24 r0, r1, r2

Move instruction

mov destination, source
mov destination, register << rotate
mov destination, register >> rotate
mov destination, register >> r5
mov destination1, destination2, source
mov.setf ...
mov.ifcc ...

destination: Target register(s).
source: Source register or immediate value.
register: Source register for small rotate instructions.
rotate: Optional rotation of the value.

Strictly speaking mov is no QPU instruction. It is simply a convenient way to create a identity ALU instruction like or with two identical source arguments or an ldi instruction, whatever fits best.

If source is a register, the assembler preferably uses the ADD ALU to realize the movement. If either the ADD ALU is already used by the current instruction or a rotate operation is requested it uses the MUL ALU. The op-code or is used in case of the ADD ALU and v8min for the MUL ALU. Except when 16 bit floating point unpack is requested, in this case the instruction fmin is used.

If source fits into a small immediate value then the assembler prefers this over load immediate. The assembler is quite smart when using small immediate. E.g. the immediate value 64 which has no direct equivalent can be achieved by passing 8 to both inputs of the MUL ALU with instruction mul24. Again the ADD ALU is preferred when available. But some hacks like the example before require the MUL ALU, but the same value could be constructed by the ADD ALU from the value 4 with the shl instruction. Even some pack modes are considered to achieve the desired constant. See the small immediate table for a list of supported values. The value 0 can be assigned without the use of a small immediate value by any ALU using xor or v8subs with an identical source.
Be aware that the carry flag is not well defined in case .setf is used because of the free choice of the opcode.

If neither the second ALU nor signalling flags are used at the end then the instruction is converted back to ldi to save ALU power.

If source does not fit into a small immediate than a ldi instruction is generated.

With some restrictions you can handle two move instructions in a single cycle. E.g. if both sources are registers or if one source is from register file A and the other source fits into a small immediate value of if both sources can be created from the same small immediate value.

Examples

mov ra29, 16
mov r3, rb4 << 2;  mov r2, ra11 # Uses the MUL ALU for the first move (because of the vector rotation) and the ADD ALU for the second one.
mov r0, 0x8000000; mov tmurs, 1 # Uses small immediate value 1 with ror r0, 1, 1 to create the 0x80000000.

Load immediate

ldi destination, constant
ldi destination1, destination2, constant
ldi.setf ...
ldi.ifcc ...

destination: Target register.
constant: Immediate value.

In contrast to mov ldi always generates a load immediate instruction even if the constant fits into a small immediate value. The same value can be assigned to two targets at once by using the ADD and the MUL ALU output.

Example

ldi ra7, 0xffff0000

No operation

nop
anop
mnop
mnop destination

destination: Target register.

nop does nothing, well not really. At least it reserves an instruction word that causes a delay. This could be required to meet some instruction constraints.
The variants anop and mnop explicitly schedule to the ADD ALU or MUL ALU respectively. Otherwise vc4asm would take whatever ALU is available.

vc4asm allows the nop instructions to have a target. This can be used to access previous ALU results again.

Read pseudo instruction

read source

source: Source register. This must be from register file A or B including peripherals or a small immediate value.

read is also no instruction of the QPU. It is just an extension of vc4asm to create a register file A or B access without allocation of an ALU instruction. Semantically it is identical to mov -, source, but it will not create any opcode to one of the ALUs. Instead only the raddr_a or raddr_b field is assigned. You can combine up to two read with up to two ALU instructions into a single instruction word as long as they do not require the particular register file source.

Use cases

Wait for peripheral register

read vw_wait

When you read a register only for the purpose to create a QPU stall then there is no need to involve an ALU. In most cases it is a good advise to prefer read ... over mov -, ....

Discard uniform or VPM value

read unif

Prefetch small immediate value

read 8; ... and.setf -, elem_num, rb39; ldtmu0

Small immediate values cannot be combined with signals. But if you can prefetch the value in the previous instruction, you are able to use the value together with a signal without the need for a temporary accumulator or even one of the ALUs of the previous instruction.

Semaphore instruction

sacq destination, number
srel destination, number
mov destination, sacqnumber
mov destination, srelnumber

destination: Target register, usually -, since the output of a semaphore instruction is not generally useful. But if it happens to be useful you may assign the value like with an ldi instruction.
number: Semaphore number to acquire or release. Only the low order 4 bits of the value are used to identify the semaphore number. Bit 4 is controlled by the acquire/release flag and any further bits are placed unchanged into the immediate value field of the instruction and may be chosen arbitrary to if you want to assign a destination.

Example

sacq -, 7
mov -, sacq7

The two instructions above are equivalent. The following function below provides Broadcom compatible syntax.

.set sacq(i) sacq0 + i
mov -, sacq(7)

Branch instruction

bra.cond destination, target
brr.cond destination, target
bra.cond destination, target1, target2
brr.cond destination, target1, target2
bra.cond destination1, destination2, target1, target2
brr.cond destination1, destination2, target1, target2

.cond

Branch condition, optional. One of:

condition	zero flag	negative flag	carry flag
set on all SIMD elements	`.allz`	`.alln`	`.allc`
not set on all SIMD elements	`.allnz`	`.allnn`	`.allnc`
set on at least one SIMD element	`.anyz`	`.anyn`	`.anyc`
not set on at least one SIMD element	`.anynz`	`.anynn`	`.anync`

destination, destination1, destination2

Target register or -. The destination(s) receive the PC position where the branch takes place, i.e. PC + 4, but the assignment only takes place if the branch is actually taken.
The option to have two destination registers require to specify two branch targets also. In doubt use 0 or -, e.g: brr ra_link, r0, -, r:target

target, target1, target2

Register from register file A, constant or label. Branch instructions can add two targets if one of them is a register and the other one is a constant or label.
Note that the use of odd register numbers implies .setf which is generally not intended.

bra creates an absolute branch, i.e. target must be a physical memory address.
brr creates a relative branch, i.e. it adds PC + 4 to the target.

Remember that branch instructions are executed 3 instructions delayed, i.e. three further instructions are always executed before any branch is taken.

Signaling instruction

bkpt
thrsw
thrend
sbwait
sbdone
lthrsw
loadcv
loadc
ldcend
ldtmu0
ldtmu1
loadam

The above signals can be combined with any normal ALU instructions in one line, i.e. no load immediate, no small immediate, no semaphore and no branch. See Broadcom reference guide for details.