↑ Top, ALU, mov, ldi, nop, read, Semaphore, Branch, Signal, Conditional assignment, Pack/unpack
A trailing ; indicates that the assembler my try to merge the current instruction with the next one if next instruction is preceded with ; and the two instructions fit into a single opcode. E.g.
add r0, r0, 1;
;mov r2, 64
generates the same code as
add r0, r0, 1; mov r2, 64
This might cross macro boundaries. But be aware that dependencies might break your code. E.g. a read to a register file might move immediately after a write to the same register. Take special care of branch targets. They should not be joined with the previous instruction. Placing a bare colon in front of a line defines an anonymous label and prevents instruction joining over this point. Ordinary labels will do the same.
binaryopcode destination, source1, source2
unaryopcode destination, source1
opcode.setf ...
opcode.ifcc ...
op |
source1 | source2 | destination | flags (if .setf is used) | |||
---|---|---|---|---|---|---|---|
type | type | type | value | Z | N | C | |
add | uint32 | uint32 | uint32 | source1 + source2 | destination == 0 | destination >>> 31 | source1 + source2 > 0xffffffff |
sub | uint32 | uint32 | uint32 | source1 - source2 | destination == 0 | destination >>> 31 | source1 < source2 |
min | int32 | int32 | int32 | source1 > source2 ? source2 : source1 | destination == 0 | destination >>> 31 | source1 > source2 |
max | int32 | int32 | int32 | source1 > source2 ? source1 : source2 | destination == 0 | destination >>> 31 | source1 > source2 |
and | uint32 | uint32 | uint32 | source1 & source2 | destination == 0 | destination >>> 31 | 0 |
or | uint32 | uint32 | uint32 | source1 | source2 | destination == 0 | destination >>> 31 | 0 |
xor | uint32 | uint32 | uint32 | source1 ^ source2 | destination == 0 | destination >>> 31 | 0 |
shl | uint32 | uint32 | uint32 | source1 <<< (source2 & 32) | destination == 0 | destination >>> 31 | source1 >>> 32-source2 & 1 |
shr | uint32 | uint32 | uint32 | source1 >>> (source2 & 31) | destination == 0 | destination >>> 31 | source1 >>> (source2 & 31)-1 & 1 |
asr | int32 | int32 | int32 | source1 >> (source2 & 31) | destination == 0 | destination >>> 31 | source1 >>> (source2 & 31)-1 & 1 |
ror | uint32 | uint32 | uint32 | source1 >>< (source2 & 31) | destination == 0 | destination >>> 31 | 0 |
not | uint32 | uint32 | ~source1 | destination == 0 | destination >>> 31 | 0 | |
clz | uint32 | uint32 | 32 - floor(log₂(source1)) | destination == 0 | 0 |
0 | |
mul24 | uint24 | uint24 | uint32 | source1 * source2 | destination == 0 | destination >>> 31 | source1 * source2 > 0xffffff |
fadd | float32 | float32 | float32 | source1 + source2 | destination == 0 | destination >>> 31 | destination > 0 (incl. +NaN) |
fsub | float32 | float32 | float32 | source1 - source2 | destination == 0 | destination >>> 31 | destination > 0 (incl. +NaN) |
fmin | float32 | float32 | float32 | source1 > source2 ? source2 : source1 | destination == 0 | destination >>> 31 | source1 > source2 |
fmax | float32 | float32 | float32 | source1 > source2 ? source1 : source2 | destination == 0 | destination >>> 31 | source1 > source2 |
fminabs | float32 | float32 | float32 | abs(source1) > abs(source2) ? abs(source2) : abs(source1) | destination == 0 | destination >>> 31 | abs(source1) > abs(source2) |
fmaxabs | float32 | float32 | float32 | abs(source1) > abs(source2) ? abs(source1) : abs(source2) | destination == 0 | destination >>> 31 | abs(source1) > abs(source2) |
fmul | float32 | float32 | float32 | source1 * source2 | destination == 0 | destination >>> 31 | 0 |
itof | int32 | float32 | source1 | destination == 0 | destination >>> 31 | 0 | |
ftoi | float32 | int32 | source1 | destination == 0 | destination >>> 31 | 0 | |
v8adds | uint8[4] | uint8[4] | uint8[4] | min(source1[] + source2[], 255) | destination == 0 | destination >>> 31 | 0 |
v8subs | uint8[4] | uint8[4] | uint8[4] | max(min(source1[] - source2[], 255), 0) | destination == 0 | destination >>> 31 | 0 |
v8min | uint8[4] | uint8[4] | uint8[4] | min(source1[], source2[]) | destination == 0 | destination >>> 31 | 0 |
v8max | uint8[4] | uint8[4] | uint8[4] | max(source1[], source2[]) | destination == 0 | destination >>> 31 | 0 |
v8muld | uint8[4] | uint8[4] | uint8[4] | (source1 * source2 + 127) / 255 | destination == 0 | destination >>> 31 | 0 |
add.setf r3, ra0, unif
mul24 r0, r1, r2
mov destination, source
mov destination, register << rotate
mov destination, register >> rotate
mov destination, register >> r5
mov destination1, destination2, source
mov.setf ...
mov.ifcc ...
Strictly speaking mov is no QPU instruction. It is simply a convenient way to create a identity ALU instruction like or with two identical source arguments or an ldi instruction, whatever fits best.
If source is a register, the assembler preferably uses the ADD ALU to realize the movement. If either the ADD ALU is already used by the current instruction or a rotate operation is requested it uses the MUL ALU. The op-code or is used in case of the ADD ALU and v8min for the MUL ALU. Except when 16 bit floating point unpack is requested, in this case the instruction fmin is used.
If source fits into a small immediate value
then the assembler prefers this over load immediate. The assembler is
quite smart when using small immediate. E.g. the immediate value 64 which
has no direct equivalent can be achieved by passing 8 to both inputs of
the MUL ALU with instruction mul24. Again the ADD ALU is
preferred when available. But some hacks like the example before require
the MUL ALU, but the same value could be constructed by the ADD ALU from
the value 4 with the shl instruction. Even some pack modes are
considered to achieve the desired constant. See the small
immediate table for a list of supported values. The value 0
can be assigned without the use of a small immediate value by any ALU
using xor or v8subs with an identical source.
Be aware that the carry flag is not well defined in case .setf
is used because of the free choice of the opcode.
If neither the second ALU nor signalling flags are used at the end then the instruction is converted back to ldi to save ALU power.
If source does not fit into a small immediate than a ldi instruction is generated.
With some restrictions you can handle two move instructions in a single cycle. E.g. if both sources are registers or if one source is from register file A and the other source fits into a small immediate value of if both sources can be created from the same small immediate value.
mov ra29, 16
mov r3, rb4 << 2; mov r2, ra11 # Uses the MUL ALU for the first move (because of the vector rotation) and the ADD ALU for the second one.
mov r0, 0x8000000; mov tmurs, 1 # Uses small immediate value 1 with ror r0, 1, 1 to create the 0x80000000.
ldi destination, constant
ldi destination1, destination2, constant
ldi.setf ...
ldi.ifcc ...
ldi ra7, 0xffff0000
nop
anop
mnop
mnop destination
nop does nothing, well not really. At least it reserves an
instruction word that causes a delay. This could be required to meet some
instruction constraints.
The variants anop and mnop explicitly schedule to the
ADD ALU or MUL ALU respectively. Otherwise vc4asm would take whatever ALU
is available.
vc4asm allows the nop instructions to have a target. This can be used to access previous ALU results again.
read source
read is also no instruction of the QPU. It is just an extension of vc4asm to create a register file A or B access without allocation of an ALU instruction. Semantically it is identical to mov -, source, but it will not create any opcode to one of the ALUs. Instead only the raddr_a or raddr_b field is assigned. You can combine up to two read with up to two ALU instructions into a single instruction word as long as they do not require the particular register file source.
read vw_wait
When you read a register only for the purpose to create a QPU stall then there is no need to involve an ALU. In most cases it is a good advise to prefer read ... over mov -, ....
read unif
read 8; ...
and.setf -, elem_num, rb39; ldtmu0
Small immediate values cannot be combined with signals. But if you can prefetch the value in the previous instruction, you are able to use the value together with a signal without the need for a temporary accumulator or even one of the ALUs of the previous instruction.
sacq destination, number
srel destination, number
mov destination, sacqnumber
mov destination, srelnumber
sacq -, 7
mov -, sacq7
The two instructions above are equivalent. The following function below provides Broadcom compatible syntax.
.set sacq(i) sacq0 + i
mov -, sacq(7)
bra.cond destination, target
brr.cond destination, target
bra.cond destination, target1, target2
brr.cond destination, target1, target2
bra.cond destination1, destination2, target1, target2
brr.cond destination1, destination2, target1, target2
condition | zero flag | negative flag | carry flag |
---|---|---|---|
set on all SIMD elements | .allz | .alln | .allc |
not set on all SIMD elements | .allnz | .allnn | .allnc |
set on at least one SIMD element | .anyz | .anyn | .anyc |
not set on at least one SIMD element | .anynz | .anynn | .anync |
bra creates an absolute branch, i.e. target must
be a physical memory address.
brr creates a relative branch, i.e. it adds PC + 4
to the target.
Remember that branch instructions are executed 3 instructions delayed, i.e. three further instructions are always executed before any branch is taken.
bkptThe above signals can be combined with any normal ALU instructions in one line, i.e. no load immediate, no small immediate, no semaphore and no branch. See Broadcom reference guide for details.
thrsw
thrend
sbwait
sbdone
lthrsw
loadcv
loadc
ldcend
ldtmu0
ldtmu1
loadam