Addendum to the Broadcom VideoCore IV documentation

Documentation bugs

Section 3: Writing to r5

r5 can be written from the elements 0, 4, 8 and 12 for each slice or only from element 0, depending on whether you write to regfile A or regfile B. vc4asm supports the register names r5quad and r5rep for this purpose. The documentation is wrong, nothing is concatenated from low order bytes.

Section 3: Table 1: ALU Instruction Fields - set flags

The flags are only updated when the corresponding condition code evaluates true and the target value is assigned. You may use this feature to do boolean operations with the flags.

Furthermore the flags are only updated from the MUL ALU when the ADD ALU executes a nop. In contrast to the Broadcom documetation the condition code .never has no effect on the behavior.

Section 3: Branch Instruction - implicit set flags

Unlike the Broadcom documentation suggests the branch instruction is able to set the flags. But the sf bit is at the same location of the instruction word than the raddr_a field. So every branch instruction with an odd register source will set the flags. This is an unwanted side effect since the flags of a PC location are normally useless.

vc4asm reports a warning when a branch instruction is used with an odd source register. You can use explicit .setf to suppress the warning.

and.setf -,  elem_num, 1
bra -, ra1 # implies .setf => warning
mov.ifz r0, 1
...

Since a branch target will never be at the physical address 0 the mov.ifz instruction in the above example will never assign a value.

Section 3: Branch Instruction - register source

The branch target is taken from SIMD element 15 if reg is set (rather than element 0). This is wrong at several places in the documentation.
However, this can particularly useful to use a regfile A register as branch target and for VPM/VCD setup and concurrently since the latter will only use SIMD element 0.

Section 3: Branch Instruction - target assignment

The assignment of the target register(s) of a branch instruction depends on the flags. The target register is only assigned if the branch is actually taken. The same applies to the flags if sf is true. So you may not abuse brr.never to load the relocated values of labels into a register - what a pity!

Section 3: QPU Instruction Set - no inf, nan, denormal support

The handling of Inf and NaN seems to be broken or just not implemented in Videocore IV. I.e. 0.0 + NaN = +Inf. In fact it only seems to support something like NaN but it uses the binary representation of ±Inf. So be careful when interacting between the GPU and the ARM core with such numbers.

There is also no support for denormal numbers. They are just truncated to 0. Not that uncommon for GPUs.

Section 3: Register-Mapped Input/Output - Host Interrupt

You need to write a non-zero value to the interrupt register otherwise nothing happens. But conditional write access seems to work as expected.

Section 4: TMU FIFO

When using direct memory access the TFREQ FIFO can hold up to 8 slots supposedly. It turned out to be unreliable to use more than 4 of them. Sometimes the first QPU elements receive the result from 4 slots ahead in case of TMU cache hits.

Section 4: Writing tmu_noswap

When writing tmu_noswap only the value of SIMD element 0 counts. Furthermore any non-zero value will disable TMU swap not just 1.

Section 7: QPU Reading and Writing of VPM

VPM reads seem to block immediately if the FIFO is empty. No undefined data is returned when the reads are made too early.

Furthermore the VPM read setups cannot be queued. Only if the result of the last VPM read setup is fully transferred to the VPM FIFO, i.e. there is no more than one value outstanding, a new setup is accepted.
Example: if you request 2 reads with the first write to vr_setup and another two reads with the second write, then the second job is ignored and you can read only two values without a deadlock. In contrast, if you request 1 read by the first setup and 3 by the second one, then everything is fine, since the single read of the first setup is immediately transferred to the FIFO and the setup is discarded. No delay slot instructions are required in the latter case.

Table 35: VCD DMA Write (VDW) Stride Setup Format

Unlike the documentation suggests the STRIDE field is 16 bits wide. Probably just a typo.

Section 10: Performance counters

The documentation of the V3D_PCTRE register is wrong. You need to set bit 31 (allegedly reserved) to enable performance counters at all.

Instruction constraints

No conditional write to peripheral registers

A write to the TMU retiring register (TMU0_S, TMU1_S) or VPM must not use conditional write access. Although the conditional write itself works the TMU/VPM fifo is triggered unconditional to process the request with undefined data in case the condition is not true.
You should also not write to the same register from both ALUs with inverse condition flags. E.g. mov.ifz vpm, 0; mov.ifnz vpm, 1 will write an undefined value from the MUL ALU.
Probably any other peripheral register share the same problem.

vc4asm warns about this kind of access in verify mode.

Distance of branch instructions

There must be at least two non branch instructions between every two branch instructions. Otherwise no branch is taken or the thread will crash. This also applies if the branch conditions are reverse and only one of the branches can actually be taken.

However, you can enqueue the next branch just before the last one is taken. Example:

# r0 contains semaphore number [0..15]
mul24 ra31, r0, 3*8
nop
brr ra31, ra31, r:sacq
nop
nop
bra -, ra31
...

:sacq
.rep i, 16
nop
nop
sacq -, i
.endr

The above code fragment dynamically acquires a semaphore depending on the number in the r0 register. This is the shortest possible code fragment to do this task.

MUL ALU pack modes and I/O register targets

The MUL ALU pack modes that write only a single byte are not available for I/O register targets. The write only hardware registers cannot write slices of a word.

Concurrent access to VPM registers

While VPM read and VPM write can coexist no other concurrent VPM access in one instruction is reliable, especially including the ldtmu signals.

Undocumented features

Section 3: Horizontal vector rotation

Both source operands must be from accumulator r0..r3 for full vector rotation. But if you choose not to do so the rotation will only take place within the slices. I.e. [0,1,2,3, 4,5,6,7, 8,9,10,11, 12,13,14,15] rotates to [3,0,1,2, 7,4,5,6, 11,8,9,10, 15,12,13,14]. In fact all source operands are taken from the current element and only the write is rotated within the slice by the lower two bits. This can be particularly useful in some cases.

Furthermore the restriction that the source register of the vector rotation must not be written in the previous instruction applies if the values are transferred to a lower QPU quad only. Rotations within a slice and to higher quad are safe. You may ensure this by an appropriate .ifcc extension, either at the previous write or at the rotation instruction itself. Examples:

and.setf -, elem_num, 1
mov      r0, ra0
mov.ifz  r1, r0<<1

shr.setf -, elem_num, 1
mov.ifc  r0, ra0
mov      r1, r0<<1

and.setf -, elem_num, 12
mov r0, ra0
mov.ifnz r1, r0>>3

Section 3, Table 5: Small immediate values

The small immediate codes for vector rotations can also be used as additional constants. Well, all of them are redundant with [16..31] but this allows you to combine vector rotations with immediate values. vc4asm will take care of this.

value encoding
0x30 = 48 0xfffffff0 = -16
0x31 = 49 0xfffffff1 = -15
0x32 = 50 0xfffffff2 = -14
0x33 = 51 0xfffffff3 = -13
0x34 = 52 0xfffffff4 = -12
0x35 = 53 0xfffffff5 = -11
0x36 = 54 0xfffffff6 = -10
0x37 = 55 0xfffffff7 = -9
0x38 = 56 0xfffffff8 = -8
0x39 = 57 0xfffffff9 = -7
0x3a = 58 0xfffffffa = -6
0x3b = 59 0xfffffffb = -5
0x3c = 60 0xfffffffc = -4
0x3d = 61 0xfffffffd = -3
0x3e = 62 0xfffffffe = -2
0x3f = 63 0xffffffff = -1
shl.setf -, elem_num, 30;  mov r0, r0<<2  # Set C flag to bit 2 of elem_num, set N flag to bit 1

Secret access to previous input values through the NOP register

Reading one of the NOP registers always returns the last value read from the particular register file by the SIMD elements 12 to 15. So reading ra39 returns the last 4 element values read from register file A repeated for each quad and reading rb39 returns the last 4 element values read from regfile B. Small immediate values including vector rotations act as register file B read as well. Even access to peripheral registers like unif will stay there. Branch instructions or load immediate do not change the values.

Of course, this is an undocumented side effect. But it seems not to be just a dangling reference to some residual voltage at internal bus lines. Even after several seconds the values are stable if the are not overridden by other instructions meanwhile.
See also Figure 2 (QPU Core Pipeline) of the documentation. The input values from the register files simply stay at the ALU muxes.

Possible applications

First of all, this is a kind of vector rotation, because you can move values from elements 12 to 15 to lower element numbers this way - even by the ADD ALU. Example:

read elem_num  # regfile A read
mov r0, ra39 # results in [12,13,14,15, 12,13,14,15, 12,13,14,15, 12,13,14,15]

Secondly you might access a value read by a previous instruction again without the need to store it in an accumulator. This is always interesting when you cannot read the value from the same register again because either it has been overridden meanwhile or it is a non repeatable read from a peripheral register like unif. But keep in mind that in the latter case it makes an important difference whether regfile A or regfile B was used to read the uniform value. So do not use the pseudo register unif in this case since vc4asm will chose freely which register file to use. Example:

add ra0, ra32, 8  # read uniform
sub rb0, ra39, 8 # read the same uniform again

Combine immediate value with signal:

add ra0, ra0, 8
add ra1, ra1, rb39; ldtmu0 # adds immediate value 8 to ra1 too

Secret access to previous MUL ALU result

Using the MUL ALU opcode 0 in conjunction with waddr_mul let you access the last result of the MUL ALU again. Because of the virtual parallelism of the quads only the results from the last quads, i.e. only SIMD element 12..15 are available. But be careful, if the value has not been assigned because of an .if condition the result is unreliable.

vc4asm supports the special instruction mnop to explicitly use the nop instruction of the MUL ALU. Example:

ldi.setf r0, [0,1,2,3, 1,2,3,0, 2,3,0,1, 3,0,1,2]
nop; v8adds vpm, elem_num, r0 # results in [ 0, 2, 4, 6, 5, 7, 9, 7, 10,12,10,12, 15,13,15,17]
nop; mnop r2 # results in [15,13,15,17, 15,13,15,17, 15,13,15,17, 15,13,15,17]

Cache details

Cache size associativity cache line address bits
Instruction cache 4 kiB 4-way 64 B 0:7
TMU cache 4 kiB - 64 B 0:9
V3D L2C 32 kiB ? 4-way 64 B 0:10 ?

TMU cache

Sequence counts! The items are loaded in order of the QPU element number, i.e. if you load the same value twice into another element this will cause a cache miss if a memory address with 4 k offset is in between.