VC4ASM - macro assembler for Broadcom VideoCore IV
aka Raspberry Pi GPU

The goal of the vc4asm project is a full featured macro assembler and disassembler with constraint checking.
The work is based on qpu-asm from Pete Warden which itself is based on Eman's work and some ideas also taken from Herman H Hermitage. But it goes further by far. First of all it supports macros and functions.
Unfortunately this area is highly undocumented in the public domain. And so the work uses only the code from hello_fft which is available as source and binary as Rosetta Stone. However, I try to keep it downwardly compatible to the Broadcom tools.

→ Assembler, → Disassembler, → Known problems, → Build instructions, → Samples, → Change log, → Contact

Download

Download Source code, Raspberry Pi 1 binary, examples and this documentation (750k)

The current version is 0.3. See change log for further details.

The source code is also available at github.com/maazl/vc4asm.

Assembler vc4asm

The heart of the software. It assembles QPU code to binary, ELF or C source.

vc4asm [<options> ...] <qasm-file> [<qasm-file2> ...]

Options

-o <bin-output>
File name for binary output. If omitted no binary output is generated.
Note that vc4asm always writes little endian binaries.
-C <C-fragment-output>
File name for C/C++ output. The result does not include surrounding braces. So write it to a separate file and include it from C as follows:
static const uint32_t qpu_code[] = {
#include<C-output>
};
-c <C-output>
Write full C output file. Requires header file also (option -h).
-h <C-header>
Write C header file, containing global symbols.
This file is compatible with the -c and the -e output
-H <C-header>
Write C header file without inline symbol values. This variant causes all symbols to be resolved by the linker. So no recompile is required when the GPU code changes unless new symbols are used in the referring C code.
This file is compatible with the -e output only.
-v
Decorate C output with comments containing code offsets, labels and source code lines.
-e <ELF-output>
Write the assembled binary directly to an ARM compatible object file in ELF format that can be passed to ld or gcc.
The ELF object will contain all symbols that have been exported by .global and the following predefined symbols:
- <filename-wo-extension> points to the starting address of the generated binary code.
- <filename-wo-extension>_end points behind the generated binary code.
- <filename-wo-extension>_size receives the size of the generated binary code in bytes.
Special characters in the file name are replaced by an underscore.
-s
Do not automatically create the predefined symbols derived from the file name for -e and -C output.
You need to use .global to be able to access the code by a linker symbol.
-I <include-path>
Add an include path to the search path list. This paths are used at .include <...>. Note that this is a prefix rather than a path, i.e. if it is a folder it should contain a trailing slash.
-i <qinc-file>
Load a file using the include search path. See option -I, useful to include vc4.qinc without an absolute path: -i vc4.qinc
-V
Check for Videocore IV constraints, e.g. reading a register file address immediately after writing it.

File arguments

You can pass multiple files to vc4asm but this will not create separate object modules. Instead the files are simply concatenated in order of appearance. You may use this feature to include platform specific definitions without the need to include them explicitly from every file. E.g.:
vc4asm -o code.bin -i vc4.qinc gpu_fft_1k.qasm

Assembler language reference

  1. Expressions and operators
  2. Assembler directives
  3. Instructions
  4. Standard macros vc4.qinc

See the Broadcom VideoCore IV Reference Guide for the semantics of the instructions and registers.
See also the Addendum for further details and bugs in the reference guide.

Disassembler vc4dis

vc4dis [-o <qasm-output>] [-x[<input-format>]] [-M] [-F] [-v] [-b <base-addr>] <input-file> [<input-file2> ...]

Options

-o <qasm-output>
Assembler output file, stdout by default.
-x<input-format>
32 - 32 bit hexadecimal input, .e. 2 qwords per instruction, default if <input-format> is missing.
64 - 64 bit hexadecimal input.
- binary input, little endian, default without -x.
-M
Do not generate mov instructions. mov is no native QPU instruction, it is emulated by trivial operators like or r1, r0, r0. Without this option vc4dis generated mov instead of the real instruction if such a situation is detected.
Note the -M is required to make the disassembler result turn around stable, i.e. even if the code contains some binary fragments vc4asm should return the same binary result from the disassembly.
-F
Do not write floating point constants. Without this option vc4dis writes immediate values that are likely to be a floating point number as float. This may not always hit the nail on the head.
-v
Write binary code and offset as comment right to each instruction.
-v2
As -v but also write QPU instruction set bit fields as comment right to each instruction. This is mainly for debugging purposes.
-V
Check for Videocore IV constraints, e.g. reading a register file address immediately after writing it.
-b <base-addr>
Base address. This is the physical memory address of the first instruction code passed to vc4dis. This is only significant for absolute branch instructions.

File arguments

If you pass multiple input files they are disassembled all together into a single result as if they were concatenated.
The format of the input is controlled by the -x option. All input files must use the same format.

Known problems

Build instructions

The source code has hopefully no major platform dependencies, i.e. you don't need to build it on the Raspberry. But it requires a C++11 compliant compiler to build. Current Raspbian ships with gcc 4.9 which works fine. Raspbian Wheezy seems not to be sufficient. While I succeeded with gcc 4.7.3 on another platform, gcc 4.7.2 of Wheezy fails to compile the disassembler. But you can install gcc 4.8 in Raspbian Wheezy. This will work.

Furthermore CMake is required. Most Linux distributions should provide this as package.

Build vc4asm

Build the samples

Note that the samples will neither build nor run on anything else but one of the Raspberry Pi models.

Running the test cases

Method 1: using make rules

The test_... targets run all test cases that have not yet run or that need to be rerun after changes to the code or to the test case itself. It stops at the first failed test. If you need an overview of all failed tests use method 2 below.

Method 2: using cmake test cases

The Cmake test cases invoke the test targets from method 1. This is significantly slower, especially on a Raspberry Pi.

Sample programs

Notes

  1. You cannot use the GPU if you have the vc4 display driver running. It simply does not support this.

  2. It is recommended to install the vcio2 driver to run the sample programs. This will remove the need to run all samples with root privileges. This is because of the need to call mmap. The sample programs will automatically detect the presence of the vcio2 driver and use it when available.

  3. There is one side effect of using vcio2: this driver does not support to access the V3D hardware directly for safety and concurrency reasons. This prevents direct hardware access used by the hello_fft sample for small transforms. So these transforms are significantly slower because of the turn around to the kernel and the firmware and the way back.

simple

This is a very simple program that demonstrates the use of all available operators with small immediate values. It is not optimized in any way.

hello_fft

This is the well known hello_fft sample available. The main difference is that it is faster compared to GPU_FFT 3.0 because the shader code has been significantly optimized. The gain is about 40% of code size and roughly 9% of the run time. The code will no longer build with another assembler since it uses several special features for instruction packing and scheduling.

batch→ 1 10
↓points gpu_fft 3 optimized gain gpu_fft 3 optimized gain
28 25 µs 20 µs* -20% 16.0 µs 13.0 µs -19%
29 39 µs
32 µs -18% 28.0 µs 22.9 µs
-18%
210 57 µs*
49 µs -14% 48.0 µs 39.9 µs
-17%
211 102 µs
92 µs* -10% 100.6 µs 82.7 µs
-18%
212 230 µs
241 µs +5% 250 µs 245 µs
-2%
213 598 µs
649 µs +9% 612 µs 655 µs
+7%
214 1.12 ms
1.31 ms +17% 1.148 ms 1.306 ms
+13%
215 3.08 ms
2.82 ms -9% 2.96 ms* 2.696 ms
-9%
216 6.05 ms
5.52 ms -9% 5.93 ms 5.38 ms*
-9%
217 12.20 ms
11.27 ms -8% 12.06 ms 11.07 ms
-8%
218 26.76 ms
24.73 ms -8% 26.64 ms 24.59 ms
-8%
219 88.55 ms
81.75 ms -8% 88.41 ms 81.60 ms
-8%
220 181.6 ms
171.8 ms -5%


221 360.4 ms
340.8 ms -5%


222 731.8 ms 693.7 ms -5%


All timings are medians from repeated executions. The Raspi was slightly overclocked. (*) Timing is unstable, reason unknown.
It is not yet known, why especially the 214 FFT is significantly slower. Maybe a bug.

Contact

Comments, ideas, bugs, improvements to raspi at maazl dot de.