VC4ASM - macro assembler for Broadcom VideoCore IV
aka Raspberry Pi GPU

The goal of the vc4asm project is a full featured macro assembler and disassembler with constraint checking.
The work is based on qpu-asm from Pete Warden which itself is based on Eman's work and some ideas also taken from Herman H Hermitage. But it goes further by far. First of all it supports macros and functions.
Unfortunately this area is highly undocumented in the public domain. And so the work uses only the code from hello_fft which is available as source and binary as Rosetta Stone. However, I try to keep it downwardly compatible to the Broadcom tools.

→ Assembler, → Disassembler, → Known problems, → Build instructions, → Samples, → Change log, → Contact

Download

Download Source code, Raspberry Pi 1 binary, examples and this documentation (750k)

The current version is 0.3. See change log for further details.

The source code is also available at github.com/maazl/vc4asm.

Assembler `vc4asm`

The heart of the software. It assembles QPU code to binary, ELF or C source.

vc4asm [<options> ...] <qasm-file> [<qasm-file2> ...]

Options

-o <bin-output>: File name for binary output. If omitted no binary output is generated.
Note that vc4asm always writes little endian binaries.
-C <C-fragment-output>: File name for C/C++ output. The result does not include surrounding braces. So write it to a separate file and include it from C as follows:
static const uint32_t qpu_code[] = { #include<C-output> };
-c <C-output>: Write full C output file. Requires header file also (option -h).
-h <C-header>: Write C header file, containing global symbols.
This file is compatible with the -c and the -e output
-H <C-header>: Write C header file without inline symbol values. This variant causes all symbols to be resolved by the linker. So no recompile is required when the GPU code changes unless new symbols are used in the referring C code.
This file is compatible with the -e output only.
-v: Decorate C output with comments containing code offsets, labels and source code lines.
-e <ELF-output>: Write the assembled binary directly to an ARM compatible object file in ELF format that can be passed to ld or gcc.
The ELF object will contain all symbols that have been exported by .global and the following predefined symbols:
- <filename-wo-extension> points to the starting address of the generated binary code.; - <filename-wo-extension>_end points behind the generated binary code.
- <filename-wo-extension>_size receives the size of the generated binary code in bytes.
Special characters in the file name are replaced by an underscore.
-s: Do not automatically create the predefined symbols derived from the file name for -e and -C output.
You need to use .global to be able to access the code by a linker symbol.
-I <include-path>: Add an include path to the search path list. This paths are used at .include <...>. Note that this is a prefix rather than a path, i.e. if it is a folder it should contain a trailing slash.
-i <qinc-file>: Load a file using the include search path. See option -I, useful to include vc4.qinc without an absolute path: -i vc4.qinc
-V: Check for Videocore IV constraints, e.g. reading a register file address immediately after writing it.

File arguments

You can pass multiple files to vc4asm but this will not create separate object modules. Instead the files are simply concatenated in order of appearance. You may use this feature to include platform specific definitions without the need to include them explicitly from every file. E.g.:
vc4asm -o code.bin -i vc4.qinc gpu_fft_1k.qasm

Assembler language reference

See the Broadcom VideoCore IV Reference Guide for the semantics of the instructions and registers.
See also the Addendum for further details and bugs in the reference guide.

Disassembler `vc4dis`

vc4dis [-o <qasm-output>] [-x[<input-format>]] [-M] [-F] [-v] [-b <base-addr>] <input-file> [<input-file2> ...]

Options

-o <qasm-output>: Assembler output file, stdout by default.
-x<input-format>: 32 - 32 bit hexadecimal input, .e. 2 qwords per instruction, default if <input-format> is missing.
64 - 64 bit hexadecimal input.
0 - binary input, little endian, default without -x.
-M: Do not generate mov instructions. mov is no native QPU instruction, it is emulated by trivial operators like or r1, r0, r0. Without this option vc4dis generated mov instead of the real instruction if such a situation is detected.
Note the -M is required to make the disassembler result turn around stable, i.e. even if the code contains some binary fragments vc4asm should return the same binary result from the disassembly.
-F: Do not write floating point constants. Without this option vc4dis writes immediate values that are likely to be a floating point number as float. This may not always hit the nail on the head.
-v: Write binary code and offset as comment right to each instruction.
-v2: As -v but also write QPU instruction set bit fields as comment right to each instruction. This is mainly for debugging purposes.
-V: Check for Videocore IV constraints, e.g. reading a register file address immediately after writing it.
-b <base-addr>: Base address. This is the physical memory address of the first instruction code passed to vc4dis. This is only significant for absolute branch instructions.

File arguments

If you pass multiple input files they are disassembled all together into a single result as if they were concatenated.
The format of the input is controlled by the -x option. All input files must use the same format.

Known problems

There are insufficient test cases so far. So likely there are still bugs in the assembler.

Build instructions

The source code has hopefully no major platform dependencies, i.e. you don't need to build it on the Raspberry. But it requires a C++11 compliant compiler to build. Current Raspbian ships with gcc 4.9 which works fine. Raspbian Wheezy seems not to be sufficient. While I succeeded with gcc 4.7.3 on another platform, gcc 4.7.2 of Wheezy fails to compile the disassembler. But you can install gcc 4.8 in Raspbian Wheezy. This will work.

Furthermore CMake is required. Most Linux distributions should provide this as package.

Build vc4asm

Download and extract the source.
Go to folder where you extracted the files.
Execute one of scripts makeDebug.Cmd or makeRelease.Cmd and do not bother about the error message "CMake has been ran to create an out of source build => abort. This is NOT an error!". However, other error messages should cause your attention.
Now vc4asm and vc4dis executables as well as libvc4asm should build in a folder matching your platform, e.g. build-Linux-x86_64-Debug. You can run the executables from this folder ...
... or go to the folder and enter make install to install the binaries on your system.

Build the samples

Ensure that you have run makeRelease.Cmd to build vc4asm as mentioned above.
Go to one of the sample folders.
Enter cmake .
Enter make
Run the executables with sudo prefix, e.g. sudo hello_fft ...
sudo is required because there is currently no secure device driver to access the Raspberry Pi GPU.

Note that the samples will neither build nor run on anything else but one of the Raspberry Pi models.

Running the test cases

Ensure that you have Perl installed. (There should be no mentionable version dependencies.)
Go to folder where you extracted the files.

Method 1: using make rules

Execute makeTest.Cmd. You may also use the targets test_qasm, test_cout, test_parser or test_validator to run only a subset of the test cases.

The test_... targets run all test cases that have not yet run or that need to be rerun after changes to the code or to the test case itself. It stops at the first failed test. If you need an overview of all failed tests use method 2 below.

Method 2: using cmake test cases

Execute runTest.Cmd. You may also use ctest.

The Cmake test cases invoke the test targets from method 1. This is significantly slower, especially on a Raspberry Pi.

Sample programs

Notes

You cannot use the GPU if you have the vc4 display driver running. It simply does not support this.
It is recommended to install the vcio2 driver to run the sample programs. This will remove the need to run all samples with root privileges. This is because of the need to call mmap. The sample programs will automatically detect the presence of the vcio2 driver and use it when available.
There is one side effect of using vcio2: this driver does not support to access the V3D hardware directly for safety and concurrency reasons. This prevents direct hardware access used by the hello_fft sample for small transforms. So these transforms are significantly slower because of the turn around to the kernel and the firmware and the way back.

`simple`

This is a very simple program that demonstrates the use of all available operators with small immediate values. It is not optimized in any way.

`hello_fft`

This is the well known hello_fft sample available. The main difference is that it is faster compared to GPU_FFT 3.0 because the shader code has been significantly optimized. The gain is about 40% of code size and roughly 9% of the run time. The code will no longer build with another assembler since it uses several special features for instruction packing and scheduling.

batch→	1			10
↓points	gpu_fft 3	optimized	gain	gpu_fft 3	optimized	gain
2⁸	25 µs	20 µs*	-20%	16.0 µs	13.0 µs	-19%
2⁹	39 µs	32 µs	-18%	28.0 µs	22.9 µs	-18%
2¹⁰	57 µs*	49 µs	-14%	48.0 µs	39.9 µs	-17%
2¹¹	102 µs	92 µs*	-10%	100.6 µs	82.7 µs	-18%
2¹²	230 µs	241 µs	+5%	250 µs	245 µs	-2%
2¹³	598 µs	649 µs	+9%	612 µs	655 µs	+7%
2¹⁴	1.12 ms	1.31 ms	+17%	1.148 ms	1.306 ms	+13%
2¹⁵	3.08 ms	2.82 ms	-9%	2.96 ms*	2.696 ms	-9%
2¹⁶	6.05 ms	5.52 ms	-9%	5.93 ms	5.38 ms*	-9%
2¹⁷	12.20 ms	11.27 ms	-8%	12.06 ms	11.07 ms	-8%
2¹⁸	26.76 ms	24.73 ms	-8%	26.64 ms	24.59 ms	-8%
2¹⁹	88.55 ms	81.75 ms	-8%	88.41 ms	81.60 ms	-8%
2²⁰	181.6 ms	171.8 ms	-5%
2²¹	360.4 ms	340.8 ms	-5%
2²²	731.8 ms	693.7 ms	-5%

All timings are medians from repeated executions. The Raspi was slightly overclocked. (*) Timing is unstable, reason unknown.
It is not yet known, why especially the 2¹⁴ FFT is significantly slower. Maybe a bug.

Contact

Comments, ideas, bugs, improvements to raspi at maazl dot de.

VC4ASM - macro assembler for Broadcom VideoCore IV aka Raspberry Pi GPU

Download

Assembler vc4asm

Options

File arguments

Assembler language reference

Disassembler vc4dis

Options

File arguments

Known problems

Build instructions

Build vc4asm

Build the samples

Running the test cases

Method 1: using make rules

Method 2: using cmake test cases

Sample programs

Notes

simple

hello_fft

Contact

VC4ASM - macro assembler for Broadcom VideoCore IV
aka Raspberry Pi GPU

Assembler `vc4asm`

Disassembler `vc4dis`

`simple`

`hello_fft`