VC4ASM - macro assembler for Broadcom VideoCore IV
aka Raspberry Pi GPU

The goal of the vc4asm project is a full featured macro assembler and disassembler with constraint checking.
The work is based on qpu-asm from Pete Warden which itself is based on Eman's work and some ideas also taken from Herman H Hermitage. But it goes further by far. First of all it supports macros and functions.
Unfortunately this area is highly undocumented in the public domain. And so the work uses only the code from hello_fft which is available as source and binary as Rosetta Stone. However, I try to keep it downwardly compatible to the Broadcom tools.

→ Assembler, → Disassembler, → Known problems, → Build instructions, → Samples, → Change log, → Contact


Download Source code, Raspberry Pi 1 binary, examples and this documentation (750k)

The current version is 0.2.3. See change log for further details.

The source code is also available at

Assembler vc4asm

The heart of the software. It assembles QPU code to binary or C constants.

vc4asm [-o <bin-output>] [-C <C-output>] [-e <ELF-output>] [-I <include-path> [-I <include-path2> ...]] <qasm-file> [<qasm-file2> ...]


-o <bin-output>
File name for binary output. If omitted no binary output is generated.
Note that vc4asm always writes little endian binaries.
-c <C-output>
File name for C/C++ output. The result does not include surrounding braces. So write it to a separate file and include it from C as follows:
static const uint32_t qpu_code[] = {
-C <C-output>
Same as -c, but suppress trailing ','.
-e <ELF-output>
Write the assembled binary directly to an ARM compatible object file in ELF format that can be passed to ld or gcc respectively.
The ELF object will contain all symbols that have been exported by .global and the following predefined symbols:
- <filename-wo-extension> points to the starting address of the generated binary code.
- <filename-wo-extension>_end points behind the generated binary code.
- <filename-wo-extension>_size receives the size of the generated binary code in bytes.
Special characters in the file name are replaced by an underscore.
-E <ELF-output>
Like -e but do not automatically create the predefined symbols derived from the file name.
You need to use .global to be able to access the code by a linker symbol.
-I <include-path>
Add an include path to the search path list. This paths are used at .include <...>. Note that this is a prefix rather than a path, i.e. if it is a folder it should contain a trailing slash.
Search include path for command line arguments (files) as well.
Useful to include vc4.qinc without an absolute path: -i vc4.qinc
Check for Videocore IV constraints, e.g. reading a register file address immediately after writing it.

File arguments

You can pass multiple files to vc4asm but this will not create separate object modules. Instead the files are simply concatenated in order of appearance. You may use this feature to include platform specific definitions without the need to include them explicitly from every file. E.g.:
vc4asm -o code.bin BCM2835.qinc gpu_fft_1k.qasm

Assembler reference

  1. Expressions and operators
  2. Assembler directives
  3. Instructions
  4. Standard macros vc4.qinc

See the Broadcom VideoCore IV Reference Guide for the semantics of the instructions and registers.
See also the Addendum for further details and bugs in the reference guide.

Disassembler vc4dis

vc4dis [-o <qasm-output>] [-x[<input-format>]] [-M] [-F] [-v] [-b <base-addr>] <input-file> [<input-file2> ...]


-o <qasm-output>
Assembler output file, stdout by default.
32 - 32 bit hexadecimal input, .e. 2 qwords per instruction, default if <input-format> is missing.
64 - 64 bit hexadecimal input.
- binary input, little endian, default without -x.
Do not generate mov instructions. mov is no native QPU instruction, it is emulated by trivial operators like or r1, r0, r0. Without this option vc4dis generated mov instead of the real instruction if such a situation is detected.
Do not write floating point constants. Without this option vc4dis writes immediate values that are likely to be a floating point number as float. This may not always hit the nail on the head.
Write binary code and offset as comment right to each instruction.
As -v but also write QPU instruction set bit fields as comment right to each instruction. This is mainly for debugging purposes.
Check for Videocore IV constraints, e.g. reading a register file address immediately after writing it.
-b <base-addr>
Base address. This is the physical memory address of the first instruction code passed to vc4dis. This is only significant for absolute branch instructions.

File arguments

If you pass multiple input files they are disassembled all together into a single result as if they were concatenated.
The format of the input is controlled by the -x option. All input files must use the same format.

Known problems

Build instructions

The source code has hopefully no major platform dependencies, i.e. you don't need to build it on the Raspberry. But it requires a C++11 compliant compiler to build. Current Raspbian ships with gcc 4.9 which works fine. Raspbian Wheezy seems not to be sufficient. While I succeeded with gcc 4.7.3 on another platform, gcc 4.7.2 of Wheezy fails to compile the disassembler. But you can install gcc 4.8 in Raspbian Wheezy. This will work.

Sample programs

All sample programs require root access to run. This is because of the need to call mmap. See vcio2 driver for an alternative without root privileges.

Furthermore you need a recent Raspbian kernel (use rpi-update) or create a local character device named /dev/vcio to access the vcio driver of the Raspi kernel: sudo mknod /dev/vcio c 100 0

All these restrictions apply to hello_fft from the Raspberry Pi Foundation as well.


This is a very simple program that demonstrates the use of all available operators with small immediate values. It is not optimized in any way.


This is the well known hello_fft sample available. The main difference is that it is faster compared to GPU_FFT 3.0 because the shader code has been significantly optimized. The gain is about 40% of code size and roughly 9% of the run time. The code will no longer build with another assembler since it uses several special features for instruction packing and scheduling.

batch→ 1 10
↓points gpu_fft 3 optimized gain gpu_fft 3 optimized gain
28 25 µs 20 µs* -20% 16.0 µs 13.0 µs -19%
29 39 µs
32 µs -18% 28.0 µs 22.9 µs
210 57 µs*
49 µs -14% 48.0 µs 39.9 µs
211 102 µs
92 µs* -10% 100.6 µs 82.7 µs
212 230 µs
241 µs +5% 250 µs 245 µs
213 598 µs
649 µs +9% 612 µs 655 µs
214 1.12 ms
1.31 ms +17% 1.148 ms 1.306 ms
215 3.08 ms
2.82 ms -9% 2.96 ms* 2.696 ms
216 6.05 ms
5.52 ms -9% 5.93 ms 5.38 ms*
217 12.20 ms
11.27 ms -8% 12.06 ms 11.07 ms
218 26.76 ms
24.73 ms -8% 26.64 ms 24.59 ms
219 88.55 ms
81.75 ms -8% 88.41 ms 81.60 ms
220 181.6 ms
171.8 ms -5%

221 360.4 ms
340.8 ms -5%

222 731.8 ms 693.7 ms -5%

All timings are medians from repeated executions. The Raspi was slightly overclocked. (*) Timing is unstable, reason unknown.
It is not yet known, why especially the 214 FFT is significantly slower. Maybe a bug.


Comments, ideas, bugs, improvements to raspi at maazl dot de.