The goal of the vc4asm project is a full featured
macro assembler and disassembler with constraint checking.
The work is based on qpu-asm from Pete Warden which itself is based on Eman's work and some ideas also taken from Herman H Hermitage. But it goes further by far. First of all it supports macros and functions.
Unfortunately this area is highly undocumented in the public domain. And so the work uses only the code from hello_fft which is available as source and binary as Rosetta Stone. However, I try to keep it downwardly compatible to the Broadcom tools.
→ Assembler, → Disassembler, → Known problems, → Build instructions, → Samples, → Change log, → Contact
Download Source code, Raspberry Pi 1 binary, examples and this documentation (750k)
The current version is 0.2.1. See change log for further details.
The source code is also available at github.com/maazl/vc4asm.
The heart of the software. It assembles QPU code to binary or C constants.
vc4asm [-o <bin-output>] [-c <c-output>] [-e <ELF-output>] [-I <include-path> [-I <include-path2> ...]] <qasm-file> [<qasm-file2> ...]
You can pass multiple files to vc4asm but this will not
create separate object modules. Instead the files are simply concatenated
in order of appearance. You may use this feature to include platform
specific definitions without the need to include them explicitly from
every file. E.g.:
vc4asm -o code.bin BCM2835.qinc gpu_fft_1k.qasm
See the Broadcom
VideoCore IV Reference Guide for the semantics of the instructions
See also the Addendum for further details and bugs in the reference guide.
vc4dis [-o <qasm-output>] [-x[<input-format>]] [-M] [-F] [-v] [-b <base-addr>] <input-file> [<input-file2> ...]
If you pass multiple input files they are disassembled all together into
a single result as if they were concatenated.
The format of the input is controlled by the -x option. All input files must use the same format.
The source code has hopefully no major platform dependencies, i.e. you don't need to build it on the Raspberry. But it requires a C++11 compliant compiler to build. Current Raspiban ships with gcc 4.9 which works. Raspbian Wheezy seems not to be sufficient. While I succeeded with gcc 4.7.3 on another platform, gcc 4.7.2 of Wheezy fails to compile the disassembler. But you can install gcc 4.8 in Raspbian Wheezy. This will work.
All sample programs require root access to run. This is because of the need to call mmap. See vcio2 driver for an alternative without root privileges.
Furthermore you need a recent Raspbian kernel (use rpi-update) or create a local character device named /dev/vcio to access the vcio driver of the Raspi kernel: sudo mknod /dev/vcio c 100 0
All these restrictions apply to the original hello_fft as well.
This is a very simple program that demonstrates the use of all available operators with small immediate values. It is not optimized in any way.
This is the well known hello_fft sample available. The main difference is that it is faster compared to GPU_FFT 3.0 because the shader code has been optimized. The gain is about 40% of code size and roughly 9% of the run time. The code will no longer build with another assembler since it uses some special features for instruction packing and scheduling.
|↓points||gpu_fft 3||optimized||gain||gpu_fft 3||optimized||gain|
|28||25 µs||20 µs*||-20%||16.0 µs||13.0 µs||-19%|
||32 µs||-18%||28.0 µs||22.9 µs
||49 µs||-14%||48.0 µs||39.9 µs
||92 µs*||-10%||100.6 µs||82.7 µs
||241 µs||+5%||250 µs||245 µs
||649 µs||+9%||612 µs||655 µs
||1.31 ms||+17%||1.148 ms||1.306 ms
||2.82 ms||-9%||2.96 ms*||2.696 ms
||5.52 ms||-9%||5.93 ms||5.38 ms*
||11.27 ms||-8%||12.06 ms||11.07 ms
||24.73 ms||-8%||26.64 ms||24.59 ms
||81.75 ms||-8%||88.41 ms||81.60 ms
|222||731.8 ms||693.7 ms||-5%|
Comments, ideas, bugs, improvements to raspi at maazl dot de.