The goal of the vc4asm project is a full featured macro assembler and disassembler with constraint
checking.
The work is based on qpu-asm from Pete Warden which itself is based on Eman's work and some ideas also taken from Herman H Hermitage.
But it goes further by far. First of all it supports macros and functions.
Unfortunately this area is highly undocumented in the public domain. And so the work uses only the code from hello_fft
which is available as source and binary as Rosetta Stone. However, I try to keep it downwardly compatible to the Broadcom
tools.
→ Assembler, → Disassembler, → Known problems, → Build instructions, → Samples, → Change log, → Contact
Download Source code, Raspberry Pi 1 binary, examples and this documentation (750k)
The current version is 0.3. See change log for further details.
The source code is also available at github.com/maazl/vc4asm.
The heart of the software. It assembles QPU code to binary, ELF or C source.
vc4asm [<options> ...] <qasm-file> [<qasm-file2> ...]
You can pass multiple files to vc4asm but this will not create separate object modules. Instead the files
are simply concatenated in order of appearance. You may use this feature to include platform specific definitions without the
need to include them explicitly from every file. E.g.:
vc4asm -o code.bin -i vc4.qinc gpu_fft_1k.qasm
See the Broadcom VideoCore IV Reference
Guide for the semantics of the instructions and registers.
See also the Addendum for further details and bugs in the reference guide.
vc4dis [-o <qasm-output>] [-x[<input-format>]] [-M] [-F] [-v] [-b <base-addr>] <input-file> [<input-file2> ...]
If you pass multiple input files they are disassembled all together into a single result as if they were concatenated.
The format of the input is controlled by the -x option. All input files must use the same format.
The source code has hopefully no major platform dependencies, i.e. you don't need to build it on the Raspberry. But it requires a C++11 compliant compiler to build. Current Raspbian ships with gcc 4.9 which works fine. Raspbian Wheezy seems not to be sufficient. While I succeeded with gcc 4.7.3 on another platform, gcc 4.7.2 of Wheezy fails to compile the disassembler. But you can install gcc 4.8 in Raspbian Wheezy. This will work.
Furthermore CMake is required. Most Linux distributions should provide this as package.
Note that the samples will neither build nor run on anything else but one of the Raspberry Pi models.
The test_... targets run all test cases that have not yet run or that need to be rerun after changes to the code or to the test case itself. It stops at the first failed test. If you need an overview of all failed tests use method 2 below.
The Cmake test cases invoke the test targets from method 1. This is significantly slower, especially on a Raspberry Pi.
You cannot use the GPU if you have the vc4 display driver running. It simply does not support this.
It is recommended to install the vcio2 driver to run the sample programs. This will remove the need to run all samples with root privileges. This is because of the need to call mmap. The sample programs will automatically detect the presence of the vcio2 driver and use it when available.
There is one side effect of using vcio2: this driver does not support to access the V3D hardware directly for safety and concurrency reasons. This prevents direct hardware access used by the hello_fft sample for small transforms. So these transforms are significantly slower because of the turn around to the kernel and the firmware and the way back.
This is a very simple program that demonstrates the use of all available operators with small immediate values. It is not optimized in any way.
This is the well known hello_fft sample available. The main difference is that it is faster compared to GPU_FFT 3.0 because the shader code has been significantly optimized. The gain is about 40% of code size and roughly 9% of the run time. The code will no longer build with another assembler since it uses several special features for instruction packing and scheduling.
batch→ | 1 | 10 | ||||
---|---|---|---|---|---|---|
↓points | gpu_fft 3 | optimized | gain | gpu_fft 3 | optimized | gain |
28 | 25 µs | 20 µs* | -20% | 16.0 µs | 13.0 µs | -19% |
29 | 39 µs |
32 µs | -18% | 28.0 µs | 22.9 µs |
-18% |
210 | 57 µs* |
49 µs | -14% | 48.0 µs | 39.9 µs |
-17% |
211 | 102 µs |
92 µs* | -10% | 100.6 µs | 82.7 µs |
-18% |
212 | 230 µs |
241 µs | +5% | 250 µs | 245 µs |
-2% |
213 | 598 µs |
649 µs | +9% | 612 µs | 655 µs |
+7% |
214 | 1.12 ms |
1.31 ms | +17% | 1.148 ms | 1.306 ms |
+13% |
215 | 3.08 ms |
2.82 ms | -9% | 2.96 ms* | 2.696 ms |
-9% |
216 | 6.05 ms |
5.52 ms | -9% | 5.93 ms | 5.38 ms* |
-9% |
217 | 12.20 ms |
11.27 ms | -8% | 12.06 ms | 11.07 ms |
-8% |
218 | 26.76 ms |
24.73 ms | -8% | 26.64 ms | 24.59 ms |
-8% |
219 | 88.55 ms |
81.75 ms | -8% | 88.41 ms | 81.60 ms |
-8% |
220 | 181.6 ms |
171.8 ms | -5% | |||
221 | 360.4 ms |
340.8 ms | -5% | |||
222 | 731.8 ms | 693.7 ms | -5% |
Comments, ideas, bugs, improvements to raspi at maazl dot de.