-
Notifications
You must be signed in to change notification settings - Fork 11
Multiplication
The AM335x TRM has a detailed section explaining the PRU multiplier. Turns out, though, that it was not detailed enough for me to write an assembly program. Below I've documented how I managed to get the PRU multiplier working.
Scroll to the bottom if you're in a hurry for a working example.
First I needed a testbench to evaluate my assembly code. I took the HC-SR04 example, and instead of ultrasonic range I returned to the host the result of PRU multiplication.
Full listing of the new assembly function that I used to replace the existing one from the C example:
.text
.section .text
.global measure_distance_mm
measure_distance_mm:
# Start with non-zero values, simply to add randomness to our test.
fill r14, 4 * (29-14)
# Load the MUL operands
ldi r28, 1001
ldi r29, 2002
# Do the multiplication, per the TRM.
xin 0, r26, 4
# Move the MUL result to the function return value register.
mov r14, r26
ret
Thinking that I fully understand the PRU multiplier I wrote the following snippet. But when testing, the result was quite different from what I expected.
ldi r28, 1001
ldi r29, 2002
xin 0, r26, 4
# Wrong result: -1001
Re-reading the TRM, a few facts stood out:
- MUL is single-cycle.
- R28/R29 operands are sampled each cycle.
- XIN instruction is required to transfer the result back to R26.
Drawing the pipeline made obvious my mistake:
.-- one cycle to execute multiplication
|
V
|<----->|
| LDI | MAC | XIN |
^ ^
| |
| `-- result ready for transfer to CPU R26/R27 registers
`-- sample R28/R29 operands
Adding a nop to account for the MUL/MAC cycle led to a correct result:
ldi r28, 1001
ldi r29, 2002
nop
xin 0, r26, 4
# Result: 2004002
PRU's Multiplier also has MAC mode where results are accumulated. This time there were no hiccups when following the TRM-suggested sequence:
ldi r25, 1
xout 0, r25, 1
ldi r25, 3
xout 0, r25, 1
ldi r25, 1
ldi32 r28, 99787
ldi r29, 3319
xout 0, r25, 1
ldi32 r28, 64663
ldi r29, 9521
xout 0, r25, 1
xin 0, r26, 4
# Success! Got the expected 946849476
The two consecutive writes to MAC's mode register seemed odd. Indeed, removing the following two lines still yielded a correct result:
ldi r25, 1
xout 0, r25, 1
Finally, let's check that the MAC accumulator "reset" works. This is the action when the MAC accumulator is set to zero, in order to initiate a new sequence of multiply-accumulate commands.
For a change, let's also test the full 64-bit result. This requires a trivial change to pass 64-bit value to the host, that I'll not show here.
# First MAC cycle that we'll ignore
ldi r25, 3
xout 0, r25, 1
ldi r25, 1
ldi32 r28, 99787
ldi r29, 3319
xout 0, r25, 1
ldi32 r28, 64663
ldi r29, 9521
xout 0, r25, 1
xin 0, r26, 8
# Second MAC cycle "for real"
ldi r25, 3
xout 0, r25, 1
ldi r25, 1
ldi32 r28, 100931
ldi32 r29, 1000033
xout 0, r25, 1
ldi32 r28, 104701
ldi32 r29, 1000003
xout 0, r25, 1
xin 0, r26, 8
# Success! Got the expected 0x2FE0D6ED9A (205635644826)
While playing with the above examples, I noticed something peculiar. My perfect MUL example was giving wrong results when I ran it right after testing the MAC example. Of course, I was rebooting the PRU remoteproc firmware between test sessions. For kernel 4.4.52-ti-r91 running on my BBG, the command is:
echo "4a338000.pru1" > /sys/bus/platform/drivers/pru-rproc/unbind
echo "4a338000.pru1" > /sys/bus/platform/drivers/pru-rproc/bind
Evidently, the MAC mode register is not cleared on remoteproc reset. Thus, I would recommend to explicitly initialize the MUL/MAC mode register at the beginning of your assembly program:
ldi r25, 0
xout 0, r25, 1
In case you're in a hurry, here is how to multiply two 32-bit integers and get a 32-bit result:
ldi r25, 0
xout 0, r25, 1 # Reset the MAC mode register (one-time-initialization).
ldi r28, 1001 # mov or ldi to load operand into R28.
ldi r29, 4567 # mov or ldi to load second operand into R29.
nop # Delay one cycle before acquiring the result.
xin 0, r26, 4 # Load the MUL result into R26.