diff --git a/content/post/multiply.md b/content/post/multiply.md index 0feb1e2..84d3c21 100644 --- a/content/post/multiply.md +++ b/content/post/multiply.md @@ -1,18 +1,22 @@ +++ title = "Investigating Multiplication on the Gameboy Advance" -date = "2024-10-22" +date = "2023-10-22" author = "zayd" tags = ["dev", "gba", "arm"] +++ +This blog post assumes a base level of knowledge - comfort in the C programming language and bitwise math is recommended. Also, if you ever have any questions, any at all, while reading this blog post, feel free to reach out to me [here]( +https://github.com/bmchtech/blog/discussions). + The Gameboy Advance has a pretty neat CPU - the ARM7TDMI. And by neat, I mean a chaotic and sadistic bundle of questionable design decisions. Seriously, they decided that the program counter should -be a _general purpose register_. Why??? Insert simile here. I'm not even joking, you can use the program +be a _general purpose register_. Why??? That's like allowing a drunk driver to change their tires while going 30 over the speed limit near a school. I'm not even joking, you can use the program counter as the output to, say, an XOR instruction. Or an AND instruction. Or a multiply instruction. -The ARM7TDMI has six different multiply instructions. The type signatures are: + +The ARM7TDMI has six different multiply instructions. They are: - u32 = u32 x u32 - u64 = u32 x u32 - i64 = i32 x i32 @@ -20,15 +24,15 @@ The ARM7TDMI has six different multiply instructions. The type signatures are: - u64 = u32 x u32 + u64 - i64 = i32 x i32 + i64 -Why are we talking about these instructions? Well the ARM7TDMI's multiplications instruction have a pretty interesting side effect. Here the manual says that -after a multiplication instruction executes, the carry and overflow flags are `UNPREDICTABLE`. +Why are we talking about these instructions? Well the ARM7TDMI's multiplication instructions have a pretty interesting side effect. Here the manual says that +after a multiplication instruction executes, the carry flag is `UNPREDICTABLE`. -![An image of the ARM7TDMI manual explaining that the carry and overflow flags are `UNPREDICTABLE` after a multiply instruction.](/manual.png) +![An image of the ARM7TDMI manual explaining that the carry and overflow flags are UNPREDICTABLE after a multiply instruction.](../../manual.png) A short description of carry and overflow flags after a multiplication instruction from the ARM7TDMI manual. [[1](#cite1)] As if anything else in this god forsaken CPU was predictable. What this means is that software cannot and should not rely on the value of the carry flag after multiplication executes. It can be set to anything. Any -value. 0, 1, a horse, whatever. This has been a source of memes in the emudev community for a bit - +value. 0, 1, a horse, whatever. This has been a source of memes in the emulator development community for a few years - people would frequently joke about how the implementation of the carry flag may as well be `cpu.flags.c = rand() & 1;`. And they had a point - the carry flag seemed to defy all patterns; nobody understood why it behaves the way it does. But the one thing we did know, was that the carry flag seemed to be @@ -37,16 +41,9 @@ same value. This was big news, because it meant that understanding the carry fla insight into how this CPU performs multiplication. And just to get this out of the way, the carry flag's behavior after multiplication isn't an important detail to -emulate at all. Software doesn't rely on it. And if software _did_ rely on it, then screw that software, those -developers got what was coming to them. But the carry flag is a meme, and it's a really tough puzzle, and +emulate at all. Software doesn't rely on it. And if software _did_ rely on it, then screw the developers who wrote that software. But the carry flag is a meme, and it's a really tough puzzle, and that was motivation enough for me to give it a go. Little did I know it'd take _3 years_ of on and off work. -<<<<<<< HEAD -======= -Now is probably the time to say that this blog post assumes a base level of knowledge - comfort in the C programming language and bitwise math is recommended. Also, if you ever have any questions, any at all, while reading this blog post, feel free to reach out to me [here]( -https://github.com/bmchtech/blog/discussions). - ->>>>>>> ab429a6 (fixes) # Standard Algorithm What's the simplest, most basic multiplication algorithm you can think of to multiply a **multiplier** with a **multiplicand**? One really easy way is to leverage the distributive property of multiplication like so: @@ -67,7 +64,7 @@ $$ The convenient thing about binary is that it's all ones and zeros, meaning the addends are only ever `0`, or the **multiplicand** left shifted by some factor. This makes the addends easy to compute, and means that for an `N-bit` number, we need to produce `N` different addends, and add them all up to get the result. -That's slow. We can do better. +That's a lot of addends, which is slow. We can do better. # Modified Booth's Algorithm The main slowness of the Standard Algorithm is that it requires you to add a _lot_ of numbers together. @@ -78,7 +75,7 @@ $$ \end{aligned} $$ -Now we apply the following transformations. Yes I know this looks scary, you could skip to the final equation if you want. +Now we apply the following transformations. Yes I know this looks scary, you could skip to [the final equation](#finaleq) if you want. $$ \begin{align} \color{#3a7dc9}{m}\color{#4A4358} \cdot \color{#DC6A76} \alpha \color{#4A4358} &= \sum_{i=0}^{n-1} (2^i \cdot \color{#3a7dc9}{m[i]}\color{#4A4358} \cdot \color{#DC6A76} \alpha \color{#4A4358} )\cr\cr @@ -111,7 +108,7 @@ $$ This is always one of `(-2, -1, 0, 1, 2)`. - + Multiplication by those five numbers is easy to calculate in hardware (well, negation is tricky - the algorithm implements negation as bitwise inversion, with an additional 1 added at a later stage. More information about this is given later). Note also that if we define: @@ -125,15 +122,17 @@ and $$ \begin{aligned} \text{For}&\text{ unsigned multiplication:}\cr\cr -&x \geq n, \color{#3a7dc9}{m[x]}\color{#4A4358} = 0\cr\cr +&x \geq n: \color{#3a7dc9}{m[x]}\color{#4A4358} = 0\cr\cr \text{For}&\text{ signed multiplication:}\cr\cr -&x \geq n, \color{#3a7dc9}{m[x]}\color{#4A4358} = \color{#3a7dc9}m[n-1]\cr\cr +&x \geq n: \color{#3a7dc9}{m[x]}\color{#4A4358} = \color{#3a7dc9}m[n-1]\cr\cr \end{aligned} $$ Then the leftover three terms outside the summation can be absorbed into the summation, by expanding the summations range by one on both boundaries. And so we have: + + $$ \begin{aligned} \color{#3a7dc9}{m}\color{#4A4358} \cdot \color{#DC6A76} \alpha \color{#4A4358} &= \sum_{i=0}^{\frac{n}{2}} ((2^{2i} \cdot \color{#DC6A76} \alpha \color{#4A4358}) \cdot (\color{#3a7dc9}{m[2i]}\color{#4A4358} + \color{#3a7dc9}{m[2i - 1]}\color{#4A4358} - 2 \cdot \color{#3a7dc9}{m[2i + 1]}\color{#4A4358}))\cr @@ -164,6 +163,12 @@ struct BoothRecodingOutput booth_recode(u64 input, BoothChunk booth_chunk) { case 6: return (struct BoothRecodingOutput) { ~input, 1 }; case 7: return (struct BoothRecodingOutput) { 0, 0 }; } + + // Note that case 4 can *not* be implemented as 2 * (~input). The reason why + // is that the real value of the addend as represented by the struct is + // recoded_output + carry. Doing the inversion after the multiplication by 2 + // will put a 1 in the LSB of the recoded_output, allowing the carry to be + // added correctly. } ``` For the curious, more information about Booth Recoding can be found in this resource. [[2](#cite2)] @@ -202,8 +207,9 @@ So you can chain a bunch of CSAs to get yourself down to two addends, and then y u64 add_csa_results(u64 result, u64 carries) { // Exercise for the reader: Why do you suppose we multiply // carries by 2? Think about how a full adder is implemented, - // and what the variable "carries" in the csa function actually - // represents. The answer is given after this code block. + // and what the variable "carries" in the perform_csa function + // above actually represents. The answer is given after this + // code block. return result + carries * 2; } @@ -212,55 +218,50 @@ The reason we multiply `carries` by two is because, if we think about how a full from bit `i` is added to bits `i + 1` of the addends. So, bit `i` of carries has double the "weight" of bit `i` of result. This is a **very** important detail that will come in handy later, so do make sure you understand this. -<<<<<<< HEAD -======= Using CSAs, the ARM7TDMI can sum up the addends together much faster. [[4, p. 94](#cite4)] ->>>>>>> ab429a6 (fixes) # Parallelism Until now, we've mostly treated "generate the addends" and "add the addends" as two separate, entirely discrete steps of the algorithm. But, turns out, we can do both of these steps _at the same time_. We know we can only add 4 addends per cycle, so what if we generate 4 addends per cycle, and compress -them using four CSAs to generate only two addends? So, we pipe 4 CSAs into each other, allowing us to process 6 `N`-bit inputs into two `N + 8` bit outputs. Each CSA widens the -output by `2`, because the carries that the CSA outputs has twice the weight of the sum, meaning the -carries needs to be represented using an `N + 1` bit number. Each cycle, we can generate 4 addends, -feed them into 4 of the 6 outputs of this CSA array, and then when we have our two results, feed those -results back to the very top of the CSA array for the next cycle. We can initialize those two inputs to the -CSA array with `0`s. Or, if we want to be clever, we can implement multiply accumulate by initializing one -of those two inputs with the accumulate value, and get multiply accumulate for free. This trick is what the -ARM7TDMI employs to do multiply accumulate. (This is a moot point, because the CPU is stupid and can only read two register values at a time per cycle. So, using an accumulate causes the CPU to take +them using four CSAs to generate only two addends? So, we pipe 4 CSAs into each other, allowing us to process 6 `N`-bit inputs into two `N + 8` bit outputs. The reason the outputs are of size `N + 8` can be derived from [the equation above](#finaleq) - each addend is shifted left by 2 more than the previous addend. + + Each cycle, we read 8 bits from the **multiplier**, and with it, we generate 4 addends. We then +feed them into 4 of the 6 outputs of this CSA array, and when we have our 2 results, feed those +2 results back to the very top of the CSA array for the next cycle. On the first cycle of the algorithm, we can initialize those 2 inputs to the +CSA array with `0`s. + +A clever trick can be done here. The ARM7TDMI [supports mutliply accumulates](#instructions), which perform multiplication and addition in one instruction. We can implement multiply accumulate by initializing one +of those two inputs with the accumulate value, and get multiply accumulate without extra cycles. This trick is what the +ARM7TDMI employs to do multiply accumulate. (This ends up being a moot point, because the CPU is stupid and can only read two register values at a time per cycle. So, using an accumulate causes the CPU to take an extra cycle _anyway_). [[4, p.95](#cite4)] # Early Termination The ARM7TDMI does something really clever here. In our current model of the algorithm, there are 4 -cycles of CSA compression, where each cycle `i` processes bits `8 * i` to `8 * i + 7` of the **multiplier**. -(explain this in previous section). The observation is that if the remaining upper bits of the **multiplier** are all +cycles of CSA compression, where each cycle `i` processes bits `8 * i` to `8 * i + 7` of the **multiplier**. The observation is that if the remaining upper bits of the **multiplier** are all zeros, then, we can skip that cycle, since the addends produced will be all zeros, which cannot possibly affect the values of the partial result + partial carry. We can do the same trick if the remaining upper bits are all ones (assuming we are performing a signed multiplication), as those also produce addends that -<<<<<<< HEAD -are all zeros. -======= are all zeros. [[4, p.95](#cite4)] ->>>>>>> ab429a6 (fixes) -# Putting it all together +# Putting it all Together Here's a rough diagram, provided by Steve Furber in his book, Arm System-On-Chip Architecture: -![An image of the high level overview of the multiplier's organization, provided by Steve Furber in his book, Arm System-On-Chip Architecture](/booth.png) + +![An image of the high level overview of the multiplier's organization, provided by Steve Furber in his book, Arm System-On-Chip Architecture](../../booth.png) An image of the high level overview of the multiplier's organization, provided by Steve Furber in his book, Arm System-On-Chip Architecture. [[4, p.95](#cite4)] Partial Sum / Partial Carry contain the results obtained by the CSAs, and are rotated right by 8 on each cycle. Rm is recoded using booth's algorithm to produce the addends for the CSA array. [[4, p.95](#cite4)] -Ok, but remember when I said (make sure I said this) that there will be an elegant way to handle booth's negation of the addends? The way the algorithm gets around this is kind of genius. Remember how the carry output of a CSA has to be left shifted by 1? Well, this left-shift creates a zero in the LSB of the carry output of the CSA, so why don't we just put the carry in that bit? [[5, p. 12](#cite5)] Like so: +Ok, but [remember when I said](#negation) that there will be an elegant way to handle booth's negation of the addends? The way the algorithm gets around this is kind of genius. Remember how the carry output of a CSA has to be left shifted by 1? Well, this left-shift creates a zero in the LSB of the carry output of the CSA, so why don't we just put the carry in that bit? [[5, p. 12](#cite5)] Like so: ```c struct CSAOutput perform_csa_array(u64 partial_sum, u64 partial_carry, - BoothRecodingOutput addends) { + struct RecodedMultiplicands addends) { struct CSAOutput csa_output = { partial_sum, partial_carry }; struct CSAOutput final_csa_output = { 0, 0 }; @@ -268,7 +269,8 @@ struct CSAOutput perform_csa_array(u64 partial_sum, u64 partial_carry, csa_output.output &= 0x1FFFFFFFFULL; csa_output.carry &= 0x1FFFFFFFFULL; - struct CSAOutput result = perform_csa(csa_output.output, addends.m[i].recoded_output & 0x1FFFFFFFFULL, csa_output.carry); + struct CSAOutput result = perform_csa(csa_output.output, + addends.m[i].recoded_output & 0x1FFFFFFFFULL, csa_output.carry); // Inject the carry caused by booth recoding result.carry <<= 1; @@ -300,27 +302,39 @@ struct CSAOutput perform_csa_array(u64 partial_sum, u64 partial_carry, (Yes, this insanity is indeed done by the actual CPU.) # Fatal Contradiction -So I lied to you all. There's a small, but very meaningful difference between the algorithm I described and + +Didn't we just finish the section titled "Putting it all Together"? Why then is the scroll bar still halfway down the page? + +Because I lied to you all. There's a small, but very meaningful difference between the algorithm I described and the ARM7TDMI's algorithm. Let's consider the following multiplication: $$ \color{#3a7dc9}0x000000FF\color{#4A4358} \cdot \color{#DC6A76}0x00000001 \color{#4A4358} $$ How many cycles should this take? 1, right? Because the upper 24 bits of the **multiplier** are zeros, then the second, third, and fourth cycles of addends will all be zeros... right? -Right? -Well, that's how long it takes the ARM7TDMI to do it. So what's the issue? Turns out, the second cycle of -the algorithm does have a single non-zero addend: +Right? +Well, that's how long it takes the ARM7TDMI to do it. So what's the issue? Let's take a closer look. The first cycle of the algorithm should have the following four chunks: $$ \begin{aligned} -&\text{Chunk #1: }\color{#3a7dc9}\text{0b001}\cr -&\text{Chunk #2: }\color{#3a7dc9}\text{0b000}\cr -&\text{Chunk #3: }\color{#3a7dc9}\text{0b000}\cr -&\text{Chunk #4: }\color{#3a7dc9}\text{0b000}\cr +&\text{Chunk #1: }\color{#3a7dc9}\text{0b110 (obtained from m[1..-1])}\cr +&\text{Chunk #2: }\color{#3a7dc9}\text{0b111 (obtained from m[3..1])}\cr +&\text{Chunk #3: }\color{#3a7dc9}\text{0b111 (obtained from m[5..3])}\cr +&\text{Chunk #4: }\color{#3a7dc9}\text{0b111 (obtained from m[7..5])}\cr \end{aligned} $$ -Because the LSB of Chunk #1 uses the MSB of Chunk #3 of Cycle #1, our algorithm would be forced to +Turns out, in our current version of the algorithm, the second cycle does have a single non-zero addend: + +$$ +\begin{aligned} +&\text{Chunk #1: }\color{#3a7dc9}\text{0b001 (obtained from m[9..7])}\cr +&\text{Chunk #2: }\color{#3a7dc9}\text{0b000 (obtained from m[11..9])}\cr +&\text{Chunk #3: }\color{#3a7dc9}\text{0b000 (obtained from m[13..11])}\cr +&\text{Chunk #4: }\color{#3a7dc9}\text{0b000 (obtained from m[15..13])}\cr +\end{aligned} +$$ +Because the LSB of Chunk #1 of Cycle #2 uses the MSB of Chunk #4 of Cycle #1, our algorithm would be forced to take 2 cycles of CSAs. And yet, on the ARM7TDMI, this multiplication would terminate early, after only 1 cycle of CSAs. And there doesn't seem to be a good way around this. And so I sat there thinking of workarounds. @@ -341,7 +355,7 @@ be capable of allowing the CPU to do 5 chunks per cycle. magic to pull this off? **Rebuttal:** Yeah, that's assigning too much credit to this god forsaken bundle of wires that somehow -obtained the title of "CPU". +obtained the title of "CPU" (actually, this solution ends up being closest to the right answer) I was kind of out of ideas. I was pretty much ready to give up - my current algorithm was nowhere near @@ -350,7 +364,7 @@ once in a while. # Descent into Madness Congrats for getting this far, now comes the tricky stuff. I require anyone who wants to continue reading to -put on [this music](https://www.google.com/url?sa=t&source=web&rct=j&opi=89978449&url=https://www.youtube.com/watch?v=ntgaStqpmjQ&ved=2ahUKEwj03pfgvs2IAxWjJTQIHTp_EToQtwJ6BAgIEAI&usg=AOvVaw1qRD2jAXNcY-9YA6Uhb9Ig) in the background, as it most accurately models the trek into insanity we are about to endure. +put on this music in the background, as it most accurately models the trek into insanity we are about to endure. So fast forward about a year, I'm out for a walk and I decide to give this problem a thought again. And so I considered something that, at the outset, sounds really, really stupid. @@ -358,45 +372,48 @@ So fast forward about a year, I'm out for a walk and I decide to give this probl I mean, it's kind of dumb, right? The entire issue is that the **multiplier** is _too big_. Left shifting it would only exacerbate this issue. Congrats, we went from being able to process 7 bits on the first cycle to 6. -But pay attention to the **first addend** that would be produced. The corresponding **chunk** would either be `000` or `100`. Two options, both of which are really easy to compute. This is a **chunk** that would only exist on the first cycle of the algorithm. Coincidentally, if you refer to the diagram[have actual link or figure #] up above, you'll notice that, in the first cycle of the algorithm, we have an extra input in the CSA array that we initialized to zero. What if, instead, we initialize it to the addend produced by this mythical **chunk**? [[5, p. 14](#cite5)] +But pay attention to the **first addend** that would be produced. The corresponding **chunk** would either be `000` or `100`. Two options, both of which are really easy to compute. This is a behavior that would only exist on the first cycle of the algorithm. Coincidentally, if you refer to [the diagram](#diagram) up above, you'll notice that, in the first cycle of the algorithm, we have an extra input in the CSA array that we initialized to zero. What if, instead, we initialize it to the addend produced by this mythical **chunk**? Allowing us to process one additional bit on the first cycle only? [[5, p. 14](#cite5)] It'd solve the issue. It'd get us the extra bit we needed, and make us match the ARM7TDMI's cycle counts completely. But that's not all. Remember the carry flag from earlier? With this simple change, we go from matching hardware about 50% of the time (no better than randomly guessing) to matching hardware _**85%**_ of the time. This sudden increase was something no other theory was able to do, and made me really confident that I was on to something. However, this percentage only happens if we set the carry flag to bit `30` of the partial carry result, which seems super arbitrary. It turns out that bit of the partial carry result had a special meaning I did not realize at the time, and I would only find out that meaning much, much later. +(Obviously, shifting the **multiplier** left by 1 means the result is now twice what it should be. This is handled later.) + # Mathematical Black Magic -It feels like we are finally making some sort of progress, however my algorithm still failed to calculate the carry flag properly around 15% of the time, and failed way more than that on long / signed multiplies. It was around this time that I found two patents, that almost _entirely_ explained the algorithm. No idea how these hadn't been found up until this point, but they were quite illuminating. [[5](#cite5)], [[6](#cite6)] +It feels like we are finally making some sort of progress, however my algorithm still failed to calculate the carry flag properly around 15% of the time, and failed way more than that on 64-bit and signed multiplies. It was around this time that I found two patents, that almost _entirely_ explained the algorithm. No idea how these hadn't been found up until this point, but they were quite illuminating. [[5](#cite5)], [[6](#cite6)] After reading the patents, it turns out my implementation of the CSA array is slightly flawed (see [`perform_csa_array`](#perform_csa_array) above). In particular, that function uses CSAs with a width of _64_ bits. That's way too large and wastes space on the chip - the actual hardware gets away with only using _31_. Another difference is that my algorithm has no way yet of supporting long accumulate values. Sure, I can initialize the partial output with the accumulate value, but the partial output is only 32 bits wide. -Turns out, the patents describe a way to deal with both of these issues at once, using some mathematical trickery. Pretty much the entire rest of this section is derived from [5, pp. 14-17]. This is the hardest part of the algorithm, so hang in there. +Turns out, the patents describe a way to deal with both of these issues at once, using some mathematical trickery. Pretty much the entire rest of this section is derived from one of the patents [[5, pp. 14-17](#cite5)]. This is the hardest part of the algorithm, so hang in there. -Roughly, on each CSA, we want to add three numbers together to produce two numbers. Let's give these five numbers some names. Define `S` to be a 33-bit value (even though the actual S is 32-bits, adding an extra bit allows us to handle both signed and unsigned multiplication) representing the previous CSA's sum, `C` to be a 33-bit value representing the previous CSA's carry, and `S'` and `C'` to be 33-bit values representing the resulting CSA sum / carry. Finally, define `X` to be a 34-bit value containing the current booths addend. Then we have: +Roughly, on each CSA, we want to add three numbers together to produce two numbers. Let's give these five numbers some names. Define `S` to be a 33-bit value representing the previous CSA's sum (even though the actual sum is 32-bits, adding an extra bit allows us to handle both signed and unsigned multiplication), `C` to be a 33-bit value representing the previous CSA's carry, and `S'` and `C'` to be 33-bit values representing the resulting CSA sum / carry. Finally, define `X` to be a 34-bit value containing the current addend. Then we have: $$ S', C' = S + C + X $$ -This, mathematically speaking, can be represented as a 65-bit addition. The reason why is that `X` can be left-shifted by as little as 0, and as much as 32. If we define `i` to be a number from `[0 - 3]` representing the CSA's position in the CSA array, we can divide the 65 bit CSA addition region into five chunks: -- Lower: A region of size `2i` that represents `final_csa_output`. This region is unaffected by future CSAs. +This, mathematically speaking, can be represented as a 65-bit addition. The reason why is that `X` can be left-shifted by as little as 0, and as much as 32. + +Now, if we define `i` to be a number from `[0 - 3]` representing the CSA's position in the CSA array, we can divide the 65 bit CSA addition region into five chunks: +- Lower: A region of size `2i` that represents `final_csa_output`. This region is unaffected by future CSAs, since all future addends are multiplied by at least `2^(2*i)`. - TransL: The two bits of CSA `#i` that will become Lower bits in CSA `#(i + 1)`. -- Active: The 31-bit region where, including TransL, the actual CSA will be performed. +- Active: The 31-bit region where, including TransL, the actual CSA will be performed. Active itself is 31-bits wide, but with TransL, this is 33-bits. - TransH: The two bits of CSA `#i` that will become Active bits in CSA `#(i + 1)` - High: Contains values that have not yet been put into the CSA. - 34-bit value containing the current booths addend. -Define `A` to be a `65-bit` accumulate (even though the actual accumulate is 64-bits, adding an extra bit allows us to handle both signed and unsigned accumulates). Define `SL` and `CL` to be the analogue of `final_csa_output` in (see [`the code snippet above`](#perform_csa_array) above). Finally, define `XC` to be the carry flag produced by booth recoding. Then, we can model the CSA as follows: +Define `A` to be a `65-bit` accumulate (even though the actual accumulate is 64-bits, adding an extra bit allows us to handle both signed and unsigned accumulates). Define `SL` and `CL` to be the analogue of `final_csa_output` in (see [`the code snippet above`](#perform_csa_array) above). Finally, define `XC` to be the carry flag produced by booth recoding. Then, we can model the addition of the 3 operands in the CSA as follows: | Region: | High | TransH | Active | TransL | Lower | -| - | - | - | - | - | - | +| -- | - | - | - | - | - | | Size: | 30 - 2i | 2 | 31 | 2 | 2i | -| Addend #1: | 0 | 0 | S[32:2] | S[1:0] | SL[2i:0] -| Addend #2: | C[32], ..., C[32] | C[32] C[32] | C[32:2] | C[1:0] | CL[2i:0] -| Addend #3: | X[33], ..., X[33] | X[33] X[33] | X[32:2] | X[1:0] | 0 +| Operand #1: | 0 | 0 | S[32:2] | S[1:0] | SL[2i:0] +| Operand #2: | C[32], ..., C[32] | C[32] C[32] | C[32:2] | C[1:0] | CL[2i:0] +| Operand #3: | X[33], ..., X[33] | X[33] X[33] | X[32:2] | X[1:0] | 0 | Result Sum: | 0 | S'[32:31] | S'[30:0] | SL[2i+2:2i] | SL[2i:0] | Result Carry: | C'[32], ..., C'[32] | C'[32:31] | C'[30:0] | CL[2i+2], XC (aka CL[2i+1]) | CL[2i:0] @@ -407,9 +424,9 @@ Here's a simple way to implement long accumulates. 33 bits of the `A` will be pl | Region: | High | TransH | Active | TransL | Lower | | - | - | - | - | - | - | | Size: | 30 - 2i | 2 | 31 | 2 | 2i | -| Addend #1: |0 | A[2i+35 : 2i+34] | S[32:2] | S[1:0] | SL[2i:0] -| Addend #2: | C[32], ..., C[32] | C[32] C[32] | C[32:2] | C[1:0] | CL[2i:0] -| Addend #3: | X[33], ..., X[33] | X[33] X[33] | X[32:2] | X[1:0] | 0 +| Operand #1: |0 | A[2i+35 : 2i+34] | S[32:2] | S[1:0] | SL[2i:0] +| Operand #2: | C[32], ..., C[32] | C[32] C[32] | C[32:2] | C[1:0] | CL[2i:0] +| Operand #3: | X[33], ..., X[33] | X[33] X[33] | X[32:2] | X[1:0] | 0 | Result Sum: | 0 |S'[32:31] | S'[30:0] | SL[2i+2:2i] | SL[2i:0] | Result Carry: | C'[32], ..., C'[32] | C'[32:31] | C'[30:0] | CL[2i+2], XC (aka CL[2i+1]) | CL[2i:0] @@ -418,46 +435,46 @@ We can ignore the Lower column, since the result there is always the same as the | Region: | High | TransH | - | - | - | | Size: | 30 - 2i | 2 -| Addend #1: |0 | A[2i+35 : 2i+34] | S[32:2] | -| Addend #2: | C[32], ..., C[32] | C[32] C[32] | C[32:2] | C[1:0] | CL[2i:0] -| Addend #3: | X[33], ..., X[33] | X[33] X[33] | X[32:2] | X[1:0] | 0 +| Operand #1: |0 | A[2i+35 : 2i+34] | S[32:2] | +| Operand #2: | C[32], ..., C[32] | C[32] C[32] | C[32:2] | C[1:0] | CL[2i:0] +| Operand #3: | X[33], ..., X[33] | X[33] X[33] | X[32:2] | X[1:0] | 0 | Result Sum: | 0 |S'[32:31] | S'[30:0] | SL[2i+2:2i] | SL[2i:0] | Result Carry: | C'[32], ..., C'[32] | C'[32:31] | C'[30:0] | CL[2i+2], XC (aka CL[2i+1]) | CL[2i:0] -We can replace Addend #2 with one row of all ones, and another row with just `!C[N]`. Convince yourself why this is mathematically OK. +We can do some trickery to replace Operand #2 with one row of all ones, and another row with just `!C[N]`. Convince yourself why this is mathematically OK. | Region: | High | TransH | - | - | - | | Size: | 30 - 2i | 2 -| Addend #1: |0 | A[2i+35 : 2i+34] | S[32:2] | -| Addend #2: | 1, ..., 1 | 1 1 | C[32:2] | C[1:0] | CL[2i:0] -| Addend #2.5: | 0 | 0 !C[32] | C[32:2] | C[1:0] | CL[2i:0] -| Addend #3: | X[33], ..., X[33] | X[33] X[33] | X[32:2] | X[1:0] | 0 +| Operand #1: |0 | A[2i+35 : 2i+34] | S[32:2] | +| Operand #2: | 1, ..., 1 | 1 1 | C[32:2] | C[1:0] | CL[2i:0] +| Operand #2.5: | 0 | 0 !C[32] | C[32:2] | C[1:0] | CL[2i:0] +| Operand #3: | X[33], ..., X[33] | X[33] X[33] | X[32:2] | X[1:0] | 0 | Result Sum: | 0 |S'[32:31] | S'[30:0] | SL[2i+2:2i] | SL[2i:0] | Result Carry: | C'[32], ..., C'[32] | C'[32:31] | C'[30:0] | CL[2i+2], XC (aka CL[2i+1]) | CL[2i:0] -Do the same to Addend #3: +Do the same to Operand #3: | Region: | High | TransH | - | - | - | | Size: | 30 - 2i | 2 -| Addend #1: |0 | A[2i+35 : 2i+34] | S[32:2] | -| Addend #2: | 1, ..., 1 | 1 1 | C[32:2] | C[1:0] | CL[2i:0] -| Addend #2.5: | 0 | 0 !C[32] | C[32:2] | C[1:0] | CL[2i:0] -| Addend #3: | 1, ..., 1 | 1 1 | C[32:2] | C[1:0] | CL[2i:0] -| Addend #3.5: | 0 | 0 !X[33] | C[32:2] | C[1:0] | CL[2i:0] +| Operand #1: |0 | A[2i+35 : 2i+34] | S[32:2] | +| Operand #2: | 1, ..., 1 | 1 1 | C[32:2] | C[1:0] | CL[2i:0] +| Operand #2.5: | 0 | 0 !C[32] | C[32:2] | C[1:0] | CL[2i:0] +| Operand #3: | 1, ..., 1 | 1 1 | C[32:2] | C[1:0] | CL[2i:0] +| Operand #3.5: | 0 | 0 !X[33] | C[32:2] | C[1:0] | CL[2i:0] | Result Sum: | 0 |S'[32:31] | S'[30:0] | SL[2i+2:2i] | SL[2i:0] | Result Carry: | C'[32], ..., C'[32] | C'[32:31] | C'[30:0] | CL[2i+2], XC (aka CL[2i+1]) | CL[2i:0] -Now, Addends #2 and #3 can be added together, being replaced by a new `Addend #4`. +Now, Operands #2 and #3 can be added together, being replaced by a new `Operand #4`. | Region: | High | TransH | - | - | - | | Size: | 30 - 2i | 2 -| Addend #1: |0 | A[2i+35 : 2i+34] | S[32:2] | -| Addend #2.5: | 0 | 0 !C[32] | C[32:2] | C[1:0] | CL[2i:0] -| Addend #3.5: | 0 | 0 !X[33] | C[32:2] | C[1:0] | CL[2i:0] -| Addend #4: | 1, ..., 1 | 1 0 | C[32:2] | C[1:0] | CL[2i:0] +| Operand #1: |0 | A[2i+35 : 2i+34] | S[32:2] | +| Operand #2.5: | 0 | 0 !C[32] | C[32:2] | C[1:0] | CL[2i:0] +| Operand #3.5: | 0 | 0 !X[33] | C[32:2] | C[1:0] | CL[2i:0] +| Operand #4: | 1, ..., 1 | 1 0 | C[32:2] | C[1:0] | CL[2i:0] | Result Sum: | 0 |S'[32:31] | S'[30:0] | SL[2i+2:2i] | SL[2i:0] | Result Carry: | C'[32], ..., C'[32] | C'[32:31] | C'[30:0] | CL[2i+2], XC (aka CL[2i+1]) | CL[2i:0] @@ -469,19 +486,21 @@ We can now remove `S'` and the bits used to calculate it. Let's see what's left. | Region: | High | TransH | - | - | - | | Size: | 30 - 2i | 2 -| Addend #1: |0 | A[2i+35] 0 | S[32:2] | -| Addend #4: | 1, ..., 1 | 1 0 | C[32:2] | C[1:0] | CL[2i:0] +| Operand #1: |0 | A[2i+35] 0 | S[32:2] | +| Operand #4: | 1, ..., 1 | 1 0 | C[32:2] | C[1:0] | CL[2i:0] | Result Carry: | C'[32], ..., C'[32] | C'[32:31] | C'[30:0] | CL[2i+2], XC (aka CL[2i+1]) | CL[2i:0] Meaning `C'[32] = !A[2i+35]`. -And with that, we managed to go from using 64 bits of CSA, to only 33. [5 pp. 14-17] Our final algorithm for the CSAs is as follows: +And with that, we managed to go from using 64 bits of CSA, to only 33. [[5 pp. 14-17](#cite5)] Our final algorithm for the CSAs is as follows: + ```C -struct CSAOutput perform_csa_array(u64 partial_sum, u64 partial_carry, BoothRecodingOutput addends[4]) { +struct CSAOutput perform_csa_array(u64 partial_sum, u64 partial_carry, + struct RecodedMultiplicands addends[4]) { struct CSAOutput csa_output = { partial_sum, partial_carry }; struct CSAOutput final_csa_output = { 0, 0 }; @@ -489,7 +508,8 @@ struct CSAOutput perform_csa_array(u64 partial_sum, u64 partial_carry, BoothReco csa_output.output &= 0x1FFFFFFFFULL; csa_output.carry &= 0x1FFFFFFFFULL; - struct CSAOutput result = perform_csa(csa_output.output, addends.m[i].recoded_output & 0x1FFFFFFFFULL, csa_output.carry); + struct CSAOutput result = perform_csa(csa_output.output, + addends.m[i].recoded_output & 0x1FFFFFFFFULL, csa_output.carry); // Inject the carry caused by booth recoding result.carry <<= 1; @@ -511,7 +531,8 @@ struct CSAOutput perform_csa_array(u64 partial_sum, u64 partial_carry, BoothReco // Perform the magic described in the tables for the handling of TransH // and High. acc_shift_register contains the upper 31 bits of the acc // in its lower bits. - u64 magic = bit(acc_shift_register, 0) + !bit(csa_output.carry, 32) + !bit(addends.m[i].recoded_output, 33); + u64 magic = bit(acc_shift_register, 0) + + !bit(csa_output.carry, 32) + !bit(addends.m[i].recoded_output, 33); result.output |= magic << 31; result.carry |= (u64) !bit(acc_shift_register, 1) << 32; acc_shift_register >>= 2; @@ -528,7 +549,7 @@ struct CSAOutput perform_csa_array(u64 partial_sum, u64 partial_carry, BoothReco # The Specifics of Early Termination -We already touched on early termination briefly, but turns out it gets a bit more complicated. The patents don't exactly explain how early termination works here in much detail, besides some cryptic references to shift types / shift values. More importantly, we can implement early termination quite simply, like so: +We already touched on early termination briefly, but turns out it gets a bit more complicated. The patents don't exactly explain how early termination works in much detail, besides some cryptic references to shift types / shift values. But, I gave it my best guess. We know that we have the following condition for early termination: ```c bool should_terminate(u64 multiplier, enum MultiplicationFlavor flavor) { @@ -540,12 +561,12 @@ bool should_terminate(u64 multiplier, enum MultiplicationFlavor flavor) { } ``` -Note that **multiplier** is a signed 33-bit number. Now here's the main issue with early termination. After every cycle of booth's algorithm, a total of 41 bits of result are produced. To convince yourself of this, look at the final two lines of `perform_csa_array`. The bottom eight bits contain the result of each CSA, and the upper 33 bits above those 8 contain `csa_output`. After every cycle of booth's algorithm, the bottom eight bits are fed into a result register, since the _next_ cycle of booth's algorithm cannot change the value of those bottom eight bits. The upper 33 bits become the next input into the next cycle of booth's algorithm. Something like this: +Note that **multiplier** is a signed 33-bit number. Now here's the main issue with early termination. After every cycle of booth's algorithm, a total of 41 bits of result are produced. To convince yourself of this, look at the final two lines of [`perform_csa_array`](#perform_csa_array2). The bottom eight bits contain the result of each CSA, and the upper 33 bits above those 8 contain `csa_output`. After every cycle of booth's algorithm, the bottom eight bits are fed into a result register, since the _next_ cycle of booth's algorithm cannot change the value of those bottom eight bits. The upper 33 bits become the next input into the next cycle of booth's algorithm. Something like this: ```c // I'm using this over a __uint128_t since the latter isn't available -// on a GBA, and I need this code to compile on a GBA so I can fuzz the +// on a GBA, and I needed this code to compile on a GBA so I can fuzz the // outputs. struct u128 { u64 lo; @@ -553,38 +574,44 @@ struct u128 { }; // Latches that contain the final results of the algorithm. -// Really, these only need to be 66 bits, but 128 is good enough. -// Why 66? Because: -// - We obtain 1 bit from initialization (EXPLAIN) -// - We obtain 8 * 4 bits from booths algorithm -// - We obtain 33 more bits also from booths algorithm. -__uint128_t partial_sum; -__uint128_t partial_carry; +u128 partial_sum; +u128 partial_carry; do { csa_output = perform_one_cycle_of_booths_mutliplication( csa_output, multiplicand, multiplier); + // The bottom 8 bits of this cycle cannot be changed by future + // addends, since those addends will be at least 256 times as + // big as this cycle's addends. So, put them into the result + // latches now. partial_sum.lo |= csa_output.output & 0xFF; partial_carry.lo |= csa_output.carry & 0xFF; + // Get csa_output ready to be fed back into the CSAs on the next + // cycle csa_output.output >>= 8; csa_output.carry >>= 8; - partial_carry >>= 8; + // ROR == ROtate Right + partial_sum = u128_ror(partial_sum, 8); + partial_carry = u128_ror(partial_carry, 8); - multiplier = asr(multiplier, 8, 33); + // ASR = Arithmetic Shift Right for 33-bit numbers + multiplier = asr_33(multiplier, 8); } while (!should_terminate(multiplier, flavor)); partial_sum.lo |= csa_output.output; partial_carry.lo |= csa_output.carry; ``` -Since `partial_sum` and `partial_carry` are shift registers that get rotated with each iteration of booths algorithm, we need to rotate them again after the algorithm ends in order to correct them to their proper values. The `partial_carry`'s rotation is done via the ARM7TDMI's barrel shifter (explain what tha barrel shifteris), with the output of the barrel shifter going to the ALU. For long (64-bit) multiplies, two rotations occur, since the ALU can only add 32-bits at a time and so must be used twice. +Since `partial_sum` and `partial_carry` are shift registers that get rotated with each iteration of booths algorithm, we need to rotate them again after the algorithm ends in order to correct them to their proper values. Thankfully, the ARM7TDMI has something called a barrel shifter. The barrel shifter is a nifty piece of hardware that allows the CPU to perform an arbitrary shift/rotate before an ALU operation, all in one cycle. Since we plan to add `partial_sum` and `partial_carry` in the ALU, we may as well use the barrel shifter to rotate one of those two operands, with no additional cost. + +For long (64-bit) multiplies, two right rotations (known on the CPU as RORs) occur, since the ALU can only add 32-bits at a time and so must be used twice. Spoiler alert, the value of the carry flag after a multiply instruction comes from the carryout of this barrel shifter. -So, what rotation values does the ARM7TDMI use? According to one of the patents, for an unsigned multiply, all (1 or 2) uses of the barrel shifter do this. [[6, p. 9](#cite6)] +So, what rotation values does the ARM7TDMI use? According to one of the patents, for an unsigned multiply, all (1 for 32-bit multiplies or 2 for 64-bit ones) uses of the barrel shifter do this: [[6, p. 9](#cite6)] | # Iterations | Type | Rotation | | - | - | - | @@ -593,7 +620,7 @@ So, what rotation values does the ARM7TDMI use? According to one of the patents, | 3 |ROR|6 | | 4 |ROR|30 | -Signed multiplies differ from unsigned multiplies in their **second** barrel shift. The second one for signed multiplies looks like this. [[6, p. 9](#cite6)] +Signed multiplies differ from unsigned multiplies in their **second** barrel shift. The second one for signed multiplies uses Arithmetic Shift Rights (ASRs) and looks like this: [[6, p. 9](#cite6)] | # Iterations | Type | Rotation | | - | - | - | @@ -602,7 +629,7 @@ Signed multiplies differ from unsigned multiplies in their **second** barrel shi | 3 |ASR|6 | | 4 |ROR|30 | -I'm not going to lie, I couldn't make sense of these rotation values. At all. Maybe they were wrong, since they patents already had a couple major errors at this point. No idea. Turns out it doesn't _really_ matter for calculating the carry flag of a multiply instruction. Observe the operation of the ARM7TDMI's `ROR` and `ASR`. +I'm not going to lie, I couldn't make sense of these rotation values. At all. Maybe they were wrong, since they patents already had a couple major errors at this point. No idea. Turns out it doesn't _really_ matter for calculating the carry flag of a multiply instruction. Why? Well, observe what hapens when the ARM7TDMI does a `ROR` or `ASR`: Code from fleroviux's wonderful NanoBoyAdvance. [[7]](#cite7) ```C++ @@ -650,25 +677,31 @@ void ASR(u32& operand, u8 amount, int& carry, bool immediate) { } ``` -Note that in both ROR and ASR the carry will always be set to the last bit of the `operand` to be shifted out. e.g., if I rotate a value by `n`, then the carry will always be bit `n - 1` of the `operand`, since that was the last bit to be rotated out. Same goes for ASR. +Note that in both ROR and ASR the carry will always be set to the last bit of the `operand` to be shifted out. i.e., if I rotate a value by `n`, then the carry will always be bit `n - 1` of the `operand`, since that was the last bit to be rotated out. Same goes for ASR. -So, _it doesn't matter_ if I don't use the same rotation values as the patents. Since, no matter the rotation value, as long as the output from _my_ barrel shifter is the same as the output from the _ARM7TDMI's_ barrel shifter, and the input to _my_ barrel shift is the same as the input to the _ARM7TDMI's_ barrel shifter, then the last bit to be shifted out must be the same, and therefore the carry flag must _also_ have been the same. +So, _it doesn't matter_ if I don't use the same rotation values as the patents. Since, no matter the rotation value, as long as the output from _my_ barrel shifter is the same as the output from the _ARM7TDMI's_ barrel shifter, then the last bit to be shifted out must be the same, and therefore the carry flag must _also_ have been the same. -So, here's my implementation. I tried to somewhat mimic the table from above, but I didn't do a very good job. But it works, so fuck it. +So, here's my implementation. I tried to somewhat mimic the table from above at the cost of code readability, but I admittedly didn't do a very good job. But hey it works, so fuck it. ```C // I'm using this over a __uint128_t since the latter isn't available -// on a GBA, and I need this code to compile on a GBA so I can fuzz the +// on a GBA, and I needed this code to compile on a GBA so I can fuzz the // outputs. struct u128 { u64 lo; u64 hi; }; -// we have ror'd partial_sum and partial_carry by 8 * num_iterations + 1 -// we now need to ror backwards, i tried my best to mimic the table, but -// i'm off by one for whatever reason. +// The final output of multiplication +struct MultiplicationOutput { + u64 output; + bool carry; +}; + +// We have ror'd partial_sum and partial_carry by 8 * num_iterations + 1. +// We now need to ror backwards (rol). I tried my best to mimic the tables, but +// I'm off by one for whatever reason. int correction_ror; if (num_iterations == 1) correction_ror = 23; if (num_iterations == 2) correction_ror = 15; @@ -681,6 +714,7 @@ partial_carry = u128_ror(partial_carry, correction_ror); int alu_carry_in = bit(multiplier, 0); if (is_long(flavor)) { + // Did we not early-terminate? if (num_iterations == 4) { struct AdderOutput adder_output_lo = adder(partial_sum.hi, partial_carry.hi, alu_carry_in); @@ -698,10 +732,11 @@ if (is_long(flavor)) { int shift_amount = 1 + 8 * num_iterations; - // why this is needed is unknown, but the multiplication doesn't work + // Why this is needed is unknown, but the multiplication doesn't work // without it shift_amount++; + // Sign extend partial_carry.lo from shift_amount to 64-bits partial_carry.lo = sign_extend(partial_carry.lo, shift_amount, 64); partial_sum.lo |= acc_shift_register << (shift_amount); @@ -713,6 +748,7 @@ if (is_long(flavor)) { }; } } else { + // Did we not early-terminate? if (num_iterations == 4) { struct AdderOutput adder_output = adder(partial_sum.hi, partial_carry.hi, alu_carry_in); @@ -732,7 +768,7 @@ if (is_long(flavor)) { ``` -Anyway, that's basically it. If you're interested in the full code, take a look [here](https://github.com/zaydlang/multiplication-algorithm/tree/master). +Anyway, that's basically it. What a meme. If you're interested in the full code, take a look [here](https://github.com/zaydlang/multiplication-algorithm/tree/master). # Works Cited diff --git a/static/booth.png b/static/booth.png new file mode 100644 index 0000000..652bcf7 Binary files /dev/null and b/static/booth.png differ