Optimization of 32-bit Unsigned Division by Constants on 64-bit Targets

(arxiv.org)

41 points | by mpweiher 19 hours ago

2 comments

foltik 1 minute ago
Compilers already optimize division of a 32 bit integer x by a constant d like:
```
  c = 2^a / d
  so: d = 2^a / c
  so: x / d = (x * c) / 2^a
```
And /2^a is equivalent to >>a, so we have (x * c) >> a, just a multiply and shift, which is cheaper than integer division.
Problem is, with some divisors like 7, 14, 21, ... the smallest c that works is 33 bits. Awkward on 32 bit CPUs, it just barely doesn’t fit but you can account for that extra 1 bit manually with an extra sequence of sub, shift.
What the paper points out is twofold:
1. On 64 bit CPUs the extra sequence isn’t required, you can just do a full 64x64 bit multiply and 128 bit shift. Back to just a multiply and shift like the original optimization, and already faster than the workaround.
2. Even better, the shift itself can be optimized away. A 64x64 bit multiply stores the 128 bit product in a pair of 64 bit registers, meaning you basically get >> 64 for free by just looking at the upper half register in the pair. That means if you pre-shift the constant to:
```
  c’ = c << (64 - a) 
```
Now (x * c’) >> 64 is equivalent to (x * c) >> a. And the result is just waiting for you in a 64 bit register, no shift needed.
Just one multiply to divide a 32 bit integer by 7. Nice.
trendbuilder 20 minutes ago
Compiler backends have handled this for decades via magic number multiplication, but the 32-bit-on-64-bit case is genuinely tricky — you often get "free" upper bits from the wider arithmetic that change which magic numbers are valid. LLVM and GCC still occasionally disagree on edge cases here, especially around divisors near powers of two where the shift correction term interacts with the zero-extension behavior.