This is an archive of the discontinued LLVM Phabricator instance.

[x86] promote all multiply i8 by constant to i32
ClosedPublic

Authored by spatel on Nov 21 2018, 8:35 AM.

Details

Summary

This is an alternative implementation of where D54770 would likely end up, so I'll abandon that if this is preferred.

We have these 2 "isDesirable" promotion hooks (I'm not sure why we need both of them, but that's independent of this patch), and we can adjust them to promote "mul i8 X, C" to i32. Then, all of our existing LEA and other multiply expansion magic happens as it would for i32 ops.

Some of the test diffs show that we could end up with an actual 32-bit mul instruction here because we choose not to expand to simpler ops. That instruction could be slower depending on the subtarget. On the plus side, this means we don't need a separate instruction to load the constant operand and possibly an extra instruction to move the result. If we need to tune mul i32 further, we could add a later transform that tries to shrink it back to i8 based on subtarget timing.

I did not bother to duplicate all of the 32-bit test file RUNs and target settings that exist to test whether LEA expansion is cheap or not. The diffs here assume a default target, so that means LEA is generally cheap.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Nov 21 2018, 8:35 AM

This makes some sense to me. There is no 8-bit multiply with immediate instruction anyway, so we have to load the constant in a register. And the 8-bit multiply instruction has bad register allocation constraints. On Intel CPUs the 8-bit multiply with 16-bit result in AL/AH is the same latency/throughput as a 32-bit multiply with 32-bit result. The annoying thing is that we sometimes emit an explicit movzbl for the anyextend i8->i32 to do the promotion and prevent a partial register access. We explicitly don't do that for i16->i32 any extend due to the heavy promotion of i16 types.

lebedev.ri added inline comments.Nov 24 2018, 9:37 AM
lib/Target/X86/X86ISelLowering.cpp
41066–41077 ↗(On Diff #174934)

I'm trying to parse this and i'm failing.

The old code was:

  • If the type is not i16, then it is desirable. (so i8/i32/i64)
  • Else, see switch.

New code says:

  • If the type is not i16 (so i8/i32/i64) and either
    • the type is not i8 (so i32/i64) or
    • [the type is i8 and] the opcode is not ISD::MUL then it is desirable
  • Else, see switch.

Doesn't this not only mark i8 MUL's as undesirable, but also mark all other i8 as desirable?

Can this code please at least be uncondenced?

spatel marked an inline comment as done.Nov 24 2018, 9:59 AM
spatel added inline comments.
lib/Target/X86/X86ISelLowering.cpp
41066–41077 ↗(On Diff #174934)

Your description sounds right - all i8 ops besides mul are still desirable - and that's the intended logic. I think it's just a DeMorgan illusion that makes it confusing (and I probably got it wrong in my initial draft locally!). I'll rearrange it to try to avoid that problem.

spatel updated this revision to Diff 175163.Nov 24 2018, 10:58 AM

Patch updated:
After looking again, Roman's analysis was correct - the previous logic was bogus. But I'm not sure if we could expose the bug because of the interaction of these 2 hooks. Ie, the test diffs in this version of the patch are unchanged from before.

In any case, updated the code and comments to hopefully be clearer now.

Ok, i think now this looks good to me.

This revision is now accepted and ready to land.Nov 25 2018, 7:19 PM
This revision was automatically updated to reflect the committed changes.

Intel core CPUs from Sandy Bridge on always store bits 63:16 and bits 7:0 in the same physical register file entry. Only bits 15:8 of EAX/EBX/ECX/EDX can be separated due to a write to AH/BH/CH/DH. For most binary arithmetic operations one of the input register is also the output register. So its easy to pass the upper bits through without modifying them. So "add %al, %bl" reads all 64-bits of %rax and %rbx (ignoring that %AH and %BH could have been written separately) and leaves bits 63:8 of %rbx unmodified. Instructions that write only bits 7:0 or bits 15:8 of a register and don't also read part of the same register trigger a merge uop to be inserted. This would be instructions like a load into %al or %ax. I believe move immediate into %al or %ax doesn't have a separate merge uop. Its single uop just reads the whole destination register and merges the immediate into the lower bits. 16-bit popcnt/lzcnt/tzcnt also have a false dependency on the upper bits so the single uop can do the merge. MOVSX/MOVZX from 8-bit to 16-bit are similar so that the upper bits can be preserved. If bits 15:8 have been separated and an instruction is issued that needs bits 15:8 and any of the other bits then a merge is inserted to join them.