This is an archive of the discontinued LLVM Phabricator instance.

[x86] allow 8-bit adds to be promoted by convertToThreeAddress() to form LEA
ClosedPublic

Authored by spatel on Dec 9 2018, 7:14 AM.

Details

Summary

This extends the code that handles 16-bit add promotion to form LEA to also allow 8-bit adds. That allows us to combine add ops with register moves and save some instructions. This is another step towards allowing add truncation in generic DAGCombiner (see D54640).

Diff Detail

Event Timeline

spatel created this revision.Dec 9 2018, 7:14 AM
RKSimon added inline comments.Dec 9 2018, 8:48 AM
lib/Target/X86/X86InstrInfo.cpp
864

ADD8rr/ADD16rr

lib/Target/X86/X86InstrInfo.h
592

Would it cause too much code bloat to remove this default bool value? Or remove the argument and drive it off the MIOpc?

bool Is16BitOp = !(MIOpc == X86::ADD8rr || MIOpc == X86::ADD8ri);
spatel updated this revision to Diff 177573.Dec 10 2018, 11:30 AM
spatel marked 3 inline comments as done.

Patch updated:
Determine op size based on the opcode parameter (no need to change the function signature).

spatel added inline comments.Dec 10 2018, 11:31 AM
lib/Target/X86/X86InstrInfo.h
592

Removing the argument looks better. I'll add an assert to make sure we actually have a 16-bit op here if it's not one of the 8-bit add opcodes.

Cheers - @andreadb @craig.topper do you have any comments?

craig.topper added inline comments.Dec 10 2018, 12:06 PM
lib/Target/X86/X86InstrInfo.cpp
810

Is this ever called with a 32-bit subtarget? It looks like your new 8-bit calls are all only in 64 bit mode which is good since leaOutReg's regclass would be wrong otherwise.

spatel marked an inline comment as done.Dec 10 2018, 12:14 PM
spatel added inline comments.
lib/Target/X86/X86InstrInfo.cpp
810

No - it's always guarded with the Subtarget.is64Bit() check in the calls below here. So I just copied that existing code for the 8-bit enhancement. That seemed weird to me, but I wasn't sure how this code would break on 32-bit.

craig.topper added inline comments.Dec 10 2018, 3:50 PM
lib/Target/X86/X86InstrInfo.cpp
810

For the 8-bit case in 32-bit mode you would need to use GR32_ABCD as the leaout register class. But GR32 would be fine for 16-bit in either mode.

The comment when the 64-bit mode qualification was added for 16 bit mentioned partial register stalls. That was in 2009 so I'm not sure what CPUs it was considering. That wouldn't have been 32-bit mode specific other than 64-bit mode doesn't use AH/BH/CH/DH. Most of the 16 bit code has no code coverage today due to 16-bit op promotion in lowering. And the DisableLEA16 flag is always true, we should nuke it and all the unreachable code.

This patch looks good to me.

However, I will let Craig give the final approval.

test/CodeGen/X86/popcnt.ll
37–42

As I wrote before, I think this patch looks good to me.

I just wanted to point out that the new codegen might lead to an increase in the number of merge opcodes on Sandybridge (see explanation below). That being said, I don't think it is something that we should be worrying about. Sorry in advance for the pedantic comment below...


On Sandybridge (according to Agner and Intel docs), a partial write to AL triggers a merge opcode on a later read of AX/EAX/RAX. Basically, it is as-if AL is renamed separate from RAX. We trade a small increase in ILP at the cost of introducing a merge opcode on a dirty read of the underlying 2/4/8 byte register.

AMD CPUs and modern Intel CPUs (from IvyBridge onwards) don't rename AL separate from RAX. That means, a write to AL always merges into RAX with a (false) dependency on RAX.

Quoting Agner: IvyBridge inserts an extra μop only in the case where a high 8-bit register (AH, BH, CH, DH) has been modified.

Intel cpus (not AMD) rename high8-bit registers. A write to AH allocates a distinct physical register. The advantage is that a write to AL can now go in parallel with a write to AH. The downside is that a read of AX/EAX/RAX triggers a merge uOp if AH is dirty. The latency of that merge uop tends to be very small (however, it varies from processor to processor).

With this patch, we sometimes trade an partial byte read (example addb) with a full register read (through LEA) which has the potential of triggering an extra merge uOp on Sandybridge.

craig.topper accepted this revision.Dec 11 2018, 7:21 AM

LGTM

I don't think Sandy Bridge and Ivy Bridge are different here. Sandy Bridge should only insert a merge when AH is dirty. I can't find anywhere in Intel's optimization manual that says they are different. The description in 3.5.2.4 describes the behavior from Sandy Bridge on.

This revision is now accepted and ready to land.Dec 11 2018, 7:21 AM
spatel updated this revision to Diff 177715.Dec 11 2018, 7:56 AM

Patch updated:
Rebased after cleanup in rL348845 and rL348851. This should be functionally equivalent, but with less cruft. There's a 'TODO' comment for 32-bit target based on the comments here.

This revision was automatically updated to reflect the committed changes.