This extends the code that handles 16-bit add promotion to form LEA to also allow 8-bit adds. That allows us to combine add ops with register moves and save some instructions. This is another step towards allowing add truncation in generic DAGCombiner (see D54640).
Details
Diff Detail
Event Timeline
Patch updated:
Determine op size based on the opcode parameter (no need to change the function signature).
lib/Target/X86/X86InstrInfo.h | ||
---|---|---|
598 | Removing the argument looks better. I'll add an assert to make sure we actually have a 16-bit op here if it's not one of the 8-bit add opcodes. |
lib/Target/X86/X86InstrInfo.cpp | ||
---|---|---|
816–817 | Is this ever called with a 32-bit subtarget? It looks like your new 8-bit calls are all only in 64 bit mode which is good since leaOutReg's regclass would be wrong otherwise. |
lib/Target/X86/X86InstrInfo.cpp | ||
---|---|---|
816–817 | No - it's always guarded with the Subtarget.is64Bit() check in the calls below here. So I just copied that existing code for the 8-bit enhancement. That seemed weird to me, but I wasn't sure how this code would break on 32-bit. |
lib/Target/X86/X86InstrInfo.cpp | ||
---|---|---|
816–817 | For the 8-bit case in 32-bit mode you would need to use GR32_ABCD as the leaout register class. But GR32 would be fine for 16-bit in either mode. The comment when the 64-bit mode qualification was added for 16 bit mentioned partial register stalls. That was in 2009 so I'm not sure what CPUs it was considering. That wouldn't have been 32-bit mode specific other than 64-bit mode doesn't use AH/BH/CH/DH. Most of the 16 bit code has no code coverage today due to 16-bit op promotion in lowering. And the DisableLEA16 flag is always true, we should nuke it and all the unreachable code. |
This patch looks good to me.
However, I will let Craig give the final approval.
test/CodeGen/X86/popcnt.ll | ||
---|---|---|
37–42 | As I wrote before, I think this patch looks good to me. I just wanted to point out that the new codegen might lead to an increase in the number of merge opcodes on Sandybridge (see explanation below). That being said, I don't think it is something that we should be worrying about. Sorry in advance for the pedantic comment below... On Sandybridge (according to Agner and Intel docs), a partial write to AL triggers a merge opcode on a later read of AX/EAX/RAX. Basically, it is as-if AL is renamed separate from RAX. We trade a small increase in ILP at the cost of introducing a merge opcode on a dirty read of the underlying 2/4/8 byte register. AMD CPUs and modern Intel CPUs (from IvyBridge onwards) don't rename AL separate from RAX. That means, a write to AL always merges into RAX with a (false) dependency on RAX. Quoting Agner: IvyBridge inserts an extra μop only in the case where a high 8-bit register (AH, BH, CH, DH) has been modified. Intel cpus (not AMD) rename high8-bit registers. A write to AH allocates a distinct physical register. The advantage is that a write to AL can now go in parallel with a write to AH. The downside is that a read of AX/EAX/RAX triggers a merge uOp if AH is dirty. The latency of that merge uop tends to be very small (however, it varies from processor to processor). With this patch, we sometimes trade an partial byte read (example addb) with a full register read (through LEA) which has the potential of triggering an extra merge uOp on Sandybridge. |
LGTM
I don't think Sandy Bridge and Ivy Bridge are different here. Sandy Bridge should only insert a merge when AH is dirty. I can't find anywhere in Intel's optimization manual that says they are different. The description in 3.5.2.4 describes the behavior from Sandy Bridge on.
Would it cause too much code bloat to remove this default bool value? Or remove the argument and drive it off the MIOpc?