This is an archive of the discontinued LLVM Phabricator instance.

PR 23155 - Improvement to X86 16 bit operation promotion for better performance.
AbandonedPublic

Authored by kbsmith1 on Apr 22 2015, 3:11 PM.

Details

Summary

This change improves the code for X86 16 bit operation promotion by checking more carefully for cases where promotion
shouldn't happen, thus causing more cases to be promoted to 32 bits. This improves performance in some cases where 16 bit
operations cause false dependencies on the upper portions of 16 bit operations.

Diff Detail

Repository
rL LLVM

Event Timeline

kbsmith1 updated this revision to Diff 24261.Apr 22 2015, 3:11 PM
kbsmith1 retitled this revision from to PR 23155 - Improvement to X86 16 bit operation promotion for better performance..
kbsmith1 updated this object.
kbsmith1 edited the test plan for this revision. (Show Details)
kbsmith1 set the repository for this revision to rL LLVM.

Ping.

Sanjay, Simon, or Elena, Would any or all of you be willing to review this please?

Thank you,
Kevin Smith

spatel added a subscriber: spatel.

Hi Kevin -

Roping in some other potentially interested reviewers based on past activity.

I also added some comments to https://llvm.org/bugs/show_bug.cgi?id=23155 and linked some other partial reg update bugs.

We need some clarification on what the expected behavior is wrt partial reg updates and the various micro-architectures. Eg, I'm unable to reproduce all of your Haswell perf results locally...which seems to line up with Agner's advice, but then we definitely see a perf hit on bzip in https://llvm.org/bugs/show_bug.cgi?id=22473 ...but maybe there are different factors in play there and we're confusing the issues?

For ease of others, here is the comment I added to 23155

As in Agner's in in 17113, I agree that the newer Intel architectures don't really suffer from partial register stalls in the sense that Pentium Pro, Pentium 4, and older architectures did. As noted in 17113

  • There is no penalty on Haswell for partial register access.
  • On Sandy Bridge, the cost is a single uop that gets automatically inserted at the cost of 1 cycle latency.
  • On Ivy Bridge there is no penalty except for the "high" byte subregs (AH, BH, etc.), in which case it behaves like Sandy Bridge.

However, whenever a partial register is the destination of an operation, and doesn't otherwise need to read the register (such as occurs with movw, movb)
then this creates a read dependence on the upper portion of the register. If
a movzbl or movzwl is instead used, then the destination register is fully killed, eliminating this "false" dependence on the upper portion of the register. This issue impacts both word and byte operations. However, it is worth noting that this only really matters in relatively tight loops where the false dependence arc causes a loop carried dependence, and that loop carried dependence effectively keeps the out-of-order processor from being able to perform multiple iterations of the loop without loop carried "false" dependencies.

From Chandler's comments in 22473:
We need to add a pass that replaces movb (and movw) with movzbl (and movzwl) when the destination is a register and the high bytes aren't used. Then we need to benchmark bzip2 to ensure that this recovers all of the performance that forcing the use of cmpl did, and probably some other sanity benchmarking. Then we can swap the cmpl formation for the movzbl formation.

I am in agreement that this would be a good solution. If you, Chandler, and Eric all like that direction, I will be willing to work on that. I also have access to SPEC benchmarks, both 2000 and 2006 to be able to benchmark as well for bzip2 specifically since that is something the community considers important.

Kevin

ab added subscribers: Unknown Object (MLST), ab.May 7 2015, 3:36 PM

+llvm-commits

mkuper added a subscriber: mkuper.May 9 2015, 11:44 PM
chandlerc edited edge metadata.May 12 2015, 10:48 AM

From Chandler's comments in 22473:
We need to add a pass that replaces movb (and movw) with movzbl (and movzwl) when the destination is a register and the high bytes aren't used. Then we need to benchmark bzip2 to ensure that this recovers all of the performance that forcing the use of cmpl did, and probably some other sanity benchmarking. Then we can swap the cmpl formation for the movzbl formation.

I am in agreement that this would be a good solution. If you, Chandler, and Eric all like that direction, I will be willing to work on that. I also have access to SPEC benchmarks, both 2000 and 2006 to be able to benchmark as well for bzip2 specifically since that is something the community considers important.

I would be *very* interested in this, and would love it if you could work on it. I suspect you're in a much better position to implement, document, and evaluate the results. We really need to kill the 'cmpl' hack that is currently used.

Thanks for the support Chandler. I am starting to work on this.

My initial thoughts are:

1 - A very late pass through the MachineInstrs that would be inserted as part of X86PassConfig::addPreEmitPass.

2 - Initially look for 8 bit and 16 bit operations that would be better expanded into 32 bit operations.

  • There could be some different reasons to do this a - Specifically for the case in PR23155 where false dependence potentially slows execution. b - Just in general for cases where partial registers may cost something (Intel X86 prior to Haswell) c - cases where code could be saved by using an equivalent 32 bit instruction, such as 16 bit instructions that would encode shorter as 32 bit. We want to do this very late to allow for folding memory operations into the 16 and 8 bit operations, and not rely on heuristics to try to predict about this.

If you have any comments or disagreements with that direction please let me know.

Kevin B. Smith

kbsmith1 abandoned this revision.Aug 26 2015, 10:31 AM

Abandoning this to change to later pass to fix these up.