Page MenuHomePhabricator

[X86] Fix False Data Dependency in popcnt
Needs RevisionPublic

Authored by AsafBadouh on Feb 16 2016, 4:12 AM.

Details

Summary

popcnt have a false dependency on the destination register dest, the instruction will wait until dest is ready before executing.

more details in:
http://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-count-variable-with-64-bit-introduces-crazy-performance

Diff Detail

Repository
rL LLVM

Event Timeline

AsafBadouh updated this revision to Diff 48062.Feb 16 2016, 4:12 AM
AsafBadouh retitled this revision from to [X86] Fix False Data Dependency in popcnt.
AsafBadouh updated this object.
AsafBadouh set the repository for this revision to rL LLVM.
AsafBadouh added a subscriber: llvm-commits.
majnemer added inline comments.
../llvm/lib/Target/X86/X86PopcntOpt.cpp
14

WA?

68

We don't do this dependency breaking if we hasPOPCNT but not hasAVX ? I'd think we'd do this if hasPOPCNT is true regardless of what hasAVX is because we might want to run our program on machines older and newer than Sandy Bridge.

mkuper added a subscriber: mkuper.Feb 16 2016, 10:38 PM
mkuper added inline comments.
../llvm/lib/Target/X86/X86PopcntOpt.cpp
68

This isn't like the other false dependency fixes we have, in the sense that this not an arch issue in the instruction definition, but rather a micro-arch bug. It doesn't exist in anything older than <arch-A>, and is supposed to be fixed in <arch-B> and above.

I think <arch-B> is Skylake, although I'm not entirely sure. Don't know what <arch-A> is. In any case, the point is - we do want a condition that's more complicated than hasPOPCNT(), I just don't know what it is.

AsafBadouh added inline comments.Feb 17 2016, 12:33 AM
../llvm/lib/Target/X86/X86PopcntOpt.cpp
14

workaround, will change it.

68

As Michael explained, Sandy-bridge and later arch have that dependency.
hasAVX flag return true for Sandy-bridge and later.

Michael, what do you mean in "more complicated"?

Your current implementation would affect AMD Jaguar / Bulldozer families as well despite them not suffering from the dependency bug.

It looks like this might need to be hidden behind a feature bit (e.g. FeaturePOPCNTFalseDependency) instead of trying to decode the cpu from the target bits.

I should also add that there is a discussion on PR26183 about whether these types of fixes should all be put under the MachineCombiner pass.

spatel requested changes to this revision.Feb 17 2016, 8:04 AM
spatel added a reviewer: spatel.

Your current implementation would affect AMD Jaguar / Bulldozer families as well despite them not suffering from the dependency bug.

Even if that wasn't the case, hijacking hasAVX() and hasPOPCNT() to predicate this is the wrong approach.

It looks like this might need to be hidden behind a feature bit (e.g. FeaturePOPCNTFalseDependency) instead of trying to decode the cpu from the target bits.

I should also add that there is a discussion on PR26183 about whether these types of fixes should all be put under the MachineCombiner pass.

This is a machine pass, so we shouldn't need to pollute the general CPU feature bits. We have access to the machine models. How about subclassing/adding a bit to MCSchedModel for only the affected CPUs? From its description, it sounds like the right place for this type of "feature":

/// The machine model directly provides basic information about the
/// microarchitecture to the scheduler in the form of properties. It also
/// optionally refers to scheduler resource tables and itinerary
/// tables. Scheduler resource tables model the latency and cost for each
/// instruction type.
This revision now requires changes to proceed.Feb 17 2016, 8:04 AM
DavidKreitzer edited edge metadata.Feb 17 2016, 9:48 PM

Asaf, why did you choose to create a new pass for this rather than modify the existing ExecutionDependencyFixPass? The existing pass is serving exactly the same purpose, and all the conditions it checks (in ExeDepsFix::shouldBreakDependence) apply equally to this popcnt case.

Also, do we have code in the RA that attempts to bias the register assignment choices to assign the same register for the src & dst of popcnt and the other instructions affected by false dependences (e.g. cvtss2sd)? Where feasible, assigning the same register for the src & dst is the most inexpensive way to eliminate a false dependence.

tycho added a subscriber: tycho.May 1 2016, 9:54 AM
tycho added a comment.May 1 2016, 10:05 AM

Sorry to gravedig, but this hasn't been updated in months. Has this been worked on at all recently?

Also, the false dependency microarchitecture bug continues to persist through Skylake, and there's no known fix on the horizon. Why not just xor the destination register unconditionally before any popcnt instruction, though? It shouldn't be a microarch-specific pass. From what I've seen, it's incredibly common for code to be built targeting a much older subtarget, but run on modern ones. That is, this pass would be skipped with -march=corei7, but -march=corei7 code can/will be run on newer hardware with the bug. From what I've seen on Nehalem and Westmere, there's no additional penalty for clearing the destination register before popcnt. So just do it always.

tycho requested changes to this revision.May 1 2016, 10:47 AM
tycho added a reviewer: tycho.

Also I just tested the current patch revision. There's an error in how it is checking whether the destination register is used as the input. The input of popcnt can be a memory location, and the destination register may be part of the address calculation:

200:   48 31 f6                xor    %rsi,%rsi          # clobbered rsi
203:   f3 49 0f b8 34 f4       popcnt (%r12,%rsi,8),%rsi # oops, used rsi
209:   48 01 de                add    %rbx,%rsi
20c:   8d 7a fd                lea    -0x3(%rdx),%edi
20f:   48 31 ff                xor    %rdi,%rdi          # clobbered rdi
212:   f3 49 0f b8 3c fc       popcnt (%r12,%rdi,8),%rdi # oops, used rdi
218:   48 01 f7                add    %rsi,%rdi
21b:   8d 72 fe                lea    -0x2(%rdx),%esi
21e:   48 31 f6                xor    %rsi,%rsi          # clobbered rsi
221:   f3 49 0f b8 34 f4       popcnt (%r12,%rsi,8),%rsi # oops, used rsi
227:   48 01 fe                add    %rdi,%rsi
22a:   8d 7a ff                lea    -0x1(%rdx),%edi
22d:   48 31 db                xor    %rbx,%rbx          # ok
230:   f3 49 0f b8 1c fc       popcnt (%r12,%rdi,8),%rbx # ok

Thanks for the comments, Steven! We are still working on fixing the popcnt false dependence problem, but we are planning to abandon this patch.

Our plan is to first fix the register allocator to bias register assignment choices to hide false dependences. If there is a true dependence on the destination register, then there is no additional cost for the false dependence. So, for example, we will strive to generate

popcnt %rax, %rax
popcnt (%rcx), %rcx

rather than

xor %rdx, %rdx
popcnt %rax, %rdx
xor %rbx, %rbx
popcnt (%rcx), %rbx

The second step will be to enhance the ExecutionDependencyFix pass to support popcnt, which will require us to add support for integer instructions in general. You are right that unconditionally adding an xor (excluding the cases where it is not legal to do so) is better than doing nothing. But since we already have a pass that is designed exactly for the purpose of deciding when it is or isn't profitable to add the xor, we ought to use it.

Note that the RA enhancements will also improve many of the instructions that are currently supported by the ExecutionDependencyFix pass, e.g.

cvtss2sd %xmm1, %xmm0
sqrtss %xmm2, %xmm3
tycho added a comment.May 9 2016, 11:34 AM

Our plan is to first fix the register allocator to bias register assignment choices to hide false dependences. If there is a true dependence on the destination register, then there is no additional cost for the false dependence. So, for example, we will strive to generate

popcnt %rax, %rax
popcnt (%rcx), %rcx

rather than

xor %rdx, %rdx
popcnt %rax, %rdx
xor %rbx, %rbx
popcnt (%rcx), %rbx

My simple tests locally show that does resolve the issue. Nice idea.

Can you CC me on the review for that when it happens?

The second step will be to enhance the ExecutionDependencyFix pass to support popcnt, which will require us to add support for integer instructions in general. You are right that unconditionally adding an xor (excluding the cases where it is not legal to do so) is better than doing nothing. But since we already have a pass that is designed exactly for the purpose of deciding when it is or isn't profitable to add the xor, we ought to use it.

I agree, that sounds sensible.

Note that the RA enhancements will also improve many of the instructions that are currently supported by the ExecutionDependencyFix pass, e.g.

cvtss2sd %xmm1, %xmm0
sqrtss %xmm2, %xmm3