This is an archive of the discontinued LLVM Phabricator instance.

../llvm/lib/Target/X86/X86PopcntOpt.cpp
14	WA?
68	We don't do this dependency breaking if we `hasPOPCNT` but not `hasAVX` ? I'd think we'd do this if `hasPOPCNT` is true regardless of what `hasAVX` is because we might want to run our program on machines older and newer than Sandy Bridge.

mkuper added a subscriber: mkuper.Feb 16 2016, 10:38 PM

mkuper added inline comments.

../llvm/lib/Target/X86/X86PopcntOpt.cpp
68	This isn't like the other false dependency fixes we have, in the sense that this not an arch issue in the instruction definition, but rather a micro-arch bug. It doesn't exist in anything older than <arch-A>, and is supposed to be fixed in <arch-B> and above. I think <arch-B> is Skylake, although I'm not entirely sure. Don't know what <arch-A> is. In any case, the point is - we do want a condition that's more complicated than hasPOPCNT(), I just don't know what it is.

AsafBadouh added inline comments.Feb 17 2016, 12:33 AM

../llvm/lib/Target/X86/X86PopcntOpt.cpp
14	workaround, will change it.
68	As Michael explained, Sandy-bridge and later arch have that dependency. hasAVX flag return true for Sandy-bridge and later. Michael, what do you mean in "more complicated"?

Your current implementation would affect AMD Jaguar / Bulldozer families as well despite them not suffering from the dependency bug.

It looks like this might need to be hidden behind a feature bit (e.g. FeaturePOPCNTFalseDependency) instead of trying to decode the cpu from the target bits.

I should also add that there is a discussion on PR26183 about whether these types of fixes should all be put under the MachineCombiner pass.

In D17289#354614, @RKSimon wrote:

Your current implementation would affect AMD Jaguar / Bulldozer families as well despite them not suffering from the dependency bug.

Even if that wasn't the case, hijacking hasAVX() and hasPOPCNT() to predicate this is the wrong approach.

It looks like this might need to be hidden behind a feature bit (e.g. FeaturePOPCNTFalseDependency) instead of trying to decode the cpu from the target bits.

I should also add that there is a discussion on PR26183 about whether these types of fixes should all be put under the MachineCombiner pass.

This is a machine pass, so we shouldn't need to pollute the general CPU feature bits. We have access to the machine models. How about subclassing/adding a bit to MCSchedModel for only the affected CPUs? From its description, it sounds like the right place for this type of "feature":

/// The machine model directly provides basic information about the
/// microarchitecture to the scheduler in the form of properties. It also
/// optionally refers to scheduler resource tables and itinerary
/// tables. Scheduler resource tables model the latency and cost for each
/// instruction type.

This revision now requires changes to proceed.Feb 17 2016, 8:04 AM

Asaf, why did you choose to create a new pass for this rather than modify the existing ExecutionDependencyFixPass? The existing pass is serving exactly the same purpose, and all the conditions it checks (in ExeDepsFix::shouldBreakDependence) apply equally to this popcnt case.

Also, do we have code in the RA that attempts to bias the register assignment choices to assign the same register for the src & dst of popcnt and the other instructions affected by false dependences (e.g. cvtss2sd)? Where feasible, assigning the same register for the src & dst is the most inexpensive way to eliminate a false dependence.

tycho added a subscriber: tycho.May 1 2016, 9:54 AM

Sorry to gravedig, but this hasn't been updated in months. Has this been worked on at all recently?

Also, the false dependency microarchitecture bug continues to persist through Skylake, and there's no known fix on the horizon. Why not just xor the destination register unconditionally before any popcnt instruction, though? It shouldn't be a microarch-specific pass. From what I've seen, it's incredibly common for code to be built targeting a much older subtarget, but run on modern ones. That is, this pass would be skipped with -march=corei7, but -march=corei7 code can/will be run on newer hardware with the bug. From what I've seen on Nehalem and Westmere, there's no additional penalty for clearing the destination register before popcnt. So just do it always.

Also I just tested the current patch revision. There's an error in how it is checking whether the destination register is used as the input. The input of popcnt can be a memory location, and the destination register may be part of the address calculation:

200:   48 31 f6                xor    %rsi,%rsi          # clobbered rsi
203:   f3 49 0f b8 34 f4       popcnt (%r12,%rsi,8),%rsi # oops, used rsi
209:   48 01 de                add    %rbx,%rsi
20c:   8d 7a fd                lea    -0x3(%rdx),%edi
20f:   48 31 ff                xor    %rdi,%rdi          # clobbered rdi
212:   f3 49 0f b8 3c fc       popcnt (%r12,%rdi,8),%rdi # oops, used rdi
218:   48 01 f7                add    %rsi,%rdi
21b:   8d 72 fe                lea    -0x2(%rdx),%esi
21e:   48 31 f6                xor    %rsi,%rsi          # clobbered rsi
221:   f3 49 0f b8 34 f4       popcnt (%r12,%rsi,8),%rsi # oops, used rsi
227:   48 01 fe                add    %rdi,%rsi
22a:   8d 7a ff                lea    -0x1(%rdx),%edi
22d:   48 31 db                xor    %rbx,%rbx          # ok
230:   f3 49 0f b8 1c fc       popcnt (%r12,%rdi,8),%rbx # ok

Thanks for the comments, Steven! We are still working on fixing the popcnt false dependence problem, but we are planning to abandon this patch.

Our plan is to first fix the register allocator to bias register assignment choices to hide false dependences. If there is a true dependence on the destination register, then there is no additional cost for the false dependence. So, for example, we will strive to generate

popcnt %rax, %rax
popcnt (%rcx), %rcx

rather than

xor %rdx, %rdx
popcnt %rax, %rdx
xor %rbx, %rbx
popcnt (%rcx), %rbx

The second step will be to enhance the ExecutionDependencyFix pass to support popcnt, which will require us to add support for integer instructions in general. You are right that unconditionally adding an xor (excluding the cases where it is not legal to do so) is better than doing nothing. But since we already have a pass that is designed exactly for the purpose of deciding when it is or isn't profitable to add the xor, we ought to use it.

Note that the RA enhancements will also improve many of the instructions that are currently supported by the ExecutionDependencyFix pass, e.g.

cvtss2sd %xmm1, %xmm0
sqrtss %xmm2, %xmm3

In D17289#424618, @DavidKreitzer wrote:
Our plan is to first fix the register allocator to bias register assignment choices to hide false dependences. If there is a true dependence on the destination register, then there is no additional cost for the false dependence. So, for example, we will strive to generate
popcnt %rax, %rax
popcnt (%rcx), %rcx
rather than
xor %rdx, %rdx
popcnt %rax, %rdx
xor %rbx, %rbx
popcnt (%rcx), %rbx

My simple tests locally show that does resolve the issue. Nice idea.

Can you CC me on the review for that when it happens?

The second step will be to enhance the ExecutionDependencyFix pass to support popcnt, which will require us to add support for integer instructions in general. You are right that unconditionally adding an xor (excluding the cases where it is not legal to do so) is better than doing nothing. But since we already have a pass that is designed exactly for the purpose of deciding when it is or isn't profitable to add the xor, we ought to use it.

I agree, that sounds sensible.

Note that the RA enhancements will also improve many of the instructions that are currently supported by the ExecutionDependencyFix pass, e.g.
cvtss2sd %xmm1, %xmm0
sqrtss %xmm2, %xmm3

spatel resigned from this revision.May 26 2020, 5:35 AM

Herald added subscribers: hiraditya, mgorny. · View Herald TranscriptMay 26 2020, 5:35 AM

Revision Contents

Path

Size

../

llvm/

lib/

Target/

X86/

	CMakeLists.txt
	CMakeLists.txt (revision 260502)

1 line

	X86.h
	X86.h (revision 260502)

4 lines

	X86PopcntOpt.cpp
	X86PopcntOpt.cpp (revision 0)

93 lines

	X86TargetMachine.cpp
	X86TargetMachine.cpp (revision 260502)

5 lines

test/

CodeGen/

X86/

	popcnt.ll
	popcnt.ll (revision 260502)

43 lines

Diff 48062

../llvm/lib/Target/X86/CMakeLists.txt

Show All 19 Lines	set(sources
X86FloatingPoint.cpp		X86FloatingPoint.cpp
X86FrameLowering.cpp		X86FrameLowering.cpp
X86ISelDAGToDAG.cpp		X86ISelDAGToDAG.cpp
X86ISelLowering.cpp		X86ISelLowering.cpp
X86InstrInfo.cpp		X86InstrInfo.cpp
X86MCInstLower.cpp		X86MCInstLower.cpp
X86MachineFunctionInfo.cpp		X86MachineFunctionInfo.cpp
X86PadShortFunction.cpp		X86PadShortFunction.cpp
		X86PopcntOpt.cpp
X86RegisterInfo.cpp		X86RegisterInfo.cpp
X86SelectionDAGInfo.cpp		X86SelectionDAGInfo.cpp
X86ShuffleDecodeConstantPool.cpp		X86ShuffleDecodeConstantPool.cpp
X86Subtarget.cpp		X86Subtarget.cpp
X86TargetMachine.cpp		X86TargetMachine.cpp
X86TargetObjectFile.cpp		X86TargetObjectFile.cpp
X86TargetTransformInfo.cpp		X86TargetTransformInfo.cpp
X86VZeroUpper.cpp		X86VZeroUpper.cpp
Show All 13 Lines

../llvm/lib/Target/X86/X86.h

	Show All 39 Lines
	/// references and pseudo instructions into floating-point stack references and			/// references and pseudo instructions into floating-point stack references and
	/// physical instructions.			/// physical instructions.
	FunctionPass *createX86FloatingPointStackifierPass();			FunctionPass *createX86FloatingPointStackifierPass();

	/// This pass inserts AVX vzeroupper instructions before each call to avoid			/// This pass inserts AVX vzeroupper instructions before each call to avoid
	/// transition penalty between functions encoded with AVX and SSE.			/// transition penalty between functions encoded with AVX and SSE.
	FunctionPass *createX86IssueVZeroUpperPass();			FunctionPass *createX86IssueVZeroUpperPass();

				/// Return a pass that insret xor before popcnt to remove
				/// false dependency in popcnt dest register
				FunctionPass *createX86PopcntOptPass();

	/// Return a pass that pads short functions with NOOPs.			/// Return a pass that pads short functions with NOOPs.
	/// This will prevent a stall when returning on the Atom.			/// This will prevent a stall when returning on the Atom.
	FunctionPass *createX86PadShortFunctions();			FunctionPass *createX86PadShortFunctions();

	/// Return a pass that selectively replaces certain instructions (like add,			/// Return a pass that selectively replaces certain instructions (like add,
	/// sub, inc, dec, some shifts, and some multiplies) by equivalent LEA			/// sub, inc, dec, some shifts, and some multiplies) by equivalent LEA
	/// instructions, in order to eliminate execution delays in some processors.			/// instructions, in order to eliminate execution delays in some processors.
	FunctionPass *createX86FixupLEAs();			FunctionPass *createX86FixupLEAs();
	Show All 22 Lines

../llvm/lib/Target/X86/X86PopcntOpt.cpp

				//===-- X86PopcntOpt.cpp - ------------------------------------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file defines the pass which inserts x86 xor instructions
				// before calls to popcnt instruction.
				// Sandy/Ivy Bridge and Haswell processors have false dependency in popcnt
				// instruction on its destination register.
				// The WA is to insert xor before the popcnt so it will remove the dependency.
				majnemerUnsubmitted Not Done Reply Inline Actions WA? majnemer: WA?
				AsafBadouhAuthorUnsubmitted Not Done Reply Inline Actions workaround, will change it. AsafBadouh: workaround, will change it.
				//
				//===----------------------------------------------------------------------===//

				#include "X86.h"
				#include "X86Subtarget.h"
				#include "llvm/ADT/Statistic.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/CodeGen/MachineInstrBuilder.h"
				#include "llvm/CodeGen/MachineRegisterInfo.h"
				#include "llvm/Target/TargetInstrInfo.h"

				using namespace llvm;

				#define DEBUG_TYPE "x86-popcnt-opt"

				STATISTIC(NumXOR, "Number of XOR instructions inserted");

				namespace {
				class PopcntOptInserter : public MachineFunctionPass {
				public:
				PopcntOptInserter() : MachineFunctionPass(ID) {}
				bool runOnMachineFunction(MachineFunction &MF) override;
				const char *getPassName() const override { return "X86 Popcnt optimization"; }

				private:
				void insertXor(MachineBasicBlock::iterator I, MachineBasicBlock &MBB,
				unsigned Xor);
				bool EverMadeChange;
				const TargetInstrInfo *TII;
				static char ID;
				};
				char PopcntOptInserter::ID = 0;
				}

				FunctionPass *llvm::createX86PopcntOptPass() { return new PopcntOptInserter(); }

				void PopcntOptInserter::insertXor(MachineBasicBlock::iterator I,
				MachineBasicBlock &MBB, unsigned Xor) {
				DebugLoc dl = I->getDebugLoc();
				// in case srcReg == destReg, there is no need to insert xor
				if (I->getOperand(0).getReg() == I->getOperand(1).getReg())
				return;
				BuildMI(MBB, I, dl, TII->get(Xor), I->getOperand(0).getReg())
				.addReg(I->getOperand(0).getReg())
				.addReg(I->getOperand(0).getReg());
				++NumXOR;
				EverMadeChange = true;
				}

				/// runOnMachineFunction - Loop over all of the basic blocks, inserting
				/// xor instructions before popcnt.
				bool PopcntOptInserter::runOnMachineFunction(MachineFunction &MF) {
				const X86Subtarget &ST = MF.getSubtarget<X86Subtarget>();
				if (!ST.hasAVX() \|\| !ST.hasPOPCNT())
				majnemerUnsubmitted Not Done Reply Inline Actions We don't do this dependency breaking if we `hasPOPCNT` but not `hasAVX` ? I'd think we'd do this if `hasPOPCNT` is true regardless of what `hasAVX` is because we might want to run our program on machines older and newer than Sandy Bridge. majnemer: We don't do this dependency breaking if we `hasPOPCNT` but not `hasAVX` ? I'd think we'd do…
				mkuperUnsubmitted Not Done Reply Inline Actions This isn't like the other false dependency fixes we have, in the sense that this not an arch issue in the instruction definition, but rather a micro-arch bug. It doesn't exist in anything older than <arch-A>, and is supposed to be fixed in <arch-B> and above. I think <arch-B> is Skylake, although I'm not entirely sure. Don't know what <arch-A> is. In any case, the point is - we do want a condition that's more complicated than hasPOPCNT(), I just don't know what it is. mkuper: This isn't like the other false dependency fixes we have, in the sense that this not an arch…
				AsafBadouhAuthorUnsubmitted Not Done Reply Inline Actions As Michael explained, Sandy-bridge and later arch have that dependency. hasAVX flag return true for Sandy-bridge and later. Michael, what do you mean in "more complicated"? AsafBadouh: As Michael explained, Sandy-bridge and later arch have that dependency. hasAVX flag return true…
				return false;
				TII = ST.getInstrInfo();
				EverMadeChange = false;
				for (MachineFunction::iterator I = MF.begin(), E = MF.end(); I != E; ++I) {
				for (MachineBasicBlock::iterator MBBI = I->begin(), MBBE = I->end();
				MBBI != MBBE;) {
				MachineInstr *MI = MBBI++;
				switch (MI->getOpcode()) {
				case X86::POPCNT16rr:
				case X86::POPCNT16rm:
				insertXor(MI, *I, X86::XOR16rr);
				break;
				case X86::POPCNT32rr:
				case X86::POPCNT32rm:
				insertXor(MI, *I, X86::XOR32rr);
				break;
				case X86::POPCNT64rr:
				case X86::POPCNT64rm:
				insertXor(MI, *I, X86::XOR64rr);
				break;
				}
				}
				}
				return EverMadeChange;
				}

../llvm/lib/Target/X86/X86TargetMachine.cpp

	Show First 20 Lines • Show All 269 Lines • ▼ Show 20 Lines
	}			}

	void X86PassConfig::addPreSched2() { addPass(createX86ExpandPseudoPass()); }			void X86PassConfig::addPreSched2() { addPass(createX86ExpandPseudoPass()); }

	void X86PassConfig::addPreEmitPass() {			void X86PassConfig::addPreEmitPass() {
	if (getOptLevel() != CodeGenOpt::None)			if (getOptLevel() != CodeGenOpt::None)
	addPass(createExecutionDependencyFixPass(&X86::VR128RegClass));			addPass(createExecutionDependencyFixPass(&X86::VR128RegClass));

				// the pass should be called post DCE pass
				// and post RA pass
				if (getOptLevel() != CodeGenOpt::None)
				addPass(createX86PopcntOptPass());

	if (UseVZeroUpper)			if (UseVZeroUpper)
	addPass(createX86IssueVZeroUpperPass());			addPass(createX86IssueVZeroUpperPass());

	if (getOptLevel() != CodeGenOpt::None) {			if (getOptLevel() != CodeGenOpt::None) {
	addPass(createX86PadShortFunctions());			addPass(createX86PadShortFunctions());
	addPass(createX86FixupLEAs());			addPass(createX86FixupLEAs());
	}			}
	}			}

../llvm/test/CodeGen/X86/popcnt.ll

	; RUN: llc -march=x86-64 -mattr=+popcnt < %s \| FileCheck %s			; RUN: llc -march=x86-64 -mattr=+popcnt < %s \| FileCheck %s
				; RUN: llc < %s -march=x86-64 -mcpu=corei7-avx \| FileCheck --check-prefix=ALL %s
				; RUN: llc < %s -march=x86-64 -mcpu=core-avx2 \| FileCheck --check-prefix=ALL %s
				; RUN: llc < %s -march=x86-64 -mcpu=haswell \| FileCheck --check-prefix=ALL %s
				; RUN: llc < %s -march=x86-64 -mcpu=knl \| FileCheck --check-prefix=ALL --check-prefix=POPCNT %s

	define i8 @cnt8(i8 %x) nounwind readnone {			define i8 @cnt8(i8 %x) nounwind readnone {
	%cnt = tail call i8 @llvm.ctpop.i8(i8 %x)			%cnt = tail call i8 @llvm.ctpop.i8(i8 %x)
	ret i8 %cnt			ret i8 %cnt
	; CHECK-LABEL: cnt8:			; CHECK-LABEL: cnt8:
				; CHECK-NOT: xorw
	; CHECK: popcntw			; CHECK: popcntw
	; CHECK: ret			; CHECK: ret

				; ALL-LABEL: cnt8:
				; ALL: popcntw
				; ALL: ret

	}			}

	define i16 @cnt16(i16 %x) nounwind readnone {			define i16 @cnt16(i16 %x) nounwind readnone {
	%cnt = tail call i16 @llvm.ctpop.i16(i16 %x)			%cnt = tail call i16 @llvm.ctpop.i16(i16 %x)
	ret i16 %cnt			ret i16 %cnt
	; CHECK-LABEL: cnt16:			; CHECK-LABEL: cnt16:
				; CHECK-NOT: xorw
	; CHECK: popcntw			; CHECK: popcntw
	; CHECK: ret			; CHECK: ret

				; ALL-LABEL: cnt16:
				; ALL: xorw
				; ALL-NEXT: popcntw
				; ALL: ret
	}			}

	define i32 @cnt32(i32 %x) nounwind readnone {			define i32 @cnt32(i32 %x) nounwind readnone {
	%cnt = tail call i32 @llvm.ctpop.i32(i32 %x)			%cnt = tail call i32 @llvm.ctpop.i32(i32 %x)
	ret i32 %cnt			ret i32 %cnt
	; CHECK-LABEL: cnt32:			; CHECK-LABEL: cnt32:
				; CHECK-NOT: xorl
	; CHECK: popcntl			; CHECK: popcntl
	; CHECK: ret			; CHECK: ret
				; ALL-LABEL: cnt32:
				; ALL: xorl
				; ALL-NEXT: popcntl
				; ALL: ret
	}			}

	define i64 @cnt64(i64 %x) nounwind readnone {			define i64 @cnt64(i64 %x) nounwind readnone {
	%cnt = tail call i64 @llvm.ctpop.i64(i64 %x)			%cnt = tail call i64 @llvm.ctpop.i64(i64 %x)
	ret i64 %cnt			ret i64 %cnt
	; CHECK-LABEL: cnt64:			; CHECK-LABEL: cnt64:
				; CHECK-NOT: xorq
	; CHECK: popcntq			; CHECK: popcntq
	; CHECK: ret			; CHECK: ret

				; ALL-LABEL: cnt64:
				; ALL: xorq
				; ALL-NEXT: popcntq
				; ALL: ret
	}			}

				define <16 x i32> @testv16i32(<16 x i32> %in) nounwind {
				; test case for destReg=srcReg
				; insert xor is illegal
				; POPCNT-LABEL: testv16i32:
				; POPCNT: # BB#0:
				; POPCNT-NEXT: vmovdqa32 (%rcx), %zmm0
				; POPCNT-NEXT: vextracti32x4 $3, %zmm0, %xmm1
				; POPCNT-NEXT: vpextrd $1, %xmm1, %eax
				; POPCNT-NEXT: popcntl %eax, %eax
				; POPCNT-NEXT: vmovd %xmm1, %ecx
				; POPCNT-NEXT: popcntl %ecx, %ecx
				; POPCNT-NEXT: vmovd %ecx, %xmm2
				%out = call <16 x i32> @llvm.ctpop.v16i32(<16 x i32> %in)
				ret <16 x i32> %out
				}
	declare i8 @llvm.ctpop.i8(i8) nounwind readnone			declare i8 @llvm.ctpop.i8(i8) nounwind readnone
	declare i16 @llvm.ctpop.i16(i16) nounwind readnone			declare i16 @llvm.ctpop.i16(i16) nounwind readnone
	declare i32 @llvm.ctpop.i32(i32) nounwind readnone			declare i32 @llvm.ctpop.i32(i32) nounwind readnone
	declare i64 @llvm.ctpop.i64(i64) nounwind readnone			declare i64 @llvm.ctpop.i64(i64) nounwind readnone
				declare <16 x i32> @llvm.ctpop.v16i32(<16 x i32>)

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Fix False Data Dependency in popcntNeeds RevisionPublic

Details

Diff Detail

Event Timeline