Download Raw Diff

Details

Reviewers

craig.topper
RKSimon
andreadb
efriedma
rengolin
t.p.northover
fhahn
greened
spatel

Commits

rL372628: [BreakFalseDeps] ignore function with minsize attribute
rG7414151929b6: [BreakFalseDeps] ignore function with minsize attribute

Summary

This came up in the x86-specific:
https://bugs.llvm.org/show_bug.cgi?id=43239
...but it seems like a general problem for the BreakFalseDeps pass. Dependencies are always broken by adding some other instruction, so that should be avoided if the overall goal is to minimize size.

Diff Detail

Event Timeline

spatel created this revision.Sep 9 2019, 12:02 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 9 2019, 12:02 PM

Herald added subscribers: hiraditya, mcrosier. · View Herald Transcript

This would make it policy for -Oz builds to not bother to break dependencies but -Os/-O0+ builds would still do.

Does anything else use BreakFalseDeps?

llvm/lib/CodeGen/BreakFalseDeps.cpp
15–16	NFC

In D67363#1663974, @RKSimon wrote:

This would make it policy for -Oz builds to not bother to break dependencies but -Os/-O0+ builds would still do.

Correct - that matches my interpretation of the vague clang docs:

-Os Like -O2 with extra optimizations to reduce code size.
-Oz Like -Os (and thus -O2), but reduces code size further.

In my experience, -Oz ("minsize") is used to ensure that code fits in some constrained space, so any extras that can be eliminated should be eliminated. Perf is a secondary concern. This is different than -Os ("optsize") where we are trying to balance speed and size.

Does anything else use BreakFalseDeps?

x86 and ARM are the only targets in trunk that use it from what I can see. Both use it the same way: insert dummy (not required for correctness) instructions to avoid known uarch hazards.

Patch updated:
Pre-committed NFC code comment diffs in rL371516.

spatel marked an inline comment as done.Sep 10 2019, 6:03 AM

Adding some ARM folks since ARM is (the only) other in-tree target that would be affected by this change.
It's not strictly stated that dependency-breaking always means adding an instruction, but that’s the practical definition based on x86 and ARM's implementation of this pass's hooks.

For VEX instructions don’t we just use the other input register to break the dependency without adding an instruction?

RKSimon mentioned this in rL371525: [X86] Add AVX partial dependency tests as noted on D67363.Sep 10 2019, 7:28 AM

RKSimon mentioned this in rG937ca6815743: [X86] Add AVX partial dependency tests as noted on D67363.

In D67363#1664689, @craig.topper wrote:

For VEX instructions don’t we just use the other input register to break the dependency without adding an instruction?

That sounds right, but it's not controlled by this pass. That's part of memory op folding?
If I'm seeing it correctly, that's already more aggressively optimizing for size than what we're doing here:

// Avoid partial and undef register update stalls unless optimizing for size.
if (!MF.getFunction().hasOptSize() &&
    (hasPartialRegUpdate(MI.getOpcode(), Subtarget, /*ForLoadFold*/true) ||
     shouldPreventUndefRegUpdateMemFold(MF, MI)))
  return nullptr;

Patch updated:
Rebased with improved test coverage for ARM and x86:
rL371526
rL371525

spatel marked an inline comment as done.Sep 10 2019, 8:12 AM

spatel added inline comments.

llvm/test/CodeGen/X86/sqrt-partial.ll
41–42	The AVX form of sqrtsd doesn't have a partial reg update, so no difference here.

In D67363#1664746, @spatel wrote:
In D67363#1664689, @craig.topper wrote:

For VEX instructions don’t we just use the other input register to break the dependency without adding an instruction?

That sounds right, but it's not controlled by this pass. That's part of memory op folding?
If I'm seeing it correctly, that's already more aggressively optimizing for size than what we're doing here:
// Avoid partial and undef register update stalls unless optimizing for size.
if (!MF.getFunction().hasOptSize() &&
    (hasPartialRegUpdate(MI.getOpcode(), Subtarget, /*ForLoadFold*/true) ||
     shouldPreventUndefRegUpdateMemFold(MF, MI)))
  return nullptr;

We disable folding if it takes away our opportunity to use the other input register. But something still needs to pick a register for the implicit_def operand. I though that was this pass? I think in absense of this pass it may just be set to the xmm0 by the register allocator. Do we have tests where xmm0 isn't the right choice?

I believe this test case compiled with avx needs this pass.

define double @minsize(double %x, double %y) minsize {

%t6 = tail call fast double @llvm.sqrt.f64(double %y)
%t= fadd fast double %t6, %y
ret double %t6

}
declare double @llvm.sqrt.f64(double)

spatel mentioned this in rL371528: [x86] add a test for BreakFalseDeps; NFC.Sep 10 2019, 8:41 AM

spatel mentioned this in rG8812157b11ea: [x86] add a test for BreakFalseDeps; NFC.

spatel mentioned this in rL371551: [x86] add test for false dependency with AVX; NFC.Sep 10 2019, 1:05 PM

spatel mentioned this in rG4d2b4077e708: [x86] add test for false dependency with AVX; NFC.

In D67363#1664788, @craig.topper wrote:
I believe this test case compiled with avx needs this pass.

define double @minsize(double %x, double %y) minsize {
%t6 = tail call fast double @llvm.sqrt.f64(double %y)
%t= fadd fast double %t6, %y
ret double %t6
}
declare double @llvm.sqrt.f64(double)

Thanks! I didn't understand where that happened - it's in here (and no existing regression tests appear to capture that).
So yes, to limit this pass, we need to use a slightly finer check to distinguish between transforms that are only changing registers vs. adding instructions.
Added more tests with rL371528 and rL371551.

Patch updated:
Refine the places where minsize comes into play (where the target is called to add an instruction via TII->breakPartialRegDependency()).

spatel marked an inline comment as done.Sep 10 2019, 1:31 PM

spatel added inline comments.

llvm/test/CodeGen/X86/sqrt-partial.ll
55–56	This is the test showing that the pass is still capable of changing a reg.

X86 looks good to me.

Ping - ARM ok?

efriedma added inline comments.Sep 19 2019, 4:15 PM

llvm/test/CodeGen/ARM/a15-partial-update.ll
59 ↗	(On Diff #219595)	There's a missed optimization here: we could avoid both the false dependence and the extra instruction by modifying the vld1 to a splat load. But generally looks fine.

Patch updated:
Added a TODO comment about using a splat load on the ARM tests.

Marking patch as accepted based on "looks good" and "looks fine". I'll hold off on commit for a ~day in case there are any more comments.

This revision is now accepted and ready to land.Sep 22 2019, 8:43 AM

LGTM - maybe raise a bug about the missing splat loads in a15-partial-update.ll ?

In D67363#1678217, @RKSimon wrote:

LGTM - maybe raise a bug about the missing splat loads in a15-partial-update.ll ?

Sure - https://bugs.llvm.org/show_bug.cgi?id=43410

Closed by commit rG7414151929b6: [BreakFalseDeps] ignore function with minsize attribute (authored by spatel). · Explain WhySep 23 2019, 10:01 AM

This revision was automatically updated to reflect the committed changes.

Diff 219531

llvm/lib/CodeGen/BreakFalseDeps.cpp

//==- llvm/CodeGen/BreakFalseDeps.cpp - Break False Dependency Fix -- C++ -==//		//==- llvm/CodeGen/BreakFalseDeps.cpp - Break False Dependency Fix -- C++ -==//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
/// \file Break False Dependency pass.		/// \file Break False Dependency pass.
///		///
/// Some instructions have false dependencies which cause unnecessary stalls.		/// Some instructions have false dependencies which cause unnecessary stalls.
/// For example, instructions may write part of a register and implicitly		/// For example, instructions may write part of a register and implicitly
/// need to read the other parts of the register. This may cause unwanted		/// need to read the other parts of the register. This may cause unwanted
/// stalls preventing otherwise unrelated instructions from executing in		/// stalls preventing otherwise unrelated instructions from executing in
/// parallel in an out-of-order CPU.		/// parallel in an out-of-order CPU.
/// This pass is aimed at identifying and avoiding these dependencies.		/// This pass is aimed at identifying and avoiding these dependencies.
		RKSimonUnsubmitted Done Reply Inline Actions NFC RKSimon: NFC
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/CodeGen/LivePhysRegs.h"		#include "llvm/CodeGen/LivePhysRegs.h"
#include "llvm/CodeGen/MachineFunctionPass.h"		#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/ReachingDefAnalysis.h"		#include "llvm/CodeGen/ReachingDefAnalysis.h"
#include "llvm/CodeGen/RegisterClassInfo.h"		#include "llvm/CodeGen/RegisterClassInfo.h"
#include "llvm/CodeGen/MachineRegisterInfo.h"		#include "llvm/CodeGen/MachineRegisterInfo.h"
▲ Show 20 Lines • Show All 221 Lines • ▼ Show 20 Lines	if (!MI.isDebugInstr())
processDefs(&MI);		processDefs(&MI);
}		}
processUndefReads(MBB);		processUndefReads(MBB);
}		}

bool BreakFalseDeps::runOnMachineFunction(MachineFunction &mf) {		bool BreakFalseDeps::runOnMachineFunction(MachineFunction &mf) {
if (skipFunction(mf.getFunction()))		if (skipFunction(mf.getFunction()))
return false;		return false;

		// This pass adds instructions to remove dependencies. That opposes the goal
		// of minimizing size.
		if (mf.getFunction().hasMinSize())
		return false;

MF = &mf;		MF = &mf;
TII = MF->getSubtarget().getInstrInfo();		TII = MF->getSubtarget().getInstrInfo();
TRI = MF->getSubtarget().getRegisterInfo();		TRI = MF->getSubtarget().getRegisterInfo();
RDA = &getAnalysis<ReachingDefAnalysis>();		RDA = &getAnalysis<ReachingDefAnalysis>();

RegClassInfo.runOnMachineFunction(mf);		RegClassInfo.runOnMachineFunction(mf);

LLVM_DEBUG(dbgs() << "******** BREAK FALSE DEPENDENCIES ********\n");		LLVM_DEBUG(dbgs() << "******** BREAK FALSE DEPENDENCIES ********\n");

// Traverse the basic blocks.		// Traverse the basic blocks.
for (MachineBasicBlock &MBB : mf) {		for (MachineBasicBlock &MBB : mf) {
processBasicBlock(&MBB);		processBasicBlock(&MBB);
}		}

return false;		return false;
}		}

llvm/test/CodeGen/X86/sqrt-partial.ll

	Show All 32 Lines
	; CHECK-NEXT: sqrtsd %xmm0, %xmm0			; CHECK-NEXT: sqrtsd %xmm0, %xmm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	; CHECK-NEXT: .LBB1_2: # %call.sqrt			; CHECK-NEXT: .LBB1_2: # %call.sqrt
	; CHECK-NEXT: jmp sqrt # TAILCALL			; CHECK-NEXT: jmp sqrt # TAILCALL
	%res = tail call double @sqrt(double %val)			%res = tail call double @sqrt(double %val)
	ret double %res			ret double %res
	}			}

	define double @minsize(double %x, double %y) minsize {			define double @minsize(double %x, double %y) minsize {
	; CHECK-LABEL: minsize:			; CHECK-LABEL: minsize:
				spatelAuthorUnsubmitted Done Reply Inline Actions The AVX form of sqrtsd doesn't have a partial reg update, so no difference here. spatel: The AVX form of sqrtsd doesn't have a partial reg update, so no difference here.
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: mulsd %xmm0, %xmm0			; CHECK-NEXT: mulsd %xmm0, %xmm0
	; CHECK-NEXT: mulsd %xmm1, %xmm1			; CHECK-NEXT: mulsd %xmm1, %xmm1
	; CHECK-NEXT: addsd %xmm0, %xmm1			; CHECK-NEXT: addsd %xmm0, %xmm1
	; CHECK-NEXT: xorps %xmm0, %xmm0
	; CHECK-NEXT: sqrtsd %xmm1, %xmm0			; CHECK-NEXT: sqrtsd %xmm1, %xmm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%t3 = fmul fast double %x, %x			%t3 = fmul fast double %x, %x
	%t4 = fmul fast double %y, %y			%t4 = fmul fast double %y, %y
	%t5 = fadd fast double %t3, %t4			%t5 = fadd fast double %t3, %t4
	%t6 = tail call fast double @llvm.sqrt.f64(double %t5)			%t6 = tail call fast double @llvm.sqrt.f64(double %t5)
	ret double %t6			ret double %t6
	}			}

	declare float @sqrtf(float)			declare float @sqrtf(float)
				spatelAuthorUnsubmitted Done Reply Inline Actions This is the test showing that the pass is still capable of changing a reg. spatel: This is the test showing that the pass is still capable of changing a reg.
	declare double @sqrt(double)			declare double @sqrt(double)
	declare double @llvm.sqrt.f64(double)			declare double @llvm.sqrt.f64(double)

llvm/test/CodeGen/X86/stack-folding-fp-sse42.ll

	Show First 20 Lines • Show All 577 Lines • ▼ Show 20 Lines

	define float @stack_fold_cvtsd2ss(double %a0) minsize {			define float @stack_fold_cvtsd2ss(double %a0) minsize {
	; CHECK-LABEL: stack_fold_cvtsd2ss:			; CHECK-LABEL: stack_fold_cvtsd2ss:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: movsd %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill			; CHECK-NEXT: movsd %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 8-byte Spill
	; CHECK-NEXT: #APP			; CHECK-NEXT: #APP
	; CHECK-NEXT: nop			; CHECK-NEXT: nop
	; CHECK-NEXT: #NO_APP			; CHECK-NEXT: #NO_APP
	; CHECK-NEXT: xorps %xmm0, %xmm0
	; CHECK-NEXT: cvtsd2ss {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 8-byte Folded Reload			; CHECK-NEXT: cvtsd2ss {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 8-byte Folded Reload
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm1},~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()			%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm1},~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()
	%2 = fptrunc double %a0 to float			%2 = fptrunc double %a0 to float
	ret float %2			ret float %2
	}			}

	define <4 x float> @stack_fold_cvtsd2ss_int(<2 x double> %a0) optsize {			define <4 x float> @stack_fold_cvtsd2ss_int(<2 x double> %a0) optsize {
	▲ Show 20 Lines • Show All 375 Lines • ▼ Show 20 Lines

	define double @stack_fold_cvtss2sd(float %a0) minsize {			define double @stack_fold_cvtss2sd(float %a0) minsize {
	; CHECK-LABEL: stack_fold_cvtss2sd:			; CHECK-LABEL: stack_fold_cvtss2sd:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: movss %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill			; CHECK-NEXT: movss %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
	; CHECK-NEXT: #APP			; CHECK-NEXT: #APP
	; CHECK-NEXT: nop			; CHECK-NEXT: nop
	; CHECK-NEXT: #NO_APP			; CHECK-NEXT: #NO_APP
	; CHECK-NEXT: xorps %xmm0, %xmm0
	; CHECK-NEXT: cvtss2sd {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 4-byte Folded Reload			; CHECK-NEXT: cvtss2sd {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 4-byte Folded Reload
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm1},~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()			%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm1},~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()
	%2 = fpext float %a0 to double			%2 = fpext float %a0 to double
	ret double %2			ret double %2
	}			}

	define <2 x double> @stack_fold_cvtss2sd_int(<4 x float> %a0) optsize {			define <2 x double> @stack_fold_cvtss2sd_int(<4 x float> %a0) optsize {
	▲ Show 20 Lines • Show All 1,020 Lines • ▼ Show 20 Lines

	define float @stack_fold_roundss(float %a0) minsize {			define float @stack_fold_roundss(float %a0) minsize {
	; CHECK-LABEL: stack_fold_roundss:			; CHECK-LABEL: stack_fold_roundss:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: movss %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill			; CHECK-NEXT: movss %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
	; CHECK-NEXT: #APP			; CHECK-NEXT: #APP
	; CHECK-NEXT: nop			; CHECK-NEXT: nop
	; CHECK-NEXT: #NO_APP			; CHECK-NEXT: #NO_APP
	; CHECK-NEXT: xorps %xmm0, %xmm0
	; CHECK-NEXT: roundss $9, {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 4-byte Folded Reload			; CHECK-NEXT: roundss $9, {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 4-byte Folded Reload
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm1},~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()			%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm1},~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()
	%2 = call float @llvm.floor.f32(float %a0)			%2 = call float @llvm.floor.f32(float %a0)
	ret float %2			ret float %2
	}			}
	declare float @llvm.floor.f32(float) nounwind readnone			declare float @llvm.floor.f32(float) nounwind readnone

	▲ Show 20 Lines • Show All 152 Lines • ▼ Show 20 Lines

	define float @stack_fold_sqrtss(float %a0) minsize {			define float @stack_fold_sqrtss(float %a0) minsize {
	; CHECK-LABEL: stack_fold_sqrtss:			; CHECK-LABEL: stack_fold_sqrtss:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: movss %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill			; CHECK-NEXT: movss %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 4-byte Spill
	; CHECK-NEXT: #APP			; CHECK-NEXT: #APP
	; CHECK-NEXT: nop			; CHECK-NEXT: nop
	; CHECK-NEXT: #NO_APP			; CHECK-NEXT: #NO_APP
	; CHECK-NEXT: xorps %xmm0, %xmm0
	; CHECK-NEXT: sqrtss {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 4-byte Folded Reload			; CHECK-NEXT: sqrtss {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 4-byte Folded Reload
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm1},~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()			%1 = tail call <2 x i64> asm sideeffect "nop", "=x,~{xmm1},~{xmm2},~{xmm3},~{xmm4},~{xmm5},~{xmm6},~{xmm7},~{xmm8},~{xmm9},~{xmm10},~{xmm11},~{xmm12},~{xmm13},~{xmm14},~{xmm15},~{flags}"()
	%2 = call float @llvm.sqrt.f32(float %a0)			%2 = call float @llvm.sqrt.f32(float %a0)
	ret float %2			ret float %2
	}			}
	declare float @llvm.sqrt.f32(float) nounwind readnone			declare float @llvm.sqrt.f32(float) nounwind readnone

	▲ Show 20 Lines • Show All 307 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[BreakFalseDeps] ignore function with minsize attribute
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 219531

llvm/lib/CodeGen/BreakFalseDeps.cpp

llvm/test/CodeGen/X86/sqrt-partial.ll

llvm/test/CodeGen/X86/stack-folding-fp-sse42.ll

This is an archive of the discontinued LLVM Phabricator instance.

[BreakFalseDeps] ignore function with minsize attributeClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 219531

llvm/lib/CodeGen/BreakFalseDeps.cpp

llvm/test/CodeGen/X86/sqrt-partial.ll

llvm/test/CodeGen/X86/stack-folding-fp-sse42.ll

[BreakFalseDeps] ignore function with minsize attribute
ClosedPublic