This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
1
VectorCombine.cpp
-
test/Transforms/VectorCombine/X86/
-
Transforms/
-
VectorCombine/
-
X86/
2/3
load.ll

Differential D93397

[VectorCombine] loosen alignment constraint for load transform
ClosedPublic

Authored by spatel on Dec 16 2020, 7:25 AM.

Download Raw Diff

Details

Reviewers

lebedev.ri
RKSimon

Commits

rGaaaf0ec72b06: [VectorCombine] loosen alignment constraint for load transform

Summary

As discussed in D93229, we only need a minimal alignment constraint when querying whether a hypothetical vector load is safe. We still pass/use the potentially stronger alignment attribute when checking costs and creating the new load.

There's already a test that changes with the minimum code change, so splitting this off as a preliminary proposal independent of any gep/offset enhancements.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Dec 16 2020, 7:25 AM

Herald added subscribers: hiraditya, mcrosier. · View Herald TranscriptDec 16 2020, 7:25 AM

spatel requested review of this revision.Dec 16 2020, 7:25 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 16 2020, 7:25 AM

LGTM, thanksThi, this makes sense to me.
float variant takes too long to proof, but i8 works fine:

----------------------------------------
define <4 x i8> @src(* dereferenceable(4) align(1) %p) {
%0:
  %s = load i8, * dereferenceable(4) align(1) %p, align 4
  %r = insertelement <4 x i8> undef, i8 %s, i32 0
  ret <4 x i8> %r
}
=>
define <4 x i8> @tgt(* dereferenceable(4) align(1) %p) {
%0:
  %1 = bitcast * dereferenceable(4) align(1) %p to *
  %2 = load <4 x i8>, * %1, align 4
  %r = shufflevector <4 x i8> %2, <4 x i8> undef, 0, 4294967295, 4294967295, 4294967295
  ret <4 x i8> %r
}
Transformation seems to be correct!

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
141–142	This probably warrants a comment why we pass this alignment (that we only care about dereferenceability)
llvm/test/Transforms/VectorCombine/X86/load.ll
407	They are not really in conflict, just the first one being overly pessimistic. Since we know what the alignment must be on all execution paths, the alignment specified on the argument could be enhanced by knowledge backpropagation: https://godbolt.org/z/dvqEEb

This revision is now accepted and ready to land.Dec 16 2020, 7:53 AM

spatel added inline comments.Dec 16 2020, 8:50 AM

llvm/test/Transforms/VectorCombine/X86/load.ll
407	Ah, thanks for the explanation. I was thinking of a gep case where the alignment could be over-specified, but that will be UB based on: http://llvm.org/docs/LangRef.html#load-instruction "Overestimating the alignment results in undefined behavior."

lebedev.ri added inline comments.Dec 16 2020, 9:22 AM

llvm/test/Transforms/VectorCombine/X86/load.ll
409	`align 1` here doesn't actually say that the alignment is/will be exactly `1`, it specifies the lower bound, it is perfectly fine for the actual pointer to be 1024-byte aligned instead.

Closed by commit rGaaaf0ec72b06: [VectorCombine] loosen alignment constraint for load transform (authored by spatel). · Explain WhyDec 16 2020, 9:27 AM

This revision was automatically updated to reflect the committed changes.

spatel added a commit: rGaaaf0ec72b06: [VectorCombine] loosen alignment constraint for load transform.

spatel mentioned this in D93406: [VectorCombine] optimize alignment for load transform.Dec 16 2020, 10:08 AM

spatel mentioned this in rG38ebc1a13dc8: [VectorCombine] optimize alignment for load transform.Dec 16 2020, 12:26 PM

spatel mentioned this in D93229: [VectorCombine] allow peeking through GEPs when creating a vector load.Dec 16 2020, 1:29 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

VectorCombine.cpp

7 lines

test/

Transforms/

VectorCombine/

X86/

load.ll

8 lines

Diff 312239

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

Show First 20 Lines • Show All 128 Lines • ▼ Show 20 Lines	bool VectorCombine::vectorizeLoadInsert(Instruction &I) {

Type *ScalarTy = Scalar->getType();		Type *ScalarTy = Scalar->getType();
uint64_t ScalarSize = ScalarTy->getPrimitiveSizeInBits();		uint64_t ScalarSize = ScalarTy->getPrimitiveSizeInBits();
unsigned MinVectorSize = TTI.getMinVectorRegisterBitWidth();		unsigned MinVectorSize = TTI.getMinVectorRegisterBitWidth();
if (!ScalarSize \|\| !MinVectorSize \|\| MinVectorSize % ScalarSize != 0)		if (!ScalarSize \|\| !MinVectorSize \|\| MinVectorSize % ScalarSize != 0)
return false;		return false;

// Check safety of replacing the scalar load with a larger vector load.		// Check safety of replacing the scalar load with a larger vector load.
		// We use minimal alignment (maximum flexibility) because we only care about
		// the dereferenceable region. When calculating cost and creating a new op,
		// we may use a larger value based on alignment attributes.
unsigned MinVecNumElts = MinVectorSize / ScalarSize;		unsigned MinVecNumElts = MinVectorSize / ScalarSize;
auto *MinVecTy = VectorType::get(ScalarTy, MinVecNumElts, false);		auto *MinVecTy = VectorType::get(ScalarTy, MinVecNumElts, false);
Align Alignment = Load->getAlign();		if (!isSafeToLoadUnconditionally(SrcPtr, MinVecTy, Align(1), DL, Load, &DT))
		lebedev.riUnsubmitted Not Done Reply Inline Actions This probably warrants a comment why we pass this alignment (that we only care about dereferenceability) lebedev.ri: This probably warrants a comment why we pass this alignment (that we only care about…
if (!isSafeToLoadUnconditionally(SrcPtr, MinVecTy, Alignment, DL, Load, &DT))
return false;		return false;

// Original pattern: insertelt undef, load [free casts of] PtrOp, 0		// Original pattern: insertelt undef, load [free casts of] PtrOp, 0
		Align Alignment = Load->getAlign();
Type *LoadTy = Load->getType();		Type *LoadTy = Load->getType();
int OldCost = TTI.getMemoryOpCost(Instruction::Load, LoadTy, Alignment, AS);		int OldCost = TTI.getMemoryOpCost(Instruction::Load, LoadTy, Alignment, AS);
APInt DemandedElts = APInt::getOneBitSet(MinVecNumElts, 0);		APInt DemandedElts = APInt::getOneBitSet(MinVecNumElts, 0);
OldCost += TTI.getScalarizationOverhead(MinVecTy, DemandedElts,		OldCost += TTI.getScalarizationOverhead(MinVecTy, DemandedElts,
/* Insert */ true, HasExtract);		/* Insert */ true, HasExtract);

// New pattern: load VecPtr		// New pattern: load VecPtr
int NewCost = TTI.getMemoryOpCost(Instruction::Load, MinVecTy, Alignment, AS);		int NewCost = TTI.getMemoryOpCost(Instruction::Load, MinVecTy, Alignment, AS);
▲ Show 20 Lines • Show All 643 Lines • Show Last 20 Lines

llvm/test/Transforms/VectorCombine/X86/load.ll

	Show First 20 Lines • Show All 397 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[R:%.*]] = insertelement <4 x float> undef, float [[S]], i32 0			; CHECK-NEXT: [[R:%.*]] = insertelement <4 x float> undef, float [[S]], i32 0
	; CHECK-NEXT: ret <4 x float> [[R]]			; CHECK-NEXT: ret <4 x float> [[R]]
	;			;
	%s = load volatile float, float* %p, align 4			%s = load volatile float, float* %p, align 4
	%r = insertelement <4 x float> undef, float %s, i32 0			%r = insertelement <4 x float> undef, float %s, i32 0
	ret <4 x float> %r			ret <4 x float> %r
	}			}

	; Negative test? - pointer is not as aligned as load.			; Pointer is not as aligned as load, but that's ok.
				; The new load uses the larger alignment value.
				lebedev.riUnsubmitted Not Done Reply Inline Actions They are not really in conflict, just the first one being overly pessimistic. Since we know what the alignment must be on all execution paths, the alignment specified on the argument could be enhanced by knowledge backpropagation: https://godbolt.org/z/dvqEEb lebedev.ri: They are not really in conflict, just the first one being overly pessimistic. Since we know…
				spatelAuthorUnsubmitted Done Reply Inline Actions Ah, thanks for the explanation. I was thinking of a gep case where the alignment could be over-specified, but that will be UB based on: http://llvm.org/docs/LangRef.html#load-instruction "Overestimating the alignment results in undefined behavior." spatel: Ah, thanks for the explanation. I was thinking of a gep case where the alignment could be over…

	define <4 x float> @load_f32_insert_v4f32_align(float* align 1 dereferenceable(16) %p) {			define <4 x float> @load_f32_insert_v4f32_align(float* align 1 dereferenceable(16) %p) {
				lebedev.riUnsubmitted Done Reply Inline Actions `align 1` here doesn't actually say that the alignment is/will be exactly `1`, it specifies the lower bound, it is perfectly fine for the actual pointer to be 1024-byte aligned instead. lebedev.ri: `align 1` here doesn't actually say that the alignment is/will be exactly `1`, it specifies the…
	; CHECK-LABEL: @load_f32_insert_v4f32_align(			; CHECK-LABEL: @load_f32_insert_v4f32_align(
	; CHECK-NEXT: [[S:%.]] = load float, float [[P:%.*]], align 4			; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[P:%.]] to <4 x float>
	; CHECK-NEXT: [[R:%.*]] = insertelement <4 x float> undef, float [[S]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 4
				; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x float> [[TMP2]], <4 x float> undef, <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: ret <4 x float> [[R]]			; CHECK-NEXT: ret <4 x float> [[R]]
	;			;
	%s = load float, float* %p, align 4			%s = load float, float* %p, align 4
	%r = insertelement <4 x float> undef, float %s, i32 0			%r = insertelement <4 x float> undef, float %s, i32 0
	ret <4 x float> %r			ret <4 x float> %r
	}			}

	; Negative test - not enough bytes.			; Negative test - not enough bytes.
	▲ Show 20 Lines • Show All 160 Lines • Show Last 20 Lines