This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombine] DAGTypeLegalizer::GenWidenVectorLoads(): make use of dereferenceability knowledge
AbandonedPublic

Authored by lebedev.ri on Jul 21 2021, 7:47 AM.

Details

Summary

In @dot3_float3, while we know that the pointer has sufficient dereferenceable bytes,
the load itself is not aligned sufficiently, and that is the only number
that is currently used to determine legality of performing the wide load.

Diff Detail

Event Timeline

lebedev.ri created this revision.Jul 21 2021, 7:47 AM
lebedev.ri requested review of this revision.Jul 21 2021, 7:47 AM
pengfei added inline comments.Jul 21 2021, 8:05 AM
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
5292

Is this a trick that turning unaligned memory to aligned if the widen bits are dereferenceable?
The problem I can think is we may mistakenly generate movaps on an unaligned address which will crush in runtime.
Did I misunderstand something here? Because I saw the tests are using movups in fact.

lebedev.ri marked an inline comment as done.

Thanks for taking a look!

Make it more obvious that it's not really an alignment, but a count of known-dereferenceable bytes,
and fix another pessimization in process.

lebedev.ri added inline comments.Jul 21 2021, 8:26 AM
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
5292

Note that LdAlign is *not* used as the alignment for the new load.
It's confusingly named as such, and is only used by FindMemType
as a measure of how many bytes are known-dereferenceable.

Hopefully this is better.

foad added inline comments.Jul 21 2021, 8:58 AM
llvm/test/CodeGen/AMDGPU/copy-illegal-type.ll
165 ↗(On Diff #360472)

This looks alarming. It's loading 32 bytes instead of 8.

RKSimon added inline comments.Jul 21 2021, 9:09 AM
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
5350

do we have to reduce NumDereferenecableBytes in the loop?

lebedev.ri added inline comments.Jul 21 2021, 9:12 AM
llvm/test/CodeGen/AMDGPU/copy-illegal-type.ll
165 ↗(On Diff #360472)

Could you please be more specific, do you believe this is a correctness concern, or a performance one?
Because i do believe this is correct:

Legalizing node: t67: v10i32,ch = load<(dereferenceable invariant load (s320) from %ir.1, align 4, addrspace 4)> t0, t28, undef:i64
Analyzing result type: v10i32
Widen node result 0: t67: v10i32,ch = load<(dereferenceable invariant load (s320) from %ir.1, align 4, addrspace 4)> t0, t28, undef:i64

NumDereferenceableBytes 40
Creating new node: t86: v8i32,ch = load<(dereferenceable invariant load (s256) from %ir.1, align 4, addrspace 4)> t0, t28, undef:i64
Creating constant: t87: i64 = Constant<32>
Creating new node: t88: i64 = add nuw t28, Constant:i64<32>
Creating new node: t89: v8i32,ch = load<(dereferenceable invariant load (s256) from %ir.1 + 32, align 4, addrspace 4)> t0, t88, undef:i64
Creating new node: t90: v16i32 = concat_vectors t86, t89
Creating new node: t91: ch = TokenFactor t86:1, t89:1
*** IR Dump Before Module Verifier (verify) *** (function: test_copy_v4i8_x4)
; ModuleID = '/tmp/test.ll'
source_filename = "/tmp/test.ll"
target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7"

; Function Attrs: nounwind
define amdgpu_kernel void @test_copy_v4i8_x4(<4 x i8> addrspace(1)* %out0, <4 x i8> addrspace(1)* %out1, <4 x i8> addrspace(1)* %out2, <4 x i8> addrspace(1)* %out3, <4 x i8> addrspace(1)* %in) #0 {
  %test_copy_v4i8_x4.kernarg.segment = call nonnull align 16 dereferenceable(76) i8 addrspace(4)* @llvm.amdgcn.kernarg.segment.ptr()
  %out0.kernarg.offset = getelementptr inbounds i8, i8 addrspace(4)* %test_copy_v4i8_x4.kernarg.segment, i64 36
  %out0.kernarg.offset.cast = bitcast i8 addrspace(4)* %out0.kernarg.offset to <4 x i8> addrspace(1)* addrspace(4)*
  %1 = bitcast <4 x i8> addrspace(1)* addrspace(4)* %out0.kernarg.offset.cast to <5 x i64> addrspace(4)*, !amdgpu.uniform !0
  %2 = load <5 x i64>, <5 x i64> addrspace(4)* %1, align 4, !invariant.load !0
  %out0.load1 = extractelement <5 x i64> %2, i32 0
  %3 = inttoptr i64 %out0.load1 to <4 x i8> addrspace(1)*
  %out1.load2 = extractelement <5 x i64> %2, i32 1
  %4 = inttoptr i64 %out1.load2 to <4 x i8> addrspace(1)*
  %out2.load3 = extractelement <5 x i64> %2, i32 2
  %5 = inttoptr i64 %out2.load3 to <4 x i8> addrspace(1)*
  %out3.load4 = extractelement <5 x i64> %2, i32 3
  %6 = inttoptr i64 %out3.load4 to <4 x i8> addrspace(1)*
  %in.load5 = extractelement <5 x i64> %2, i32 4
  %7 = inttoptr i64 %in.load5 to <4 x i8> addrspace(1)*
  %out1.kernarg.offset = getelementptr inbounds i8, i8 addrspace(4)* %test_copy_v4i8_x4.kernarg.segment, i64 44
  %out1.kernarg.offset.cast = bitcast i8 addrspace(4)* %out1.kernarg.offset to <4 x i8> addrspace(1)* addrspace(4)*
  %out2.kernarg.offset = getelementptr inbounds i8, i8 addrspace(4)* %test_copy_v4i8_x4.kernarg.segment, i64 52
  %out2.kernarg.offset.cast = bitcast i8 addrspace(4)* %out2.kernarg.offset to <4 x i8> addrspace(1)* addrspace(4)*
  %out3.kernarg.offset = getelementptr inbounds i8, i8 addrspace(4)* %test_copy_v4i8_x4.kernarg.segment, i64 60
  %out3.kernarg.offset.cast = bitcast i8 addrspace(4)* %out3.kernarg.offset to <4 x i8> addrspace(1)* addrspace(4)*
  %in.kernarg.offset = getelementptr inbounds i8, i8 addrspace(4)* %test_copy_v4i8_x4.kernarg.segment, i64 68
  %in.kernarg.offset.cast = bitcast i8 addrspace(4)* %in.kernarg.offset to <4 x i8> addrspace(1)* addrspace(4)*
  %tid.x = call i32 @llvm.amdgcn.workitem.id.x(), !range !1
  %idxprom = sext i32 %tid.x to i64
  %gep = getelementptr <4 x i8>, <4 x i8> addrspace(1)* %7, i64 %idxprom
  %val = load <4 x i8>, <4 x i8> addrspace(1)* %gep, align 4
  store <4 x i8> %val, <4 x i8> addrspace(1)* %3, align 4
  store <4 x i8> %val, <4 x i8> addrspace(1)* %4, align 4
  store <4 x i8> %val, <4 x i8> addrspace(1)* %5, align 4
  store <4 x i8> %val, <4 x i8> addrspace(1)* %6, align 4
  ret void
}

; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.x() #1

; Function Attrs: nounwind readnone speculatable willreturn
declare i32 @llvm.amdgcn.workitem.id.y() #1

; Function Attrs: nounwind readnone speculatable willreturn
declare align 4 i8 addrspace(4)* @llvm.amdgcn.kernarg.segment.ptr() #2

; Function Attrs: convergent nounwind willreturn
declare { i1, i64 } @llvm.amdgcn.if.i64(i1) #3

; Function Attrs: convergent nounwind willreturn
declare { i1, i64 } @llvm.amdgcn.else.i64.i64(i64) #3

; Function Attrs: convergent nounwind readnone willreturn
declare i64 @llvm.amdgcn.if.break.i64(i1, i64) #4

; Function Attrs: convergent nounwind willreturn
declare i1 @llvm.amdgcn.loop.i64(i64) #3

; Function Attrs: convergent nounwind willreturn
declare void @llvm.amdgcn.end.cf.i64(i64) #3

attributes #0 = { nounwind "amdgpu-memory-bound"="true" "amdgpu-wave-limiter"="true" "target-cpu"="tahiti" }
attributes #1 = { nounwind readnone speculatable willreturn "target-cpu"="tahiti" }
attributes #2 = { nounwind readnone speculatable willreturn }
attributes #3 = { convergent nounwind willreturn }
attributes #4 = { convergent nounwind readnone willreturn }

!0 = !{}
!1 = !{i32 0, i32 1024}
lebedev.ri marked an inline comment as done.

Don't forget to decrease the dereferenceable size in loop.

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
5350

Oh, hmm, yes i do believe we do indeed, good spot!

foad added inline comments.Jul 21 2021, 9:25 AM
llvm/test/CodeGen/AMDGPU/copy-illegal-type.ll
165 ↗(On Diff #360472)

I meant it looks like an alarming performance regression. I have no reason to believe it's incorrect.

lebedev.ri added inline comments.Jul 21 2021, 10:49 AM
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
5354–5356

This really doesn't look right

llvm/test/CodeGen/AMDGPU/copy-illegal-type.ll
165 ↗(On Diff #360472)

I'm open to suggestions.

RKSimon added inline comments.Jul 21 2021, 11:38 AM
llvm/test/CodeGen/AMDGPU/copy-illegal-type.ll
165 ↗(On Diff #360472)

The problem you're going to hit is FindMemType searches for the biggest possible valid type - and the dereferencable change has relaxed that validity check even more - you're probably going to have to get FindMemType to continue searching the smaller types until it fails a second validity test?

efriedma added inline comments.
llvm/test/CodeGen/X86/widen_load-3.ll
118 ↗(On Diff #360497)

How are we proving this transform is legal?

lebedev.ri marked 4 inline comments as done.

Attempt to fix issues.
Unless i'm doing it wrong, this seems to have fixed regressions,
and maybe fixed a few existing miscompiles?

... and actually avoid regressions.
I think this is it.

lebedev.ri added inline comments.Jul 21 2021, 2:00 PM
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
5336

Ugh, actually, this is still not quite right, because it should be the alignment of the load we will produce now, not of the one we just produced.

5354–5356

Disregard, alignment tracking in SDAG is confusing.

Ok, now *this* should be it.

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
5336

This is rather horrible :(

efriedma added inline comments.Jul 21 2021, 2:44 PM
llvm/test/CodeGen/ARM/vector-load.ll
257

We probably shouldn't be widening volatile loads like this. Not that this particular testcase reflects anything useful, but still.

And one more regression..

lebedev.ri marked an inline comment as done.Jul 21 2021, 2:49 PM
lebedev.ri added inline comments.
llvm/test/CodeGen/ARM/vector-load.ll
257

Indeed. Looks like we discovered this bug at the same time.

lebedev.ri marked an inline comment as done.Jul 21 2021, 2:50 PM

@foad thoughts on remaining AMDGPU changes?

Hi,
we talked a bit about this internally and found some potential problems.
There were some concerns that assuming that dereferenceable bytes >= alignment is incorrect. Memory protection may work at 4-byte or even byte granularity and objects may be over-aligned in memory. In these cases, loading padding bytes does not work.

Even if it can be proven that enough bytes are dereferenceable, we do not want to widen loads for amdgpu (at least not in cases where the load gets a lot larger, i.e. loading 64 Bytes instead of 32). Widening a load poses more restrictions on the register allocator, as a larger consecutive set of registers needs to be allocated. And, in the case that a widened load hits more cache lines than before, it also consumes more memory bandwidth.

Thanks for taking a look!

Hi,
we talked a bit about this internally and found some potential problems.

There were some concerns that assuming that dereferenceable bytes >= alignment is incorrect. Memory protection may work at 4-byte or even byte granularity and objects may be over-aligned in memory. In these cases, loading padding bytes does not work.

Even if it can be proven that enough bytes are dereferenceable,

This is FUD.

/// isDereferenceable - Return true if V is always dereferenceable for
/// Offset + Size byte.
bool MachinePointerInfo::isDereferenceable(<...>
/// Returns true if V is always dereferenceable for Size byte with alignment
/// greater or equal than requested. If the context instruction is specified
/// performs context-sensitive analysis and returns true if the pointer is
/// dereferenceable at the specified instruction.
bool isDereferenceableAndAlignedPointer(<...>
/// Check if executing a load of this pointer value cannot trap.
///
/// If DT and ScanFrom are specified this method performs context-sensitive
/// analysis and returns true if it is safe to load immediately before ScanFrom.
///
/// If it is not obviously safe to load from the specified pointer, we do
/// a quick local scan of the basic block containing \c ScanFrom, to determine
/// if the address is already accessed.
///
/// This uses the pointee type to determine how many bytes need to be safe to
/// load from the pointer.
bool llvm::isSafeToLoadUnconditionally(Value *V, Align Alignment, APInt &Size,
                                       const DataLayout &DL,
                                       Instruction *ScanFrom,
                                       const DominatorTree *DT,
                                       const TargetLibraryInfo *TLI) {
  // If DT is not specified we can't make context-sensitive query
  const Instruction* CtxI = DT ? ScanFrom : nullptr;
  if (isDereferenceableAndAlignedPointer(V, Alignment, Size, DL, CtxI, DT, TLI))
    return true;

We can not second-guess LLVM semantics.
If we can tell that we are always allowed to load N bytes without trapping, then we can do that.
If we actually can't, then the problem is elsewhere, we shouldn't have deduced that we can.

we do not want to widen loads for amdgpu (at least not in cases where the load gets a lot larger, i.e. loading 64 Bytes instead of 32). Widening a load poses more restrictions on the register allocator, as a larger consecutive set of registers needs to be allocated. And, in the case that a widened load hits more cache lines than before, it also consumes more memory bandwidth.

Ok, so we need a target hook, that's doable.

There were some concerns that assuming that dereferenceable bytes >= alignment is incorrect. Memory protection may work at 4-byte or even byte granularity and objects may be over-aligned in memory. In these cases, loading padding bytes does not work.

Even if it can be proven that enough bytes are dereferenceable,

This is FUD.

We can not second-guess LLVM semantics.
If we can tell that we are always allowed to load N bytes without trapping, then we can do that.
If we actually can't, then the problem is elsewhere, we shouldn't have deduced that we can.

I agree with that, I misunderstand the code then.

if (LD->isSimple()) {
  NumDereferenceableBytes = LD->getAlignment();
  if (!LdWidth.isScalable())
    NumDereferenceableBytes =
        std::max<unsigned>(NumDereferenceableBytes, LdWidth / 8);
  if (!WidenWidth.isScalable() && NumDereferenceableBytes < WidenWidth / 8 &&
      LD->getPointerInfo().isDereferenceable(
          WidenWidth / 8, *DAG.getContext(), DAG.getDataLayout()))
    NumDereferenceableBytes =
        std::max<unsigned>(NumDereferenceableBytes, WidenWidth / 8);
}

As far as I understand this part, NumDereferenceableBytes is set to the (guaranteed minimum) alignment of the load.
In some cases, where it’s ok, e.g. because the dereferenceable size is set, NumDereferenceableBytes is increased.

So, NumDereferenceableBytes is always at least the alignment?
isDereferenceable(LD->getAlignment()) is never checked as far as I see.

CC @nlopes @aqjune - is alive2 correct for LLVM IR semantics?

There were some concerns that assuming that dereferenceable bytes >= alignment is incorrect. Memory protection may work at 4-byte or even byte granularity and objects may be over-aligned in memory. In these cases, loading padding bytes does not work.

Even if it can be proven that enough bytes are dereferenceable,

This is FUD.

We can not second-guess LLVM semantics.
If we can tell that we are always allowed to load N bytes without trapping, then we can do that.
If we actually can't, then the problem is elsewhere, we shouldn't have deduced that we can.

I agree with that, I misunderstand the code then.

if (LD->isSimple()) {
  NumDereferenceableBytes = LD->getAlignment();
  if (!LdWidth.isScalable())
    NumDereferenceableBytes =
        std::max<unsigned>(NumDereferenceableBytes, LdWidth / 8);
  if (!WidenWidth.isScalable() && NumDereferenceableBytes < WidenWidth / 8 &&
      LD->getPointerInfo().isDereferenceable(
          WidenWidth / 8, *DAG.getContext(), DAG.getDataLayout()))
    NumDereferenceableBytes =
        std::max<unsigned>(NumDereferenceableBytes, WidenWidth / 8);
}

As far as I understand this part, NumDereferenceableBytes is set to the (guaranteed minimum) alignment of the load.

That is true, and please note that it is already what is happening regardless of this patch, as you can see from the LHS of the diff.
Looks like that was most originally added by rL94338.

Now that i look, indeed, while deref is the source of truth (https://alive2.llvm.org/ce/z/9TAAwj vs https://alive2.llvm.org/ce/z/bmKMm4),
as per the LLVM IR semantics as modelled by alive2, alignment does not guarantee dereferenceability: https://alive2.llvm.org/ce/z/-HxoLA
(CC @nlopes @aqjune)

In some cases, where it’s ok, e.g. because the dereferenceable size is set, NumDereferenceableBytes is increased.

So, NumDereferenceableBytes is always at least the alignment?
isDereferenceable(LD->getAlignment()) is never checked as far as I see.

Right, but that is so regardless of this patch.

@lebedev.ri Yep, Alive2's encoding assumes that dereferenceable's encoding has no relation with align value.
LangRef does not seem to impose that a pointer must be dereferenceable by at least its alignment.

Hmm, that's, interesting.
So how does the current existing transform motivate the legality of what it does i wonder?

I guess the motivation was that the assumption was valid for certain architectures (I'm not an expert in this area though)
But, I agree that such information is valuable not only in codegen but also for middle-end transformations (e.g. attributor).

Alignment implying lower granularity of "dereferenceable" isn't justified in LangRef anywhere. In practice, it works out on common targets because it's impossible to allocate less than a page of memory. (At least, in userspace...)

It's something that would be nice to preserve in the cases where it's justified, but maybe not worth the complexity of trying to track the relevant page size on every target.

foad added inline comments.Jul 29 2021, 2:29 AM
llvm/test/CodeGen/AMDGPU/kernel-args.ll
457–463

I think this diff is misleading. I've pushed an alternative version here, which is based on first autogenerating the checks in this file before applying D106447: https://github.com/jayfoad/llvm-project/commit/ed4a53da7c265675510caeb2c773f1b461354347

In this particular case (v5i64_arg for SI) the relevant part of the diff is:

-; SI-NEXT:    s_load_dwordx8 s[4:11], s[0:1], 0x19
-; SI-NEXT:    s_load_dwordx2 s[12:13], s[0:1], 0x9
-; SI-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x21
-; SI-NEXT:    s_mov_b32 s15, 0xf000
-; SI-NEXT:    s_mov_b32 s14, -1
+; SI-NEXT:    s_load_dwordx16 s[4:19], s[0:1], 0x19
+; SI-NEXT:    s_load_dwordx2 s[0:1], s[0:1], 0x9
+; SI-NEXT:    s_mov_b32 s3, 0xf000
+; SI-NEXT:    s_mov_b32 s2, -1

I.e. the load from scaled offset 0x9 is untouched; the significant difference is that the dwordx8 and dwordx2 load from offsets 0x19 and 0x21 have been replaced by a single dwordx16 load from offset 0x19.

lebedev.ri added inline comments.Jul 29 2021, 2:35 AM
llvm/test/CodeGen/AMDGPU/kernel-args.ll
457–463

That's why nobody likes manually-written check lines..
Let me rebase this..

lebedev.ri added inline comments.Jul 29 2021, 2:46 AM
llvm/test/CodeGen/AMDGPU/kernel-args.ll
457–463

Actually, i see that you have not pushed https://github.com/llvm/llvm-project/commit/00b0f1ed7d5833695ba266854ee1d07467ddcf9c, so there isn't much point in rebasing.
Do you want to push that one?

foad added inline comments.Jul 29 2021, 3:01 AM
llvm/test/CodeGen/AMDGPU/kernel-args.ll
457–463

Pushed for review as D107052 because I had to make some non-trivial changes to the RUN lines.

Rebased after D107052 (thanks!)
I think the AMDGPU test change looks somewhat less scary now.
Thoughts?

foad added a comment.Aug 2 2021, 1:36 AM

Rebased after D107052 (thanks!)
I think the AMDGPU test change looks somewhat less scary now.
Thoughts?

I think we will still need a target hook or some other way of restricting this. In the v5i64_arg case I quoted earlier we're loading 16 dwords instead of 10. Yes the 16-dword load is legal, but that doesn't mean it's a good idea to use it. It increases register pressure (which is especially bad on GPUs) and may load from more cache lines.

lebedev.ri abandoned this revision.Jan 17 2022, 2:37 PM