This is an archive of the discontinued LLVM Phabricator instance.

[amdgpu] Teach load widening to handle non-DWORD aligned loads.
ClosedPublic

Authored by hliao on May 20 2020, 11:46 PM.

Details

Summary
  • If a load is naturally aligned but not DWORD aligned, load wideningg should handle the case where the address is the sum of a base pointer and a constant offset. That load is able to be widened by aligning that address and extracting the narrow value from the high part.

Diff Detail

Event Timeline

hliao created this revision.May 20 2020, 11:46 PM
arsenm added inline comments.May 21 2020, 7:11 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7469 ↗(On Diff #265435)

Typo quaranteed

7472–7475 ↗(On Diff #265435)

This will be true for all of the preloaded SGPRs. However I think looking for the argument copies here is the wrong approach. The original intrinsic calls should have been annotated with the align return value attribute. Currently this is a burden on the frontend/library call emitting the intrinsic. We could either annotate the intrinsic calls in one of the later passes (maybe AMDGPUCodeGenPrepare), or add a minimum alignment to the intrinsic definition which will always be applied, similar to how intrinsic declarations already get their other attributes

7509–7550 ↗(On Diff #265435)

We already have essentially the same code in lowerKernargMemParameter (plus an IR version in AMDGPULowerKernelArguments). This just generalizes to any isBaseWithConstantOffset. Can you refactor these to avoid the duplication?

7535 ↗(On Diff #265435)

Can you use one of the simpler getLoad overloads? We don't really ever need to mention unindexed

7537 ↗(On Diff #265435)

I think you could have a better alignment than 4 here

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dispatch.ptr.ll
32

Should also have some tests with multiple loads. You should make sure that a use of the base address load, and a use of the offset address both CSE into a single load with 1 shift.

Also could test with an explicit align attribute on the dispatch ptr call

hliao marked an inline comment as done.May 21 2020, 11:26 AM
hliao added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475 ↗(On Diff #265435)

do we have that attribute available? could you elaborate in detail?

hliao marked an inline comment as done.May 21 2020, 11:45 AM
hliao added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475 ↗(On Diff #265435)

OK, I found that.

hliao marked an inline comment as done.May 22 2020, 11:45 AM
hliao added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475 ↗(On Diff #265435)

In another review D80422, relevant intrinsics are being annotated with assumed alignment on the return pointer. Unfortunately, in SelectionDAG, we won't have the facility to keep tracking of that hint. I will enhance AMDGPUCodeGenPrepare pass to do that similar thing.

arsenm added inline comments.May 22 2020, 1:13 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475 ↗(On Diff #265435)

We don't need to explicitly know the pointer alignment. If the intrinsic was properly annotated from the beginning (as would be the case after D80422), the downstream load users would have the correct alignment assigned. The optimizer propagates alignment information already

hliao marked an inline comment as done.May 22 2020, 2:01 PM
hliao added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475 ↗(On Diff #265435)

That load won't be marked as DWORD aligned even after those intrinsics are annotated correctly as they are just not DWORD aligned. For example, the newly-added test has an i16 load from pointer (dippatch.ptr + 6). It's not DOWRD aligned. The transformation performed here is to widen such load if that pointer is calculated in the form a DWORD aligned pointer (base) with a constant offset (off). With that,

(i16 load (add base, off))

is translated into

(trunc (lshr (i32 load (add base, off - 2)))

We need to know that base is DWORD aligned. However, in SDNode, we don't have any facility to retain the original alignment from LLVM IR.

arsenm added inline comments.May 22 2020, 2:37 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475 ↗(On Diff #265435)

is computeKnownBitsForTargetNode called for CopyFromReg? If so, we could move the argument check in there

hliao marked an inline comment as done.May 22 2020, 6:08 PM
hliao added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475 ↗(On Diff #265435)

unfortunately NO

hliao updated this revision to Diff 265945.May 24 2020, 2:49 PM

Rewrite such transformations in LLVM IR as the late codegen preparation.

hliao marked an inline comment as done.May 24 2020, 2:49 PM
hliao added a comment.May 26 2020, 2:06 PM

Rewrite such transformations in LLVM IR as the late codegen preparation.

this change depends on D80422 to add align attribute properly on relevant intrinsics.

hliao updated this revision to Diff 266655.May 27 2020, 1:47 PM

Rebase the latest trunk.

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
33–37

If a separate pass is going to do the load widening, you can remove it from AMDGPUCodeGenPrepare

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dispatch.ptr.ll
59

Missing align?

hliao added a comment.May 28 2020, 8:45 AM

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

That's due to the misaligned load after coalescing. This example is written intentionally to skip LSV and verify that common widened load could be CSEd within DAG. In practice, there won't be such input as the 1st i16 load should be properly annotated with align 4.

hliao marked 2 inline comments as done.May 28 2020, 8:47 AM
hliao added inline comments.
llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
33–37

I could clean that after this.

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dispatch.ptr.ll
59

that align attribute will be added implicitly if there's no explicit overriding.

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

That's due to the misaligned load after coalescing. This example is written intentionally to skip LSV and verify that common widened load could be CSEd within DAG. In practice, there won't be such input as the 1st i16 load should be properly annotated with align 4.

But the load isn't misaligned. The point of adding the alignment to the intrinsic declaration was so that the whole optimization pipeline would know about the alignment. Taking this example:

after opt -instcombine:

define amdgpu_kernel void @test3(i32 addrspace(1)* nocapture %out) local_unnamed_addr #0 {
  %dispatch_ptr = tail call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr()
  %d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 4
  %h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)*
  %v1 = load i16, i16 addrspace(4)* %h1, align 4
  %e1 = zext i16 %v1 to i32
  %d2 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 6
  %h2 = bitcast i8 addrspace(4)* %d2 to i16 addrspace(4)*
  %v2 = load i16, i16 addrspace(4)* %h2, align 2
  %e2 = zext i16 %v2 to i32
  %o = add nuw nsw i32 %e2, %e1
  store i32 %o, i32 addrspace(1)* %out, align 4
  ret void
}

Now the load's alignment was correctly inferred to be higher, so then the vectorizer handles this as expected:

opt -instcombine -load-store-vectorizer -instsimplify gives:

define amdgpu_kernel void @test3(i32 addrspace(1)* nocapture %out) local_unnamed_addr #0 {
  %dispatch_ptr = tail call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr()
  %d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 4
  %h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)*
  %1 = bitcast i16 addrspace(4)* %h1 to <2 x i16> addrspace(4)*
  %2 = load <2 x i16>, <2 x i16> addrspace(4)* %1, align 4
  %v11 = extractelement <2 x i16> %2, i32 0
  %v22 = extractelement <2 x i16> %2, i32 1
  %e1 = zext i16 %v11 to i32
  %e2 = zext i16 %v22 to i32
  %o = add nuw nsw i32 %e2, %e1
  store i32 %o, i32 addrspace(1)* %out, align 4
  ret void
}

So I think with the alignment patch, for any real world example, this would do the right thing. Your example fails here because the IR wasn't pre-optimized. We don't need to expect perfect optimization from the backend for any random IR and need to consider the entire pass pipeline

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

That's due to the misaligned load after coalescing. This example is written intentionally to skip LSV and verify that common widened load could be CSEd within DAG. In practice, there won't be such input as the 1st i16 load should be properly annotated with align 4.

But the load isn't misaligned. The point of adding the alignment to the intrinsic declaration was so that the whole optimization pipeline would know about the alignment. Taking this example:

after opt -instcombine:

but we run llc only in this test and skip all middle-end optimizations. That's why I said this test is impractical in the normal scenario where middle-end is always run. This test is added to address your previous concern on CSE of widened loads. If not required, I could remove this test.

define amdgpu_kernel void @test3(i32 addrspace(1)* nocapture %out) local_unnamed_addr #0 {
  %dispatch_ptr = tail call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr()
  %d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 4
  %h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)*
  %v1 = load i16, i16 addrspace(4)* %h1, align 4
  %e1 = zext i16 %v1 to i32
  %d2 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 6
  %h2 = bitcast i8 addrspace(4)* %d2 to i16 addrspace(4)*
  %v2 = load i16, i16 addrspace(4)* %h2, align 2
  %e2 = zext i16 %v2 to i32
  %o = add nuw nsw i32 %e2, %e1
  store i32 %o, i32 addrspace(1)* %out, align 4
  ret void
}

Now the load's alignment was correctly inferred to be higher, so then the vectorizer handles this as expected:

opt -instcombine -load-store-vectorizer -instsimplify gives:

define amdgpu_kernel void @test3(i32 addrspace(1)* nocapture %out) local_unnamed_addr #0 {
  %dispatch_ptr = tail call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr()
  %d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 4
  %h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)*
  %1 = bitcast i16 addrspace(4)* %h1 to <2 x i16> addrspace(4)*
  %2 = load <2 x i16>, <2 x i16> addrspace(4)* %1, align 4
  %v11 = extractelement <2 x i16> %2, i32 0
  %v22 = extractelement <2 x i16> %2, i32 1
  %e1 = zext i16 %v11 to i32
  %e2 = zext i16 %v22 to i32
  %o = add nuw nsw i32 %e2, %e1
  store i32 %o, i32 addrspace(1)* %out, align 4
  ret void
}

So I think with the alignment patch, for any real world example, this would do the right thing. Your example fails here because the IR wasn't pre-optimized. We don't need to expect perfect optimization from the backend for any random IR and need to consider the entire pass pipeline

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

That's due to the misaligned load after coalescing. This example is written intentionally to skip LSV and verify that common widened load could be CSEd within DAG. In practice, there won't be such input as the 1st i16 load should be properly annotated with align 4.

But the load isn't misaligned. The point of adding the alignment to the intrinsic declaration was so that the whole optimization pipeline would know about the alignment. Taking this example:

after opt -instcombine:

but we run llc only in this test and skip all middle-end optimizations. That's why I said this test is impractical in the normal scenario where middle-end is always run. This test is added to address your previous concern on CSE of widened loads. If not required, I could remove this test.

This was more of a concern when letting the DAG handle it. Since you added the implied align attribute, I don't think anything else should be needed

hliao updated this revision to Diff 267029.May 28 2020, 1:50 PM

Remove an inpractical test.

Remove an inpractical test.

I mean beyond the test, I think the whole patch is unnecessary now. Do you have an example of real source that still needs this?

hliao added a comment.EditedMay 28 2020, 8:01 PM

Remove an inpractical test.

I mean beyond the test, I think the whole patch is unnecessary now. Do you have an example of real source that still needs this?

test2, that's a real example. The widening done in this patch is not supported in any code in LLVM. The load, by itself, is not DWORD aligned but only naturally aligned no matter whether alignment marking is correct or not.

I did some experiments locally and think this can stay in AMDGPUCodeGenPrepare, and doesn't need the split pass. Since you restrict this widening to the case where you're rebasing the load anyway, I don't think this will cause the same problems with the vectorizer the previous IR load widening had (and may help it even?)

test3 should also come back, but should have the explicit align 4 added to the load. This could also use some loads of i8, and <2 x i8>. We could also extend this to handle wider, sub-dword aligned types but that's a separate patch.

hliao added a comment.May 29 2020, 8:30 PM

I did some experiments locally and think this can stay in AMDGPUCodeGenPrepare, and doesn't need the split pass. Since you restrict this widening to the case where you're rebasing the load anyway, I don't think this will cause the same problems with the vectorizer the previous IR load widening had (and may help it even?)

test3 should also come back, but should have the explicit align 4 added to the load. This could also use some loads of i8, and <2 x i8>. We could also extend this to handle wider, sub-dword aligned types but that's a separate patch.

Scalar load widening should run after LSV to generate redundant loads. Cases like a sequence of consecutive loads of i16 benefit from such an organization to avoid redundant load generation. Here's the details

for 4 loads of i16

ld.i16 (ptr + 0)
ld.i16 (ptr + 2)
ld.i16 (ptr + 4)
ld.i16 (ptr + 6)

If we run scalar load widening before LSV. After widening, we have

ld.i16 (ptr + 0)
ld.i32 (ptr + 0)
ld.i16 (ptr + 4)
ld.i32 (ptr + 4)

After LSV, we have

ld.i16 (ptr + 0)
ld.i32x2 (ptr + 0)
ld.i16 (ptr + 4)

That 2 i16 loads are redundant. If we run scalar load widening after LSV, we won't have that result.

hliao updated this revision to Diff 267904.Jun 2 2020, 9:18 AM

Add test case and comment on why we need to run scalar load widening after LSV.

arsenm added inline comments.Jun 2 2020, 11:15 AM
llvm/test/CodeGen/AMDGPU/vectorize-loads.ll
26 ↗(On Diff #267904)

s/widening/widened

32–33 ↗(On Diff #267904)

Function name needs to be better. This is not merely a v4i16 vectorization, there's the constant widening to consider

Is this still necessary?

rampitec added inline comments.Oct 8 2020, 2:18 PM
llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
796

It can simply to to GCNPassConfig::addPreISel().

hliao updated this revision to Diff 299769.Oct 21 2020, 11:53 AM

Rebase and revise.

hliao marked an inline comment as done.Oct 21 2020, 12:03 PM

This patch is still required as MMO's alignment is calculated based on the offset from the base alignment. As the base alignment is the alignment from the pointer in the IR, it cannot be modified. We need extra logic to re-align MMO operand if we widen the original one. For instance of a 16-bit load from ptr has an alignment of 2, if ptr is equivalent to base - 2 and base's alignment is 4, we could widen that 16-bit load to 32-bit load from ptr - 2with an alignment 4. But, as we cannot change IR in MMO, we need extra stuff to in the new MMO could assume that new alignment.

LGTM in principle. We wanted to split CodeGenPrepare for a long time already. We also should drop widening from an early pass then.

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
119

Please use "auto *" as tidy suggests.

147

Also "auto *".

167

Same here and in another places.

hliao updated this revision to Diff 301018.Oct 27 2020, 8:55 AM

Fix coding style following clang-tidy.

rampitec accepted this revision.Oct 27 2020, 9:39 AM

LGTM, provided you are planning to remove widening from the early pass.

This revision is now accepted and ready to land.Oct 27 2020, 9:39 AM
This revision was landed with ongoing or failed builds.Oct 27 2020, 11:08 AM
This revision was automatically updated to reflect the committed changes.
foad added a subscriber: foad.Nov 2 2020, 2:55 PM