This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Trim zero components from buffer and image stores
ClosedPublic

Authored by matejam on Mar 23 2023, 9:55 AM.

Details

Summary

For image and buffer stores the default behaviour on GFX11 and
older is to set all unset components to zero. So if we pass
only X component it will be the same as X000, or XY same as XY00.

This patch simplifies the passed vector of components in InstCombine
by removing zero components from the end.

For image stores it also trims DMask if necessary.

Diff Detail

Event Timeline

matejam created this revision.Mar 23 2023, 9:55 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 23 2023, 9:55 AM
matejam requested review of this revision.Mar 23 2023, 9:55 AM
arsenm added inline comments.Mar 24 2023, 7:56 AM
llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
360

Can you use possiblyDemandedEltsInMask or something else from vector utils? This doesn't feel like something you would need to invent yourself

362

This is an unchecked dyn_cast. You already did a dyn_cast at the call site, so just pass in Instruction to begin with?

365

SmallVector

366

Don't use std::map. Also, can this just be a SmallVector in the reverse direction per component?

386

Replace the opcode check with the dyn_cast

401–402

Replace the opcode check with the dyn_cast

1135

Don't bother using APInt here?

Thanks for the review!

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
360

I thought about using that, but this is a bit more than that.
It doesn't just find which elements were set, but also takes into consideration the DMask for image intrinsics and also ignores zeros that were inserted at the end of the vector, which is the primary goal of this patch.
There is also findDemandedEltsByAllUsers, but that works in a different direction, meaning that it will find the uses of the given Value * and update DemandedElts that way.

In my case I look at the definitions (insertelement, shufflevector or ConstantVector) of the given Value * and "recursively" go up and update DemandedElts accordingly.

365

Will be done.

366

I will look into that.

386

Will be done.

401–402

Will be done.

1135

I see I can do the same thing a lot more easier.
Like "if (DMaskVal > (1 << VWidth))"

matejam added inline comments.Mar 24 2023, 9:30 AM
llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
366

What about MapVector?
I'm not sure how would I use SmallVector to implement a map?

matejam updated this revision to Diff 508140.Mar 24 2023, 9:53 AM

Using SmallVector instead of std::vector, MapVector instead of std::map, eliminated some unnecessary dyn_casts and a few more small changes.

arsenm added inline comments.Mar 31 2023, 3:16 PM
llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
366

I meant it's the value to vector indexes? You could just go the other direction and find the value from the index

387

Unchecked dyn_cast, need a test with a variable vector index

matejam updated this revision to Diff 510767.Apr 4 2023, 6:14 AM

Use of SmallVector instead of VectorMap for tracking which components were already added.
Remove some unnecessary dyn_casts.

matejam updated this revision to Diff 512757.Apr 12 2023, 3:52 AM

Rebase and minor changes.

matejam updated this revision to Diff 517104.Apr 26 2023, 2:31 AM

Rebase.

@foad @arsenm would you please review this?
Thank you.

foad added inline comments.Apr 26 2023, 4:06 AM
llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
365–368

Maybe call them ComponentIndices and ComponentValues?

371

I wonder if the whole of this loop could be replaced by calling llvm::computeKnownBits with an appropriate DemandedElts mask to test each element from the last to the first.

1099

Use getArgOperand throughout for consistency?

1102

Why does this need to be an Instruction? Couldn't it be a Constant? E.g. if it was an all zeroes constant, that should be simplified.

1134

Should be >=

1270–1271

Keep the braces around this, because it is more than one physical line.

foad added a comment.Apr 26 2023, 4:08 AM

[AMDGPU] Default component broadcast store

There is no "broadcasting" here. Maybe call it something like "trim zero components from buffer and image stores"?

arsenm added inline comments.Apr 26 2023, 7:51 AM
llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
1096

Don't see why the element type would matter? They're a bit artificial

matejam added inline comments.Apr 27 2023, 6:59 AM
llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
371

I looked that up.
computeKnownBits works only for integers, pointers and vector of integers.

foad added inline comments.Apr 27 2023, 7:54 AM
llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
371

You might be able to use the brand new computeKnownFPClass instead, and check if the result is exactly fcPosZero?

matejam updated this revision to Diff 517563.Apr 27 2023, 8:05 AM
matejam marked 5 inline comments as done.
matejam marked an inline comment as not done.Apr 27 2023, 8:13 AM
matejam added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
371

Thanks.

matejam updated this revision to Diff 517925.Apr 28 2023, 7:58 AM

Changes in findDemandedElts, use computeKnownFPClass.

foad added a comment.Apr 28 2023, 9:18 AM

When I suggested using computeKnownFPClass I meant something like this: https://reviews.llvm.org/differential/diff/517952/
Further cleanups are possible.

matejam updated this revision to Diff 517963.Apr 28 2023, 9:49 AM

Thank you @foad.
findDemandedElts with correct usage of computeKnownFPClass.

matejam retitled this revision from [AMDGPU] Default component broadcast store to [AMDGPU] Trim zero components from buffer and image stores.Apr 28 2023, 9:53 AM
matejam edited the summary of this revision. (Show Details)
matejam updated this revision to Diff 519066.May 3 2023, 7:19 AM

Rebase and change in comments.

Please review this. @foad @arsenm

foad added inline comments.May 9 2023, 6:46 AM
llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
358

This function needs a better name now to explain what it does, and a comment, and should return APInt instead of taking an APInt& argument.

1110–1113

It would be simpler to use dyn_cast and break if it fails.

1119

dyn_cast should be isa since you don't use the result. But what is this actually testing for? Would it be better to test hasNamedOperand?

foad added inline comments.May 9 2023, 7:30 AM
llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
1119

I think these intrinsics all have immarg on the dmask argument, so it should always be a constant?

arsenm added inline comments.May 9 2023, 10:46 AM
llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
1119

Yes, should be able to just cast<ConstantInt>

Thanks for the review.

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
358

Is the name trimTrailingZerosInVector more explanatory?

1119

Both image and buffer intrinsics have the same body in this switch.
For buffer instructions the first operand is not a constant, that is why I used dyn_cast<ConstantInt>.
I could use isa<ConstantInt>.
I'm not sure how would I check for hasNamedOperand, because II is a call instruction.

matejam updated this revision to Diff 520957.May 10 2023, 4:36 AM

Change the name from findDemandedElts to trimTrailingZerosInVector.
Remove some unnecessary dyn_casts.
Refactor and rebase.

foad added inline comments.May 10 2023, 6:14 AM
llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
358

Sure.

1111

You don't use IIVTy for anything else so just test isa<FixedVectorType>(...) instead.

1119

You definitely should not rely on the second argument of the buffer intrinsics not being a constant! Instead you should test the opcode.

matejam updated this revision to Diff 520986.May 10 2023, 7:17 AM

Use Intrinsic opcode to know if the instructions has DMask instead of testing if the instruction has a ConstantInt as the second operand.
Add more run-lines to the test.

foad added a comment.May 12 2023, 4:13 AM

Code looks good now, thanks.

llvm/test/Transforms/InstCombine/AMDGPU/amdgcn-intrinsics.ll
4554

This looks like we have lost an optimization that converts the "mip" form to the non-"mip" form.

matejam updated this revision to Diff 522179.May 15 2023, 7:25 AM

Do the optimizations for image instructions that were done prior to this patch.

matejam updated this revision to Diff 522182.May 15 2023, 7:31 AM

Refactor.

foad added inline comments.May 15 2023, 7:32 AM
llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
1123–1126

Instead of duplicating this code, I think it would be neater to move it after the whole switch (...) {...} statement. Then everything should Just Work.

matejam updated this revision to Diff 522186.May 15 2023, 7:38 AM
matejam updated this revision to Diff 522187.May 15 2023, 7:41 AM

Move the default case out of the switch.

foad accepted this revision.May 15 2023, 7:52 AM

LGTM, thanks!

This revision is now accepted and ready to land.May 15 2023, 7:52 AM
This revision was landed with ongoing or failed builds.May 15 2023, 9:24 AM
This revision was automatically updated to reflect the committed changes.
foad added a comment.May 18 2023, 8:37 AM

matejam added a reverting change: rG9c8c31eea439: Revert "[AMDGPU] Trim zero components from buffer and image stores".

Have you got any more details about what was wrong with it?

matejam added a reverting change: rG9c8c31eea439: Revert "[AMDGPU] Trim zero components from buffer and image stores".

Have you got any more details about what was wrong with it?

No, we're having issues accessing amd network.
I don't know what tests exactly are failing and why.

foad added inline comments.May 24 2023, 12:41 PM
llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
366

I don't think you want to loop to go down to i = 0. If the 0'th element is zero then you will remove the whole store instruction, which would not be right. Can you add a test for that case please?

matejam updated this revision to Diff 526097.May 26 2023, 9:27 AM

Change condition in for loop, instead of i >= 0, put i > 0. We don't want to optimize out the 0th element.

matejam reopened this revision.May 26 2023, 10:16 AM

This patch was reverted on upstream, because of failed cts tests.

This revision is now accepted and ready to land.May 26 2023, 10:16 AM
matejam updated this revision to Diff 526110.May 26 2023, 10:16 AM

Add test case with all zero components.

matejam updated this revision to Diff 527849.Jun 2 2023, 6:54 AM
matejam added reviewers: nhaehnle, dstuttard.

Remove *_buffer_store instructions from being optimized.

matejam requested review of this revision.Jun 2 2023, 7:14 AM
arsenm accepted this revision.Jun 2 2023, 2:25 PM
This revision is now accepted and ready to land.Jun 2 2023, 2:25 PM
foad added inline comments.Jun 5 2023, 12:21 AM
llvm/test/Transforms/InstCombine/AMDGPU/amdgcn-simplify-image-buffer-stores.ll
12–16

Nit: could just use zeroinitializer.

foad added inline comments.Jun 8 2023, 5:12 AM
llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
375

Could also handle undef here and treat it as zero.

matejam closed this revision.Jun 16 2023, 2:01 AM
matejam marked 2 inline comments as done.