This is an archive of the discontinued LLVM Phabricator instance.

[GISel] Rework trunc/shl combine in a generic trunc/shift combine
ClosedPublic

Authored by Pierre-vh on Oct 20 2022, 12:36 AM.

Details

Summary

This combine only handled left shifts, but now it can handle right shifts as well. It handles right shifts conservatively and only truncates them to the size returned by TLI.

AMDGPU benefits from always lowering shifts to 32 bits for instance, but AArch64 would rather keep them at 64 bits.

Diff Detail

Event Timeline

Pierre-vh created this revision.Oct 20 2022, 12:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 20 2022, 12:36 AM
Pierre-vh requested review of this revision.Oct 20 2022, 12:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 20 2022, 12:36 AM
Pierre-vh updated this revision to Diff 469138.Oct 20 2022, 2:19 AM

Remove dead var

The generic combiner already has trunc(shl), this should go there along with it (preferably with fewer hardcoded sizes)

Pierre-vh updated this revision to Diff 469523.Oct 21 2022, 2:59 AM

Move to generic combine, rework

Pierre-vh retitled this revision from [AMDGPU][GISel] Add trunc/shr combine to [GISel] Rework trunc/shl combine in a generic trunc/shift combine.Oct 21 2022, 3:00 AM
Pierre-vh edited the summary of this revision. (Show Details)
foad added inline comments.Oct 21 2022, 5:27 AM
llvm/include/llvm/CodeGen/GlobalISel/CombinerHelper.h
410

Interesting - this seems like an abuse of getPreferredShiftAmountTy, but I guess it works out OK in practice for the one case we care about (converting 64-bit shifts to 32-bit shifts).

arsenm added inline comments.Oct 21 2022, 10:27 AM
llvm/include/llvm/CodeGen/GlobalISel/CombinerHelper.h
410

This is definitely not what getPreferredShiftAmountTy is for, which is only supposed to help produce already legal shift RHS operands

llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
2271

This is what getPreferredShiftAmountTy is for

2285

This usage of getPreferredShiftAmountTy doesn't make sense, but you need something to pick an intermediate type here. Without adding anything, I guess you could bisect types until you found a legal shift? Creating the minimum required shift also "should" work out, but may create more legalizer and optimizer work. It would be nice if we could directly inspect the list of legal shift types, but the API doesn't have that.

Does just reusing the truncating type work as well?

Pierre-vh added inline comments.Oct 23 2022, 11:38 PM
llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
2285

Reusing the truncating type doesn't work as well, and trying to guess a legal one doesn't either for AArch64 because 16 bits shifts seem legal there, but are suboptimal compared to 64 (word size) shifts.

My first version actually did a simplified bisecting to find a good type, but AArch64 suffered and that's why I didn't do it.

I think a new TLI hook would be best so targets can just choose whatever they prefer. I'll post a version with that and we can discuss there.

Pierre-vh marked 4 inline comments as done.Oct 24 2022, 12:50 AM
Pierre-vh added inline comments.
llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
2285

Nevermind, I found that AArch64 "suffers" because the combine added in D109419 gets confused by the added trunc on the right shift operands. I'm looking into fixing it.

Pierre-vh updated this revision to Diff 470082.Oct 24 2022, 1:51 AM
  • Remove misuse of getPreferredShiftAmountTy
  • Only truncate right shifts to 32 bits for now. 16 bits doesn't seem to benefit any targets, I only observed regressions in our backend and no clear gains.
  • Don't truncate right shifts if the trunc has any store users to prevent blocking truncstore combine.
    • I looked into adding support for the patterns generated by this new combine to the truncstore combine, but it's a complex combine so it's not immediately clear what needs to change. I also didn't observe any regressions caused by adding this restriction so it's not high-priority IMO.

Not sure why you deleted the test

llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
2262

Start with lowercase letter

2313–2316

I don't understand special casing this, I'd rather just ignore it

All combines are always on GISEL opcodes? There are no AMDGPU native combines?

arsenm added inline comments.Nov 16 2022, 9:18 AM
llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
2313

Do this first then?

All combines are always on GISEL opcodes? There are no AMDGPU native combines?

Yes. You can also define target specific "generic" opcodes

You have define generic opcodes in the *amdgpu* namespace that represent AMDGPU native instructions?

You have define generic opcodes in the *amdgpu* namespace that represent AMDGPU native instructions?

Yes. They are regbankselectable and have the same restrictions as G_* opcodes

Pierre-vh updated this revision to Diff 476064.Nov 17 2022, 2:50 AM
Pierre-vh marked 3 inline comments as done.

Comments

Not sure why you deleted the test

Test cases are in the new file, not sure why it shows as deleted

llvm/lib/CodeGen/GlobalISel/CombinerHelper.cpp
2313–2316

I spent a while trying to get the truncstore combine to work properly in such cases but it's really not easy. The matching logic is a bit complex.

There's also a comment in that combine's matcher indicating it may be moved to a separate pass in the future to improve its matching logic, so I think it's better to just leave this special case in for now since:

  • Fixing the truncstore combine seems to be hard
  • This seems to be a niche case - this special case doesn't look like it causes any harm, on the contrary, it allows the combine to not interfere with truncstore.

The detailed reason is that the truncstore combine calculates the number of stores it should be merging together based on the type of the source it matches.

  • If we always ignore the trunc, it breaks the combine on AArch64 for 16 bit sources, as the source is always going to be a trunc of a w register. (See trunc_i16_to_i8 testcase).
    • It would always take the 32 bit value as the source, but in those cases the 16 bit source is the correct one.
  • Conditionally ignoring the trunc (e.g. only when the src was matched from a shift) can cause premature matching of the combine.
    • e.g. if this (right-shift trunc) combine kicks in on trunc_i64_to_i16, it causes the shift's source to be 32 bits. We then see that we need 2 or 4 stores (depending on which source we use). As the combiner works top-down within a block, we'll find 2 stores before finding 4 of them, and the matcher has to conservatively assume it won't find more. It'll end up merging 2 stores instead of 4.

TL;DR: I would rather leave the special case in for now as removing isn't worth the effort IMO (and it doesn't seem to improve codegen quality significantly either)

arsenm accepted this revision.Dec 7 2022, 8:33 AM
This revision is now accepted and ready to land.Dec 7 2022, 8:33 AM
Pierre-vh updated this revision to Diff 481177.Dec 8 2022, 12:41 AM

@arsenm I accidentally forgot to leave in the special case to not fold in the presence of G_STORE users. That's why some aarch64 test was failing, see my previous comments.
Is it still good to land even with the special-case left in?

arsenm added a comment.Dec 8 2022, 7:49 PM

@arsenm I accidentally forgot to leave in the special case to not fold in the presence of G_STORE users. That's why some aarch64 test was failing, see my previous comments.
Is it still good to land even with the special-case left in?

I forgot about that. Can you file an issue to pick it up later?

This revision was landed with ongoing or failed builds.Dec 9 2022, 1:46 AM
This revision was automatically updated to reflect the committed changes.
llvm/test/CodeGen/AMDGPU/GlobalISel/combine-trunc-shift.mir