This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add basic support for extended i8 perm matching

Authored by jrbyrnes on Jan 27 2023, 1:30 PM.



Implement traversal algorithm to match trees to i8 vperms. For ors that can be combined into perms, we expect to see some pattern that combine four 8 bit operands (actually 16 bit operands, with only 8 nonzero bits) into two 16 bit operand, and combine thes two 16 bit operands via the or (after an zext, and ext-shift), The trees that do this type of combination are one of the two classes of trees relevant, and are matched in calculateByteProvider. The 8 bit operands used in this tree are typically produced via an AND op or a SRL op, and are the leaves of the trees in calculateByteProvider. The other relevant class of trees are those that map a leaf of calculateByteProvider to an ultimate source. This class of trees is matched in calculateSrcByte.

Through this recusive process, we track an Index (SrcIndex in calculateSrcByte) which is the byte of the current op that maps to the byte of the dest of the or we are currently mapping. For example, the 4th byte of the dest of SHL Src, 16 maps to the 2nd byte of Src. Through basic rules like this we can map src bytes to the dest byte of the or. Using this mapping we can create perm masks.

Much of the code for calculateByteProvider was borrowed from CodeGen/SelectionDAG/DAGCombiner.cpp (MatchLoadCombine). There are still many candidate trees that can be matched into perms that this patch does not attempt to. Those are saved for future iterations.

Depends on:

Diff Detail

Event Timeline

jrbyrnes created this revision.Jan 27 2023, 1:30 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 27 2023, 1:30 PM
jrbyrnes requested review of this revision.Jan 27 2023, 1:30 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 27 2023, 1:30 PM
arsenm added inline comments.Jan 27 2023, 2:32 PM

Can you keep this as a generic utility?


Needs to move to std::optional


Test needs to use opaque pointers

jrbyrnes updated this revision to Diff 493435.Jan 30 2023, 3:24 PM

Blacklist or->perm combine for certain users of or. Some ops (e.g. V_CVT_F32_UBYTE) are performed in bytewise manner. If the or has such a user, it is better to leave the dag in uncombined state since we will need to byte extract the combine.

jrbyrnes updated this revision to Diff 493458.Jan 30 2023, 5:20 PM
jrbyrnes marked 2 inline comments as done.

Resolve remaining regressions + Separate out ByteProvider + std::optional

jrbyrnes retitled this revision from [AMDGPU] WIP: Add basic support for extended i8 perm matching to [AMDGPU] Add basic support for extended i8 perm matching.Jan 31 2023, 2:36 PM
jrbyrnes edited the summary of this revision. (Show Details)
jrbyrnes added a reviewer: arsenm.

The ratio of test to code changes has me worried the test coverage is incomplete


By generic utility I mean in generic code and used by the load combine as well


6 is the one true recursion depth limit


Don't need a small vector, can just directly iterate an initializer list?


Is the multiple use case covered in the tests?






Don't understand why you need the const_cast. Also, why isn't this an SDValue to begin with? Assuming output 0 can be risky


Drop -opaque-pointers (also direction doesn't make sense with the test contents)


Need to use typed pointers, also should prefer global loads to flat

jrbyrnes marked an inline comment as done.Jan 31 2023, 4:29 PM
jrbyrnes added inline comments.

Hey Matt -- thanks for comments. I think I don't fully understand this one -- I guess you didn't mean ? By generic, do you mean templated base class (perhaps in ADT) ?

arsenm added inline comments.Feb 2 2023, 6:26 PM

I didn't see that, I mean share code with the MatchLoadCombine that you mentioned. I didn't think about the details of that (something abstracter might be good since the same thing will need to be ported for GlobalISel)

jrbyrnes updated this revision to Diff 494696.Feb 3 2023, 11:28 AM
jrbyrnes marked 9 inline comments as done.

Rebase + review comments.

The ratio of test to code changes has me worried the test coverage is incomplete

This patch is intended to simply bring in the components necessary for i8 perm matching. Fitting / tuning it to be optimally useful in actual workloads is left to a future iteration. As such, the heuristics / conditions we use to apply the combine are very restrictive (e.g. no multi use operands in or, IsCombineVectorized heuristic, no support for 16 bit ors, etc). For this iteration, my primary concerns for testing were: true positives (i.e. testing accurate production of v_perm when we expect to), and false positive (correctness error / inefficient codegen). False negative (missed opportunity) are left to future iteration.

True positive coverage:
There are 4096 4xi8 shuffle_vector iterations. I tested and validated all permuations. The initial tests included covered all trees for these permutations.
There are 8192 4xi8 shuffle_vector iterations where 1 operand is undef. This iteration doesn't fully support these. Of these, about ~2k are lowered to v_perm by this iteration. I validated all of these.

False positive coverage:
lit tests
CK correctness tests
epsdb (to be run)

passes psdb

jrbyrnes updated this revision to Diff 496236.Feb 9 2023, 2:14 PM

Add tests for gating conditions on attempting to v_perm combine + nits

Adding reviewers due to size of diff

jrbyrnes updated this revision to Diff 499591.Feb 22 2023, 11:08 AM

Add opaque tests

arsenm added inline comments.Feb 22 2023, 11:17 AM

Opaque pointer tests are not additional, the tests need to be just converted. There are 0 remaining typed pointer AMDGPU tests

jrbyrnes updated this revision to Diff 499615.Feb 22 2023, 11:58 AM
jrbyrnes marked an inline comment as done.

Convert test to opaque pointers

foad added inline comments.Feb 23 2023, 2:26 AM
42 ↗(On Diff #499615)

These kind of changes look like regressions for some combination of code size / latency / sgpr pressure.

arsenm added inline comments.Feb 23 2023, 6:53 AM
64 ↗(On Diff #499615)

Yes, this is worse. Should avoid cases that can use v_lshl_or_b32

jrbyrnes updated this revision to Diff 499971.Feb 23 2023, 1:47 PM

Check that we are actually doing 8 bit extraction before lowering into v_perm.

We can determine (based on the potential perm mask and operands) if we need to insert any 8 bit extraction code.

For example, a perm mask of 0x05040100 suggests we will not need to extract any bits from the operands iff they have 16 bits of data (e.g. zext 16 load into 32 bit). In this case, we assume CodeGen will lower it well, and do not combine into v_perm. If, however, the operands are 32 bit, then we will need to insert mask code, so we do lower to v_perm.

As another example, if we have a mask of 0x05040201 then we will lower into v_perm for muiltiple reasons: 1. the 0x0201 portion of the mask implies a 32 bit operand, 2. the 0x0201 portion of the mask is not well formed, since it requires a shift instruction to address these bits.

Finally, if the mask and operands indicates we are just producing one of the ops, combine the tree into the op.

Feature resulted in changes -- optimally not lowering into v_perm -- in:

All 4096 permutation of <4 x i8> shufflevector produced desired result (including <i32 0, i32 1, i32 2, i32 3> and <i32 4, i32 5, i32 6, i32 7> which lower into correspond 32 bit operand).

Should ByteProvider really be BitProvider?


Why 8? Usually 6 is the one true recursion depth limit


Do the RHS calls first and short circuit the second call if the first failed


Do you really need the explicit std::optional?



arsenm added inline comments.Mar 17 2023, 11:24 AM

Need to make sure this is scalar


Need to make sure this is scalar

jrbyrnes updated this revision to Diff 506658.Mar 20 2023, 11:12 AM
jrbyrnes marked 6 inline comments as done.

Thanks @arsenm for taking another look.

Address review comments. Need to rerun psdb since short-circuiting LHS slightly changes algorithmic behavior.


Based on testing, I can lower depth, but we must accept depth 6 since this is the max tree depth across build vectors that should be lowered into v_perm. Relevant test is already included in permute_i8.ll. This depth may need to change in future iteration.


Yes, it cannot infer std::optional, even after changing order.

arsenm added inline comments.Apr 20 2023, 10:54 AM



This isn't checking for a scalar type?


Can this be a separate function? It's a lambda that doesn't capture anything

jrbyrnes updated this revision to Diff 520792.May 9 2023, 12:33 PM

Rebase + passed PSDB

Sorry @arsenm, I somehow missed your comments -- I'll address those.

jrbyrnes updated this revision to Diff 520852.May 9 2023, 4:03 PM
jrbyrnes marked 3 inline comments as done.

Address comments

jrbyrnes updated this revision to Diff 525221.May 24 2023, 9:26 AM

Cleanup checking in is16BitScalarOp + ping

Thanks for comments thus far -- any other concerns here?

arsenm accepted this revision.Jun 6 2023, 11:23 AM
arsenm added inline comments.

No reference


Doesn't feel like this is the right way to express it. Should handle CopyToRegs at least too


Don't need llvm::


Don't these default initialize to nullopt?

This revision is now accepted and ready to land.Jun 6 2023, 11:23 AM
jrbyrnes updated this revision to Diff 528985.Jun 6 2023, 12:14 PM
jrbyrnes marked 4 inline comments as done.

Thanks @arsenm for the review.

I'll take another look at the v_perm candidate whitelist method in a subsequent patch which extends this work to capture more patterns.

arsenm accepted this revision.Jun 8 2023, 4:15 PM
This revision was landed with ongoing or failed builds.Jun 19 2023, 9:54 AM
This revision was automatically updated to reflect the committed changes.


Why rebase a closed revision?


Why rebase a closed revision?

It was for posterity. Sorry for confusion.