This is an archive of the discontinued LLVM Phabricator instance.

[X86][Costmodel] `getReplicationShuffleCost()`: implement cost model for 32/64 bit-wide elements with AVX512F
ClosedPublic

Authored by lebedev.ri on Nov 6 2021, 2:02 PM.

Details

Summary

This models lowering to vpermd/vpermq/vpermps/vpermpd,
that take a single input vector and a single index vector,
and are cross-lane. So far i haven't seen evidence that
replication ever results in demanding more than a single
input vector per output vector.

This results in *shockingly* lesser costs :)

Diff Detail

Event Timeline

lebedev.ri created this revision.Nov 6 2021, 2:02 PM
lebedev.ri requested review of this revision.Nov 6 2021, 2:02 PM
lebedev.ri added inline comments.Nov 6 2021, 2:05 PM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3741–3747

Yes, we should not need to override this, but then it does not compile,
it seems like that ^ override hides base implementation of this method.
So far my attempts at fixing this have not been successful.

lebedev.ri added inline comments.Nov 6 2021, 2:17 PM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3702–3704

I guess two sources of demanded elts was a wrong choice,
DemandedSrcElts is implied by DemandedReplicatedElts,
yet DemandedSrcElts provides less knowledge than DemandedReplicatedElts,
and changing DemandedReplicatedElts to DemandedIndices
would also be lossy. I'll drop DemandedSrcElts later.

lebedev.ri updated this revision to Diff 385297.Nov 6 2021, 3:14 PM

Tidy up the code somewhat, NFC.

lebedev.ri updated this revision to Diff 385299.Nov 6 2021, 3:51 PM

Ok, and implement the fast-path that assumes that all the non-trivial invariants hold,
but keep the original complex code in with-asserts mode and verify that results match.

The rough roadmap for the planned follow-ups:

  1. i16 if +BWI
  2. i8 if +VBMI
  3. i16 ?ext -> i32 -> trunc i16 if -BWI
  4. i8 ?ext -> i16 -> trunc i8 if -VBMI +BWI
  5. i8 ?ext -> i32 -> trunc i8 if -VBMI -BWI
  6. i1 ?ext -> i8 -> trunc i1 if +VBMI
  7. i1 ?ext -> i16 -> trunc i1 if -VBMI +BWI
  8. i1 ?ext -> i32 -> trunc i1 if -VBMI -BWI
  9. ???
  10. PROFIT!
lebedev.ri updated this revision to Diff 385758.Nov 9 2021, 3:40 AM

Rebased after refactoring baseline interface, NFC.

RKSimon added inline comments.Nov 10 2021, 7:14 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3632

Is this really useful?

3730

can't we extract the length of seq() ?

3742

I'm really torn over whether this should be in the method or not :( I understand its use but its a LOT of debug code.

lebedev.ri marked 2 inline comments as done.

@RKSimon thanks for taking a look!
Dropped debug code, i've convinced myself that this is trivially true for AVX512 shuffles, given that the replication factor is integral.

llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3632

Spelling the exact same thing over and over seems bound to do it wrong at some point.
I can inline if you insist, but it seemed more concise this way.

RKSimon added inline comments.Nov 10 2021, 11:23 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3654

We're not using the cost at all. just the legalized type?

lebedev.ri added inline comments.Nov 10 2021, 11:27 AM
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
3654

Could you please clarify, are you asking to change this to

MVT LegalReplicatedVecTy =
      TLI->getTypeLegalizationCost(DL, ReplicatedVecTy).second;

, or asking why we don't use the cost?
I'm mainly not sure why that cost is there or how i should use it here.
We don't need to legalize the input, it already as,
we just need to know how it's laid out (how many ?MM vectors), i believe?

RKSimon accepted this revision.Nov 10 2021, 11:38 AM

Yeah, you're right - LGTM

This revision is now accepted and ready to land.Nov 10 2021, 11:38 AM

Yeah, you're right - LGTM

Thank you for the review!