This is an archive of the discontinued LLVM Phabricator instance.

[ARM] MVE FP16 cost adjustments
ClosedPublic

Authored by dmgreen on Jun 14 2020, 2:38 PM.

Details

Summary

This adjusts the MVE fp16 cost model, similar to how we already do for integer casts. It uses the base cost of 1 per cvt for most fp extend / truncates, but adjusts it for loads and stores where we know that a extending load has been used to get the load into the correct lane, and only an MVE VCVTB is then needed.

Diff Detail

Event Timeline

dmgreen created this revision.Jun 14 2020, 2:38 PM
dmgreen updated this revision to Diff 271746.Jun 18 2020, 9:26 AM

Rebase and add some additional trunc handling. This should not effect a lot, but keeps a symmetry between extending load and truncating stores.

There's a lot of changes here... would it be worth committing them as separate patches: MVE, NEON and generic, or splitting of memory and cast?

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
501

Maybe a comment on your magic number choices, or how about using getIntrinsicInstrCost for the call cost?

985

I thought you stance was that we shouldn't be looking at the context?!

dmgreen marked 2 inline comments as done.Jun 24 2020, 5:55 AM

Yeah, I can split this up. It did grow a bit. The parts are sometimes interrelated but we now have D82456, D82457 and D82458. And I will rebase this to the remaining fp16 parts.

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
501

getCallInstrCost looks like it might work - at least give the right cost. It will not include the same codesize changes, but that can be done separately. It's hard to say exactly what the call will cost with all the stuff that might go on inside it.

985

No. It's more... nuanced than that. We need to look at context in order to ever accurately model things. At least at the moment. Without this bit of code for example, all <4 x f16> would be quite expensive, so you end up with any fp16->fp32 convert not being vectorized, even if it should be beneficial.

But there are only certain parts of the context that are valid to look at. The opcode of the surrounding instructions is almost certainly OK, and in this case I would claim that the _scalar_ type of the float makes sense because it's converting a float (not an int). The absolute type could be wrong because it would be passed as a different type from the vectorizer (with a different number of vector elements). If it was an integer type then that could be wrong too because the vectorizer can promote types as it vectorizes to a smaller bitwidth.

I'm not saying it's a great design and would love a better way to do this, but it's the one we've currently got.

dmgreen updated this revision to Diff 273006.Jun 24 2020, 6:02 AM
dmgreen edited the summary of this revision. (Show Details)

Rebase onto other patches, leaving the FP16 part remaining here.

dmgreen updated this revision to Diff 273015.Jun 24 2020, 7:01 AM

Rebase properly onto tests. They were previously coming through twice.

samparker accepted this revision.Jun 29 2020, 7:50 AM
samparker added inline comments.
llvm/test/Transforms/LoopVectorize/ARM/prefer-tail-loop-folding.ll
565

Looks like these TODOs can now be removed.

This revision is now accepted and ready to land.Jun 29 2020, 7:50 AM
This revision was automatically updated to reflect the committed changes.