This adjusts the MVE fp16 cost model, similar to how we already do for integer casts. It uses the base cost of 1 per cvt for most fp extend / truncates, but adjusts it for loads and stores where we know that a extending load has been used to get the load into the correct lane, and only an MVE VCVTB is then needed.
Details
Diff Detail
Event Timeline
Rebase and add some additional trunc handling. This should not effect a lot, but keeps a symmetry between extending load and truncating stores.
There's a lot of changes here... would it be worth committing them as separate patches: MVE, NEON and generic, or splitting of memory and cast?
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
501 | Maybe a comment on your magic number choices, or how about using getIntrinsicInstrCost for the call cost? | |
985 | I thought you stance was that we shouldn't be looking at the context?! |
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
501 | getCallInstrCost looks like it might work - at least give the right cost. It will not include the same codesize changes, but that can be done separately. It's hard to say exactly what the call will cost with all the stuff that might go on inside it. | |
985 | No. It's more... nuanced than that. We need to look at context in order to ever accurately model things. At least at the moment. Without this bit of code for example, all <4 x f16> would be quite expensive, so you end up with any fp16->fp32 convert not being vectorized, even if it should be beneficial. But there are only certain parts of the context that are valid to look at. The opcode of the surrounding instructions is almost certainly OK, and in this case I would claim that the _scalar_ type of the float makes sense because it's converting a float (not an int). The absolute type could be wrong because it would be passed as a different type from the vectorizer (with a different number of vector elements). If it was an integer type then that could be wrong too because the vectorizer can promote types as it vectorizes to a smaller bitwidth. I'm not saying it's a great design and would love a better way to do this, but it's the one we've currently got. |
llvm/test/Transforms/LoopVectorize/ARM/prefer-tail-loop-folding.ll | ||
---|---|---|
565 | Looks like these TODOs can now be removed. |
Maybe a comment on your magic number choices, or how about using getIntrinsicInstrCost for the call cost?