This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Allow finer grain control of an unaligned access speed
ClosedPublic

Authored by rampitec on Apr 21 2022, 5:14 PM.

Details

Summary

A target can return if a misaligned access is 'fast' as defined
by the target or not. In reality there can be different levels
of 'fast' and 'slow'. This patch changes the boolean 'Fast'
argument of the allowsMisalignedMemoryAccesses family of functions
to an unsigned representing its speed.

A target can still define it as it wants and the direct translation
of the current code uses 0 and 1 for current false and true. This
makes the change an NFC.

Subsequent patch will start using an actual value of speed in
the load/store vectorizer to compare if a vectorized access going
to be not just fast, but not slower than before.

Diff Detail

Event Timeline

rampitec created this revision.Apr 21 2022, 5:14 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2022, 5:14 PM
rampitec requested review of this revision.Apr 21 2022, 5:14 PM

What are the units?..

What are the units?..

The units are target defined, just like the very definition of 'fast'. The only metric here is that something with a higher number is faster than something with a lower number.

rampitec added a comment.EditedApr 21 2022, 5:41 PM

What are the units?..

The units are target defined, just like the very definition of 'fast'. The only metric here is that something with a higher number is faster than something with a lower number.

For example in the D124219 I am assigning the full bit size of an operation if it is really fast, then 32 if it is less than 4 byte aligned, and 1 otherwise as it is still faster than smaller access. I could have chosen another numbers, but it is simply easier for me to think about it that way: a less than 4 byte aligned wider load has in fact the same speed as a 32 bit load with the same alignment. So vectorizer would chose wider load as the speed is the same and the number of operations goes down.

This can be one way of choosing the units. I am pretty much sure one can come up with a sophisticated mechanism of weights for a specific target.

foad added a comment.Apr 22 2022, 3:21 AM

What are the units?..

The units are target defined, just like the very definition of 'fast'. The only metric here is that something with a higher number is faster than something with a lower number.

For example in the D124219 I am assigning the full bit size of an operation if it is really fast, then 32 if it is less than 4 byte aligned, and 1 otherwise as it is still faster than smaller access. I could have chosen another numbers, but it is simply easier for me to think about it that way: a less than 4 byte aligned wider load has in fact the same speed as a 32 bit load with the same alignment. So vectorizer would chose wider load as the speed is the same and the number of operations goes down.

This can be one way of choosing the units. I am pretty much sure one can come up with a sophisticated mechanism of weights for a specific target.

Is it worth switching to InstructionCost here? Or at least switching to a model where larger numbers mean slower, like InstructionCost uses.

Is it worth switching to InstructionCost here? Or at least switching to a model where larger numbers mean slower, like InstructionCost uses.

Changing it to InstructionCost will make it comparable to the rest of instructions. Further tweaking the cost itself will change codegen for many targets where cost model is used, and not tweaking it will change codegen where this interface is used.

Moreover, turning the meaning from 'fast' or 'speed' into a reverse 'cost' value will require setting actual thresholds on what is slow. Right now slow is just 0 or 'false' before the change. I am not really ready to set such thresholds and to actually define speed units even for our own target. Note that even D124219 defines speed levels only for LDS and this is not comparable to other address spaces.

I intentionally wanted to make this transition NFC and a switch to cost will turn an interface change into a real unwanted functional change.

mariusz-sikora-at-amd added inline comments.
llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
1323

false -> 0

rampitec updated this revision to Diff 424950.Apr 25 2022, 9:55 AM
rampitec marked an inline comment as done.
rampitec added inline comments.
llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
1323

Thanks!

kosarev added inline comments.May 18 2022, 12:41 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17017

Shouldn't we also rename them all to something like Fastness?

rampitec added inline comments.May 18 2022, 12:52 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17017

I have a change like this. It is 4 times bigger, so I dropped it. We can rename it if this is reasonable, but probably as a separate NFC. In any way it is almost not reviewable.

kosarev added inline comments.May 18 2022, 1:10 AM
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17017

OK, makes sense.

arsenm accepted this revision.Nov 16 2022, 3:12 PM

LGTM, I thought you pushed this already

This revision is now accepted and ready to land.Nov 16 2022, 3:12 PM
This revision was landed with ongoing or failed builds.Nov 17 2022, 9:24 AM
This revision was automatically updated to reflect the committed changes.