This is an archive of the discontinued LLVM Phabricator instance.

new method TargetTransformInfo::supportsVectorElementLoadStore() for LoopVectorizer
ClosedPublic

Authored by jonpa on Mar 6 2017, 11:49 PM.

Details

Summary

Since SystemZ supports vector element load / store instructions, there is no need for extracts / inserts if a vector load / store gets scalarized. This patch lets Target specify that it supports such instructions by means of a new TTI virtual method that defaults to false.

The use for this is in the LoopVectorizer getScalarizationOverhead() method. As a side note, I am thinking perhaps this function should go into BasicTTIImpl.h, but since there are a at least two changes in review in this area, I am waiting with that.

Diff Detail

Event Timeline

jonpa created this revision.Mar 6 2017, 11:49 PM
jonpa added a comment.Mar 14 2017, 3:01 AM

ping.

This patch does not have any tests, because it will go in with https://reviews.llvm.org/D29631.

I was hoping to make review simpler by creating a separate diff for this issue.

anemet edited edge metadata.Mar 14 2017, 9:49 AM

Why aren't you putting this in TTI.getOperandsScalarizationOverhead?

Why aren't you putting this in TTI.getOperandsScalarizationOverhead?

I don't see how that would work, because this is dependent on the instruction opcode. It is only relevant if it is a store or a load, and only for the store case are the operands checked for extracts. Are you suggesting adding an extra parameter to that function (opcode)?

Sorry about the delay on this but I was working on something related for ARM that may benefit from this as well. What I need for ARM is something that can communicate to the SLPVectorizer that load-pair and store-pair (of two registers) is efficiently supported on the target. I am wondering if we can combine the two things if your new hook would take the type and the vectorization width.

What do you think?

Sorry about the delay on this but I was working on something related for ARM that may benefit from this as well. What I need for ARM is something that can communicate to the SLPVectorizer that load-pair and store-pair (of two registers) is efficiently supported on the target. I am wondering if we can combine the two things if your new hook would take the type and the vectorization width.

What do you think?

Is this also in the context of scalarizing a load / store?

For SystemZ, a scalarized memory access will have to do VF memory operations, but there is no need to extract or insert any of the data elements, as there are vector element load/store instructions.

I am not sure if you mean that there is no scalarization cost and that there is a lowered (halfed) memoryOpCost, or that you mean that you could extract / insert two elements at a time?

In the latter case, I suppose we could use something like getLoadStoreScalarizationDiscount(VF), instead of the current patch.

In the first case, I suppose you should use this new hook, along with adjusting the getMemoryOpCost() method.

jonpa added a comment.Apr 3 2017, 4:18 AM

ping. Adam - any comments?

anemet added a comment.Apr 4 2017, 9:11 PM

Sorry about the delay on this but I was working on something related for ARM that may benefit from this as well. What I need for ARM is something that can communicate to the SLPVectorizer that load-pair and store-pair (of two registers) is efficiently supported on the target. I am wondering if we can combine the two things if your new hook would take the type and the vectorization width.

What do you think?

Is this also in the context of scalarizing a load / store?

For SystemZ, a scalarized memory access will have to do VF memory operations, but there is no need to extract or insert any of the data elements, as there are vector element load/store instructions.

We have something like this on ARM too. ld1 can load any element of a vector (e.g. ld1.s {v1}[1], [x1] loads lane 1 of vector reg v1) and st1 can store any element. That said, ld1 is still a partial write of the vector register so in terms of performance, it's worse than a regular store which is a full write. I think that modeling its cost as a load + insert (for non-zero-lane) is fairly accurate. Doesn't this match the situation on SystemZ?

jonpa added a comment.Apr 4 2017, 11:13 PM

Sorry about the delay on this but I was working on something related for ARM that may benefit from this as well. What I need for ARM is something that can communicate to the SLPVectorizer that load-pair and store-pair (of two registers) is efficiently supported on the target. I am wondering if we can combine the two things if your new hook would take the type and the vectorization width.

What do you think?

Is this also in the context of scalarizing a load / store?

For SystemZ, a scalarized memory access will have to do VF memory operations, but there is no need to extract or insert any of the data elements, as there are vector element load/store instructions.

We have something like this on ARM too. ld1 can load any element of a vector (e.g. ld1.s {v1}[1], [x1] loads lane 1 of vector reg v1) and st1 can store any element. That said, ld1 is still a partial write of the vector register so in terms of performance, it's worse than a regular store which is a full write. I think that modeling its cost as a load + insert (for non-zero-lane) is fairly accurate. Doesn't this match the situation on SystemZ?

As far as I know there is on SystemZ no extra penalty for using a vector load element, so scalarizing a vector load will really cost e.g. 4 loads at VF 4. This should be better than doing 4 scalar loads and 4 inserts.

Are you saying that this only makes sense for stores on ARM? In that case maybe a boolean argument like IsStore might work?

What about the handing of two registers at a time you mentioned earlier?

bjope added a subscriber: bjope.Apr 4 2017, 11:33 PM

Sorry about the delay on this but I was working on something related for ARM that may benefit from this as well. What I need for ARM is something that can communicate to the SLPVectorizer that load-pair and store-pair (of two registers) is efficiently supported on the target. I am wondering if we can combine the two things if your new hook would take the type and the vectorization width.

What do you think?

Is this also in the context of scalarizing a load / store?

For SystemZ, a scalarized memory access will have to do VF memory operations, but there is no need to extract or insert any of the data elements, as there are vector element load/store instructions.

We have something like this on ARM too. ld1 can load any element of a vector (e.g. ld1.s {v1}[1], [x1] loads lane 1 of vector reg v1) and st1 can store any element. That said, ld1 is still a partial write of the vector register so in terms of performance, it's worse than a regular store which is a full write. I think that modeling its cost as a load + insert (for non-zero-lane) is fairly accurate. Doesn't this match the situation on SystemZ?

As far as I know there is on SystemZ no extra penalty for using a vector load element, so scalarizing a vector load will really cost e.g. 4 loads at VF 4. This should be better than doing 4 scalar loads and 4 inserts.

My point is not whether it's better or not (it certainly is shorter) but whether 4 scalar loads have the same cost as four vector-element loads. The hook would state the latter. Anyhow, for in-order processors I could see how this could be true.

Are you saying that this only makes sense for stores on ARM? In that case maybe a boolean argument like IsStore might work?

Yes, I think so.

What about the handing of two registers at a time you mentioned earlier?

I convinced myself that that is a separate issue. There, we want to communicate that to load or store a pair of registers (<=64bits) only takes one instruction in scalar mode.

include/llvm/Analysis/TargetTransformInfo.h
437

Needs comment.

jonpa updated this revision to Diff 94485.Apr 6 2017, 11:23 PM

Comment added.
Test case for SystemZ.

jonpa marked an inline comment as done.Apr 6 2017, 11:26 PM
Are you saying that this only makes sense for stores on ARM? In that case maybe a boolean argument like IsStore might work?

Yes, I think so.

OK - well, in that case I guess you are fine with this patch, since it will be easy for you to extend it?

jonpa updated this revision to Diff 94515.Apr 7 2017, 6:39 AM

Added 'REQUIRES: asserts' to test case.

anemet added inline comments.Apr 7 2017, 8:52 AM
include/llvm/Analysis/TargetTransformInfo.h
437–439

Again, this should not state the presence of these instructions but whether they are as efficient as their scalar counterparts. Please rephrase.

It's probably even better to express this in the name of the hook, e.g. supportsEfficientVectorElementLoadStore.

jonpa updated this revision to Diff 94649.Apr 10 2017, 1:37 AM

Updated per review with new method name.

anemet accepted this revision.Apr 11 2017, 9:05 AM

LGTM.

This revision is now accepted and ready to land.Apr 11 2017, 9:05 AM
jonpa added a comment.Apr 12 2017, 5:54 AM

Thanks for review!
r300056

jonpa closed this revision.Apr 12 2017, 5:54 AM