This is an archive of the discontinued LLVM Phabricator instance.

[TTI] Add a hook to TTI for choosing scalarized shuffle-reduction sequence for reduction idiom
AbandonedPublic

Authored by FarhanaAleen on Apr 19 2018, 12:08 PM.

Details

Reviewers
RKSimon
Summary

This patch adds a hook to TTI for choosing scalarized shuffle-reduction as opposed to vectorized shuffle-reduction sequence for reduction idiom.

Allows generation:

%0 = extractelement <4 x float> %bin.rdx, i32 0
%1 = extractelement <4 x float> %bin.rdx, i32 1
%res = fadd fast float %0, %1

Instead of

%rdx.shuf1 = shufflevector <4 x float> %bin.rdx, <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
%bin.rdx2 = fadd fast <4 x float> %bin.rdx, %rdx.shuf1
%res = extractelement <4 x float> %bin.rdx2, i32 0

Hi Simon,

This patch reflects your suggestion on https://reviews.llvm.org/D45393

Diff Detail

Event Timeline

FarhanaAleen created this revision.Apr 19 2018, 12:08 PM
hsaito added a subscriber: hsaito.Apr 20 2018, 11:40 AM

Farhana,

It looks to me that you are trying to fine-tune the final step of log-2 reduction. In my opinion, such an optimization should be done in the Target.
There are probably downstream optimizers looking for proper log-2 shuffle sequence to detect reduction last value compute and this change will
break that for the affected target. Is that what you'd like to do? Also, the newly introduced TTI interface name doesn't convey enough information
about fine-tuning only the last step.

If you really think this change should benefit all targets (you mentioned that in D45393), you shouldn't be doing this through TTI as a long
term solution. Have a proper discussion in llvm-dev and get major Targets to agree with you first. Then, TTI should help migrate Targets

  • and finally get rid of TTI interface after the migration.

Thanks,
Hideki

Hi Hideki.

"It looks to me that you are trying to fine-tune the final step of log-2 reduction."

This patch actually tries to generate required sized vectors(precise vector length) for shuffleReduction (where scalarization follows naturally) instead of keeping the starting vector length fixed all the way through which are filled with unnecessary undefs. I agree it was not clear in the previous patch. To make the distinction clear, I've created a separate function called getVariableLengthShuffleReduction() with additional changes.

I am not opposed to do this in the target/Codegen, it's just that the implementation is a little messy since there is no good way to classify scalar operations as opposed to vector operations whereas it can be implemented in a cleaner way at the first place with a TTI hook.

Does it look reasonable to add a TTI hook and allow SLP to generate precise vector length for the shuffle reduction at the first place if target wants that?

Hi Hideki.

"It looks to me that you are trying to fine-tune the final step of log-2 reduction."

This patch actually tries to generate required sized vectors(precise vector length) for shuffleReduction (where scalarization follows naturally) instead of keeping the starting vector length fixed all the way through which are filled with unnecessary undefs. I agree it was not clear in the previous patch. To make the distinction clear, I've created a separate function called getVariableLengthShuffleReduction() with additional changes.

I am not opposed to do this in the target/Codegen, it's just that the implementation is a little messy since there is no good way to classify scalar operations as opposed to vector operations whereas it can be implemented in a cleaner way at the first place with a TTI hook.

Does it look reasonable to add a TTI hook and allow SLP to generate precise vector length for the shuffle reduction at the first place if target wants that?

I don't know about your original intention, but your previous patch (on the left pane as of now) wasn't doing what you described above ---- and the right pane does what you've described above.

This time, I can see why you'd want to do this upfront --- except for the last step in your code. You may call this last step more optimal and I don't have any data to support or go against your opinion about that ---- but I can say it's not consistent. Why bother? Is that REALLY needed? In any case, you seem to be introducing a new "canonical form" for reduction last value compute. You are better off having a good discussion in llvm-dev, by explicitly asking for those who "latch on" to reduction last value compute, to see if they are okay to add another form (or if you really think this is a better representation, you should argue that the old form should be retired). I suggest sending an RFC to llvm-dev. At this moment, you fail to convince me why the community should support both canonical forms (i.e., why you can't optimize from the current form ---- or why you'd want to keep the old form). If others support having both is a great idea, I won't insist.

BTW, have you looked at ARM's last value compute experimental intrinsic: e.g., int_experimental_vector_reduce_add? If you don't mind losing IR-level optimizations for reduction last value code, this is the easiest way to do custom lowering in the Target. I don't know whether this meets your needs, but I thought it's worth mentioning.

FarhanaAleen abandoned this revision.Apr 25 2018, 2:41 PM

Thanks Hideki, I will think about your suggestion.

BTW, have you looked at ARM's last value compute experimental intrinsic: e.g., int_experimental_vector_reduce_add? If you don't mind losing IR-level optimizations for reduction last value code, this is the easiest way to do custom lowering in the Target. I don't know whether this meets your needs, but I thought it's worth mentioning.

Yes, I have. They are my fallback option :).