This is an archive of the discontinued LLVM Phabricator instance.

Improve ISel using across lane addition reduction
ClosedPublic

Authored by junbuml on Aug 25 2015, 11:16 AM.

Details

Summary

In vectorized add reduction code, the final "reduce" step is sub-optimal.
This change wll combine :

ext  v1.16b, v0.16b, v0.16b, #8
add  v0.4s, v1.4s, v0.4s
dup  v1.4s, v0.s[1]
add  v0.4s, v1.4s, v0.4s

into

addv s0, v0.4s

This fixes PR21371.

Diff Detail

Event Timeline

junbuml updated this revision to Diff 33091.Aug 25 2015, 11:16 AM
junbuml retitled this revision from to Improve ISel using across lane addition reduction.
junbuml updated this object.
junbuml added reviewers: jmolloy, mcrosier.
junbuml added subscribers: mssimpso, bmakam, gberry, llvm-commits.
aadg added a subscriber: aadg.Aug 25 2015, 11:33 AM
mcrosier edited edge metadata.Aug 25 2015, 12:35 PM

Thanks for working on this, Jun. Mostly style nits with a few other comments.

lib/Target/AArch64/AArch64ISelLowering.cpp
8603

To improve readability, you could expand this into two parts:

// If the extract isn't feeding an ADD, can't do such combine.
if (N0->getOpcode() != ISD::ADD)
  return SDValue();

// Vector extract idx must constant zero. 
if (!isa<ConstantSDNode>(N1) ||  cast<ConstantSDNode>(N1)->getZExtValue())
  return SValue();
8615

The computation of NumVecElts / 2 appears to be loop invariant. Mind hoisting it out of the conditional check for clarity?

8629

Please add a space between %svn and =.

8631

I don't know the correct answer, but is the SV.hasOneUse() check necessary?

8642

Please pre-increment i.

8650

Please use DL rather than dl.

junbuml updated this revision to Diff 33109.Aug 25 2015, 1:15 PM
junbuml edited edge metadata.

Thanks Chad for the quick review. I addressed your comments.

jmolloy requested changes to this revision.Aug 26 2015, 5:32 AM
jmolloy edited edge metadata.

Hi,

Thanks for working on this! I'm generally happy with the algorithm, but I have a bunch of comment and style requests.

Cheers,

James

lib/Target/AArch64/AArch64ISelLowering.cpp
8588

I'd like to see here an example using SDAG nodes. It's nice to have the assembly before/after, but what's really important is the exact pattern this function is intending to match, and what it replaces it with.

For example:

%2 = SHUFFLE_VECTOR ...
%1 = ADD %2, %3
EXTRACT_ELEMENT %1, 0

Something like that.

And specifically, this is the log2-shuffle pattern that the LoopVectorizer produces. Please make that very explicit, because this code won't match all reductions.

8603

Well actually you can - FMINNUM, FMINNAN, FMAXNUM, FMAXNAN, SMIN, SMAX, UMIN and UMAX all have lane-reduce variants in the A64 instruction set.

I understand you may not want to handle them all at this time, but:

  • Please make the comment explicit about what you are and aren't handling.
  • Please replace uses of "Add" later on with "binop" or something - so that it isn't perceived that this code is specific to Adds. It's not, it's just only been tested with adds. It should work for any commutative binary op.
8616

Why is this called NumMaxSubAddElts? (Where's the Sub come from?)

8620

It'd be more readable IMHO to calculate the number of expected steps beforehand (as lg(N)), and just have normal counter here. It would make the algorithm more clear.

8623

Please don't fully capitalize local variables - stick to the naming convention of UpperCamelCase (unless using an initialism, so SV is fine).

8624

... That said, it'd be nice to call this "Shuffle", to be very explicit about it.

8625

Comment what you're doing here, which is to allow the add's operands to be commutative.

8647

You've given an example here, which is good, but you haven't defined the *requirements* on the mask indices which the code below is intending to enforce.

Without that information it is difficult to know if the code has a bug in it.

Also, use "unsigned" instead of "unsigned int".

8648

I'm assuming this is saying that all elements greater than NumAddElts must be undef? better to use "Mask[i] != 0" instead of ">= 0" I think, so it is more obvious.

8649

Instead of static_casting here, just use a signed induction variable.

8653

I'm not convinced that "InputADD" is the most descriptive name here, but I don't have an explicit alternative in mind.

This revision now requires changes to proceed.Aug 26 2015, 5:32 AM
junbuml marked 6 inline comments as done.Aug 26 2015, 6:21 AM

Thanks James for the review. I will address your comments and update it at this morning.

junbuml updated this revision to Diff 33227.Aug 26 2015, 11:15 AM
junbuml edited edge metadata.

Addressed James' comments.

junbuml marked 11 inline comments as done.Aug 26 2015, 11:39 AM
junbuml added inline comments.
lib/Target/AArch64/AArch64ISelLowering.cpp
8588

. Changed the example using SDAG nodes.
. Specifically mentioned that this function handles the final step of vector reduction.

8603

Add "FIXME" in comment.

8620

Use "NumExpectedStep" and "CurStep" to iterate steps.

8648

To check if an element is an UNDEF, we need to check if the mask is negative value. So, I'm checking Mask[i] < 0.

8653

Now I'm using PreOp and CurOp, instead of InputADD and ADD.

Jun/James,

While this patch is focused on the add reductions, aren't there other opportunities here for optimizing the reduce step of the other reduction types? For example, couldn't we similarly use the sminv/uminv and smaxv/umaxv instructions for min/max reductions? We currently don't do this. Min/max reductions are important for 456.hmmer.

I now see that the above comment was already made, and a FIXME was added. Please disregard.

junbuml marked 5 inline comments as done.Aug 26 2015, 2:20 PM

Yes, there are more across lane reduction instructions as mentioned in FIXME. We could extend it to support other types as well.
I may prefer to get this patch in first, and extend it to support other types. James, let me know your thought.

I may prefer to get this patch in first, and extend it to support other types. James, let me know your thought.

The LLVM community prefers incremental changes, which simplifies code review as well as eases bisection. It might be wise to land this patch before moving onto the other types of reductions.

junbuml updated this revision to Diff 33459.Aug 28 2015, 1:50 PM
junbuml edited edge metadata.

Updated comments little bit more.

Confirmed that this patch was passed correctness tests for spec2000/2006, and applied several spec benchmarks including gcc, h264, hmmer, sjeng, and vpr.

Hi James,

I'm planning on pushing another patch to handle other across vector reductions (SMAX/SMIN/UMAX/UMIN), which require some change in this patch. Do you prefer to have a single merged patch or separate patches? Either ways are fine with me. Please let me know your thought.

@jmolloy: This looks to be in good shape, AFAICT. Mind giving this a LGTM, if you're satisfied.

jmolloy accepted this revision.Sep 3 2015, 9:53 AM
jmolloy edited edge metadata.

Hi,

This looks great now, thanks! LGTM with the few nitpicks below.

Cheers,

James

lib/Target/AArch64/AArch64ISelLowering.cpp
8634

Pedantic: Step -> Steps.

8638

"check" -> "check only"

8650

"Check if it forms a one" -> "Check if it forms one"

This revision is now accepted and ready to land.Sep 3 2015, 9:53 AM
junbuml updated this revision to Diff 33955.Sep 3 2015, 10:06 AM
junbuml edited edge metadata.
junbuml marked 3 inline comments as done.

Address comments.
Thanks James for the review

Thanks James for the review !
Chad, would you mind landing this when you have a chance ? Thanks!

mcrosier closed this revision.Sep 3 2015, 12:20 PM

Committed in r246790.