Page MenuHomePhabricator

Apply accumulator to fadd/fmul experimental vector reductions (PR36734)
Needs ReviewPublic

Authored by RKSimon on Apr 5 2018, 2:48 PM.

Details

Summary

The llvm.experimental.vector.reduce.fadd/fmul intrinsic expansions were ignoring the accumulator scalar argument which should be added/multiplied to the scalar value from the vector reduction.

For (fast) shuffle reductions we should be able to apply the accumulator at the end of the sequence.

NOTE: I am currently working on a second patch for scalarizing strict fadd/fmul vector reductions (PR36732)

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Apr 5 2018, 2:48 PM
gnzlbg added inline comments.
lib/Transforms/Utils/LoopUtils.cpp
1545

Could you add a note explaining why this is the case?

If this isn't the case adding fast math flags is going to have a lot of unexpected side-effects for the users. Is there a way to verify this or assert this?

RKSimon added inline comments.Apr 6 2018, 6:28 AM
lib/Transforms/Utils/LoopUtils.cpp
1545

This is a cut and paste from the code below so I didn't introduce this - but for shuffle reduction you do have to assume fast math (see the equivalent comments further down in the createSimpleTargetReduction calling function).

ABataev added inline comments.Apr 6 2018, 6:32 AM
lib/Transforms/Utils/LoopUtils.cpp
1542
  1. auto->auto &&
  2. It is a bad idea to capture everything by reference, most of the variables can be captured by value, like Op or RedOps. It is better to use explicit capturing rather than implict.
1578–1579

Why are you excluding UndefValue here? If Acc is Undef, the Result must be Undef too, no?

test/CodeGen/Generic/expand-experimental-reductions.ll
103–118

Could you commit these tests separately as NFC and then update them using your changes to see the real effect of your patch?

RKSimon added inline comments.Apr 6 2018, 6:39 AM
lib/Transforms/Utils/LoopUtils.cpp
1578–1579

Undef appears to have been used to ignore the accumulator.... @aemerson can you confirm please?

aemerson added inline comments.Apr 6 2018, 7:11 AM
lib/Transforms/Utils/LoopUtils.cpp
1578–1579

I think I missed out a detail when I wrote the langref, original motivation of the scalar accumulator argument was for the use in strictly ordered FP reductions only. I.e. when the intrinsic call has no FMF flags attached then the accumulator argument is used, otherwise if there are no FMF flags then the argument is meant to be ignored.

If we're talking about the semantics of the intrinsics: whether or not the accumulator is undef should have no effect on the codegen for fast reductions. If it did, as your patch implements, then @ABataev is right in that a value + undef = undef. We would then have to ensure that we generate identity values for the particular reduction kind in the cases where we don't have an accumulator.

gnzlbg added inline comments.Apr 6 2018, 8:02 AM
lib/Transforms/Utils/LoopUtils.cpp
1545

but for shuffle reduction you do have to assume fast math

I think one just needs to assume floating-point math to be associative, but one does not need to assume that floating-point math is, for example, finite.

gnzlbg added a comment.EditedApr 6 2018, 8:10 AM

@aemerson

I think I missed out a detail when I wrote the langref, original motivation of the scalar accumulator argument was for the use in strictly ordered FP reductions only. I.e. when the intrinsic call has no FMF flags attached then the accumulator argument is used, otherwise if there are no FMF flags then the argument is meant to be ignored.

Why do we need the accumulator for this case? That is, why can't we just do:

result = vector[0];
for i in [1, vector.len) {
    result = binary_op(result, vector[i]);
}
return result;

I also wonder whether requiring fast-math to allow tree reductions is overkill. Tree reductions can be implemented reasonably efficiently in many architectures, while linearly ordered reduction appear to me to be more of a niche. Therefore, I wonder if it wouldn't make more sense to add llvm.experimental.vector.reduce.tree.{add,mul} that perform tree reductions without requiring fast math, and to just call those from here if fast-math is enabled.

scanon added a subscriber: scanon.Apr 6 2018, 8:18 AM

I also wonder whether requiring fast-math to allow tree reductions is overkill. Tree reductions can be implemented reasonably efficiently in many architectures, while linearly ordered reduction appear to me to be more of a niche.

Agreed. Tree-reduction is significantly faster and has smaller average error; it should really be the default in both LLVM and source languages except where there's an explicit constraint that forces linear reduction.

it should really be the default in both LLVM and source languages except where there's an explicit constraint that forces linear reduction.

Tree reductions is what Rust portable packed SIMD RFC [0] currently specifies.

@aemerson

For strictly ordered reductions which are supported on some vector architectures like ARM SVE, then
the accumulator operand is used when there are no FMF flags on the call.

This makes sense. Reading from the ARM SVE spec [1]:

Horizontal reductions

These instructions perform arithmetic horizontally across Active elements of a single source vector and deliver a
scalar result.

The floating-point horizontal accumulating sum instruction, FADDA, operates strictly in order of increasing Element
number across a vector, using the scalar destination register as a source for the initial value of the accumulator. This
preserves the original program evaluation order where non-associativity is required.
The other floating-point reductions calculate their result using a recursive pair-wise algorithm that does not preserve
the original program order, but permits increased parallelism for code that does not require strict order of evaluation.

The accumulator does really make sense for FADDA: one can iterate over a large sequence of memory, adding horizontal vectors to the accumulator, and get a result that preserves the ordered arithmetic. That's pretty cool.

[0]: https://github.com/rust-lang/rfcs/pull/2366
[1]: https://static.docs.arm.com/ddi0584/a/DDI0584A_b_SVE_supp_armv8A.pdf

@aemerson

I think I missed out a detail when I wrote the langref, original motivation of the scalar accumulator argument was for the use in strictly ordered FP reductions only. I.e. when the intrinsic call has no FMF flags attached then the accumulator argument is used, otherwise if there are no FMF flags then the argument is meant to be ignored.

Why do we need the accumulator for this case? That is, why can't we just do:

result = vector[0];
for i in [1, vector.len) {
    result = binary_op(result, vector[i]);
}
return result;

I also wonder whether requiring fast-math to allow tree reductions is overkill. Tree reductions can be implemented reasonably efficiently in many architectures, while linearly ordered reduction appear to me to be more of a niche. Therefore, I wonder if it wouldn't make more sense to add llvm.experimental.vector.reduce.tree.{add,mul} that perform tree reductions without requiring fast math, and to just call those from here if fast-math is enabled.

Because not every architecture has a statically defined vector length, so you may need to generate IR loops in order to express it unless you use these intrinsics.

it should really be the default in both LLVM and source languages except where there's an explicit constraint that forces linear reduction.

Tree reductions is what Rust portable packed SIMD RFC [0] currently specifies.

Yup, it's also what the Apple simd module uses.

RKSimon updated this revision to Diff 141381.Apr 6 2018, 11:05 AM

rebased - always apply accumulator if not null

Simon asked this on the PR, let's continue the discussion in one place:

Do we really want to completely ignore an intrinsic argument depending on the fast flags? There might be valid reasons to want to include an accumulation value.

The raison d'être for the argument is for ordered reductions, and the intrinsics were designed to allow the expression of the reduction idiom only, in light of newer vector architectures where the previous representation was inadequate. The use of an accumulator argument for fast reductions wasn't necessary, so the semantics were supposed to be defined in the most minimal form. The accumulator can easily be expressed as a extractelement->op->insertelement sequence. The question here I think becomes our good old friend: what should the canonical form be?

The other issue is that while these intrinsics were experimental (it was on my todo list for later this year to change that), AArch64 has been using them with their original intended semantics for a while now. If we change that, IR generated from a released compiler will be miscompiled since it now becomes legal to fold away undef accumulator reductions to undef, unless we do some IR auto-upgrading based on the bitcode version.

The other issue is that while these intrinsics were experimental (it was on my todo list for later this year to change that), AArch64 has been using them with their original intended semantics for a while now. If we change that, IR generated from a released compiler will be miscompiled since it now becomes legal to fold away undef accumulator reductions to undef, unless we do some IR auto-upgrading based on the bitcode version.

If they are flagged as experimental surely we can't be held resposible for changes in behaviour?

The other issue is that while these intrinsics were experimental (it was on my todo list for later this year to change that), AArch64 has been using them with their original intended semantics for a while now. If we change that, IR generated from a released compiler will be miscompiled since it now becomes legal to fold away undef accumulator reductions to undef, unless we do some IR auto-upgrading based on the bitcode version.

If they are flagged as experimental surely we can't be held resposible for changes in behaviour?

AArch64 was the test guinea pig for this new representation, and it's proved to be a smooth transition. If you change the semantics now, even if the intrinsics are still experimental in name, IR from LLVM 6.0 may be silently miscompiled if someone implements a valid optimization based on your new proposal. That's a fact, and I don't think this patch review is the right place to discuss that if you want to do this, I suggest you send a new RFC or revive the old one.