Page MenuHomePhabricator

[Clang] Add integer mul reduction builtin
ClosedPublic

Authored by RKSimon on Jan 20 2022, 12:20 PM.

Details

Summary

Similar to the existing bitwise reduction builtins, this lowers to a llvm.vector.reduce.mul intrinsic call.

For other reductions, we've tried to share builtins for float/integer vectors, but the fmul reduction intrinsic also take a starting value argument and can either do unordered or serialized, but not reduction-trees as specified for the builtins. However we address fmul support this shouldn't affect the integer case.

Diff Detail

Event Timeline

RKSimon requested review of this revision.Jan 20 2022, 12:20 PM
RKSimon created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptJan 20 2022, 12:20 PM

I should mention - according to https://clang.llvm.org/docs/LanguageExtensions.html __builtin_reduce_add() already exists, which I don't think is true.

This comment was removed by craig.topper.
fhahn added a subscriber: junaire.Jan 21 2022, 2:09 AM

Thanks for the patch!

For other reductions, we've tried to share builtins for float/integer vectors, but the fadd/fmul reduction builtins also take a starting value argument. Technically I could support float by using default values, but we're probably better off with specific fadd/fmul reduction builtins for both arguments.

Just to double check, you mean LLVM intrinsics here, right? As specified, the Clang __builtin_reduce_add should support both integer and floating point reductions. Do you think that's problematic? If so, we should update the spec.

For float reductions, the key questions is how to lower it. Unfortunately llvm.vector.reduce.fadd reductions at the moment are either unordered or sequential, but we need a particular order. @junaire has been looking into this and put up D117480 as an option to extend the intrinsic to have a dedicated order argument. That would make it easy for backends like AArch64 to select the right reduction instruction. Alternatively clang could also create the expanded reduction tree to start with. But we really want to lower this to native reduction instructions if we can.

I should mention - according to https://clang.llvm.org/docs/LanguageExtensions.html __builtin_reduce_add() already exists, which I don't think is true.

Do you mean it is listed in the docs but not implemented yet? That's true, the docs specified the whole set of proposed builtins from the start, regardless of implementation status.

FWIW, I think we intentionally did not specify __builtin_reduce_mul to start with, because it is very prone to overflows for integer vectors. Not saying that we cannot add it, but it would at least require updating the extension documentation. @scanon might have additional feedback. In any case, it might be good to only handle the add case in this patch.

clang/lib/Sema/SemaChecking.cpp
2599

_add should also support floats, add a TODO?

I'm happy to continue with this just for integers or wait until we have a plan for floats as well. I guess we need to decide if we want to support the starting value in the fadd/fmul intrinsics from the builtin or not? If we don't then adding float support to the add/mul reduction builtins now (or later on) would just involve using default starting values, if we do then we probably need a separate fadd/fmul builtin.

Or we could add starting values to the add/mul reduction builtins as well and we manually insert a scalar post-reduction add/mul instruction in cgbuiltin?

wrt float orders, currently the avx512f reductions attach fmf attributes when the builtin is translated to the intrinsic: https://github.com/llvm/llvm-project/blob/d2012d965d60c3258b3a69d024491698f8aec386/clang/lib/CodeGen/CGBuiltin.cpp#L14070

We might be able to get away with just expecting users to handle this with pragmas in code? I was planning to see if that would work for the avx512f fmin/fmax reduction intrinsics so I can use __builtin_reduce_min/max.

I'm mainly interested in __builtin_reduce_mul as avx512f requires it - the (few) cases I've see it used have always involved char/short pixel data, extended to int/long before the mul reduction to address the overflow issues.

fhahn added a comment.Jan 21 2022, 3:44 AM

I'm happy to continue with this just for integers or wait until we have a plan for floats as well. I guess we need to decide if we want to support the starting value in the fadd/fmul intrinsics from the builtin or not? If we don't then adding float support to the add/mul reduction builtins now (or later on) would just involve using default starting values, if we do then we probably need a separate fadd/fmul builtin.

I'm not sure if there's a major benefit of specifying the start value? Unless there is, I think we should not add a start value argument.

Or we could add starting values to the add/mul reduction builtins as well and we manually insert a scalar post-reduction add/mul instruction in cgbuiltin?

wrt float orders, currently the avx512f reductions attach fmf attributes when the builtin is translated to the intrinsic: https://github.com/llvm/llvm-project/blob/d2012d965d60c3258b3a69d024491698f8aec386/clang/lib/CodeGen/CGBuiltin.cpp#L14070

I am not familiar how exactly ia32_reduce_fadd_pd512 & co are specified, but it looks like adding reassoicate to the intrinsic call there might be incorrect technically, i.e. calming the order does not matter, while I assume the C builtin guarantees a particular order? It might not result in an actual mis-compile on X86, because the X86 backend kind of guarantees that reductions with reassoicate are lowered exactly as the C builtin requires (relying on some kind of secret handshake between frontend & backend).

As discussed in D117480 this seems like a major weakness in how llvm.vector.reduce.fadd is specified. As long as it is only used for target specific builtins, things might work out fine in most cases, but different targets might lower reassoicate reductions differently. Also, things might fall apart if the middle end uses the reassoicate property during a transform that changes the reduction order.

We might be able to get away with just expecting users to handle this with pragmas in code?

If the user allows reassoication via a pragma/flag, we can use the reduction builtin at the moment without problems. The motivation behind specifying a well defined order that can be lowered efficiently by targets is to guarantee portable results by default. This is important for uses cases where the builtins are used in code that's executed on different platforms.

I was planning to see if that would work for the avx512f fmin/fmax reduction intrinsics so I can use __builtin_reduce_min/max.

There shouldn't be any problem with fmin/fmax, as the result should be independent of the evaluation order IIUC.

I'm mainly interested in __builtin_reduce_mul as avx512f requires it - the (few) cases I've see it used have always involved char/short pixel data, extended to int/long before the mul reduction to address the overflow issues.

Sounds reasonable! As mentioned earlier, it would be good to do that as separate patch, also updating LanguageExtensions.rst.

@junaire has been looking into this and put up D117480 as an option to extend the intrinsic to have a dedicated order argument.

Hi @fhahn, I wonder if we should continue working on this as I remember one of the reviewers was doubted this change.

I agree that the major problem now is how to lower it with a particular order and maybe we can reach an agreement before we can work on it.

RKSimon planned changes to this revision.Jan 23 2022, 11:24 AM
fhahn added a comment.Mar 16 2022, 9:12 AM

@RKSimon are you planning on pushing this patch through? From my perspective, it looks good with splitting off __builtin_reduce_mul and adding a TODO to also implement it for floating point types.

Herald added a project: Restricted Project. · View Herald TranscriptMar 16 2022, 9:12 AM

@RKSimon are you planning on pushing this patch through? From my perspective, it looks good with splitting off __builtin_reduce_mul and adding a TODO to also implement it for floating point types.

You mean handle add/mul in separate patches? I'm happy to continue with this, I just didn't want to make it more difficult for us to add fp support in the future - particularly if we do end up needing the scalar initial_value argument

fhahn added a comment.Mar 21 2022, 8:12 AM

@RKSimon are you planning on pushing this patch through? From my perspective, it looks good with splitting off __builtin_reduce_mul and adding a TODO to also implement it for floating point types.

You mean handle add/mul in separate patches? I'm happy to continue with this, I just didn't want to make it more difficult for us to add fp support in the future - particularly if we do end up needing the scalar initial_value argument

Yeah that's what I meant. I'm not sure about the need for supporting a scalar initial value, but implementing a subset of the specified behavior for now shouldn't cause much trouble further down the line hopefully.

@fhahn Remind me - why did you want me to split these? If we're initially just going for integer support, can't both be done at the same time in this patch?

RKSimon marked an inline comment as done.Apr 8 2022, 9:39 AM
fhahn added a comment.Sun, May 1, 12:17 PM

@fhahn Remind me - why did you want me to split these? If we're initially just going for integer support, can't both be done at the same time in this patch?

I think my original thinking was that __builtin_reduce_mul isn't defined at the moment https://clang.llvm.org/docs/LanguageExtensions.html#vector-builtins

So I think it would be good to either the update the docs in the patch (or preferably be split off the straight forward __builtin_reduce_add change :))

RKSimon updated this revision to Diff 426366.Mon, May 2, 3:31 AM
RKSimon retitled this revision from [Clang] Add integer add/mul reduction builtins to [Clang] Add integer mul reduction builtin.
RKSimon edited the summary of this revision. (Show Details)

Just handle the mul case now that D124741 has landed

RKSimon updated this revision to Diff 426367.Mon, May 2, 3:33 AM

Add "__builtin_reduce_mul" to the docs listing supported reductions

fhahn accepted this revision.Thu, May 5, 12:58 PM

LGTM, thanks!

IIRC @scanon had some opinions about __builtin_reduce_mul during some earlier discussions. Please wait a few days in case there are additional comments.

This revision is now accepted and ready to land.Thu, May 5, 12:58 PM
This revision was automatically updated to reflect the committed changes.