This is an archive of the discontinued LLVM Phabricator instance.

[Clang] Add __builtin_reduce_or and __builtin_reduce_and
ClosedPublic

Authored by junaire on Jan 6 2022, 3:57 AM.

Details

Summary

This patch implements two builtins specified in D111529.
The last __builtin_reduce_add will be seperated into another one.

Diff Detail

Event Timeline

junaire requested review of this revision.Jan 6 2022, 3:57 AM
junaire created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptJan 6 2022, 3:57 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

LGTM, thanks!

The last __builtin_reduce_add will be seperated into another one.

Are you planning on putting up a patch for this one as well? What makes add a bit different is that ‘llvm.vector.reduce.fadd.*’ can only perform reductions either in the original order or in an unspecified order. For the extension, we need a particular evaluation order (reduction tree adding adjacent element pairs). Technically this order is required for all reduction builtins, but for integers the order doesn't matter, same for min/max.

clang/lib/Sema/SemaChecking.cpp
2240–2243

nit: Those .... vectors ....

LGTM other than the NIT found by @fhahn

junaire updated this revision to Diff 398838.Jan 10 2022, 9:54 PM

fix the comment.

LGTM, thanks!

The last __builtin_reduce_add will be separated into another one.

Are you planning on putting up a patch for this one as well? What makes add a bit different is that ‘llvm.vector.reduce.fadd.*’ can only perform reductions either in the original order or in an unspecified order. For the extension, we need a particular evaluation order (reduction tree adding adjacent element pairs). Technically this order is required for all reduction builtins, but for integers the order doesn't matter, same for min/max.

Sorry about the late response. Yeah, I'm trying to work on this builtin too, but actually, I don't know if I can do this as all my previous work is like kind of boilerplate or something? I have read the whole discussion in the mailing list and the related LLVM IR reference, but I still get confused a little bit.

So the difference of this builtin is because LLVM intrinsic declare it like:

declare float @llvm.vector.reduce.fadd.v4f32(float %start_value, <4 x float> %a)
declare double @llvm.vector.reduce.fadd.v2f64(double %start_value, <2 x double> %a)

And it performs sequential reduction which is not what we want right? We need it to reduce like:

[e3, e2, e1, e0] => (e3, e2) + (e1, e0)

Does it mean we should do something like a for loop? or like recursive calls? or like changing the order of the elements in the vector?

And another thing that confuses me is that pad identity elements after the last element to widen the vector out to a power 2 . According to the IR reference, is the neutral value just zero?

The last confusing point is %start_value, we just simply consider it is 0, isn't it?

I would appreciate it if you can give me any hints, which I think is very helpful to my LLVM learning :-)

fhahn added a comment.Jan 13 2022, 7:28 AM

@junaire did you already get commit access or should I commit this change on your behalf?

LGTM, thanks!

The last __builtin_reduce_add will be separated into another one.

Are you planning on putting up a patch for this one as well? What makes add a bit different is that ‘llvm.vector.reduce.fadd.*’ can only perform reductions either in the original order or in an unspecified order. For the extension, we need a particular evaluation order (reduction tree adding adjacent element pairs). Technically this order is required for all reduction builtins, but for integers the order doesn't matter, same for min/max.

Sorry about the late response. Yeah, I'm trying to work on this builtin too, but actually, I don't know if I can do this as all my previous work is like kind of boilerplate or something? I have read the whole discussion in the mailing list and the related LLVM IR reference, but I still get confused a little bit.

So the difference of this builtin is because LLVM intrinsic declare it like:

declare float @llvm.vector.reduce.fadd.v4f32(float %start_value, <4 x float> %a)
declare double @llvm.vector.reduce.fadd.v2f64(double %start_value, <2 x double> %a)

And it performs sequential reduction which is not what we want right? We need it to reduce like:

[e3, e2, e1, e0] => (e3, e2) + (e1, e0)

Does it mean we should do something like a for loop? or like recursive calls? or like changing the order of the elements in the vector?

One way to go about it would be to extend the @llvm.vector.reduce.fadd to take another integer or boolean argument indicating the order to apply.

Targets that support such horizontal add instructions, like AArch64, can then lower the intrinsic call directly to the right instructions. Otherwise we can generate the right instruction sequence for the reduction. We know the number of vector elements, so there should be no for a loop or recursion, we can just generate instructions for the full tree (extra the lanes using shuffle vector & add them).

And another thing that confuses me is that pad identity elements after the last element to widen the vector out to a power 2 . According to the IR reference, is the neutral value just zero?

The last confusing point is %start_value, we just simply consider it is 0, isn't it?

For fadd reductions it should be -0.0 I think.

I would appreciate it if you can give me any hints, which I think is very helpful to my LLVM learning :-)

@junaire did you already get commit access or should I commit this change on your behalf?

Yeah, I already have commit access, just waiting for your approval ;D

fhahn accepted this revision.Jan 14 2022, 2:35 AM

@junaire did you already get commit access or should I commit this change on your behalf?

Yeah, I already have commit access, just waiting for your approval ;D

Ah sorry, I thought I already approved the change.

LGTM again

This revision is now accepted and ready to land.Jan 14 2022, 2:35 AM
This revision was automatically updated to reflect the committed changes.