This is an archive of the discontinued LLVM Phabricator instance.

Are you planning on putting up a patch for this one as well? What makes add a bit different is that ‘llvm.vector.reduce.fadd.*’ can only perform reductions either in the original order or in an unspecified order. For the extension, we need a particular evaluation order (reduction tree adding adjacent element pairs). Technically this order is required for all reduction builtins, but for integers the order doesn't matter, same for min/max.

clang/lib/Sema/SemaChecking.cpp
2235	nit: Those .... vectors ....

LGTM other than the NIT found by @fhahn

fix the comment.

Harbormaster completed remote builds in B142581: Diff 398838.Jan 10 2022, 10:37 PM

In D116736#3230040, @fhahn wrote:

LGTM, thanks!

The last __builtin_reduce_add will be separated into another one.

Are you planning on putting up a patch for this one as well? What makes add a bit different is that ‘llvm.vector.reduce.fadd.*’ can only perform reductions either in the original order or in an unspecified order. For the extension, we need a particular evaluation order (reduction tree adding adjacent element pairs). Technically this order is required for all reduction builtins, but for integers the order doesn't matter, same for min/max.

Sorry about the late response. Yeah, I'm trying to work on this builtin too, but actually, I don't know if I can do this as all my previous work is like kind of boilerplate or something? I have read the whole discussion in the mailing list and the related LLVM IR reference, but I still get confused a little bit.

So the difference of this builtin is because LLVM intrinsic declare it like:

declare float @llvm.vector.reduce.fadd.v4f32(float %start_value, <4 x float> %a)
declare double @llvm.vector.reduce.fadd.v2f64(double %start_value, <2 x double> %a)

And it performs sequential reduction which is not what we want right? We need it to reduce like:

[e3, e2, e1, e0] => (e3, e2) + (e1, e0)

Does it mean we should do something like a for loop? or like recursive calls? or like changing the order of the elements in the vector?

And another thing that confuses me is that pad identity elements after the last element to widen the vector out to a power 2 . According to the IR reference, is the neutral value just zero?

The last confusing point is %start_value, we just simply consider it is 0, isn't it?

I would appreciate it if you can give me any hints, which I think is very helpful to my LLVM learning :-)

@junaire did you already get commit access or should I commit this change on your behalf?

In D116736#3239530, @junaire wrote:
In D116736#3230040, @fhahn wrote:

LGTM, thanks!

The last __builtin_reduce_add will be separated into another one.

Are you planning on putting up a patch for this one as well? What makes add a bit different is that ‘llvm.vector.reduce.fadd.*’ can only perform reductions either in the original order or in an unspecified order. For the extension, we need a particular evaluation order (reduction tree adding adjacent element pairs). Technically this order is required for all reduction builtins, but for integers the order doesn't matter, same for min/max.

Sorry about the late response. Yeah, I'm trying to work on this builtin too, but actually, I don't know if I can do this as all my previous work is like kind of boilerplate or something? I have read the whole discussion in the mailing list and the related LLVM IR reference, but I still get confused a little bit.

So the difference of this builtin is because LLVM intrinsic declare it like:
declare float @llvm.vector.reduce.fadd.v4f32(float %start_value, <4 x float> %a)
declare double @llvm.vector.reduce.fadd.v2f64(double %start_value, <2 x double> %a)
And it performs sequential reduction which is not what we want right? We need it to reduce like:
[e3, e2, e1, e0] => (e3, e2) + (e1, e0)
Does it mean we should do something like a for loop? or like recursive calls? or like changing the order of the elements in the vector?

One way to go about it would be to extend the @llvm.vector.reduce.fadd to take another integer or boolean argument indicating the order to apply.

Targets that support such horizontal add instructions, like AArch64, can then lower the intrinsic call directly to the right instructions. Otherwise we can generate the right instruction sequence for the reduction. We know the number of vector elements, so there should be no for a loop or recursion, we can just generate instructions for the full tree (extra the lanes using shuffle vector & add them).

And another thing that confuses me is that pad identity elements after the last element to widen the vector out to a power 2 . According to the IR reference, is the neutral value just zero?

The last confusing point is %start_value, we just simply consider it is 0, isn't it?

For fadd reductions it should be -0.0 I think.

I would appreciate it if you can give me any hints, which I think is very helpful to my LLVM learning :-)

@junaire did you already get commit access or should I commit this change on your behalf?

Yeah, I already have commit access, just waiting for your approval ;D

In D116736#3242623, @junaire wrote:

@junaire did you already get commit access or should I commit this change on your behalf?

Yeah, I already have commit access, just waiting for your approval ;D

Ah sorry, I thought I already approved the change.

LGTM again

This revision is now accepted and ready to land.Jan 14 2022, 2:35 AM

Closed by commit rG8de0c1feca28: [Clang] Add __builtin_reduce_or and __builtin_reduce_and (authored by junaire). · Explain WhyJan 14 2022, 6:06 AM

This revision was automatically updated to reflect the committed changes.

junaire added a commit: rG8de0c1feca28: [Clang] Add __builtin_reduce_or and __builtin_reduce_and.

junaire mentioned this in D117480: [IR] Extend llvm.vector.reduce.fadd.Jan 17 2022, 6:56 AM

Revision Contents

Path

Size

clang/

include/

clang/

Basic/

Builtins.def

2 lines

lib/

CodeGen/

CGBuiltin.cpp

6 lines

Sema/

SemaChecking.cpp

6 lines

test/

CodeGen/

builtins-reduction-math.c

22 lines

Sema/

builtins-reduction-math.c

34 lines

Diff 397846

clang/include/clang/Basic/Builtins.def

	Show First 20 Lines • Show All 644 Lines • ▼ Show 20 Lines

	BUILTIN(__builtin_elementwise_abs, "v.", "nct")			BUILTIN(__builtin_elementwise_abs, "v.", "nct")
	BUILTIN(__builtin_elementwise_max, "v.", "nct")			BUILTIN(__builtin_elementwise_max, "v.", "nct")
	BUILTIN(__builtin_elementwise_min, "v.", "nct")			BUILTIN(__builtin_elementwise_min, "v.", "nct")
	BUILTIN(__builtin_elementwise_ceil, "v.", "nct")			BUILTIN(__builtin_elementwise_ceil, "v.", "nct")
	BUILTIN(__builtin_reduce_max, "v.", "nct")			BUILTIN(__builtin_reduce_max, "v.", "nct")
	BUILTIN(__builtin_reduce_min, "v.", "nct")			BUILTIN(__builtin_reduce_min, "v.", "nct")
	BUILTIN(__builtin_reduce_xor, "v.", "nct")			BUILTIN(__builtin_reduce_xor, "v.", "nct")
				BUILTIN(__builtin_reduce_or, "v.", "nct")
				BUILTIN(__builtin_reduce_and, "v.", "nct")

	BUILTIN(__builtin_matrix_transpose, "v.", "nFt")			BUILTIN(__builtin_matrix_transpose, "v.", "nFt")
	BUILTIN(__builtin_matrix_column_major_load, "v.", "nFt")			BUILTIN(__builtin_matrix_column_major_load, "v.", "nFt")
	BUILTIN(__builtin_matrix_column_major_store, "v.", "nFt")			BUILTIN(__builtin_matrix_column_major_store, "v.", "nFt")

	// "Overloaded" Atomic operator builtins. These are overloaded to support data			// "Overloaded" Atomic operator builtins. These are overloaded to support data
	// types of i8, i16, i32, i64, and i128. The front-end sees calls to the			// types of i8, i16, i32, i64, and i128. The front-end sees calls to the
	// non-suffixed version of these (which has a bogus type) and transforms them to			// non-suffixed version of these (which has a bogus type) and transforms them to
	▲ Show 20 Lines • Show All 1,026 Lines • Show Last 20 Lines

clang/lib/CodeGen/CGBuiltin.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,202 Lines • ▼ Show 20 Lines	case Builtin::BI__builtin_reduce_min: {

return RValue::get(emitUnaryBuiltin(		return RValue::get(emitUnaryBuiltin(
*this, E, GetIntrinsicID(E->getArg(0)->getType()), "rdx.min"));		*this, E, GetIntrinsicID(E->getArg(0)->getType()), "rdx.min"));
}		}

case Builtin::BI__builtin_reduce_xor:		case Builtin::BI__builtin_reduce_xor:
return RValue::get(emitUnaryBuiltin(		return RValue::get(emitUnaryBuiltin(
*this, E, llvm::Intrinsic::vector_reduce_xor, "rdx.xor"));		*this, E, llvm::Intrinsic::vector_reduce_xor, "rdx.xor"));
		case Builtin::BI__builtin_reduce_or:
		return RValue::get(emitUnaryBuiltin(
		*this, E, llvm::Intrinsic::vector_reduce_or, "rdx.or"));
		case Builtin::BI__builtin_reduce_and:
		return RValue::get(emitUnaryBuiltin(
		*this, E, llvm::Intrinsic::vector_reduce_and, "rdx.and"));

case Builtin::BI__builtin_matrix_transpose: {		case Builtin::BI__builtin_matrix_transpose: {
const auto *MatrixTy = E->getArg(0)->getType()->getAs<ConstantMatrixType>();		const auto *MatrixTy = E->getArg(0)->getType()->getAs<ConstantMatrixType>();
Value *MatValue = EmitScalarExpr(E->getArg(0));		Value *MatValue = EmitScalarExpr(E->getArg(0));
MatrixBuilder<CGBuilderTy> MB(Builder);		MatrixBuilder<CGBuilderTy> MB(Builder);
Value *Result = MB.CreateMatrixTranspose(MatValue, MatrixTy->getNumRows(),		Value *Result = MB.CreateMatrixTranspose(MatValue, MatrixTy->getNumRows(),
MatrixTy->getNumColumns());		MatrixTy->getNumColumns());
return RValue::get(Result);		return RValue::get(Result);
▲ Show 20 Lines • Show All 15,715 Lines • Show Last 20 Lines

clang/lib/Sema/SemaChecking.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,226 Lines • ▼ Show 20 Lines	if (!TyA) {
<< 1 << /* vector ty*/ 4 << Arg->getType();		<< 1 << /* vector ty*/ 4 << Arg->getType();
return ExprError();		return ExprError();
}		}

TheCall->setType(TyA->getElementType());		TheCall->setType(TyA->getElementType());
break;		break;
}		}

// __builtin_reduce_xor supports vector of integers only.		// This builtins support vector of integers only.
		fhahnUnsubmitted Not Done Reply Inline Actions nit: Those .... vectors .... fhahn: nit: Those .... vectors ....
case Builtin::BI__builtin_reduce_xor: {		case Builtin::BI__builtin_reduce_xor:
		case Builtin::BI__builtin_reduce_or:
		case Builtin::BI__builtin_reduce_and: {
if (PrepareBuiltinReduceMathOneArgCall(TheCall))		if (PrepareBuiltinReduceMathOneArgCall(TheCall))
return ExprError();		return ExprError();

const Expr *Arg = TheCall->getArg(0);		const Expr *Arg = TheCall->getArg(0);
const auto *TyA = Arg->getType()->getAs<VectorType>();		const auto *TyA = Arg->getType()->getAs<VectorType>();
if (!TyA \|\| !TyA->getElementType()->isIntegerType()) {		if (!TyA \|\| !TyA->getElementType()->isIntegerType()) {
Diag(Arg->getBeginLoc(), diag::err_builtin_invalid_arg_type)		Diag(Arg->getBeginLoc(), diag::err_builtin_invalid_arg_type)
<< 1 << /* vector of integers */ 6 << Arg->getType();		<< 1 << /* vector of integers */ 6 << Arg->getType();
▲ Show 20 Lines • Show All 14,978 Lines • Show Last 20 Lines

clang/test/CodeGen/builtins-reduction-math.c

Show First 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	void test_builtin_reduce_xor(si8 vi1, u4 vu1) {
// CHECK: [[VI1:%.+]] = load <8 x i16>, <8 x i16>* %vi1.addr, align 16		// CHECK: [[VI1:%.+]] = load <8 x i16>, <8 x i16>* %vi1.addr, align 16
// CHECK-NEXT: call i16 @llvm.vector.reduce.xor.v8i16(<8 x i16> [[VI1]])		// CHECK-NEXT: call i16 @llvm.vector.reduce.xor.v8i16(<8 x i16> [[VI1]])
short r2 = __builtin_reduce_xor(vi1);		short r2 = __builtin_reduce_xor(vi1);

// CHECK: [[VU1:%.+]] = load <4 x i32>, <4 x i32>* %vu1.addr, align 16		// CHECK: [[VU1:%.+]] = load <4 x i32>, <4 x i32>* %vu1.addr, align 16
// CHECK-NEXT: call i32 @llvm.vector.reduce.xor.v4i32(<4 x i32> [[VU1]])		// CHECK-NEXT: call i32 @llvm.vector.reduce.xor.v4i32(<4 x i32> [[VU1]])
unsigned r3 = __builtin_reduce_xor(vu1);		unsigned r3 = __builtin_reduce_xor(vu1);
}		}

		void test_builtin_reduce_or(si8 vi1, u4 vu1) {

		// CHECK: [[VI1:%.+]] = load <8 x i16>, <8 x i16>* %vi1.addr, align 16
		// CHECK-NEXT: call i16 @llvm.vector.reduce.or.v8i16(<8 x i16> [[VI1]])
		short r2 = __builtin_reduce_or(vi1);

		// CHECK: [[VU1:%.+]] = load <4 x i32>, <4 x i32>* %vu1.addr, align 16
		// CHECK-NEXT: call i32 @llvm.vector.reduce.or.v4i32(<4 x i32> [[VU1]])
		unsigned r3 = __builtin_reduce_or(vu1);
		}

		void test_builtin_reduce_and(si8 vi1, u4 vu1) {

		// CHECK: [[VI1:%.+]] = load <8 x i16>, <8 x i16>* %vi1.addr, align 16
		// CHECK-NEXT: call i16 @llvm.vector.reduce.and.v8i16(<8 x i16> [[VI1]])
		short r2 = __builtin_reduce_and(vi1);

		// CHECK: [[VU1:%.+]] = load <4 x i32>, <4 x i32>* %vu1.addr, align 16
		// CHECK-NEXT: call i32 @llvm.vector.reduce.and.v4i32(<4 x i32> [[VU1]])
		unsigned r3 = __builtin_reduce_and(vu1);
		}

clang/test/Sema/builtins-reduction-math.c

Show First 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	void test_builtin_reduce_xor(int i, float4 v, int3 iv) {
// expected-error@-1 {{too many arguments to function call, expected 1, have 2}}		// expected-error@-1 {{too many arguments to function call, expected 1, have 2}}

i = __builtin_reduce_xor(i);		i = __builtin_reduce_xor(i);
// expected-error@-1 {{1st argument must be a vector of integers (was 'int')}}		// expected-error@-1 {{1st argument must be a vector of integers (was 'int')}}

i = __builtin_reduce_xor(v);		i = __builtin_reduce_xor(v);
// expected-error@-1 {{1st argument must be a vector of integers (was 'float4' (vector of 4 'float' values))}}		// expected-error@-1 {{1st argument must be a vector of integers (was 'float4' (vector of 4 'float' values))}}
}		}

		void test_builtin_reduce_or(int i, float4 v, int3 iv) {
		struct Foo s = __builtin_reduce_or(iv);
		// expected-error@-1 {{initializing 'struct Foo' with an expression of incompatible type 'int'}}

		i = __builtin_reduce_or();
		// expected-error@-1 {{too few arguments to function call, expected 1, have 0}}

		i = __builtin_reduce_or(iv, iv);
		// expected-error@-1 {{too many arguments to function call, expected 1, have 2}}

		i = __builtin_reduce_or(i);
		// expected-error@-1 {{1st argument must be a vector of integers (was 'int')}}

		i = __builtin_reduce_or(v);
		// expected-error@-1 {{1st argument must be a vector of integers (was 'float4' (vector of 4 'float' values))}}
		}

		void test_builtin_reduce_and(int i, float4 v, int3 iv) {
		struct Foo s = __builtin_reduce_and(iv);
		// expected-error@-1 {{initializing 'struct Foo' with an expression of incompatible type 'int'}}

		i = __builtin_reduce_and();
		// expected-error@-1 {{too few arguments to function call, expected 1, have 0}}

		i = __builtin_reduce_and(iv, iv);
		// expected-error@-1 {{too many arguments to function call, expected 1, have 2}}

		i = __builtin_reduce_and(i);
		// expected-error@-1 {{1st argument must be a vector of integers (was 'int')}}

		i = __builtin_reduce_and(v);
		// expected-error@-1 {{1st argument must be a vector of integers (was 'float4' (vector of 4 'float' values))}}
		}

This is an archive of the discontinued LLVM Phabricator instance.

[Clang] Add __builtin_reduce_or and __builtin_reduce_andClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline