This is an archive of the discontinued LLVM Phabricator instance.

Implement custom lowering for ISD::CTTZ_ZERO_UNDEF and ISD::CTTZ.
ClosedPublic

Authored by wdng on Aug 31 2017, 12:27 PM.

Download Raw Diff

Details

Reviewers

arsenm
b-sumner
t-tye
kzhuravl
rampitec

Commits

rG5676acad9e0b: Implement custom lowering for ISD::CTTZ_ZERO_UNDEF and ISD::CTTZ.
rL315610: Implement custom lowering for ISD::CTTZ_ZERO_UNDEF and ISD::CTTZ.

Summary

During the DAGCombine optimization phase, the LLVM compiler converts ISD::CTTZ_ZERO_UNDEF to ISD::CTTZ and then expands during the Legalization phase, which prevents the v_ffbl_b32 instruction generation. This patch implements custom lowering for ISD::CTTZ_ZERO_UNDEF and ISD::CTTZ.

Diff Detail

Repository: rL LLVM

Event Timeline

wdng created this revision.Aug 31 2017, 12:27 PM

Herald added a subscriber: nhaehnle. · View Herald TranscriptAug 31 2017, 12:28 PM

arsenm added inline comments.Aug 31 2017, 1:34 PM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
16680 ↗	(On Diff #113451)	The existing code was the correct way to do this

arsenm added a subscriber: llvm-commits.Aug 31 2017, 1:34 PM

wdng added inline comments.Aug 31 2017, 1:54 PM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

16680 ↗

(On Diff #113451)

With original code, we will have the following code transformations:

Initial selection DAG: BB#0 'sample_test:entry'
SelectionDAG has 50 nodes:
  t0: ch = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %vreg2
  t3: i64 = Constant<0>
  t5: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t2, undef:i64
  t6: i64,ch = merge_values t5, t5:1
    t8: i64 = add t2, Constant:i64<8>
  t9: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t8, undef:i64
  t10: i64,ch = merge_values t9, t9:1
  t11: ch = TokenFactor t6:1, t10:1
      t13: i64 = llvm.amdgcn.dispatch.ptr TargetConstant:i32<359>
    t19: i64 = add t13, Constant:i64<4>
  t20: i16,ch = load<LD2[%4(addrspace=2)](align=4)(tbaa=<0x4436db8>)> t11, t19, undef:i64
    t25: i64 = llvm.amdgcn.implicitarg.ptr TargetConstant:i32<460>
  t27: i64,ch = load<LD8[%11(addrspace=2)](tbaa=<0x4435518>)> t11, t25, undef:i64
  t29: i64 = Constant<32>
              t17: i32 = llvm.amdgcn.workgroup.id.x TargetConstant:i32<505>
              t21: i32 = zero_extend t20
            t22: i32 = mul t17, t21
            t15: i32 = llvm.amdgcn.workitem.id.x TargetConstant:i32<508>
          t23: i32 = add t22, t15 
        t26: i64 = zero_extend t23 
      t28: i64 = add t27, t26 
    t31: i64 = shl t28, Constant:i32<32>
  t32: i64 = sra t31, Constant:i32<32>
    t33: i64 = add t6, t32 
  t34: i8,ch = load<LD1[%arrayidx(addrspace=1)](tbaa=<0x4435498>)> t11, t33, undef:i64
  t35: i32 = zero_extend t34 
    t39: i1 = setcc t35, Constant:i32<0>, setne:ch
    t36: i32 = cttz_zero_undef t35
  t40: i32 = select t39, t36, Constant:i32<32>
  t43: i1 = setcc Constant:i32<8>, t40, setult:ch
      t47: ch = TokenFactor t20:1, t27:1, t34:1
        t44: i32 = umin t40, Constant:i32<8>
      t45: i8 = truncate t44 
      t46: i64 = add t10, t32 
    t48: ch = store<ST1[%arrayidx3(addrspace=1)](tbaa=<0x4435498>)> t47, t45, t46, undef:i64
  t49: ch = ENDPGM t48

Optimized lowered selection DAG: BB#0 'sample_test:entry'
SelectionDAG has 35 nodes:
  t0: ch = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %vreg2
  t5: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t2, undef:i64
    t8: i64 = add t2, Constant:i64<8>
  t9: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t8, undef:i64
  t11: ch = TokenFactor t5:1, t9:1
    t33: i64 = add t5, t63 
  t54: i32,ch = load<LD1[%arrayidx(addrspace=1)](tbaa=<0x4435498>), zext from i8> t11, t33, undef:i64
    t25: i64 = llvm.amdgcn.implicitarg.ptr TargetConstant:i32<460>
  t62: i32,ch = load<LD4[%11(addrspace=2)](align=8)(tbaa=<0x4435518>)> t11, t25, undef:i64
          t17: i32 = llvm.amdgcn.workgroup.id.x TargetConstant:i32<505>
        t22: i32 = mul t17, t64 
        t15: i32 = llvm.amdgcn.workitem.id.x TargetConstant:i32<508>
      t23: i32 = add t22, t15
    t60: i32 = add t62, t23
  t63: i64 = sign_extend t60, ValueType:ch:i32
      t13: i64 = llvm.amdgcn.dispatch.ptr TargetConstant:i32<359>
    t19: i64 = add t13, Constant:i64<4>
  t64: i32,ch = load<LD2[%4(addrspace=2)](align=4)(tbaa=<0x4436db8>), zext from i16> t11, t19, undef:i64
      t47: ch = TokenFactor t64:1, t62:1, t54:1
        t53: i32 = cttz t54
      t44: i32 = umin t53, Constant:i32<8>
      t46: i64 = add t9, t63
    t50: ch = store<ST1[%arrayidx3(addrspace=1)](tbaa=<0x4435498>), trunc to i8> t47, t44, t46, undef:i64
  t49: ch = ENDPGM t50

We won't be able to generate s/v_ffbl instructions. I found llvm.cttz.i32 has all been converted to cttz_zero_undef instread of 'cttz'.

If we don't want to change the original way of implementation, we may want to do a custom lowering for ISD::CTTZ at AMDGPU backend to ISD::CTTZ_ZERO_UNDE?

Ping.

I think the actual problem is the implementation of ISD::CTTZ not using v_ffbl and not this transformation.

If v_ffbl is able to produce a defined answer of bit width for 0, then you want to match it with cttz and have the operation action for cttz_zero_undef set to Expand. That will turn all cttz_zero_undef calls into cttz.

If v_ffbl is not capable of handling zero, then you want cttz_zero_undef set to Legal, and cttz set to Expand which will make use of cttz_zero_undef and a select. Or you can make cttz Custom and do your own lowering.

I think the instruction behavior is to return -1 on 0 input. IIRC we handle this and fold that for ctlz already, just not cttz.

Just add a custom lowering ISD:CTTZ to ISD::CTTZ_ZERO_UNDEF

In D37348#859119, @wdng wrote:

Just add a custom lowering ISD:CTTZ to ISD::CTTZ_ZERO_UNDEF

I don't think that will help. Why not follow exactly how CTLZ* is handled now and implement AMDGPUTargetLowering::LowerCTTZ making use of ffbl?

wdng updated this revision to Diff 114215.Sep 7 2017, 11:13 AM

wdng retitled this revision from Tighten conditions for converting ISD::CTTZ_ZERO_UNDEF to ISD::CTTZ to Implement custom lowering for ISD::CTTZ_ZERO_UNDEF and ISD::CTTZ..

wdng added a reviewer: craig.topper.Sep 7 2017, 11:36 AM

Ping.

wdng added a reviewer: t-tye.Sep 8 2017, 12:10 PM

arsenm added inline comments.Sep 8 2017, 1:22 PM

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2784–2793	I don't understand why you have this or most of the other changes. This shouldn't be substantially different from how we handle ctlz already. i.e. I would expect to see another version of AMDGPUTargetLowering::performCtlzCombine that does essentially the same thing for CTTZ.
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
420	This should definitely remain legal
2029	You didn't enable custom lowering for i64, so this is dead code. You also didn't a dd any tests for it. In either case, it should be in a separate patch from the i32 handling.
lib/Target/AMDGPU/AMDGPUInstrInfo.td
301–302	This isn't a signed/unsigned operation. There is just one v_ffbl_b32.

craig.topper resigned from this revision.Sep 8 2017, 10:48 PM

wdng marked 2 inline comments as done.Sep 11 2017, 9:05 AM

wdng added inline comments.

lib/Target/AMDGPU/AMDGPUISelLowering.cpp
420	I think it doesn't matter to define it as Custom, because it will be converted to FFBL_U32 during the custom lowering and then pattern matching to the ffbl instruction anyway at the end. However, if we defined it as Legal, we will have a "duplicate" or "extra" pattern (FFBL_U32 and CTTZ_ZERO_UNDEF) for generating the ffbl instruction. Is there any specific reason that I neglect here that we have to define it as Legal?

arsenm added inline comments.Sep 12 2017, 7:05 PM

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2803–2806	OK, I see the default expansion here isn't the compare and select like I expected. Since the compare+select implementation is likely more instructions with the compare than the sub/ctpop implementation, that one should be tried first.
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
417	We should probably fix this at some point to be legal
1109–1110	Also need the select with -1 optimization (and corresponding tests) as cttz
2021	This is mostly copy past from LowerCTLZ. These should be factored into a common helper.
test/CodeGen/AMDGPU/cttz_zero_undef.ll
103	Need i64 tests

Changes based on code review feedback.

Upload a full diff.

Missing performCtlzCombine equivalent

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2803–2806	I don't see this changed
test/CodeGen/AMDGPU/cttz_zero_undef.ll
109	Missing scalar version

Address code reviews.

Fix the issues that variables are not capitalized.

wdng marked 2 inline comments as done.Sep 14 2017, 2:50 PM

Ping.

arsenm added inline comments.Sep 15 2017, 10:17 AM

lib/Target/AMDGPU/AMDGPUISelLowering.cpp
2036–2042	Indentation wrong
2043	llvm_unreachable
2061–2064	Select between Zero and One as input to getSetCC
3008–3009	You could just pass in the new opcode directly rather than selecting it again
3010	Commented out code
3035	You didn't add tests for this part
lib/Target/AMDGPU/AMDGPUISelLowering.h
374	Should be name FFBL_B32 to match the instruction
test/CodeGen/AMDGPU/cttz_zero_undef.ll
131	Also should have some tests with i8/i16

Will create another separate ticket to fix the v_ffbl_sdwa instruction generation.

Ping.

wdng edited reviewers, added: kzhuravl; removed: craig.topper.Sep 22 2017, 12:29 PM

Ping.

Needs more comprehensive check lines. Just checking the instructions won't demonstrate that the extra instructions you're trying to avoid aren't there

lib/Target/AMDGPU/AMDGPUISelLowering.cpp
3028	Don't include AMDGPUISD in the name of this
3053	Ditto
test/CodeGen/AMDGPU/cttz_zero_undef.ll
85	This needs to check more
97	This needs to check more
121–122	Ditto

Address code reivews.

Ping.

wdng added a reviewer: rampitec.Oct 10 2017, 11:10 AM

arsenm added inline comments.Oct 11 2017, 11:05 AM

lib/Target/AMDGPU/AMDGPUISelLowering.cpp
2031	Extra space after ::
2040	Don't includ eAMDGPUISD in variable name
2041	Missing space before {
2073	Double // and missing closing )
2077	Double //
test/CodeGen/AMDGPU/cttz_zero_undef.ll
1	Add -enable-var-scope to all of the FileCheck lines. Several of these tests are broken
171–172	This isn't checking the outputs and select
184	Using undefined VAL
198	Undefined VAL

Address code reviews.

wdng marked 3 inline comments as done.Oct 11 2017, 4:14 PM

craig.topper removed a subscriber: craig.topper.Oct 11 2017, 4:16 PM

arsenm added inline comments.Oct 11 2017, 4:24 PM

test/CodeGen/AMDGPU/cttz_zero_undef.ll
175–176	Using 2 -DAGs with identical lines doesn't do anything. It will pass with only one

wdng added inline comments.Oct 11 2017, 4:27 PM

test/CodeGen/AMDGPU/cttz_zero_undef.ll
175–176	No, it won't work if I remove the -DAG. As the generated instructions get interleaved with each other.

Remove duplicate check lines.

Removed -DAG checks completely.

LGTM

This revision is now accepted and ready to land.Oct 12 2017, 10:39 AM

Closed by commit rL315610: Implement custom lowering for ISD::CTTZ_ZERO_UNDEF and ISD::CTTZ. (authored by wdng). · Explain WhyOct 12 2017, 12:37 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

include/

llvm/

Target/

TargetSelectionDAG.td

2 lines

lib/

CodeGen/

SelectionDAG/

LegalizeDAG.cpp

13 lines

Target/

AMDGPU/

AMDGPUISelLowering.h

3 lines

AMDGPUISelLowering.cpp

63 lines

AMDGPUInstrInfo.td

2 lines

EvergreenInstructions.td

2 lines

SOPInstructions.td

5 lines

test/

CodeGen/

AMDGPU/

cttz_zero_undef.ll

39 lines

Diff 115230

include/llvm/Target/TargetSelectionDAG.td

Context not available.
	def SDTFPTernaryOp : SDTypeProfile<1, 3, [ // fmadd, fnmsub, etc.	def SDTFPTernaryOp : SDTypeProfile<1, 3, [ // fmadd, fnmsub, etc.
	SDTCisSameAs<0, 1>, SDTCisSameAs<0, 2>, SDTCisSameAs<0, 3>, SDTCisFP<0>	SDTCisSameAs<0, 1>, SDTCisSameAs<0, 2>, SDTCisSameAs<0, 3>, SDTCisFP<0>
	]>;	]>;
	def SDTIntUnaryOp : SDTypeProfile<1, 1, [ // ctlz	def SDTIntUnaryOp : SDTypeProfile<1, 1, [ // ctlz, cttz
	SDTCisSameAs<0, 1>, SDTCisInt<0>	SDTCisSameAs<0, 1>, SDTCisInt<0>
	]>;	]>;
	def SDTIntExtendOp : SDTypeProfile<1, 1, [ // sext, zext, anyext	def SDTIntExtendOp : SDTypeProfile<1, 1, [ // sext, zext, anyext
Context not available.

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp

Context not available.
	// This trivially expands to CTTZ.	// This trivially expands to CTTZ.
	return DAG.getNode(ISD::CTTZ, dl, Op.getValueType(), Op);	return DAG.getNode(ISD::CTTZ, dl, Op.getValueType(), Op);
	case ISD::CTTZ: {	case ISD::CTTZ: {
		EVT VT = Op.getValueType();
		unsigned len = VT.getSizeInBits();

		if (TLI.isOperationLegalOrCustom(ISD::CTTZ_ZERO_UNDEF, VT)) {
		EVT SetCCVT = getSetCCResultType(VT);
		SDValue CTTZ = DAG.getNode(ISD::CTTZ_ZERO_UNDEF, dl, VT, Op);
		SDValue Zero = DAG.getConstant(0, dl, VT);
		SDValue SrcIsZero = DAG.getSetCC(dl, SetCCVT, Op, Zero, ISD::SETEQ);
		return DAG.getNode(ISD::SELECT, dl, VT, SrcIsZero,
		DAG.getConstant(len, dl, VT), CTTZ);
		}
		arsenmUnsubmitted Done Reply Inline Actions I don't understand why you have this or most of the other changes. This shouldn't be substantially different from how we handle ctlz already. i.e. I would expect to see another version of AMDGPUTargetLowering::performCtlzCombine that does essentially the same thing for CTTZ. arsenm: I don't understand why you have this or most of the other changes. This shouldn't be…

	// for now, we use: { return popcount(~x & (x - 1)); }	// for now, we use: { return popcount(~x & (x - 1)); }
	// unless the target has ctlz but not ctpop, in which case we use:	// unless the target has ctlz but not ctpop, in which case we use:
	// { return 32 - nlz(~x & (x-1)); }	// { return 32 - nlz(~x & (x-1)); }
	// Ref: "Hacker's Delight" by Henry Warren	// Ref: "Hacker's Delight" by Henry Warren
	EVT VT = Op.getValueType();
	SDValue Tmp3 = DAG.getNode(ISD::AND, dl, VT,	SDValue Tmp3 = DAG.getNode(ISD::AND, dl, VT,
	DAG.getNOT(dl, Op, VT),	DAG.getNOT(dl, Op, VT),
	DAG.getNode(ISD::SUB, dl, VT, Op,	DAG.getNode(ISD::SUB, dl, VT, Op,
Context not available.
		arsenmUnsubmitted Done Reply Inline Actions OK, I see the default expansion here isn't the compare and select like I expected. Since the compare+select implementation is likely more instructions with the compare than the sub/ctpop implementation, that one should be tried first. arsenm: OK, I see the default expansion here isn't the compare and select like I expected. Since the…
		arsenmUnsubmitted Done Reply Inline Actions I don't see this changed arsenm: I don't see this changed

lib/Target/AMDGPU/AMDGPUISelLowering.h

Context not available.
	SDValue LowerFROUND(SDValue Op, SelectionDAG &DAG) const;	SDValue LowerFROUND(SDValue Op, SelectionDAG &DAG) const;
	SDValue LowerFFLOOR(SDValue Op, SelectionDAG &DAG) const;	SDValue LowerFFLOOR(SDValue Op, SelectionDAG &DAG) const;

	SDValue LowerCTLZ(SDValue Op, SelectionDAG &DAG) const;	SDValue LowerCTTZ_CTLZ(SDValue Op, SelectionDAG &DAG) const;

	SDValue LowerINT_TO_FP32(SDValue Op, SelectionDAG &DAG, bool Signed) const;	SDValue LowerINT_TO_FP32(SDValue Op, SelectionDAG &DAG, bool Signed) const;
	SDValue LowerINT_TO_FP64(SDValue Op, SelectionDAG &DAG, bool Signed) const;	SDValue LowerINT_TO_FP64(SDValue Op, SelectionDAG &DAG, bool Signed) const;
Context not available.
	BFM, // Insert a range of bits into a 32-bit word.	BFM, // Insert a range of bits into a 32-bit word.
	FFBH_U32, // ctlz with -1 if input is zero.	FFBH_U32, // ctlz with -1 if input is zero.
	FFBH_I32,	FFBH_I32,
		FFBL_U32, // cttz with -1 if input is zero.
		arsenmUnsubmitted Done Reply Inline Actions Should be name FFBL_B32 to match the instruction arsenm: Should be name FFBL_B32 to match the instruction
	MUL_U24,	MUL_U24,
	MUL_I24,	MUL_I24,
	MULHI_U24,	MULHI_U24,
Context not available.

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

Context not available.
	setOperationAction(ISD::CTLZ_ZERO_UNDEF, MVT::i32, Custom);	setOperationAction(ISD::CTLZ_ZERO_UNDEF, MVT::i32, Custom);
		arsenmUnsubmitted Not Done Reply Inline Actions We should probably fix this at some point to be legal arsenm: We should probably fix this at some point to be legal

	if (Subtarget->hasFFBL())	if (Subtarget->hasFFBL())
	setOperationAction(ISD::CTTZ_ZERO_UNDEF, MVT::i32, Legal);	setOperationAction(ISD::CTTZ_ZERO_UNDEF, MVT::i32, Custom);
	arsenmUnsubmitted Done Reply Inline Actions This should definitely remain legal arsenm: This should definitely remain legal
	wdngAuthorUnsubmitted Not Done Reply Inline Actions I think it doesn't matter to define it as Custom, because it will be converted to FFBL_U32 during the custom lowering and then pattern matching to the ffbl instruction anyway at the end. However, if we defined it as Legal, we will have a "duplicate" or "extra" pattern (FFBL_U32 and CTTZ_ZERO_UNDEF) for generating the ffbl instruction. Is there any specific reason that I neglect here that we have to define it as Legal? wdng: I think it doesn't matter to define it as Custom, because it will be converted to FFBL_U32…

		setOperationAction(ISD::CTTZ, MVT::i64, Custom);
		setOperationAction(ISD::CTTZ_ZERO_UNDEF, MVT::i64, Custom);
	setOperationAction(ISD::CTLZ, MVT::i64, Custom);	setOperationAction(ISD::CTLZ, MVT::i64, Custom);
	setOperationAction(ISD::CTLZ_ZERO_UNDEF, MVT::i64, Custom);	setOperationAction(ISD::CTLZ_ZERO_UNDEF, MVT::i64, Custom);

Context not available.
	case ISD::FP_TO_FP16: return LowerFP_TO_FP16(Op, DAG);	case ISD::FP_TO_FP16: return LowerFP_TO_FP16(Op, DAG);
	case ISD::FP_TO_SINT: return LowerFP_TO_SINT(Op, DAG);	case ISD::FP_TO_SINT: return LowerFP_TO_SINT(Op, DAG);
	case ISD::FP_TO_UINT: return LowerFP_TO_UINT(Op, DAG);	case ISD::FP_TO_UINT: return LowerFP_TO_UINT(Op, DAG);
		case ISD::CTTZ:
		arsenmUnsubmitted Not Done Reply Inline Actions Also need the select with -1 optimization (and corresponding tests) as cttz arsenm: Also need the select with -1 optimization (and corresponding tests) as cttz
		case ISD::CTTZ_ZERO_UNDEF:
	case ISD::CTLZ:	case ISD::CTLZ:
	case ISD::CTLZ_ZERO_UNDEF:	case ISD::CTLZ_ZERO_UNDEF:
	return LowerCTLZ(Op, DAG);	return LowerCTTZ_CTLZ(Op, DAG);
	case ISD::DYNAMIC_STACKALLOC: return LowerDYNAMIC_STACKALLOC(Op, DAG);	case ISD::DYNAMIC_STACKALLOC: return LowerDYNAMIC_STACKALLOC(Op, DAG);
	}	}
	return Op;	return Op;
Context not available.
	return DAG.getNode(ISD::FADD, SL, MVT::f64, Trunc, Add);	return DAG.getNode(ISD::FADD, SL, MVT::f64, Trunc, Add);
	}	}

	SDValue AMDGPUTargetLowering::LowerCTLZ(SDValue Op, SelectionDAG &DAG) const {	static bool isCtlzOpc(unsigned Opc) {
		arsenmUnsubmitted Done Reply Inline Actions This is mostly copy past from LowerCTLZ. These should be factored into a common helper. arsenm: This is mostly copy past from LowerCTLZ. These should be factored into a common helper.
		return Opc == ISD::CTLZ \|\| Opc == ISD::CTLZ_ZERO_UNDEF;
		}

		SDValue AMDGPUTargetLowering:: LowerCTTZ_CTLZ(SDValue Op, SelectionDAG &DAG) const {
	SDLoc SL(Op);	SDLoc SL(Op);
	SDValue Src = Op.getOperand(0);	SDValue Src = Op.getOperand(0);
	bool ZeroUndef = Op.getOpcode() == ISD::CTLZ_ZERO_UNDEF;	bool ZeroUndef = Op.getOpcode() == ISD::CTTZ_ZERO_UNDEF \|\|
		Op.getOpcode() == ISD::CTLZ_ZERO_UNDEF;
		arsenmUnsubmitted Not Done Reply Inline Actions You didn't enable custom lowering for i64, so this is dead code. You also didn't a dd any tests for it. In either case, it should be in a separate patch from the i32 handling. arsenm: You didn't enable custom lowering for i64, so this is dead code. You also didn't a dd any tests…

		unsigned ISDOpc, AMDGPUISDOpc;
		arsenmUnsubmitted Done Reply Inline Actions Extra space after :: arsenm: Extra space after ::
		if (isCtlzOpc(Op.getOpcode())) {
		ISDOpc = ISD::CTLZ_ZERO_UNDEF;
		AMDGPUISDOpc = AMDGPUISD::FFBH_U32;
		} else {
		ISDOpc = ISD::CTTZ_ZERO_UNDEF;
		AMDGPUISDOpc = AMDGPUISD::FFBL_U32;
		}

	if (ZeroUndef && Src.getValueType() == MVT::i32)	if (ZeroUndef && Src.getValueType() == MVT::i32)
		arsenmUnsubmitted Done Reply Inline Actions Don't includ eAMDGPUISD in variable name arsenm: Don't includ eAMDGPUISD in variable name
	return DAG.getNode(AMDGPUISD::FFBH_U32, SL, MVT::i32, Src);	return DAG.getNode(AMDGPUISDOpc, SL, MVT::i32, Src);
		arsenmUnsubmitted Done Reply Inline Actions Missing space before { arsenm: Missing space before {

		arsenmUnsubmitted Done Reply Inline Actions Indentation wrong arsenm: Indentation wrong
	SDValue Vec = DAG.getNode(ISD::BITCAST, SL, MVT::v2i32, Src);	SDValue Vec = DAG.getNode(ISD::BITCAST, SL, MVT::v2i32, Src);
		arsenmUnsubmitted Done Reply Inline Actions llvm_unreachable arsenm: llvm_unreachable

Context not available.
	EVT SetCCVT = getSetCCResultType(DAG.getDataLayout(),	EVT SetCCVT = getSetCCResultType(DAG.getDataLayout(),
	*DAG.getContext(), MVT::i32);	*DAG.getContext(), MVT::i32);

	SDValue Hi0 = DAG.getSetCC(SL, SetCCVT, Hi, Zero, ISD::SETEQ);	SDValue Hi0;
		if (isCtlzOpc(Op.getOpcode()))
		Hi0 = DAG.getSetCC(SL, SetCCVT, Hi, Zero, ISD::SETEQ);
		else
		Hi0 = DAG.getSetCC(SL, SetCCVT, Hi, One, ISD::SETEQ);

	SDValue CtlzLo = DAG.getNode(ISD::CTLZ_ZERO_UNDEF, SL, MVT::i32, Lo);	SDValue OprLo = DAG.getNode(ISDOpc, SL, MVT::i32, Lo);
	SDValue CtlzHi = DAG.getNode(ISD::CTLZ_ZERO_UNDEF, SL, MVT::i32, Hi);	SDValue OprHi = DAG.getNode(ISDOpc, SL, MVT::i32, Hi);

	const SDValue Bits32 = DAG.getConstant(32, SL, MVT::i32);	const SDValue Bits32 = DAG.getConstant(32, SL, MVT::i32);
	SDValue Add = DAG.getNode(ISD::ADD, SL, MVT::i32, CtlzLo, Bits32);	SDValue Add, NewOpr;
		arsenmUnsubmitted Done Reply Inline Actions Select between Zero and One as input to getSetCC arsenm: Select between Zero and One as input to getSetCC
		if (isCtlzOpc(Op.getOpcode())) {
	// ctlz(x) = hi_32(x) == 0 ? ctlz(lo_32(x)) + 32 : ctlz(hi_32(x))	Add = DAG.getNode(ISD::ADD, SL, MVT::i32, OprLo, Bits32);
	SDValue NewCtlz = DAG.getNode(ISD::SELECT, SL, MVT::i32, Hi0, Add, CtlzHi);	//// ctlz(x) = hi_32(x) == 0 ? ctlz(lo_32(x)) + 32 : ctlz(hi_32(x)
		NewOpr = DAG.getNode(ISD::SELECT, SL, MVT::i32, Hi0, Add, OprHi);
		} else {
		Add = DAG.getNode(ISD::ADD, SL, MVT::i32, OprHi, Bits32);
		//// cttz(x) = lo_32(x) == 0 ? cttz(hi_32(x)) + 32 : cttz(lo_32(x))
		NewOpr = DAG.getNode(ISD::SELECT, SL, MVT::i32, Hi0, Add, OprLo);
		}
		arsenmUnsubmitted Done Reply Inline Actions Double // and missing closing ) arsenm: Double // and missing closing )

	if (!ZeroUndef) {	if (!ZeroUndef) {
	// Test if the full 64-bit input is zero.	// Test if the full 64-bit input is zero.
		arsenmUnsubmitted Done Reply Inline Actions Double // arsenm: Double //
Context not available.

	// The instruction returns -1 for 0 input, but the defined intrinsic	// The instruction returns -1 for 0 input, but the defined intrinsic
	// behavior is to return the number of bits.	// behavior is to return the number of bits.
	NewCtlz = DAG.getNode(ISD::SELECT, SL, MVT::i32,	NewOpr = DAG.getNode(ISD::SELECT, SL, MVT::i32,
	SrcIsZero, Bits32, NewCtlz);	SrcIsZero, Bits32, NewOpr);
	}	}

	return DAG.getNode(ISD::ZERO_EXTEND, SL, MVT::i64, NewCtlz);	return DAG.getNode(ISD::ZERO_EXTEND, SL, MVT::i64, NewOpr);
	}	}

	SDValue AMDGPUTargetLowering::LowerINT_TO_FP32(SDValue Op, SelectionDAG &DAG,	SDValue AMDGPUTargetLowering::LowerINT_TO_FP32(SDValue Op, SelectionDAG &DAG,
Context not available.
	return false;	return false;
	}	}

	static bool isCtlzOpc(unsigned Opc) {
	return Opc == ISD::CTLZ \|\| Opc == ISD::CTLZ_ZERO_UNDEF;
	}

	SDValue AMDGPUTargetLowering::getFFBH_U32(SelectionDAG &DAG,	SDValue AMDGPUTargetLowering::getFFBH_U32(SelectionDAG &DAG,
	SDValue Op,	SDValue Op,
	const SDLoc &DL) const {	const SDLoc &DL) const {
		arsenmUnsubmitted Done Reply Inline Actions You could just pass in the new opcode directly rather than selecting it again arsenm: You could just pass in the new opcode directly rather than selecting it again
		arsenmUnsubmitted Done Reply Inline Actions Commented out code arsenm: Commented out code
		arsenmUnsubmitted Not Done Reply Inline Actions You didn't add tests for this part arsenm: You didn't add tests for this part
		arsenmUnsubmitted Done Reply Inline Actions Don't include AMDGPUISD in the name of this arsenm: Don't include AMDGPUISD in the name of this
		arsenmUnsubmitted Done Reply Inline Actions Ditto arsenm: Ditto
Context not available.
	NODE_NAME_CASE(BFM)	NODE_NAME_CASE(BFM)
	NODE_NAME_CASE(FFBH_U32)	NODE_NAME_CASE(FFBH_U32)
	NODE_NAME_CASE(FFBH_I32)	NODE_NAME_CASE(FFBH_I32)
		NODE_NAME_CASE(FFBL_U32)
	NODE_NAME_CASE(MUL_U24)	NODE_NAME_CASE(MUL_U24)
	NODE_NAME_CASE(MUL_I24)	NODE_NAME_CASE(MUL_I24)
	NODE_NAME_CASE(MULHI_U24)	NODE_NAME_CASE(MULHI_U24)
Context not available.

lib/Target/AMDGPU/AMDGPUInstrInfo.td

Context not available.
	def AMDGPUffbh_u32 : SDNode<"AMDGPUISD::FFBH_U32", SDTIntUnaryOp>;	def AMDGPUffbh_u32 : SDNode<"AMDGPUISD::FFBH_U32", SDTIntUnaryOp>;
	def AMDGPUffbh_i32 : SDNode<"AMDGPUISD::FFBH_I32", SDTIntUnaryOp>;	def AMDGPUffbh_i32 : SDNode<"AMDGPUISD::FFBH_I32", SDTIntUnaryOp>;

		def AMDGPUffbl_u32 : SDNode<"AMDGPUISD::FFBL_U32", SDTIntUnaryOp>;

		arsenmUnsubmitted Done Reply Inline Actions This isn't a signed/unsigned operation. There is just one v_ffbl_b32. arsenm: This isn't a signed/unsigned operation. There is just one v_ffbl_b32.
	// Signed and unsigned 24-bit multiply. The highest 8-bits are ignore	// Signed and unsigned 24-bit multiply. The highest 8-bits are ignore
	// when performing the mulitply. The result is a 32-bit value.	// when performing the mulitply. The result is a 32-bit value.
	def AMDGPUmul_u24 : SDNode<"AMDGPUISD::MUL_U24", SDTIntBinOp,	def AMDGPUmul_u24 : SDNode<"AMDGPUISD::MUL_U24", SDTIntBinOp,
Context not available.

lib/Target/AMDGPU/EvergreenInstructions.td

Context not available.
	def FLT16_TO_FLT32 : R600_1OP_Helper <0xA3, "FLT16_TO_FLT32", f16_to_fp, VecALU>;	def FLT16_TO_FLT32 : R600_1OP_Helper <0xA3, "FLT16_TO_FLT32", f16_to_fp, VecALU>;
	def BCNT_INT : R600_1OP_Helper <0xAA, "BCNT_INT", ctpop, VecALU>;	def BCNT_INT : R600_1OP_Helper <0xAA, "BCNT_INT", ctpop, VecALU>;
	def FFBH_UINT : R600_1OP_Helper <0xAB, "FFBH_UINT", AMDGPUffbh_u32, VecALU>;	def FFBH_UINT : R600_1OP_Helper <0xAB, "FFBH_UINT", AMDGPUffbh_u32, VecALU>;
	def FFBL_INT : R600_1OP_Helper <0xAC, "FFBL_INT", cttz_zero_undef, VecALU>;	def FFBL_INT : R600_1OP_Helper <0xAC, "FFBL_INT", AMDGPUffbl_u32, VecALU>;

	let hasSideEffects = 1 in {	let hasSideEffects = 1 in {
	def MOVA_INT_eg : R600_1OP <0xCC, "MOVA_INT", [], VecALU>;	def MOVA_INT_eg : R600_1OP <0xCC, "MOVA_INT", [], VecALU>;
Context not available.

lib/Target/AMDGPU/SOPInstructions.td

Context not available.

	def S_FF0_I32_B32 : SOP1_32 <"s_ff0_i32_b32">;	def S_FF0_I32_B32 : SOP1_32 <"s_ff0_i32_b32">;
	def S_FF0_I32_B64 : SOP1_32_64 <"s_ff0_i32_b64">;	def S_FF0_I32_B64 : SOP1_32_64 <"s_ff0_i32_b64">;
		def S_FF1_I32_B64 : SOP1_32_64 <"s_ff1_i32_b64">;

	def S_FF1_I32_B32 : SOP1_32 <"s_ff1_i32_b32",	def S_FF1_I32_B32 : SOP1_32 <"s_ff1_i32_b32",
	[(set i32:$sdst, (cttz_zero_undef i32:$src0))]	[(set i32:$sdst, (AMDGPUffbl_u32 i32:$src0))]
	>;	>;
	def S_FF1_I32_B64 : SOP1_32_64 <"s_ff1_i32_b64">;

	def S_FLBIT_I32_B32 : SOP1_32 <"s_flbit_i32_b32",	def S_FLBIT_I32_B32 : SOP1_32 <"s_flbit_i32_b32",
	[(set i32:$sdst, (AMDGPUffbh_u32 i32:$src0))]	[(set i32:$sdst, (AMDGPUffbh_u32 i32:$src0))]
Context not available.

test/CodeGen/AMDGPU/cttz_zero_undef.ll

Context not available.
	; RUN: llc -march=r600 -mcpu=cypress -verify-machineinstrs < %s \| FileCheck -check-prefix=EG -check-prefix=FUNC %s	; RUN: llc -march=r600 -mcpu=cypress -verify-machineinstrs < %s \| FileCheck -check-prefix=EG -check-prefix=FUNC %s

	declare i32 @llvm.cttz.i32(i32, i1) nounwind readnone	declare i32 @llvm.cttz.i32(i32, i1) nounwind readnone
		declare i64 @llvm.cttz.i64(i64, i1) nounwind readnone
	declare <2 x i32> @llvm.cttz.v2i32(<2 x i32>, i1) nounwind readnone	declare <2 x i32> @llvm.cttz.v2i32(<2 x i32>, i1) nounwind readnone
	declare <4 x i32> @llvm.cttz.v4i32(<4 x i32>, i1) nounwind readnone	declare <4 x i32> @llvm.cttz.v4i32(<4 x i32>, i1) nounwind readnone
	declare i32 @llvm.r600.read.tidig.x() nounwind readnone	declare i32 @llvm.r600.read.tidig.x() nounwind readnone
Context not available.
	store <4 x i32> %cttz, <4 x i32> addrspace(1)* %out, align 16	store <4 x i32> %cttz, <4 x i32> addrspace(1)* %out, align 16
	ret void	ret void
	}	}

		; FUNC-LABEL: {{^}}s_cttz_zero_undef_i32_with_select:
		; SI: s_ff1_i32_b32
		; EG: MEM_RAT_CACHELESS STORE_RAW [[RESULT:T[0-9]+\.[XYZW]]]
		; EG: FFBL_INT {{\? }}[[RESULT]]
		define amdgpu_kernel void @s_cttz_zero_undef_i32_with_select(i32 addrspace(1)* noalias %out, i32 %val) nounwind {
		arsenmUnsubmitted Done Reply Inline Actions This needs to check more arsenm: This needs to check more
		%cttz = tail call i32 @llvm.cttz.i32(i32 %val, i1 true) nounwind readnone
		%cttz_ret = icmp ne i32 %val, 0
		%ret = select i1 %cttz_ret, i32 %cttz, i32 32
		store i32 %cttz, i32 addrspace(1)* %out, align 4
		ret void
		}

		; FUNC-LABEL: {{^}}v_cttz_zero_undef_i32_with_select:
		; SI: v_ffbl_b32_e32
		; EG: MEM_RAT_CACHELESS STORE_RAW [[RESULT:T[0-9]+\.[XYZW]]]
		define amdgpu_kernel void @v_cttz_zero_undef_i32_with_select(i32 addrspace(1)* noalias %out, i32 addrspace(1)* nocapture readonly %arrayidx) nounwind {
		%val = load i32, i32 addrspace(1)* %arrayidx, align 1
		arsenmUnsubmitted Done Reply Inline Actions This needs to check more arsenm: This needs to check more
		%cttz = tail call i32 @llvm.cttz.i32(i32 %val, i1 true) nounwind readnone
		%cttz_ret = icmp ne i32 %val, 0
		%ret = select i1 %cttz_ret, i32 %cttz, i32 32
		store i32 %ret, i32 addrspace(1)* %out, align 4
		ret void
		}
		arsenmUnsubmitted Done Reply Inline Actions Need i64 tests arsenm: Need i64 tests

		; FUNC-LABEL: {{^}}v_cttz_zero_undef_i64_with_select:
		; SI: v_ffbl_b32_e32
		; SI: v_ffbl_b32_e32
		; EG: MEM_RAT_CACHELESS STORE_RAW [[RESULT:T[0-9]+\.[XYZW]]]
		define amdgpu_kernel void @v_cttz_zero_undef_i64_with_select(i64 addrspace(1)* noalias %out, i64 addrspace(1)* nocapture readonly %arrayidx) nounwind {
		arsenmUnsubmitted Done Reply Inline Actions Missing scalar version arsenm: Missing scalar version
		%val = load i64, i64 addrspace(1)* %arrayidx, align 1
		%cttz = tail call i64 @llvm.cttz.i64(i64 %val, i1 true) nounwind readnone
		%cttz_ret = icmp ne i64 %val, 0
		%ret = select i1 %cttz_ret, i64 %cttz, i64 32
		store i64 %ret, i64 addrspace(1)* %out, align 4
		ret void
		}

Context not available.
		arsenmUnsubmitted Done Reply Inline Actions Also should have some tests with i8/i16 arsenm: Also should have some tests with i8/i16
		arsenmUnsubmitted Done Reply Inline Actions Ditto arsenm: Ditto
		arsenmUnsubmitted Done Reply Inline Actions This isn't checking the outputs and select arsenm: This isn't checking the outputs and select
		arsenmUnsubmitted Done Reply Inline Actions Using undefined VAL arsenm: Using undefined VAL
		arsenmUnsubmitted Done Reply Inline Actions Undefined VAL arsenm: Undefined VAL
		arsenmUnsubmitted Done Reply Inline Actions Using 2 -DAGs with identical lines doesn't do anything. It will pass with only one arsenm: Using 2 -DAGs with identical lines doesn't do anything. It will pass with only one
		wdngAuthorUnsubmitted Done Reply Inline Actions No, it won't work if I remove the -DAG. As the generated instructions get interleaved with each other. wdng: No, it won't work if I remove the -DAG. As the generated instructions get interleaved with each…

This is an archive of the discontinued LLVM Phabricator instance.

Implement custom lowering for ISD::CTTZ_ZERO_UNDEF and ISD::CTTZ.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 115230

include/llvm/Target/TargetSelectionDAG.td

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp

lib/Target/AMDGPU/AMDGPUISelLowering.h

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

lib/Target/AMDGPU/AMDGPUInstrInfo.td

lib/Target/AMDGPU/EvergreenInstructions.td

lib/Target/AMDGPU/SOPInstructions.td

test/CodeGen/AMDGPU/cttz_zero_undef.ll

Implement custom lowering for ISD::CTTZ_ZERO_UNDEF and ISD::CTTZ.
ClosedPublic