This is an archive of the discontinued LLVM Phabricator instance.

Implement custom lowering for ISD::CTTZ_ZERO_UNDEF and ISD::CTTZ.
ClosedPublic

Authored by wdng on Aug 31 2017, 12:27 PM.

Details

Summary

During the DAGCombine optimization phase, the LLVM compiler converts ISD::CTTZ_ZERO_UNDEF to ISD::CTTZ and then expands during the Legalization phase, which prevents the v_ffbl_b32 instruction generation. This patch implements custom lowering for ISD::CTTZ_ZERO_UNDEF and ISD::CTTZ.

Diff Detail

Repository
rL LLVM

Event Timeline

wdng created this revision.Aug 31 2017, 12:27 PM
arsenm added inline comments.Aug 31 2017, 1:34 PM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
16680 ↗(On Diff #113451)

The existing code was the correct way to do this

wdng added inline comments.Aug 31 2017, 1:54 PM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
16680 ↗(On Diff #113451)

With original code, we will have the following code transformations:

Initial selection DAG: BB#0 'sample_test:entry'
SelectionDAG has 50 nodes:
  t0: ch = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %vreg2
  t3: i64 = Constant<0>
  t5: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t2, undef:i64
  t6: i64,ch = merge_values t5, t5:1
    t8: i64 = add t2, Constant:i64<8>
  t9: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t8, undef:i64
  t10: i64,ch = merge_values t9, t9:1
  t11: ch = TokenFactor t6:1, t10:1
      t13: i64 = llvm.amdgcn.dispatch.ptr TargetConstant:i32<359>
    t19: i64 = add t13, Constant:i64<4>
  t20: i16,ch = load<LD2[%4(addrspace=2)](align=4)(tbaa=<0x4436db8>)> t11, t19, undef:i64
    t25: i64 = llvm.amdgcn.implicitarg.ptr TargetConstant:i32<460>
  t27: i64,ch = load<LD8[%11(addrspace=2)](tbaa=<0x4435518>)> t11, t25, undef:i64
  t29: i64 = Constant<32>
              t17: i32 = llvm.amdgcn.workgroup.id.x TargetConstant:i32<505>
              t21: i32 = zero_extend t20
            t22: i32 = mul t17, t21
            t15: i32 = llvm.amdgcn.workitem.id.x TargetConstant:i32<508>
          t23: i32 = add t22, t15 
        t26: i64 = zero_extend t23 
      t28: i64 = add t27, t26 
    t31: i64 = shl t28, Constant:i32<32>
  t32: i64 = sra t31, Constant:i32<32>
    t33: i64 = add t6, t32 
  t34: i8,ch = load<LD1[%arrayidx(addrspace=1)](tbaa=<0x4435498>)> t11, t33, undef:i64
  t35: i32 = zero_extend t34 
    t39: i1 = setcc t35, Constant:i32<0>, setne:ch
    t36: i32 = cttz_zero_undef t35
  t40: i32 = select t39, t36, Constant:i32<32>
  t43: i1 = setcc Constant:i32<8>, t40, setult:ch
      t47: ch = TokenFactor t20:1, t27:1, t34:1
        t44: i32 = umin t40, Constant:i32<8>
      t45: i8 = truncate t44 
      t46: i64 = add t10, t32 
    t48: ch = store<ST1[%arrayidx3(addrspace=1)](tbaa=<0x4435498>)> t47, t45, t46, undef:i64
  t49: ch = ENDPGM t48

Optimized lowered selection DAG: BB#0 'sample_test:entry'
SelectionDAG has 35 nodes:
  t0: ch = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %vreg2
  t5: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t2, undef:i64
    t8: i64 = add t2, Constant:i64<8>
  t9: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t8, undef:i64
  t11: ch = TokenFactor t5:1, t9:1
    t33: i64 = add t5, t63 
  t54: i32,ch = load<LD1[%arrayidx(addrspace=1)](tbaa=<0x4435498>), zext from i8> t11, t33, undef:i64
    t25: i64 = llvm.amdgcn.implicitarg.ptr TargetConstant:i32<460>
  t62: i32,ch = load<LD4[%11(addrspace=2)](align=8)(tbaa=<0x4435518>)> t11, t25, undef:i64
          t17: i32 = llvm.amdgcn.workgroup.id.x TargetConstant:i32<505>
        t22: i32 = mul t17, t64 
        t15: i32 = llvm.amdgcn.workitem.id.x TargetConstant:i32<508>
      t23: i32 = add t22, t15
    t60: i32 = add t62, t23
  t63: i64 = sign_extend t60, ValueType:ch:i32
      t13: i64 = llvm.amdgcn.dispatch.ptr TargetConstant:i32<359>
    t19: i64 = add t13, Constant:i64<4>
  t64: i32,ch = load<LD2[%4(addrspace=2)](align=4)(tbaa=<0x4436db8>), zext from i16> t11, t19, undef:i64
      t47: ch = TokenFactor t64:1, t62:1, t54:1
        t53: i32 = cttz t54
      t44: i32 = umin t53, Constant:i32<8>
      t46: i64 = add t9, t63
    t50: ch = store<ST1[%arrayidx3(addrspace=1)](tbaa=<0x4435498>), trunc to i8> t47, t44, t46, undef:i64
  t49: ch = ENDPGM t50

We won't be able to generate s/v_ffbl instructions. I found llvm.cttz.i32 has all been converted to cttz_zero_undef instread of 'cttz'.

wdng added a comment.Sep 1 2017, 9:01 AM

If we don't want to change the original way of implementation, we may want to do a custom lowering for ISD::CTTZ at AMDGPU backend to ISD::CTTZ_ZERO_UNDE?

b-sumner edited edge metadata.Sep 1 2017, 9:32 AM

I think the actual problem is the implementation of ISD::CTTZ not using v_ffbl and not this transformation.

If v_ffbl is able to produce a defined answer of bit width for 0, then you want to match it with cttz and have the operation action for cttz_zero_undef set to Expand. That will turn all cttz_zero_undef calls into cttz.

If v_ffbl is not capable of handling zero, then you want cttz_zero_undef set to Legal, and cttz set to Expand which will make use of cttz_zero_undef and a select. Or you can make cttz Custom and do your own lowering.

arsenm edited edge metadata.Sep 1 2017, 10:20 AM

I think the instruction behavior is to return -1 on 0 input. IIRC we handle this and fold that for ctlz already, just not cttz.

wdng updated this revision to Diff 113579.Sep 1 2017, 12:45 PM

Just add a custom lowering ISD:CTTZ to ISD::CTTZ_ZERO_UNDEF

In D37348#859119, @wdng wrote:

Just add a custom lowering ISD:CTTZ to ISD::CTTZ_ZERO_UNDEF

I don't think that will help. Why not follow exactly how CTLZ* is handled now and implement AMDGPUTargetLowering::LowerCTTZ making use of ffbl?

wdng updated this revision to Diff 114215.Sep 7 2017, 11:13 AM
wdng retitled this revision from Tighten conditions for converting ISD::CTTZ_ZERO_UNDEF to ISD::CTTZ to Implement custom lowering for ISD::CTTZ_ZERO_UNDEF and ISD::CTTZ..
wdng added a reviewer: t-tye.Sep 8 2017, 12:10 PM
arsenm added inline comments.Sep 8 2017, 1:22 PM
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2784–2793

I don't understand why you have this or most of the other changes. This shouldn't be substantially different from how we handle ctlz already. i.e. I would expect to see another version of AMDGPUTargetLowering::performCtlzCombine that does essentially the same thing for CTTZ.

lib/Target/AMDGPU/AMDGPUISelLowering.cpp
420

This should definitely remain legal

2032

You didn't enable custom lowering for i64, so this is dead code. You also didn't a dd any tests for it. In either case, it should be in a separate patch from the i32 handling.

lib/Target/AMDGPU/AMDGPUInstrInfo.td
301–302

This isn't a signed/unsigned operation. There is just one v_ffbl_b32.

craig.topper resigned from this revision.Sep 8 2017, 10:48 PM
wdng marked 2 inline comments as done.Sep 11 2017, 9:05 AM
wdng added inline comments.
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
420

I think it doesn't matter to define it as Custom, because it will be converted to FFBL_U32 during the custom lowering and then pattern matching to the ffbl instruction anyway at the end. However, if we defined it as Legal, we will have a "duplicate" or "extra" pattern (FFBL_U32 and CTTZ_ZERO_UNDEF) for generating the ffbl instruction. Is there any specific reason that I neglect here that we have to define it as Legal?

arsenm added inline comments.Sep 12 2017, 7:05 PM
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2803–2806

OK, I see the default expansion here isn't the compare and select like I expected. Since the compare+select implementation is likely more instructions with the compare than the sub/ctpop implementation, that one should be tried first.

lib/Target/AMDGPU/AMDGPUISelLowering.cpp
417

We should probably fix this at some point to be legal

1113–1114

Also need the select with -1 optimization (and corresponding tests) as cttz

2024

This is mostly copy past from LowerCTLZ. These should be factored into a common helper.

test/CodeGen/AMDGPU/cttz_zero_undef.ll
107

Need i64 tests

wdng updated this revision to Diff 115230.Sep 14 2017, 8:40 AM
wdng marked 5 inline comments as done.

Changes based on code review feedback.

wdng updated this revision to Diff 115231.Sep 14 2017, 8:42 AM

Upload a full diff.

Missing performCtlzCombine equivalent

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2803–2806

I don't see this changed

test/CodeGen/AMDGPU/cttz_zero_undef.ll
112

Missing scalar version

wdng updated this revision to Diff 115294.Sep 14 2017, 2:41 PM
wdng edited the summary of this revision. (Show Details)

Address code reviews.

wdng updated this revision to Diff 115296.Sep 14 2017, 2:47 PM

Fix the issues that variables are not capitalized.

wdng marked 2 inline comments as done.Sep 14 2017, 2:50 PM
arsenm added inline comments.Sep 15 2017, 10:17 AM
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
2038–2044

Indentation wrong

2045

llvm_unreachable

2063–2066

Select between Zero and One as input to getSetCC

3019–3020

You could just pass in the new opcode directly rather than selecting it again

3021

Commented out code

3042

You didn't add tests for this part

lib/Target/AMDGPU/AMDGPUISelLowering.h
374

Should be name FFBL_B32 to match the instruction

test/CodeGen/AMDGPU/cttz_zero_undef.ll
134

Also should have some tests with i8/i16

wdng updated this revision to Diff 115908.Sep 19 2017, 2:53 PM
wdng marked 7 inline comments as done.

Will create another separate ticket to fix the v_ffbl_sdwa instruction generation.

wdng edited reviewers, added: kzhuravl; removed: craig.topper.Sep 22 2017, 12:29 PM
arsenm added a comment.Oct 4 2017, 2:50 PM

Needs more comprehensive check lines. Just checking the instructions won't demonstrate that the extra instructions you're trying to avoid aren't there

lib/Target/AMDGPU/AMDGPUISelLowering.cpp
3028

Don't include AMDGPUISD in the name of this

3053

Ditto

test/CodeGen/AMDGPU/cttz_zero_undef.ll
85

This needs to check more

97

This needs to check more

121–122

Ditto

wdng updated this revision to Diff 118277.Oct 9 2017, 2:53 PM
wdng marked 5 inline comments as done.

Address code reivews.

arsenm added inline comments.Oct 11 2017, 11:05 AM
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
2031

Extra space after ::

2040

Don't includ eAMDGPUISD in variable name

2041

Missing space before {

2073

Double // and missing closing )

2077

Double //

test/CodeGen/AMDGPU/cttz_zero_undef.ll
1–2

Add -enable-var-scope to all of the FileCheck lines. Several of these tests are broken

171–172

This isn't checking the outputs and select

184

Using undefined VAL

198

Undefined VAL

wdng updated this revision to Diff 118718.Oct 11 2017, 4:14 PM
wdng marked 5 inline comments as done.

Address code reviews.

wdng marked 3 inline comments as done.Oct 11 2017, 4:14 PM
arsenm added inline comments.Oct 11 2017, 4:24 PM
test/CodeGen/AMDGPU/cttz_zero_undef.ll
175–176

Using 2 -DAGs with identical lines doesn't do anything. It will pass with only one

wdng added inline comments.Oct 11 2017, 4:27 PM
test/CodeGen/AMDGPU/cttz_zero_undef.ll
175–176

No, it won't work if I remove the -DAG. As the generated instructions get interleaved with each other.

wdng updated this revision to Diff 118729.Oct 11 2017, 5:08 PM
wdng marked 2 inline comments as done.

Remove duplicate check lines.

wdng updated this revision to Diff 118733.Oct 11 2017, 5:45 PM

Removed -DAG checks completely.

arsenm accepted this revision.Oct 12 2017, 10:39 AM

LGTM

This revision is now accepted and ready to land.Oct 12 2017, 10:39 AM
This revision was automatically updated to reflect the committed changes.