This is an archive of the discontinued LLVM Phabricator instance.

Implement custom lowering for ISD::CTTZ_ZERO_UNDEF and ISD::CTTZ.
ClosedPublic

Authored by wdng on Aug 31 2017, 12:27 PM.

Details

Summary

During the DAGCombine optimization phase, the LLVM compiler converts ISD::CTTZ_ZERO_UNDEF to ISD::CTTZ and then expands during the Legalization phase, which prevents the v_ffbl_b32 instruction generation. This patch implements custom lowering for ISD::CTTZ_ZERO_UNDEF and ISD::CTTZ.

Diff Detail

Repository
rL LLVM

Event Timeline

wdng created this revision.Aug 31 2017, 12:27 PM
arsenm added inline comments.Aug 31 2017, 1:34 PM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
16680 ↗(On Diff #113451)

The existing code was the correct way to do this

wdng added inline comments.Aug 31 2017, 1:54 PM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
16680 ↗(On Diff #113451)

With original code, we will have the following code transformations:

Initial selection DAG: BB#0 'sample_test:entry'
SelectionDAG has 50 nodes:
  t0: ch = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %vreg2
  t3: i64 = Constant<0>
  t5: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t2, undef:i64
  t6: i64,ch = merge_values t5, t5:1
    t8: i64 = add t2, Constant:i64<8>
  t9: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t8, undef:i64
  t10: i64,ch = merge_values t9, t9:1
  t11: ch = TokenFactor t6:1, t10:1
      t13: i64 = llvm.amdgcn.dispatch.ptr TargetConstant:i32<359>
    t19: i64 = add t13, Constant:i64<4>
  t20: i16,ch = load<LD2[%4(addrspace=2)](align=4)(tbaa=<0x4436db8>)> t11, t19, undef:i64
    t25: i64 = llvm.amdgcn.implicitarg.ptr TargetConstant:i32<460>
  t27: i64,ch = load<LD8[%11(addrspace=2)](tbaa=<0x4435518>)> t11, t25, undef:i64
  t29: i64 = Constant<32>
              t17: i32 = llvm.amdgcn.workgroup.id.x TargetConstant:i32<505>
              t21: i32 = zero_extend t20
            t22: i32 = mul t17, t21
            t15: i32 = llvm.amdgcn.workitem.id.x TargetConstant:i32<508>
          t23: i32 = add t22, t15 
        t26: i64 = zero_extend t23 
      t28: i64 = add t27, t26 
    t31: i64 = shl t28, Constant:i32<32>
  t32: i64 = sra t31, Constant:i32<32>
    t33: i64 = add t6, t32 
  t34: i8,ch = load<LD1[%arrayidx(addrspace=1)](tbaa=<0x4435498>)> t11, t33, undef:i64
  t35: i32 = zero_extend t34 
    t39: i1 = setcc t35, Constant:i32<0>, setne:ch
    t36: i32 = cttz_zero_undef t35
  t40: i32 = select t39, t36, Constant:i32<32>
  t43: i1 = setcc Constant:i32<8>, t40, setult:ch
      t47: ch = TokenFactor t20:1, t27:1, t34:1
        t44: i32 = umin t40, Constant:i32<8>
      t45: i8 = truncate t44 
      t46: i64 = add t10, t32 
    t48: ch = store<ST1[%arrayidx3(addrspace=1)](tbaa=<0x4435498>)> t47, t45, t46, undef:i64
  t49: ch = ENDPGM t48

Optimized lowered selection DAG: BB#0 'sample_test:entry'
SelectionDAG has 35 nodes:
  t0: ch = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %vreg2
  t5: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t2, undef:i64
    t8: i64 = add t2, Constant:i64<8>
  t9: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t8, undef:i64
  t11: ch = TokenFactor t5:1, t9:1
    t33: i64 = add t5, t63 
  t54: i32,ch = load<LD1[%arrayidx(addrspace=1)](tbaa=<0x4435498>), zext from i8> t11, t33, undef:i64
    t25: i64 = llvm.amdgcn.implicitarg.ptr TargetConstant:i32<460>
  t62: i32,ch = load<LD4[%11(addrspace=2)](align=8)(tbaa=<0x4435518>)> t11, t25, undef:i64
          t17: i32 = llvm.amdgcn.workgroup.id.x TargetConstant:i32<505>
        t22: i32 = mul t17, t64 
        t15: i32 = llvm.amdgcn.workitem.id.x TargetConstant:i32<508>
      t23: i32 = add t22, t15
    t60: i32 = add t62, t23
  t63: i64 = sign_extend t60, ValueType:ch:i32
      t13: i64 = llvm.amdgcn.dispatch.ptr TargetConstant:i32<359>
    t19: i64 = add t13, Constant:i64<4>
  t64: i32,ch = load<LD2[%4(addrspace=2)](align=4)(tbaa=<0x4436db8>), zext from i16> t11, t19, undef:i64
      t47: ch = TokenFactor t64:1, t62:1, t54:1
        t53: i32 = cttz t54
      t44: i32 = umin t53, Constant:i32<8>
      t46: i64 = add t9, t63
    t50: ch = store<ST1[%arrayidx3(addrspace=1)](tbaa=<0x4435498>), trunc to i8> t47, t44, t46, undef:i64
  t49: ch = ENDPGM t50

We won't be able to generate s/v_ffbl instructions. I found llvm.cttz.i32 has all been converted to cttz_zero_undef instread of 'cttz'.

wdng added a comment.Sep 1 2017, 9:01 AM

If we don't want to change the original way of implementation, we may want to do a custom lowering for ISD::CTTZ at AMDGPU backend to ISD::CTTZ_ZERO_UNDE?

b-sumner edited edge metadata.Sep 1 2017, 9:32 AM

I think the actual problem is the implementation of ISD::CTTZ not using v_ffbl and not this transformation.

If v_ffbl is able to produce a defined answer of bit width for 0, then you want to match it with cttz and have the operation action for cttz_zero_undef set to Expand. That will turn all cttz_zero_undef calls into cttz.

If v_ffbl is not capable of handling zero, then you want cttz_zero_undef set to Legal, and cttz set to Expand which will make use of cttz_zero_undef and a select. Or you can make cttz Custom and do your own lowering.

arsenm edited edge metadata.Sep 1 2017, 10:20 AM

I think the instruction behavior is to return -1 on 0 input. IIRC we handle this and fold that for ctlz already, just not cttz.

wdng updated this revision to Diff 113579.Sep 1 2017, 12:45 PM

Just add a custom lowering ISD:CTTZ to ISD::CTTZ_ZERO_UNDEF

In D37348#859119, @wdng wrote:

Just add a custom lowering ISD:CTTZ to ISD::CTTZ_ZERO_UNDEF

I don't think that will help. Why not follow exactly how CTLZ* is handled now and implement AMDGPUTargetLowering::LowerCTTZ making use of ffbl?

wdng updated this revision to Diff 114215.Sep 7 2017, 11:13 AM
wdng retitled this revision from Tighten conditions for converting ISD::CTTZ_ZERO_UNDEF to ISD::CTTZ to Implement custom lowering for ISD::CTTZ_ZERO_UNDEF and ISD::CTTZ..
wdng added a reviewer: t-tye.Sep 8 2017, 12:10 PM
arsenm added inline comments.Sep 8 2017, 1:22 PM
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2784–2793 ↗(On Diff #114215)

I don't understand why you have this or most of the other changes. This shouldn't be substantially different from how we handle ctlz already. i.e. I would expect to see another version of AMDGPUTargetLowering::performCtlzCombine that does essentially the same thing for CTTZ.

lib/Target/AMDGPU/AMDGPUISelLowering.cpp
420 ↗(On Diff #114215)

This should definitely remain legal

2029 ↗(On Diff #114215)

You didn't enable custom lowering for i64, so this is dead code. You also didn't a dd any tests for it. In either case, it should be in a separate patch from the i32 handling.

lib/Target/AMDGPU/AMDGPUInstrInfo.td
301–302 ↗(On Diff #114215)

This isn't a signed/unsigned operation. There is just one v_ffbl_b32.

craig.topper resigned from this revision.Sep 8 2017, 10:48 PM
wdng marked 2 inline comments as done.Sep 11 2017, 9:05 AM
wdng added inline comments.
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
420 ↗(On Diff #114215)

I think it doesn't matter to define it as Custom, because it will be converted to FFBL_U32 during the custom lowering and then pattern matching to the ffbl instruction anyway at the end. However, if we defined it as Legal, we will have a "duplicate" or "extra" pattern (FFBL_U32 and CTTZ_ZERO_UNDEF) for generating the ffbl instruction. Is there any specific reason that I neglect here that we have to define it as Legal?

arsenm added inline comments.Sep 12 2017, 7:05 PM
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2803–2806 ↗(On Diff #114215)

OK, I see the default expansion here isn't the compare and select like I expected. Since the compare+select implementation is likely more instructions with the compare than the sub/ctpop implementation, that one should be tried first.

lib/Target/AMDGPU/AMDGPUISelLowering.cpp
417 ↗(On Diff #114215)

We should probably fix this at some point to be legal

1109–1110 ↗(On Diff #114215)

Also need the select with -1 optimization (and corresponding tests) as cttz

2021 ↗(On Diff #114215)

This is mostly copy past from LowerCTLZ. These should be factored into a common helper.

test/CodeGen/AMDGPU/cttz_zero_undef.ll
103 ↗(On Diff #114215)

Need i64 tests

wdng updated this revision to Diff 115230.Sep 14 2017, 8:40 AM
wdng marked 5 inline comments as done.

Changes based on code review feedback.

wdng updated this revision to Diff 115231.Sep 14 2017, 8:42 AM

Upload a full diff.

Missing performCtlzCombine equivalent

lib/CodeGen/SelectionDAG/LegalizeDAG.cpp
2803–2806 ↗(On Diff #114215)

I don't see this changed

test/CodeGen/AMDGPU/cttz_zero_undef.ll
109 ↗(On Diff #115231)

Missing scalar version

wdng updated this revision to Diff 115294.Sep 14 2017, 2:41 PM
wdng edited the summary of this revision. (Show Details)

Address code reviews.

wdng updated this revision to Diff 115296.Sep 14 2017, 2:47 PM

Fix the issues that variables are not capitalized.

wdng marked 2 inline comments as done.Sep 14 2017, 2:50 PM
arsenm added inline comments.Sep 15 2017, 10:17 AM
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
2036–2042 ↗(On Diff #115296)

Indentation wrong

2043 ↗(On Diff #115296)

llvm_unreachable

2061–2064 ↗(On Diff #115296)

Select between Zero and One as input to getSetCC

3008–3009 ↗(On Diff #115296)

You could just pass in the new opcode directly rather than selecting it again

3010 ↗(On Diff #115296)

Commented out code

3035 ↗(On Diff #115296)

You didn't add tests for this part

lib/Target/AMDGPU/AMDGPUISelLowering.h
374 ↗(On Diff #115296)

Should be name FFBL_B32 to match the instruction

test/CodeGen/AMDGPU/cttz_zero_undef.ll
131 ↗(On Diff #115296)

Also should have some tests with i8/i16

wdng updated this revision to Diff 115908.Sep 19 2017, 2:53 PM
wdng marked 7 inline comments as done.

Will create another separate ticket to fix the v_ffbl_sdwa instruction generation.

wdng edited reviewers, added: kzhuravl; removed: craig.topper.Sep 22 2017, 12:29 PM
arsenm added a comment.Oct 4 2017, 2:50 PM

Needs more comprehensive check lines. Just checking the instructions won't demonstrate that the extra instructions you're trying to avoid aren't there

lib/Target/AMDGPU/AMDGPUISelLowering.cpp
3028 ↗(On Diff #115908)

Don't include AMDGPUISD in the name of this

3053 ↗(On Diff #115908)

Ditto

test/CodeGen/AMDGPU/cttz_zero_undef.ll
85 ↗(On Diff #115908)

This needs to check more

97 ↗(On Diff #115908)

This needs to check more

121–122 ↗(On Diff #115908)

Ditto

wdng updated this revision to Diff 118277.Oct 9 2017, 2:53 PM
wdng marked 5 inline comments as done.

Address code reivews.

arsenm added inline comments.Oct 11 2017, 11:05 AM
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
2031 ↗(On Diff #118277)

Extra space after ::

2040 ↗(On Diff #118277)

Don't includ eAMDGPUISD in variable name

2041 ↗(On Diff #118277)

Missing space before {

2073 ↗(On Diff #118277)

Double // and missing closing )

2077 ↗(On Diff #118277)

Double //

test/CodeGen/AMDGPU/cttz_zero_undef.ll
1 ↗(On Diff #118277)

Add -enable-var-scope to all of the FileCheck lines. Several of these tests are broken

171–172 ↗(On Diff #118277)

This isn't checking the outputs and select

184 ↗(On Diff #118277)

Using undefined VAL

198 ↗(On Diff #118277)

Undefined VAL

wdng updated this revision to Diff 118718.Oct 11 2017, 4:14 PM
wdng marked 5 inline comments as done.

Address code reviews.

wdng marked 3 inline comments as done.Oct 11 2017, 4:14 PM
arsenm added inline comments.Oct 11 2017, 4:24 PM
test/CodeGen/AMDGPU/cttz_zero_undef.ll
175–176 ↗(On Diff #118718)

Using 2 -DAGs with identical lines doesn't do anything. It will pass with only one

wdng added inline comments.Oct 11 2017, 4:27 PM
test/CodeGen/AMDGPU/cttz_zero_undef.ll
175–176 ↗(On Diff #118718)

No, it won't work if I remove the -DAG. As the generated instructions get interleaved with each other.

wdng updated this revision to Diff 118729.Oct 11 2017, 5:08 PM
wdng marked 2 inline comments as done.

Remove duplicate check lines.

wdng updated this revision to Diff 118733.Oct 11 2017, 5:45 PM

Removed -DAG checks completely.

arsenm accepted this revision.Oct 12 2017, 10:39 AM

LGTM

This revision is now accepted and ready to land.Oct 12 2017, 10:39 AM
This revision was automatically updated to reflect the committed changes.