During the DAGCombine optimization phase, the LLVM compiler converts ISD::CTTZ_ZERO_UNDEF to ISD::CTTZ and then expands during the Legalization phase, which prevents the v_ffbl_b32 instruction generation. This patch implements custom lowering for ISD::CTTZ_ZERO_UNDEF and ISD::CTTZ.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
| lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
|---|---|---|
| 16680 ↗ | (On Diff #113451) | The existing code was the correct way to do this |
| lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
|---|---|---|
| 16680 ↗ | (On Diff #113451) | With original code, we will have the following code transformations: Initial selection DAG: BB#0 'sample_test:entry'
SelectionDAG has 50 nodes:
t0: ch = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %vreg2
t3: i64 = Constant<0>
t5: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t2, undef:i64
t6: i64,ch = merge_values t5, t5:1
t8: i64 = add t2, Constant:i64<8>
t9: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t8, undef:i64
t10: i64,ch = merge_values t9, t9:1
t11: ch = TokenFactor t6:1, t10:1
t13: i64 = llvm.amdgcn.dispatch.ptr TargetConstant:i32<359>
t19: i64 = add t13, Constant:i64<4>
t20: i16,ch = load<LD2[%4(addrspace=2)](align=4)(tbaa=<0x4436db8>)> t11, t19, undef:i64
t25: i64 = llvm.amdgcn.implicitarg.ptr TargetConstant:i32<460>
t27: i64,ch = load<LD8[%11(addrspace=2)](tbaa=<0x4435518>)> t11, t25, undef:i64
t29: i64 = Constant<32>
t17: i32 = llvm.amdgcn.workgroup.id.x TargetConstant:i32<505>
t21: i32 = zero_extend t20
t22: i32 = mul t17, t21
t15: i32 = llvm.amdgcn.workitem.id.x TargetConstant:i32<508>
t23: i32 = add t22, t15
t26: i64 = zero_extend t23
t28: i64 = add t27, t26
t31: i64 = shl t28, Constant:i32<32>
t32: i64 = sra t31, Constant:i32<32>
t33: i64 = add t6, t32
t34: i8,ch = load<LD1[%arrayidx(addrspace=1)](tbaa=<0x4435498>)> t11, t33, undef:i64
t35: i32 = zero_extend t34
t39: i1 = setcc t35, Constant:i32<0>, setne:ch
t36: i32 = cttz_zero_undef t35
t40: i32 = select t39, t36, Constant:i32<32>
t43: i1 = setcc Constant:i32<8>, t40, setult:ch
t47: ch = TokenFactor t20:1, t27:1, t34:1
t44: i32 = umin t40, Constant:i32<8>
t45: i8 = truncate t44
t46: i64 = add t10, t32
t48: ch = store<ST1[%arrayidx3(addrspace=1)](tbaa=<0x4435498>)> t47, t45, t46, undef:i64
t49: ch = ENDPGM t48
Optimized lowered selection DAG: BB#0 'sample_test:entry'
SelectionDAG has 35 nodes:
t0: ch = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %vreg2
t5: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t2, undef:i64
t8: i64 = add t2, Constant:i64<8>
t9: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t8, undef:i64
t11: ch = TokenFactor t5:1, t9:1
t33: i64 = add t5, t63
t54: i32,ch = load<LD1[%arrayidx(addrspace=1)](tbaa=<0x4435498>), zext from i8> t11, t33, undef:i64
t25: i64 = llvm.amdgcn.implicitarg.ptr TargetConstant:i32<460>
t62: i32,ch = load<LD4[%11(addrspace=2)](align=8)(tbaa=<0x4435518>)> t11, t25, undef:i64
t17: i32 = llvm.amdgcn.workgroup.id.x TargetConstant:i32<505>
t22: i32 = mul t17, t64
t15: i32 = llvm.amdgcn.workitem.id.x TargetConstant:i32<508>
t23: i32 = add t22, t15
t60: i32 = add t62, t23
t63: i64 = sign_extend t60, ValueType:ch:i32
t13: i64 = llvm.amdgcn.dispatch.ptr TargetConstant:i32<359>
t19: i64 = add t13, Constant:i64<4>
t64: i32,ch = load<LD2[%4(addrspace=2)](align=4)(tbaa=<0x4436db8>), zext from i16> t11, t19, undef:i64
t47: ch = TokenFactor t64:1, t62:1, t54:1
t53: i32 = cttz t54
t44: i32 = umin t53, Constant:i32<8>
t46: i64 = add t9, t63
t50: ch = store<ST1[%arrayidx3(addrspace=1)](tbaa=<0x4435498>), trunc to i8> t47, t44, t46, undef:i64
t49: ch = ENDPGM t50We won't be able to generate s/v_ffbl instructions. I found llvm.cttz.i32 has all been converted to cttz_zero_undef instread of 'cttz'. |
If we don't want to change the original way of implementation, we may want to do a custom lowering for ISD::CTTZ at AMDGPU backend to ISD::CTTZ_ZERO_UNDE?
I think the actual problem is the implementation of ISD::CTTZ not using v_ffbl and not this transformation.
If v_ffbl is able to produce a defined answer of bit width for 0, then you want to match it with cttz and have the operation action for cttz_zero_undef set to Expand. That will turn all cttz_zero_undef calls into cttz.
If v_ffbl is not capable of handling zero, then you want cttz_zero_undef set to Legal, and cttz set to Expand which will make use of cttz_zero_undef and a select. Or you can make cttz Custom and do your own lowering.
I think the instruction behavior is to return -1 on 0 input. IIRC we handle this and fold that for ctlz already, just not cttz.
I don't think that will help. Why not follow exactly how CTLZ* is handled now and implement AMDGPUTargetLowering::LowerCTTZ making use of ffbl?
| lib/CodeGen/SelectionDAG/LegalizeDAG.cpp | ||
|---|---|---|
| 2784–2793 ↗ | (On Diff #114215) | I don't understand why you have this or most of the other changes. This shouldn't be substantially different from how we handle ctlz already. i.e. I would expect to see another version of AMDGPUTargetLowering::performCtlzCombine that does essentially the same thing for CTTZ. |
| lib/Target/AMDGPU/AMDGPUISelLowering.cpp | ||
| 420 ↗ | (On Diff #114215) | This should definitely remain legal |
| 2029 ↗ | (On Diff #114215) | You didn't enable custom lowering for i64, so this is dead code. You also didn't a dd any tests for it. In either case, it should be in a separate patch from the i32 handling. |
| lib/Target/AMDGPU/AMDGPUInstrInfo.td | ||
| 301–302 ↗ | (On Diff #114215) | This isn't a signed/unsigned operation. There is just one v_ffbl_b32. |
| lib/Target/AMDGPU/AMDGPUISelLowering.cpp | ||
|---|---|---|
| 420 ↗ | (On Diff #114215) | I think it doesn't matter to define it as Custom, because it will be converted to FFBL_U32 during the custom lowering and then pattern matching to the ffbl instruction anyway at the end. However, if we defined it as Legal, we will have a "duplicate" or "extra" pattern (FFBL_U32 and CTTZ_ZERO_UNDEF) for generating the ffbl instruction. Is there any specific reason that I neglect here that we have to define it as Legal? |
| lib/CodeGen/SelectionDAG/LegalizeDAG.cpp | ||
|---|---|---|
| 2803–2806 ↗ | (On Diff #114215) | OK, I see the default expansion here isn't the compare and select like I expected. Since the compare+select implementation is likely more instructions with the compare than the sub/ctpop implementation, that one should be tried first. |
| lib/Target/AMDGPU/AMDGPUISelLowering.cpp | ||
| 417 ↗ | (On Diff #114215) | We should probably fix this at some point to be legal |
| 1109–1110 ↗ | (On Diff #114215) | Also need the select with -1 optimization (and corresponding tests) as cttz |
| 2021 ↗ | (On Diff #114215) | This is mostly copy past from LowerCTLZ. These should be factored into a common helper. |
| test/CodeGen/AMDGPU/cttz_zero_undef.ll | ||
| 103 ↗ | (On Diff #114215) | Need i64 tests |
Missing performCtlzCombine equivalent
| lib/CodeGen/SelectionDAG/LegalizeDAG.cpp | ||
|---|---|---|
| 2803–2806 ↗ | (On Diff #114215) | I don't see this changed |
| test/CodeGen/AMDGPU/cttz_zero_undef.ll | ||
| 109 ↗ | (On Diff #115231) | Missing scalar version |
| lib/Target/AMDGPU/AMDGPUISelLowering.cpp | ||
|---|---|---|
| 2036–2042 ↗ | (On Diff #115296) | Indentation wrong |
| 2043 ↗ | (On Diff #115296) | llvm_unreachable |
| 2061–2064 ↗ | (On Diff #115296) | Select between Zero and One as input to getSetCC |
| 3008–3009 ↗ | (On Diff #115296) | You could just pass in the new opcode directly rather than selecting it again |
| 3010 ↗ | (On Diff #115296) | Commented out code |
| 3035 ↗ | (On Diff #115296) | You didn't add tests for this part |
| lib/Target/AMDGPU/AMDGPUISelLowering.h | ||
| 374 ↗ | (On Diff #115296) | Should be name FFBL_B32 to match the instruction |
| test/CodeGen/AMDGPU/cttz_zero_undef.ll | ||
| 131 ↗ | (On Diff #115296) | Also should have some tests with i8/i16 |
Needs more comprehensive check lines. Just checking the instructions won't demonstrate that the extra instructions you're trying to avoid aren't there
| lib/Target/AMDGPU/AMDGPUISelLowering.cpp | ||
|---|---|---|
| 3028 ↗ | (On Diff #115908) | Don't include AMDGPUISD in the name of this |
| 3053 ↗ | (On Diff #115908) | Ditto |
| test/CodeGen/AMDGPU/cttz_zero_undef.ll | ||
| 85 ↗ | (On Diff #115908) | This needs to check more |
| 97 ↗ | (On Diff #115908) | This needs to check more |
| 121–122 ↗ | (On Diff #115908) | Ditto |
| lib/Target/AMDGPU/AMDGPUISelLowering.cpp | ||
|---|---|---|
| 2031 ↗ | (On Diff #118277) | Extra space after :: |
| 2040 ↗ | (On Diff #118277) | Don't includ eAMDGPUISD in variable name |
| 2041 ↗ | (On Diff #118277) | Missing space before { |
| 2073 ↗ | (On Diff #118277) | Double // and missing closing ) |
| 2077 ↗ | (On Diff #118277) | Double // |
| test/CodeGen/AMDGPU/cttz_zero_undef.ll | ||
| 1 ↗ | (On Diff #118277) | Add -enable-var-scope to all of the FileCheck lines. Several of these tests are broken |
| 171–172 ↗ | (On Diff #118277) | This isn't checking the outputs and select |
| 184 ↗ | (On Diff #118277) | Using undefined VAL |
| 198 ↗ | (On Diff #118277) | Undefined VAL |
| test/CodeGen/AMDGPU/cttz_zero_undef.ll | ||
|---|---|---|
| 175–176 ↗ | (On Diff #118718) | Using 2 -DAGs with identical lines doesn't do anything. It will pass with only one |
| test/CodeGen/AMDGPU/cttz_zero_undef.ll | ||
|---|---|---|
| 175–176 ↗ | (On Diff #118718) | No, it won't work if I remove the -DAG. As the generated instructions get interleaved with each other. |