During the DAGCombine optimization phase, the LLVM compiler converts ISD::CTTZ_ZERO_UNDEF to ISD::CTTZ and then expands during the Legalization phase, which prevents the v_ffbl_b32 instruction generation. This patch implements custom lowering for ISD::CTTZ_ZERO_UNDEF and ISD::CTTZ.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
16680 ↗ | (On Diff #113451) | The existing code was the correct way to do this |
lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
16680 ↗ | (On Diff #113451) | With original code, we will have the following code transformations: Initial selection DAG: BB#0 'sample_test:entry' SelectionDAG has 50 nodes: t0: ch = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %vreg2 t3: i64 = Constant<0> t5: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t2, undef:i64 t6: i64,ch = merge_values t5, t5:1 t8: i64 = add t2, Constant:i64<8> t9: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t8, undef:i64 t10: i64,ch = merge_values t9, t9:1 t11: ch = TokenFactor t6:1, t10:1 t13: i64 = llvm.amdgcn.dispatch.ptr TargetConstant:i32<359> t19: i64 = add t13, Constant:i64<4> t20: i16,ch = load<LD2[%4(addrspace=2)](align=4)(tbaa=<0x4436db8>)> t11, t19, undef:i64 t25: i64 = llvm.amdgcn.implicitarg.ptr TargetConstant:i32<460> t27: i64,ch = load<LD8[%11(addrspace=2)](tbaa=<0x4435518>)> t11, t25, undef:i64 t29: i64 = Constant<32> t17: i32 = llvm.amdgcn.workgroup.id.x TargetConstant:i32<505> t21: i32 = zero_extend t20 t22: i32 = mul t17, t21 t15: i32 = llvm.amdgcn.workitem.id.x TargetConstant:i32<508> t23: i32 = add t22, t15 t26: i64 = zero_extend t23 t28: i64 = add t27, t26 t31: i64 = shl t28, Constant:i32<32> t32: i64 = sra t31, Constant:i32<32> t33: i64 = add t6, t32 t34: i8,ch = load<LD1[%arrayidx(addrspace=1)](tbaa=<0x4435498>)> t11, t33, undef:i64 t35: i32 = zero_extend t34 t39: i1 = setcc t35, Constant:i32<0>, setne:ch t36: i32 = cttz_zero_undef t35 t40: i32 = select t39, t36, Constant:i32<32> t43: i1 = setcc Constant:i32<8>, t40, setult:ch t47: ch = TokenFactor t20:1, t27:1, t34:1 t44: i32 = umin t40, Constant:i32<8> t45: i8 = truncate t44 t46: i64 = add t10, t32 t48: ch = store<ST1[%arrayidx3(addrspace=1)](tbaa=<0x4435498>)> t47, t45, t46, undef:i64 t49: ch = ENDPGM t48 Optimized lowered selection DAG: BB#0 'sample_test:entry' SelectionDAG has 35 nodes: t0: ch = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %vreg2 t5: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t2, undef:i64 t8: i64 = add t2, Constant:i64<8> t9: i64,ch = load<LD8[undef(addrspace=2)](nontemporal)(dereferenceable)(invariant)> t0, t8, undef:i64 t11: ch = TokenFactor t5:1, t9:1 t33: i64 = add t5, t63 t54: i32,ch = load<LD1[%arrayidx(addrspace=1)](tbaa=<0x4435498>), zext from i8> t11, t33, undef:i64 t25: i64 = llvm.amdgcn.implicitarg.ptr TargetConstant:i32<460> t62: i32,ch = load<LD4[%11(addrspace=2)](align=8)(tbaa=<0x4435518>)> t11, t25, undef:i64 t17: i32 = llvm.amdgcn.workgroup.id.x TargetConstant:i32<505> t22: i32 = mul t17, t64 t15: i32 = llvm.amdgcn.workitem.id.x TargetConstant:i32<508> t23: i32 = add t22, t15 t60: i32 = add t62, t23 t63: i64 = sign_extend t60, ValueType:ch:i32 t13: i64 = llvm.amdgcn.dispatch.ptr TargetConstant:i32<359> t19: i64 = add t13, Constant:i64<4> t64: i32,ch = load<LD2[%4(addrspace=2)](align=4)(tbaa=<0x4436db8>), zext from i16> t11, t19, undef:i64 t47: ch = TokenFactor t64:1, t62:1, t54:1 t53: i32 = cttz t54 t44: i32 = umin t53, Constant:i32<8> t46: i64 = add t9, t63 t50: ch = store<ST1[%arrayidx3(addrspace=1)](tbaa=<0x4435498>), trunc to i8> t47, t44, t46, undef:i64 t49: ch = ENDPGM t50 We won't be able to generate s/v_ffbl instructions. I found llvm.cttz.i32 has all been converted to cttz_zero_undef instread of 'cttz'. |
If we don't want to change the original way of implementation, we may want to do a custom lowering for ISD::CTTZ at AMDGPU backend to ISD::CTTZ_ZERO_UNDE?
I think the actual problem is the implementation of ISD::CTTZ not using v_ffbl and not this transformation.
If v_ffbl is able to produce a defined answer of bit width for 0, then you want to match it with cttz and have the operation action for cttz_zero_undef set to Expand. That will turn all cttz_zero_undef calls into cttz.
If v_ffbl is not capable of handling zero, then you want cttz_zero_undef set to Legal, and cttz set to Expand which will make use of cttz_zero_undef and a select. Or you can make cttz Custom and do your own lowering.
I think the instruction behavior is to return -1 on 0 input. IIRC we handle this and fold that for ctlz already, just not cttz.
I don't think that will help. Why not follow exactly how CTLZ* is handled now and implement AMDGPUTargetLowering::LowerCTTZ making use of ffbl?
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp | ||
---|---|---|
2784–2793 | I don't understand why you have this or most of the other changes. This shouldn't be substantially different from how we handle ctlz already. i.e. I would expect to see another version of AMDGPUTargetLowering::performCtlzCombine that does essentially the same thing for CTTZ. | |
lib/Target/AMDGPU/AMDGPUISelLowering.cpp | ||
420 | This should definitely remain legal | |
2029 | You didn't enable custom lowering for i64, so this is dead code. You also didn't a dd any tests for it. In either case, it should be in a separate patch from the i32 handling. | |
lib/Target/AMDGPU/AMDGPUInstrInfo.td | ||
301–302 | This isn't a signed/unsigned operation. There is just one v_ffbl_b32. |
lib/Target/AMDGPU/AMDGPUISelLowering.cpp | ||
---|---|---|
420 | I think it doesn't matter to define it as Custom, because it will be converted to FFBL_U32 during the custom lowering and then pattern matching to the ffbl instruction anyway at the end. However, if we defined it as Legal, we will have a "duplicate" or "extra" pattern (FFBL_U32 and CTTZ_ZERO_UNDEF) for generating the ffbl instruction. Is there any specific reason that I neglect here that we have to define it as Legal? |
lib/CodeGen/SelectionDAG/LegalizeDAG.cpp | ||
---|---|---|
2803–2806 | OK, I see the default expansion here isn't the compare and select like I expected. Since the compare+select implementation is likely more instructions with the compare than the sub/ctpop implementation, that one should be tried first. | |
lib/Target/AMDGPU/AMDGPUISelLowering.cpp | ||
417 | We should probably fix this at some point to be legal | |
1109–1110 | Also need the select with -1 optimization (and corresponding tests) as cttz | |
2021 | This is mostly copy past from LowerCTLZ. These should be factored into a common helper. | |
test/CodeGen/AMDGPU/cttz_zero_undef.ll | ||
103 | Need i64 tests |
lib/Target/AMDGPU/AMDGPUISelLowering.cpp | ||
---|---|---|
2036–2042 | Indentation wrong | |
2043 | llvm_unreachable | |
2061–2064 | Select between Zero and One as input to getSetCC | |
3008–3009 | You could just pass in the new opcode directly rather than selecting it again | |
3010 | Commented out code | |
3035 | You didn't add tests for this part | |
lib/Target/AMDGPU/AMDGPUISelLowering.h | ||
374 | Should be name FFBL_B32 to match the instruction | |
test/CodeGen/AMDGPU/cttz_zero_undef.ll | ||
131 | Also should have some tests with i8/i16 |
Needs more comprehensive check lines. Just checking the instructions won't demonstrate that the extra instructions you're trying to avoid aren't there
lib/Target/AMDGPU/AMDGPUISelLowering.cpp | ||
---|---|---|
3028 | Don't include AMDGPUISD in the name of this | |
3053 | Ditto | |
test/CodeGen/AMDGPU/cttz_zero_undef.ll | ||
85 | This needs to check more | |
97 | This needs to check more | |
121–122 | Ditto |
lib/Target/AMDGPU/AMDGPUISelLowering.cpp | ||
---|---|---|
2031 | Extra space after :: | |
2040 | Don't includ eAMDGPUISD in variable name | |
2041 | Missing space before { | |
2073 | Double // and missing closing ) | |
2077 | Double // | |
test/CodeGen/AMDGPU/cttz_zero_undef.ll | ||
1 | Add -enable-var-scope to all of the FileCheck lines. Several of these tests are broken | |
171–172 | This isn't checking the outputs and select | |
184 | Using undefined VAL | |
198 | Undefined VAL |
test/CodeGen/AMDGPU/cttz_zero_undef.ll | ||
---|---|---|
175–176 | Using 2 -DAGs with identical lines doesn't do anything. It will pass with only one |
test/CodeGen/AMDGPU/cttz_zero_undef.ll | ||
---|---|---|
175–176 | No, it won't work if I remove the -DAG. As the generated instructions get interleaved with each other. |
I don't understand why you have this or most of the other changes. This shouldn't be substantially different from how we handle ctlz already. i.e. I would expect to see another version of AMDGPUTargetLowering::performCtlzCombine that does essentially the same thing for CTTZ.