Was previously de-optimizating if -march supported lzcnt as there is
no reason to add the extra instruction.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/X86/X86InstrCompiler.td | ||
---|---|---|
2254 ↗ | (On Diff #488100) | This doesn't produce the correct value for $src being 0. The ctlz would return 16 and the xor would turn that to 31. BSR16rr will produce an undefined value. I think this is only valid for the zero_undef nodes. |
The bsr implementation in hardware also reads the destination register to return it if the other input is zero. This can create unintentional dependencies on older instructions.
bsr is also a multiple uop instruction on some AMD CPUs such as Zen1, 2, and 3. According to uops.info its improved on Zen4.
Would be fair to not do this on zen1/2/3. Could also make it last use only (where dst==src is almost guranteed) if the dependency
on dst is a major concern, although for many intel x86 impls lzcnt also has a false-dep on dst so its mostly equivilent.
llvm/lib/Target/X86/X86InstrCompiler.td | ||
---|---|---|
2254 ↗ | (On Diff #488100) |
I think you're right although call i32 @llvm.ctlz.i32(i32, i1 true) doesn't seem to match ctlz_zero_undef. def LZCNT16rr : I<0xBD, MRMSrcReg, (outs GR16:$dst), (ins GR16:$src), "lzcnt{w}\t{$src, $dst|$dst, $src}", [(set GR16:$dst, (ctlz GR16:$src)), (implicit EFLAGS)]>, XS, OpSize16, Sched<[WriteLZCNT]>; def LZCNT16rm : I<0xBD, MRMSrcMem, (outs GR16:$dst), (ins i16mem:$src), "lzcnt{w}\t{$src, $dst|$dst, $src}", [(set GR16:$dst, (ctlz (loadi16 addr:$src))), (implicit EFLAGS)]>, XS, OpSize16, Sched<[WriteLZCNTLd]>; def LZCNT32rr : I<0xBD, MRMSrcReg, (outs GR32:$dst), (ins GR32:$src), "lzcnt{l}\t{$src, $dst|$dst, $src}", [(set GR32:$dst, (ctlz GR32:$src)), (implicit EFLAGS)]>, XS, OpSize32, Sched<[WriteLZCNT]>; ... Which is just set as ctlz so I thought it was just a very confusing overloaded wording in the TD. Do you know what I'm missing? |
We have TuningFastLZCNT which should be a good enough predicate to control when NOT to perform this fold - you will need to add test coverage to clz.ll for fast/slow lzcnt
The false dependency was fixed on Skylake which released in 2015. Is it safe to assume the majority of users won’t be using a CPU older than that?
The false dependency was fixed on Skylake which released in 2015. Is it safe to assume the majority of users won’t be using a CPU older than that?
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
51899 | Use curly braces here for consistency with the other blocks. | |
51923 | This needs to be SDVTList VTs = DAG.getVTList(OpVT, MVT::i32); Op = DAG.getNode(X86ISD::BSR, dl, VTs, Op); We have X86ISD::BSR plumbed to also returns flags so it has a second i32 out put for the flags. |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
51912 | You can use C->getZExtValue() since we know the VT is 64-bits or less. |
llvm/test/CodeGen/X86/clz.ll | ||
---|---|---|
971–976 | The FIXME is solved :) |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
51923 |
|
Fix some nits
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
52108 |
Are the two equivilent? Always thought !C is a zero/non-zero check but nullptr != 0 is techincally possible. | |
llvm/test/CodeGen/X86/clz.ll | ||
971–976 |
Zen4 and I think a few other AMD ones (although more have 1c lzcnt and expensive bsr). |
Use curly braces here for consistency with the other blocks.