Was previously de-optimizating if -march supported lzcnt as there is
no reason to add the extra instruction.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
| llvm/lib/Target/X86/X86InstrCompiler.td | ||
|---|---|---|
| 2254 ↗ | (On Diff #488100) | This doesn't produce the correct value for $src being 0. The ctlz would return 16 and the xor would turn that to 31. BSR16rr will produce an undefined value. I think this is only valid for the zero_undef nodes. |
The bsr implementation in hardware also reads the destination register to return it if the other input is zero. This can create unintentional dependencies on older instructions.
bsr is also a multiple uop instruction on some AMD CPUs such as Zen1, 2, and 3. According to uops.info its improved on Zen4.
Would be fair to not do this on zen1/2/3. Could also make it last use only (where dst==src is almost guranteed) if the dependency
on dst is a major concern, although for many intel x86 impls lzcnt also has a false-dep on dst so its mostly equivilent.
| llvm/lib/Target/X86/X86InstrCompiler.td | ||
|---|---|---|
| 2254 ↗ | (On Diff #488100) |
I think you're right although call i32 @llvm.ctlz.i32(i32, i1 true) doesn't seem to match ctlz_zero_undef. def LZCNT16rr : I<0xBD, MRMSrcReg, (outs GR16:$dst), (ins GR16:$src),
"lzcnt{w}\t{$src, $dst|$dst, $src}",
[(set GR16:$dst, (ctlz GR16:$src)), (implicit EFLAGS)]>,
XS, OpSize16, Sched<[WriteLZCNT]>;
def LZCNT16rm : I<0xBD, MRMSrcMem, (outs GR16:$dst), (ins i16mem:$src),
"lzcnt{w}\t{$src, $dst|$dst, $src}",
[(set GR16:$dst, (ctlz (loadi16 addr:$src))),
(implicit EFLAGS)]>, XS, OpSize16, Sched<[WriteLZCNTLd]>;
def LZCNT32rr : I<0xBD, MRMSrcReg, (outs GR32:$dst), (ins GR32:$src),
"lzcnt{l}\t{$src, $dst|$dst, $src}",
[(set GR32:$dst, (ctlz GR32:$src)), (implicit EFLAGS)]>,
XS, OpSize32, Sched<[WriteLZCNT]>;
...Which is just set as ctlz so I thought it was just a very confusing overloaded wording in the TD. Do you know what I'm missing? |
We have TuningFastLZCNT which should be a good enough predicate to control when NOT to perform this fold - you will need to add test coverage to clz.ll for fast/slow lzcnt
The false dependency was fixed on Skylake which released in 2015. Is it safe to assume the majority of users won’t be using a CPU older than that?
The false dependency was fixed on Skylake which released in 2015. Is it safe to assume the majority of users won’t be using a CPU older than that?
| llvm/lib/Target/X86/X86ISelLowering.cpp | ||
|---|---|---|
| 51899 | Use curly braces here for consistency with the other blocks. | |
| 51923 | This needs to be SDVTList VTs = DAG.getVTList(OpVT, MVT::i32); Op = DAG.getNode(X86ISD::BSR, dl, VTs, Op); We have X86ISD::BSR plumbed to also returns flags so it has a second i32 out put for the flags. | |
| llvm/lib/Target/X86/X86ISelLowering.cpp | ||
|---|---|---|
| 51912 | You can use C->getZExtValue() since we know the VT is 64-bits or less. | |
| llvm/test/CodeGen/X86/clz.ll | ||
|---|---|---|
| 971–976 | The FIXME is solved :) | |
| llvm/lib/Target/X86/X86ISelLowering.cpp | ||
|---|---|---|
| 51923 |
| |
Fix some nits
| llvm/lib/Target/X86/X86ISelLowering.cpp | ||
|---|---|---|
| 52108 |
Are the two equivilent? Always thought !C is a zero/non-zero check but nullptr != 0 is techincally possible. | |
| llvm/test/CodeGen/X86/clz.ll | ||
| 971–976 |
Zen4 and I think a few other AMD ones (although more have 1c lzcnt and expensive bsr). | |
Use curly braces here for consistency with the other blocks.